17th Conference of the Methods & Evaluation Section of the German Psychological Society (DGPs)

Machine Learning in Psychology Part II: New Advancements in Statistical Methods Using Machine Learning and AI

Chair(s): Susanne Frick (TU Dortmund, Statistik), Mirka Henninger (Universität Basel)

In this symposium, we present new advancements that can improve statistical modeling in psychology and beyond by employing machine learning and artificial intelligence (AI). When thoughtfully integrated, machine learning and AI tools can enhance statistical modeling and facilitate research and test construction. This symposium presents new developments that are tailored to the specifics of social science research.

The first two talks focus on the use of large language models (LLMs). In the first talk, Mirka Henninger presents how LLMs can be used to streamline abstract screening. In the second talk, Rudolf Debelak presents a framework for employing LLMs for both generating items and predicting their properties. The next two talks focus on improving causal effect estimation. Seyda Betul Aydin examines how to employ transfer learning to integrate information from several datasets in order to estimate individual treatment effects. Roberto Faleh proposes a deep learning framework for estimating causal effects in observational data with several challenges. Last, Philipp Doebler presents an efficient ensemble-based method for detecting anomalies.

In sum, this symposium exemplifies how the flexibility and automatisation of machine-learning can improve the estimation of statistical models and save resources in the research process.

Presentations of the Symposium

Using Large Language Models for Abstract Screening

Mirka Henninger¹, Jan Radek¹, Martin Pauly²
¹Universität Basel, ²Exnaton AG

When conducting research syntheses projects, such as systematic reviews or meta-analyses, researchers need to identify all relevant studies, screen titles and abstracts, and extract information from the articles’ full text. This screening and data extraction process is time-consuming, labor-intense, and error prone, and requires substantial training and expertise of the reviewers.

In this project, we explore how large-language models (LLMs) can be used to automate and streamline the abstract screening process in research synthesis projects. We first review some central aspects of LLMs. Then, we present a possible procedure on how LLMs can be used for abstract screening and illustrate the procedure with data from a current research synthesis project. We show how this procedure can be automated, how prompts and hyperparameters can be adapted and compare the outputs for different hyperparameter settings. Given the rapidly changing LLM development, we do not aim to provide a final recipe on how to use LLMs in research synthesis. Rather, we would like to share our experiences in using LLMs and provide a starting point for future discussions and workflows.

Developing and Evaluating Assessment Items: An LLM Framework

Rudolf Debelak¹, Rasmus Alexander Jensen², Alexandre Clin Deffarges², Laura Stahlhut²
¹Universität Zürich, ²ETH Zürich, Institut für Bildungsevaluation

The development of large language models has led to numerous possibilities for their application in education. In this talk, we present a framework for using large language models for the development and quality control of item pools for large-scale assessments. By combining methods such as prompt engineering, fine-tuning and self-reflection, the proposed framework allows the generation and sequential enhancement of items for diverse fields such as reading comprehension, knowledge tests, and assessments in mathematics or the sciences in a scalable and cost-efficient way. The framework's evaluation component predicts psychometric characteristics such as item parameters from item response models and an assessment of the suitability for specific student populations.

To demonstrate the framework's application, we will present preliminary findings from an ongoing pilot study in Switzerland, where this framework is being applied for the development of large-scale assessments in primary and secondary schools. The presentation will include examples of generated items, illustrate the AI-driven item generation and enhancement process, and provide an initial assessment of the prediction accuracy for psychometric characteristics. These results will offer crucial insights into the practical feasibility and potential of LLM-driven item development.

Individual Treatment Effect Estimation Through Transfer Learning

Seyda Betul Aydin, Roberto Faleh, Holger Brandt
Universität Tübingen, Methods Center

Extending causal insights to new and varied contexts remains an important challenge in scientific research. Findings obtained from large-scale datasets must frequently be generalized to distinct and novel environments, underscoring the necessity of external validity. Transfer learning addresses this challenge by using pre-trained models to capture intricate patterns from extensive datasets and adapting these patterns to different contexts. In this talk, we examine how transfer learning can facilitate the estimation of individual treatment effects (ITEs) through Treatment-Agnostic Representation Networks (TARNets). Typically, estimating such models requires substantial sample sizes, which often limits their feasibility for smaller datasets frequently found in social and behavioral sciences. We illustrate that leveraging transfer learning in TARNets significantly enhances the estimation accuracy of ITEs by utilizing knowledge derived from larger source datasets and applying it to smaller target datasets. We explore multiple scenarios, including the transfer of insights from large-scale observational datasets to experimental settings, as well as bias mitigation in non-randomized target datasets through the appropriate selection of source data. Additionally, we investigate scenarios in which the source dataset lacks complete group information, such as missing either the treatment or control group. To address these challenges, we incorporate transfer learning methods with Inverse Probability Weighting (IPW) to adjust for the missing information and enhance the reliability of transferred inferences. Finally, we discuss the use of the Integral Probability Metric (IPM) to quantitatively measure distributional discrepancies, allowing researchers to assess the appropriateness of knowledge transfer from particular source datasets to specific target contexts.

Improving Causal Estimates with Sparse Autoencoders and Integral Probability Measures

Roberto Faleh, Sofia Morelli, Holger Brandt
Universität Tübingen, Methods Center

Estimating causal effects in observational studies is particularly challenging in the presence of high-dimensional covariates, nonlinear treatment assignment mechanisms, and residual confounding. Traditional propensity score-based methods often struggle under these conditions. To address these limitations, we propose a deep learning framework that combines a Deep Neural Network with a Sparse Autoencoder to extract low-dimensional, informative representations from complex covariate structures.

We employ Integral Probability Metrics (IPMs)(eg. Wasserstein distance) in the training objective. This encourages the model to produce compact and balanced latent representations, reducing the bias in causal estimates and improving the estimation of conditional treatment assignment probabilities.

We demonstrate the effectiveness of our method through empirical evaluations on both synthetic and semi-synthetic real-world datasets.

Highly Efficient Anomaly Detection with Ensembles of Small Linear Models

Philipp Doebler
TU Dortmund, Statistik

Relative to a reference distribution, anomaly detection (AD) aims to find unusual observations, including (multivariate) outliers but also less obvious anomalies. Most modern AD techniques are unsupervised machine learning approaches developed in the context of industrial applications. Current AD approaches often build on computationally involved deep neural networks to accommodate high-dimensional data. Anomalies are interesting in psychology as well: Since anomalies deviate from a reference distribution, they might indicate an impeding relapse in a clinical context, or they might be indicative of careless responding, boredom, cheating, or dysfunctional strategies in psychological and educational testing. After a general introduction to AD, we discuss the potential of Shallow Ensemble ANomaly detection (SEAN; Klüttermann, Peka, Doebler & Müller, ICMLA 2024) in a psychological context. SEAN is a novel anomaly detection algorithm originally designed for real-time applications in predictive maintenance. SEAN is an ensemble of linear models: For each submodel, a low-dimensional random linear combination of the original variables is computed and a linear model is fitted with these new variables. Since the dependent variable value is fixed to 1 in these linear models (without an intercept), in the reference distribution the expectation of the predicted values is close to 1. Submodels are pooled by computing the maximum of the deviations from 1 across all submodels. SEAN operates over 20,000 times faster than a similar state-of-the-art deep learning alternative, with negligible sacrifice in detection accuracy. Variants of SEAN are discussed and the approach is evaluated on psychological data.

Conference Agenda