17th Conference of the Methods & Evaluation Section of the German Psychological Society (DGPs)

Machine Learning in Psychology Part I: Evaluation and Guidelines for Using Machine Learning in Psychological Research

Chair(s): Mirka Henninger (University of Basel, Switzerland), Susanne Frick (TU Dortmund)

Machine learning has become increasingly popular for data analysis over the past decade. However, many of its properties have not yet been thoroughly evaluated with respect to their application in psychological research. This symposium presents, evaluates, and discusses both the potential and the limitations of applying machine learning methods in psychology.

In the first talk, Carolin Strobl highlights and examines five properties of a widely used machine learning model—random forests—focusing on their statistical background and using illustrations in R. In the second talk, Florian Scharf addresses the gap between the promises of machine learning and its empirical performance in psychological research. Through simulation studies, the presenter identifies conditions under which machine learning outperforms classical statistical methods, and vice versa.

In third talk, Constantin Wiegand explores methods for quantifying stability and uncertainty in machine learning. He evaluates which stability measures may pose a risk of misinterpretation, potentially leading researchers to draw incorrect conclusions. The fourth and fifth talks adopt a systematic review approach. In the fourth talk, Jan Radek investigates when and how psychological researchers apply and interpret machine learning analyses in practice. In the fifth talk, Loreen Sabel examines the transfer of anomaly detection methods—such as those for identifying outliers or careless responses—from industrial contexts to psychological research.

Taken together, this symposium sheds light on the feasibility and appropriateness of machine learning approaches in psychological research, integrating findings from various projects and perspectives.

Presentations of the Symposium

The top five things I wished I had known when starting to work with random forests

Carolin Strobl
University of Zurich

Compared to other machine learning methods, random forests are intuitive, easy to use and often perform surprisingly well even without tuning. Still it can be helpful to be aware of a few properties of random forests that are not so widely known. This presentation will highlight five such properties:

- when random forests work well even in "small n large p" settings,

- which random forest variant is suitable for data sets with different types of predictors,

- which tuning parameter does not have a smart default value,

- how many trees are enough, and

- when a single tree might even be enough.

The properties will be explained with regard to their statistical background and practical examples will be shown in R.

Old Data + Machine Learning = New Knowledge? When Can We Expect Strong Predictive Performance in Psychology?

Florian Scharf¹, Kristin Jankowsky¹, Kim-Laura Speck¹, Katrin Jansen², Ulrich Schroeders¹
¹University of Kassel, ²University of Muenster

Machine learning (ML) models have gained increasing popularity across a range of disciplines, including psychology. While their utility in technical domains is well established, their usefulness in psychological research remains debated—characterized by both notable successes and sobering disappointments. Crucially, the application of ML models does not inherently yield superior performance compared to more traditional statistical approaches. We argue that the gap between the promise of ML and its empirical performance in psychology can often be attributed to the characteristics of psychological datasets, which are typically too small for complex models and suffer from imprecise measurement. A prototypical research approach has been the reanalysis of large psychological datasets using ML models, often in direct comparison to classical regression models. In a series of simulation studies, we systematically evaluated the predictive performance of ML models under varying conditions, including differences in the underlying population model, levels of measurement error, and effect sizes. Our findings highlight that ML models outperform classical approaches only under highly specific conditions—conditions that are rarely met in typical psychological research settings. We therefore argue that the utility of ML in psychology critically depends on the availability of datasets that are explicitly designed to support the requirements of ML-based analyses.

Evaluating Methods to Assess the Stability of Decision Trees

Constantin Wiegand¹, Florian Scharf², Mirka Henninger¹
¹University of Basel, ²University of Kassel

Decision Trees can model non-linear and complex relationships, while still being easy to interpret. At the same time, decision trees have been criticized for being instable, meaning that small changes in the data can lead to substantial changes in the tree structure. A potential solution to this problem might be to assess tree stability in decision trees using bootstrap or subsampling. In this project, we assess bootstrapping and subsampling procedures for decision trees with regard to variable selection and prediction stability in a simulation study. We varied the sample size and the number of informative and uninformative predictors and analysed the generated data using two decision tree algorithms, traditional regression trees (CART) and conditional inference trees (ctree). With regard to predictor selection, our results show increased false alarm rates up to 50% for bootstrap sampling in combination with ctrees. To a lesser degree, the results show increased false alarm rates for the CART algorithm combined with the bootstrapping procedure. With regard to prediction stability, we are currently ﬁnalizing the simulation and evaluating the results. We discuss potential risks and opportunities when assessing stability in tree-based learners and implications for empirical researchers who wish to assess tree stability and directions for future research.

How is Machine Learning Used and Interpreted in Psychological Research? A Systematic Review

Jan Radek¹, Carolin Strobl², Mirka Henninger¹
¹University of Basel, ²University of Zurich

The use of machine learning is experiencing an upswing in many areas of psychology and related disciplines. However, there is limited understanding of the behaviors and practices of researchers when it comes to employing machine learning to analyze psychological data. To examine the current state of the field, we conduct a systematic review that focuses on how psychological researchers utilize machine learning methods in empirical research and how they interpret the results. Therefore, we searched the APA PsycInfo database for articles that presented an empirical study, employed supervised machine learning for data analysis and were published in psychological or psychiatric journals in the year 2022. To investigate the extent and form in which machine learning applications are conducted, we developed a structured framework describing four major steps in applying machine learning within psychology. Based on this framework, we address the following questions: What data characteristics and preprocessing steps are associated with machine learning applications? How do researchers in psychology select and evaluate machine learning models to fit data? How do they assess the importance of predictors in explaining or predicting a response variable? Finally, how do they interpret the strength and shape of an effect derived from machine learning?

Multivariate Anomaly Detection Methods and their Application in Psychology

Loreen Sabel
TU Dortmund

Anomaly detection is a common task in unsupervised machine learning, rooted in computer science and data science. Statistical methods and algorithms aim to detect anomalous data instances as deviations from the norm. Examples of frequent applications include network intrusion detection within cybersecurity or fraud detection in financial transactions. In industrial domains, the development and application of anomaly detection methods are being researched intensively. In contrast to psychological data, industrial data rarely contains missing values, can often be assumed to be independent and identically distributed, or consists of high-frequency time series. Given these challenges in directly transferring industrial anomaly detection methods, we assume that only limited applications exist for psychological datasets. An initial literature search revealed the sporadic use of anomaly detection methods in multivariate outlier detection, careless response detection, and symptom change or relapse detection. In order to obtain an overview of multivariate anomaly detection methods with applications in psychology, we are conducting a systematic review of the existing psychological research. After providing a general introduction to anomaly detection and its conceptual differences from outlier detection, we will present initial results that include a characterization of the identified methods and an overview of their application aims. Additionally, we plan to analyze and compare the performance of anomaly detection methods on empirical psychological datasets.

Conference Agenda