17th Conference of the Methods & Evaluation Section of the German Psychological Society (DGPs)

Measurement and Machine Learning

Chair(s): Melanie Viola Partsch (Utrecht University, Netherlands, The)

Researchers face various challenges when measuring psychological constructs, and machine learning (ML) contributes both novel solutions and novel challenges to psychological measurement and research. Our symposium starts with practical consequences and a causal approach to measurement invariance (MI), continues with challenges that MI poses for ML, and closes with the performance of ML in predicting rare events and misspecifications in measurement models.

In the first talk, Caterina Luz Sanchez Steinhagen (LMU) presents a simulation study on the relationship between scalar non-invariance and bias in mean comparisons focusing on the practical consequences of MI violations on inferences that applied researchers draw.

The second talk by Philipp Sterner (RUB) then introduces a causal framework for cross-sectional and longitudinal investigations of measurement invariance based on directed acyclic graphs (DAGs) facilitating a substantive interpretation of latent mean comparisons between groups.

In the third talk, David Goretzko (UU) focuses on how traditional and novel data sources, that is, self-report data and sensing data, introduce different kinds of MI to ML in psychology, limiting the comparability of predictions.

The fourth talk by Kristin Jankowsky (University of Kassel) centers on the predictive performance of ML models in rare event classifications and informs on the minimum data characteristics required for these prediction models.

In the last talk, Melanie Partsch (UU) introduces and illustrates a new ML-based method for detecting and classifying misfit in latent measurement models informing researchers on the type and severity of misspecification present in their model and guiding them through its revision.

Presentations of the Symposium

How Much Is Too Much? The Relationship Between Scalar Non-Invariance and Bias in Mean Comparisons

Caterina Luz Sanchez Steinhagen¹, Philipp Sterner², David Goretzko³
¹LMU Munich, ²Ruhr University Bochum, ³Utrecht University

The importance of measurement invariance (MI) for drawing valid inferences is increasingly recognized not only by methodologists but also by applied researchers. When testing scalar MI using multi-group confirmatory factor analysis (MG-CFA), researchers typically rely on fit index differences between a model in which only loadings are constrained (metric model) and one in which both loadings and intercepts are constrained (scalar model). However, evaluating (non-)invariance based on static cutoffs is to some extent arbitrary and reflects a binary view of MI that may not always align with the practical consequences of MI violations.

In this talk, we present a simulation study that investigates the functional relationship between violations of scalar MI and (1) bias in latent mean comparisons and (2) type I and II error rates of the corresponding hypothesis test. Additionally, it provides insight into how differences in fit indices relate to the actual consequences of non-invariance. Key features of the simulation include modeling scalar non-invariance as a continuous variable, varying the true latent mean difference between groups, and manipulating the direction of intercept differences relative to that true effect, allowing us to examine both exaggeration and compensation effects. By focusing on the practical consequences of MI violations on estimation and statistical inference, our findings aim to help researchers better assess how potential violations may impact the inferences they draw, supporting a more nuanced and integrated approach to MI in applied research.

A causal framework for cross-sectional and longitudinal investigations of measurement invariance

Philipp Sterner¹, David Goretzko²
¹Ruhr University Bochum, ²Utrecht University

Measurement invariance (MI) describes the equivalence of measures of a construct across groups or time. When comparing latent means, it is crucial to establish MI prior to analyses, otherwise the results might be distorted. The most common way to test MI is multi-group confirmatory factor analysis (MG-CFA). Although numerous guides on how to test MI in this framework exist, a recent review has shown that it is very rarely done in practice. We argue that one reason for this might be because the results of the MG-CFA approach are uninformative as to why MI does not hold between groups. Even more severe, if the causal relationships between the measurement model and surrounding covariates are neglected, the results of MG-CFA might be misleading. We show how directed acyclic graphs (DAGs) from the causal modeling literature can help to guide researchers in constructing informed tests of MI and to reason about the causes of non-invariance. For this, we demonstrate how DAGs for measurement models can encode assumptions about causes of non-invariance in both cross-sectional and longitudinal designs. Especially for longitudinal designs, this causal view on MI can be beneficial. This is because changes in the interpretation of a construct or in the use of a scale (i.e., non-invariance) might be expected or even desired (e.g., after psychotherapeutic interventions). Ultimately, our goal is to provide a framework in which MI testing is not deemed a “gateway test” but rather a part of the whole process of latent mean comparisons between groups.

Non-Invariance in Machine Learning: Limiting the Comparability of Predictions

David Goretzko¹, Philipp Sterner²
¹Utrecht University, ²Ruhr University Bochum

Psychological research increasingly combines digital data and machine learning to predict latent variables such as personality traits. Smartphone sensing, for instance, enables researchers to monitor human behavior efficiently across diverse, everyday contexts and extended periods. Applying machine learning to these rich datasets offers new opportunities for psychological research but still faces challenges related to data quality, comparability, and classical measurement issues.

Sensing data can be strongly influenced by preprocessing steps, hardware, operating systems, and app versions, while psychological traits are typically assessed via error-prone self-reports. Accordingly, machine learning modeling in psychology is affected by both traditional measurement challenges, such as non-invariant measurement models, and incomparable input data due measurement or data handling issues.

In this presentation, we discuss the impact of these different measurement non-invariances on machine learning predictions and their implications for fairness, robustness, and generalizability. We examine not only non-invariant target or dependent variables but also non-invariant input or independent variables. By investigating the causes of differential predictions in machine learning models through both empirical and conceptual studies, we aim to enhance replicability and generalizability in psychological research.

The Potential and Limits of Machine Learning in Rare Event Classifications

Kristin Jankowsky¹, Katrin Jansen², Florian Scharf¹
¹University of Kassel, ²University of Münster

Over the past decade, the use of machine learning (ML) in psychological research, particularly in clinical psychology, has increased significantly. For example, researchers in the field of suicide prevention have asked whether the use of flexible ML algorithms could eventually lead to more successful screening of at-risk individuals. Previous simulation studies have shown the detrimental effect of low data quality and quantity on the ability of ML algorithms to accurately detect complex relationships and on overall predictive performance. However, they either focused on metric outcomes or did not necessarily reflect data characteristics common in psychological research (i.e. high measurement error). Thus, it is necessary to take another look at the data requirements for a useful application of ML algorithms predicting rare events such as suicidal behaviors. In a simulation study, we compared two ML algorithms - elastic net regression with interactions and gradient boosting machines - in their ability to detect complex relationships and achieve predictive performance consistent with the simulated effects. We systematically varied a) sample size, b) event rate, c) measurement error, and d) effect size. In addition, we evaluated the effect of different modelling and validation choices in dealing with unequal group sizes (upsampling vs. recalibration of the classification cutoff). The overall aim of this study is to provide recommendations on the minimum data characteristics required to build meaningful clinical prediction models for rare events and to provide researchers with the tools to (re-)evaluate previous findings.

Introduction and illustration of a new method for detecting and classifying SEM misfit using machine learning

Melanie Viola Partsch, David Goretzko
Utrecht University

Despite the popularity of structural equation modeling (SEM), investigating the fit of SEM models is still challenging. The most commonly applied method is the use of fixed cutoffs for model fit indices, such as the CFI, RMSEA, and SRMR. This method is error-prone because the cutoffs do not generalize well beyond the few model and data conditions considered in the simulation study they were derived from. In addition, it only informs the researcher about the fit versus misfit of a model (i.e., facilitates a binary decision) and provides no guidance in revising a misspecified model.

In this talk, we present a new machine learning- (ML-)based approach to model fit evaluation in SEM that overcomes the above-mentioned shortcomings of fixed fit index cutoffs. In the development of this method, we trained several ML models to detect misspecifications in multi-factorial measurement models and classify their type and severity based on 173 model and data features extracted from more than 1 million simulated data sets and fitted models. We quickly walk the audience through the development of the ML-based method and show how it outperforms fixed fit index cutoffs in model fit evaluation in general and by means of a specific example. Furthermore, we present a workflow—wrapped into a user-friendly R function—that informs the researcher on how likely a certain type of misspecification is present in their measurement model, how severe they can expect it to be, and how they could proceed in revising their measurement model or possibly underlying operationalization/theory.

Conference Agenda