17th Conference of the Methods & Evaluation Section of the German Psychological Society (DGPs)

Session

Differential Item Functioning and Parameter Instability

Time:

Monday, 29/Sept/2025:

11:30am - 1:00pm

Session Chair: Farshad Effatpanah

Location: Raum L 116

60

Presentations

Differential Item Functioning in Polytomous Diagnostic Classification Models: An Extension of the Sequential G-DINA Model

Farshad Effatpanah¹, Hamdollah Ravand², Olga Kunina-Habenicht¹, Wenchao Ma³

¹Research Unit of Psychological Assessment, Faculty of Rehabilitation Sciences, TU Dortmund University, Dortmund, Germany; ²Department of English, Vali-e-Asr University of Rafsanjan, Rafsanjan, Iran; ³Department of Educational Psychology, University of Minnesota, Minneapolis, U.S.A.

In psychological assessment, differential item functioning (DIF) occurs when respondents from different groups (e.g., gender) respond differently to an item despite having the same underlying symptom profile. In the diagnostic classification models (DCMs) framework, an item is flagged as DIF if respondents from different groups with the same symptom profile have different probabilities of endorsing the item, suggesting that the item may be influenced by group-specific factors unrelated to the underlying attributes. Despite the growing interest in DCMs, little attention has been paid to DIF analyses using polytomous DCMs, such as the sequential Generalized Deterministic Inputs, Noisy “And” gate (sG-DINA) model. One major challenge is that existing R packages such as GDINA and CDM currently lack support for conducting DIF analyses within polytomous DCMs. To address this gap, we developed custom code to extend the GDINA package, allowing DIF detection for polytomous items and providing detailed information on response category thresholds. Using this extension, we analyzed responses from 50,831 German participants (both clinical and non-clinical) to the simplified version of the Beck Depression Inventory (BDI-S), a polytomous psychological screening tool, to investigate DIF across gender. The results of Wald test identified DIF in 20 response categories across different items, indicating potential measurement inequivalence in the BDI-S. These findings highlight the importance of evaluating DIF in polytomous DCMs, especially when applied in diverse populations, to ensure the validity and fairness of psychological assessments.

A Local Nonparametric Framework for Detecting DIF Along a Continuous Covariate Across Diverse IRT Models

Tuo Liu, Aron Fink

Institut für Psychologie, Goethe-Universität Frankfurt am Main

Differential item functioning (DIF) threatens the validity of test score interpretations by biasing comparisons of person abilities, making its detection a central issue. Research has developed three main strategies for detecting DIF regarding a continuous covariate: (1) model-agnostic procedures (e.g., multiple-group approach) are compatible with various item response theory (IRT) models but rely on discretizing continuous covariates; (2) model-specific extensions (e.g., moderated factor analysis) allow for continuous covariates but are restricted to specific IRT models with predefined functional forms (e.g., quadratic trends) on DIF parameters; and (3) model-agnostic tests (e.g., score-based test) preserve covariate continuity and are not restricted to IRT models but merely flag the presence of DIF without describing how it changes across the whole range of the covariate.

Inspired by local structural equation models, we propose a local non-parametric DIF detection framework that inherits the flexibility of all three lines while avoiding their limitations. This framework integrates kernel-based local weighting with an overlapping, weighted multiple group M-estimator. For the statistical inference of DIF, a person-level cluster bootstrap is employed due to the overlapping samples.
This framework requires neither discretizations of continuous covariates nor predefined functional forms of the DIF effect, and is compatible with a wide range of IRT models, including both dichotomous (e.g., the 3PL model) and polytomous (e.g., the GPCM model) models.

Using a preliminary simulation, we demonstrate that the proposed framework outperforms previous methods while revealing nonlinear DIF patterns along the covariate continuum.
The analyses are ongoing and will be completed before the conference.

Trend Estimation in Longitudinal Assessments: Comparing Concurrent Calibration, Item Parameter Drift Detection-Based Methods, Robust Linking, and Regularized Estimation

Oskar Engels^1,2, Oliver Lüdtke^1,2, Alexander Robitzsch^1,2

¹IPN − Leibniz Institute for Science and Mathematics Education, Kiel, Germany; ²Centre for International Student Assessment (ZIB), Kiel, Germany

In longitudinal assessments, tests are frequently used to estimate trends. When item parameters lack invariance, time point comparisons can be distorted, requiring appropriate statistical methods for accurate estimation. This talk compares trend estimates using the 2PL model under item parameter drift (IPD) across four linking approaches for two time points.

First, two methods assume invariant item parameters: concurrent calibration jointly estimates item parameters across time points, while fixed item parameter calibration estimates them at one point and fixes them at the other. Second, separate calibration of the two points is followed by robust Haberman or robust Haebara linking via common items to place parameters on a common scale. Third, noninvariant items are detected using likelihood ratio tests or the root mean square deviation (RMSD) statistic with fixed or outlier-based cutoffs, and trend estimates are recomputed using only the identified invariant items. Fourth, regularized estimation under a smooth Bayesian information criterion (SBIC) is applied, shrinking small or null IPD effects toward zero while estimating all others as nonzero.

The simulation varied sample size, number of items, IPD effect size, IPD item proportion, balanced or unbalanced IPD, and the average change in ability between time points. Bias and relative RMSE were evaluated for the mean and SD at the second time point.

Results suggest SBIC generally performed best, followed by Haberman linking with the L₀ loss function. For the detection-based approach, commonly used RMSD cutoffs may be too lenient; stricter thresholds appear necessary to achieve satisfactory parameter estimates.

On the meaning of measurement invariance in social relations – confirmatory factor analysis for relative variance parameters

David Christopher Jendryczko, Fridtjof Wilhelm Nussbeck

Universität Konstanz, Germany

We present and illustrate meaningful ways to assess relative variance parameters (variance components) in multiple indicator social relations – confirmatory factor analysis models for dyadic round-robin data where different types of measurement invariance may hold. With simulation studies, we investigate under which conditions of sample-size, true parameter values, and (mis-)specified invariance restrictions estimation issues as well as biased and inaccurate parameter estimates occur. Estimation issues are commonly observed in realistic data situations with low person-level variances and comparably few members per round-robin group. However, such issues can be effectively avoided by (falsely) implementing invariance restrictions across factor loadings without severely biasing relative variances for the sum-score and reciprocity correlations. Implications and limitations are discussed.

Conference Agenda