Conference Agenda

Session

Item selection

Time:

Monday, 29/Sept/2025:

2:30pm - 4:00pm

Session Chair: Martin Schultze

Location: Raum L 116

60

Presentations

Utilizing lexical information in automated item selection procedures

Martin Schultze

Goethe Universität Frankfurt, Germany

Automated item selection procedures have gained popularity in the construction of psychological questionnaires in recent years. Especially when combined with LLM-based item generation, this leads to a wide array of powerful possibilities to tailor questionnaires to specific measurement needs. In most applications, meta-heuristic approaches are used for item selection, with the goal of optimizing the final questionnaires in terms of classical psychometric indicators (e.g., reliability, predictive validity, discriminant validity) and/or statistical indicators derived from confirmatory factor analysis (e.g., model fit, measurement invariance). These objectives are often combined with constraints derived from theory about the factorial structure of the assumed constructs. In this talk, I present an extension of automated item selection procedures to include lexical information about the item phrasing itself. Such indicators can be used in the initial construction to increase the probability of broader applicability of a questionnaire (e.g., by maximizing ease of item translation). Specifically, three categories of lexical indicators are presented and investigated as possible components in objective functions: (1) (dis-)similarity of item phrasing, (2) item phrasing complexity, and (3) adherence of lexical clustering to the psychometric model. In an evaluation study, the lexical indicators are combined with classical and CFA quality indicators, and their performance in a priori, empirical, and adaptive objective functions is evaluated.

Applications of Integer Programming by the Example of Anticlustering

Martin Papenberg

Heinrich Heine University Düsseldorf, Germany

Integer programming (IP) is a general framework for automatically solving computationally difficult problems that arise in research planning and data analysis. Despite important applications in many areas of psychology including cluster analysis, test design, and the assembly of stimulus sets, IP remains comparably unknown. The present talk gives a primer to the IP method and presents an example application that employs an IP model to create stimulus sets in experimental psychology. IP uses mathematical modelling to present a problem as a combination of (a) decision variables (b) an objective function to be optimized, and (c) a system of constraints. Unlike heuristic methods like local neighbourhood search or ant colony optimization, IP guarantees globally optimal results according to the objective function. Though in theory, attaining globally optimal results is computationally infeasible as the problem size increases, IP oftentimes finds optimal results even for large real-life applications. In an illustrative example, we partitioned a large stimulus set of 400 images (https://doi.org/10.3758/s13428-024-02351-1) into 10 subsets. The subset partitioning optimized two objectives in a sequential procedure. First, we used a graph coloring IP model to obtain globally optimal visual dissimilarity of images assigned to the same set. Via anticlustering, we then maximized similarity of rating variables (familiarity, visual complexity, mental imagery) between subsets, while maintaining the optimal within-set dissimilarity. The example was implemented using the free and open source R package anticlust (https://cran.r-project.org/package=anticlust). All code and data needed to implement the application is available from the Open Science Framework (https://osf.io/caqug/).

Using Binary Integer Programming to Revise the Big Five Triplets

Jan Killisch¹, Susanne Frick², Eunike Wetzel¹

¹Rheinland-Pfälzische Technische Universität Kaiserslautern-Landau; ²TU Dortmund

Introduction. In the multidimensional forced-choice (MFC) format, respondents are asked to rank blocks of items according to how well the items describe the respondents’ personality. To construct an MFC questionnaire, items must first be assigned to blocks. Second, blocks must be selected to be included in the final test. Both steps are combinatorially challenging. This is why heuristic optimization algorithms for block assembly and selection are frequently used. In comparison, exact optimization using binary integer programming (BIP) has not been applied empirically yet.

Objectives. We present a first empirical application of BIP to MFC test construction and identify key areas for further methodological developments.

Methods. We empirically revised the Big Five Triplets (BFT) using BIP. To this purpose, we re-analyzed 9 datasets (N = 6,070) to create an item pool of 278 items and applied BIP to assemble MFC blocks while optimizing factor loadings under constraints. To investigate the method in detail, we systematically tested 907,130 combinations of constraints. Finally, a 69 blocks solution was chosen and piloted in an online access panel (N = 1,051). In a second instance of BIP, we optimized the trace of blockinformation to select a subset of 33 blocks. Finally, a validation study was run (N = 1,099).

Results. We improved the BFT’s construct validity by using BIP for MFC test construction.

Discussion. BIP is feasible for MFC test construction, but must be further researched to become a viable alternative to heuristic optimization approaches. We discuss key areas for future developments.

Technology-Based Speed Tests: The Added Value of Timing Data for Item Selection

Janine Buchholz, Thomas Canz

Institute for Educational Quality Improvement (IQB), Berlin

Educational and psychological assessments typically distinguish between two major test formats: power tests and speed tests. Speed tests, the focus of this study, are timed assessments consisting of many relatively easy items, capturing the number of correct responses given within a certain time limit. They are widely used to assess constructs such as processing speed, attention, and fluency - e.g., reading fluency which reflects the degree of automatization in reading and is considered a key prerequisite for reading comprehension.
While many widely used instruments are paper-based, the growing use of technology-based testing offers several advantages, including immediate feedback and, importantly, the tracking of item-level response times. These data provide valuable diagnostic information that can inform and enhance test development.
In this presentation, we demonstrate how response-time data - combined with theory-driven hypotheses about the target construct - can be used to refine item selection. Data come from a reading fluency assessment administered to elementary school students (grades 2-4), comprising both a word recognition and a sentence verification task. Hypotheses about item difficulty and time demands were derived from psycholinguistic research (e.g., falsifying incorrect sentences is expected to take longer than verifying correct ones). Items were classified by surface characteristics, then flagged based on empirical difficulty and response time. Using concrete examples, we illustrate how selection decisions are made by integrating psychometric properties with theory-based contextual information.
This ongoing work supports more theory-informed item selection, with particular attention to contextual information. Preliminary results point toward promising directions for future research and application.