Journal of Applied Measurement

A publication of the Department of Educational Psychology and Counseling
National Taiwan Normal University

Volume 25, Issue 1/2 (2024)
Special Issue in Commemoration of Prof. Wen-Chung Wang Part 1

Life and Contributions of Professor Wen-Chung Wang: From His Loving Wife

Hsueh-Chu Chen
Education University of Hong Kong

n/a

Citation:
Chen, H.-C. (2024). Life and contributions of Professor Wen-Chung Wang: From his loving wife. Journal of Applied Measurement, 25(1/2), 1–5.

Download PDF


Obituary, Wen-Chung Wang, 1961–2018

Mark Wilson
University of California, Berkeley
Magdalena Mok
The Education University of Hong Kong
National Taichung University of Education

n/a

Citation:
Wilson, M., & Mok, M. (2024). Obituary, Wen-Chung Wang, 1961–2018. Journal of Applied Measurement, 25(1/2), 6–11. (Reprinted from "Obituary," 2018, British Journal of Mathematical and Statistical Psychology, 71[3], 561–566, https://doi.org/10.1111/bmsp.12138)


Assessing Differential Item Functioning in Computerized Adaptive Testing

Ching-Lin Shih
National Sun Yat-sen University
Kuan-Yu Jin
Hong Kong Examinations and Assessment Authority
Chia-Ling Hsu
Hong Kong Examinations and Assessment Authority

To implement computerized adaptive testing (CAT), monitoring the parameter stability of operational items and checking the quality of newly written items are critical. In particular, assessing differential item functioning (DIF) is a vital step in ensuring test fairness and improving test reliability and validity. This study investigated the performance in CAT of several nonparametric DIF assessment methods, the odds ratio (OR; Jin et al., 2018) approach, modified Mantel–Haenszel method (Zwick et al., 1994a, 1994b; Zwick & Thayer, 2002), modified logistic regression method (Lei et al., 2006), and CAT version of the simultaneous item bias test (SIBTEST) method (Nandakumar & Roussos, 2001), via a series of simulation studies. The results showed that the OR outperformed the other three methods in controlling false positive rates and producing high true positive rates when there were many DIF items in a test. Moreover, combining the OR with a scale purification procedure further improved DIF assessment in CAT as the percentage of DIF items exceeded 10%.

Keywords: differential item functioning, computerized adaptive test, scale purification, odds ratio, CATSIB

Citation:
Shih, C.-L., Jin, K.-Y., & Hsu, C.-L. (2024). Assessing differential item functioning in computerized adaptive testing. Journal of Applied Measurement, 25(1/2), 12–25.

Download PDF


Exploratory Differential Item Functioning Assessment With Multiple Grouping Variables: The Application of the DIF-Free-Then-DIF Strategy

Jyun-Hong Chen
National Cheng Kung University
Chi-Chen Chen
National Academy for Educational Research, Taiwan
Hsiu-Yi Chao
Soochow University

To ensure test fairness and validity, it is crucial for test practitioners to assess differential item functioning (DIF) simultaneously for all grouping variables to avoid omitted variable bias (OVB; Chao et al., 2018). In testing practice, however, we often face challenges due to insufficient information, such as the absence of DIF-free anchor items, while conducting DIF assessment. This scenario, referred to as exploratory DIF assessment involving multiple grouping variables, has received limited attention, highlighting the importance of accurately identifying DIF-free items as anchors for all grouping variables. To address this issue, this study proposed the parallel DIF-free-then-DIF (p-DFTD) strategy, which selects DIF-free items simultaneously for each grouping variable and utilizes them as anchors in the constant item method for DIF assessment. A comprehensive simulation study was conducted to evaluate the performance of the p-DFTD strategy. The findings revealed that the conventional approach of assessing DIF with one grouping variable at a time was vulnerable to OVB, leading to an inflation of Type I error rates. In contrast, the p-DFTD strategy successfully identified DIF-free anchor items and effectively controlled Type I errors while maintaining satisfactory statistical power in most conditions. The empirical analysis further supported these findings, showing that the p-DFTD strategy provided more accurate and consistent DIF detection compared to methods that do not account for all grouping variables simultaneously. In conclusion, the p-DFTD strategy, which demonstrated a robust performance in this study, holds promise as a reliable approach for test developers to conduct exploratory DIF assessments involving multiple grouping variables, thereby ensuring fairness and validity in testing practices.

Keywords: Differential item functioning, DIF-free-then-DIF strategy, scale purification, omitted variable bias

Citation:
Chen, J.-H., Chen, C.-C., & Chao, H.-Y. (2024). Exploratory differential item functioning assessment with multiple grouping variables: The application of the DIF-free-then-DIF strategy. Journal of Applied Measurement, 25(1/2), 26–47.

Download PDF


Enhancing DIF Detection in Cognitive Diagnostic Models: An Improved LCDM Approach Using TIMSS Data

Su-Pin Hung
National Cheng Kung University
Po-Hsi Chen
National Taiwan Normal University
Hung-Yu Huang
National Cheng Kung University

Cognitive diagnostic models (CDMs) are being increasingly utilized in educational research, especially for analyzing large-scale international assessment datasets comparing skill mastery among various countries and gender groups. However, establishing test invariance before making such comparisons is critical to ensure that differential item functioning (DIF) does not distort the results. Earlier research on DIF detection within CDMs has predominantly dealt with non-compensatory models, leaving several influential factors ambiguous. This study addresses these issues with the log-linear cognitive diagnosis model (LCDM; Henson et al., 2009), enhancing its practical applicability. The aim is to improve the LCDM-DIF method and evaluate the efficacy of the purification procedure using total scores as matching criteria with non-parametric methods—logistic regression (LR) and Mantel-Haenszel (MH). Factors examined include test length, percentage of DIF, DIF magnitude, group distributions, and sample size. Using data from the Trends in International Mathematics and Science Study (TIMSS), factoring nationality and gender of the participants, an empirical study gauges the performance of the proposed methods. The results reaffirm that the model-based method surpasses the MH and LR methods in controlling Type I errors and achieving higher power rates. Additionally, the LCDM-based approach offers broader insights into the results. The study discusses its value, potential applications, and future research areas, emphasizing the significance of tackling issues related to contaminated matching criteria in DIF detection within CDMs.

Keywords: cognitive diagnostic models, differential item functioning, log-linear cognitive diagnosis model, purification procedure

Citation:
Hung, S.-P., Chen, P.-H., & Huang, H.-Y. (2024). Enhancing DIF detection in cognitive diagnostic Models: An improved LCDM approach using TIMSS data. Journal of Applied Measurement, 25(1/2), 48–74.

Download PDF


Examining Illusory Halo Effects Across Successive Writing Assessments: An Issue of Stability and Change

Thomas Eckes
TestDaF Institute, University of Bochum
Kuan-Yu Jin
Hong Kong Examinations and Assessment Authority

Halo effects are a common source of rating errors in assessments. When raters assign scores to examinees on multiple performance dimensions or criteria, they may fail to discriminate between them, lowering each criterion's information value regarding an examinee's performance. Using the mixture Rasch facets model for halo effects (MRFM-H), we studied halo tendencies in four successive high-stakes writing assessments administered to 15,677 examinees spanning 10 months involving 162 raters. The MRFM-H allows separating between illusory halo due to judgmental biases and true halo due to the actual overlap between the criteria. Applying this model, we aimed to detect illusory halo effects in the first exam, tracking the effects' occurrence across the other three exams. We also ran the standard Rasch facets model (RFM) and computed raw-score correlational and standard deviation halo indices, rH and SDH, for comparison purposes. The findings revealed that (a) the MRFM-H fit the rating data better than the RFM in all four exams, (b) in the first exam, 11 out of 100 raters exhibited illusory halo effects, (c) the halo raters showed evidence of both stability and change in their rating tendencies over exams, (d) the non-halo raters mostly remained stable, (e) the rH and SDH statistics did not separate between the halo and non-halo raters, and (f) the illusory halo effects had a small but demonstrable impact on examinee rank orderings, which may have consequences for selection decisions. The discussion focuses on the model's practical implications for performance assessments, such as rater training, monitoring, and selection, and highlights future research perspectives.

Keywords: rater effects, illusory halo, writing assessment, Rasch measurement, mixture Rasch model

Citation:
Eckes, T., & Jin, K.-Y. (2024). Examining illusory halo effects across successive writing assessments: An issue of stability and change. Journal of Applied Measurement, 25(1/2), 75–95.

Download PDF


Making Multiple Regression Narratives Accessible: The Affordances of Wright Maps

Alexander Mario Blum
Stanford University Graduate School of Education
Enrich Your Academics
James M. Mason
University of California, Berkeley
Aaryan Shah
Stanford University
Sam Brondfield
University of California, San Francisco

Wright Maps have been an important tool in promoting meaning making about measurement results for measurement experts and substantive researchers alike. However, their potential to do so for latent regression results is underexplored. In this paper, we augmented Wright Maps with hypothetical group mean locations corresponding to realistic scenarios of interest. We analyzed data from an instrument measuring cognitive load experienced by medical fellows and residents while performing inpatient consults. We focused on extraneous load (EL: i.e., distraction) and variables potentially associated with distraction. Through collaborative examination of the Wright Map, we found not only corresponding regions to construct levels but also a region with important practical consequences, namely that the third threshold represented a critical level of cognitive load, which could impact patient care. We augmented the Wright Map with the locations of two typical scenarios differing only in novelty of the consult, representing the lowest and highest levels of novelty, respectively. These group locations were plotted on the Wright Map approximately 1.5 logits apart, allowing for a kind of visual relative effect size, as this difference can be perceived relative to other features of the Wright Map. In this case, both scenarios were within the same band of the Wright Map, leading to the practical interpretation; although EL was significantly increased, the risk of cognitive overload was not. However, because of the problematic nature of the third threshold, a 1.5 logit difference does not have the same practical consequences along the entire scale; other realistic scenarios with increased initial EL are possible, where increased novelty could lead to cognitive overload. This area of visualization techniques, along with a combinatorial view of a multiple-regression analysis, could be helpful in other substantive and practical contexts, and with more complex regression models.

Keywords: Latent regression, Wright map, item-response theory, visual relative effect size, combinatorial interpretation

Citation:
Blum, A. M., Mason, J. M., Shah, A., & Brondfield, S. (2024). Making multiple regression narratives accessible: The affordances of Wright maps. Journal of Applied Measurement, 25(1/2), 96–108.

Download PDF


How Does Knowledge About Higher Education Develop at the End of High School: A Longitudinal Analysis of 11th and 12th Graders

Maria Veronica Santelices
Pontificia Universidad Católica de Chile
Millennium Nucleus, Student Experience in Higher Education in Chile
Ximena Catalán
Pontificia Universidad Católica de Chile
Millennium Nucleus, Student Experience in Higher Education in Chile
Magdalena Zarhi
Duoc UC
Juan Acevedo
Universidad de Los Andes de Chile
Catherine Horn
University of Houston

The information and guidance available to secondary education students are positively related to access to higher education. Information level, however, has been described as low, especially for students from low socioeconomic status, whose parents have not attended higher education. We explore how the knowledge about higher education changes between 11th and 12th grade, identifying possible differences between students by socio-demographic groups and as possible consequences of school activities. We use a complex multidimensional measure to capture Knowledge and Perception of Knowledge about Higher Education. Results from our study show that students exhibit low levels of information in both knowledge subdimensions. Despite positive variations observed from 11th to 12th grade, the low level of information remains in the last year of high school.

Keywords: Information, transition to higher education, multidimensional model

Citation:
Santelices, M. V., Catalán, X., Zarhi, M., Acevedo, J., & Horn, C. (2024). How does knowledge about higher education develop at the end of high school: A longitudinal analysis of 11th and 12th graders. Journal of Applied Measurement, 25(1/2), 109–131.

Download PDF


An Examination of an Epistemic Cognition Developmental Progression: A Rasch Analysis

Man Ching Esther Chan
The University of Melbourne
Mark Wilson
University of California, Berkeley
The University of Melbourne

In the era of post-truth and misinformation, there are continuing calls for an emphasis on critical thinking in school education. Epistemic cognition has been proposed as foundational to critical thinking inside and outside of the classroom. However, due to a lack of understanding of the construct and its development, teachers are not well equipped to foster effective epistemic cognition and thus critical thinking in the classroom. Drawing from previous literature, an assessment tool was piloted and subsequently administered to 168 Year 8 students (13- to 14-year-olds) in Melbourne, Australia, to examine the theorized development of epistemic cognition. Students' responses were examined qualitatively using think-aloud protocols and quantitatively using classical test theory and Rasch modelling. The instrument showed good person-separation reliability (.76) and Cronbach's alpha (.79). Based on the analysis, the student responses generally aligned with the theorized construct map, demonstrating strong construct validity. The findings offer empirical evidence for a developmental progression in epistemic cognition, which may be used to inform teaching of critical thinking.

Keywords: epistemic cognition, developmental progression, item response theory, test development, psychometrics

Citation:
Chan, M. C. E., & Wilson, M. (2024). An examination of an epistemic cognition developmental progression: A Rasch analysis. Journal of Applied Measurement, 25(1/2), 132–149.

Download PDF