Journal of Applied Measurement

A publication of the Department of Educational Psychology and Counseling
National Taiwan Normal University

Abstracts for all Volumes

Volume 22(1/2)

Volume 22(3/4)

Volume 23(1/2)

Volume 23(3/4)

Volume 24(1/2)


Newly accepted!

Differences in School Leaders’ and Teachers’ Perceptions on School Emphasis on Academic Success: An Exploratory Comparative Study

Sijia Zhang and Cheng Hua
Accepted on: 16 October 2024

Abstract

This quantitative study examined how principals and teachers from all participating countries and regions perceive school emphasis on academic success (SEAS) differently. Participants (N = 26,302) were all principals and teachers who filled out the SEAS scale from PIRLS 2021. A second-order confirmatory factor analysis and a many-faceted Rasch analysis were used to investigate the psychometric properties of the SEAS and whether there existed differences in school leaders’ and teachers’ perceptions of such construct within and across countries. Results from the factor analysis yielded a three-factor solution, and the SEAS scale demonstrated satisfying psychometric properties. Rasch analysis indicated great model-data fit, and item-level fit statistics. Future studies are encouraged to explore the psychometric properties of SEAS and how SEAS impacts other school-related variables and student outcomes. This study explored a new instrument to measure academic emphasis and compared leaders’ and teachers’ perceptions of SEAS in an international setting.


Analysis of Multidimensional Forced-Choice Items using Rasch Ipsative Models with ConQuest

Xuelan Qiu and Dan Cloney
Accepted on: 24 September 2024

Abstract

Multidimensional forced-choice (MFC) items have been widely used to assess career interests, values, and personality to prevent response biases. This tutorial first introduces the typical types of MFC items and the item response theory models to analyze MFC items. It further shows how to analyze the dichotomously and polytomously scored MFC items with paired statements based on the Rasch ipsative models using the computer program ACER ConQuest. The assessment of differential statement functioning using the ConQuest was also demonstrated.


Modeling the Effect of Reading Item Clarity on Item Discrimination

Paul Montuoro and Stephen Humphry
Accepted on: 16 August 2024

Abstract

The logistic measurement function (LMF) satisfies Rasch’s criteria for measurement while allowing for varying discrimination among sets of items. Previous research has shown how the LMF can be applied in test equating. This article demonstrates the advantages of dividing reading test items into three sets and subsequently applying the LMF instead of the Rasch model. The first objective is to examine the effect of item clarity and transparency on item discrimination using a new technique for dividing reading items into sets, referred to as an item clarity review. In this article, the technique is used to divide items in a reading test with different levels of discrimination into three sets. The second objective is to show that, where three such sets exist, the subsequent application of the LMF leads to improved item fit compared to the standard Rasch model and the subsequent retention of more items. The item sets were shown to have different between-set discrimination but relatively uniform within-set discrimination. The results show that, in this context, reading test item clarity and transparency affect item discrimination. These findings and other implications are discussed.


Psychometric Properties of the Statistical Anxiety Scale and the Current Statistics Self-Efficacy Using Rasch Analysis in a Sample of Community College Students

Samantha Estrada Aguilera, Emily Barena, and Erica Martinez
Accepted on: 2 February 2024

Abstract

Community college students have rarely been the focus of study within statistical education research. This study aims to examine the psychometric properties of two popular scales utilized within statistics education: Current Statistics Self-Efficacy (CSSE) and Statistical Anxiety Scale (SAS) focusing on a population of community college students. A survey was conducted on N = 161 community college students enrolled in an introductory statistics course. The unidimensional structure of the CSSE was confirmed utilizing a confirmatory factor analysis (CFA), and after selecting the rating scale model approach, we found no misfitting items and good reliability. Concurrent and discriminant validity was examined utilizing the SAS. The SAS three-factor structure was also assessed, examining the item fit. We found that an item in the SAS subscale, Fear of Asking for Help, was flagged as misfitting. Overall, both the CSSE and SAS demonstrated sound psychometric properties when utilized with a population of community college students.


Using Explanatory Item Response Models to Evaluate Surveys

Jing Li and George Engelhard
Accepted on: 30 January 2024

Abstract

This study evaluates the psychometric quality of surveys by using explanatory item response models. The specific focus is on how item properties can be used to improve the meaning and usefulness of survey results. The study uses a food insecurity survey (HFSSM) as a case study. The case study examines 500 households with data collected between 2012 and 2014 in the United States. Eleven items from the HFSSM are classified in terms of two item properties: referent (household, adult, and child) and content (worry, ate less, cut meal size, hungry, and not eat for the whole day). A set of explanatory linear logistic Rasch models is used to explore the relationships between these item properties and their locations on the food insecurity scale. The results suggest that both the referent and item content are significant predictors of item location on the food insecurity scale. It is demonstrated that the explanatory item response models are a potential method for examining the psychometric quality of surveys. Explanatory item response models can be used to enhance the meaning and usefulness of survey results by providing insights into the relationship between item properties and survey responses. This approach can help researchers improve the psychometric quality of surveys and ensure that they are measuring what they intend to measure. It can lead to better-informed policy decisions and interventions aimed at tackling social issues such as food insecurity, poverty, and inequality.


Examining Equivalence of Three Versions of Mathematics Tests in China’s National College Admission Examination Using a Single Group Design

Chunlian Jiang, Stella Yun Kim, Chuang Wang, and Jincai Wang
Accepted on: 12 September 2023

Abstract

The National College Admission Examination, also known as Gaokao, is the most competitive examination in China because students’ scores obtained in Gaokao are used as the only criterion to screen applicants for college admission. Chinese students’ scores in Gaokao are also accepted by many universities in the world. The one-syllabus-multiple-tests practice has been implemented since 1985, but not much has been explored as to the extent to which multiple tests are equivalent. This study attempts to examine the equivalence of three versions of Gaokao mathematics tests and illustrate the methodological procedure using a single group design with an item response theory (IRT) approach. The results indicated that the three versions were comparable in terms of content coverage; however, most items were found to be easy for the students so more challenging items are suggested to be included for distinguishing students with average and high mathematics competencies. Some differences were also noted in terms of differential item functioning analysis and the factor structure.


Examining the Psychometric Properties of the Spanish Version Life Orientation Test-Revised (LOT-R): Finding Multi-dimensionality and Issues with Translation for Reverse-Scored Items

Rosalba Hernandez, Kendon J. Conrad, John Ruiz, Melissa Flores, Judith T. Moskowitz, Linda C. Gallo, Erin L. Merz, Frank J. Penedo, Ramon A. Durazo-Arvizu, Angelica P. Gutierrez, Jinsong Chen, and Martha L. Daviglus
Accepted on: 18 September 2023

Abstract

The Life Orientation Test-Revised (LOT-R) is the most used instrument to assess dispositional optimism. We examined the psychometrics of the LOT-R in a diverse sample of U.S. Hispanics/Latinos. Data analysis included 5,140 adults ages 18–74 in the Hispanic Community Health Study/Study of Latinos and Sociocultural Ancillary Study. We employed the Rasch measurement model using Winsteps software. The Rasch person reliability for the 6-item LOT-R and Cronbach’s alpha both had values of 0.54. When testing convergent validity, correlations were statistically significant, but small to medium in magnitude. The ratio of percentage of variance explained by the measures to the variance explained in the first contrast and the correlation of subscales did not meet the expected unidimensionality criterion. The item, “I hardly expect things to go my way,” displayed differential item functioning by language (Spanish vs. English) and reverse-scored items were found to be problematic. Use of the LOT-R in its present form in U.S. Hispanics/Latinos is unsupported by psychometric evidence.


Validity and Test-Length Reduction Strategies for Complex Assessments

Lance M. Kruse, Gregory E. Stone, Toni A. May, and Jonathan D. Bostic
Accepted on: 12 April 2023

Abstract

Lengthy standardized assessments decrease instructional time while increasing concerns about student cognitive fatigue. This study presents a methodological approach for item reduction within a complex assessment setting using the Problem Solving Measure for Grade 6 (PSM6). Five item-reduction methods were utilized to reduce the number of items on the PSM6, and each shortened instrument was evaluated through validity evidence for test content, internal structure, and relationships to other variables. The two quantitative methods (Rasch model and point-biserial) resulted in the best psychometrically performing shortened assessments but were not representative of all content subdomains, while the three qualitative (content preservation) methods resulted in poor psychometrically performing assessments that retained all subdomains. Specifically, the ten-item Rasch and ten-item point-biserial shortened tests demonstrated the overall strongest validity evidence, but future research is needed to explore the psychometric performance of these versions in a new independent sample and the necessity for subdomain representation. Implications for the study provide a methodological framework for researchers to use and reduce the length of existing instruments while identifying how the various reduction strategies may sacrifice different information from the original instrument. Practitioners are encouraged to carefully examine to what extent their reduced instrument aligns with their pre-determined criteria.


Impact of Violation of Equal Item Discrimination on Rasch Calibrations

Chunyan Liu, Wenli Ouyang, and Raja Subhiyah
Accepted on: 8 March 2023

Abstract

The Rasch model, a widely used item response theory (IRT) system, assumes equal discrimination of all items when estimating item difficulties and examinee proficiencies. However, to some extent, the impact of item misfit due to violations of the equal item discrimination assumption on Rasch calibrations remains unclear. In this simulation study, we assess the effects of balanced and systematic variation of item discrimination on Rasch difficulty estimates and Rasch model fit statistics. Our findings suggest that item misfit due to unequal item discrimination can negatively impact item difficulty estimates and INFIT/OUTFIT statistics for both misfitting and well-fitting items. Test developers may find our results useful for improving the accuracy of item difficulty estimates and, ultimately, of the estimated examinee proficiencies.


A Rasch Analysis of the Mindful Attention Awareness Scale—Children

Daniel Strissel and Julia A. Ogg
Accepted on: 19 December 2022

Abstract

The Mindful Attention Awareness Scale—Children (MAAS-C) was developed using traditional psychometric methods to measure dispositional mindfulness in children. The MAAS-C is based on the MAAS, a highly cited mindfulness scale for adults. This study extended this effort by applying the Rasch model to investigate the psychometric properties of the MAAS-C. Evidence from Rasch Analyses conducted on the MAAS suggested local dependence of items and the need for modifications, including a rescoring algorithm. We aimed to examine how the MAAS-C performed when evaluated with Rasch analysis using a sample of 406 fifth- and sixth-grade children. Upon analysis, all 15 items on the MAAS-C worked in the same direction; the fit statistics fell within a range suitable for productive measurement; a PCAR of the residuals revealed an unpatterned distribution of residuals, and DIF was not found for any of the grouping variables. However, the items were not evenly distributed nor well-targeted for children with the highest levels of dispositional mindfulness. We demonstrated that the precision and item functioning of the MAAS-C could be improved by uniformly rescoring the response categories. Once rescored, the provided ordinal-to-interval conversion table (see Table 5) can be used to optimize scoring on the MAAS-C.


Psychometric Assessment of an Adapted Social Provisions Survey in a Sample of Adults with Prediabetes

Kathryn E. Wilson, Tzeyu L. Michaud, Cynthia Castro Sweet, Jeffrey A. Katula, Fabio A. Almeida, Fabiana A. Brito, Robert Schwab, & Paul A. Estabrooks
Accepted on: 17 November 2022

Abstract

The relevance of social support for weight management is not well documented in people with prediabetes. An important consideration is the adequate assessment of social provisions related to weight management. We aimed to assess the factor structure and measurement invariance of an adapted Social Provisions Scale specific to weight management (SPS-WM). Participants of a diabetes prevention trial (n = 599) completed a demographic survey, and the SPS-WM. Confirmatory analyses tested the factor structure of the SPS-WM, and measurement invariance was assessed for gender, weight status, education level, and age. Removal of two items resulted in acceptable model fit, supporting six correlated factors for social provisions specific to weight management. Measurement invariance was supported across all subgroups. Results support score interpretations for these scales reflecting distinct components of social support specific to weight management in alignment with those of the original survey.


Exploring the Impact of Open-Book Assessment on the Precision of Test-Taker and Item Estimates Using an Online Medical Knowledge Assessment

Stefanie A. Wind, Cheng Hua, Stefanie S. Sebok-Syer
Accepted on: 7 November 2022

Abstract

Researchers concerned with the psychometric quality of assessments often examine indicators of model-data fit to inform the interpretation and use of parameter estimates for items and persons. Fit assessment techniques generally focus on overall model fit (i.e., global fit), as well as fit indicators for individual items and persons. In this study, we demonstrate how one can also use item-level information from individual responses (e.g., the use of outside assistance) to explore the impact of such behaviors on model-data fit. This study’s purpose was to use data from an open-book format assessment to examine the impact of examinee help-seeking behavior on the psychometric quality of item and person estimates. Open-book assessment formats, where test-takers are permitted to consult outside resources while completing assessments, have become increasingly common across a wide range of disciplines and contexts. With our analysis, we illustrated an approach for evaluating model-data fit by combining residual analyses with test-takers’ self-reported information about their test-taking behavior. Our results indicated that the use of outside resources impacted the pattern of expected and unexpected responses differently for individual test-takers and individual items. Analysts can use these techniques across a variety of assessment contexts where information about test-taker behavior is available to inform assessment development, interpretation, and use; this includes: evaluating psychometric properties following pilot testing, item bank maintenance, evaluating results from operational exam administrations, and making decisions about assessment interpretation and use.


Seeing Skill: Heuristic Reasoning when Forecasting Outcomes in Sports

Veronica U. Weser, Karen M. Schmidt, Gerald L. Clore, & Dennis R. Proffitt
Accepted on: 2 October 2022

Abstract

Expert advantage in making judgments about the outcomes of sporting events is well-documented. It is not known, however, whether experts have an advantage in the absence of objective information, such as the current score or the relative skill of players. Participants viewed 5-second clips of modern Olympic fencing matches and selected the likely winners. Participants’ predictions were compared with the actual winners to compute accuracy. Study 1 revealed small but significant differences between the accuracy of experts and novices, but it was not clear what fencing behaviors informed participants’ judgments. Rasch modeling was used to select stimuli for Study 2, in which fencing-naïve participants rated the gracefulness, competitiveness, and confidence of competitors before selecting winners. By using Rasch modeling to develop the stimuli set, fencing-naïve participants were able to identify winners at above chance rates. The results further indicate that in the absence of concrete information, competitiveness and confidence may be used as a heuristic for the selection of winning athletes.


The Effects of Textual Borrowing Training on Rater Accuracy When Scoring Students’ Responses to an Integrated Writing Task

Kevin R. Raczynski, Jue Wang, George Engelhard, Jr, Allan S. Cohen
Accepted on: 23 September 2022

Abstract

Integrated writing (IW) tasks require students to incorporate information from provided source material into their written responses. While the growth of IW tasks has outpaced research on scoring challenges that raters working in this assessment context experience, several researchers have reported that raters in their studies struggled to agree about whether students demonstrated successful integration of source material in their written responses. One suggestion offered for meeting this challenge was to provide rater training on textual borrowing, a topic not covered in all training programs. We randomly assigned 17 middle school teachers to two training conditions to determine whether teachers who completed an augmented training protocol specific to textual borrowing would score students’ responses to an IW task more accurately than teachers who completed a comparison training protocol that did not include instruction on textual borrowing. After training, all teachers scored the same set of 30 benchmark essays. We compared the teachers’ scores to professional raters’ scores, dichotomized based on whether the scores matched, and then analyzed the resulting data using a FACETS model for accuracy. As a group, the teachers who completed the augmented training scored more accurately than the comparison group. Policy implications for scoring rubric design and rater training are discussed.


The development and validation of a thinking skills assessment for students with disability using Rasch measurement approaches

Toshiko Kameia, Masa Pavlovicb
Accepted on: 23 September 2022

Abstract

21st century skills such as thinking are gaining prominence in curricula internationally as fundamental to thrive and learn in an evolving global environment. This study developed and investigated the validity of a thinking skills assessment based on a learning progression of students with disability. It followed an established method (Griffin, 2007) previously used to develop assessments based on learning progressions of foundational learning skills of students with disability. An initial review of research and co-design process with teachers with expertise in teaching students with disability was used to develop a set of assessment items based on a hypothetical criterion-referenced framework of thinking skills. This was followed by empirical exploration of the developed thinking skills assessment through a Rasch partial credit model (Masters, 1982) analysis using student assessment data from a field trial of the thinking skills assessment items involving 864 students. SME review, person and item fit statistics, and reliability coefficients provided evidence to support the validity of the assessment for its intended purpose and supported arguments for a single, underlying construct. A thinking skills assessment based on learning progression of school-age students with disability was derived, drawing on teacher interpretation and calibration of student assessment data. The resulting thinking skills assessment provided a practical tool that teachers can apply in the classroom to implement the teaching and learning of thinking skills to a cohort and level of learning previously not targeted.


Career Advancement Inventory: Assessing Decent Work among Individuals with Psychiatric Disabilities

Uma Chandrika Millner, Sarah A. Satgunam, James Green, Tracy Woods, Richard Love, Amanda Nutton, Larry Ludlow
Accepted on: 14 April 2022

Abstract

Comprehensive assessments of the outcomes of vocational programs and interventions are necessary to ameliorate the significant employment disparities among individuals with psychiatric disabilities. In particular, measuring the attainment of decent work is critical for assessing their vocational outcomes. In the absence of existing vocational instruments that assess progress towards decent work among individuals with psychiatric disabilities, we developed the Career Advancement Inventory (CAI). The CAI was theoretically grounded in the Career Pathways Framework (CPF), review of focus group data and existing literature, and constructed utilizing an iterative scale development approach and a combination of classical test theory and item response theory principles, specifically Rasch modeling. The CAI included five subscales: Self-Efficacy, Environmental Awareness, Work Motivation, Vocational Identity, and Career Adaptabilities. Rasch analyses indicated mixed results where some items in the subscales mapped onto the hierarchical stage-like progression as proposed by CPF, while others did not. The results support construct validity of the subscales, with the exception of Work Motivation, and contribute to the expansion of the theoretical propositions of CPF. The CAI has the potential to be an effective career assessment for individuals with psychiatric disabilities and has implications for vocational psychology and vocational rehabilitation.


Extended Rater Representations in the Many-Facet Rasch Model

Mark Elliott, Paula J. Buttery
Accepted on: 14 March 2022

Abstract

Many-Facet Rasch Models (Eckes, 2009, 2015; Engelhard & Wind, 2017; Linacre, 1994) provide a framework for measuring rater effects for examiner-scored assessments, even under sparse data designs. However, the representation of a rater as a global scalar measure involves an assumption of uniformity of severity across the range of rating scales and criteria within each scale. We introduce extended rater representations of vectors or matrices of local measures relating to individual rating scales and criteria. We contrast these extended representations with previous work on local rater effects (Myford & Wolfe, 2003) and discuss issues related to their application, for raters and other facets. We conduct a simulation study to evaluate the models, using an extension of the CPAT algorithm (Elliott & Buttery, 2021). We conclude that extended representations more naturally and completely reflect the role of the rater within the assessment process and provide greater inferential power than the traditional global measure of severity. Extended representations also have applicability to other facets which may have non-uniform effects across items and thresholds.


Tracing Morals: Reconstructing the Moral Foundations Questionnaire in New Zealand and Sweden Using Mokken Scale Analysis and Optimal Scaling Procedure

Erik Forsberg, Anders Sjöberg
Accepted on: 21 March 2022
Abstract updated on: 8 April 2022

Abstract

The Moral Foundations Questionnaire, consisting of the Relevance subscale and the Judgment subscale, was constructed using the framework of classical test theory for the purpose of measuring five moral foundations. However, so far, no study has investigated the latent properties of the questionnaire. Two independent samples, one from the New Zealand Attitudes and Values Study (N = 3989), and one nationally representative sample from Sweden (N = 1004), were analysed using Mokken scale analysis and optimal scaling procedure. The results indicate strong shared effects across both samples. Foremost, the Moral Foundations Questionnaire holds two latent trait dimensions, corresponding to the theoretical partitioning between Individualizing and Binding foundations. However, while the Relevance subscale was, in all, reliable in ordering respondents on level of ability, the Judgment subscale was not. Moreover, the dimensionality analysis showed that the Relevance subscale carries three cross-cultural homogeneity outlier items (items for loyalty and disorder concerns) in both samples. Lastly, while the test for local independence indicated adequate fit for the Individualizing trait dimension, the Binding dimension was theoretically ambiguous. Suggestions for improvements and future directions are discussed.


Measuring the Complexity of Equity-Centered Teaching Practice Development and Validation of a Rasch/Guttman Scenario Scale

Wen-Chia C. Chang
Accepted on: 4 March 2022

Abstract

The Name Blinded for Review (TEES) Scale was developed to measure the complexity of teaching practice for equity by integrating Rasch measurement and Guttman facet theory. This paper extends the existing work to develop and validate an efficient, short-form TEES Scale that can be used for research and evaluation purposes. The Rasch rating scale model is used to analyze the responses of 354 teachers across the United States. Validity evidence, which addresses the data/theory alignment, item and person fit, rating scale functioning, dimensionality, generalizability, and relations to external variables, is examined to support the adequacy and appropriateness of the proposed score interpretations and uses. The short-form TEES Scale functions well to measure teaching practice for equity and provides evidence for research or evaluation studies on whether and to what extent teachers or candidates learn to enact equity-centered practice. Limitations and future directions of the scale are discussed.


Using An Exploratory Quantitative Text Analysis (EQTA) to Synthesize Research Articles

Cheng Hua, Catanya Stager, Stefanie A. Wind
Accepted on: 4 January 2022

Abstract

An Exploratory Quantitative Text Analysis (EQTA) method was proposed to synthesize large sets of scholarly publications and to examine thematic characteristics in the Journal of Applied Measurement (JAM). After synthesizing 578 articles published in JAM from 2000 to 2020, authors classified each article into five categories to compare the difference in three phases: (1) word frequency analysis from EQTA; (2) descriptive analysis in the trend of research articles and classifications in counts; and (3) thematic analysis in word frequency between article classifications. We found that (1) the most frequently used words are Item, Rasch model, and Measure; (2) most article’s authors are from North America (380/578; 65.74%), followed by Europe (68/578; 11.76%) and other countries (130/578; 22.5%); (3) articles are focusing on model comparisons (77/578; 13%), followed by methodological developments (69/578; 12%) and reviews/other (43/578; 7%); (4) differences in classifications between application and methodology are displayed using pyramid plots. The EQTA revealed deeper insight into the nature of JAM publications, including common topics and areas of emphasis, and the EQTA is worthy of recommendation for future relevant research, as it is not limited to the journal of JAM.


Examining Rating Designs with Cross-Classification Multilevel Rasch Models

Jue Wang, Zhenqiu Lu, George Engelhard Jr., Allan S. Cohen
Accepted on: 23 November 2021

Abstract

The scoring of rater-mediated assessments largely relies on human raters, and their ratings empirically reflect student proficiency of a specific skill. Incomplete rating designs are common in operational scoring procedures because raters do not typically score all student performances. The cross-classification mixed-effect models can be used for examining data with a complex structure. By incorporating Rasch measurement models into the multilevel models, the cross-classification multilevel Rasch model (CCM-RM) can examine both students and raters on a single latent continuum, and also examine random effects for higher-level units. In addition, the CCM-RM provides flexibilities for modeling characteristics of raters and features of student performances. This study investigates the effect of different rating designs on the estimation accuracy of CCM-RM with consideration of sample sizes and variances of rater through a simulation study. We also illustrate the use of CCM-RM for evaluating rater accuracy with different rating designs on data from a statewide writing assessment.


Effects of Item Misfit on Proficiency Estimates Under the Rasch Model

Chunyan Liu, Peter Baldwin, Raja Subhiyah
Accepted on: 12 November 2021

Abstract

When IRT parameter estimates are used to make inferences about examinee performance, assessment of model-data fit is an important consideration. Although many studies have examined the effects of violating IRT model assumptions, relatively few have focused on the effects of violating the equal discrimination assumption on examinee proficiency estimation conditional on true proficiency under the Rasch model. The findings of this simulation study suggest that systematic item misfit due to violating this assumption can have noticeable effects on proficiency estimation, especially for candidates with relatively high or low proficiency. We also consider the consequences of misfit for examinee classification and show that while the effects on overall classification (e.g., pass/fail) rates are generally very small, false-negative and false-positive rates can still be affected in important ways.