Applications of statistical models to validate test score inferences for research, instructional and accountability uses
- The three methodological papers in this dissertation apply novel or recently developed statistical models to improve our understanding and use of educational test scores. The first paper explores the use of diagnostic classification models to validate scores on a multiple-choice test written to be more instructionally useful to teachers. The second paper describes a novel application of hierarchical logistic regression that allows researchers to formally test and quantify variation in item bias across different test administrations. The third paper presents an improved method for obtaining estimates of test score distributions for small groups when only aggregate proficiency data are available. Paper 1 uses a generalized diagnostic classification model (GDCM) to provide validity evidence for a test measuring student misconceptions in middle school geometry. The test is an example of a "distractor-driven" test that includes selected-response questions with systematically written incorrect response options, and is intended to provide teachers with an efficient means of obtaining instructionally useful information about their students' reasoning, including whether students may be reasoning with common misconceptions that could interfere with their learning. This paper illustrates how graphical and numerical results from the GDCM can be used to evaluate current uses of the test and to guide future test development. The discussion considers both the strengths and limitations of applying the GDCM framework to this type of distractor-driven test. Paper 2 proposes a new approach for studying variation in differential item functioning (DIF) across test administrations. DIF analyses, which can help identify biased test items, are widely used in test development and validation to ensure that test scores are fair for all test-takers. Research in social psychology and related fields suggests that contextual features of test-taking environments may adversely affect some test-takers' performance, which could lead to variance in DIF across test administrations. Most commonly used DIF detection methods assume DIF is constant across test administrations, an assumption that could lead to incorrect or incomplete inferences in the presence of such heterogeneity. This paper proposes a novel use of hierarchical logistic regression (HLR) models to detect both DIF and DIF variance across test administrations. A real data analysis and a simulation study are used to demonstrate and evaluate the proposed model. The results show that the HLR model has good Type I error and statistical power rates when testing for DIF variance, and provides a more accurate test of uniform DIF than the standard logistic regression DIF model in the presence of DIF variance. Paper 3 describes an improved method for analyzing test score data presented as aggregate proficiency data. Aggregate proficiency data indicate the number of students in different schools or demographic groups scoring in each of a small number of ordered performance levels. These data do not include complete information about the full test score distributions and are of limited applicability for many statistical analyses. However, they are often the only source of data available to those addressing important questions about differences in educational achievement outcomes across groups. Heteroskedastic ordered probit (HETOP) models can be used to recover estimates of means and standard deviations of the full test score distributions, but may yield biased or very imprecise estimates when group sample sizes are small. This paper describes a pooled HETOP model that pools data across grade levels to improve small-sample estimates. Two simulation studies demonstrate that the pooled HETOP model can reduce the bias and sampling error of test score standard deviation estimates when group sample sizes are very small. An analysis of real test score data finds that the pooled HETOP model assumptions are plausible and therefore supports the use of the model in applied settings.
|Type of resource
|electronic; electronic resource; remote
|1 online resource.
|Shear, Benjamin Rogers
|Stanford University, Graduate School of Education.
|Reardon, Sean F
|Reardon, Sean F
|Statement of responsibility
|Benjamin Rogers Shear.
|Submitted to the Graduate School of Education.
|Thesis (Ph.D.)--Stanford University, 2016.
- © 2016 by Benjamin Rogers Shear
- This work is licensed under a Creative Commons Attribution Non Commercial 3.0 Unported license (CC BY-NC).
Also listed in
Loading usage metrics...