Author: "DeMars, Christine E." - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"DeMars, Christine E."' showing total 219 results

Start Over Author "DeMars, Christine E."

219 results on '"DeMars, Christine E."'

1. Anchors Aweigh: How the Choice of Anchor Items Affects the Vertical Scaling of 3PL Data with the Rasch Model

Author: Waterbury, Glenn Thomas and DeMars, Christine E.
Abstract: Vertical scaling is used to put tests of different difficulty onto a common metric. The Rasch model is often used to perform vertical scaling, despite its strict functional form. Few, if any, studies have examined anchor item choice when using the Rasch model to vertically scale data that do not fit the model. The purpose of this study was to investigate the implications of anchor item choice on bias in growth estimates when data do not fit the Rasch model. Data were generated with varying levels of true difference between grades and levels of the lower asymptote. When true growth or the lower asymptote were zero, estimates were unbiased and anchor item choice was not consequential. As true growth and the lower asymptote both increased, growth was underestimated and choice of anchor items had an impact. Easy anchor items led to less biased estimates of growth than hard anchor items.
Published: 2021
Full Text: View/download PDF

2. Violation of Conditional Independence in the Many-Facets Rasch Model

Author: DeMars, Christine E.
Abstract: Estimation of parameters for the many-facets Rasch model requires that conditional on the values of the facets, such as person ability, item difficulty, and rater severity, the observed responses within each facet are independent. This requirement has often been discussed for the Rasch models and 2PL and 3PL models, but it becomes more complex with more facets. To show how violation of conditional independence may be exhibited, three scenarios with different types of dependency are developed: (1) raters rating the same work; (2) a residual ability shared by two tasks; and (3) score on one task dependent on observed score on a previous task.
Published: 2021
Full Text: View/download PDF

3. Item Parameter Recovery: Sensitivity to Prior Distribution.

Author: DeMars, Christine E. and Satkus, Paulius
Subjects: *STATISTICAL models, *DATA analysis, *DIFFERENTIAL item functioning (Research bias), *PROBABILITY theory, *RESEARCH methodology evaluation, *EDUCATIONAL tests & measurements, *CONVALESCENCE, *PSYCHOMETRICS, *STATISTICS, *SENSITIVITY & specificity (Statistics)
Abstract: Marginal maximum likelihood, a common estimation method for item response theory models, is not inherently a Bayesian procedure. However, due to estimation difficulties, Bayesian priors are often applied to the likelihood when estimating 3PL models, especially with small samples. Little focus has been placed on choosing the priors for marginal maximum estimation. In this study, using sample sizes of 1,000 or smaller, not using priors often led to extreme, implausible parameter estimates. Applying prior distributions to the c -parameters alleviated the estimation problems with samples of 500 or more; for the samples of 100, priors on both the a -parameters and c -parameters were needed. Estimates were biased when the mode of the prior did not match the true parameter value, but the degree of the bias did not depend on the strength of the prior unless it was extremely informative. The root mean squared error (RMSE) of the a -parameters and b -parameters did not depend greatly on either the mode or the strength of the prior unless it was extremely informative. The RMSE of the c -parameters, like the bias, depended on the mode of the prior for c. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. Higher Education Faculty Engagement in a Modified Mapmark Standard Setting

Author: Horst, S. Jeanne and DeMars, Christine E.
Abstract: The Mapmark standard setting method was adapted to a higher education setting in which faculty leaders were highly involved. Eighteen university faculty members participated in a day-long standard setting for a general education communications test. In Round 1, faculty set initial cut-scores for each of four student learning objectives. In Rounds 2 and 3, participants used a Mapmark item map to consider information from four student learning objectives at one glance and to integrate this information into a single cut-score. Participants and faculty leaders reported that the process was intuitive, and there was support for a defensible cut-score from the majority of participants and faculty leaders. Practical suggestions and implications are discussed.
Published: 2016

5. Multilevel Rasch Modeling: Does Misfit to the Rasch Model Impact the Regression Model?

Author: DeMars, Christine E.
Abstract: Multilevel Rasch models are increasingly used to estimate the relationships between test scores and student and school factors. Response data were generated to follow one-, two-, and three-parameter logistic (1PL, 2PL, 3PL) models, but the Rasch model was used to estimate the latent regression parameters. When the response functions followed 2PL or 3PL models, the proportion of variance explained in test scores by the simulated student or school predictors was estimated accurately with a Rasch model. Proportion of variance within and between schools was also estimated accurately. The regression coefficients were misestimated unless they were rescaled out of logit units. However, item-level parameters, such as DIF effects, were biased when the Rasch model was violated, similar to single-level models.
Published: 2020
Full Text: View/download PDF

6. Comparing Examinee-Based and Response-Based Motivation Filtering Methods in Remote Low-Stakes Testing

Author: Alahmadi, Sarah, primary and DeMars, Christine E., additional
Published: 2024
Full Text: View/download PDF

7. Revised Parallel Analysis with Nonnormal Ability and a Guessing Parameter

Author: DeMars, Christine E.
Abstract: Previous work showing that revised parallel analysis can be effective with dichotomous items has used a two-parameter model and normally distributed abilities. In this study, both two- and three-parameter models were used with normally distributed and skewed ability distributions. Relatively minor skew and kurtosis in the underlying ability distribution had almost no effect on Type I error for unidimensional data and reduced power for two-dimensional data slightly with smaller sample sizes of 400. Using a two-parameter model on three-parameter data produced dramatically increased rejection rates for the unidimensional data. Using the correct three-parameter model reduced the unidimensional rejection rates but yielded lower power than the two-parameter data in some conditions.
Published: 2019
Full Text: View/download PDF

8. Brief Research Report: A User Friendly Effect Size--When Normality Matters

Author: Waterbury, Glenn Thomas and DeMars, Christine E.
Abstract: There is a need for effect sizes that are readily interpretable by a broad audience. One index that might fill this need is [pi], which represents the proportion of scores in one group that exceed the mean of another group. The robustness of estimates of [pi] to violations of normality had not been explored. Using simulated data, three estimates of [pi] ([pi-hat][subscript direct], r, and r[subscript robust]) were studied under varying conditions of sample size, distribution shape, and group mean difference. This study demonstrated that r and r[subscript robust] were biased estimates of [pi] when data were nonnormal. We recommend that neither be used in estimating [pi] unless data are normally distributed.
Published: 2019
Full Text: View/download PDF

9. The Role of Gender in Test-Taking Motivation under Low-Stakes Conditions

Author: DeMars, Christine E., Bashkov, Bozhidar M., and Socha, Alan B.
Abstract: Examinee effort can impact the validity of scores on higher education assessments. Many studies of examinee effort have briefly noted gender differences, but gender differences in test-taking effort have not been a primary focus of research. This review of the literature brings together gender-related findings regarding three measures of examinee motivation: attendance at the assigned testing session, time spent on each test item, and self-reported effort. Evidence from the literature is summarized, with some new results presented. Generally, female examinees exert more effort, with differences mostly at very low levels of effort--the levels at which effort is most likely to impact test scores. Examinee effort is positively correlated with conscientiousness and agreeableness, and negatively correlated with workavoidance. The gender differences in these constructs may account for some of the gender differences in test-taking effort. Limitations and implications for higher education assessment practice are discussed.
Published: 2013

10. Scoring Subscales Using Multidimensional Item Response Theory Models

Author: DeMars, Christine E.
Abstract: Several methods for estimating item response theory scores for multiple subtests were compared. These methods included two multidimensional item response theory models: a bi-factor model where each subtest was a composite score based on the primary trait measured by the set of tests and a secondary trait measured by the individual subtest, and a model where the traits measured by the subtests were separate but correlated. Composite scores based on unidimensional item response theory, with each subtest borrowing information from the other subtest, as well as independent unidimensional scores for each subtest were also considered. Correlations among scores from all methods were high, though somewhat lower for the independent unidimensional scores. Correlations between course grades and test scores, a measure of validity, were similar for all methods, though again slightly lower for the unidimensional scores. To assess bias and RMSE, data were simulated using the parameters estimated for the correlated factors model. The independent unidimensional scores showed the greatest bias and RMSE; the relative performance of the other three methods varied with the subscale. (Contains 6 tables, 4 figures, and 2 footnotes.)
Published: 2005

11. An Application of Item Response Time: The Effort-Moderated IRT Model

Author: Wise, Steven L., DeMars, Christine E., and Kong, Xiaojing
Abstract: The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account different levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity. (Contains 11 figures, 5 tables, and 2 footnotes.)
Published: 2005

12. Modeling Student Outcomes in a General Education Course with Hierarchical Linear Models. AIR 2002 Forum Paper.

Author: DeMars, Christine E.
Abstract: When students are nested within course sections, the assumption of independence of residuals is unlikely to be met, unless the course section is explicitly included in the model. Hierarchical linear modeling (HLM) allows for modeling the course section as a random effect, leading to more accurate standard errors. In this study, students chose one of four themes for a communications course, with multiple sections and instructors within each theme. Complete data were available for 1,304 students in 67 sections. HLM was used to test for differences by theme in scores on a final exam; the differences were not significant when Scholastic Assessment Test (SAT) scores were controlled. (Author/SLD)
Published: 2002

13. Recovery of Graded Response and Partial Credit Parameters in MULTILOG and PARSCALE.

Author: DeMars, Christine E.
Abstract: Using simulated data, the MULTILOG and PARSCALE software packages were compared for their recovery of item and trait parameters under the graded response and generalized partial credit item response theory models. The shape of the latent population distribution (normal, skewed, or uniform) and the sample size (250 or 500) were varied. Parameter estimates were essentially unbiased under all conditions, and the root mean square error was similar for both software packages. The choice between these packages can therefore be based on considerations other than the accuracy of parameter estimation. (Contains 3 tables, 5 figures, and 22 references.) (Author/SLD)
Published: 2002

14. Item Parameter Recovery: Sensitivity to Prior Distribution

Author: DeMars, Christine E., primary and Satkus, Paulius, additional
Published: 2023
Full Text: View/download PDF

15. Assessment of Undergraduate Students' Environmental Stewardship Reasoning and Knowledge

Author: Hartman, Christie-Joy Brodrick, DeMars, Christine E., Griscom, Heather Peckham, and Butner, Harold Martin
Abstract: Purpose: The purpose of this paper is to present a public university's design and implementation of an assessment approach that measures the change in undergraduate students' environmental stewardship reasoning and knowledge abilities over time. Design/methodology/approach: In support of a university's strategic emphasis on environmental stewardship, members of a university committee developed environmental stewardship learning outcomes for undergraduate students. The learning outcomes were not required in specific academic courses or in general education. Subsequently, volunteers from a variety of roles, in cooperation with committee members, developed a corresponding assessment test that focused on reasoning and knowledge. The instrument was revised between Spring 2011 and Spring 2014, and its validity was evaluated. An exploratory analysis of student learning over time was conducted using 22 items shared by different test forms. Findings: A series of implementations and revisions resulted in a 50-question test, the Environmental Stewardship Reasoning and Knowledge Assessment (ESRKA), which showed good reliability (0.83). A comparative analysis provided evidence of the validity of the instrument. Results from a small sample of students showed that second-year students generally performed better on the 22 items than incoming first-year students. Those taking the assessment as second-year students, 18 months after their initial assessment, scored significantly higher on the 22 items by about 10.4 percentage points (0.61 standard deviation units, t[subscript 68] = 6.23, p < 0.0001). Research limitations/implications: Because of the small sample size and revision of the items, the analysis of student learning is only exploratory. Originality/value: The learning outcomes and validated assessment instrument may be used either in whole or part by other institutions. The approach to measure changes in students' environmental stewardship reasoning and knowledge abilities as cohorts over time could assist universities in tracking environmental stewardship learning and could inform strategic implementation of learning opportunities through the curriculum, as well as through other student learning experiences.
Published: 2017
Full Text: View/download PDF

16. Detecting Multidimensionality Due to Curricular Differences

Author: DeMars, Christine E.
Published: 2003

17. Assessment of undergraduate students’ environmental stewardship reasoning and knowledge

Author: Hartman, Christie-Joy Brodrick, DeMars, Christine E., Griscom, Heather Peckham, and Butner, Harold Martin
Published: 2017
Full Text: View/download PDF

18. Partially Compensatory Multidimensional Item Response Theory Models: Two Alternate Model Forms

Author: DeMars, Christine E.
Abstract: Partially compensatory models may capture the cognitive skills needed to answer test items more realistically than compensatory models, but estimating the model parameters may be a challenge. Data were simulated to follow two different partially compensatory models, a model with an interaction term and a product model. The model parameters were then estimated for both models and for the compensatory model. Either the model used to simulate the data or the compensatory model generally had the best fit, as indexed by information criteria. Interfactor correlations were estimated well by both the correct model and the compensatory model. The predicted response probabilities were most accurate from the model used to simulate the data. Regarding item parameters, root mean square errors seemed reasonable for the interaction model but were quite large for some items for the product model. Thetas were recovered similarly by all models, regardless of the model used to simulate the data.
Published: 2016
Full Text: View/download PDF

19. The Interaction of Ability Differences and Guessing When Modeling Differential Item Functioning with the Rasch Model: Conventional and Tailored Calibration

Author: DeMars, Christine E. and Jurich, Daniel P.
Abstract: In educational testing, differential item functioning (DIF) statistics must be accurately estimated to ensure the appropriate items are flagged for inspection or removal. This study showed how using the Rasch model to estimate DIF may introduce considerable bias in the results when there are large group differences in ability (impact) and the data follow a three-parameter logistic model. With large group ability differences, difficult non-DIF items appeared to favor the focal group and easy non-DIF items appeared to favor the reference group. Correspondingly, the effect sizes for DIF items were biased. These effects were mitigated when data were coded as missing for item-examinee encounters in which the person measure was considerably lower than the item location. Explanation of these results is provided by illustrating how the item response function becomes differentially distorted by guessing depending on the groups' ability distributions. In terms of practical implications, results suggest that measurement practitioners should not trust the DIF estimates from the Rasch model when there is a large difference in ability and examinees are potentially able to answer items correctly by guessing, unless data from examinees poorly matched to the item difficulty are coded as missing.
Published: 2015
Full Text: View/download PDF

20. Differential Item Functioning Detection with the Mantel-Haenszel Procedure: The Effects of Matching Types and Other Factors

Author: Socha, Alan, DeMars, Christine E., Zilberberg, Anna, and Phan, Ha
Abstract: The Mantel-Haenszel (MH) procedure is commonly used to detect items that function differentially for groups of examinees from various demographic and linguistic backgrounds--for example, in international assessments. As in some other DIF methods, the total score is used to match examinees on ability. In thin matching, each of the total score points is used as its own matching category, whereas in thick matching the total score is discretized into several score ranges. Evidence regarding how matching type affects the accuracy of MH procedure is inconclusive. The current study investigated the effects of thin and thick matching in conjunction with sample size, purification, symmetric and asymmetric groups sample sizes, test length, and differences in the ability distributions. Results suggest that whenever feasible, purification should be used in conjunction with thin matching.
Published: 2015
Full Text: View/download PDF

21. High School-Based Treatment for Adolescents with Attention-Deficit/Hyperactivity Disorder: Results from a Pilot Study Examining Outcomes and Dosage

Author: Evans, Steven W., Schultz, Brandon K., and DeMars, Christine E.
Abstract: The purpose of this study was to examine the efficacy and dose-response relationship of a school-based treatment program for high school students with attention-deficit/hyperactivity disorder (ADHD). Two paraprofessionals provided interventions to 24 students with ADHD randomly assigned to the treatment condition at two public high schools. They met in dyadic coaching sessions during one school year. In addition, parents attended weekly parent meetings and adolescents attended group sessions targeting social functioning in the evenings for 10 weeks in the fall semester. Intent-to-treat analyses showed little statistically significant benefit for the participants; however, effect sizes indicated moderate improvements in parent ratings of inattention, relationships with peers, academic impairment, and family functioning. There was large variability in the dosage of services received across participants, and an analysis of outcomes by dosage suggests large differences in response based on the number of sessions attended. This school-based intervention provides a viable option for educators and school mental health professionals who wish to provide interventions for high school students with ADHD, but further development and evaluation are needed.
Published: 2014

22. An Investigation of Sample Size Splitting on ATFIND and DIMTEST

Author: Socha, Alan and DeMars, Christine E.
Abstract: Modeling multidimensional test data with a unidimensional model can result in serious statistical errors, such as bias in item parameter estimates. Many methods exist for assessing the dimensionality of a test. The current study focused on DIMTEST. Using simulated data, the effects of sample size splitting for use with the ATFIND procedure for empirically deriving a subtest composed of items that potentially measure a second dimension versus DIMTEST for assessing whether this subtest represents a second dimension were investigated. Conditions explored included proportion of sample used for ATFIND, sample size, test length, interability correlations, test structure, and distribution of item difficulties. Overall, it appears that DIMTEST has Type I error rates near the nominal rate and good power in detecting multidimensionality, although Type I error inflation is observed for larger sample sizes. Results suggest that a 50/50 split maximizes power and keeps the Type I error rate below the nominal level unless the test is short and the sample is large. A 75/25 split controls Type I error better for short tests and large samples. (Contains 4 figures and 1 note.)
Published: 2013
Full Text: View/download PDF

23. A Note on Specifying the Guessing Parameter in ATFIND and DIMTEST

Author: Socha, Alan and DeMars, Christine E.
Abstract: The software program DIMTEST can be used to assess the unidimensionality of item scores. The software allows the user to specify a guessing parameter. Using simulated data, the effects of guessing parameter specification for use with the ATFIND procedure for empirically deriving the Assessment Subtest (AT; that is, a subtest composed of items that potentially measure a second dimension) and DIMTEST for assessing whether that AT represents a second dimension were investigated. Results suggest that specifying higher guessing parameters in ATFIND and DIMTEST result in higher Type I error rates. (Contains 3 figures.)
Published: 2013
Full Text: View/download PDF

24. A Tutorial on Interpreting Bifactor Model Scores

Author: DeMars, Christine E.
Abstract: This tutorial addresses possible sources of confusion in interpreting trait scores from the bifactor model. The bifactor model may be used when subscores are desired, either for formative feedback on an achievement test or for theoretically different constructs on a psychological test. The bifactor model is often chosen because it requires fewer computational resources than other models for subscores. The bifactor model yields a score on the general or primary trait measured by the test overall, as well as specific or secondary traits measured by the subscales. Interpreting the general trait score is straight-forward, but the specific traits must be interpreted as residuals relative to the general trait. Trait scores on the specific factors are contrasted with trait scores on a simple-structure model with correlated factors, using example data from one TIMSS test booklet and a civic responsibility measure. The correlated factors model was used for contrast because its scores correspond to a more intuitive interpretation of subscores, and thus it helps to illustrate how the bifactor scores should NOT be interpreted. Estimation details are covered in an appendix.
Published: 2013
Full Text: View/download PDF

25. The Impact of External Events on Low-Stakes Assessment: A Cautionary Tale.

Author: Nason, Kelsey and DeMars, Christine E.
Subjects: SPRING, SUICIDE victims, STUDENTS
Abstract: Universities administer assessments for accountability and program improvement. Student effort is low during assessments due to minimal perceived consequences. The effects of low effort are compounded by assessment context. This project investigates validity concerns caused by minimal effort and exacerbated by contextual factors. Systematic disruptions that affect effort impact the validity of scores. Effort and scores from four administrations of James Madison University's (JMU) remote Assessment Day were examined; these semesters presented unique, changing contexts. Special attention was paid to Spring 2022 which had numerous contextual factors (e.g., online assessment, campus suicides) affecting students and their assessment environments. Time spent testing varied across semesters mirroring the varied scores. With one exception, our results showed lower effort in Spring (posttest) than Fall (pretest) assessments which led to estimates of little or no gain between pretest and posttest. Implications and limitations are discussed. [ABSTRACT FROM AUTHOR]
Published: 2023

26. Investigating the Impact of Compromised Anchor Items on IRT Equating under the Nonequivalent Anchor Test Design

Author: Jurich, Daniel P., DeMars, Christine E., and Goodman, Joshua T.
Abstract: The prevalence of high-stakes test scores as a basis for significant decisions necessitates the dissemination of accurate and fair scores. However, the magnitude of these decisions has created an environment in which examinees may be prone to resort to cheating. To reduce the risk of cheating, multiple test forms are commonly administered. When multiple forms are employed, the forms must be equated to account for potential differences in form difficulty. If cheating occurs on one of the forms, the equating procedure may produce inaccurate results. A simulation study was conducted to examine the impact of cheating on item response theory (IRT) true score equating. Recovery of equated scores and scaling constants was assessed for the Stocking-Lord IRT scaling method under various conditions. Results indicated that cheating artificially increased the equated scores of the entire examinee group that was administered the compromised form. Future research should focus on the identification and removal of compromised items. (Contains 5 tables, 2 figures and 1 note.)
Published: 2012
Full Text: View/download PDF

27. Software Note: Using BILOG for Fixed-Anchor Item Calibration

Author: DeMars, Christine E. and Jurich, Daniel P.
Abstract: The nonequivalent groups anchor test (NEAT) design is often used to scale item parameters from two different test forms. A subset of items, called the anchor items or common items, are administered as part of both test forms. These items are used to adjust the item calibrations for any differences in the ability distributions of the groups taking different forms. One method for scaling the item parameters is the fixed-anchor item method, in which the parameters for the anchor items are held constant at the values calibrated in the base form. With this method, the parameters for the new form should be on the metric of the old form, with no need for further scaling. The fixed-anchor procedure may be more complicated when the base-form and new-form ability distributions show large differences, as they would for vertical scaling. Several studies have demonstrated that in BILOG (Zimowski, Muraki, Mislevey, & Bock, 2003), if one fixes the anchor item parameters and specifies NOADJUST on the CALIB line, and if the new-form ability distribution is not close to N(0,1) relative to the metric of the anchor items, the scaling for the new-form nonanchor items and the new-form ability distribution will not be accurate (Kang & Peterson, 2009; Kim, 2006). Before discussing these findings and proposing an alternative, several BILOG options will be discussed. These options will be described in the context of separate calibration for the new form. (Contains 2 figures, 1 table, and 2 notes.)
Published: 2012
Full Text: View/download PDF

28. Confirming Testlet Effects

Author: DeMars, Christine E.
Abstract: A testlet is a cluster of items that share a common passage, scenario, or other context. These items might measure something in common beyond the trait measured by the test as a whole; if so, the model for the item responses should allow for this testlet trait. But modeling testlet effects that are negligible makes the model unnecessarily complicated and risks capitalization on chance, increasing the error in parameter estimates. Checking each testlet to see if the items within the testlet share something beyond the primary trait could therefore be useful. This study included (a) comparison between a model with no testlets and a model with testlet "g," (b) comparison between a model with all suspected testlets and a model with all suspected testlets "except" testlet "g," and (c) a test of essential unidimensionality. Overall, Comparison b was most useful for detecting testlet effects. Model comparisons based on information criteria, specifically the sample-size adjusted Bayesian Information Criteria (SSA-BIC) and BIC, resulted in fewer false alarms than statistical significance tests. The test of essential unidimensionality had true hit rates and false alarm rates similar to the SSA-BIC when the testlet effect was zero for all testlets except the studied testlet. But the presence of additional testlet effects in the partitioning test led to higher false alarm rates for the test of essential unidimensionality. (Contains 4 tables, 3 figures, and 2 notes.)
Published: 2012
Full Text: View/download PDF

29. A Comparison of Limited-Information and Full-Information Methods in M'plus' for Estimating Item Response Theory Parameters for Nonnormal Populations

Author: DeMars, Christine E.
Abstract: In structural equation modeling software, either limited-information (bivariate proportions) or full-information item parameter estimation routines could be used for the 2-parameter item response theory (IRT) model. Limited-information methods assume the continuous variable underlying an item response is normally distributed. For skewed and platykurtic latent variable distributions, 3 methods were compared in M"plus": limited information, full information integrating over a normal distribution, and full information integrating over the known underlying distribution. Interfactor correlation estimates were similar for all 3 estimation methods. For the platykurtic distribution, estimation method made little difference for the item parameter estimates. When the latent variable was negatively skewed, for the most discriminating easy or difficult items, limited-information estimates of both parameters were considerably biased. Full-information estimates obtained by marginalizing over a normal distribution were somewhat biased. Full-information estimates obtained by integrating over the true latent distribution were essentially unbiased. For the "a" parameters, standard errors were larger for the limited-information estimates when the bias was positive but smaller when the bias was negative. For the "d" parameters, standard errors were larger for the limited-information estimates of the easiest, most discriminating items. Otherwise, they were generally similar for the limited- and full-information estimates. Sample size did not substantially impact the differences between the estimation methods; limited information did not gain an advantage for smaller samples. (Contains 4 tables, 1 footnote, and 9 figures.)
Published: 2012
Full Text: View/download PDF

30. Effectiveness of the Challenging Horizons After-School Program for Young Adolescents with ADHD

Author: Evans, Steven W., Schultz, Brandon K., and DeMars, Christine E.
Abstract: There are no empirically supported psychosocial treatments for adolescents with attention-deficit hyperactivity disorder (ADHD). This study examined the treatment benefits of the Challenging Horizons Program (CHP), a psychosocial treatment program designed to address the impairment and symptoms associated with this disorder in young adolescents. In addition to evaluating social and academic functioning outcomes, two critical questions from previous studies pertaining to the timing, duration, and family involvement in treatment were addressed. Forty-nine students recruited in two cohorts were randomly assigned to receive either the CHP or a community care condition. Outcomes suggested that students who received the CHP improved compared to students in the control condition on measures of symptoms and impairment. Implications related to timing, duration, and family involvement are reported, as well as recommendations for future studies. (Contains 3 tables.)
Published: 2011
Full Text: View/download PDF

31. Differential Item Functioning Detection with Latent Classes: How Accurately Can We Detect Who Is Responding Differentially?

Author: DeMars, Christine E. and Lau, Abigail
Abstract: There is a long history of differential item functioning (DIF) detection methods for known, manifest grouping variables, such as sex or ethnicity. But if the experiences or cognitive processes leading to DIF are not perfectly correlated with the manifest groups, it would be more informative to uncover the latent groups underlying DIF. The use of item response theory (IRT) mixture models to detect latent groups and estimate the DIF caused by these latent groups has been explored/interpreted with real data sets, but the accuracy of model estimation has not been thoroughly explored. The purpose of this simulation research was to assess the accuracy of the recovery of classes, item parameters, and DIF effects in contexts where relatively small clusters of items showed DIF. Overall, the results from the study reveal that the use of IRT mixture models for latent DIF detection may be problematic. Class membership recovery was poor in all conditions tested. Discrimination parameters were estimated well for the invariant items, as well as for the DIF items when there was no group impact. But when there was group impact, discriminations for the DIF items were positively biased. When there was no group impact, DIF effect estimates tended to be positively biased. In general, having fewer items was associated with more biased estimates and larger standard errors. (Contains 1 figure and 5 tables.)
Published: 2011
Full Text: View/download PDF

32. An Analytic Comparison of Effect Sizes for Differential Item Functioning

Author: Demars, Christine E.
Abstract: Three types of effects sizes for DIF are described in this exposition: log of the odds-ratio (differences in log-odds), differences in probability-correct, and proportion of variance accounted for. Using these indices involves conceptualizing the degree of DIF in different ways. This integrative review discusses how these measures are impacted in different ways by item difficulty, item discrimination, and item lower asymptote. For example, for a fixed discrimination, the difference in probabilities decreases as the difference between the item difficulty and the mean ability increases. Under the same conditions, the log of the odds-ratio remains constant if the lower asymptote is zero. A non-zero lower asymptote decreases the absolute value of the probability difference symmetrically for easy and hard items, but it decreases the absolute value of the log-odds difference much more for difficult items. Thus, one cannot set a criterion for defining a large effect size in one metric and find a corresponding criterion in another metric that is equivalent across all items or ability distributions. In choosing an effect size, these differences must be understood and considered. (Contains 2 tables, 6 footnotes, and 7 figures.)
Published: 2011
Full Text: View/download PDF

33. Type I Error Inflation for Detecting DIF in the Presence of Impact

Author: DeMars, Christine E.
Abstract: In this brief explication, two challenges for using differential item functioning (DIF) measures when there are large group differences in true proficiency are illustrated. Each of these difficulties may lead to inflated Type I error rates, for very different reasons. One problem is that groups matched on observed score are not necessarily well matched on true proficiency, which may result in the false detection of DIF due to inaccurate matching. The other problem is that a model that does not allow for a nonzero asymptote can produce what seems to be DIF. These issues have been discussed separately in the literature earlier. This article brings them together in a nontechnical form. (Contains 3 tables and 3 figures.)
Published: 2010
Full Text: View/download PDF

34. Can Differential Rapid-Guessing Behavior Lead to Differential Item Functioning?

Author: DeMars, Christine E. and Wise, Steven L.
Abstract: This investigation examined whether different rates of rapid guessing between groups could lead to detectable levels of differential item functioning (DIF) in situations where the item parameters were the same for both groups. Two simulation studies were designed to explore this possibility. The groups in Study 1 were simulated to reflect differences between high-stakes and low-stakes conditions, with no rapid guessing in the high-stakes condition. Easy, discriminating items with high rates of rapid guessing by the low-stakes group were detected as showing DIF favoring the high-stakes group when using the Mantel-Haenszel index. The groups in Study 2 were simulated to reflect gender differences in rapid guessing on a low-stakes test. Both groups had some rapid guessing, but the focal group guessed more. Easy items with greater differences in rapid guessing were more likely to be detected as showing DIF. When the group with more rapid guessing had lower mean proficiency, the overall proportion of flagged items was lower but the effect of difference in rapid guessing remained. Our results suggest that there likely are instances in which statistically identified DIF is observed due to the behavioral characteristics of the studied subgroups rather than the content of the items. (Contains 8 figures, 1 table and 5 notes.)
Published: 2010
Full Text: View/download PDF

35. Examinee Noneffort and the Validity of Program Assessment Results

Author: Wise, Steven L. and DeMars, Christine E.
Abstract: Educational program assessment studies often use data from low-stakes tests to provide evidence of program quality. The validity of scores from such tests, however, is potentially threatened by examinee noneffort. This study investigated the extent to which one type of noneffort--rapid-guessing behavior--distorted the results from three types of commonly used program assessment designs. It was found that, for each design, a modest amount of rapid guessing had a pronounced effect on the results. In addition, motivation filtering was found to be successful in mitigating the effects caused by rapid guessing. It is suggested that measurement practitioners routinely apply motivation filtering whenever the data from low-stakes tests are used to support program decisions. (Contains 5 tables, 3 figures and 1 footnote.)
Published: 2010
Full Text: View/download PDF

36. Modification of the Mantel-Haenszel and Logistic Regression DIF Procedures to Incorporate the SIBTEST Regression Correction

Author: DeMars, Christine E.
Abstract: The Mantel-Haenszel (MH) and logistic regression (LR) differential item functioning (DIF) procedures have inflated Type I error rates when there are large mean group differences, short tests, and large sample sizes.When there are large group differences in mean score, groups matched on the observed number-correct score differ on true score, contributing to inflated Type I error rates. The simultaneous item bias test procedure has incorporated an adjustment for this difference, originally using a linear regression correction and later using a nonlinear correction. In this study, these adjustments are applied to the MH and LR procedures. They effectively reduce the Type I error inflation for the MH and the LR test of uniform DIF, but not the LR test of nonuniform DIF. For large samples and large group mean differences, the [delta] effect size is estimated with greater accuracy using these adjustments. (Contains 3 tables and 5 figures.)
Published: 2009
Full Text: View/download PDF

37. Polytomous Differential Item Functioning and Violations of Ordering of the Expected Latent Trait by the Raw Score

Author: DeMars, Christine E.
Abstract: The graded response (GR) and generalized partial credit (GPC) models do not imply that examinees ordered by raw observed score will necessarily be ordered on the expected value of the latent trait (OEL). Factors were manipulated to assess whether increased violations of OEL also produced increased Type I error rates in differential item functioning (DIF) procedures conditioned on the raw score. Shorter tests and greater variance in item slope parameters increased OEL violations for the GR data but not for the GPC data. These same factors, combined with group mean differences between the reference and focal groups, increased the Type I error rate for the observed raw score DIF methods for "both" the GR and GPC data. A procedure conditioned on the classical test theory latent score estimate instead of the observed score helped reduce the Type I error in some of the conditions but not for the shortest tests. (Contains 2 figures and 4 tables.)
Published: 2008
Full Text: View/download PDF

38. Changes in Rapid-Guessing Behavior over a Series of Assessments

Author: DeMars, Christine E.
Abstract: A series of 8 tests was administered to university students over 4 weeks for program assessment purposes. The stakes of these tests were low for students; they received course points based on test completion, not test performance. Tests were administered in a counterbalanced order across 2 administrations. Response time effort, a measure of the proportion of items on which solution behavior rather than rapid-guessing behavior was used, was higher when a test was administered in the 1st week. Test scores were also higher. Differences between Week 1 and Week 4 test scores decreased when the test was scored with an effort-moderated model that took into account whether the student used solution or rapid-guessing behavior. Differences further decreased when students who used rapid-guessing on 5 or more of the 30 items were filtered from the data set.
Published: 2007
Full Text: View/download PDF

39. 'Guessing' Parameter Estimates for Multidimensional Item Response Theory Models

Author: DeMars, Christine E.
Abstract: Two software packages commonly used for multidimensional item response theory (IRT) models require the user to input values for the lower asymptotes of the item response functions. One way of selecting these values is to estimate lower asymptotes with a one-dimensional IRT model and use those estimates as fixed values in the multidimensional model. This procedure was compared to simply setting the asymptotes to a reasonable value. For two-factor tests, the use of unidimensional asymptotes worked well, yielding results nearly comparable to setting the lower asymptotes to the true values. With four-factor tests, in contrast, the item parameter and item response surface estimates were less accurate when the lower asymptotes were estimated through a unidimensional model. The estimates of the lower asymptotes from the unidimensional model tended to be too high for the four-factor tests, which likely caused the decreased accuracy of this procedure. (Contains 8 tables.)
Published: 2007
Full Text: View/download PDF

40. An Application of Item Response Time: The Effort-Moderated IRT Model

Author: Wise, Steven L. and DeMars, Christine E.
Abstract: The validity of inferences based on achievement test scores is dependent on the amount of effort that examinees put forth while taking the test. With low-stakes tests, for which this problem is particularly prevalent, there is a consequent need for psychometric models that can take into account differing levels of examinee effort. This article introduces the effort-moderated IRT model, which incorporates item response time into proficiency estimation and item parameter estimation. In two studies of the effort-moderated model when rapid guessing (i.e., reflecting low examinee effort) was present, one based on real data and the other on simulated data, the effort-moderated model performed better than the standard 3PL model. Specifically, it was found that the effort-moderated model (a) showed better model fit, (b) yielded more accurate item parameter estimates, (c) more accurately estimated test information, and (d) yielded proficiency estimates with higher convergent validity. (Contains 3 notes.)
Published: 2006
Full Text: View/download PDF

41. Application of the Bi-Factor Multidimensional Item Response Theory Model to Testlet-Based Tests

Author: DeMars, Christine E.
Abstract: Four item response theory (IRT) models were compared using data from tests where multiple items were grouped into testlets focused on a common stimulus. In the bi-factor model each item was treated as a function of a primary trait plus a nuisance trait due to the testlet; in the testlet-effects model the slopes in the direction of the testlet traits were constrained within each testlet to be proportional to the slope in the direction of the primary trait; in the polytomous model the item scores were summed into a single score for each testlet; and in the independent-items model the testlet structure was ignored. Using the simulated data, reliability was overestimated somewhat by the independent-items model when the items were not independent within testlets. Under these nonindependent conditions, the independent-items model also yielded greater root mean square error (RMSE) for item difficulty and underestimated the item slopes. When the items within testlets were instead generated to be independent, the bi-factor model yielded somewhat higher RMSE in difficulty and slope. Similar differences between the models were illustrated with real data.
Published: 2006
Full Text: View/download PDF

42. Type I Error Rates for PARSCALE's Fit Index

Author: DeMars, Christine E.
Abstract: Type I error rates for PARSCALE's fit statistic were examined. Data were generated to fit the partial credit or graded response model, with test lengths of 10 or 20 items. The ability distribution was simulated to be either normal or uniform. Type I error rates were inflated for the shorter test length and, for the graded-response model, also for the longer test length when the ability distribution was uniform. In conditions in which [alpha] was inflated, it was particularly high when one or more response categories were used infrequently. Overall, PARSCALE's fit index is not recommended for short tests.
Published: 2005
Full Text: View/download PDF

43. Low Examinee Effort in Low-Stakes Assessment: Problems and Potential Solutions

Author: Wise, Steven L. and DeMars, Christine E.
Abstract: Student test-taking motivation in low-stakes assessment testing is examined in terms of both its relationship to test performance and the implications of low student effort for test validity. A theoretical model of test-taking motivation is presented, with a synthesis of previous research indicating that low student motivation is associated with a substantial decrease in test performance. A number of assessment practices and data analytic procedures for managing the problems posed by low student motivation are discussed.
Published: 2005
Full Text: View/download PDF

44. Detection of Item Parameter Drift over Multiple Test Administrations

Author: DeMars, Christine E.
Abstract: Three methods of detecting item drift were compared: the procedure in BILOG-MG for estimating linear trends in item difficulty, the CUSUM procedure that Veerkamp and Glas (2000) used to detect trends in difficulty or discrimination, and a modification of Kim, Cohen, and Park's (1995) x 2 test for multiple-group differential item functioning (DIF), using linear contrasts on the discrimination and difficulty parameters. Data were simulated as if collected over 3, 4, or 5 time points, with parameter drift in either a gradual, linear pattern, a less linear but still monotonic pattern, or as a sudden shift at the third time point. The BILOG-MG procedure and the modification of the Kim et al. procedure were more powerful than the CUSUM procedure, nearly always detecting drift. All three procedures had false alarm rates for nondrift items near the nominal alpha. The procedures were also illustrated on a real data set.
Published: 2004

45. Scoring 'Neutral or Unsure' on an Identity Development Instrument for Higher Education

Author: DeMars, Christine E. and Erwin, T. Dary
Abstract: The use of "neutral or unsure" on an instrument designed to measure identity development in college students was explored. The nominal response model from item response theory was used to evaluate whether "neutral or unsure" was used more frequently by those at low or middle levels of development; the results depended on the subscale and sometimes on the item within the subscale. Scoring based on the nominal response model allows for this category to be treated differently for different items.
Published: 2004
Full Text: View/download PDF

46. Type I Error Rates for Generalized Graded Unfolding Model Fit Indices

Author: DeMars, Christine E.
Abstract: Type I error rates were examined for several fit indices available in GGUM2000: extensions of Infit, Outfit, Andrich's X(2), and the log-likelihood ratio X(2). Infit and Outfit had Type I error rates much lower than nominal alpha. Andrich's X(2) had Type I error rates much higher than nominal alpha, particularly for shorter tests or larger sample sizes. The log-likelihood X(2) had Type I error rates near or below nominal alpha for small samples or longer tests but had inflated error rates with large samples and shorter tests. For conditions in which the log-likelihood ratio X(2) did not perform well, alternative fit indices or modifications to these procedures should be considered in future studies.
Published: 2004
Full Text: View/download PDF

47. Revising the Scale of Intellectual Development: Application of an Unfolding Model.

Author: DeMars, Christine E. and Erwin, T. Dary
Abstract: An unfolding model was selected for the scores on the Scale of Intellectual Development to take into account that, for stage-based instruments, agreement with a statement first increases as the student approaches the stage represented by the statement, then decreases as the student progresses beyond that stage. (Contains 21 references.) (Author)
Published: 2003

48. Detecting Multidimensionality Due to Curricular Differences.

Author: DeMars, Christine E.
Abstract: Generated data to simulate multidimensionality resulting from including two or four subtopics on a test. DIMTEST analysis results suggest that including multiple topics, when they are commonly taught together, can lead to conceptual multidimensionality and mathematical multidimensionality. (SLD)
Published: 2003

49. Sample Size and the Recovery of Nominal Response Model Item Parameters.

Author: DeMars, Christine E.
Abstract: Varied the number of items and categories per item to explore the effects on estimation of item parameters in the nominal response model. Simulation results show that increasing the number of items had little effect on item parameter recovery, but increasing the number of categories increased the error variance of the parameter estimates. (SLD)
Published: 2003

50. Standard Setting: A Systematic Approach to Interpreting Student Learning.

Author: DeMars, Christine E., Sundre, Donna L., and Wise, Steven L.
Abstract: Describes workshops designed to set standards for freshman technological literacy at James Madison University (Virginia). Results indicated that about 30% of incoming freshmen could meet the standards set initially; by the end of the year, an additional 50-60% could meet them. Provides recommendations for standard setting in a general education context. (Contains 14 references.) (AUTH)
Published: 2002

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

219 results on '"DeMars, Christine E."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources