13 results on '"Choi, Ikkyu"'
Search Results
2. Benchmark Keystroke Biometrics Accuracy from High-Stakes Writing Tasks. Research Report. ETS RR-21-15
- Author
-
Choi, Ikkyu, Hao, Jiangang, Deane, Paul, and Zhang, Mo
- Abstract
"Biometrics" are physical or behavioral human characteristics that can be used to identify a person. It is widely known that keystroke or typing dynamics for short, fixed texts (e.g., passwords) could serve as a behavioral biometric. In this study, we investigate whether keystroke data from essay responses can lead to a reliable biometric measure, with implications for test security and the monitoring of writing fluency and style changes. Based on keystroke data collected from a high-stakes writing testing setting, we established a preliminary biometric benchmark for detecting test-taker identity by using features extracted from their writing process logs. We report a benchmark keystroke biometric accuracy of equal error rate of 4.7% for identifying same versus different individuals on an essay task. In particular, we show that the inclusion of writing process features (e.g., features designed to describe the writing process) in addition to the widely used typing-timing features (e.g., features based on the time intervals between two-letter key sequences) improves the accuracy of the keystroke biometrics. The proposed keystroke biometrics can have important implications for the writing assessments administered through the remotely proctored tests that have been widely adopted during the COVID pandemic.
- Published
- 2021
3. Using Linkage Sets to Improve Connectedness in Rater Response Model Estimation
- Author
-
Casabianca, Jodi M., Donoghue, John R., Shin, Hyo Jeong, Chao, Szu-Fu, and Choi, Ikkyu
- Abstract
Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios, there tends to be a large proportion of missing data, yielding sparse matrices and estimation issues. In this article, we explore the impact of different types of connectedness, or linkage, brought about by using a linkage set--a collection of responses scored by most or all raters. We also explore the impact of the properties and composition of the linkage set, the different connectedness yielded from different rating designs, and the role of scores from automated scoring engines. In designing monitoring systems using the rater response version of the generalized partial credit model, the study results suggest use of a linkage set, especially a large one that is comprised of responses representing the full score scale. Results also show that a double-human-scoring design provides more connectedness than a design with one human and an automated scoring engine. Furthermore, scores from automated scoring engines do not provide adequate connectedness. We discuss considerations for operational implementation and further study.
- Published
- 2023
- Full Text
- View/download PDF
4. Evaluating Writing Process Features in an Adult EFL Writing Assessment Context: A Keystroke Logging Study
- Author
-
Choi, Ikkyu and Deane, Paul
- Abstract
Keystroke logs provide a comprehensive record of observable writing processes. Previous studies examining the keystroke logs of young L1 English writers performing experimental writing tasks have identified writing processes features predictive of the quality of responses. Contrarily, large-scale studies on the dynamic and temporal nature of L2 writing process are scarce, especially in an assessment setting. This study utilized the keystroke logs of adult English as a foreign language (EFL) learners responding to assessment tasks to examine the usefulness of the process features in this new context. We evaluated the features in terms of stability, explored factor structures for their correlations, and constructed models to predict response quality. The results showed that most of the process features were stable and that their correlations could be efficiently represented with a five-factor structure. Moreover, we observed improved response quality prediction over a baseline by up to 48%. These findings have implications for the evaluation and understanding of writing process features and for the substantive understanding of writing processes under assessment conditions.
- Published
- 2021
- Full Text
- View/download PDF
5. Evaluating Subscore Uses across Multiple Levels: A Case of Reading and Listening Subscores for Young EFL Learners
- Author
-
Choi, Ikkyu and Papageorgiou, Spiros
- Abstract
Stakeholders of language tests are often interested in subscores. However, reporting a subscore is not always justified; a subscore should provide reliable and distinct information to be worth reporting. When a subscore is used for decisions across multiple levels (e.g., individual test takers and schools), it needs to be justified for its reliability and distinctiveness at every relevant level. In this study, we examined whether reporting seven Reading and Listening subscores of the "TOEFL Primary"® test, a standardized English proficiency test for young English as a foreign language learners, could be justified for reporting at individual and school levels. We analyzed data collected in pilot administrations, in which 4776 students from 51 schools participated. We employed the classical test theory (CTT) based approaches of Haberman (2008) and Haberman, Sinharay, and Puhan (2009) for the individual and school-level investigations, respectively. We also supplemented the CTT-based approaches with a factor analytic approach for the individual level analysis and a multilevel modeling approach for the school-level analysis. The results differed across the two levels: we found little support for reporting the subscores at the individual level, but strong evidence supporting the added-value of the school-level subscores when the sample size for each school exceeds 50.
- Published
- 2020
- Full Text
- View/download PDF
6. The Impact of Operational Scoring Experience and Additional Mentored Training on Raters' Essay Scoring Accuracy
- Author
-
Choi, Ikkyu and Wolfe, Edward W.
- Abstract
Rater training is essential in ensuring the quality of constructed response scoring. Most of the current knowledge about rater training comes from experimental contexts with an emphasis on short-term effects. Few sources are available for empirical evidence on whether and how raters become more accurate as they gain scoring experiences or what long-term effects training can have. In this study, we addressed this research gap by tracking how the accuracies of new raters change through experience and by examining the impact of an additional training session on their accuracies in scoring calibration and monitoring essays. We found that, on average, raters' accuracy improved with scoring experience and that individual raters differed in their accuracy trajectories. The estimated average effect of the training was an approximately six percent increase in the calibration essay accuracy. On the other hand, we observed a smaller impact on the monitoring essay accuracy. Our follow-up analysis showed that this differential impact of the additional training on the calibration and monitoring essay accuracy could be accounted for by successful gatekeeping through calibration.
- Published
- 2020
- Full Text
- View/download PDF
7. Item Response Models for Multiple Attempts with Incomplete Data
- Author
-
Bergner, Yoav, Choi, Ikkyu, and Castellano, Katherine E.
- Abstract
Allowance for multiple chances to answer constructed response questions is a prevalent feature in computer-based homework and exams. We consider the use of item response theory in the estimation of item characteristics and student ability when multiple attempts are allowed but no explicit penalty is deducted for extra tries. This is common practice in online formative assessments, where the number of attempts is often unlimited. In these environments, some students may not always answer-until-correct, but may rather terminate a response process after one or more incorrect tries. We contrast the cases of graded and sequential item response models, both unidimensional models which do not explicitly account for factors other than ability. These approaches differ not only in terms of log-odds assumptions but, importantly, in terms of handling incomplete data. We explore the consequences of model misspecification through a simulation study and with four online homework data sets. Our results suggest that model selection is insensitive for complete data, but quite sensitive to whether missing responses are regarded as informative (of inability) or not (e.g., missing at random). Under realistic conditions, a sequential model with similar parametric degrees of freedom to a graded model can account for more response patterns and outperforms the latter in terms of model fit.
- Published
- 2019
- Full Text
- View/download PDF
8. Investigating the Benefits of Scaffolding in Assessments of Young English Learners: A Case for Scaffolded Retell Tasks
- Author
-
Choi, Ikkyu, Wolf, Mikyung Kim, Pooler, Emilie, Sova, Lorraine, and Faulkner-Bond, Molly
- Abstract
Scaffolding, an instructional strategy for providing support until learners can perform a task on their own, holds a potential to improve assessments for test takers who need support to demonstrate their abilities, such as English learners (ELs). In this study, we evaluated the benefits of including scaffolded tasks in standardized speaking assessments for young ELs. Our focus was on scaffolded retell tasks that require test takers to retell a given story with procedural scaffolds. We collected responses from 233 third grade ELs to two scaffolded retell tasks, and investigated the relationship between the pre- and post-scaffolding retell performances to learn the impact of the scaffolding on the post-scaffolding performance. We also examined the psychometric added-value of the scaffolding steps and post-scaffolding responses through an application of item response theory (IRT) measurement models. The findings showed that, conditional on the pre-scaffolding performance, the EL test takers' performance on the scaffolding steps were positively associated with their post-scaffolding retell performance. Moreover, the responses to the scaffolding steps and the post-scaffolding retell provided additional information about test takers' oral English proficiency. These findings provide empirical support for the benefits of using scaffolding for EL assessments in a standardized setting.
- Published
- 2019
- Full Text
- View/download PDF
9. Adding Value to Second-Language Listening and Reading Subscores: Using a Score Augmentation Approach
- Author
-
Papageorgiou, Spiros and Choi, Ikkyu
- Abstract
This study examined whether reporting subscores for groups of items within a test section assessing a second-language modality (specifically reading or listening comprehension) added value from a measurement perspective to the information already provided by the section scores. We analyzed the responses of 116,489 test takers to reading and listening items from operational administrations of two large-scale international tests of English as a foreign language. To "strengthen" the reliability of the subscores, and thus improve their added value, we applied a score augmentation method (Haberman, 2008). In doing so, our aim was to examine whether reporting augmented subscores for specific groups of reading and listening items could improve the added value of these subscores and consequently justify providing more fine-grained information about test taker performance. Our analysis indicated that in general, there was lack of support for reporting subscores from a psychometric perspective, and that score augmentation marginally improved the added value of the subscores. We discuss several implications of our findings for test developers wishing to report more fine-grained information about test performance. We conclude by arguing that research on how to best report such refined feedback should remain the focus of future efforts related to second-language proficiency tests.
- Published
- 2018
- Full Text
- View/download PDF
10. Exploring the Validity of a Second Language Intercultural Pragmatics Assessment Tool
- Author
-
Timpe-Laughlin, Veronika and Choi, Ikkyu
- Abstract
Pragmatics has been a key component of language competence frameworks. While the majority of second/foreign language (L2) pragmatics tests have targeted productive skills, the assessment of receptive pragmatic skills remains a developing field. This study explores validation evidence for a test of receptive L2 pragmatic ability called the American English Sociopragmatic Comprehension Test (AESCT), which is a Web-based assessment consisting of 54 tasks measuring knowledge of speech acts, routine formulae, and culture-dependent lexical differences. The AESCT is intended to be used as a learning-oriented assessment in university-level applied linguistics classes. This study collected evidence on construct validity supporting the AESCT design as a measure of pragmatic comprehension and for providing feedback in low-stakes learning contexts; 97 university-level English language learners took the AESCT along with the Cambridge Placement Test and a background questionnaire on their exposure to target language input. Descriptive statistics, correlation analyses, and linear regression were used to analyze aspects of construct validity. Results indicate that the AESCT is sufficiently reliable. Overall, learners were found to perform as previous research suggests: sociopragmatic knowledge was related to L2 exposure and L2 proficiency. Alongside future research into L2 pragmatics test validation, implications for instruction-oriented utilization of the AESCT are discussed.
- Published
- 2017
- Full Text
- View/download PDF
11. Empirical Profiles of Academic Oral English Proficiency from an International Teaching Assistant Screening Test
- Author
-
Choi, Ikkyu
- Abstract
Language proficiency constitutes a crucial barrier for prospective international teaching assistants (ITAs). Many US universities administer screening tests to ensure that ITAs possess the required academic oral English proficiency for their TA duties. Such ITA screening tests often elicit a sample of spoken English, which is evaluated in terms of multiple aspects by trained raters. In this light, ITA screening tests provide an advantageous context in which to gather rich information about test taker performances. This study introduces a systematic way of extracting meaningful information for major stakeholders from an ITA screening test administered at a US university. In particular, this study illustrates how academic oral English proficiency profiles can be identified based on test takers' subscale score patterns, and discusses how the resulting profiles can be used as feedback for ITA training and screening policy makers, the ITA training program of the university, ESL instructors, and test takers. The proficiency profiles were identified using finite mixture modeling based on the subscale scores of 960 test takers. The modeling results suggested seven profile groups. These groups were interpreted and labeled based on the characteristic subscale score patterns of their members. The implications of the results are discussed, with the main focus on how such information can help ITA policy makers, the ITA training program, ESL instructors, and test takers make important decisions.
- Published
- 2017
- Full Text
- View/download PDF
12. Structural Equation Modeling Reporting Practices for Language Assessment
- Author
-
Ockey, Gary J. and Choi, Ikkyu
- Abstract
Studies that use structural equation modeling (SEM) techniques are increasingly encountered in the language assessment literature. This popularity has created the need for a set of guidelines that can indicate what should be included in a research report and make it possible for research consumers to judge the appropriateness of the interpretations made from a reported study. This article attempts to fill this void by providing a set of reporting guidelines appropriate for language assessment researchers.
- Published
- 2015
- Full Text
- View/download PDF
13. Review of Pearson Test of English Academic: Building an Assessment Use Argument
- Author
-
Wang, Huan, Choi, Ikkyu, and Schmidgall, Jonathan
- Abstract
This review departs from current practice in reviewing tests in that it employs an "argument-based approach" to test validation to guide the review (e.g. Bachman, 2005; Kane, 2006; Mislevy, Steinberg, & Almond, 2002). Specifically, it follows an approach to test development and use that Bachman and Palmer (2010) call the process of "assessment justification". This process focuses on investigating the extent to which the intended uses of a particular test can be justified to stakeholders. The justification process includes two interrelated activities. The first is articulating an assessment use argument (AUA; Bachman, 2003, 2005; Bachman & Palmer, 2010), which makes explicit claims that link test takers' performance to the consequences of test use. The second activity consists of collecting evidence to support the claims that are articulated in the AUA. This review aims to examine the evidence that is provided by the test developer about the intended uses of the test so that readers and test users will be better informed in making their own judgments. (Contains 5 notes.)
- Published
- 2012
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.