18 results on '"Tom, Bramley"'
Search Results
2. Metaphors and the psychometric paradigm
- Author
-
Tom Bramley
- Subjects
Educational measurement ,Psychometrics ,05 social sciences ,050401 social sciences methods ,050301 education ,computer.software_genre ,Literal and figurative language ,Education ,0504 sociology ,Educational assessment ,Psychology ,0503 education ,computer ,Cognitive psychology - Abstract
This paper analyses the psychometric paradigm for understanding educational measurement and standard setting by considering the extent to which that understanding is based on metaphors. I argue tha...
- Published
- 2020
- Full Text
- View/download PDF
3. The effect of adaptivity on the reliability coefficient in adaptive comparative judgement
- Author
-
Tom Bramley and Sylvia Vitello
- Subjects
Computer science ,media_common.quotation_subject ,05 social sciences ,Judgement ,050401 social sciences methods ,050301 education ,Industrial engineering ,Education ,0504 sociology ,Adaptive comparative judgement ,Scale (social sciences) ,Evaluation methods ,Quality (business) ,0503 education ,Reliability (statistics) ,media_common - Abstract
Comparative Judgement (CJ) is an increasingly widely investigated method in assessment for creating a scale, for example of the quality of essays. One area that has attracted attention in C...
- Published
- 2018
- Full Text
- View/download PDF
4. Some implications of choice of tiering model in GCSE mathematics for inferences about what students know and can do
- Author
-
Tom Bramley
- Subjects
Rasch model ,Secondary education ,General Mathematics ,media_common.quotation_subject ,05 social sciences ,050401 social sciences methods ,050301 education ,Inference ,Ambiguity ,Certificate ,Education ,0504 sociology ,Mathematics education ,0503 education ,Reliability (statistics) ,media_common ,Mathematics - Abstract
This study compared models of assessment structure for achieving differentiation across the range of examinee attainment in the General Certificate of Secondary Education (GCSE) examination taken by 16-year-olds in England. The focus was on the “adjacent levels” model, where papers are targeted at three specific non-overlapping ranges of grades. Examinees enter for a pair of papers at adjacent levels and receive the highest grade achieved. There is no aggregation of marks across papers. This study used simulation, based on data from a GCSE Mathematics exam, to compare the adjacent levels model with two other tiering models in terms of: (1) suitability of grade boundary locations; (2) score distributions; and (3) reliability. The adjacent levels model led to lower reliability but arguably improved two aspects of validity: the strength of the inference about what examinees with a given grade would know and be able to do; and the removal of the ambiguity about overlapping grades inherent in the curre...
- Published
- 2017
- Full Text
- View/download PDF
5. Handbook of test development – review of Section 2, edited by Suzanne Lane, Mark R. Raymond, and Thomas M. Haladyna
- Author
-
Tom Bramley
- Subjects
Sight ,Heading (navigation) ,0504 sociology ,Philosophy ,05 social sciences ,Section (typography) ,050401 social sciences methods ,050301 education ,Art history ,0503 education ,Education ,Test (assessment) - Abstract
The five chapters that make up the second section of the Handbook are grouped under the heading ‘Content’. At first sight, the fifth of these (Chapter 11) about ‘Setting performance standards on te...
- Published
- 2017
- Full Text
- View/download PDF
6. Spoilt for choice? Issues around the use and comparability of optional exam questions
- Author
-
Victoria Crisp and Tom Bramley
- Subjects
Exploit ,Level data ,05 social sciences ,Comparability ,050401 social sciences methods ,050301 education ,Data science ,Education ,0504 sociology ,Item response theory ,Selection (linguistics) ,0503 education ,Social psychology ,Mathematics - Abstract
For many years, question choice has been used in some UK public examinations, with students free to choose which questions they answer from a selection (within certain parameters). There has been little published research on choice of exam questions in recent years in the UK. In this article we distinguish different scenarios in which choice arises, explore the arguments for and against using optional questions, and exploit the item level data that has recently become available from on-screen marking of examinations to exemplify methods for investigating the (statistical) comparability of optional questions. We conclude that unless there is a very good reason for allowing question choice it should be avoided.
- Published
- 2017
- Full Text
- View/download PDF
7. Rasch Measurement in the Social Sciences and Quality of Life Research
- Author
-
Tom Bramley
- Subjects
Rasch model ,Book Reviews ,Interpretation (philosophy) ,lcsh:BF1-990 ,Item bank ,computer.software_genre ,Educational research ,lcsh:Psychology ,Educational assessment ,Computerized adaptive testing ,Social science ,Form of the Good ,Psychology ,Research question ,computer ,General Psychology - Abstract
Rasch measurement is an approach to measuring attributes such as ‘mathematics ability’ or ‘quality of life’ – that is, the kind of attributes encountered in the non-physical sciences. It began in the 1960s in the field of educational assessment, where it is most widely known, but has spread in recent decades to other areas of social science. Within educational assessment, its main applications are in item banking, test equating and computer adaptive testing (CAT), where the focus is on estimating the level of ability/achievement/proficiency of test-takers who have taken different tests. Within social science, its main application is in developing instruments (usually questionnaires with Likert-style response formats) to measure a specific concept of research interest. Rasch Measurement in the Social Sciences and Quality of Life Research is a short book for social science researchers, written with the aim of encouraging them to use the Rasch model in their own work by providing a brief introduction to the Rasch measurement perspective, discussing some of the criticisms of the use of the Rasch model, and illustrating how various kinds of research question can be approached within the Rasch framework. The preface to the book warns readers that the aim of the book is “not to describe the models exhaustively nor to provide a step by step guide to their use”. The brevity of the book means that many of the issues it raises are not dealt with as thoroughly as they should be for it to be useful to practitioners who already know the basics of Rasch measurement. For readers without much of a quantitative background the text may well be too technical in places – terms like ‘sufficient statistics’, ‘principal components analysis’, ‘threshold estimates’ and ‘eigenvalue’ are used with little or no explanation. On the other hand, quantitative researchers familiar with statistical methods who know nothing of Rasch or IRT may want to see more discussion of how Rasch-based methods would compare and contrast with other methods they were familiar with for approaching the same kinds of research question. Chapter 1 at 22 pages is the longest in the book, and is an introduction to the conceptual and practical issues of Rasch measurement. It draws quite heavily on other introductory textbooks. There is a good discussion of invariance and unidimensionality. Other features of Rasch models are also described, sometimes in a way which assumes a certain level of familiarity on the part of the reader. It is not emphasised clearly enough that some attractive properties of the Rasch model only apply when the data fits the model. More discussion would have been helpful of what the implications of misfit are for interpretation of results, and what the implications of addressing misfit (e.g. by removing items) are for understanding the concept being measured. There are also several pitfalls for the unwary in the form of occasional mistakes, inaccuracies and oversimplifications. Chapter 2 is drawn from an article published by the author and colleagues in the British Journal of Educational Research, discussing some of the criticisms of the Rasch model and giving counter-arguments, although the brevity of the chapter means that many of the details and subtleties of the arguments for both sides cannot be explored in depth. Chapters 3 to 6 are based on other work of the author and colleagues, some of it published in this journal (i.e. Europe’s Journal of Psychology). They illustrate the way that different types of analysis within the Rasch framework can be used to address various problems of research interest, such as developing and refining a questionnaire, investigating unidimensionality, and investigating person misfit to the model. All the examples are interesting, but many will raise further questions in the mind of the reader which they will need to go elsewhere to answer more fully. Chapter 7 shows how the number of citations of Rasch-based articles has increased dramatically over the decades, and concludes with a summary of the basic principles and advantages of Rasch models. Before the list of references the author gives some book suggestions, with the good advice for novice and experienced researchers to start with Applying the Rasch Model by Bond & Fox (2007). I would agree, and also think that enough good material is freely available online about the topics covered by the book reviewed here to make it a non-essential purchase. This book will be of most interest to those who are looking for examples and guidance relating to the analytical possibilities available within the Rasch framework when developing and validating instruments for assessing concepts of the kind found in social science research.
- Published
- 2015
- Full Text
- View/download PDF
8. How accurate are examiners’ holistic judgements of script quality?
- Author
-
Tom Bramley and Tim Gill
- Subjects
Judgement ,Expert judgement ,Psychology ,Grading (education) ,Research setting ,Social psychology ,Education ,Cognitive psychology - Abstract
This research investigated the accuracy (agreement with the original marking and grading) of examiners’ holistic judgements of the quality of examination scripts that were close together in overall mark. For a History and a Physics exam, examiners considered pairs of scripts (with marks removed) and made three types of judgement: (1) Absolute – which grade each script was worth; (2) Relative – which of the pair was better in terms of overall quality; (3) Confidence – how confident they were about judgements (1) and (2). In both subjects, relative judgements were more accurate than absolute judgements, and judgements rated as ‘very confident’ were more accurate than other judgements. In Physics, the further apart the two scripts in terms of overall mark the greater was the likelihood of a correct relative judgement, but in History this expected pattern was not found. Despite differences between the research setting and the use of expert judgement in grading the live examinations, these results suggest that...
- Published
- 2013
- Full Text
- View/download PDF
9. Problems in estimating composite reliability of ‘unitised’ assessments
- Author
-
Tom Bramley and Vikas Dhawan
- Subjects
Secondary education ,business.industry ,Computer science ,Modular design ,Test theory ,computer.software_genre ,Education ,Weighting ,Reliability engineering ,Educational assessment ,Item response theory ,business ,computer ,Unit level ,Reliability (statistics) - Abstract
This paper discusses the issues involved in calculating indices of composite reliability for ‘modular’ or ‘unitised’ assessments of the kind used in GCSEs, AS and A level examinations in England. The increasingly widespread use of on-screen marking has meant that the item-level data required for calculating indices of reliability is now routinely available for most (but not all) units of unitised assessments. Whilst it is relatively straightforward to obtain indices of reliability at unit level, it is far more complex to obtain indices at overall assessment level because of problems created by: (i) the number of different possible ‘routes’ to the final assessment; (ii) the different knowledge, skills and understanding assessed in different units; (iii) the wide variety in item type and size within and across units; (iv) the fact that the item-level data required for calculating reliability indices is not available (or does not exist) for certain units and (v) the different intended weighting of different ...
- Published
- 2013
- Full Text
- View/download PDF
10. 'Measurement' and 'Construct' Need to Be Clarified First. Commentary on Newton, P. E. 'Clarifying the Consensus Definition of Validity'
- Author
-
Tom Bramley
- Subjects
Statistics and Probability ,Applied Mathematics ,Psychology ,Construct (philosophy) ,Social psychology ,Education ,Epistemology - Published
- 2012
- Full Text
- View/download PDF
11. The interrelations of features of questions, mark schemes and examinee responses and their impact upon marker agreement
- Author
-
Beth Black, Tom Bramley, and Irenka Suto
- Subjects
Inter-rater reliability ,Logical analysis ,Computer science ,business.industry ,Data mining ,Artificial intelligence ,computer.software_genre ,business ,Grading (education) ,computer ,Natural language processing ,Education - Abstract
In this paper we develop an evidence-based framework for considering many of the factors affecting marker agreement in GCSEs and A levels. A logical analysis of the demands of the marking task sugg...
- Published
- 2011
- Full Text
- View/download PDF
12. Evaluating the rank‐ordering method for standard maintaining
- Author
-
Tom Bramley and Tim Gill
- Subjects
Rasch model ,business.industry ,Rank (computer programming) ,Item bank ,Test validity ,computer.software_genre ,Education ,Test (assessment) ,Test score ,Statistics ,Equating ,Thurstone scale ,Artificial intelligence ,business ,Psychology ,computer ,Natural language processing - Abstract
The rank‐ordering method for standard maintaining was designed for the purpose of mapping a known cut‐score (e.g. a grade boundary mark) on one test to an equivalent point on the test score scale of another test, using holistic expert judgements about the quality of exemplars of examinees’ work (scripts). It is a novel application of an old technique (Thurstone’s paired comparison method for scaling psychological stimuli), and one that can be applied when the more familiar methods of statistical equating or item banking are not possible. How should a method like this be evaluated? If the correct mapping were known, then the outcome of a rank‐ordering exercise could be compared against that. However, in the contexts for which the method was designed, there is no ‘right answer’. This paper presents an evaluation of the rank‐ordering method in terms of its rationale, its psychological validity and the stability of the outcome when various factors incidental to the method are varied (e.g. the number of judges...
- Published
- 2010
- Full Text
- View/download PDF
13. A response to an article published in Educational Research's Special Issue on Assessment (June 2009). What can be inferred about classification accuracy from classification consistency?
- Author
-
Tom Bramley
- Subjects
Psychometrics ,business.industry ,Computer science ,National curriculum ,computer.software_genre ,Education ,Test (assessment) ,Educational research ,Argument ,Consistency (statistics) ,Item response theory ,Statistics ,Artificial intelligence ,business ,computer ,Reliability (statistics) ,Natural language processing - Abstract
Background: A recent article published in Educational Research on the reliability of results in National Curriculum testing in England (Newton, The reliability of results from national curriculum testing in England, Educational Research 51, no. 2: 181–212, 2009) suggested that: (1) classification accuracy can be calculated from classification consistency; and (2) classification accuracy on a single test administration is higher than classification consistency across two tests. Purpose: This article shows that it is not possible to calculate classification accuracy from classification consistency. It then shows that, given reasonable assumptions about the distribution of measurement error, the expected classification accuracy on a single test administration is higher than the expected classification consistency across two tests only in the case of a pass–fail test, but not necessarily for tests that classify test-takers into more than two categories. Main argument and conclusion: Classification accuracy is...
- Published
- 2010
- Full Text
- View/download PDF
14. Investigating a judgemental rank‐ordering method for maintaining standards in UK examinations
- Author
-
Beth Black and Tom Bramley
- Subjects
Inter-rater reliability ,Operations research ,Equating ,Rank (computer programming) ,Raw score ,Context (language use) ,Test theory ,Psychology ,Academic standards ,Education ,Test (assessment) - Abstract
A new judgemental method of equating raw scores on two tests, based on rank‐ordering scripts from both tests, has been developed by Bramley. The rank‐ordering method has potential application as a judgemental standard‐maintaining mechanism, because given a mark on one test (e.g. the A grade boundary mark), the equivalent mark (i.e. at the same judgemental standard) on the other test can be determined. If the two tests come from different years then the standard from the earlier year can be applied to the later year. The current standard maintaining method used by Awarding Bodies in England, Wales and Northern Ireland, is the ‘awarding meeting’. Here expert judgement takes place within the context of a variety of statistical information, including score distributions and hence likely pass rates. The rank‐ordering method, in contrast, involves harnessing expert judgement independently of any statistical information. The aim of this study was to investigate the extent to which the outcome of an awarding meet...
- Published
- 2008
- Full Text
- View/download PDF
15. Accessibility, easiness and standards
- Author
-
Tom Bramley
- Subjects
Rasch model ,Standardization ,Process (engineering) ,Item response theory ,Equating ,Mathematics education ,National curriculum ,Construct (philosophy) ,Psychology ,Data science ,Education ,Test (assessment) - Abstract
In setting the cut-scores on National Curriculum tests it is important to maintain standards. In the process of test development, both within and across years, changes are made to the style of the questions in order to increase their ‘accessibility’. This raises the question of whether a more accessible test should have higher cut-scores. Purely statistical definitions of equating are blind to differences between ‘accessibility’ and ‘easiness’ and cut-scores derived from statistical equating methods will be higher for a more accessible test. Arguments about the increased validity of the more accessible test are sometimes used to justify not raising the cut-scores as much as would be indicated by statistical methods. These arguments are shown to be equivalent to postulating that changing the accessibility is changing the construct measured by the test. Using a statistical measurement model can provide a rational basis for understanding accessibility and identifying types of question where accessibility iss...
- Published
- 2005
- Full Text
- View/download PDF
16. The Question Tariff Problem in GCSE Mathematics
- Author
-
Tom Bramley
- Subjects
Single process ,Repetition (rhetorical device) ,media_common.quotation_subject ,Item bank ,Tariff ,behavioral disciplines and activities ,Education ,Test (assessment) ,Statistics ,Raw score ,Quality (business) ,Reliability (statistics) ,media_common ,Mathematics - Abstract
In GCSE mathematics there are many question parts (items) worth 2 or 3 marks. The mark distribution in some of these items is U-shaped - for example in a 2-mark item where the majority of candidates score either 2 or 0. Such items are effectively single-mark items being given double weight. This contributes to the spread of raw scores and therefore makes the test appear to be more discriminating than perhaps it is. The source of the problem is in the design of the mark scheme, which has to allocate the marks to fit a tight specification of content coverage and ability targeting. After identifying such items, analysis of their mark schemes showed that the most common causes are awarding an intermediate mark for correct working, and awarding marks for repetition of a single process. The items were re-scored to remove under-used middle categories. Several comparisons with the original scores, including a simulation of test scores, showed that the amount of measurement error was reduced and hence the quality ...
- Published
- 2001
- Full Text
- View/download PDF
17. A rank-ordering method for equating tests by expert judgment
- Author
-
Tom, Bramley
- Subjects
Male ,Adolescent ,England ,Reading ,Humans ,Female ,Curriculum - Abstract
This paper describes a new method of comparing the raw mark scales on two tests using expert judgment. The two tests do not need to have any common items, nor to be taken by common groups of candidates. This study used scripts (i.e., the complete work of a candidate on the test) from England's National Curriculum Test for Reading at Key Stage 3 (14-year olds) in 2003 and 2004. Each member of a panel of 12 experts was given four packs each containing ten scripts--five scripts from each year's test. Marks and annotations from these scripts had been removed. Their task was to put the ten scripts into a single rank order, based on a holistic judgment of the level of performance exhibited in each. Because the design of the study linked scripts across judges and packs it was possible to construct a single latent trait of judged quality of performance. This was done using two different analytical methods: the Rasch formulation of Thurstone paired comparisons, and the Rasch Partial Credit model. Relating the two raw mark scales to the single latent scale allowed the two years' tests to be equated. The merits of using this standard-maintaining method as opposed to a standard-setting method in this particular context are discussed.
- Published
- 2005
18. Erratum
- Author
-
Tom Bramley
- Subjects
Education - Published
- 2013
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.