Adaptive comparative judgement

Adaptive Comparative Judgement is a technique borrowed from psychophysics which is able to generate reliable results for educational assessment - as such it is an alternative to traditional exam script marking. In the approach judges are presented with pairs of student work and are then asked to choose which is better, one or the other. By means of an iterative and adaptive algorithm, a scaled distribution of student work can then be obtained without reference to criteria.

Introduction

Traditional exam script marking began in Cambridge 1792 when, with undergraduate numbers rising, the importance of proper ranking of students was growing. So in 1792 the new Proctor of Examinations, William Farish, introduced marking, a process in which every examiner gives a numerical score to each response by every student, and the overall total mark puts the students in the final rank order. Francis Galton

Francis Galton

Sir Francis Galton /ˈfrɑːnsɪs ˈgɔːltn̩/ FRS , cousin of Douglas Strutt Galton, half-cousin of Charles Darwin, was an English Victorian polymath: anthropologist, eugenicist, tropical explorer, geographer, inventor, meteorologist, proto-geneticist, psychometrician, and statistician...

(1869) noted that, in an unidentified year about 1863, the Senior Wrangler scored 7,634 out of a maximum of 17,000, while the Second Wrangler scored 4,123. (The ‘Wooden Spoon’ scored only 237.)

Prior to 1792, a team of Cambridge examiners convened at 5pm on the last day of examining, reviewed the 19 papers each student had sat – and published their rank order at midnight. Marking solved the problems of numbers and prevented unfair personal bias, and its introduction was a step towards modern objective testing, the format it is best suited to. But the technology of testing that followed, with its major emphasis on reliability and the automatisation of marking, has been an uncomfortable partner for some areas of educational achievement: assessing writing or speaking, and other kinds of performance need something more qualitative and judgemental.

The technique of Adaptive Comparative Judgement is an alternative to marking. It returns to the pre-1792 idea of sorting papers according to their quality, but retains the guarantee of reliability and fairness. It is by far the most reliable way known to score essays or more complex performances. It is much simpler than marking, and has been preferred by almost all examiners who have tried it. The real appeal of Adaptive Comparative Judgement lies in how it can re-professionalise the activity of assessment and how it can re-integrate assessment

Assessment

Educational assessment is the process of documenting, usually in measurable terms, knowledge, skills, attitudes and beliefs. Assessment can focus on the individual learner, the learning community , the institution, or the educational system as a whole...

with learning.

Thurstone’ s Law of Comparative Judgement

“There is no such thing as absolute judgement" Laming (2004)

The science of comparative judgement began with Louis Leon Thurstone

Louis Leon Thurstone

Louis Leon Thurstone was a U.S. pioneer in the fields of psychometrics and psychophysics. He conceived the approach to measurement known as the law of comparative judgment, and is well known for his contributions to factor analysis.-Background and history:Louis Leon Thurstone was born in Chicago,...

of the University of Chicago

University of Chicago

The University of Chicago is a private research university in Chicago, Illinois, USA. It was founded by the American Baptist Education Society with a donation from oil magnate and philanthropist John D. Rockefeller and incorporated in 1890...

. A pioneer of psychophysics

Psychophysics

Psychophysics quantitatively investigates the relationship between physical stimuli and the sensations and perceptions they effect. Psychophysics has been described as "the scientific study of the relation between stimulus and sensation" or, more completely, as "the analysis of perceptual...

, he proposed several ways to construct scales for measuring sensation and other psychological properties. One of these was the Law of comparative judgment

Law of comparative judgment

The law of comparative judgment was conceived by L. L. Thurstone. In modern day terminology, it is more aptly described as a model that is used to obtain measurements from any process of pairwise comparison...

(Thurstone, 1927a, 1927b), which defined a mathematical way of modeling the chance that one object will ‘beat’ another in a comparison, given values for the ‘quality’ of each. This is all that is needed to construct a complete measurement system.

A variation on his model (see Pairwise comparison

Pairwise comparison

Pairwise comparison generally refers to any process of comparing entities in pairs to judge which of each entity is preferred, or has a greater amount of some quantitative property. The method of pairwise comparison is used in the scientific study of preferences, attitudes, voting systems, social...

and the BTL model), states that the difference between their quality values is equal to the log of the odds that object-A will beat object-B:

Before the availability of modern computers, the mathematics needed to calculate the ‘values’ of each object’s quality meant that the method could only be used with small sets of objects, and its application was limited. For Thurstone, the objects were generally sensations, such as intensity, or attitudes, such as the seriousness of crimes, or statements of opinions. Social researchers continued to use the method, as did market researchers for whom the objects might be different hotel room layouts, or variations on a proposed new biscuit.

In the 1970s and 1980s Comparative Judgement appeared, almost for the first time in educational assessment, as a theoretical basis or precursor for the new Latent Trait or Item Response Theories. (Andrich, 1978) These models are now standard, especially in item banking and adaptive testing systems.

Re-introduction in education

The first published paper using Comparative Judgement in education was Pollitt & Murray (1994), essentially a research paper concerning the nature of the English proficiency scale assessed in the speaking part of Cambridge’s CPE exam. The objects were candidates, represented by 2-minute snippets of video recordings from their test sessions, and the judges were Linguistics post-graduate students with no assessment training. The judges compared pairs of video snippets, simply reporting which they thought the better student, and were then clinically interviewed to elicit the reasons for their decisions.

Pollitt then introduced Comparative Judgement to the UK awarding bodies, as a method for comparing the standards of A Levels from different boards. Comparative judgement replaced their existing method which required direct judgement of a script against the official standard of a different board. For the first two or three years of this Pollitt carried out all of the analyses for all the boards, using a program he had written for the purpose. It immediately became the only experimental method used to investigate exam comparability in the UK; the applications for this purpose from 1996 to 2006 are fully described in Bramley (2007)

In 2004 Pollitt presented a paper at the conference of the International Association for Educational Assessment titled Let’s Stop Marking Exams, and another at the same conference in 2009 titled Abolishing Marksism. In each paper the aim was to convince the assessment community that there were significant advantages to using Comparative Judgement in place of marking for some types of assessment. In 2010 he presented a paper at the Association for Educational Assessment – Europe, How to Assess Writing Reliably and Validly, which presented evidence of the extraordinarily high reliability that has been achieved with Comparative Judgement in assessing primary school pupils’skill in first language English writing.

Comparative Judgement becomes a viable alternative to marking when it is implemented as an adaptive web-based assessment system. In this, the 'scores' (the model parameter for each object) are re-estimated after each 'round' of judgements in which, on average, each object has been judged one more time. In the next round, each script is compared only to another whose current estimated score is similar, which increases the amount of statistical information contained in each judgement. As a result, the estimation procedure is more efficient than random pairing, or any other pre-determined pairing system like those used in classical comparative judgement applications.

As with computer-adaptive testing, this adaptivity maximises the efficiency of the estimation procedure, increasing the separation of the scores and reducing the standard errors. The most obvious advantage is that this produces significantly enhanced reliability, compared to assessment by marking, with no loss of validity.

e-scape

The first application of Comparative Judgement to the direct assessment of students was in a project called e-scape

E-scape

E-scape is a project run by the Technology Education Research Unit at Goldsmiths University of London, England that developed an approach to the authentic assessment of creativity and collaboration based on open-ended but structured activities...

, led by Prof. Richard Kimbell of London University’s Goldsmiths College (Kimbell & Pollitt, 2008). The development work was carried out in collaboration with a number of awarding bodies in a Design & Technology course. Kimbell’s team developed a sophisticated and authentic project in which students were required to develop, as far as a prototype, an object such as a children’s pill dispenser in two three-hour supervised sessions.

The web-based judgement system was designed by Karim Derrick and Declan Lynch from TAG Developments, a part of BLi Education, and based on the MAPS (software)

MAPS (software)

MAPS is a proprietary web-based Assessment or EPortfolio service. The software is entirely web services based and provides an array of evidence recording tools which are structured around a blog or learner diary. Users can belong to mulitple institutions and individual portfolios are owned entirely...

assessment portfolio system. Goldsmiths, TAG Developments and Pollitt ran three trials, increasing the sample size from 20 to 249 students, and developing both the judging system and the assessment system. There are three pilots, involving Geography and Science as well as the original in Design & Technology.

Primary school writing

In late 2009 TAG Developments and Pollitt trialled a new version of the system for assessing writing. A total of 1000 primary school scripts were evaluated by a team of 54 judges in a simulated national assessment context. The reliability of the resulting scores after each script had been judged 16 times was 0.96, considerably higher than in any other reported study of similar writing assessment. Further development of the system has shown that reliability of 0.93 can be reached after about 9 judgements of each script, when the system is no more expensive than single marking but still much more reliable.

Several projects are underway at present, in England, Scotland, Ireland, Israel, Singapore and Australia. They range from primary school to university in context, and include both formative and summative assessment, from writing to Mathemtatics. The basic web system is now available on a commercial basis from TAG Developments (http://www.tagdevelopments.com), and can be modified to suit specific needs.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.