Independent non-profit

120 years of experience

End-to-end services

Clients in over 100 countries

Making sense of scores

A guide to Classical Test Theory and Item Response Theory

Statisticians Katherine Shaw and Rhiannan Davies explain the two key psychometric theories which underpin millions of assessments around the world.

For those working in assessment, curriculum design, or qualification development, understanding what’s behind a score is essential. It affects how we judge the quality of a test, how we interpret results, and how we make fair and reliable decisions for learners.

This article provides a clear and practical introduction to two key approaches used worldwide: Classical Test Theory (CTT) and Item Response Theory (IRT) for anyone without a technical background.

Why do we have exams?

We cannot explicitly know how proficient someone is at a given subject, so we use exams to try and measure this. However, exam scores don't always exactly reflect the actual level of what is being measured - the outcome does not reflect the learner’s true score. 

Things like a bad night’s sleep, feeling unwell, or being distracted can affect how someone performs on a test. These kinds of unpredictable influences are called random error.

Classical Test Theory

Classical Test Theory (CTT) is a psychometric approach that states that an exam score can be represented by the true score plus random error in the equation below: 

X = T + E

X = total test score
T = true score
E = random error

In this approach to testing, a person’s overall test score is used as an estimate of their proficiency. A single question (called an “item”) is judged by comparing how well someone does on that question to how well they did on the whole test.

This allows an exam provider to assess the quality of the test overall and the quality of each item within the test. There are several ways in which CTT can be used to judge the quality of a test and its items.

Test and Item Statistics

Reliability

Internal reliability of a test is a measure of how consistently different items measure the same concept. In CTT, one of the most common ways to measure internal reliability is Cronbach’s alpha. This measures if the more able candidates did well on all the items across the whole test. It is the average inter-item correlation which utilises the rank order of candidates.

Rank order: the candidates are ranked based on how well they did on the item (so whoever got the highest mark is rank 1, second highest, rank 2, etc).

Cronbach’s alpha tells us how well the rank order on item 1 matches the rank order on item 2; item 1 matches item 3, item 1 matches item 4, and so on. This is repeated for every possible pair of two items and the mean calculated. Table 1 shows the range and interpretation of the result.

For example, an alpha value of 0.8 on a hypothetical test would tell us that the same candidates performed better on most questions on the exam, meaning the exam has good internal reliability.

However, the length of an exam affects alpha. Longer exams result in higher reliability so, while important, Cronbach’s alpha should be considered in conjunction with other statistics.

Table 1: Cronbach’s Alpha values

Facility Value

The facility value of an item is the proportion of candidates that got the question right, used as an indicator of item difficulty.

This is a measure of a question’s difficulty, generally expressed as a decimal between 0 and 1 but can also be shown as a percentage. The closer the facility value is to 1, the more people got the question right. A ‘good’ facility value is typically between 0.3 and 0.8 as an exam works best with a spread of items with different facility values to help differentiate between learners.

Items with a facility value less than 0.3 are normally seen as being too difficult, and items with a facility value of greater than 0.8 are considered too easy and don’t provide much information on a candidate’s ability. However, this is not a strict rule as it is sometimes required to test essential knowledge that every learner should get correct.

Facility values are strongly affected by cohort ability so should be used in conjunction with other item statistics, such as discrimination and alpha, to paint a full picture.

Discrimination

Item discrimination describes how well a question distinguishes between stronger and weaker test-takers. It is measured as the correlation between performance on that question and the person’s total score on the rest of the test (excluding that question). Table 2 shows the range discrimination values should fall in and the interpretation.

An item with discrimination <0.2 tells us that the item doesn’t differentiate between more and less able learners. This item provides no value to the exam result and may affect learners' confidence by being confusing or misleading.

Table 2: discrimination range and interpretation

Item Response Theory

Another approach to testing is Item Response Theory (IRT). This estimates the learner’s underlying ability. This is the concept that ability is something real but hidden — we can’t observe it directly — so we use someone’s pattern of answers on a test to make an informed estimate of it.

Example:

You want to judge the spelling ability of two people, Learner A and Learner B, so you give them three words to spell aloud with the following results.

Even though both learners received the same score on the spelling test- 2/3 correct, it’s clear who is the better speller based on the difficulty of the words they were being asked to spell. This is how IRT works - it considers how challenging each item is and a learner’s pattern of answers to make an estimation of their level of ability.

IRT differs from Classical Test Theory, as it is sample independent and not influenced by how able the cohort is or how difficult the item is. Classical Test Theory would say these candidates are of the same spelling proficiency.

IRT plotted

Learner ability and the probability of getting an item right can be plotted against each other, and a line of best fit can be drawn. This is an Item Response Function (IRF) as shown in this graph. The steeper section in the middle shows how well the item differentiates between people of different abilities and flattens at the top and bottom.

This can be used to customise an exam for an individual learner’s ability or to customise an exam with a particular threshold to determine which learners are of an appropriate competence.

This curve is also known as an Item Characteristic Curve (ICC).

IRT features

Item difficulty and person ability are measured on the same scale. This means that, in the simplest IRT models, when a person’s proficiency level matches an item’s difficulty level, the model predicts a 50% probability of a correct response. If an Item Response Function is closer to the y axis, it means the item is easier.

This graph shows three different Item Response Functions, each one for a different item of different difficulties. A person with -1 ability will have a 50% chance of getting a correct answer for the item of difficulty -1, about a 26% chance of correctly answering the item with difficulty 0 and about a 15% probability of answering the item with difficulty 1 correctly.

Requirements for IRT

There are some constraints to the use of IRT:

Only one underlying skill can be measured at a time (e.g., only mathematical competence or literacy).
- So don’t measure other assessments on the same scale e.g. don’t measure maths and English at once.
Items should be independent of one another- getting Q1 right shouldn’t affect if you get Q2 right
- “Continuation marks” in maths, or one item giving the answer to others away violate this.
There should only be two outcomes from an item- a right answer or a wrong answer
- This means something like a 25-mark essay poses a problem and would need a more complex model.

Uses of IRT

IRT requires a large item bank and lots of pre-testing to determine the item difficulties in order to judge learner ability. This is unrealistic for exams such as GCSE’s, as pre-testing isn’t allowed due to security, so CTT is required. CTT is useful for understanding how assessments perform and to inform future assessment and item design and can provide insights for samples of learners as small as 20.

Depending on the model, IRT would need a minimum of 100 to 1,000 responses per item to provide useful insights. IRT is useful for standard setting for certification exams as it produces a more reliable pass mark by taking into account candidate ability and item difficulties than when CTT is used. It is also used in Computerised Adaptive Testing (CAT) which is currently used in schools in Wales to create personalised assessments and better understand learner performance. These are formative exams taken by learners each academic year to identify areas of focus for learners and teachers. See our Wales case study here or learn more about CAT here.

Contact our team

Fill in our contact form and we'll get in touch.

Email: info@aqaglobal.com

Tel: +44 (0) 20 3960 0671

Message us on WhatsApp

Our London office

AQA Global Assessment Services
Lynton House
7-12 Tavistock Square
London
WC1H 9LT
UK

Offices in Manchester, Milton Keynes, Bahrain, Sharjah and Hong Kong

Events

Meet with our experts at a conference near you.

We regularly attend education and assessment events around the world, often presenting our latest research.

See events

;

Case Studies

News

Contact

About

Services