Elevator pitch
There is little to no consensus in the academic literature over whether centralised, standardised exams are better for students than teacher assessments. While a growing body of evidence from economics highlights bias in teacher assessments, educationalists and psychologists point to the harm caused by high-stakes exam-related stress and argue that exams and teacher assessments generally agree very closely. This lack of academic consensus is reflected in policy: a wide variety of assessment methods are used across (and even within) countries. Policymakers should be aware of the potential for inequalities in non-blind assessments and consider carefully the consequences of relying on a single method of assessment.
Key findings
Pros
Regular contact allows teachers to form a richer picture of students’ abilities.
Teachers can assess a much broader curriculum than even the best-designed standardised tests.
Regular opportunities for assessment reduce the randomness inherent in sitting a limited number of exams.
Teachers can (implicitly) account for student (dis)advantage in a way not possible for standardised tests.
Cons
There is significant variation in the assessments different teachers make even of identical pieces of work.
There is also (likely) significant variation in the criteria used by teachers to assess their students.
The variation mentioned above mean teacher assessments are likely to be biased against some groups of students.
The variation in teacher assessments makes it difficult to compare scores across teachers and schools (and can harm students assessed by particularly strict teachers).
Author's main message
Studies from the US and Europe have revealed evidence of gaps between teacher assessed grades and those from externally marked exams that are correlated with students’ characteristics, suggestive of potential bias in teacher assessments. Several mechanisms have been explored in the literature. Teachers may use factors other than ability such as behavior, or may favor types of students who have performed well in previous years, or those who are among the minority group (e.g. gender) in a field. In terms of consequences, some studies show that biased teacher assessments in certain subjects, such as maths, can impact pupil’s progress, their choice of academic track, and even their degree choices.
Motivation
Recent years have seen a growing debate over the merits of teacher assessment versus externally marked exams. A leading example of this debate is the role of SATs versus GPA in US college admissions. SATs were vilified as a “wealth test” because of a high correlation between SAT scores and parental income. However, despite many selective colleges going ‘test-optional’ during the Covid-19 pandemic, one study suggests that while SATs might be unfair to disadvantaged students, the (current) alternatives are even more damaging for inequality.
The issue is not unique to the US. In the UK there was a recent government consultation over whether the current university admissions system – which relies on teacher-assessed “predicted” grades – should be scrapped in favor of a post-qualifications application system. The chief driver of this reform was concerns around the fairness and accuracy of predicted grades, 86% of which are inaccurate.
Another reason for standardised, external exams is that it helps to keep teachers and schools accountable. If the only measure of a student’s performance is an assessment by their teacher, comparisons across teachers and schools become difficult at best, and impossible if teachers act in their own self-interest. Standardised exams provide a metric by which teachers’ and schools’ performances can be assessed, improving accountability. On the other side of the debate, education researchers and psychologists have raised concerns that high-stakes exams lead to significant test anxiety, harming students’ wellbeing. These two aspects are beyond the scope of this article, instead, this article will focus on the existence of bias in teacher assessment, potential mechanisms behind this, and the consequences of mis-assessment.
Discussion of pros and cons
Proponents of teacher assessment suggest that the regular contact teachers have with their students allows them to form a “richer picture of what students know and can do than tests alone” [1] p. 78. Even the best-designed standardised tests are limited, both in the extent of abilities they can assess, and the time in which they implement these assessments.
Conversely, perhaps the chief criticism of teacher assessments is their variability, both in the scores they assign to students, and particularly in the criteria they consider in reaching these scores. Although variability in individual markers assigned scores is still an issue with external tests, blind-marking can ensure that markers only see the information they are supposed to use to grade the students work, and moderation and double-marking can reduce the variability.
Despite general agreement that teacher assessments and external exams measure different underlying traits in general, the majority of the literature assumes that teacher assessed grades and standardised tests are attempting to measure the same thing, i.e. teachers (aim to) use the same criteria that are measured by standardised tests. A move towards standards-based grading (SBG) has helped teachers align more closely with external tests. In SBG, student grades are only judged against specific criteria, or standards, and other judgements such as behavior are reported separately. However, even under such an assumption, the extent of the variability in teacher assessments is large. Figure 1 shows the results of an experiment from 1912 in which 142 teachers were asked to mark the same paper [2]. The variation in scores is striking and highlights the need for standardisation of marks even when exams are marked blind – something that is only possible with externally-set, standardised exams. Another study replicated this experiment with high-school teachers in a single US district [3]. The scores from 73 teachers who graded the same paper on a 0-100 scale ranged from 50 to 96. This variability is likely to be problematic when grades are particularly high stakes.

Another concern is that when teachers possess additional information about the pupil, they will use this in their assessment. Several authors have compared students’ performances in standardised tests to their teacher-assessed grades under the assumption that, on average across the population, standardised tests provide a “true” measure of their attainment. Their findings suggest that teachers appear to use information other than “pure attainment” (that is attainment measured by a standardised test) to assess students, and that the differences between teacher-assessed and external grades vary systematically with student characteristics, such as race and gender. The next paragraphs discuss some of the evidence in more detail, evidence which generally relies on similar methods and a key assumption: that non-biased teacher assessments (non-blind) and external exams (blind or quasi-blind) are directly comparable and hence should be the same on average.

Many studies focus on differences by gender. Most studies find that boys and girls with the same standardised test scores are graded differently, with girls being awarded more generous grades by their teachers. Girls were shown to be favored in all subjects at primary schools in the US [5], while this was only true in mathematics in a study of middle schools in France [4]. Another study exploits evidence from high-school matriculation exams in Israel, where students are assessed using both external blind exams and within-school exams and finds that female students are favored in teacher assessments [6]. Analysing Swedish data, one author also finds a bias against male students when comparing the teacher-assessed “School-Leaving Certificate” to results from standardised tests [7]. And yet another study finds that boys do relatively better in externally blind-marked central exit exams in Norway than in teacher assessments [8].
Although a large part of the literature has focused on gender, there is also evidence that teacher assessments vary with other characteristics too. For example, one study finds that teachers grade students differently based on their ethnicity [9]. Similarly, another study reports that teachers in Brazil award lower maths assessments to black students than to their white peers with the same test scores. Two authors run a field experiment in India, finding that teacher assessments vary by the caste of the exam taker, and suggest this is due to statistical discrimination on the part of the teacher [10]. In Italy, one study finds that teachers award immigrant students lower grades than to their native peers of similar (tested) ability [11].
The literature on teacher bias generally assumes that teacher-assessed grades follow standards-based grading (SBG). This is a conceptual approach that allows for both an easily comparable summary attainment grade and the recognition of other factors, where grades are based on certain standards of achievement, and other factors such as effort and behavior are reported separately. Hence teacher-assed grades that follow SBG are measuring the same underlying ability (or achievement) as standardised tests. For example, many of the papers discussed in this article argue that teachers are told to use standards-based grading (see e.g. [6]and [7]), with one study providing a link to “online [materials] to support teachers in ‘aligning their judgments systematically with national standards’.” [9], p. 540.
Mechanisms
Many of the studies mentioned above have also explored the possible mechanisms behind their findings. One study assesses whether student behavior has an impact on teacher assessments, finding that teachers inflate the scores of better-behaved students [12]. They find that this explains a large part of the gap between genders in their results. The authors of another study also point to the same mechanism for the gender gap, finding that when non-cognitive skills are accounted for, there is no difference in how teachers grade boys versus girls [5]. The authors of yet another study find that their results are driven by student-teacher interactions during coursework which favor girls in teacher-assessed grades [8].
Conversely to [5], one study rules out differences in behavior as a driver of differences across genders and asserts that they are due to differences in teacher behavior, i.e. to gender-based discrimination [6]. Others also test several mechanisms, finding that a stereotype model fits best (although they do not rule out the other models completely) [9]. Their analysis suggests that groups who had performed well in previous years were favored in teacher assessments – i.e. a teacher will categorise students and create prototypes or exemplars to make conscious or unconscious judgments about future students of the same group. Yet another study also provides support for the role of teacher stereotypes, finding that teachers’ immigrant-native bias is reduced when they are informed about their own stereotypes [11].
Another possible mechanism uncovered is suggested in two studies, that find a correlation between the degree of male-dominance in STEM fields and the pro-female bias in non-blind oral exams relative to gender-blind written exams, suggesting examiners are favoring those among the minority gender in a field [13]. Figure 3 shows the correlation between the share of females in a field, and the relative performance of males versus females by field, in a test that is common to all fields.

While there is considerable evidence and consensus that teacher assessments are biased towards certain groups, there is much less evidence (and consensus) on the causes of these biases. Further research is required to better understand the causes of these biases and to understand the contexts in which they occur. Policymakers can aid this research by ensuring the data required to perform these analyses is available to researchers. For example, to understand whether bias in teacher assessments is correlated with characteristics of teachers themselves, researchers need access to data not only on students but also on the characteristics of their teachers, data which is often unavailable.
Consequences of teacher mis-assessment
But do these divergences between teacher assessment and external exams matter for students’ later outcomes? Evidence on this question is limited, but a small number of studies have attempted to answer this question, comparing outcomes of those who were over- or under-assessed by their teachers (again by comparison with external exams). One study suggests pupils might be placed in a lower secondary school “set” due to underassessment, harming their future outcomes and motivation [9]. Another study finds that underassessed children are less likely to enrol in ambitious high-school tracks, with underassessment also being a key contributor to the migrant-enrolment gap in high school.
One author compares blind and non-blind test scores among high school pupils in grades 6-11, exploiting quasi-random assignment of pupils to biased teachers [4]. She finds that teachers’ gender biases in French and maths impacts pupils future progress and their likeliness to choose certain high school tracks, but in ways that are not straightforward. Bias against boys in maths does not impact boys progress in maths, but it does improve girls progress – while bias against boys in French reduces their progress. Bias in favor of girls in maths increases girls’ probability of selecting a scientific track in high school.
One study finds that underassessment can impact students’ future degree choices. The authors exploit a situation in Denmark where high school students are randomly allocated to an external exam in one subject (as well as being teacher assessed) [14]. They find that girls do relatively better in the maths exam than teacher assessment, versus boys, and that assignment to the maths exam reduces the gender gap in maths degree uptake.
Limitations and gaps
Investigating divergences between teacher assessment and external exams requires students to be assessed by both teachers and external exams at similar ages and educational stages, and for the researcher to observe test scores from both sources. But it also relies on a key assumption – that external exams provide an accurate measure of student attainment, and teacher assessments should be close to this measure. A failure of this assumption to hold could invalidate many of these findings, and it cannot be tested in most cases. Another issue is that different commonly used methods are often assumed to be directly comparable when in fact they are not. The authors show how differences in the gender distributions of test scores can lead to biased estimates of bias.
Apparently, the only study that does not rely on this assumption is a field experiment in India in which the authors randomly assigned child characteristics (including gender and caste) to the cover page of the exam sheet, before teachers graded them [10]. This ensured there could be no systematic relationship between the observed characteristics and the exam quality, meaning any effect of the characteristics on test scores must be down to bias. However, the setting of this study in a developing country with quite different cultural and educational values makes it unclear if these results would hold in Western developed countries, such as the UK and the US. Therefore, investigating teacher bias in developed countries using alternative methods which rely on different assumptions is an important goal for future research.
Summary and policy advice
Studies from the US and Europe have revealed evidence of divergence between teacher assessment and externally marked exams. This implies that teachers may use criteria other than student ability when grading exams.
Much of the literature investigating gaps in teacher assessment by student characteristic has focused on gender, finding that girls are typically more likely to be favored by teachers. A smaller number of studies have focused on other characteristics, such as ethnicity, but there remains a lack of evidence on discrepancies in teacher judgement by socio-economic status.
Why do these discrepancies in teacher judgements versus exams exist? Evidence suggests that factors other than pupil ability (such as behavior) impact non-blind teacher assessments, or that certain types of students may be favored, such as those who have performed well in previous years, or who are among the minority in a field.
But does this matter? Evidence on the extent to which using teacher judgements versus external exams has consequences for student outcomes is limited. However, a small number of articles have shown that biased teacher assessments in certain subjects, such as maths, can impact pupil progress, their choice of academic track, and even their degree choices. More research in this area is needed.
Policymakers should be aware of the consequences of using teacher assessment versus externally marked exams. Moving away from external examinations – as some countries have discussed doing – may result in examination not reflecting the true ability of students, and may benefit certain student types over others, with long run consequences for inequality. Solutions include giving teachers better training on what assessment should include (and not include) or giving teachers training in unconscious bias, which has been show to successfully reduce biases [11]. However, if the aim is to award grades purely on the basis of ability, the evidence shows that a system of externally set and marked exams is likely to be the best way of achieving this.
Acknowledgments
The authors thank the anonymous referee(s) and the IZA World of Labor editors for many helpful suggestions on earlier drafts.
Competing interests
The IZA World of Labor project is committed to the IZA Guiding Principles of Research Integrity. The authors declare to have observed these principles.
© Oliver Cassagneau-Francis and Gill Wyness