Awarding Gaps as Validity Deficits

The provocation

Oxford’s Teaching & Learning Symposium this year had a keynote speech from Rachel Forsyth, whose new book is all about assessment.

Her talk gave a really supportive analysis of what it means to design and review assessments, and part of the discussion was about some of the principles underlying them. Some of the contemporary concern with generative AI, for example, was set up in the wider context of authenticity.

Validity and Reliability

Another part of this conversation about the principles of assessment dwelt on the ideas of validity and reliability. Broadly, validity is about whether you are testing the right thing. A Shakespeare question in a Physics exam would be low-validity content, for example, and a recall question in a problem solving exam would be a poor fit in terms of validity as well.

Reliability is about whether the assessment gives the same result when it’s used over time or on different groups. This is a trickier concept in academic assessment, but we perhaps engaged with the idea of validity during COVID when we became aware that lateral flow tests sometimes gave false positives and false negatives. Validity in academic assessment is hard because you can’t normally give the same student the same test twice, and because the whole point of education is to help students become better at things - you might even see this as a deliberate act of invalidation!

Nonetheless, validity is worth taking seriously. It seems reasonable to imagine that comparing grade distributions between years is a valuable activity, and to think that aggregated parameters like average marks might be meaningfully analysed between cohorts.

Awarding Gaps as Evidencing Reliability

I’ve been thinking a lot about awarding gaps in Oxford, and Dr Forsyth’s crystallised a key question for me. If - say - the gender gap is persistent, then you might describe the assessment as having very high reliability in discriminating between genders. It’s not good! But it is reliable.

The question this invites is whether this suggests that awarding gaps are evidencing the low validity of assessments. Are women under-performing because we’re doing subtle versions of Shakespeare in the Physics exam or recall in the problem solving paper?

Confounding Factors

I’ve touched on a few of these before, but without hard evidence it’s possible to pin awarding gaps onto several interrelated factors. The way we admit students to Universities, the academic experience of teaching, and the non-academic experience of socialising are all important parts of the puzzle.

At the same time, it seems increasingly clear that the diversification of assessment imposed by COVID had the effect of substantially narrowing awarding gaps, including in Oxford-like Universities. In principle, of course, this COVID effect might somewhat be attributed to things like the role of online teaching.

The merits of a Validity Deficit lens

I don’t want to claim that assessment is the whole picture, but focusing on assessment to fix awarding gaps is a pragmatic decision because it is a somewhat-centralised process which affords the scope for quality assurance mechanisms. I think assessment is a space where you can pull levers at scale, whereas teaching or socialising are not. You can develop processes like “the Chair of Examiners must endorse that the exam questions reflect content within the published syllabus”. You can do things in the assessment space, and that has the potential to be really constructive.

I think a “validity deficit” argument is worth exploring because it aligns institutional objectives with academics’ values. How do we fix the gender gap? We make our assessments more rigorous by making sure we’re testing exactly the right thing (“you’re asking about radioactivity here, but that’s sort of like seeing a Shakespeare question in a Physics paper” “you’re asking about a specific molecule here, but to give the right answer students don’t have to solve a problem“). Centring the idea of rigour is important because it allows academics to do an even better job of what they want to do. No-one is against testing the right thing!

Conclusion

One of the biggest revelations for me about inclusive practice in education is that something which helps one group normally helps everyone. Recorded lectures are great for people with chronic conditions, but they are also useful for people with hangovers. You don’t have a duty to include hungover students in your Equality Act adjustments, of course, but surely it’s a good thing if they can still engage with the material?

A Validity Deficit lens is an interesting idea to play with in this light, because it has the scope to advance the agenda of staff as well as students. EDI work is often framed as allowing disadvantaged groups to access disciplines which remain unchanged (“so we run the same exam, but students with Dyslexia get extra time”), but inclusive practice might be a way to advance our vision for what it means to have a rigorous training in our discipline.

Which seems like a reasonable reading to me! Awarding gaps are telling us that something we thought was good actually has some substantial weaknesses. We should be approaching them as new information about the education we’ve been providing for generations, not as a new requirement which has suddenly appeared.