Standard Setting Non-Existent Exams

I wrote a blog about those #SQAresults, #covid, the Scottish education system, and tried not to panic about what’s going to happen next academic year …

The Scottish Qualifications Authority (SQA) released its exam results this week to huge uproar. This is a fascinating, and horrifying glimpse into what might await us in higher education, and the rest of the UK. Let’s talk about it. 

The SQA Higher exams are sat in 5th (15-16 years old)  and 6th year (16-17 years old), and are typically the exams that get you into university. Typically, the maximum you can sit in a year is 5. I have 6 Highers, because I sat 5 in 5th Year and 2 in 6th Year (and failed my Higher Psychology – a discussion for another day). There are also Advanced Highers, which we won’t discuss here. 

To get in to do Zoology at the University of Glasgow, you now need 5 A Highers at the end of 6th Year (it was easier back in my day!).

You sit a preliminary exam around Christmas time (the prelim) which are, to the best of my knowledge, set by the individual school based on what has been taught so far. The actual Higher paper, sat in May, is held at the same time and same place with the same paper across the country. This year, students could not sit their Highers, and so the SQA asked teachers to estimate what they thought students would get instead. 

The majority of students who sit these exams are aged 15-18 years old. Over the past four years, 76.8% (+/-1.2%) of students aged 15-18 have achieved an A, B or C grade in their Highers. This year, teachers estimated that 88.9% of this age category would achieve an A, B, or C grade. 

The SQA had a problem. 

The teachers estimates would have meant a 12% point rise in the number of students across the board who received a A-C Higher grade. Why were the teachers estimates so high? What was the SQA going to award students? Could the SQA use the students’ last exams, a prelim that wasn’t standardised across material or paper, to fairly discriminate between ‘excellent’ and ‘satisfactory’ students? What were they going to do?

Option A: Use the teachers estimates

Teachers were told to guess at what their student could do on their best possible day. This, I think, is a crucial mistake in the story, because best possibly days are rare, and have big impacts on performance. I rarely ever had my best possible day on my exams (see my failed Higher Psychology). I expect teachers also felt very sorry for students, and I expect they wanted to support students through this. I would not be surprised if a few schools leant on their teachers to whisper “hey, we could do with some better results this year”. The result? The estimates were far out of line of normal exams. 

If the SQA awarded the estimated grade, they would devalue the exam and the accreditation. This is complicated because the SQA is also the first national examining body in the UK to release grades, thanks to Scotland’s early summer. Every year, we get stories about how grade inflation is making exams easier to pass, and their results harder to trust. These stories arise from creeps of 2 or 3% points. 12% points would have been scandal.  Scottish students would have found it difficult to use those grades to demonstrate their ability, and access to university may have been a challenge. The SQA may have feared that other examining bodies would take a different line, and they would disadvantage Scottish students by being perceived as lenient, who knows? Certainly the rest of the UK is watching Scotland right now. 

Option B: Use the prelim grade

The next solution may have seemed logical – why not use the last exam the students sat? The one that they would have based any appeals on in a better year? (My Higher Psychology prelim was a C if I remember correctly. The appeal went nowhere). 

Exams are a pretty poor way of assessing students. The one thing we can agree on is that you can be broadly sure the right student is sitting in the right seat (ehhhhh), and that every student is seeing the same paper at the same time. At a national level, that requires a massive amount of coordination. It is a phenomenal amount of work to ensure that the Higher Psychology paper I sat in C201 in Park Mains in 2004, is the exact same paper that the other 2778 students were sitting. That when I left, the first moment I could, enough time had passed that I wasn’t likely to be texting my pal in Stornoway the answers. That me and the other 826 students who failed that paper were all fairly marked. It is an exercise in logistics that prelims, which are taken from past papers (in fact, I think I knew exactly which past papers were being used in my Psychology prelim), and are dictated at the level of the school, cannot match up to. 

Again we come back to standards. If the students didn’t all sit the same exam, how can we be sure that these 2020 grades are the passport to the future our schooling system is built on?

Option C: Standard set

And so the SQA took a third road. If about 77% of students usually achieve an A-C grade, then we have no real reason to assume that in a normal year, about 77% of students wouldn’t achieve the same.

But therein lies the rub. The SQA did not take the average of everyone – it took the average of your school, perhaps hoping to smooth over that prelim issue a little. Unfortunately . . . exams are a really, really terrible way to assess students, and consistently students in lower Scottish Index of Multiple Deprivation (SIMD) categories, perform poorer. If you’re in the poorest 20% of the population, you are probably going to a school in a deprived area, with other poor students. Historically, your school will do poorly . . .

And this is what the data shows. 

Most peoples scores were inflated above the usual. Most peoples scores were brought back in line with what their school would likely do. Some very bright students in poor areas have probably done very poorly. Some middling students in very good schools may have benefits. There has been a lot of anger about this: 

And some more big picture observations

https://twitter.com/mgshanks/status/1290650545758371847?s=20

Model Answer: So what do we do?

The Scottish Greens have issued a ‘no detriment’ petition, which I have signed. This petition proposes that students should at least achieve the grade they achieved at their prelim.  But I actually don’t think this is a good answer either. 

The Scottish Government have assured students normal appeals procedures will go ahead, taking prelims into account, but I know from personal experience this doesn’t always get you the result you want, and up the page we just said prelims weren’t standardised, so . . . what do we do?

These exams didn’t happen. Even if they had, they would have been as shit as they always are in terms of equity, diversity and inclusion. COVID will disproportionately affect students in deprived areas, so why are we trying to pretend that four or five letters besides someone’s name, plucked from the aether, can tell us anything about these students abilities?

If I was in charge of university admissions, or had the ear of parliament and the SQA, I’d be advocating for “NULL” in those fields. I’d be advocating for more holistic assessment of incoming students to uni, much like Multiple Mini Interviews in medicine and veterinary medicine, and I’d be advocating for Scotland to take the lead here, because we need to fix this issue. We could be Finland, but we playing.

Exams are shit at assessing anything but whether a student can sit an exam. I don’t set exams in my courses at uni for this very reason, instead I set skills-based assessments wherever I can. I’m not perfect at this, and I could do better. I’ve recently had interesting conversations on twitter about whether we in the UK have an overly aggressive quality assurance approach when it comes to exams, and flexibility in QA this year is something I was firm we had to raise in our 10 Simple Rules paper. But I do like the Scottish Credit and Qualifications Framework, I like what it tries to standardise in terms of assessment throughout all levels of Scottish education.

I just don’t think we should pretend students have sat exams that they haven’t.

Covid fucking sucks. 

You can find my visualisation code over on github.

Can sin a-rithist?

Failte gu Fluffy Sciences! Is mise Jill NicAoidh. Tha aon cat agam. Seo Athena. 

In late 2019, Duolingo launched the Scottish Gaelic version of its app. My dad and sister have been learning Gaelic for some time, and I’ve been trying to pick up a few phrases here and there. I’ve been doing this mostly through Speaking Our Language, a brilliant BBC Scotland series that I think is supposed to take place in a post SNP victory Scotland where English has been outlawed and people wander around Glasgow stumbling through broken Gaelic with frightened faces. Its wonderful and I love it and you should watch it:

At school, I didn’t find languages easy, and therefore I considered them hard. Like many perfectionist people I would then announce I was terrible at languages. After a few weeks of playing around on Duolingo, I can confidently say I speak more Gaelic than German, which I learned at school for many years. I’m trying to avoid ‘classifying’ my language abilities these days as part thinking about how assessment and learning intertwine. 

In education conferences, particularly whenever gamification is mentioned, Duolingo is the Ur Example people use to illustrate how points, leaderboards, and rewards can be used to motivate learners. Both myself and my partner have taken the app up this month, he’s learning Spanish, I’m learning Gaelic, and I have some thoughts on how gamification and motivation tie in. 

I am very motivated to learn Gaelic. My little sister is currently shaming me, which is a big one, but there’s something beautiful about reviving a language that I see on signs every day, but is spoken by very few people. I recently learned that my grandparents used to speak Gaelic in the home, and my teuchter family must have done for many generations. It’s strange to think how quickly a language can disappear. 

There are lots of benefits to learning a language. There’s reasonable evidence that being bilingual slows the onset of Alzheimers, and learning new skills as an adult (and educator) can help you think more about learning. There is also, for me, a huge benefit in being able to read the street signs in my country. 

When you drive from England to Scotland you pass beautiful blue signs that read Failte gu Alba! I’ve had that said to me several times, but in my head I always read it as ‘Fail-ta goo Alba’. Now I read it, naturally, as ‘Fael-Cha gu Alaba”. Many people in Scotland use odd turns of phrase or strange grammar. The Scots dialect would say “It’s wanting cleaned”, and I see echoes of that in the way Gaelic constructs sentences, tha mi ag irraidh ti. I’ve no idea if these parallels are true, but I feel as though I’m recovering something precious. If it’s something I can do to roll back ‘Scottish Cringe’ I’m all for it. In primary school we were simultaneously taught to recite Scottish poetry but penalised for writing ‘yous’ and ‘wur’, and there’s a lot that’s needed to undo that damage. 

Learning on Duolingo is interesting though. I’m fascinated by silos in learning from a curriculum design point of view. There’s a phenomenon where if you learn something in one context you aren’t able to generalise it to another context. I feel like I’ve been fighting learning silos for my entire teaching career, and it frustrates me no end to find my own Gaelic abilities vanishing the moment I close the Duolingo app. I’ve peppered some Gaelic throughout this blog, all phrases I can reliably type into the app, and all of them I had to google in front of my word document. Duolingo does suggest you should write down as many phrases as you can remember after a lesson, but is that enough? When you scaffold ‘extra’ learning outside of class time, is that really divorced enough from the course context to break down these walls?

Both my partner and I have observed that our language skills aren’t persisting outside of the app’s ‘classroom’, even though we’re both motivated to learn. I have no answers for this problem, yet, but it’s been an interesting experience to have first hand. 

Tha mi a’ bruidhinn Gaidhlig, tha mi cho toilichte.  

Complexity

I have the beginnings of some thoughts about teaching statistical modelling

One of my fabulous colleagues has started a book club on campus where a group of us work through Advanced R by Hadley Wickham. After the day I learned about the tidyverse, this Advanced R book club has been the biggest set of leaps I’ve been making in my R skills, and I’m probably only understanding about a fifth of it.

This week we began the chapter on functional programming – and Ian’s code and examples are on github. I went home and spent the evening doing this:

There was one example that Ian drew up that I can’t stop thinking about from a teaching perspective. Teaching stats is really, really intimidating, because the more you know about it, the more you recognise how subjective it can be. I often see people take refuge in complexity where they refuse to answer a learner’s question in favour of reiterating the memorised textbook response. I’ve done this myself! At the same time, I’ve had a really intriguing stats challenge with a colleague where I’ve gone around the houses trying to make sure I can justify our choices.

This comes down to model selection, which is one of the most Fun(™) conversations you can ever have about statistics. The more I learn about statistics the more I feel that model selection is the personification of this tweet from my colleague:

You see, there really are no ‘right’ answers in model selection, just ‘less wrong’ ones. This is the subject of a lot of interesting blogs. One of them is David Robinson’s excellent ‘Variance Explained’.

Another of @drob’s posts that I’ve linked to before I’m sure is this one: Teach tidyverse to beginners. This idea fascinates me. David (and I feel I can call him David because I once asked him a question at a demo and he said it was a good question and it was honestly one of the highlights of my life) suggests that students should have goals, and they should be doing those goals as soon as possible.

I don’t know how much educational training the Data Camp/RStudio folks have but I’m always really impressed with the way they teach.

(It’s important here to take a moment to acknowledge the problems Data Camp is having at the moment regarding how they addressed a sexual harassment complaint. I have the utmost sympathy for all involved, and at the moment I don’t feel that boycotting Data Camp is the answer, but it’s worth pointing towards blog posts like this one to give a different opinion.)

‘Doing’ as soon as possible is something we struggle with in higher education. I’ve just had to rewrite a portion of a paper to defend why I think authentic assessment is so vital for science. We put ‘doing’ at the top of our assessment pyramids, and talk about how it takes us a long time to get there.

During this week’s bookclub, my colleague Ian had a great example of using the broom and purrr packages in R to fit multiple models to a dataset quickly and easily. And I had to derail the conversation in the room for a bit. Why don’t we teach this to our students straight away? At present, the way I teach model selection is a laborious process of fitting each model one by one, examining the results individually, and then trying to get those results into some kind of comparable format. After some brief discussion, with all the usual sciencey caveats, our Advanced R bookclub was all keen to use this as a way of introducing model selection to students.

I feel as though this is tickling at the edge of something quite important for higher education, especially for the sciences. Something about empowering students, and getting them to ask me about things I don’t know the answer to more quickly. I also feel just a little irate about the fact I can’t formalise this as nicely as I know David Robinson and the RStudio lot can. I kind of feel like some of the most useful stuff I’m doing lately is in the Open Educational Resources range, such as my Media Hopper channels and on my GitHub. There’s a freedom in OERs to push the boat, and to start teaching the complex things first.

And ultimately, my disjointed ramblings might just help someone else connect a few dots. Happy spring, people!

The Gold Standard

This a blog about assessment and urine. I promise there’s more of a point than the punny title.

This is a blog about assessment and urine. Please stay . . .  

I was very proud of myself this morning for collecting a urine sample from Athena. She seems to be suffering from cystitis, which is common in cats in her demographic. By a bizarre coincidence I happen to have a UTI this week as well, which is a common occurrence in my demographic. The upshot of this is that on Wednesday I saw a GP deal with my case very effectively, and a vet deal with Athena’s case very effectively. Both practitioners impressed me.

In medical education we have a concept, Miller’s Pyramid, which describes the different levels of ability in a practitioner.

  • You know
  • You know how
  • You show
  • You do

Obviously the ‘doing’ is the most important part. Both my GP and my vet did an excellent job of doing, with a lot of similarities in how they handled their respective cases. Both were good at providing detail, providing treatment options, making me feel consulted, and both were respectively gentle with their patients (although I will say Athena was less grateful than she could have been). But large parts of that ‘doing’ is subjective, involving my feelings and Athena’s feelings, as best we can know them.

Let’s take a less medical example. An excellent question for a statistician might be:

Calculate the likelihood of a cohabiting 32 year old woman and 4 year old spayed indoor female cat presenting with cystitis on the same week.

A statistician would need to investigate the prevalence of these conditions in these populations and then calculate how often these populations intersect. We might then ask them to comment on the factors which may make this an under/over estimate, and see if they show enough awareness of the real world to realise that I’m probably more sensitive to Athena’s problems when I’m in pain myself.

Even with this example, which uses lovely objective maths, there isn’t a true ‘right’ answer for doing. You might use different estimates, for example, or you may bring in other information (such as the fact cystitis may be associated with stress, in cats, and possibly in women). The best you can do is give your estimate and outline your thinking as to why this is the case.

At the same time, it’s MSc marking season. We say the gold standard for an MSc is to be of ‘publishable quality’, but in line with #PeerReviewWeek18 (yeah, that is unbelievably a thing), we scientists can’t decide that amongst ourselves. A recent study has shown that as readers, scientists are reasonably good at guessing which papers will not be replicated, and yet we still allow those papers to be published – we are the ones who peer review them after all.

My GP and my vet were responsive to me, and both were very accepting of the ‘grey’ areas in diagnoses. My vet deeply impressed me by strongly recommending a painkiller for Athena (who is currently snoozing very comfortably on my left leg), and my GP was extremely good at parsing my confused jumble of “I’m not sure if this is a symptom or if I’m just overly-anxious today”.

When I was asked to collect a sample of Athena’s urine I thought back to when I used to perform similar tasks in the wildlife hospital I worked in over ten years ago. Then, the assessment criteria (that I perceived anyway) was to perform the task quickly, with economic use of resources and with a minimum of fuss. But this morning I wanted to do it calmly, inflicting as little stress on Athena as possible, and still get to my first meeting on time. Similar task, two different sets of criteria.

The same task in different contexts requires different definitions of ‘doing’ – and good practitioners are adaptable. But funnily enough, this week has made me a lot more confident in ‘assessing’ practice. You recognise good care when you get it, not necessarily because it ‘works’, but because afterwards you feel better. Athena and I feel better today, and even if our respective problems aren’t fixed, we’re better for having seen good health professionals. Vice versa, the next time I think a paper isn’t publishable, I’ll remember that I’m capable of recognising quality when I see it. 

And just an observation, it’s those ‘softer’ skills that my practitioners used to demonstrate their excellence . . . 

Jill Goes Back to the Chalet School

It was my birthday recently, and one of my friends gave me an old copy of The Chalet School. It’s one of the best presents I’ve ever been given. I’ve been hunting for the Chalet School books for years, but they’re very difficult to find and seem to be out of print at the moment.

For the uninitiated, the Chalet School series was written by Elinor M Brent-Dyer in the 1920s. It is probably a trope codifier for the ‘boarding school’ genre in English fiction. There are 58 books in the series and I reckon in my childhood I read a good 50 of them. The books serve as morality tales, preaching obedience and diligence to the girls, while recognising that the most fun girls still have character flaws. Jo, one of the great heroes, frequently is described as dishevelled and romantically dreaming of Napoleon’s conquests.

When I was little, I could devour two or three of these books in a week, so I imagine there was a period of about a year when I was obsessed with them. I remember constructing elaborate fantasies in my head about being sent to the Chalet School where I could somehow become Jo, and my two younger sisters would also be sent to the Chalet School and they would cause trouble and I would have to rescue them, while nearby a handsome Doctor would be waiting for me to turn of legal marriageable age. I also remember going through a period of putting brushes in peoples’ beds and being deeply disappointed by my mum’s utter lack of reaction (an excellent example of negative punishment).

I was aware that the Chalet School existed in another time. After all, it takes ten books to get to the second world war which lasts another five books in itself. But reading the book as an adult, there were a few things that jumped out to me. Firstly, I vividly remembered the odd feeling I had when Simone and Jo interacted and I recognise now that I identified their relationship as romantic long before I identified myself as bi. Secondly, the quality of the German in the book is appalling. Thirdly, the imperialistic tone of the book is really quite troubling at times even if you do try to remind yourself it was written in 1925, the same time as The Great Gatsby and Mein Kampf.

But the fourth thing . . . I think we could learn a little about curriculum design from the Chalet School. Re-reading the book, not just as an adult, but as an educator, was fascinating. I was never one to play ‘teacher’ as a kid (my fantasies were more about letting both my little sisters nearly drown in the ice-covered lakes of the Tyrol before deigning to rescue them in the nick of time so I could be lauded by a much older Doctor), so it’s interesting now to note how often the Chalet’s School’s curriculum is referenced. The girls are very much trained to be good wives, with needlework and mending forming a decent chunk of the timetable. They also must be fluent in three languages and possess good numeracy skills (which many of the heroes struggle with).

I’m not advocating a return to home-making skills in our higher curriculum, but in both the #UoELTConf18 and VetEd18 we had discussions about how much higher education should encourage community spirit and social responsibility. There was considerable debate in fact about to what extent it’s the responsibility of universities to do this. Many of my friends and family work in all stages of teaching and I happen to know that (in Scotland at least) there is a focus on community in early years education, so I’m not trying to pass this responsibility on.

In some ways, I wonder if we come at this from the wrong perspective. Perhaps what we’re really asking for is authentic assessment. In my elaborate self-insert fantasies where a handsome doctor was waiting in the wings for me to turn 18, I was being assessed on how good I’d be as a wife. That assessment is unique to each individual pairing, and has unique criteria. I really like Guliker’s et al (2004) framework for thinking about authentic assessment. They suggest that authenticity comes from:

  • Task
    • i.e. a problem which will occur in practice
  • Physical context
    • i.e. in a space that will be equivalent to the space that you’ll be in in practice
  • Social context
    • i.e. reflecting the social structure you will be in in practice
  • Assessment form
    • i.e. the output of the assessment has a relevance or parallel in the real world
  • Assessment criteria
    • i.e. the things you mark are relevant to how that task will be assessed in the real world.

 

If we stay with the Chalet School a little longer, the tall Doctor waiting in the wings will presumably want me to remain calm under pressure around patients (i.e. rescue my drowning hypothermic sisters), in an unsupervised environment (The Austrian mountains), while not pointing out any of my working class roots (jolly good), and provide continued life for my sisters while keeping up appearances the whole time.

I think that when we wring our hands over whether our students demonstrate social responsibility and community spirit, we’re actually bemoaning how our programme design and assessment don’t translate to what the real world values. Unlike the Chalet School, we don’t want to produce good spouses in higher education, but we do want to produce good citizens. And therefore we need to make space in our curriculum and our assessments to reflect that importance.

And if anyone spots any other Chalet School books int he charity shops . . . . do let me know.