Standard Setting Non-Existent Exams

I wrote a blog about those #SQAresults, #covid, the Scottish education system, and tried not to panic about what’s going to happen next academic year …

The Scottish Qualifications Authority (SQA) released its exam results this week to huge uproar. This is a fascinating, and horrifying glimpse into what might await us in higher education, and the rest of the UK. Let’s talk about it. 

The SQA Higher exams are sat in 5th (15-16 years old)  and 6th year (16-17 years old), and are typically the exams that get you into university. Typically, the maximum you can sit in a year is 5. I have 6 Highers, because I sat 5 in 5th Year and 2 in 6th Year (and failed my Higher Psychology – a discussion for another day). There are also Advanced Highers, which we won’t discuss here. 

To get in to do Zoology at the University of Glasgow, you now need 5 A Highers at the end of 6th Year (it was easier back in my day!).

You sit a preliminary exam around Christmas time (the prelim) which are, to the best of my knowledge, set by the individual school based on what has been taught so far. The actual Higher paper, sat in May, is held at the same time and same place with the same paper across the country. This year, students could not sit their Highers, and so the SQA asked teachers to estimate what they thought students would get instead. 

The majority of students who sit these exams are aged 15-18 years old. Over the past four years, 76.8% (+/-1.2%) of students aged 15-18 have achieved an A, B or C grade in their Highers. This year, teachers estimated that 88.9% of this age category would achieve an A, B, or C grade. 

The SQA had a problem. 

The teachers estimates would have meant a 12% point rise in the number of students across the board who received a A-C Higher grade. Why were the teachers estimates so high? What was the SQA going to award students? Could the SQA use the students’ last exams, a prelim that wasn’t standardised across material or paper, to fairly discriminate between ‘excellent’ and ‘satisfactory’ students? What were they going to do?

Option A: Use the teachers estimates

Teachers were told to guess at what their student could do on their best possible day. This, I think, is a crucial mistake in the story, because best possibly days are rare, and have big impacts on performance. I rarely ever had my best possible day on my exams (see my failed Higher Psychology). I expect teachers also felt very sorry for students, and I expect they wanted to support students through this. I would not be surprised if a few schools leant on their teachers to whisper “hey, we could do with some better results this year”. The result? The estimates were far out of line of normal exams. 

If the SQA awarded the estimated grade, they would devalue the exam and the accreditation. This is complicated because the SQA is also the first national examining body in the UK to release grades, thanks to Scotland’s early summer. Every year, we get stories about how grade inflation is making exams easier to pass, and their results harder to trust. These stories arise from creeps of 2 or 3% points. 12% points would have been scandal.  Scottish students would have found it difficult to use those grades to demonstrate their ability, and access to university may have been a challenge. The SQA may have feared that other examining bodies would take a different line, and they would disadvantage Scottish students by being perceived as lenient, who knows? Certainly the rest of the UK is watching Scotland right now. 

Option B: Use the prelim grade

The next solution may have seemed logical – why not use the last exam the students sat? The one that they would have based any appeals on in a better year? (My Higher Psychology prelim was a C if I remember correctly. The appeal went nowhere). 

Exams are a pretty poor way of assessing students. The one thing we can agree on is that you can be broadly sure the right student is sitting in the right seat (ehhhhh), and that every student is seeing the same paper at the same time. At a national level, that requires a massive amount of coordination. It is a phenomenal amount of work to ensure that the Higher Psychology paper I sat in C201 in Park Mains in 2004, is the exact same paper that the other 2778 students were sitting. That when I left, the first moment I could, enough time had passed that I wasn’t likely to be texting my pal in Stornoway the answers. That me and the other 826 students who failed that paper were all fairly marked. It is an exercise in logistics that prelims, which are taken from past papers (in fact, I think I knew exactly which past papers were being used in my Psychology prelim), and are dictated at the level of the school, cannot match up to. 

Again we come back to standards. If the students didn’t all sit the same exam, how can we be sure that these 2020 grades are the passport to the future our schooling system is built on?

Option C: Standard set

And so the SQA took a third road. If about 77% of students usually achieve an A-C grade, then we have no real reason to assume that in a normal year, about 77% of students wouldn’t achieve the same.

But therein lies the rub. The SQA did not take the average of everyone – it took the average of your school, perhaps hoping to smooth over that prelim issue a little. Unfortunately . . . exams are a really, really terrible way to assess students, and consistently students in lower Scottish Index of Multiple Deprivation (SIMD) categories, perform poorer. If you’re in the poorest 20% of the population, you are probably going to a school in a deprived area, with other poor students. Historically, your school will do poorly . . .

And this is what the data shows. 

Most peoples scores were inflated above the usual. Most peoples scores were brought back in line with what their school would likely do. Some very bright students in poor areas have probably done very poorly. Some middling students in very good schools may have benefits. There has been a lot of anger about this: 

And some more big picture observations

https://twitter.com/mgshanks/status/1290650545758371847?s=20

Model Answer: So what do we do?

The Scottish Greens have issued a ‘no detriment’ petition, which I have signed. This petition proposes that students should at least achieve the grade they achieved at their prelim.  But I actually don’t think this is a good answer either. 

The Scottish Government have assured students normal appeals procedures will go ahead, taking prelims into account, but I know from personal experience this doesn’t always get you the result you want, and up the page we just said prelims weren’t standardised, so . . . what do we do?

These exams didn’t happen. Even if they had, they would have been as shit as they always are in terms of equity, diversity and inclusion. COVID will disproportionately affect students in deprived areas, so why are we trying to pretend that four or five letters besides someone’s name, plucked from the aether, can tell us anything about these students abilities?

If I was in charge of university admissions, or had the ear of parliament and the SQA, I’d be advocating for “NULL” in those fields. I’d be advocating for more holistic assessment of incoming students to uni, much like Multiple Mini Interviews in medicine and veterinary medicine, and I’d be advocating for Scotland to take the lead here, because we need to fix this issue. We could be Finland, but we playing.

Exams are shit at assessing anything but whether a student can sit an exam. I don’t set exams in my courses at uni for this very reason, instead I set skills-based assessments wherever I can. I’m not perfect at this, and I could do better. I’ve recently had interesting conversations on twitter about whether we in the UK have an overly aggressive quality assurance approach when it comes to exams, and flexibility in QA this year is something I was firm we had to raise in our 10 Simple Rules paper. But I do like the Scottish Credit and Qualifications Framework, I like what it tries to standardise in terms of assessment throughout all levels of Scottish education.

I just don’t think we should pretend students have sat exams that they haven’t.

Covid fucking sucks. 

You can find my visualisation code over on github.

Can sin a-rithist?

Failte gu Fluffy Sciences! Is mise Jill NicAoidh. Tha aon cat agam. Seo Athena. 

In late 2019, Duolingo launched the Scottish Gaelic version of its app. My dad and sister have been learning Gaelic for some time, and I’ve been trying to pick up a few phrases here and there. I’ve been doing this mostly through Speaking Our Language, a brilliant BBC Scotland series that I think is supposed to take place in a post SNP victory Scotland where English has been outlawed and people wander around Glasgow stumbling through broken Gaelic with frightened faces. Its wonderful and I love it and you should watch it:

At school, I didn’t find languages easy, and therefore I considered them hard. Like many perfectionist people I would then announce I was terrible at languages. After a few weeks of playing around on Duolingo, I can confidently say I speak more Gaelic than German, which I learned at school for many years. I’m trying to avoid ‘classifying’ my language abilities these days as part thinking about how assessment and learning intertwine. 

In education conferences, particularly whenever gamification is mentioned, Duolingo is the Ur Example people use to illustrate how points, leaderboards, and rewards can be used to motivate learners. Both myself and my partner have taken the app up this month, he’s learning Spanish, I’m learning Gaelic, and I have some thoughts on how gamification and motivation tie in. 

I am very motivated to learn Gaelic. My little sister is currently shaming me, which is a big one, but there’s something beautiful about reviving a language that I see on signs every day, but is spoken by very few people. I recently learned that my grandparents used to speak Gaelic in the home, and my teuchter family must have done for many generations. It’s strange to think how quickly a language can disappear. 

There are lots of benefits to learning a language. There’s reasonable evidence that being bilingual slows the onset of Alzheimers, and learning new skills as an adult (and educator) can help you think more about learning. There is also, for me, a huge benefit in being able to read the street signs in my country. 

When you drive from England to Scotland you pass beautiful blue signs that read Failte gu Alba! I’ve had that said to me several times, but in my head I always read it as ‘Fail-ta goo Alba’. Now I read it, naturally, as ‘Fael-Cha gu Alaba”. Many people in Scotland use odd turns of phrase or strange grammar. The Scots dialect would say “It’s wanting cleaned”, and I see echoes of that in the way Gaelic constructs sentences, tha mi ag irraidh ti. I’ve no idea if these parallels are true, but I feel as though I’m recovering something precious. If it’s something I can do to roll back ‘Scottish Cringe’ I’m all for it. In primary school we were simultaneously taught to recite Scottish poetry but penalised for writing ‘yous’ and ‘wur’, and there’s a lot that’s needed to undo that damage. 

Learning on Duolingo is interesting though. I’m fascinated by silos in learning from a curriculum design point of view. There’s a phenomenon where if you learn something in one context you aren’t able to generalise it to another context. I feel like I’ve been fighting learning silos for my entire teaching career, and it frustrates me no end to find my own Gaelic abilities vanishing the moment I close the Duolingo app. I’ve peppered some Gaelic throughout this blog, all phrases I can reliably type into the app, and all of them I had to google in front of my word document. Duolingo does suggest you should write down as many phrases as you can remember after a lesson, but is that enough? When you scaffold ‘extra’ learning outside of class time, is that really divorced enough from the course context to break down these walls?

Both my partner and I have observed that our language skills aren’t persisting outside of the app’s ‘classroom’, even though we’re both motivated to learn. I have no answers for this problem, yet, but it’s been an interesting experience to have first hand. 

Tha mi a’ bruidhinn Gaidhlig, tha mi cho toilichte.  

The Gold Standard

This a blog about assessment and urine. I promise there’s more of a point than the punny title.

This is a blog about assessment and urine. Please stay . . .  

I was very proud of myself this morning for collecting a urine sample from Athena. She seems to be suffering from cystitis, which is common in cats in her demographic. By a bizarre coincidence I happen to have a UTI this week as well, which is a common occurrence in my demographic. The upshot of this is that on Wednesday I saw a GP deal with my case very effectively, and a vet deal with Athena’s case very effectively. Both practitioners impressed me.

In medical education we have a concept, Miller’s Pyramid, which describes the different levels of ability in a practitioner.

  • You know
  • You know how
  • You show
  • You do

Obviously the ‘doing’ is the most important part. Both my GP and my vet did an excellent job of doing, with a lot of similarities in how they handled their respective cases. Both were good at providing detail, providing treatment options, making me feel consulted, and both were respectively gentle with their patients (although I will say Athena was less grateful than she could have been). But large parts of that ‘doing’ is subjective, involving my feelings and Athena’s feelings, as best we can know them.

Let’s take a less medical example. An excellent question for a statistician might be:

Calculate the likelihood of a cohabiting 32 year old woman and 4 year old spayed indoor female cat presenting with cystitis on the same week.

A statistician would need to investigate the prevalence of these conditions in these populations and then calculate how often these populations intersect. We might then ask them to comment on the factors which may make this an under/over estimate, and see if they show enough awareness of the real world to realise that I’m probably more sensitive to Athena’s problems when I’m in pain myself.

Even with this example, which uses lovely objective maths, there isn’t a true ‘right’ answer for doing. You might use different estimates, for example, or you may bring in other information (such as the fact cystitis may be associated with stress, in cats, and possibly in women). The best you can do is give your estimate and outline your thinking as to why this is the case.

At the same time, it’s MSc marking season. We say the gold standard for an MSc is to be of ‘publishable quality’, but in line with #PeerReviewWeek18 (yeah, that is unbelievably a thing), we scientists can’t decide that amongst ourselves. A recent study has shown that as readers, scientists are reasonably good at guessing which papers will not be replicated, and yet we still allow those papers to be published – we are the ones who peer review them after all.

My GP and my vet were responsive to me, and both were very accepting of the ‘grey’ areas in diagnoses. My vet deeply impressed me by strongly recommending a painkiller for Athena (who is currently snoozing very comfortably on my left leg), and my GP was extremely good at parsing my confused jumble of “I’m not sure if this is a symptom or if I’m just overly-anxious today”.

When I was asked to collect a sample of Athena’s urine I thought back to when I used to perform similar tasks in the wildlife hospital I worked in over ten years ago. Then, the assessment criteria (that I perceived anyway) was to perform the task quickly, with economic use of resources and with a minimum of fuss. But this morning I wanted to do it calmly, inflicting as little stress on Athena as possible, and still get to my first meeting on time. Similar task, two different sets of criteria.

The same task in different contexts requires different definitions of ‘doing’ – and good practitioners are adaptable. But funnily enough, this week has made me a lot more confident in ‘assessing’ practice. You recognise good care when you get it, not necessarily because it ‘works’, but because afterwards you feel better. Athena and I feel better today, and even if our respective problems aren’t fixed, we’re better for having seen good health professionals. Vice versa, the next time I think a paper isn’t publishable, I’ll remember that I’m capable of recognising quality when I see it. 

And just an observation, it’s those ‘softer’ skills that my practitioners used to demonstrate their excellence . . . 

Clever Cat

A friend recently asked me about their dog who was showing some unusual behaviour. The dog was suddenly acting fearfully around traffic, although there hadn’t been an obvious incident to spook him. I said “Sometimes clever animals get spooked by things just when they’re slightly ‘off’, they’re clever enough to recognise the pattern is wrong and start obsessing over why”

In some ways, this explanation is mainly to soothe the owner’s feelings. People like to think their animal is clever.

And I’m proof of this. Since having this conversation, I’ve been quietly re-evaluating some of Athena’s behaviours. Athena is a great example of a fearful cat, who runs away at the slightest provocation . . . except when Edinburgh had a brief but very welcome thunderstorm she sat by the window watching the lightning, completely calm. Living where we do, she has got a lot of experience with fireworks and other things that typically frighten animals, and she is utterly blase about them.

In fact, earlier this month we had a packed house, full of noisy family doing all the unpredictable things that Athena finds uncomfortable, and she still chose to join us, and to complain loudly about all the people sitting in her various spots. She even chose to sleep on the bed with our guest rather than on the floor with me (cow).

Like many people in my age and general middle class demographic, I greatly value intelligence. I want, very dearly, to believe that Athena’s general quirks are due to a very intelligent little cat mind that tries to understand a human world. And yet, as much as I want it, I still have to acknowledge this is a cat who regularly walks off windowsills and sofa edges because she’s too busy talking to me to watch where she’s going. 

I got very angry recently at a news article about the increased levels of unconditional offers being made to university. This was supposed to be bad because it would encourage students to take their foot off the gas and make them slack in the year before they got to university. There is a lot to unpack in that statement, which I may get in to another time, but I had also recently read this interesting blog post purporting that examinations make it easier for students with poor social capital to demonstrate their ability.

As an academic, I wouldn’t dream of suggesting that exams test intelligence. I can just about say that certain formats of them test knowledge and skill acquisition. When scientists try to measure intelligence, they get caught in whole heap of challenging research. There is, we think, a thing about some people’s brains that makes them perform better in the tests we give them (tests which we’ve designed are not unbiased). However, believing that intelligence is malleable seems to also make people perform better in these tasks. There are many ways in which social capital helps you perform better in many of the ways we judge intelligence.

What about Athena’s social capital? Daughter of a teenage mum, separated from her mother shortly after birth and raised in foster care. She was ill for a period as baby, and so was slow to gain weight. She was separated from her own kind and adopted by someone who then suffered a mental health problem. She lived apart from her own kind, as is the culture she was adopted into. She developed a long-term condition health condition that gives her pain and discomfort.

When I think about it that way, watching Athena study some loud, cheerful strangers from a safe spot beside me seems like a very, very intelligent response to something unusual. It’s just my measurement is bad.

A Dangerous Demographic

I have a bit of a thing for adverts on the internet, because I love looking at how an algorithm decides what I like. (See the book, chapter 11).

But there’s an advert I’m getting a lot lately, and I can’t get it out of my head. The advert is about three minutes long, and thankfully skippable, and it plays in front of every YouTube video I watch. Doing my daily yoga? Advert. Watching a group of gamers murder and mutilate one another (virtually)? Advert.

Towards the end of the ad, the presenter says “Imagine . . . never having to worry about that time consuming process of creating courses and coaching programmes.”

Hold up. Wait. Insert record-scratch noise.

Never having to create a course again?

 

This advert is for a service which will provide you ‘content’ for a price. They seem to be mostly selling blog posts and ‘top 10 tips’ lists. They seem to be talking mainly about ‘coaching’ services, but I can’t get that phrase out my head. “Never have to create a course again”.

Off the top of my head, I’ve been involved in the creation of about 25 courses in higher education. Four of them have been courses which were owned by me, and that I would have to do the bulk of the delivery for. I think that gives me an unusual perspective on course design.

There’s a part of me that very much wants to write a dystopian future novel about a higher education environment where the educators purchase the materials of the course from the same place that sells the answers to the assignments to the students. Yes – I think the next logical step for essay mills is for course creation.

I am being a little flippant here, as I actually think essay mills are one of the greatest failures of higher education. It horrifies me that we have a whole cohort of students, a marketable population who value product over process. I don’t think this company is interested particularly in writing university courses, but I am certain they wouldn’t object to me using the content I might purchase for them in such a way. In fact, I think they would even start working to develop content in that area, if they thought people would pay for it.

It’s interesting that this ad comes up on YouTube because some people on the platform have been paid to promote essay mills in their own content. It’s also interesting that no matter how many times I tell YouTube I don’t like the advert, it continues to show it to me. Something in that algorithm is overriding the information I myself give to Google. I can’t help but wonder if there’s something about me specifically that the company wants to reach. Since GDPR, I’ve had some truly weird and wonderful adverts, including a company who thinks I’m in the market to buy a bulk order of silicone processors (Google Ads thinks I like Business and Productivity Software, Business News and Business Services as well as Computer Components which . . . is kind of disappointing, Google). And I have seen an unfortunate resurgence in the amount of adverts to the all important 30+ woman demographic which means that pregnancy testing companies think I spend all day urinating. (Asides from all the research implications, this has been the biggest issue I have with GDPR. I had JUST trained Google out of this).

What really, really worries me – if that I fit into a demographic here. I know that Google Ads aren’t that clever. And I know how essay mills sell. They say that essay assignments are unfair, are impossible to be marked unless you know the system, and they say they have PhD students waiting to write for you. They talk about unemployed professors wanting to get one over a system that wronged them.

I look at staff who are fighting for pensions, and yet will be punished for this year’s poor NSS scores. I remember the incredulous face of a colleague when I described how that overall satisfaction is actually calculated. I think of the papers which demonstrate that department, not university, not subject, but that little culture of people in a building – is the greatest contributor to variation in the NSS scores.

I wonder how many of those departments, those unhappy and stressed people, who are told that leaving academia is weak and shameful, and I wonder . . .

When they click a YouTube link, do they hear Imagine . . . never having to worry about that time consuming process of creating courses.

 

 

Jill Goes Back to the Chalet School

It was my birthday recently, and one of my friends gave me an old copy of The Chalet School. It’s one of the best presents I’ve ever been given. I’ve been hunting for the Chalet School books for years, but they’re very difficult to find and seem to be out of print at the moment.

For the uninitiated, the Chalet School series was written by Elinor M Brent-Dyer in the 1920s. It is probably a trope codifier for the ‘boarding school’ genre in English fiction. There are 58 books in the series and I reckon in my childhood I read a good 50 of them. The books serve as morality tales, preaching obedience and diligence to the girls, while recognising that the most fun girls still have character flaws. Jo, one of the great heroes, frequently is described as dishevelled and romantically dreaming of Napoleon’s conquests.

When I was little, I could devour two or three of these books in a week, so I imagine there was a period of about a year when I was obsessed with them. I remember constructing elaborate fantasies in my head about being sent to the Chalet School where I could somehow become Jo, and my two younger sisters would also be sent to the Chalet School and they would cause trouble and I would have to rescue them, while nearby a handsome Doctor would be waiting for me to turn of legal marriageable age. I also remember going through a period of putting brushes in peoples’ beds and being deeply disappointed by my mum’s utter lack of reaction (an excellent example of negative punishment).

I was aware that the Chalet School existed in another time. After all, it takes ten books to get to the second world war which lasts another five books in itself. But reading the book as an adult, there were a few things that jumped out to me. Firstly, I vividly remembered the odd feeling I had when Simone and Jo interacted and I recognise now that I identified their relationship as romantic long before I identified myself as bi. Secondly, the quality of the German in the book is appalling. Thirdly, the imperialistic tone of the book is really quite troubling at times even if you do try to remind yourself it was written in 1925, the same time as The Great Gatsby and Mein Kampf.

But the fourth thing . . . I think we could learn a little about curriculum design from the Chalet School. Re-reading the book, not just as an adult, but as an educator, was fascinating. I was never one to play ‘teacher’ as a kid (my fantasies were more about letting both my little sisters nearly drown in the ice-covered lakes of the Tyrol before deigning to rescue them in the nick of time so I could be lauded by a much older Doctor), so it’s interesting now to note how often the Chalet’s School’s curriculum is referenced. The girls are very much trained to be good wives, with needlework and mending forming a decent chunk of the timetable. They also must be fluent in three languages and possess good numeracy skills (which many of the heroes struggle with).

I’m not advocating a return to home-making skills in our higher curriculum, but in both the #UoELTConf18 and VetEd18 we had discussions about how much higher education should encourage community spirit and social responsibility. There was considerable debate in fact about to what extent it’s the responsibility of universities to do this. Many of my friends and family work in all stages of teaching and I happen to know that (in Scotland at least) there is a focus on community in early years education, so I’m not trying to pass this responsibility on.

In some ways, I wonder if we come at this from the wrong perspective. Perhaps what we’re really asking for is authentic assessment. In my elaborate self-insert fantasies where a handsome doctor was waiting in the wings for me to turn 18, I was being assessed on how good I’d be as a wife. That assessment is unique to each individual pairing, and has unique criteria. I really like Guliker’s et al (2004) framework for thinking about authentic assessment. They suggest that authenticity comes from:

  • Task
    • i.e. a problem which will occur in practice
  • Physical context
    • i.e. in a space that will be equivalent to the space that you’ll be in in practice
  • Social context
    • i.e. reflecting the social structure you will be in in practice
  • Assessment form
    • i.e. the output of the assessment has a relevance or parallel in the real world
  • Assessment criteria
    • i.e. the things you mark are relevant to how that task will be assessed in the real world.

 

If we stay with the Chalet School a little longer, the tall Doctor waiting in the wings will presumably want me to remain calm under pressure around patients (i.e. rescue my drowning hypothermic sisters), in an unsupervised environment (The Austrian mountains), while not pointing out any of my working class roots (jolly good), and provide continued life for my sisters while keeping up appearances the whole time.

I think that when we wring our hands over whether our students demonstrate social responsibility and community spirit, we’re actually bemoaning how our programme design and assessment don’t translate to what the real world values. Unlike the Chalet School, we don’t want to produce good spouses in higher education, but we do want to produce good citizens. And therefore we need to make space in our curriculum and our assessments to reflect that importance.

And if anyone spots any other Chalet School books int he charity shops . . . . do let me know.