I’ve been playing with a dataset for a while from a benchmarking exercise. These engage students with assessment criteria by asking them to apply those criteria to known reference texts (texts for which we already have marks), giving us and students feedback regarding how accurate their assessment is. I’ve written (here, and here) about why I find this approach exciting (the approach broadly construed comes under a few labels including mine of ‘diagnostic peer review‘, at UTS ‘benchmarking‘, at UCLA ‘calibrated peer review‘, and in the digital assess system ‘adaptive comparative judgement‘).
In this case the dataset is structured such that each row (representing an individual student from one of 4 years) contains:
- 6 values, representing the student’s scores for 3 reference texts against 2 rubric items (i.e. 3 texts for which we have scores, are given marks against 2 rubric items – simple distance measures can be computed by subtracting the student score from the reference value)
- a comment for each reference text for each rubric item (i.e. 3*2 comments across 3 texts on 2 rubric items)
- for a subset (the most recent year cohort), a final grade for the student on their submitted assignment (completed at a later date), along with their scores on each rubric element and their self-assessments on those rubric elements
- for that same subset, an overall comment from the tutor and a self-reflective comment from the student
We might hypothesise that we should see groups. We might expect the best students to be making judgements that are close to the exemplars with only small deviations. Next we might expect to see those that too easy (mark high) or harsh (mark low) in the judgements, but can notice relative differences between them. Finally, we might expect those students that struggle the most to be inconsistent in their judgements and not notice the relative differences between the exemplars.
We can imagine testing various interventions for their impact on either final grade or improvement in self-assessment (or in benchmarking tasks, although we’d want to know that transfers to self-related learning contexts) over time – e.g. to calibrate up, down, or at all. We might also expect to see types of comments emerge, and perhaps to be able to group some together such that an easier overview of these short texts could be created for better targeted overall feedback.
In terms of relationships, it isn’t clear what links we should see between benchmarking-success, final-grade, and self-review. Ideally, if the exercise is as productive as we hope, even those students who perform poorly should do well later, because poor performance should provide a learning experience.
In any case, I’ve tried analysing this data in a number of ways, but not had much luck:
- I’ve tried analysing the text comments (and made a few attempts at slicing these according to accuracy) to identify topics, or discriminating terms.
- Note it might just be worth doing a really basic term analysis, perhaps including use of something like the ‘academic word list‘
- I’ve tried analysing the distance from the reference values (using a mid point and gradient distance measures) and both making rules to create groups, and using clustering to create clusters – with the small dataset though, it isn’t obvious there’s a large enough group of ‘consistent’ students (whether on/over/under reference values) or enough behaviour to cluster students meaningfully on.
- I’ve had some basic looks at relationships between benchmarking performance and final performance (but I’m not sure this is a particularly productive avenue)
So, I’m thinking about why that might be:
- There isn’t enough data to show patterns
- The data we have wouldn’t show the information I’m looking for even if we had more of it (we either need additional data, or different data – on the former, that might include repeated tests, pre-post assessments, or other information from running the same tasks again; on the latter it might include changes to the tasks such as varying the qualities of the reference texts, etc.).
- There are patterns there, I just haven’t found them yet/used the right method
E.g. it may need some more manual analysis to guide the computational analysis.
So in terms of next steps, addressing those are all options (and suggestions very welcome). There’s also the other potential to think about how the task is constructed and, for example, foreground the data to students in a different way (e.g. to visualise the comments and their divergence from the reference scores in some way) in order to support their sensemaking around the feedback they receive on the task.
I’m also wondering what item response theory folks would say about the task (not my area, but interesting in light of Geoff Crisp’s recent talk).
Some suggestions from my wonderful colleague Andrew Gibson:
- Can the students recognise extreme examples? – very good versus very bad. (5+ short exemplars)
- Can the students recognise close examples? – good vs not so good. (3+ longer exemplars)
- Do they improve after feedback? – another good vs not so good, but looking for improvement on 2. (3+ longer exemplars)