Etherpad for co-operation/collaboration learning analytics

The Etherpad at #MozParty NewcastleOne of my data-sources is the set of etherpads which were used to write (mostly in small groups) a report. This data includes the final output (obviously) including some basic formatting, across which we could look for the presence of URLs and cue-phrases, etc. It also includes the whole revision history of each etherpad.

I’ve been looking around for some code to help me out, and to my surprise it appears (a) not much learning research has been done on such data (specifically using trace-data, rather than video, etc.) and (b) where research has been conducted, it’s with custom tools/code and not released.

From the academic side, from a quick look on google scholar (first 10 pages of ~30 + some citation following, note this probably excludes some google-docs & wiki related research and similar) I found a few interesting things:

  1. This cool project – Collabode – which developed a collaborative real-time coding space using Etherpad+Eclipse alludes to such analysis in the paper about it, but I can’t see any expansion or available code.
  2. Stian Haklav has also talked about some cool ideas around etherpad idea-convergence and scripts to work with etherpads.
  3. Another paper (Hirsch, Benjamin, et al. “Collaborative learning in action.Teaching, Assessment and Learning for Engineering (TALE), 2013 IEEE International Conference on. IEEE, 2013.) which used an etherpad describes analysis, but again without code:  “Each keystroke entered into the CLEs collaborative editing pad is recorded, including edits, deletes, copy/pastes, etc., and is stored in a database. A statistics module was built in order to visually display information about the pad (see Fig. 2), including a breakdown of each students contribution (such as typed characters, final characters, copy/paste/delete actions) along with a means to see how students contributed to the pad over time. This tool can be accessed by faculty in order to assist in forming an understanding of each individual’s contribution to a collaborative assignment.”
  4. Similar paper: Vahakangas, Taneli, and Joel Pyykko. “VisciPad: Peeking into a Collaborative Creative Writing Project in Elementary School.Creating, Connecting and Collaborating through Computing (C5), 2012 10th International Conference on. IEEE, 2012.
  5. Which cites this nice paper: Southavilay, Vilaythong, Kalina Yacef, and Rafael A. Calvo. “Process Mining to Support Students’ Collaborative Writing.EDM. 2010. which used process mining (ProM) and an LSA tool to run analysis on contribution types on Google Docs contributions (again, I can’t see anything reproducible/downloadable). The tool is described in more detail in this paper Southavilay, Vilaythong, Kalina Yacef, and Rafael A. Calvo. “WriteProc: A framework for exploring collaborative writing processes.ADCS 2009 (2009): 129.
  6. Liu, M., Calvo, R. A., & Pardo, A. (2013, July). Tracer: A Tool to Measure and Visualize Student Engagement in Writing Activities. In Advanced Learning Technologies (ICALT), 2013 IEEE 13th International Conference on (pp. 421-425). IEEE.  – Tracer looks interesting, but I can’t see any code
  7. Southavilay, V., Yacef, K., Reimann, P., & Calvo, R. A. (2013, April). Analysis of collaborative writing processes using revision maps and probabilistic topic models. In Proceedings of the Third International Conference on Learning Analytics and Knowledge (pp. 38-47). ACM.

    From the abstract, analysis includes: “(1) the revision map, which summarises the text edits made at the paragraph level, over the time of writing. (2) the topic evolution chart, which uses probabilistic topic models, especially Latent Dirichlet Allocation (LDA) and its extension, DiffLDA, to extract topics and follow their evolution during the writing process. (3) the topic-based collaboration network, which allows a deeper analysis of topics in relation to author contribution and collaboration, using our novel algorithm DiffATM in conjunction with a DiffLDA-related technique”

  8. Handayani, N. S. (2012). Examining the Writing Phases and Revision Patterns in Online Collaborative Writing: What Can We Learn from Them?. Malaysian Journal of Distance Education, 14(2), 39-62.

And I’ve seen a few discussions in forums and plugin spaces:

  1. This plugin exports including taking authorship metadata at a contribution level (but only at the line level)
  2. This plugin also exports authorship metadata, possibly at a finer grain?
  3. A developer was working on a stats plugin (commercial license) (here’s another (dead) thread on it) which (from contacting John) includes:
    1. Characters
    2. Word counts
    3. Revision counts
    4. Saved revisions
    5. Authors
    6. A set of author stats, including (a) n of words contributed, (b) n of lines contributed to, (c) n of lines as only contributor, (d) n of characters

So the third is probably worth a look in, and the 2nd one might be useful if the ‘spans’ of authorship colour are fully exported in the html (much easier to work with than the etherpad data structure).

Given the above bits of research, and thinking about (a) what etherpad records and (b) what sort of things we’re interested in for learning contexts, it’s interesting to consider what we’d want from an etherpad analytic tool.  E.g.

  1. The stats from the ep_stats plugin above (especially author contribution counts, and a proportion based measure here).
  2. N of ‘touch points’ – e.g. if every other word were written by a different author, the N of touch points would be 1/2 the N of words, we’d want some way to express this as a number between 0-1 probably
  3. N of uninterrupted blocks (similar to (c) above: ‘n of lines as only contributor’)
  4. Temporal analysis?
    1. Perhaps including contribution over time versus contribution in the final pad (a crude ‘survival’ measure) or deletion over time (and authorial deletion) e.g. did one author appear to contribute less, but in fact their edits were deleted
    2. Do groups engage in different processes, e.g. working on their own sections throughout or co-editing throughout, making notes and then refining or starting from ‘clean’ text, do they engage in ‘linear’ editing (adding at the bottom) or other forms of insertion, etc.
    3. Possibly tied in with other trace data (e.g. when URL x was inserted, was it being talked about in the chat?)
  5. Topic based analysis (possibly related to temporal) as per Southavilay et al (2013) above, we might be interested in whether individuals contribute to just one topic, or across them, whether they tend to start topics, join them later, or a mixture, etc.  (On topic things, see also Stian’s stuff around topic tags in ‘2’ above).
  6. Chat data (if using the etherpad chat), e.g. did those chatting more edit more, do chats and edits co-occur
  7. ???

One thing I’m interested in, is just a very simple operationalisation in which:

  1. Collaboration is taken to be the extent to which authors interact around the same text (i.e., their edits ‘touch’ more often, as per ‘2’ above)
  2. Co-operation is taken to be the extent to which authors edit on their own areas, their contributions are ‘stacked’ not interlinked (i.e. there are fewer edits touching, even though the overall pad size might be similar, as per ‘3’ above)

What else might we want?  And does anyone know any (ideally easy, or implemented) ways to do any interesting things with such data?

Print pagePDF pageEmail page

This Post Has 15 Comments

  1. Stian says:

    Hi Simon,

    interesting problem and great review. I have been using Etherpad quite a lot myself, however mostly for cowriting in the classroom, where there’s often a single person taking notes, so I haven’t looked into detailed tracking of edits etc. One of my ideas would be to convert it into a format that already has mature analysis tools, for example it would be interesting to turn it into a git repository (and shouldn’t be very hard). There are some tools for visualizing contributions, movement of text etc. (The same with a mediawiki site). However, this assumes more of a “handing off” editing strategy (I do some edits, you do some edits). If you have multiple people writing at the same time, the git commits would not make any sense. I guess it would depend a lot on what you are actually looking for.

    I’ve been more focusing on “orchestration tools” related to managing multiple Etherpads – right now I can quite easily set up a bunch of Etherpads with initial text, and the nice thing is that once they’re in use, they function as two-way channels – I can read text from them, and redistribute them to other pads, or I can push out additional information during the class (which I used to first have people brainstorm, then go to different URLs to get information, then have their ideas sent to other pads for peer review etc).

    I’ve thought about building tools that make it easier to monitor this – for example some kind of live-ish activity graph showing how people are editing the pads you set up, maybe with screenshots, so if you can see that all groups are doing OK, etc.

    I’ve also thought about pattern analysis of wikis. A particularly interesting example is that for a recent course, we had an open book exam where only the wiki was allowed as a source of information. Before the exam, students came together to edit an impressive study guide, and during the exam I casually observed various access strategies – some people went to the week page for a certain topic, some used search extensively, and others had their own notes typed up, and mostly referred to a single page. Since I logged every single page access during the exam, and I have ethics to use this together with the final grades etc, there might be some interesting ways of analyzing this data. 🙂

    Would love to talk more.

Leave A Reply

%d bloggers like this: