My Year – developments, collaborations, skills

Sitting outside in Ireland (near the place below) seems as good a time as any to think about how the year’s gone, collaborations forged, skills developed, etc. (this is also one of the last things I need to slot in to my 1st year report before I’m done on it!)  While this is primarily something I’m writing for my own purposes – as is the rest of the blog – aspects of it may be of interest to other’s working in similar areas or possibly just thinking about reflecting on the year (and look, there’s a pretty picture).

Blog as staging ground

First off, the blog has been a staging ground for a lot of my thinking this year – I’ve often drafted thoughts here (where they are searchable and organised in posts) and then copied much of the text into formal publications.  That’s been useful to me, perhaps partly because I can “publish” things unfinished here, and it pushes me away from my horrible habit of having dozens of documents open at any one time (although I still have a problem with >400 tabs open!).

When I set the blog up I was pondering pulling a Belshaw, and writing my thesis largely online from the beginning (on wikis and wordpress).  I was particularly interested in using CommentPress or for paragraph level feedback, and ZotPress for citation management.  In the end, although the blog has been a drafting space, it hasn’t been in the same way as I’d originally conceived.  That’s partly because of the way I like to write.  It’s also because I was shouted out for having a hideous blog (thanks Gol)…so I (Gol) fixed it up, including removing the para-level commenting (which in any case no one was using).  ZotPress has remained, and it’s actually got much better (I’d highly recommend it to academic bloggers) but for longer pieces of writing the use of shortcodes is still problematic (you need to flit between the published and raw versions to see what citation the shortcode refers to – and sometimes that’s just not workable).

My main plan for the blog next year is to move to self-hosting or at least my own domain name – long term it’s pretty stupid to send people to an institutional webpage given I’m likely to only be there for another 2 years.  I’d welcome any other suggestions on things to improve!

Conferences and Meetings

I’ve been lucky enough to attend a few conferences this year, with a couple of other things lined up for the future.

Wikimedia visits (including San Francisco and Lincoln AGM)

One of the things I’ve been exploring this year is the potential of mediawiki for learning analytics, and whether wikimedia contributions could be badged.  That would be particularly interesting for me given that such platforms might offer insight into epistemic cognition, and that the platform might be used as a collaborative information seeking/sharing one (see below).

More broadly I’m also interested in the potential of collective intelligence tools such as Wikipedia for learning environments and OER development (and talked about this at the WMUK AGM/Conference in Lincoln earlier in the year), and am particularly keen to encourage people to edit Wikipedia – including editing the Learning Analytics article and Massive Open Online Course article.  Prior to this year I’d never edited Wikipedia (or, I had, but it was one minor correction some time ago); my work on ORBIT involved using the platform which helped develop the skills, but it’s something I’ve become more interested in and both the platform-skills around organising knowledge, and the specific skills of writing a Wikipedia style article are pretty valuable and something I’d recommend thinking about.

CSCW13 (both workshops) – San Antonio, Texas

In February I attended CSCW in San Antonio, Texas with two workshop papers.  One of those was on the use of a toolset (including Cohere) to support collaborative sensemaking in collaborative information seeking environments; in an updated form this is included as part of my first year report.  In addition, that workshop has led to a collaboration with two other attendees and I co-authoring a paper on the nature of ‘context’ in CIS.  The other workshop – on the relationship between CSCW and Education – was also a useful networking event, from which I have maintained contact (and met up with again) at least one other attendee.

LAK13 – Leuven, Belgium

Almost immediately upon starting the PhD we (my supervisors and I) set about writing a submission for the 3rd Learning Analytics and Knowledge conference, held in Leuven, Belgium.  In the end, what we submitted was nominated for ‘Best Paper’ award, and will be revised and updated for submission in the first issue of the Journal of Learning Analytics (as well as forming a significant chunk of the earlier parts of my first year report).  Much of the work on this paper also informed my and Simon’s work on the 6th week of of the Learning Analytics Open Course this year – which was on epistemology and LAK, and included talks from me, George Siemens (on connectivism), and David Williamson Shaffer (on epistemic games).

In addition to that paper, Karen and I wrote a paper for the Discourse Centric Learning Analytics workshop on the importance of context for educational discourse, and some challenges to DCLA of context.  This paper informed my subsequent analysis of DCLA techniques and a paper (in draft, included as a Work in Progress in the report) on the multiple levels of context in the analysis of exploratory dialogue, some challenges for machine learning techniques, and a proposed method.

LASI – Palo Alto, California

In early July, I was at Stanford (in fact, I’ve really only just returned) at the Learning Analytics Summer Institute.  Unusually, I didn’t write a blog afterwards for two reasons: 1) I was exhausted, I went to the coast and just chilled for a couple of days afterwards; 2) it was so busy, and there were so many small useful conversations it was hard to summarise or think of anything useful to other people to say (other than what I’d been tweeting, etc. during the event).  In terms of “things to report”: there are a number of hopeful collaborations (e.g. a group of us looking at information seeking/knowledge management including Dragan, Bodong and Emily); I got to talk to some google people and I’m hoping a few things will come of that; I finally met Rebecca Enyon who holds a joint post between the Oxford Internet Institute (OII) and the Education dept there and works on young people’s internet use; I had a really useful chat with Carolyn Rosé about machine learning stuff; Tony Hirst and I finally met (despite both being at the OU, this hadn’t happened yet!) and had some interesting chats about search based pedagogy [jfgi] and the scope for interesting research using Google Trends/Insights.

Society of the Query Conference – Amsterdam

One of my continued interests is in how we conceptualise knowledge, particularly in the context of tools such as google, and wikipedia.  Paul Matthews wrote a piece on this in the context of social epistemology (and a) was at CSCW, and b) I hope to write something with in the not too distant future), and I also have great hopes for the Extended Knowledge Project (which I hope to be able to contribute to).  In this area, I’ve been invited to contribute to the 2nd Society of the Query Conference in Amsterdam this November on the subject of Education and the role of context (see below), and submit a piece to their reader.

The particular panel is:

4. Search in context

There is a long-term cultural shift in trust happening, away from the library, the book store, even the school towards Google’s algorithms. What does that mean? How are search engines used in today’s classrooms and do teachers have enough critical understanding of what it means to hand over authority? We think we find more and in a faster way, while we might actually find less or useless information. The way we search is related to the way we see the world – how do we learn to operate in this context?

Specific projects – including:

Over the year I’ve also been working on specific projects, particularly developing skills and practical work around the literature I’ve been reading and writing about.

Developing a CIS environment for Epistemic Commitments

A core deliverable for my work is the development of an environment on which to conduct my research.  Over the year I’ve been exploring a range of internal (Cohere and Evidence Hub particularly) and external (mediawiki, and the existing CIS environments in particular) tools which could be used for my research.

As a part of this I learnt how to use WAMPserver to mirror an existing Wiki (the Schome project at the OU) and download and install extensions to that Wiki.  I have also explored the use of Google Analytics for tracking user behaviours on websites (not appropriate due to constraints on identifiable information), and the potential to use external feeds (RSS) to seed another environment (Cohere).

From these explorations and my reading I designed a specification for a tool in collaboration with my supervisors and Michelle Bachler (a developer in KMi) which will be a Firefox addon for the EvidenceHub tool, developed by Michelle.  In addition I have had useful conversations with others about developing tools to explore epistemic commitments (e.g. Sean Lip at Google).


Another core piece of my work is around educationally productive dialogue.  In particular, given my interest in epistemic commitments a core part of my project is to identify when commitments are ‘accountable’ in the group (i.e. are within the scope of exploratory/accountable dialogue).  To some extent keyword spotting may be enough here particularly within the constrained environment of the EvidenceHub (and indeed, that is the finding of the Epistemic Games group).  However, given existing work in the department to further develop from bag of words approaches, and my own paper with Karen Littleton discussing the role of context in exploratory dialogue, it was of interest to explore how machine learning techniques might be used for such classification.

Therefore, over the course of the year I have learnt to deploy the existing Exploratory Discourse Detection Module (see section on Maturing EDAM in the first year report) which is built on the MALLET command line tool.  My checking of outputs from this tool has been conducted in Excel (the tool essentially produces a .csv format), and I would not anticipate delving further into work with MALLET.  I have, however, further explored the use of GATE, and to some extent WEKA (and  I am pleased to have met one of its founders – Ian Witten – at LASI).  I have also had useful discussions with Elijah Mayfield who developed the LightSIDE tool, and has been kind enough to share the code he used to detect ‘authoritative talk’ using an ‘Integer Linear Programming’ approach.  I would hope over the course of the PhD to be able to utilise GUI tools such as LightSIDE and GATE in appropriate contexts, while also working with machine learning specialists to develop custom tools.  To that end I have had helpful technical conversations (for which I am very grateful) with Carolyn Rosé and Elijah Mayfield, Zhongyu Wei (who conducted much of the original work on the EDDM tool), and Yulan He (who also worked on the EDDM tool, and who we hope to continue to work with).  The joint paper with Karen Littleton on Maturing EDAM includes a technical proposal for continued work which we hope provides a specification for the next generation tool.

In 2012 Microsoft released a dataset from the social search tool ‘‘ to researchers. is an experimental social network in which when one searches, a post is created based around that search, to which interesting results from the search may be pinned.  It is multimedia intensive, and visually quite attractive.  The original intention was that the tool be used particularly in universities, although that appears to have died down a bit now – one interesting new development may be in the use of TEDActive “conference-goers can assemble images, research links, videos, and text into collages that express their reactions and associations around the TED Talks.”

In addition to the literature review which offers a justification for the interest in dialogue around CIS, some public blog posts around interesting discussions (e.g. on whether Aliens built the pyramids) indicated it might potentially hold some interesting exploratory dialogue.  Thus, in order to attempt to investigate the exploratory properties of dialogue around CIS the dataset was requested, ethical clearance granted, and the dataset opened in R.

Language modules, potential of SNA, mostly used it for subsetting and column classifications (e.g. if >80, then ‘yes’) – poor use of R power.

A key lesson here is that while R has lots of great packages, loading a whole dataset into dataframes in R is not a good idea – instead, it is better to load them into MySQL do database stuff within the database (like most subsetting, joins, etc.) and then run R commands where necessary on subsets from the larger database.

[expand title="Those commands can be expanded here"]

<pre lang="rsplus" line="1" file="download.txt" colla="+">

#to create summary tables you can load a table using e.g.
#then tables using table(), at most basic just pass a single column for frequency see e.g.

# E.g. to convert from factor to numeric
BLE$ExploratoryProbability1 <- as.numeric(as.character(ExploratoryProbability))

#run an ifelse on numeric in this case, you can insert an & (after the 40 on this) for additional conditions
BLE$Exploratory40<-ifelse((BLE$ExploratoryProbability1>=40), “yes”, “no”)

#To join a table and a sparsly populated table with same columns one nice way is to subset out all the Table1 rows that exist in Table2, then rbind Table1(a) and Table2
#using something like: data2[data1$char1 %in% c(“string1″,”string2”),1]. OR do the reverse subset to the one you did before using !
NotText = subset(behavfiles, ActionId=!’78’|ActionId=!’49’|ActionId=!’153′)
#78=message on party, 153=message, 49=a comment.  You might want to do something with the likes data at some point too
#to filter data down subset
SubBehav = subset(behavfiles, ActionId==’78’|ActionId==’49’|ActionId==’153′)

#You’ll want to export the data, to do that: write.table(SubBehav1534978, “c:/mydata.txt”, sep=”\t”)
#First though, lets put into the format EDAM will actually accept!
#blank some columns for EDAM (so not actually blank)
SubBehavExporting$TargetUserId = “NA”
#Then you can reorder columns using subsetting
SubBehavExporting1 = subset(SubBehavExporting,selectc=

#Detect language
BehavLang = detectLanguage(SubBehavExporting$Context,isPlainText=FALSE,includeExtendedLanguages=FALSE, pickSummaryLanguage=FALSE,removeWeakMatches=FALSE, hintTopLevelDomain=NULL, hintLanguageCode=Languages$UNKNOWN_LANGUAGE, hintEncoding=Encodings$UNKNOWN)
BL = cbind(SubBehavExporting,BehavLang)
BL = subset(BL,detectedLanguage==’ENGLISH’)
write.table(BL, “BL.txt”, sep=”\t”)

# Load behav files as:#
behav1 = read.delim2(“BehaviorData_0001.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “UTF-8”)
behav2 = read.delim2(“BehaviorData_0002.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “UTF-8”)
behav3 = read.delim2(“BehaviorData_0003.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “UTF-8”)
behav4 = read.delim2(“BehaviorData_0004.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “UTF-8”)
behav5 = read.delim2(“BehaviorData_0005.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “UTF-8”)
# Combine behav files using
behavfiles = rbind(behav1,behav2,behav3,behav4,behav5)

# Load user files
users = read.delim2(“Users1.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)
usersDel = read.delim2(“DeletedUsers.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)

# Load Action logs
Posts1 = users = read.delim2(“OnelinePosts-2012.11.01-2012.11.20.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)
Posts2 = users = read.delim2(“OnelinePosts-2012.10.01-2012.11.01.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)
Posts3 = users = read.delim2(“OnelinePosts-2012.09.01-2012.10.01.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)
Posts4 = users = read.delim2(“OnelinePosts-2012.08.01-2012.09.01.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)
Posts5 = users = read.delim2(“OnelinePosts-2012.05.01-2012.06.01.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)
Posts6 = users = read.delim2(“OnelinePosts-2012.06.01-2012.07.01.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)
Posts7 = users = read.delim2(“OnelinePosts-2012.07.01-2012.08.01.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)
Posts8 = users = read.delim2(“OnelinePosts-2012.01.01-2012.02.01.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)
Posts9 = users = read.delim2(“OnelinePosts-2012.02.01-2012.03.01.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)
Posts10 = users = read.delim2(“OnelinePosts-2012.03.01-2012.04.01.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)
Posts11 = users = read.delim2(“OnelinePosts-2012.04.01-2012.05.01.txt”, header = TRUE, sep = “\t”, quote = “”, comment.char = “”, encoding = “utf-8”)

# Combine post files using
postfiles = rbind(Posts1, Posts2, Posts3, Posts4, Posts5, Posts6, Posts7, Posts8, Posts9, Posts10, Posts11)



CIS on Wikipedia (R)

The lesson regarding R and MySQL was further enforced by another case.  Wikipedia Talk pages are a place in which editors can make sense of, and share, information – I was hypothesised that we could see these two distinct types of behaviour in link patterns; first moving from articles to talk pages (sensemaking), second from talk pages to articles (CIS in Wikipedia Talk pages).  So, with some generous help from Aaron Halfaker (who scraped the edit histories of Wikipedia to send me, for each edit on each page, every link added or removed), I set about trying to use R to process Wikipedia LinkFlow data…again, this was a mistake – although I did learn some useful R along the way (as detailed in that blog post).

While in Stanford hosted by the Lytics Lab there I had a chance to talk to René Kizilcec, and the weekend I left we (mostly he) did some playing on trying to get the dataset into a readable form in order to get it into MySQL, and then reshape it (or, just count) as per the discussion in that blog post in which I discuss looking for strings of ATDR (Inserted on Article; Inserted on Talk; Deleted from Article; Removed from Talk [essentially the same thing but we need to distinguish the two]).  By looking at that, we could then count the number of  times any particular link has moved from A to T or vice versa (as well as the other doubles) and we could even insert in S/N – same user, not same user – on each double to explore that aspect too.  IF we wanted to do a bit more, we could index each link such that if the same link appears on multiple pages it has the same ID – that would allow us to start exploring SNA potential too.

Having returned to the UK and been very busy, as is René, this is currently not quite on hold, but certainly not fully active (I’m hoping to get a ‘Note’ style paper written by mid-September).  The version of the data I managed to get in to R wasn’t complete (the process must have failed at some point) so the next step is to get my version of the data into the same format we got it at Stanford, get it in to MySQL and go from there (which will include me learning to operate on the dataset within a database, and then if appropriate moving partially into R).


One of the hopes with the dataset (above) was that it might provide some interesting coded data on which Epistemic Network Analysis could be conducted.  My visit to University of Wisconsin-Madison was to learn how to use this method, and deploy it on some data – intended initially to be the dataset, but in the end I recoded my MPhil data.

ENA is based on the theory of Epistemic Frames which posits that the important component of ‘knowledge’ is not facts and skills in isolation, but understanding how those are connected.  For example, in the case of information seeking, seeing that user’s seek ‘authority’ is, in isolation, not terribly informative (because standards of ‘authority’ for knowledge may be inappropriate or appropriate depending on other contextual factors).  However, understanding that a searcher’s ‘authority seeking’ talk is connected to other talk related to community practices (perhaps around who we assume authorities to be, such as scientists – a ‘value’ of that community), or seeing that searchers engage in what Shaffer would call ‘epistemic’ talk and what I have called accountable or exploratory talk (to justify their selection of authorities) is interesting.  So, we see in this example a case where simply exploring one component in isolation provides relatively little information, while looking for combinations offers more insight.  For example, we could explore not only a reliance on authority/corroboration in sourcing, but also instances in which they are more/less likely to be connected to particular types of justification (attempts to understand the material v. simple matching information to plug answers in).  This work is ongoing and has required me to learn to use the ENA tool.  Again I have a paper in draft which I hope to finish by mid-November.

Multiple Document Processing & Google

Hope to learn some Python

Hope to be able to play with credibility judgement work

Print pagePDF pageEmail page

This Post Has 0 Comments

Leave A Reply

You must be logged in to post a comment.

%d bloggers like this: