I’ve been a bit quiet recently – preparing for my US trip (on Monday – 5 weeks of it!), and writing first year report, and playing with [Wikipedia data to look at link flow to get some idea of collaborative information seeking]1 in that environment. So this blog describes some (fairly filthy) R on Wikipedia data…The first bit is probably a pretty common problem to do with ‘wide’ data, then the coded bit at the bottom is a combination of code I’ve used so far, and some notes for where I’m going (my R session’s crashed so I’m at a loose end!) Some caveats/disclaimer: I am an idiot – I write horrible code, it’s slow, and my knowledge of libraries and some probably fairly basic things is very poor. This is a project to learn a bit of R by doing something useful – but I’m not really putting a huge amount of effort into learning the best ways to be honest. My only prior experience is writing VBA macros…and those weren’t very nice either – you’re right to laugh (although I did share some useful ones for SIMS [here]2, almost certainly defunct now). For example, I’ve held off using RMySQL until today because I’ve never used MySQL (except via that lovely GUI :-)) as such I am more than open to suggestions and ideas. I think the ‘conceptual’ side of what I’m doing makes sense, although I may have missed some things (again, do suggest). Finally I’m focussed on one specific task – looking at link flow. I’m aware social network analysis and other sorts of analysis might also be of interest here. I’m hoping to make the data available on our R server once I’ve finished so other people can do what they want with it. This page is as much about me saving my notes while I can’t access my script file as anything else.

Converting ‘wide’ data held in a single cell to long data I’m trying

to construct some data so I can do summaries, stats, further work on it. The data is essentially of the form described in this thread where some fields (in this case ‘major’) have multiple values (actual data at end of post): first last sex major John Smith M ANTH Jane Doe F HIST,BIOL (this q. is similar too with sql, but again not sure it helps me ) My default position is to to duplicate rows with multiple ‘major’ values, and split the values across the rows by counting their separator (,) and looping through copying rows and splitting the ‘major’ equivalent into the new rows (I’d still need to work out how to do this!). But, with the data I have that’d take me from 6.7 million rows, to well over double that with huge data duplication. On the basis that a) on reflection I have a (separate project) table of >40million already n that’s ok, and b) doing it another way will involve having to run within ‘cell’ (what the hell is the equivalent name in R?) comparisons every time rather than across row…well that adds complexity for me, and my time is more valuable to me than the servers (sorry…), c) I think I can factorise the values in the column to split (which reduces the data?) but I don’t think that’d be possible if each ‘cell’ in the column held more than one value. Working with formulae that loop through strings of concatenated columns is something I’ll think about in the future though given the redundancy in this method (lots of duplication, lots of NA columns where they’re created to accommodate in this case, one edit with >2000 links added). Tony (Hirst) sent me and pointed out OpenRefine has functions to do this (it looks really good!) Note too that the issue above is not the same as a simple longwide or vice versa conversion in that the wide values are all in the same column separated by a ,. For those conversions there’s a nice description here: [http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/]3

What the data looked like, what it looks like now So before, my data

contained • rev_id: The identifier for the revision • rev_timestamp: When the revision was saved. • page_namespace: Useful for identifying articles vs. talk pages • page_title: The page’s title • UID • link count: number of links edited in that revision • links removed href: the urls removed • links added href: the urls added – the largest number here was >2000 (it was a user page, presumably they were drafting some things and moved them into main wikipedia). Moving that to long obviously added rather a lot of new rows… I now have a long dataset with columns: • rev_id: The identifier for the revision • rev_timestamp: When the revision was saved. • page_namespace: Useful for identifying articles vs. talk pages • page_title: The page’s title • UID • action: either “insert” or “remove” • page name: The raw page (without Talk:) to keep article/talk pages together • href: the url of the single link added or removed I also had to do some column name changing to merge the links/added removed columns (these are now marked by an ‘add’/’rem’ on the row in a new column), and other bits as described below. Where I’m going – link flow triples? So I’m going to give a few bits of the code I’ve used so far.

#download and load
download.file("http://...SOME URL_link_set_diffs_not_empty.tsv.gz", method="auto", quiet = FALSE, mode = "w")


#read table
linkdiffsnotempty = read.delim("WikiLinks/link_set_diffs_not_empty.tsv", header = F, sep = "t", stringsAsFactors=F, encoding = "utf-8")

#set column names

#sum a column ignoring nas - this was useful to do a sum on a column containing a count of the no. of links added in that edit session

#copy linkdiffsnotempty "links added" to start stacking/rbind process to move from wide to long
LinksAdded = subset(linkdiffsnotempty, select=c(rev_id,rev_timestamp,rev_shal,page_id,page_namespace,page_title,user_id,user_name,link_count,links_added))
LinksRemoved = subset(linkdiffsnotempty, select=c(rev_id,rev_timestamp,rev_shal,page_id,page_namespace,page_title,user_id,user_name,link_count,links_removed))

#Add a column to that - "Added/Removed"
LinksAdded\$Status < - "Add"
LinksRemoved\$Status <- "Rem"

#Mark rows with no links (by doing a 'countif' :// [for http://] is present)
LinksRemoved\$Number <- str_count(c(links_removed),"://")
LinksAdded\$Number <- str_count(c(links_added),"://")
#Subset empty rows out
LinksAdded = subset(LinksAdded,Number!=0)
LinksRemoved = subset(LinksRemoved,Number!=0)

#Split the values and duplicate rows for each link (go long from wide)
#For wide data in separate columns the R cookbook describes good methods http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/
df <- data.frame(LinksRemoved)
LinksRemoved = data.frame(rev_id=rep(df\$rev_id,el.len), rev_timestamp=rep(df\$rev_timestamp,el.len), rev_shal=rep(df\$rev_shal,el.len), page_id=rep(df\$page_id,el.len), page_namespace=rep(df\$page_namespace,el.len), page_title=rep(df\$page_title,el.len), user_id=rep(df\$user_id,el.len), user_name=rep(df\$user_name,el.len), link_count=rep(df\$link_count,el.len), Number=rep(df\$Number,el.len), Status=rep(df\$Status,el.len), links_removed=unlist(long.links_removed))

#To do this for the added links I needed to split the file, this is a very inelegant method, obviously you could loop through to do this, but for my purposes this was just as fast.  Note this actually loads all of these files as duplicates of the original into memory...so it's a really awful way of doing things

LinksAdded1 = LinksAdded[1:1000000,]
LinksAdded2 = LinksAdded[1000001:2000000,]
LinksAdded3 = LinksAdded[2000001:3000000,]
LinksAdded4 = LinksAdded[3000001:4000000,]
LinksAdded5 = LinksAdded[4000001:5000000,]
LinksAdded6 = LinksAdded[5000001:5811865,]

df <- data.frame(LinksAdded1a)
LinksAdded1a = data.frame(rev_id=rep(df\$rev_id,el.len), rev_timestamp=rep(df\$rev_timestamp,el.len), rev_shal=rep(df\$rev_shal,el.len), page_id=rep(df\$page_id,el.len), page_namespace=rep(df\$page_namespace,el.len), page_title=rep(df\$page_title,el.len), user_id=rep(df\$user_id,el.len), user_name=rep(df\$user_name,el.len), link_count=rep(df\$link_count,el.len), Number=rep(df\$Number,el.len), Status=rep(df\$Status,el.len), links_added=unlist(long.links_added))

#then the same again for each (...again, you could loop) 

LinksAddedCombi = rbind(LinksAdded1, LinksAdded2, LinksAdded3, LinksAdded4, LinksAdded5, LinksAdded6)
rm(LinksAdded1, LinksAdded2, LinksAdded3, LinksAdded4, LinksAdded5, LinksAdded6)

#Rename the link columns so they match in each data.frame, then combine (rbind) the two for overall
names(df)[names(df) == 'old.var.name'] <- 'new.var.name'
names(LinksAdded)[names(LinksAdded) == 'links_added'] <- 'links'
names(LinksRemoved)[names(LinksRemoved) == 'links_removed'] <- 'links'

LinksExpanded = rbind(LinksAdded,LinksRemoved)

#strip [ ] and whitespace
LinksExpanded\$links = gsub('[','',LinksExpanded\$links,fixed=TRUE )
LinksExpanded\$links = gsub(']','',LinksExpanded\$links,fixed=TRUE )
LinksExpanded\$links = gsub(' ','',LinksExpanded\$links,fixed=TRUE )

#I want R to evaluate http://www. and http:// as the same. Could just strip them, but given URLs might be useful later, that's not a good approach...
#Alternatively, we just take what users used...saves a job

#We need to link Talk and Article name spaces. Their page_id isn't useful, so doing it through stripping 'Talk:' may be the best (only?) option
LinksRemoved\$raw_title = gsub('Talk:','',LinksRemoved\$page_title,fixed=TRUE )
LinksAdded\$raw_title = gsub('Talk:','',LinksAdded\$page_title,fixed=TRUE )

#At this point I've encountered a problem - the task never completes for LinksAdded. Subsetting into smaller files is inelegant and also ineffective it seems in this case.  Could use some sort of loop to create subsets to work on (the below loop doesn't work, but it's what I started to work on)
for(i in 1:11){
  LinksAdded[i]\$raw_title = gsub('Talk:','',LinksAdded[i]\$page_title,fixed=TRUE);

#After a lot of attempting to use variations of the same method, the shift to RMySQL seems inevitable.  First, I do what I should have done earlier and...
#Remove templates/categories from the LinksExpanded list by subsetting the namespaces out, they're interesting but add additional complexity (and extraneous data), subset these out:
LinksExpandedTA = subset(LinksExpanded, page_namespace==0|page_namespace==1)
#At some point in the future looking at link movement for e.g. between user pages and talk/article pages would be very interesting (e.g. that very sizeable move above), but I just don't have the capacity to cope with that now!

#Then setup a db connection

#Write the table to the db

#Originally in R I intended to use recast to go wide with columns: talk/article (from name_space), UID we as just a straight copy over in the first instance, and maybe the date (although the rows should be ordered by this first, so it's probably not necessary). I think this would have required a loop...again this is not a good idea particularly with large dataframes.  'Apply' might have been a better option.
#Having moved into a db though, we'll want to query MySQL to get a data.frame for each link represented on each page name (talk and article).  Then we want to write to a new table, with columns: 
#Page   Link   String[Talk|Article|DeletedArticle|RemovedTalk]   User[...]
#TADR (I'm pronouncing that 'ta-da!'...it's the small things wot keep me amused)
#Each link double (TA, TD, TT, TR, AT, AD, AR, AA, DA, DT, DR, DD [?], RA, RT, RD, RR [?]) can be interleaved with the user (same/different) such that...
#we have an actor-oriented triples table (TSAS, etc. you can use a wildcard to ignore the middle one e.g.(T\$AS indicates the same person put a link on talk as put it in article space) [or whatever wildcard is]
#From the above, create a doubles table - TA = CIS, AT = sensemaking (or whatever)
#We're also interested in the first occurrence of each link and probably the second so we want to mark those in a separate column too (and user)

#For future stuff it might be interesting to reshape to wide for each link, with a column for pagenames - so we could see how links were moved around wikipedia...but I think we'd need some deeper analysis of where that was likely (semantically related pages) first.  
#It might, however, be interesting to see if there are common links across all pages (that could give us a subset making it worth doing ^^ that with).
#Using factors to index links and pages is also worth looking at (at the moment they're all just stored as char I think)

Some sample data from head of original file is here, it’s tab limited so imagine it all being on the same row (easier to break here): > head(linkdiffsnotempty) rev_id rev_timestamp rev_shal page_id 1 7171936 1099772601 bad9d3e017094101319610f59510346939129fb1 1139257 2 8134663 1101670454 c7c479206925b1c937f3e6b245016a548f0d0b86 1219556 3 106201915 1170810868 02a8f5965bf6e3006a85a795ecbd27ac254c44ec 1219556 4 274881064 1236156945 029b24af56bf27ba113b665135aac130882c4d48 1219556 5 11405728 1111459229 055922859737417c7edee757696240a42b61b8d2 1139257 6 11580672 1111525002 1859b2a9372dfb0a7d89ddb0b042f3c8f3d775cf 1139257 page_namespace page_title user_id user_name link_count 1 0 DUKW 1325 Ortolan88 2 2 0 Kulbir Thapa 142018 J.N. Houterman 2 3 0 Kulbir Thapa 59986 PDH 1 4 0 Kulbir Thapa 736651 Ohconfucius 1 5 0 DUKW 84951 ArnoldReinhold 3 6 0 DUKW 84951 ArnoldReinhold 4 links_added 1 [http://www.archives.gov/exhibit_hall/a_people_at_war/science_pitches_in/dukw.html, http://members.aol.com/twodukw/twodukw.htm] 2 [http://www.victoriacross.net, http://haynese.winthrop.edu/india/medals/VC/1KThapa.html] 3 [] 4 [http://web.archive.org/web/20070310222143rn_1/faculty.winthrop.edu/haynese/india/medals/VC/1KThapa.html] 5 [http://www.transchool.eustis.army.mil/Museum/DUKW.htm] 6 [http://www.bostonducktours.com/] links_removed 1 [] 2 [] 3 [http://www.victoriacross.net] 4 [http://haynese.winthrop.edu/india/medals/VC/1KThapa.html] 5 [] 6 [] Alternatively, download a sample [SampleOriginalWikiData]4, and another sample [SampleTransformedWikiData]5. For reasons I don’t understand the comments on this post aren’t working. I’ll try and get that fixed but if anyone does have any comments either put them on the [CIS wikipedia post]1 or [tweet me]6.


  1. http://sjgknight.com/finding-knowledge/2013/05/collaborative-information-seeking-on-wikipedia-talk-pages/ “Collaborative Information Seeking on Wikipedia Talk Pages” 2

  2. https://sites.google.com/site/sjgknight/work

  3. http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

  4. http://sjgknight.com/finding-knowledge/static/2013/06/sample.csv

  5. http://sjgknight.com/finding-knowledge/static/2013/06/samplenow.csv

  6. www.twitter.com/sjgknight