From time to time I forget that I neither particularly enjoy coding or am much good at it, especially web scraping (and maps, why are maps so hard?!). So I decide I’ll remind myself by picking an “oh that should be quite a quick thing”….lockdown doesn’t help this impulse. For some baffling reason the ARC publishes a list of the College of Experts with only names and affiliations (mostly, a few are missing, and I know some are outdated), with a link to institutional profiles (again, some are broken, some are actually query strings [I assume they resolve], and some are missing). This means that if you want to explore who from your subject area is on the college, or/and you want to look at the wider balance of members, you need to either know individuals by name already, or hunt their information down. So, I thought I’d scrape them and try and find out at least what faculty/department/school./institute/PVC office/unit (getting an idea of the issue?) they’re in. I looked at:

  1. Searching google scholar by name + affiliation and scraping the profile information. Issue: Not everyone has a scholar profile, google limits requests, and the R package requires the scholar’s ID which means you need two queries (and there’s some probably obvious error in my code)

  2. Scraping their institutional page and looking for instances of “Professor of…” etc. and “Department of…”, etc.. and returning paragraphs in which those occur (the reason not to return simply ‘professor’, or ‘department’ is I think I’ll end up with a lot of false positives…there are quite a lot as it is). Issue: Some profiles load javascript and you need to use RSelenium to render them, the phrases aren’t consistent so I’m missing some I can grab (I know using “professor NAME” would help), and for some too many results are returned (e.g., you received funding from ‘department of…’).

  3. Scraping their institutional page and looking for identifier links (ORCID, scholar, researcherID, scopus profile). Same issues as above really, plus this doesn’t give you any information unless you then go and scrape those, although at least their structure should be standardised (unlike uni pages). I started to write the code to do this for orcids. The 3rd approach is probably most promising, although you might also be able to query Bing, look for snippets with the name +, and return those. The sensible thing to do would be merge a bunch of these to run at once, and tidy outputs, etc…but I’ve reached the limits of my interest in the exercise, even for a mindless evening challenge. The code sort of works, it’s just less complete than I’d hoped, independent of mess+errors on my part. Maybe the few names here with information will be helpful to someone, or maybe some generous soul might fix my code.

Code below or in [this gist]1, output is [here]2.




  2. table1.html