Blog

Interesting discussion data corpora

I recently published a post collecting together student writing corpora. I’m also interested in ‘discussion’ style corpora, for example chat data, social q&a, discussion boards, news/article/blog comments, etc. There are some of these openly available, there must be more than this (and of course others could be collected from webpages), but so far I’ve come across:

  1. Reddit released a corpus of comments in 2015
  2. Wikipedia’s talk pages (and edit comments) are of course openly available and a ripe source of discussion data (see e.g. especially here, here, and google scholar)
  3. There’s also, of course, the Wikipedia Teahouse Q&A data
  4. Stack Overflow data dump (under a CC license) (research project based on it)
  5. Yahoo Answers data
  6. TREC data on Q&A
  7. Usenet corpus
  8. NPS Chat Data
  9. SMS corpus

There are also some excellent tools to analyse and construct this type of data, e.g. I’d like to play with the WikiTalk page parser, and the Microsoft discussion graph tool:

Discussion Graph Tool (DGT) simplifies social media analysis by making it easy to extract high-level features and co-occurrence relationships from raw data. With just 3-4 simple lines of script, you can load your social media data, extract complex features such as mood, gender and location, and generate a graph among arbitrary features. Throughout, DGT automates best-practices, such as tracking the context of relationships.

 


Print pagePDF pageEmail page

This Post Has 1 Comment

  1. Simon Knight says:

    http://www.hlt.utdallas.edu/~saidul/stance/stance.html This page is a distribution site of the dataset for the task of Stance Classification. Data available on this page have been collected from a popular online debate platform called CreateDebate.

    https://nlds.soe.ucsc.edu/iac2 The Internet Argument Corpus (IAC) version 2 is a collection of corpora for research in political debate on internet forums. It consists of three datasets: 4forums (414K posts), ConvinceMe (65K posts), and a sample from CreateDebate (3K posts). It includes topic annotations, response characterizations (4forums), and stance.

Leave A Reply





%d bloggers like this: