I recently published a post collecting together [student writing corpora.]1 I’m also interested in ‘discussion’ style corpora, for example chat data, social q&a, discussion boards, news/article/blog comments, etc. There are some of these openly available, there must be more than this (and of course others could be collected from webpages), but so far I’ve come across: 1. [Reddit released a corpus of comments in 2015]2 2. Wikipedia’s talk pages (and edit comments) are of course openly available and a ripe source of discussion data (see e.g. [especially here]3, [here,]4 and [google scholar]5) 3. There’s also, of course, the [Wikipedia Teahouse Q&A data]6 4. [Stack Overflow data dump]7 (under a CC license) ([research project]8 based on it) 5. [Yahoo Answers data]9 6. [TREC data on Q&A]10 7. [Usenet corpus]11 8. [NPS Chat Data]12 9. [SMS corpus]13 There are also some excellent tools to analyse and construct this type of data, e.g. I’d like to play with the [WikiTalk page parser]14, and the Microsoft [discussion graph tool]

> Discussion Graph Tool (DGT) simplifies social media analysis by making it easy to extract high-level features and co-occurrence relationships from raw data. With just 3-4 simple lines of script, you can load your social media data, extract complex features such as mood, gender and location, and generate a graph among arbitrary features. Throughout, DGT automates best-practices, such as tracking the context of relationships. >



  1. http://sjgknight.com/finding-knowledge/2015/12/student-writing-corpora/

  2. https://archive.org/details/2015_reddit_comments_corpus

  3. https://www.ukp.tu-darmstadt.de/data/discourse-analysis/wikipedia-discussion-corpora/

  4. http://www.cs.cornell.edu/~cristian/Echoes_of_power.html

  5. https://scholar.google.co.uk/scholar?hl=en&q=wikipedia+talk+pages&btnG=&as_sdt=1%2C5&as_sdtp=

  6. https://datahub.io/dataset/teahouse-corpus

  7. http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/

  8. http://www.cs.berkeley.edu/~bjoern/projects/stackoverflow/

  9. http://webscope.sandbox.yahoo.com/catalog.php?datatype=l

  10. http://trec.nist.gov/data/qamain.html

  11. http://www.psych.ualberta.ca/~westburylab/downloads/usenetcorpus.download.html

  12. http://faculty.nps.edu/cmartell/NPSChat.htm

  13. http://linguistics.stackexchange.com/questions/1412/does-anyone-know-of-text-message-corpora/1416#1416

  14. https://github.com/sdivad/WikiTalkParser