I’ve been thinking for some time about how I could use various tools that analyse text and social media data to do interesting things with our learning data. Of course, R has a set of Natural language processing tools https://cran.r-project.org/web/views/NaturalLanguageProcessing.html and Python has many tools including www.nltk.org/ (Tony Hirst talks about some of these a bit here). I’m reasonably comfortable using these in lots of contexts, but having a GUI is often nice, and opens the potential to introduce both students and other academics to the use of (for example) text analyses in the humanities. So, this is a bit of a random selection of some tools, etc. I might want to play with (welcome other suggestions in the comments):
Tools to explore corpora
- http://docs.voyant-tools.org/start/ Voyant Tools is a web-based text reading and analysis environment. It’s designed to make it easy for you to work with your own text or collection of texts in a variety of formats, including plain text, HTML, XML, PDF, RTF, and MS Word.
- Interested in whether we can develop a tool like this, but that would allow you to cut a corpus by feature – for example, grade – to explore textual differences
- Wmatrix corpus analysis and comparison tool http://ucrel.lancs.ac.uk/wmatrix/ and a thesis on it http://ucrel.lancs.ac.uk/people/paul/publications/phd2003.pdf Wmatrix is a software tool for corpus analysis and comparison. It provides a web interface to the English USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains.
- Might allow us to navigate a corpus (with statistical analysis) although it has a cost associated
- http://interrogator.github.io/corpkit/doc_interrogate.html Corkpit is a GUI based corpus explorer – I think it does the same stuff as the first 2, but it isn’t entirely clear (and I haven’t played with it yet) – it’s free and open source. It definitely does allow sub-corpora comparison.
- http://www.textarc.org/ A TextArc is a visual represention of a text—the entire text (twice!) on a single page. A funny combination of an index, concordance, and summary; it uses the viewer’s eye to help uncover meaning. Here are more detailed overviews of the interactive work and the prints.
- http://graphics.cs.wisc.edu/Vis/SequenceSurveyor/TextDNA.html TextDNA: Analyzing Text as a Sequence TextDNA allows users to explore and analyze word usage across text collections of varying scale. With TextDNA, users can compare word usage between document collections (e.g., across different decades), between individual documents, or between elements within a document (e.g., chapters or acts). Word usage can be explored across raw texts, i.e., text documents not subject to processing. Additionally, word usage can be explored across different metrics, such as how frequently words are used within a document.
- Possibly interesting to explore sequential data in our corpus of essays, including of sequences of rhetorical moves
- TAPoR is generally great, Jigsaw is one interesting example http://tapor-test.artsrn.ualberta.ca/tools/509 Jigsaw is a free visual analytics application for exploring collections of documents such as text or spreadsheets. It is aimed at analysists and researchers, particularly to “help analysts reach more timely and accurate understandings of the larger stories and important concepts embedded throughout textual results” via a collection of visualizations representing aspects such as important entities and their interconnections. The visualizations include graph, temporal and connections-based, and can be viewed on a document or corpus level.
- Visualizing Linguistic Variation with LATtice http://winedarksea.org/?p=1285 How can we return from the complex yet “opaque” collection of floating point numbers to the linguistic richness of the texts they represent? I wrote a program called LATtice that lets us explore and compare texts across entire corpora but also allows us to “drill down” to the level of individual LATs to ask exactly what rhetorical categories make texts similar or different. To visualize relations between texts or genres, we have to find ways to reduce the dimensionality of the vectors, to represent the entire “gene” of the text within a logical space in relation to other texts.
Navigating social media
- The Govcom.org Foundation, Amsterdam, and its collaborators have developed a software tool that locates and visualizes networks on the Web. The Issue Crawler, at http://issuecrawler.net, is used by NGOs and other researchers to answer questions about specific networks and effective networking more generally. You also may do in-depth research with the software.
- Different category, visualizing online conversations https://netlytic.org/home/ Netlytic is a community-supported text and social networks analyzer that can automatically summarize and discover social networks from online conversations on social media sites. It is made by researchers for researchers, no programming/API skills required.
- Scraping the history of a Wikipedia article evolution http://cloud.tapor.ca/wiscker/
- Fantastic set of tools at https://wiki.digitalmethods.net/Dmi/ToolDatabase including various tools to extract data from social media websites, a tool to compare networks over time, one to check ‘whether a URL is censored in a particular country by using proxies located around the world’, and Lippmannian Device which lets you (1) see how sources refer to particular issues; and (2) within a source, what is the balance of issues discussed?
- Discursis is a computer-based visual text analytic tool for analysing human communication. Communication can be in the form of conversations, web forums, training scenarios, and many more. Discursis automatically processes transcribed text to show participant’s individual topic use, and their interactions around topics with other conversation participants, over the entire time-course of the conversation. Discursis can assist practitioners in understanding the structure, information content, and inter-speaker relationships that are present within input data. http://www.discursis.com/index.php/about2/
Looking for particular features
- http://www.cohmetrix.com/ and http://www.kristopherkyle.com/taaco.html and the other tools provide insight into textual structure
- ReaderBench is an automated software framework designed to support both students and tutors by making use of text mining techniques, advanced natural language processing, and social network analysis tools. ReaderBench is centered on comprehension prediction and assessment based on a cohesion-based.representation of the discourse applied on different sources (e.g., textual materials, behavior tracks, metacognitive explanations, Computer Supported Collaborative Learning – CSCL – conversations). Therefore, ReaderBench can act as a Personal Learning Environment (PLE) which incorporates both individual and collaborative assessments. Besides the a priori evaluation of textual materials’ complexity presented to learners, our system supports the identification of reading strategies evident within the learners’ self-explanations or summaries. Moreover, ReaderBench integrates a dedicated cohesion-based module to assess participation and collaboration in CSCL conversations. http://readerbench.com/
- How do we test differences between two corpora? http://www.lancaster.ac.uk/fss/courses/ling/corpus/blue/l08_4.htm and http://ucrel.lancs.ac.uk/llwizard.html and https://de.dariah.eu/tatom/feature_selection.html#log-likelihood-ratio-and-feature-selection and a paper comparing statistical approaches http://sites-test.uclouvain.be/cecl/archives/PAQUOT_BESTGEN_2009_Distinctive_words_in_academic_writing_ICAME2008.pdf
- Tests for categorical/count data http://www.biostathandbook.com/gtestind.html and g-test http://www.biostathandbook.com/gtestind.html inc code http://www.biostathandbook.com/gtestind.html or ordered categorical data http://www.stat.ufl.edu/~aa/articles/liu_agresti_2005.pdf