Blog

Tools for analysing learning texts

I’ve been thinking for some time about how I could use various tools that analyse text and social media data to do interesting things with our learning data.  Of course, R has a set of Natural language processing tools https://cran.r-project.org/web/views/NaturalLanguageProcessing.html and Python has many tools including www.nltk.org/ (Tony Hirst talks about some of these a bit here). I’m reasonably comfortable using these in lots of contexts, but having a GUI is often nice, and opens the potential to introduce both students and other academics to the use of (for example) text analyses in the humanities. So, this is a bit of a random selection of some tools, etc. I might want to play with (welcome other suggestions in the comments):

Tools to explore corpora

  1. http://docs.voyant-tools.org/start/ Voyant Tools is a web-based text reading and analysis environment. It’s designed to make it easy for you to work with your own text or collection of texts in a variety of formats, including plain text, HTML, XML, PDF, RTF, and MS Word.
    • Interested in whether we can develop a tool like this, but that would allow you to cut a corpus by feature – for example, grade – to explore textual differences
  2. Wmatrix corpus analysis and comparison tool http://ucrel.lancs.ac.uk/wmatrix/ and a thesis on it http://ucrel.lancs.ac.uk/people/paul/publications/phd2003.pdf Wmatrix is a software tool for corpus analysis and comparison. It provides a web interface to the English USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains.
    • Might allow us to navigate a corpus (with statistical analysis) although it has a cost associated
  3. http://interrogator.github.io/corpkit/doc_interrogate.html Corkpit is a GUI based corpus explorer – I think it does the same stuff as the first 2, but it isn’t entirely clear (and I haven’t played with it yet) – it’s free and open source. It definitely does allow sub-corpora comparison.
  4. http://www.textarc.org/ A TextArc is a visual represention of a text—the entire text (twice!) on a single page. A funny combination of an index, concordance, and summary; it uses the viewer’s eye to help uncover meaning. Here are more detailed overviews of the interactive work and the prints.
  5. http://graphics.cs.wisc.edu/Vis/SequenceSurveyor/TextDNA.html TextDNA: Analyzing Text as a Sequence TextDNA allows users to explore and analyze word usage across text collections of varying scale. With TextDNA, users can compare word usage between document collections (e.g., across different decades), between individual documents, or between elements within a document (e.g., chapters or acts). Word usage can be explored across raw texts, i.e., text documents not subject to processing. Additionally, word usage can be explored across different metrics, such as how frequently words are used within a document.
    • Possibly interesting to explore sequential data in our corpus of essays, including of sequences of rhetorical moves
  6. TAPoR is generally great, Jigsaw is one interesting example http://tapor-test.artsrn.ualberta.ca/tools/509 Jigsaw is a free visual analytics application for exploring collections of documents such as text or spreadsheets. It is aimed at analysists and researchers, particularly to “help analysts reach more timely and accurate understandings of the larger stories and important concepts embedded throughout textual results” via a collection of visualizations representing aspects such as important entities and their interconnections. The visualizations include graph, temporal and connections-based, and can be viewed on a document or corpus level.
  7. Visualizing Linguistic Variation with LATtice http://winedarksea.org/?p=1285 How can we return from the complex yet “opaque” collection of floating point numbers to the linguistic richness of the texts they represent? I wrote a program called LATtice that lets us explore and compare texts across entire corpora but also allows us to “drill down” to the level of individual LATs to ask exactly what rhetorical categories make texts similar or different. To visualize relations between texts or genres, we have to find ways to reduce the dimensionality of the vectors, to represent the entire “gene” of the text within a logical space in relation to other texts.

Navigating social media

  1. The Govcom.org Foundation, Amsterdam, and its collaborators have developed a software tool that locates and visualizes networks on the Web. The Issue Crawler, at http://issuecrawler.net, is used by NGOs and other researchers to answer questions about specific networks and effective networking more generally. You also may do in-depth research with the software.
  2. Different category, visualizing online conversations https://netlytic.org/home/ Netlytic is a community-supported text and social networks analyzer that can automatically summarize and discover social networks from online conversations on social media sites. It is made by researchers for researchers, no programming/API skills required.
  3. Scraping the history of a Wikipedia article evolution http://cloud.tapor.ca/wiscker/
  4. Fantastic set of tools at https://wiki.digitalmethods.net/Dmi/ToolDatabase including various tools to extract data from social media websites, a tool to compare networks over time, one to check ‘whether a URL is censored in a particular country by using proxies located around the world’, and Lippmannian Device which lets you (1) see how sources refer to particular issues; and (2) within a source, what is the balance of issues discussed?

Visualising discussion

  1. Discursis is a computer-based visual text analytic tool for analysing human communication. Communication can be in the form of conversations, web forums, training scenarios, and many more. Discursis automatically processes transcribed text to show participant’s individual topic use, and their interactions around topics with other conversation participants, over the entire time-course of the conversation. Discursis can assist practitioners in understanding the structure, information content, and inter-speaker relationships that are present within input data. http://www.discursis.com/index.php/about2/

Looking for particular features

  1. http://www.cohmetrix.com/ and http://www.kristopherkyle.com/taaco.html and the other tools provide insight into textual structure
  2. ReaderBench is an automated software framework designed to support both students and tutors by making use of text mining techniques, advanced natural language processing, and social network analysis tools. ReaderBench is centered on comprehension prediction and assessment based on a cohesion-based.representation of the discourse applied on different sources (e.g., textual materials, behavior tracks, metacognitive explanations, Computer Supported Collaborative Learning – CSCL – conversations). Therefore, ReaderBench can act as a Personal Learning Environment (PLE) which incorporates both individual and collaborative assessments. Besides the a priori evaluation of textual materials’ complexity presented to learners, our system supports the identification of reading strategies evident within the learners’ self-explanations or summaries. Moreover, ReaderBench integrates a dedicated cohesion-based module to assess participation and collaboration in CSCL conversations. http://readerbench.com/

Statistical analyses

  1. How do we test differences between two corpora? http://www.lancaster.ac.uk/fss/courses/ling/corpus/blue/l08_4.htm and http://ucrel.lancs.ac.uk/llwizard.html and https://de.dariah.eu/tatom/feature_selection.html#log-likelihood-ratio-and-feature-selection and a paper comparing statistical approaches http://sites-test.uclouvain.be/cecl/archives/PAQUOT_BESTGEN_2009_Distinctive_words_in_academic_writing_ICAME2008.pdf
  2. Tests for categorical/count data http://www.biostathandbook.com/gtestind.html and g-test http://www.biostathandbook.com/gtestind.html inc code http://www.biostathandbook.com/gtestind.html or ordered categorical data http://www.stat.ufl.edu/~aa/articles/liu_agresti_2005.pdf

Print pagePDF pageEmail page

This Post Has 2 Comments

  1. Simon Knight says:

    Another tool (not available anywhere I think)
    “This paper reviews the gap between current methods of text visualization and the needs of corpus-linguistic research, and introduces a tool that takes a step towards bridging that gap. Current text visualization methods tend to treat the problem as a data-encoding issue only, and do not strive for interactive, tightly coupled representations of text that would foster discovery. The paper argues that such visualizations should always be linked for effortless movement between the text and its visualization, and that the visualization controls should provide continuous and immediate feedback to facilitate exploration. We introduce a tool, Text Variation Explorer (TVE), to demonstrate the aforementioned requirements. TVE allows visual and interactive examining of the behaviour of linguistic parameters affected by text window size and overlap, and in addition, performs interactive principal component analysis based on a user-given set of words.” https://benjamins.com/#catalog/journals/ijcl.19.3.05sii/details

  2. Simon Knight says:

    Use of concordancers by learners http://www.lextutor.ca/cv/conc_fb.htm

    “This paper addresses the problem of evaluating the quality of students’ productions by using evaluation methods partially based on quantitative measurements. A corpus has been created consisting of essays on literary topics written over a school year by a group of second-year students of English in a French university. The essays have been individually parsed and examined by a computer-based concordance that records measurable elements such as total words, unique words, number of sentences, forms of verbs, pronouns, etc. Individual achievement and group performance over a period of time have been recorded in the form of percentages and numbers. Working from these figures, the author will try to answer a series of questions. Can trends be discerned? Can measures like these be considered valid methods of evaluating student performance? And can computers be used to test the quality of our teaching?” https://asp.revues.org/4132?lang=en

    Concordancing with Heart: Students Analyse Their Own Writing http://www.hltmag.co.uk/aug09/idea.htm

    http://neon.niederlandistik.fu-berlin.de/en/textstat/ TextSTAT is a simple programme for the analysis of texts. It reads plain text files (in different encodings) and HTML files (directly from the internet) and it produces word frequency lists and concordances from these files. This version includes a web-spider which reads as many pages as you want from a particular website and puts them in a TextSTAT-corpus. The new news-reader, too, puts news messages in a TextSTAT-readable corpus file.

Leave A Reply





%d bloggers like this: