Computational Text Analysis Workgroup

Aimed at introducing colleagues to Natural Language Processing (NLP)

Role

I frequently participate in a workgroup focused on NLP, but only with English. Given that Latin and Greek pose interesting challenges to standard NLP processes and require their own unique support packages. No one else at my university has worked with NLP on ancient languages. So, the only way to learn more and develope my skills further, was to teach others what I already knew and work on solidifying my understanding as a workgroup leader.

Software
  •  Python
  •  Jupyter Notebooks
  •  Classical Language Toolkit (CLTK)
The Goal

To create accessible and funcitonal tutorials for colleagues with no programming experience.


Computational methods for studying Greek and Latin have really started taking off in the last decade, especially with the advent of the CLTK. After working with it myself, I knew that my colleagues should be introduced to this new methodology.

A flyer I made for the workgroup.




Overview

The workgroup was structured as to introduce participants to the fundamental functions and processes involved with NLP such as tokenization and the bag-of-words approach. After these, we tackled two more advanced topics: stylometry and topic modelling, which both invovled a foray into basic statistics.

You can check out the tutorials and even work through them yourself here.


Intro to Python & Jupyter Notebooks

In the first session, I introduced everyone to the very basics of programming in Python and the utter joy of using Jupyter Notebooks to do so. Participants also got a taste of what is possible through computational text analysis: to conclude, I demonstrated how we could find all the attested Latin palindromes with a simple script.


Tokens, Lemmas, & Stopwords

Particpants worked through some of the basic functions of NLP and started building their repertoire of tools. Tokenization, Lemmatization, Stopwords, Stoplists, and the Bag-of-Words method were all covered in this session.


Concordances & N-Grams

Working mostly with the Aeneid, I introduced the participants to concordances and n-gram analysis. While these two things aren't ground-breaking or new, they are a foundational aspect to NLP. Moreover, they provided a great introduction to using some of the tools participants learned in the previous session.


Stylometry

Moving into slightly more advanced aspects of NLP, I guided the participants through a study of my own in stylometric analysis. Using the wonderful lesson on stylometry by François Dominic Laramée as a basis, we discussed John Burrow's Delta Method and how to implement it. Our case study was on the text of Valerius Maximus and sought to analyze his authorial style.


Our little group working away.

Reflection


Using pre-coded Jupyter notebooks for the tutorials worked incredibly well since the participants were not experienced programmers. It saved a lot of time and allowed the workgroup to focus on the computational processes we wished to study.

This workgroup had a niche focus and thus attracted few participants. Though this was to be expected, whenever one or two regular participants were absent, the group seemed less productive.

Overall, I think the workgroup was successful in introducing participants to and getting them to think about an up-and-coming methodology for the study of ancient languages. Hopefully whenever they encounter an article or presentation that employs computational linguistics, they will be better prepared to engage with it.