Loading Events

« All Events

Introduction to Computational Text Analysis (DHSI 2026)

Format

in person/face-à-face

Event Language

English

Description

Computational text analysis offers powerful tools to explore patterns of style, meaning, and theme in large bodies of text. This intensive, intermediate-level course provides hands-on training in three foundational methods widely used in digital humanities and computational linguistics. Designed to build a solid foundation, the course empowers participants to work independently and develop their skills confidently beyond the classroom.

The course introduces three key digital methods of text analysis that can be applied to virtually any type of text. They are stylometry, word embeddings, and topic modeling. A sample corpus will be provided for the exercises, but the participants are also encouraged to bring their own corpora they would like to work on. The methods will be used to discuss authorship, literary style, chronology, themes of texts, semantics, cultural stereotypes, and their relation to quantifiable measures.

The first method to be explored in depth is stylometry, a technique which measures textual similarity based on word or n-gram frequencies. While best known for its success in authorship attribution, stylometry is also widely used to analyze stylistic trends, thematic structures, and translator-specific features. By the end of this module, participants will be able to conduct stylometric analyses using the stylo package in R, generate visualizations of their results, and build network diagrams highlighting similarities across a corpus. They will also learn how to compare two distinct corpora and detect segments of text likely written by different authors.

The second method covered in the course is word embeddings, a powerful technique grounded in distributional semantics, which represents words as vectors in a multidimensional space. This approach allows for the analysis of meaning, context, and relationships between words, uncovering connotations, cultural associations, and even implicit biases or stereotypes present within a given text corpus.

The third and final core method covered in the course is topic modeling, a technique used to discover recurring patterns across collections of texts. Topics are clusters of words which frequently co-occur within the corpus, based on the assumption that textual proximity reflects underlying semantic relationships. It can be used for text classification, but also for identifying quantifiable features of literary style. The participants will learn how to extract topics from a corpus, identify the most prominent ones in individual texts, and track how topic distributions change across different sections of a single work.

In addition, the course will also introduce some more basic, less complex methods, such as concordance analysis and measures of concordance strength, along with various online tools that can also support computational text analysis. Some basic key concepts of machine learning will also be explained to help the participants understand how certain linguistic models work.

The software used for the course can be easily installed on the participants’ computers. Strong coding skills are not required. However, basic programming knowledge will be helpful to customize the scripts provided for the participants’ own purposes. Basic applications of the presented methods presented during the course will allow the participants to develop tailored solutions for their own research or digital humanities projects.

Instructor(s)

Wojciech Łukasik is a digital humanist affiliated with the Center for Quantitative Research in Political Science at the Jagiellonian University, the Jagiellonian Centre for Digital Humanities, and the Department of Polish Studies. He also cooperates with the Institute of Polish Language at the Polish Academy of Sciences, where he obtained his PhD in linguistics. In his thesis, he applied digital methods including corpus analysis, stylometry, and topic modeling to a corpus of Young Poland literature. His work also involves the digitization of historical dictionaries and data processing for digital scholarly editions.

Jacek Bąkowski is a researcher at the Institute of Polish Language, Polish Academy of Sciences, with an academic background in mathematics, computer science, and linguistics. His research focuses primarily on semantic similarity measures, distributional semantics, and machine learning techniques applied to natural language processing, particularly in the context of South Asian languages. He has also worked on stylometry, lexical analysis, and authorship attribution in Sanskrit texts.

Click here for an example of previous syllabus and course material (2025)

3150 Rue Jean Brillant
Montreal, Québec H3T 1N7 Canada
+ Google Map