Led by: Joanna Byszuk and Jacek Bąkowski
This is a beginner to intermediate-level course in computational text analysis. It will focus on using digital tools to enhance and deepen traditional ways of reading and analyzing texts. We will explore ways of answering questions about authorship, textual, chronological, and authorial style, genre, and meaning, using some freely available and easy-to-use tools, such as ‘LIWC’ or ‘Stylo’ and most commonly applied methods, such as stylometry.
While stylometry, i.e. the analysis of countable linguistic features of texts has been usually associated with authorship attribution, the same methods are successfully applied to more general text analysis, and, recently, even analysis of other modes such as music, image and video. The statistics of even such simple features as word, word n-gram or character n-gram frequencies are not only a highly precise tool for identifying authorship but can also reveal patterns of similarity and difference between groups of works, as well as individual works, or specific voices within them, such as idiolects of characters in novels. Such methods are also frequently applied to compare works by one author or various authors or translators, and finally between works differing in terms of chronology, genre or narrative styles, etc. The results of computational text analysis can be compared and confronted with the findings of traditional studies, opening a new set of questions about style and its transfer, as well as the nature of particular features and language.
With this course, we aim to help participants build the knowledge and skills required to identify the problem they want to examine, define relevant research questions and apply the right method, and, finally, to design and complete their own experiments from corpus building to interpretation of results. Participants will learn how to use major modern stylometric methods in a reliable and reproducible manner, from simple keyword extraction and feature selection and analysis, to supervised and unsupervised machine learning based on text features, followed by visualization techniques ranging from PCA and dendrograms to networks. The software used in the course can easily be installed and run on participants’ own computers. While we do not expect the participants to have strong programming skills, having a basic understanding of running and reading code can improve the course experience and allow the participants to benefit more from the course. We will provide text corpora for training purposes but also encourage participants to bring their own data and research problems to work on during the course.
This course combines elements of courses previously taught at DHSI by Computational Stylistics Group (Maciej Eder, Jan Rybicki, Joanna Byszuk, Jeremi K. Ochab), i.e. ‘Stylometry with R’, ‘DIY Computational Text Analysis with R’, as well as ‘Out of the Box Text Analysis’ taught by late David Hoover.