Led by: Chris Tănăsescu
The course offers an effective hands-on intro to natural language processing (NLP), text and media analysis, and text and/or media corpus network visualization and analysis. It will harness the power and amplitude of large language models (LLMs) alongside other computing resources in analyzing both single/discrete datums and big data, be they text or media or both. The skills, affordances, methods, and concepts will be paced and assembled into a pipeline starting from locating, collecting/scraping, and (pre)processing relevant datasets, continuing by deploying specialized libraries and developing algorithms for multi-feature data analysis, and culminating with fine-grained holistic networked assemblages modeling and scrutinizing the datasets in depth and comparatively across corpora and media.
We will be doing coding in Python and learning how to use (and compare) (sub)word, text, and media modeling open-source LLMs/frameworks such as GPT (2 and later), (M)BERT, GPT-NeoX, T5, (Meta-)Llama, OLMo, and a host of others in concurrence with a wide-range of relevant libraries including Scikit-learn, NLTK, FastText, Stanza, and SpaCy (displaCy), involving embeddings with text classifiers and/or image/video/audio vectorization, e.g.., Deep Learning architectures, CLIP, MediaPipe, TensorFlow & Keras, Pytorch, LibROSA, etc. In the context, we will also learn how to train or fine-tune our own LLMs.
After using BeautifulSoup, Selenium, and pytesseract (Python-tesseract) to automatically collect and (if needed) OCR our data, the subsequent computational analyses will be translated to networks ranging from plain (single-layer) graphs to multiplexes to most general multilayer networks to be visualized and/or analyzed by means of NetworkX or, in the more specific or complex cases, in-house/indie algorithms. The translation to networks will also involve correlations between various forms of vectorization applied to text (and/as inter)media as coexistent in or combined into modeling the data.
On the fifth day (Friday, June 6th), everybody will have the opportunity to participate in the #GraphPoem event, an intermedia social computing and data-commoning performance drawing on the algorithms, methods, and programming presented or developed in class.
The knowledge and skills acquired—alongside our in-class applications—will be useful in education, research, and analytical-creative work involving NLP, automated text and (mono and multilingual) corpus analysis, network science (or graph theory) applications, inter/trans-disciplinary text (and) media studies, computational literary studies/analysis/criticism, computational linguistics, multimodal and intermedia(lity) studies and creativity, HCI creative writing and experimental/intersemiotic/literary translation, digital editions, digital poetry/e-lit/digital art, social (media/network) analysis, complexity studies in/and social science, and applications in the philosophy of mathematics.
This is a hands-on course with some lecture components.