Automatic Text Recognition of Historical Documents: Building Text Corpora and Datasets (DHSI 2026)

Event Language
EnglishFormat
in person/face-à-faceDescription
The course will be an introduction to automatic text recognition technologies, focusing on the example of Kraken and eScriptorium but including an overview of other existing solutions. At the end of the course, participants will have a better understanding of machine learning through the example of automatic text recognition, will have first-hand experience using transcription software, producing data and models, as well as publishing and reusing special datasets for training transcription models. They will be able to understand how to organize a transcription campaign individually or as a team while complying with existing standards for annotation. The course is intended for participants with little to or no knowledge about automatic transcription. Students, librarians, and all scholars are welcome.
Instructor(s)
Alix Chagué is a specialist in automatic text recognition applied to historical documents. Her PhD thesis, which she will defend in 2026, especially focuses on questions relating to Open Science and data creation for using and training transcription models. She has contributed to the development of several essential infrastructures for the advancement of automatic transcription, among which the open source application Scriptorium and the ecosystem for the publication of reusable gold data for text recognition HTR-United. Since 2019, she has taught several workshops introducing automatic text recognition and its software solutions to beginners or advanced users.
