How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures

Abstract

In the last decade, OCR progress has triggered a massive trend towards the digitisation of legacy documents, with several Digital Humanities projects exploring means for structuring retro-digitised dictionaries. However there is a lack of awareness of the impact of the OCRs quality on the information extraction process. In this work, we shed light on the relationship between these two steps through experiments carried out with a TEI-based system for automatic parsing of dictionaries.

Publication
19th annual Conference and Members’ Meeting of the Text Encoding Initiative Consortium (TEI) -What is text, really? TEI and beyond
Pedro Javier Ortiz Suárez
Pedro Javier Ortiz Suárez
PhD Student

I’m a PhD student in Computer Science at Sorbonne Université and at the ALMAnaCH research team at Inria