The Learnable Typewriter

A Generative Approach to Text Analysis

Ioannis Siglidis Nicolas Gonthier Julien Gaubil Tom Monnier Mathieu Aubry

LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, Marne-la-Valle, France

ICDAR 2024 (Best Paper Award)

Abstract

We present a generative document-specific approach to character analysis and recognition in text lines. Our main idea is to build on unsupervised multi-object segmentation approaches and in particular methods which reconstruct images based on a limited amount of visual elements, called sprites. Our approach can learn a large number of different characters and leverage line-level annotations when available. Our contributions are twofold. First, we provide the first adaptation and evaluation of a deep unsupervised multi-object segmentation approach for text line analysis. Since these methods have mainly been evaluated on synthetic data in a completely unsupervised setting, demonstrating they can be adapted and quantitatively evaluated on real images, even as structured as text images, and using weak supervision are significant progresses. Second, we demonstrate the potential of the method we present for new applications, in particular in the field of paleography, which studies the history and variations of handwriting. We evaluate our approach on three very different datasets: a printed volume of the Google1000 dataset, the Copiale cipher and historical handwritten charters from the XIIth and early XIIIth century.

Approach

The Typewriter Module

Learning to Extract Fonts & Scripts

In both the supervised and unsupervised settings our method produces meaningful sprites and accurate reconstructions, even despite the high number of characters and their variability.

Google1000: scanned historical printed books.

Input, supervised and unsupervised semantic segmenation.

Copiale Cipher: an oculist German manuscript from a 18th century secret society.

More results.

Historical Fonts

For 8 different fonts from the "ICDAR2024 Competition on Multi Font Group Recognition and OCR", we compare a-t, A-T between learned sprites (ours), and manualy extracted exemplars. Fonts are sorted by descending SSIM, computed on post-processed sprites (see details in supplementary material):

Note, that although fonts come from a single family, they may present allographs. For example, Italic contains different variants of “Q”, where a random exemplar can significantly differ from the one summarized using our method.

Paleography

Focusing on a collection of 14 historical charters from the Fontenay abbey we show that our approach can be used to perform paleographic analysis. After training on the whole collection of documents, we are able to display character variations across individual documents (hard to describe with words) by simply finetuning our approach to each one of them:

Sprites learned for similar documents in Praegothica script.

'a' and 'g' sprite for each document and a associated example of the character. Note how the variations of the descending part of the 'g' sprites closely match the variations observed in the documents. Also note the subtle variations of the 'a' which are clear in the sprites but would be hard to notice and describe from the original images for a non-expert.

The appearance variations of individual instances associated to the 'e' character in the document are accurately visually summarized by the sprite.

The double appearance of the ascending line of the 'd' sprite shown on the left is related to the co-existence of two different kinds of 'd' in the document, as shown in the examples on the right.

BibTeX

        
@article{the-learnable-typewriter,
	title = {The Learnable Typewriter: A Generative Approach to Text Analysis},
	author = {Siglidis, Ioannis and Gonthier, Nicolas and Gaubil, Julien and Monnier, Tom and Aubry, Mathieu},
	publisher = {ICDAR},
	year = {2024},
}

Acknowledgements

We would like to thank Malamatenia Vlachou and Dominique Stutzmann for sharing ideas, insights and data for applying our method in paleography; Vickie Ye and Dmitriy Smirnov for useful insights and discussions; Romain Loiseau, Mathis Petrovich, Elliot Vincent, Sonat Baltacı for manuscript feedback and constructive insights. This work was partly supported by the European Research Council (ERC project DISCOVER, number 101076028), ANR project EnHerit ANR-17-CE23-0008, ANR project VHS ANR-21-CE38-0008 and HPC resources from GENCI-IDRIS (2022-AD011012780R1, AD011012905).