In Codice Ratio:
Scalable Transcription of Vatican Registers

by Donatella Firmani, Paolo Merialdo (Roma Tre University) and Marco Maiorino (Vatican Secret Archives)

In Codice Ratio is an end-to-end workflow for the automatic transcription of the Vatican Registers, a corpus of more than 18.000 pages contained as part of the Vatican Secret Archives. The workflow has a character recognition phase, featuring a deep convolutional neural network, and a proper transcription phase using language statistics. Results produced so far have high quality and require limited human effort.

Historical handwritten documents are an essential source of knowledge concerning past cultures and societies [3]. Many libraries and archives have recently begun digitizing their assets, including the Bibliotéque Nationale de France, the Virtual Manuscript Library of Switzerland, and the Vatican Apostolic Library. Due to the sheer size of the collections and the many challenges involved in a fully automatic handwriting transcription (such as irregularities in writing, ligatures and abbreviations, and so forth), many researchers in the last years have focused on solving easier problems, most notably keyword spotting. However, as more and more libraries worldwide digitize their collections, greater effort is being put into the creation of full-fledged transcription systems.

Our contribution is a scalable end-to-end transcription workflow based on fine-grained segmentation of text elements into characters and symbols. We first partition sentences and words into text segments. Most segments contain actual characters, but there are also segments with spurious ink strokes. (Perfect segmentation cannot be achieved without transcription. This result is known as Sayer’s Paradox.) Then, we submit all the segments to a deep convolutional neural network (CNN), designed following recent progresses in deep learning [2] and the de facto standards for complex optical character recognition (OCR) problems. The labels returned for each segment by the deep CNN are very accurate when the segments contain actual characters, but can be wrong otherwise. Finally, we reassemble such noisy labels into words and sentences using language statistics, similarly to [1].

In Codice Ratio is an interdisciplinary project involving the Humanities and Engineering departments from Roma Tre University, and the Vatican Secret Archives, one of the largest historical libraries in the world. The project started in 2016 and aims at the complete transcription of the “Vatican Registers” corpus. The corpus, which is part of the Vatican Secret Archives, consists of more than 18.000 pages of official correspondence of the Roman Curia in the 13th century, including letters, opinions on legal questions, addressed from and to kings and sovereigns, as well as to many political and religious institutions throughout Europe. Never having been transcribed in the past, these documents are of unprecedented historical relevance. A small illustration of the Vatican Registers is shown in Figure 1.

Figure 1: Sample text from the manuscript “Liber septimus regestorum domini Honorii pope III”, in the Vatican Registers.
Figure 1: Sample text from the manuscript “Liber septimus regestorum domini Honorii pope III”, in the Vatican Registers.

State-of-the-art transcription algorithms generally work by a segmentation-free approach, where it is not necessary to individually segment each character. While this removes one of the hardest steps in the process, it is necessary to have full-text transcriptions for the training corpus, in turn requiring expensive labelling procedures undertaken by paleographers with expertise on the period under consideration. Our character-level classification has instead much smaller training cost, and allows the collection of a large corpus of annotated data using a cheap crowdsourcing procedure. Specifically, we implemented a custom crowdsourcing platform, and employed more than a hundred high-school students to manually label the dataset. To overcome the complexity of reading ancient fonts, we provided the students with positive and negative examples of each symbol. After a data augmentation process, the result is an inexpensive, high-quality dataset of 23.000 characters, which we plan to make publicly available online. Our deep CNN trained on this dataset achieves an overall accuracy of 96%, which is one of the highest results reported in the literature so far.

The project takes place in Rome (Italy) and features interdisciplinary collaborators throughout Europe and the world, including the Trinity College in Dublin (Ireland), the Max-Planck-Institute for European Legal History in Frankfurt (Germany), and the Notre Dame University in South Bend (Indiana). Domestic collaborators include Sapienza University in Rome (Italy) and two roman high schools, namely Liceo Keplero and Liceo Montale.

The transcriptions produced so far account for lower-case letters and a subset of the abbreviation symbols. Future activities include transcription of abbreviations and upper-case letters, as well as an extensive experimental evaluation of the whole pipeline.

References:
[1] D. Keysers, et al.: “Multi-language online handwriting recognition”, IEEE TPAMI, 2017.
[2] D. Kingma, B. Jimmy: “Adam: A method for stochastic optimization”, arXiv, 2014.
[3] J. Michel, et al.: “Quantitative analysis of culture using millions of digitized books”, Science, 2011.

Please contact:
Donatella Firmani, Paolo Merialdo
Roma Tre University, Italy
+39 06 5733 3229
This email address is being protected from spambots. You need JavaScript enabled to view it.,
+39 06 5733.3218
This email address is being protected from spambots. You need JavaScript enabled to view it.

Marco Maiorino
Vatican Secret Archives
Vatican City State
This email address is being protected from spambots. You need JavaScript enabled to view it.

Next issue: January 2025
Special theme:
Large-Scale Data Analytics
Call for the next issue
Image ERCIM News 111 epub
This issue in ePub format

Get the latest issue to your desktop
RSS Feed