by Muhammad Hanif and Anna Tonazzini (ISTI-CNR)
Archival, ancient manuscripts constitute a primary carrier of information about our history and civilisation process. In the recent past they have been the object of intensive digitisation campaigns, aimed at their preservation, accessibility and analysis. At ISTI-CNR, the availability of the diverse information contained in the multispectral, multisensory and multiview digital acquisitions of these documents has been exploited to develop several dedicated image processing algorithms. The aim of these algorithms is to enhance the quality and reveal the obscured contents of the manuscripts, while preserving their best original appearance according to the concept of “virtual restoration”. Following this research line, within an ERCIM “Alain Bensoussan” Fellowship, we are now studying sparse image representation and dictionary learning methods to restore the natural appearance of ancient manuscripts affected by spurious patterns due to various ageing degradations.
The collection of ancient manuscripts serves as history’s own closet, carrying stories of enigmatic, unknown places or incredible events that took place in the distant past, many of which are yet to be revealed. These manuscripts are of great interest and importance for historians to study people of the past, their culture, civilisation and way of life. Most of the ancient classic documents have had a very narrow escape from total annihilation. Thus, digital preservation of our documental heritage has been one of the first focusses of the massive archive and library digitisation campaigns performed in the recent years. This, in turn, has contributed to the birth of digital humanities as a science. In addition to preservation, computing technologies applied to the digital images of these documents have quickly become a powerful and versatile tool to simplify their study and retrieval, and to facilitate new insights into the documents’ contents.
The quality of the digital records, however, depends on the current status of the original manuscripts, which in most cases are affected by several types of degradation, such as spots, ink fading, or ink seeping from the reverse side, due to bad storage conditions and the fragile nature of the materials (e.g., humidity, mould, ink chemical properties, paper porosity). In particular, the phenomenon of ink seeping is perhaps the most frequent one in ancient manuscripts written on both sides of the sheet. This effect, termed as bleed-through, becomes visible as an unpleasant and disturbing degradation pattern, severely impairing legibility, aesthetics, and interpretation of the source document.
In general, bleed-through removal is addressed as a classification problem, where image pixels are labelled as either background (paper texture), bleed-through (seeped ink), or foreground (original text). This classification problem is very difficult, since the intensity of both foreground text and bleed-through pattern can be so highly variable as to make it extremely hard or even impossible to distinguish them. In addition, when the aim is to obtain a very accurate and plausible restoration, another very critical issue is to replace the identified bleed-through pixels with appropriate replacement colour values, which do not alter the original look of the manuscript.
Recently, we proposed a two-step method to address bleed-through document restoration from a pre-registered pair of recto and verso images of the manuscript. First, the bleed-through pixels are identified on both sides ; then, a sparse representation based image inpainting technique is applied to fill-in the bleed-through pixels, by taking into account their propensity to aggregation, and in accordance with the natural texture of the surrounding background.
Sparse representation methods are reported with state-of-the-art results in different image processing applications. These methods process the whole image by operating on a patch-by-patch level. For this specific application, our aim is to reproduce the background texture to maintain the original look of the document. In the sparse representation setup, an over-complete dictionary is learned using a set of training patches from the recto and verso image pair. In the training set we only select patches with no bleed-though pixels. This choice speeds up the training process since it excludes non-informative image regions. For each patch to be inpainted, we first search for its mutual similar patches in a small bounded neighbourhood window. We used a block matching technique with Euclidean distance metric as similarity criterion. The similar patches are grouped together in a matrix. A group-based sparse representation method  is exploited to find the befitting fill-in for the bleed-through strokes. The use of similar patch groups incorporates local information that helps to preserve the natural colour/texture continuation property of the physical manuscript.
Figure 1: A visual comparison of an original ancient manuscript effected by bleed-through degradation and its restored version using our method. The first row shows the degraded recto and verso pair, and the restored images are presented in the second row.
An original degraded manuscript and its restored version are presented in Figure 1. It is worth noting that our algorithm can be directly applied to inpaint any other possible interference pattern detected in the paper support (e.g., stains). We are also studying the extension of the method to the restoration of broken or faded foreground characters.
 A. Tonazzini, P. Savino, and E. Salerno: “A nonstationary density model to separate overlapped texts in degraded documents,” Signal, Image and Video Processing, vol. 9, pp. 155–164, 2015.
 J. Zhang and D. Zhaocand W. Gao: “Group-based sparse representation for image restoration,” IEEE Trans. Image Process., vol. 32, pp. 1307–1314, 2016.
Anna Tonazzini, ISTI-CNR, Pisa