by Kaspar Riesen and Darko Brodic
Many libraries globally have started digitizing their most valuable old handwritings in order to preserve the world's cultural heritage. To improve the accessibility of the large number of available handwritten document images they must be made amenable to searching and browsing. A recent research project aims at a novel graph based keyword spotting framework applicable to historical documents. For testing the novel framework, isolated word images from the Miroslav Gospels (one of the oldest surviving documents written in old church Slavonic) will be represented by graphs.
{jcomments on}“Keyword spotting” refers to the process of retrieving all instances of a given keyword or key phrase from a document. A large variety of algorithms have been developed for keyword spotting over the last two decades. For instance, a common approach is to represent a word as a sequence of features, extracted via a sliding window. Subsequently, the sequences are compared using dynamic time warping (DTW) [1]. The major objective of the present project is to employ graph based pattern representation for the problem of keyword spotting. To the knowledge of the authors, graph based algorithms have never been used for this task. This is rather surprising as it turns out that the paradigm of graph matching, ie the comparison of pairs of graphs, perfectly meets the requirements of keyword spotting.
Various procedures for graph matching have been proposed in the literature. In [2] a novel graph matching procedure was proposed. This method is based on an (optimal) fast optimization procedure mapping nodes and their local structure of one graph to nodes and their local structure of another graph. This procedure allows the approximation of graph dissimilarities in polynomial time complexity. The distances found by this framework are equal to, or larger than, the exact graph distance. Empirical investigations show that relatively large distance values tend to be overestimated while small distances remain nearly unaffected by our approximation. Keyword spotting is a typical application where an overestimation of large distances is acceptable since we are only interested in document words that are very near to the keyword.
We plan to extract graphs from skeleton word images by means of a procedure similar to [3]. First, keypoints that include endpoints, intersections, and corner points of circular structures are extracted from the word skeleton image. Next, a node is added to the word graph for each skeleton keypoint, labeled with its position. After all keypoints have been included as nodes in the word graph, connection points are also added as nodes to the graph. Given a skeleton line connection between two keypoints in form of a pixel chain, connection points are inserted at regular distances along that chain. Finally, skeleton connections between two nodes are represented by undirected and unlabeled edges.
Through the representation of isolated words with graphs, the search and retrieval of given words in documents can be interpreted as matching of an input graph (keyword) with a large set of graphs (document). More formally, in order to spot a certain keyword wi, all p graph instances gi1, … , gip of that word wi occurring in the training set are matched against all graph words in each text line using our adapted graph matching procedure. That is, for a given word wi and a specific text line s pairwise distances between all prototypical graphs gi1, …,gip and the m word graphs g'1, … ,g'm from text line s are obtained first. The minimum of these graph distances serves as a distance function d(wi,s) of the keyword’s word class wi to the text line s. If the distance d(wi,s) of a keyword to the text line is below a given threshold, the text line s and the word from s having the minimum distance is returned as a positive match to the keyword wi.
A very important aspect of the whole project is the experimental evaluation of our novel graph based procedure for keyword spotting. We plan to carry out exhaustive experimental evaluations on the Miroslav Gospels. Miroslav Gospels is a 362-page illuminated manuscript Gospel Book on parchment with very rich decorations. It is one of the oldest surviving documents written in Old Church Slavonic and represents one of the most precious and significant documents in cultural heritage of Serbia (see Figure 1 for a detailed view of a page of the Miroslav Gospel).
The outlined project has been submitted as a proposal for joint research project to the SCOPES programme (Scientific co-operation between Eastern Europe and Switzerland). The research project will be mainly carried out by the Technical Faculty in Bor at the University of Belgrade (Serbia) and the Institute for Information Systems at the University of Applied Sciences and Arts Northwestern Switzerland. Further information about the SCOPES programme can be found at http://www.snf.ch/E/international/europe/scopes/Pages/default.aspx
Links:
Video lecture on our novel algorithmic framework for approximate graph distances: http://videolectures.net/gbr07_riesen_bgm
The algorithmic framework for approximate graph distances: http://www.fhnw.ch/wirtschaft/iwi/gmt
Miroslav Gospels:
http://en.wikipedia.org/wiki/Miroslav_Gospel
SCOPES programme:
http://www.snf.ch/E/international/europe/scopes
References:
[1] T. M. Rath, R. Manmatha: “Word spotting for historical documents”, Int. Journal on Document Analysis and Recognition, p. 139-152, 2007
[2] K. Riesen, H. Bunke: “Approximate graph edit distance computation by means of bipartite graph matching”, Image and Vision Computing, 27(4):950-959, 2009.
[3] A. Fischer, K. Riesen, H. Bunke: “Graph similarity features for HMM-based handwriting recognition in historical documents”, in proc. Int. Conf. on Frontiers in Handwriting Recognition, pages 253–258, 2010.
Please contact:
Kaspar Riesen
University of Applied Science and Arts Northwestern Switzerland, Olten, Switzerland
Tel: +41 79 688 77 19
E-mail: