by Massimo Ruffolo and Ermelinda Oro
The Web is the largest knowledge repository ever. In recent years there has been considerable interest in languages and approaches providing structured (eg XML) and semantic (eg Semantic Web) representation of Web content. However, most of the information available is still accessed via Web pages in HTML and documents in PDF, both of which have internal encoding conceived to present content on screen to human users. This makes automatic information extraction problematic.
In Presentation-Oriented Documents (PODs) content is laid out to provide visual patterns that help human readers to make sense of it. A human reader is able to look at an arbitrary document and intuitively recognize its logical structure and understand the various layout conventions and complex visual patterns that have been used in its presentation. This aspect is particularly evident, for instance, in Deep Web pages where Web designers arrange data records and data items with visual regularity, and in PDF documents where tables are used to meet the reading habits of humans. However, the internal representations of PODs are often very intricate and not expressive enough to allow the associated meaning to be extracted, even though it is clearly evidenced by the presentation.
In order to extract data from such documents, for purposes such as information extraction, it is necessary to consider their internal representation structures as well as the spatial relationships between presented elements. Typical problems that must be addressed, especially in the case of PDF documents, are incurred by the separation between document structure and spatial layout. Layout is important as it often indicates the semantics of data items corresponding to complex structures that are conceptually difficult to query, eg in western languages, the meaning of a cell entry in a table is most easily defined by the leftmost cell of the same row and the topmost cell of the same column. Even when the internal encoding provides fine-grained annotation, the conceptual gap between the low level representation of PODs and the semantics of the elements is extremely wide. This makes it difficult:
- for human and applications attempting to manipulate POD content. For example, languages such as XPath 1.0 are currently not applicable to PDF documents;
- for machines attempting to learn of extraction rules automatically. In fact, existing wrapper induction approaches infer the regularity of the structure of PODs only by analyzing their internal structure.
The effectiveness of manual and automated wrapper construction is thus limited by the need to analyze the internal encoding of PODs with increasing structural complexity. The intrinsic print/visual oriented nature of PDF encoding poses many issues in defining ‘ad hoc’ information extraction approaches.
In the literature a number of spatial query languages for Web pages, query languages for multimedia databases and presentations, visual Web wrapping approaches, and PDF wrapping approaches, have been proposed. However, so far, these proposals provide limited capabilities for navigating and querying PODs for information extraction purposes. In particular, existing approaches are not able to generate extraction rules that are reusable when the internal structure changes, or for different documents in which information is presented by the same visual pattern. Information extraction approaches are needed that can exploit the presentation features of PODs.
ICAR-CNR is addressing these problems through the definition of spatial and semantic wrapper induction and querying approaches that allow users to query PODs by exploiting the visual patterns provided in the presentation. These approaches are grounded on document layout analysis and page segmentation algorithms combined with techniques for automatic wrapper induction and spatial languages like SXPath, a spatial extension of XPath 1.0. The innovative approaches for information extraction from PODs now being studied at ICAR-CNR permit: (i) the analysis of document layout and recognition of complex content structures like tables, sections, titles, data records, page columns, etc.; (ii) the automatic learning of extraction rules and creation of wrappers that enable relevant information to be extracted from documents such as records and objects belonging to specific classes; (iii) the navigation and querying of both Web and PDF documents by spatial primitives that exploit the spatial arrangement of content elements resulting from documents presentation.
A CNR spin-off and start-up company, Altilia srl, will implement the approaches defined at ICAR-CNR. Altilia will provide semantic content capture technologies for the content management area of the IT market.
Massimo Ruffolo, ICAR-CNR, Italy