A Model-Driven Data Provenance Method in a Semantic Web-Based Environment

by Tibor Gottdank

The goal of data provenance is to provide a method and a standard with which to manage the validity and the origin of information. At SZTAKI, a model-level data provenance method is implemented in a distributed service-oriented system, as part of a Hungarian national research and development project called SINTAGMA (Semantic INtegration Technology Applied in Grid-based, Model-driven Architectures).

The underlying principle of the SINTAGMA project (the project partners are IQSYS Computing Ltd., Budapest University of Technology and Economics (BME), HunorIS and SZTAKI) is to employ loosely coupled components and combine data-centric and process-based integration. This is done by providing appropriate wrappers via the integration of relational, object-oriented and semi-structured sources, as well as Web Services in a unified framework.

In SINTAGMA system, metadata (knowledge about information systems) are stored in a so-called model repository. This knowledge is represented by some formalism (description logics or UML). The properties of a model (concepts, structure of classes, description of relations of classes) are stored in a knowledge base. However, constraints have an important role in defining objects and relations. To determine these constraints, both OCL (Object Constraint Language) and the language of Description Logics are used.

The SINTAGMA system itself is both technically and semantically an integration tool for high-level access of heterogeneous data sources.

This model-level implementation is based on an internal language (SILan), which describes the models stored in the model repository (knowledge base). The metadata is stored as a model in the model repository in SINTAGMA. Each data source is described by a model, and the mappings between models present associations. The SINTAGMA system provides underway (that is to say virtually) integrated data for the external components (eg for agents).

This article is focused on the data provenance issue in this distributed SOA environment. The goal is to provide the ability to access and analyse provenance, which records past events and provides users with a guide to steps in the future.

Figure 1: Data integration process via SINTAGMA system.

Technically, the models (in the model repository) are completed with provenance information. The provenance information is derived and queried like other information. The method is a detailed guide that the knowledge engineer can use to append new information and to maintain existing data. Additional information contains the source name, the current time and the model name. This information is never lost during the data derivation, meaning the end user is also able to use it.

The method consists of three steps within the typical use cases that cover the whole provenance activity. The use cases are based on a general situation where the user is looking for a piece of information and wants to be sure of the authenticity of incoming data. First, the sources are derived. In the second step the data provenance information itself is derived. Through derivation, the provenance information moves to a higher level and provenance attributes are inserted into other attributes of class objects. Provenance data queries and the identification of sources are therefore possible at all levels of data derivation (third step).

Step 1: Insertion of provenance data into models: derivation of sources (eg MySQL database tables).

Step 2: Derivation of data provenance information: Adding attributes to given level classes takes place for provenance data (eg by Web Services).

Figure 2: Sample SILan code (bold indicates provenance information).

The derivation of source data consists of four use cases:

Derivation of relations-level models. In the model repository the derivation is realized through abstraction. Within abstraction, values passed to particular provenance attributes are inserted into class attributes at the first point. The constant values, which correspond to given sources and higher-level model names, are added to attributes by the knowledge engineer.
Derivation of more low-level models.
Derivation of higher-level models. The derivation is processed among application-level models. In this case the derivation process has already occurred on the relations level.
Association and derivation of higher-level models. This class derivation is on the application level as well, but the relation realized here is an association type relation.

Step 3: Retrieval of provenance information: Here the requested information can be queried.

Figure 3: Result of the above query in the GUI of SINTAGMA system.

The SINTAGMA technology (and the model-driven provenance method within it) provides cost-effective integration of information for medium-size enterprises, where the development of quality and efficiency of services is important.

Currently, the potential target field includes map businesses, travel agencies, libraries and museums. In academia, BME and SZTAKI will use the results of basic and applied research in their education and research program. By merging, filtering, grouping data of different data providers, in cooperation with the Hungarian News Agency (MTI) and National Széchényi Library (OSZK), SZTAKI provides the Hungarian Digital Library with the possibility of effective data search.

The project (completed this summer) will demonstrate the wide applicability of the technology to be developed using two significantly different application environments. First, the IT services linked to the Hungarian cultural heritage (building on the National Digital Repository) will be improved, and second, specific problems suffered by SMEs in data integration and electronic business/commerce (related to the so-called Enterprise Information Integration market trend) will be solved.

Link:
http://www.sintagma.hu

Please contact:
Tibor Gottdank, SZTAKI, Hungary
Tel: +36 1 279 6205
E-mail: gottdanksztaki.hu