by Alexandra Roatiș

The WaRG framework brings flexibility and semantics to data warehousing. The development of Semantic Web data represented within W3C’s Resource Description Framework [1] (RDF), and the associated standardization of the SPARQL query language now at v1.1 has lead to the emergence of many systems storing, querying, and updating RDF. However, as more and more RDF datasets (graphs) are made available, in particular Linked Open Data, application requirements also evolve.

DW4RDF (Data Warehousing for RDF) is a three year R&D project funded by the Digiteo foundation, an important player in the French IT R&D environment in the greater Parisian area. Within this project, we have developed the Warehousing RDF Graphs (WaRG) framework, a joint project involving Dario Colazzo (U. Paris Dauphine, France), François Goasdoué (U. Rennes 1, France) and Ioana Manolescu (Inria & U. Paris Sud, France). Within this framework, we have developed new models and tools for analytics and OLAP-style analysis of RDF data, taking into account data heterogeneity, its lack of a strict structure, and its rich semantics.

Heterogeneity
Data heterogeneity significantly complicates Big Data analytics. For example, although we can reasonably expect to find information about restaurants online, we cannot always find the menu, opening hours or closing days. Existing data warehousing tools tackle such issues by cleaning the data in the Extract Transform Load process, or allow nulls and nested columns in tables. In contrast, we view heterogeneity not as a problem, but as a desired feature. Instead of trying to eliminate or hide it, we propose ways of incorporating heterogeneity within the data warehousing model, and tools to build meaningful aggregates over heterogeneous data.

Warehouses need not revolve around a single concept
Generally, data warehouses follow a star (or snowflake) schema, where facts of a single kind can be analysed based on certain dimensions and measures. To analyze different concepts, such as restaurants, shops, museums, each must be modelled by a different schema and put into a distinct data warehouse. WaRG models a type of data warehouse where several core concepts coexist, interlinked by meaningful queries.

Just a matter of semantics?
RDF Schema is a valuable feature of RDF that allows the descriptions in RDF graphs to be enhanced by declaring semantic constraints between classes and properties. Such constraints are interpreted under the open-world assumption [2], propagating instances from one relationship to another. For example, in a database where “Océan is a pancake house” and “a pancake house is a restaurant”, we can infer that “Océan is a restaurant”. Querying the database for restaurants should also return all the pancake houses! Our framework is centreed on RDF, thus it natively supports RDF semantics when querying.

Figure 1: Sample analytical schema and query

Explore data though analytics
Even for an experienced database developer, understanding a new dataset is always a challenge, as each brings its own set of features, which may be particularly subtle in the case of semantic-endowed data such as RDF. To facilitate their understanding, datasets are generally published along with a schema. But how does one understand the schema? Contemporary published schemas tend to be complex and can be seen as datasets in their own rights. While still small compared to the data, they can be real puzzles for the analyst. Working with RDF, one can seamlessly query the schema and the data, for example ask for all the relationships linking people to other entities. Our model allows analytics not only over the data, but over the schema as well.

Data cubes, no longer a dictatorship
In order to perform data warehouse analysis, one must first establish the dimensions and measures according to which to analyze the facts. Data cubes are built as a result of aggregating the measures along the dimensions. For instance, when asking “what are the total sales for region Lorraine in autumn 2013?”, the sales are a measure, while region and period represent dimensions. However, such a warehouse cannot answer the query “how many regions registered sales in autumn 2013?”, since region is a dimension, and relational data cubes do not allow aggregating over the dimensions. In contrast, our framework is very flexible, allowing a choice of dimensions and measures at data cube (query) time, not at data warehouse design time.

WaRG models the analytical schema of an RDF warehouse as a graph. Each node represents a set of facts, modelling a new RDF class. The edges connecting these nodes are defined independently and correspond to new RDF properties. The instances of these classes and properties, modelling the data warehouse contents to be further analyzed, are intentionally defined in the schema, following the well-know “Global As View” approach for data integration. For more details we refer the interested reader to [3].

Our ongoing work includes RDF analytical schema recommendation and efficient algorithms for massively parallel RDF analytics.

Link: https://team.inria.fr/oak/warg/

References:
[1] W3C, Resource Description Framework, http://www.w3.org/RDF/
[2] S. Abiteboul, R. Hull, V. Vianu: “Foundations of Databases”, Addison-Wesley, 1995
[3] D. Colazzo, F. Goasdoué, I. Manolescu, A. Roatiș: “Warehousing RDF Graphs”, in “Bases de Données Avancées”, 2013, http://hal.inria.fr/docs/00/86/86/16/PDF/paper.pdf.

Please contact:
Alexandra Roatiș
Inria Saclay and LRI, Université Paris-Sud
http://www.lri.fr/~roatis/
E-mail: This email address is being protected from spambots. You need JavaScript enabled to view it.

{jcomments on}
Next issue: January 2025
Special theme:
Large-Scale Data Analytics
Call for the next issue
Image ERCIM News 96 epub
This issue in ePub format
Get the latest issue to your desktop
RSS Feed