by Minh-Duc Pham and Peter Boncz
The Resource Description Framework (RDF) has been used as the main data model for the semantic web and Linked Open Data, providing great flexibility for users to represent and evolve data without need for a prior schema. This flexibility, however, poses challenges in implementing efficient RDF stores. It i) leads to query plan with many self-joins in triple tables, ii) blocks the use of advanced relational physical storage optimization such as clustered indexes and data partitioning, and iii) the lack of a schema sometimes makes it problematic for users to comprehend the data and formulate queries . In the Database Architecture group at CWI, Amsterdam, we tackle these RDF data management problems by automatically recovering the structure present in RDF data, leveraging this structure both internally inside the database systems (in storage, optimization, and execution), and externally as an emergent schema towards the users who pose queries.
This research project is inspired by the work on “characteristic sets” (CS's)  showing that it is relatively easy to recover a large part of the implicit structure underlying RDF data. Here, a characteristic set is a set of properties that occurs frequently with the same subject. We have observed that structure typically surfaces with certain kinds of subjects having the same set of properties (belonging to the same CS), and certain CS’s being connected over the same kind of property paths (“foreign key” relationships). The preliminary evaluation (see Figure 1) shows that even in the most dirty and messy RDF datasets, such as web-crawl data, 85% of the triples can be covered by a few hundred CS’s. The key idea in this research is to discover a rough “emergent” relational schema which covers most input RDF triples (e.g., 85% of the dataset). This schema consists of a set of CS’s and foreign key relationships between them, where tables and columns have logical names. This idea is being realized and experimentally evaluated inside MonetDB, an open-source column-store. By exploiting this schema, we provide better data locality and lower query processing cost. Our research focuses on three topics:
- Emergent Schema Recognition: Our goal is to represent a large part of the RDF data using a compact and understandable relational schema, derived from its emergent structure. To achieve this, after using the basic algorithm to detect CS’s based on common property sets, we enrich the schema by analysing the frequency distributions of Object-Subject references between CS’s and analysing the literal type frequency distributions of CS’s properties. To compact the schema, we merge CS’s that are semantically and structurally similar, and further optimize the schema by filtering out dirty data such as infrequent Properties or Subjects that miss most properties. To allow users to understand and easily browse the schema via SQL-based tools, we develop methods to label each CS and its properties using understandable names exploiting (when available) ontologies, special Properties (e.g., rdf:type), the content of URI values, and labels of Properties that often reference the Subjects belonging to a CS.
- Data Storage: The emergent schema is exploited to build a physical storage scheme that clusters the triples in a particular order, such that relational clustered indexing and data partitioning become possible. From the RDF store point of view this comes down to Subject clustering where loaded triples get a renumbered Subject: for each CS, the Subjects that belong to it get densely ascending object identifiers. The PSO triple representation (ie, triples in lexicographic Predicate–Subject–Object order) thus becomes highly efficient, as P and S compress away because they are repetitive resp. densely increasing. It can alternatively be viewed as columnar table storage of only Objects, and, in fact, gets stored as regular MonetDB table columns. The resulting “CS” tables are directly usable in SQL queries. Note that these tables do not represent the irregular triples (those that do not belong to a CS). These are stored separately as a triple table, and only play a role in SPARQL query answering.
- Query Execution: To leverage this emergent schema in SPARQL queries, we propose to extend the database kernel with new algorithms (RDFscan and RDFjoin). These operators accelerate typical “star join” patterns in SPARQL queries, avoiding self-joins. They stitch together CS columns in one sequential scan, like column stores do, but simultaneously merge-join the data with the irregular triple table to provide correct SPARQL query answers.
We believe this architecture will benefit RDF data management. It solves the performance issues of RDF stores, adds semantic web features to the relational model, helps SPARQL users better comprehend RDF data, and allows them to enjoy the benefits of the relational tool-chain.
 P. Minh-Duc, P.A. Boncz: “Self-organizing Structured RDF in MonetDB,” PhD Symposium, ICDE, 2013, http://oai.cwi.nl/oai/asset/20747/20747B.pdf
 T. Neumann, G. Moerkotte: “Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins”, in ICDE, IEEE, 2011, dx.doi.org/10.1109/ICDE.2011.5767868
CWI, The Netherlands