by Irene Petrou and George Papastefanatos
Linked Open Data technology is an emerging way of making structured data available on the Web. This project aims to develop a generic methodology for publishing statistical datasets, mainly stored in tabular formats (e.g., csv and excel files) and relational databases, as LOD. We build statistical vocabularies and LOD storage technologies on top of existing publishing tools to ease the process of publishing these data. Our efforts focus on census data collected during Greece’s 2011 Census Survey and provided by the Hellenic Statistical Authority. We develop a platform through which the Greek Census Data are converted, interlinked and published.
Statistical or fact-based data about observations of socioeconomic indicators are maintained by statistical agencies and organizations, and are harvested via surveys or aggregated from other sources. Census data include demographic, economic, housing and household information, as well as a set of indices concerning the population over time, such as mortality, dependency rate, total fertility rate, life expectancy at birth, etc.
The main objective in publishing socio-demographic data, such as census data, as LOD is to make these data available in an easier-to-process format (they can be crawled or queried via SPARQL), to be identifiable at the record level through their assignment with URIs and finally to be citable, ie, make it possible for other sources to link and connect with them. Being available in LOD format will make them easier to access and use by third parties, facilitating data exploration and the development of novel applications. Furthermore, publishing Greek census data as LOD will facilitate their comparison and linkage with datasets derived from other administrative resources (e.g. public bodies, Eurostat, etc.), and deliver consistency and uniformity between current and future census results.
Best practices for publishing Linked Data encourage the reuse of vocabularies for describing common concepts in a specific domain. In this way, interoperability and interlinking between published datasets is achieved. In the statistics field, a number of statistical vocabularies and interoperability standards have been proposed, such as the SDMX (Statistical data and metadata exchange) standard, the Data Cube Vocabulary, and SCOVO. In our approach, we employ the Data Cube Vocabulary for representing census results. The Data Cube Vocabulary relies on the multidimensional (or cube) model. Its main components are the dimensions, the attributes and the measures. Dimension components capture common characteristics across datasets, such as the reference period or the reference location; an attribute component captures attributes of the observed value(s), such as the unit of measure. Finally, a measure component represents the phenomenon being observed, such as the number of inhabitants. Data Cube Vocabulary, furthermore, uses SKOS and SDMX concepts for defining classifications, hierarchies and common statistical concepts.
To publish the data we apply the proposed methodology in Figure 1, comprising the following steps:
- Data modelling: This step involves identifying and modelling custom ontologies for all census-specific concepts and indices, which are not defined by other sources. An important part of this step is to tackle problems related to the evolution of the concepts (both in terms of structure and data values) over time. A typical example concerns the structure of administrative divisions in Greece: in the 2001 Census Survey the divisions were defined according to “KAPODISTRIAS” Plan containing six hierarchy levels of divisions, whereas in 2011, restructuring according to “KALLIKRATIS” Plan resulted in eight levels.
- Data RDF-ization: This step involves cleaning up the data, the selection of the appropriate URI Scheme for each type of resource (e.g., datasets, dimensions, observations, etc.) and the mapping of each concept within the source file (e.g., a column in case of xls files) either to the appropriate component of the Data Cube Vocabulary or to a concept of other related vocabulary (SDMX, SKOS). The data are then exported to RDF. The mapping along with the RDF generation is done within the custom platform developed for data transformation in real-time.
- Data interlinking: The transformed data are interlinked with other resources. For example, indices are linked with datasets from World Bank and economic activities, occupational and educational data are linked with Eurostat’s datasets via the NACE, ISCO and ISCED classifications, respectively.
- Data storage: The produced RDF data are uploaded, stored and maintained in a LOD triple store. OpenLink Virtuoso is used for storing and dereferencing data.
- Data publication: Finally, the data become available for dereferencing and further exploration through a SPARQL endpoint service and for downloading as RDF dumps.
The current work is implemented in the context of a national large scale project regarding the management of socio-demographic data in Greece. The project is co-financed by the European Union (European Regional Development Fund - ERDF) and Greek national funds through the Operational Program “Competitiveness and Entrepreneurship” (OPCE ΙΙ) of the National Strategic Reference Framework (NSRF) - Research Funding Program: KRIPIS. The project started at the beginning of 2013 and will run over the next two years. It involves the Institute for the Management of Information Systems at the Research Center “Athena”, and the Institute of Social Research of the National Centre for Social Research.
 I. Petrou, G. Papastefanatos, T. Dalamagas: “Publishing Census as Linked Open Data. A Case Study”, in proc. of the 2nd Int. Workshop on Open Data (WOD’13), Paris, France, 2013.
Athena Research Centre, Greece
Athena Research Centre, Greece