by Dumitru Roman and Ahmet Soylu (SINTEF)
Corporate information, ranging from basic company information such as company name(s) and incorporation date to complex balance sheets and personal data about directors and shareholders, are the foundation that many data value chains depend upon in various sectors. However, collecting and aggregating information about a business entity from relevant public and private sources and especially across borders and languages is a tedious and very expensive task rendering many potential business models non-feasible.
The euBusinessGraph project [L1], funded by European Union’s Horizon 2020 programme, set the foundations for a business knowledge graph of companies, and delivered a set of innovative data-driven business products and services dealing with company information.
Governments and other public bodies are increasingly publishing open data about firmographics and contextual databases, which reference companies. For example, the UK, Norway, France, and Denmark openly publish records about companies, while other countries have various degrees of openness for their company registries. Examples of contextual databases include the EU TED (Tenders Electronic Daily) public procurement notices and gazette notices.
Unfortunately, firmographics datasets are not yet fully harmonised and interoperable because data differs widely in semantics from one source to another and data formats vary, ranging from UK’s five star Linked Data to poorly accessible and poorly documented datasets. Furthermore, contextual databases are not linked to the national company registries and they still use different company identifier systems or, in some cases, no identifiers at all. Private businesses also produce valuable company-related data, which is seldom linked to the public sources mentioned above. For example, media publishers often reference businesses and legal entities by name (hence ambiguously) even within their digital publications. This occurs because there is no widespread mark-up schema to annotate a digital reference to a company and no standardised way of accessing its information once it’s unambiguously identified. As a result, it is extremely expensive, time consuming, and error prone to find, interpret and reconcile these data from private sector sources.
One of the immediate consequences is that the business information sector is not very cost-efficient in itself, which is reflected in a lack of transparency and efficiency of the markets . Nevertheless, the most relevant consequence in this context is that these inefficiencies severely harm digital innovation across sectors, which is often introduced by small and agile actors (e.g., start-ups, civil society organisations) who lack the capacity to invest time and resources.
Figure 1: Homepage of the euBusinessGraph data marketplace prototype.
The euBusinessGraph project used ontologies as a key mechanism for aggregating, linking, provisioning and analysing company-related data in order to create a “business knowledge graph” – a highly interconnected graph of company-related information. A prototype data marketplace was created on top of the provisioned knowledge graph for enabling the creation of data-driven products and services. It exemplifies the democratisation of the company information market, currently dominated by a few large international players creating a market barrier for smaller company data providers. The marketplace exemplified how such smaller players can join a common ecosystem to promote their data offerings, and for data consumers to have a central point where they could easily compare company data offerings.
An ontology – the euBusinessGraph ontology [L2] – was developed by following common techniques recommended by well-established ontology development methods. The main sources used in its development were existing ontologies and vocabularies, such as the W3C Organisation ontology, and company data from four data providers. The data providers include: (i) OpenCorporates with core company data on over 145 million entities, obtained from more than 120 company registers around the world; (ii) SpazioDati with basic firmographics about more than 11 million business entities in the UK and Italy and information about 13 million directors and managers; (iii) Brønnøysund Register Centre (Brønnøysundregistrene) with a database that contains information on all legal entities in Norway such as commercial enterprises and governmental agencies, and; (iv) Ontotext with data from the Bulgarian Trade Register for commercial and non-profit organisations. The data made available by the data providers originally came from both official (e.g., national and regional company registers) and unofficial sources (e.g., the corporate web, business-centric news aggregators and social networks).
A data provisioning infrastructure was developed to onboard data from various data providers . Using this infrastructure, data source files from data providers were processed and mapped to the euBusinessGraph ontology. The data provisioning infrastructure includes a set of data ingestion services and data preparation tools that can be used to simplify data cleaning and transformation from the various sources. The services include tools for data transformation, enrichment, interlinking, and metadata generation processes in order to publish the business knowledge graph data as Linked Data. DataGraft  was used to clean, transform, enrich and convert tabular data to Linked Data. Currently, more than 1.4 billion Linked Data triples available in the business knowledge graph. A data marketplace prototype [L3], depicted in Figure 1, was implemented on top of the knowledge graph and includes functionality for full-text advanced search and detailed faceted search for exploration of the company knowledge graph. Furthermore, the marketplace offers analytics services such as data aggregation and visualisation (e.g., company activities per city), search for company news articles, and search for company events.
The project partners are SINTEF AS (Norway, coordinator), OpenCorporates (UK), Cerved (Italy), SpazioDati (Italy), Evry AS (Norway), Deutsche Welle (Germany), Ontotext (Bulgaria), Brønnøysund Register Centre (Norway), Jozef Stefan Institute (Slovenia), and University of Milano-Bicocca (Italy).
 Janssen et al.: “Driving public sector innovation using big and open linked data (BOLD)”, Information Systems Frontiers 19(2) 2017.
 Maurino et al.: “Modelling and Linking Company Data in the euBusinessGraph Platform”, in Proc. of DSMM@SIGMOD 2019.
 Roman, et al.: “DataGraft: One-stop-shop for open data management”, Semantic Web 9(4) 2018.
Ahmet Soylu, SINTEF AS, Norway