by the guest editors Rudolf Mayer (SBA Research) and Thijs Veugen (TNO and CWI)
Many branches of the economy and society are increasingly dependent on data-driven predictions, decision support and autonomous decision-making by systems. These systems often depend on the availability of large amounts of data to learn from, which may include details about individuals, or otherwise sensitive information. Disclosure of the individuals represented by this data must be avoided for ethical reasons or regulatory requirements, which have been tightened in recent years, e.g. by the introduction of the EU’s General Data Protection Regulation (GDPR). This means that the use of data is restricted, making sharing, combining, and analysing data problematic. Privacy-preserving computation tries to bridge this gap: to find a way to leverage data while preserving the privacy of individuals.
As diverse as data collection, analysis scenarios, and workflows are, so are the approaches for privacy-preserving data analysis. These range from settings where data is collected centrally, or analysed by third parties, to settings where data is locally collected at many individual locations, and jointly analysed in a secure way; sharing and centralising the data in one location is often not a feasible option in this case.
Cryptographic approaches aim to keep data confidential during computation. Secure multi-party computation (SMPC) allows a joint result to be computed from data distributed over multiple sources, while keeping the data inputs private. The efficiency of SMPC techniques has improved considerably in recent decades, enabling more complex computation, such as sophisticated machine learning algorithms. A related technique is homomorphic encryption, suitable for outsourcing computations on sensitive data.
Novel paradigms, such as federated learning, offer an alternative for analysing distributed data at low computational overhead.
In terms of anonymisation, implementations of differential privacy can help mitigate privacy risks for dynamic database queries and beyond. Methods to measure privacy or leakage are a cornerstone of evaluating the success of privacy-preserving computation.
Within Europe, considerable research is being conducted into privacy-preserving machine learning. This is, in part, because Europe is a pioneer in data protection, with the GDPR being one of the world’s strongest and most comprehensive regulations. This naturally puts a strong research focus on data privacy, and this is reflected in the research funding schemes of the EU – either in the form of project funding, dealing directly with novel approaches to protecting individuals’ data, or indirectly, as strong privacy assessment and management are required in many research endeavours, e.g. in most research projects dealing with medical and health data.
Privacy by design
Even as early as the data collection stage, regulations need to be considered. Bianchini et al. define a framework that evolves the usual approach of the DevSecOps paradigm by introducing the analysis and validation of privacy and data protection, and security dimensions at the point of conception of the technology, enforcing a holistic assessment. Chrysakis et al. discuss socio-technical tools to promote collective awareness and informed consent.
Miksa et al. describe a platform that allows non-disclosive data analysis on sensitive data complemented by a detailed auditing mechanism. A consent management system and linkage between different data sources allows data across different sources to be integrated and analysed. Gremaud et al. present, for the use case of smart home IoT devices, a system for storing data in a centralised manner; they eliminate the need to entrust a system operator by utilising an enclave where data is always encrypted; this also enables collaboration between different data owners. Abraham et al. propose a marketplace for brokering personal data; the planned exchange of information between participants is via an SMPC framework. Rocha et al. discuss how to collaborate among distributed immunological datasets. To this end, a system is provided for managing consent and to provide adequate permission levels according to the sensitivity of the data to be shared. Delsate et al. describe a platform that manages data access to multiple sources, utilising pseudonymisation and data aggregation to reduce risks for the identification of individuals in the database.
Lorünser et al. show how collaboration and optimisation of processes across company boundaries can be facilitated by SMPC, without disclosing actual business secrets to competitors. Abspoel et al. examine ways to improve the efficiency of SMPC protocols; their specific focus is on computation domains such as integers modulo powers of two. Pennekamp et al. discuss how non-disclosive computation, e.g. in the form of homomorphic encryption, can enable industrial collaboration, especially in settings where a strong emphasis is put on confidentiality of the data exchanged and shared in the collaboration. Kamphorst et al. discuss two solutions to collaboratively mine distributed data for clinical research in the oncology domain. The solutions perform privacy-preserving survival analysis in different scenarios; particularly, depending on the type of patient data distribution, e.g. whether the (hypothetically) combined data results in information on more individuals or in more information per individual, the solution based on federated learning or SMPC applies. Van Egmond et al. address the issue of information exchange between banks to detect money laundering. As data sharing is generally not an option, they investigate the feasibility of SMPC to enable this collaboration.
Spini et al. propose a workflow to exchange information between multiple stakeholders that need to collaborate to identify individuals eligible for certain social welfare programmes, without revealing their data to each other. This is achieved through homomorphic encryption to maintain confidentiality. Grimm et al. utilise federated learning to detect fraud in accounting and auditing, and thus enable anomaly detection and classification without the need to exchange data. Basile focuses on formal verification of SMPC settings, to create a contract-based design methodology to enforce security accountability and reputation of distributed digital entities.
Nitz et al. analyse requirements for data anonymisation in the domain of cyber security with the aim of helping small and medium enterprises share their cyber threat intelligence data with others, to fast-track and improve the process of detecting attacks. Pejó et al. investigate the question of how important the contributions of individual parties are to a model collaboratively learned via a federated learning approach. This can be the basis to identify both participants that just want to benefit from the collaborative model without adequately contributing, and those that want to exaggerate their impact on the model when trying to tamper with (poison) the result. Hittmeir et al. address the domain of the human micro-biome, which has attracted the interest of researchers because of its links with a number of medical conditions. The authors investigate and assess the privacy risks stemming from identification of individuals participating in a database of micro-biome samples taken from multiple parts of the human body, and discuss countermeasures such as anonymisation and data synthetisation. Campbell et al. address the domain of voice-based interaction with computing devices, which is heavily dependent on user-created data to adapt to the multitude of languages and dialects individuals use to communicate with these devices. To reduce the risk of identification of specific users in the aggregated training data, methods to anonymise the voice and text representation are proposed, e.g. to make the recorded speech less identifiable. Šarčević and Mayer investigate multiple utility measures, which are employed to estimate the impact of anonymisation techniques on the information remaining within the data. They conclude that in many cases, simple and generic measures are not able to correctly predict which anonymised version is best for specific tasks, such as learning a machine learning model.
This issue of ERCIM news focuses on current research on privacy-preserving machine learning. As the scope, nature and goals of machine learning activities are very broad, so is the need for solutions that can best fit a specific setting. While there is a wealth of approaches, new data collection and analysis settings emerge frequently, with the introduction of new devices and services, and the widespread adoption of those by many users. Thus, further research in this area will have to address these novel, yet unknown settings, and at the same time aim to improve the current solutions, to increase privacy but at the same time aim for a maximum possible utility. While strong data protection guidelines such as the GDPR might, at first glance, seem to contradict the latter goal of highest utility, such challenges can often be the catalyst for novel and enhanced solutions.
SBA Research, Austria
TNO and CWI, The Netherlands