Handling Privacy Preservation in a Software Ecosystem for the Querying and Processing of Deep Sequencing Data

by Artur Rocha, Alexandre Costa, Marco Amaro Oliveira (INESC TEC) and Ademar Aguiar (University of Porto and INESC TEC)

iReceptor Plus will enable researchers around the world to share and analyse huge immunological distributed datasets, from multiple countries, containing sequencing data pertaining to both healthy and sick individuals. Most of the Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) data is currently stored and curated by individual labs, using a variety of tools and technologies.

iReceptor Plus aims to lower the barrier to accessing and analysing large AIRR-seq datasets, which will make this important data more available to academia, industry and clinical partners.

The project will stimulate the public sharing of AIRR-seq data, while providing a mechanism for users to protect private data when required. To this end, we are developing a layered security framework across a distributed (federated) software ecosystem.

The international iReceptor Plus [L1] (iR+) consortium aims to promote human immunological data storage, integration and controlled sharing for a wide range of clinical and scientific purposes. iR+ is an ongoing four-year project, started in 2019, and co-funded by the EU and Canadian government, that aims to develop an innovative platform to integrate distributed repositories of Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) data [1] that will enable improved personalised medicine and immunotherapy in cancer, inflammatory and autoimmune diseases, allergies and infectious diseases.

This platform will empower researchers around the world to share and analyse huge immunological distributed datasets, from multiple countries, that contain sequencing data pertaining to both healthy and sick individuals. Currently, most of these data banks are stored and curated by individual labs, using a variety of tools and technologies. iR+ software ecosystem will lower the barriers to accessing and analysing large AIRR-seq datasets, which will make these important data more available to academia, industry and clinical partners.

Layered security framework
AIRR sequencing [L2] technology has made it possible to sample the immune repertoire in exquisite detail but also poses substantial challenges, such as the preservation of the privacy of data subjects.

The issue of privacy is a topic of continuous discussion within the health informatics community, especially when it comes to genetic datasets, which are subject to constraints of confidentiality, security, rights and ownership. While analyses performed on these datasets may provide crucial research evidence, both data access and their processing must be conducted in a way that does not compromise privacy.

The role of iR+ layered security framework is to enable secure access between the components of the software ecosystem, following the current standards of security, to provide multiple levels of authentication and authorisation to AIRR Data Commons (ADC) [2] compliant software.

The layered security framework delivers to iR+ a working Authentication and Authorisation Infrastructure enabling the following features [L3]:

federated authentication for data consumers, compatible with multiple third-party identity providers (and identity brokers);
secure ADC repository endpoints according to the permissions set by data stewards;
a dashboard for data stewards to manage data consumer’s permissions for each end-point and resource they own.

Due to the distributed nature of the data providers and to the technological heterogeneity of the various repository services, the security framework was implemented following a technology-agnostic approach. It was vital to determine an interoperable mechanism for managing resources, independent of the underlying repository implementation.

Authorisation component
The main standard for managing authorisation is user-managed access (UMA 2.0). UMA is an OAuth-based access management protocol for managing authorisation to resources. It grants data stewards the ability to manage permissions and accessibility to their resources, and control who can access their resources (data consumers). The basic workflow follows an exchange of permission tickets between the security framework and the requesting user. The process is used to identify the user, determine which dataset the user is trying to access, and finally to resolve which sets of data should be returned to the user.

The UMA 2.0 authorisation standard was designed specifically for protected data. However, in iR+, protected data may live side by side with public data in the same repositories. Therefore the security framework had to deal with this limitation and extend regular UMA implementations by acknowledging both authenticated and unauthenticated access to publicly available data:
By default, any requests made to a secured repository will return data defined by the data steward as publicly accessible. This means the requesting user will not need to be registered and will not need to explicitly request access to the data steward to view public data. On the other hand, we leveraged on the HTTP protocol by appending a custom HTTP header that should be sent along with the request, to trigger the default UMA authorisation workflow to access protected data.

Dashboard component
The security dashboard is an interface that allows data stewards to control access using different levels of granularity through an interface modelled after the ADC data standards. It enables fine-grained customisation over what is exposed by the security framework.

Accessibility levels may be customised using arbitrary permission scopes. For example, a data steward may enable an intermediary permission level for exploratory data analysis, where only aggregated, non-identifying information is delivered to the requesting user, by defining a specific scope for such purpose.

It provides flexible settings through security templates that allow data stewards to quickly set up recurring accessibility levels for different datasets.

ADC-Middleware
The ADC-Middleware [L4], is the central component of the security framework and the main service responsible for providing a control layer between the requesting users and the ADC repositories. It takes into account all the security configurations, data ownership, who the data was shared with, fine-grained customisations, and uses this information to control which sets of data should be exposed to users.

It effectively acts as a barrier between the ADC Repository, only making it possible to request contents the user has access to, and filters out any data the user does not have access to. This filtering process builds on the UMA 2.0 authorisation service to determine permissions, along with the ADC-Middleware internal filtering engine to determine more fine-grained access control.
The ADC-Middleware provides programmatic access to AIRR-seq data sets following the same querying and filtering formats that a normal ADC API would, and is fully interoperable with ADC API implementations.

Conclusions
The layered security framework builds on the privacy by design and data minimisation principles to attain privacy preservation in a federated software ecosystem for the querying and processing of AIRR-seq data. If data has been previously made public, it can be accessed via standard APIs without triggering the default UMA workflow. Should access restrictions apply, data stewards can use the security framework to configure adequate permission levels according to the sensitivity of the data to be shared.

As an example, summary statistics, non-disclosive features derived from genetic data, or other forms of aggregated data can be set to an intermediary level of permissions. A registered user could then access these features in an exploratory data analysis stage, before deciding to activate the necessary legal instruments for the sharing of potential sensitive data.

Figure 1: Overview of the Security Framework and interaction among its main components.

Links:
[L1] https://www.ireceptor-plus.com/
[L2] https://www.antibodysociety.org/the-airr-community/
[L3] https://ireceptorplus.inesctec.pt/wiki/start
[L4] https://github.com/ireceptorplus-inesctec/adc-middleware

References:
[1] F. Rubelt, et al.: “Adaptive immune receptor repertoire community recommendations for sharing immune-repertoire sequencing data", Nature immunology 18.12 (2017): 1274-1278.
[2] S. Christley et al.: “The ADC API: a web API for the programmatic query of the AIRR Data Commons”, Front, Big Data 3: 22. doi: 10.3389/fdata (2020).

Please contact:
Artur Rocha
Institute for Systems and Computer Engineering, Technology and Science, Portugal
This email address is being protected from spambots. You need JavaScript enabled to view it.