by Hubert Schölnast, Peter Kieseberg, Patrick Kochberger and Henri Ruotsalainen (University of Applied Sciences St. Pölten)

While the concepts of data sharing and data reuse are simple in theory, they face a plethora of challenges and obstacles when transferred into real-life applications. In this article we discuss the major challenges encountered in the successful construction of a sharing infrastructure for oncological data, as well as best practices and learnings in order to overcome similar issues.

Utilization of health data has a long tradition, especially when referring to concepts like P4 medicine [1]. While the task seems trivial in theory, there are many obstacles in practice, especially when considering small teams targeting very specialized data and research questions below the multi-million euro development budgets. In this paper we discuss learnings from previous projects that acted as basis for the practical implementation of an oncological platform in Lower Austria together with new challenges derived from the changes in the geopolitical landscape. Figure 1 provides an overview of these selected challenges.

 

Five selected challenges for real-world medical data exchange environments
Figure 1: Five selected challenges for real-world medical data exchange environments.

 

Obstacle 1 – Anonymisation versus Pseudonymization: One major issue for misunderstanding is the term “anonymization”. While it is clearly reserved in IT Security and regulations like the General Data Protection Regulation (GDPR) as methods that do not allow re-identification of persons in data sets, neither directly through identifiers nor indirectly through a combination of so-called “quasi identifiers”, it is often used as a word for “masking” data in the sense of using pseudonyms for the identifiers in medical environments. While this seems to be a trivial issue on first glance, it results in completely different protection levels: Regulations like the GDPR enforce strict protection of non-anonymous data, which also explicitly includes pseudonymized data with respect to e.g. consent, right to deletion and others. Furthermore, typically applicable anonymization mechanisms result in data distortion which could be detrimental to the analysis conducted by AI algorithms.
This topic is a typical example for different interpretations of technological terms in different fields. Thus, defining each of these terms is of major importance. Furthermore, we found that referencing specific standards or regulations can greatly enhance the quality of understanding between experts from different fields.

Obstacle 2 – Digital Sovereignty: Due to recent reshaping of the geopolitical landscape, the topic of European digital sovereignty emerged as a major issue, especially considering the high risk involved when dealing with personal medical information for the patients. In addition, as technological support becomes a bargaining chip in geopolitical games, requiring availability of foreign technological platforms becomes a liability. While the decision to stay with purely European platforms needs to be decided on case by case, the principle issue of digital sovereignty needs to be taken into consideration. Furthermore, upcoming regulations are currently foreseen to strongly emphasize on utilizing European platforms wherever possible, in order to decouple from US dominance, adding further complexity to the design phase. Thus, the requirement for using sovereign software and technologies needs to be discussed and clarified.

Obstacle 3 – Bills of Materials:In tandem with the topic of digital sovereignty, so-called “Bills of Material” (BOMs) [2] become increasingly important. While this is already required for Software under certain circumstances and is a major mechanism for ensuring digital sovereign as outlined in Obstacle 2, the nature of modern machine learning algorithms also requires a BOM for the data that is used for training the models, since these models define the actual behaviour of the algorithms. Furthermore, this test data might not only unwillingly introduce biases but can also be used to include specific miscalculations on purpose. Thus, the Data BOM needs to make sure that it covers the actual data that was used for training, which is typically hard to prove in case of “training as a service” or “model as a service” environments. While this can be solved by training the model on premise under the control of a trusted supervisor, this certainly increases the costs, as model training requires a lot of hardware resources. In case of utilizing cryptography for data protection, “Cryptographic Bills of Materials” (CBOMs) are required as well, not only catering for the algorithm that was used, but for the specific implementation, as this is a major weak spot in the application of encryption.

Obstacle 4 – Legal and regulatory realities:
Obstacle 4 – Legal and regulatory realities: While sharing seems to be trivial, and the FAIR criteria lay out respected guidelines on what needs to be fulfilled for future proof information exchange accompanied by a lot of research and best practices on how to achieve it, the practical implementation in actual projects requires a lot of legal involvement, especially when dealing with different partners for data provisioning and analysis: Different legislations have different rules on the treatment of patient data, as this topic is typically heavily nationalised. Even within the same legal framework, partners have different internal rules and responsibilities, as well as different strategies regarding risks. This alignment is often overlooked in projects driven from technical / research side and can result in long delays and critical uncertainties.

Obstacle 5 – Testdata for Development: A major obstacle in the development of AI-enhanced medical research platforms lies in the provisioning of data for development and testing purposes: While this is often clarified for the actual operational workflow, where the platform is e.g. under the control of the original data owner, things become far more muddied in the development phase, when external partners need access to data and formats, especially when externalizing AI model training. While synthetic data can of course solve this problem, this means that data generators need to be developed that cover the full width of possible test cases. This test coverage is even a problem when using real data in case of rare cases and thus requires far more data engineering than typically assigned to this task.

In conclusion, the practical implementation of data driven processes encounters practical difficulties, especially when dealing with multiple partners in complex legal and ethical environments. In order to navigate around these obstacles, the APOG project [L1] has defined guidelines for qualified interviews that clarify these issues in a structured and comprehensive manner that will be published in the course of the project.

Link: 
[L1] https://research.ustp.at/en/projects/access-point-for-basic-oncological-research-in-lower-austria  

References: 
[1] P. Sobradillo, F. Pozo, and Á. Agustí, “P4 medicine: The future around the corner,” Archivos de Bronconeumología (English Edition), vol. 47, no. 1, pp. 35–40, 2011.
[2] É. Ó. Muirí, “Framing software component transparency: Establishing a common software bill of material (SBOM),” NTIA, Nov. 12, 2019.
[3] M. D. Wilkinson et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Scientific Data, vol. 3, no. 1, pp. 1–9, 2016. 

Please contact: 
Hubert Schölnast 
University of Applied Sciences St. Pölten, Austria
This email address is being protected from spambots. You need JavaScript enabled to view it.

 

Next issue: October 2026
Special theme:
Quantum Technology
Call for the next issue
Image ERCIM News 145 cover
This issue in pdf

 

Image ERCIM News 145 epub
This issue in ePub format

Get the latest issue to your desktop
RSS Feed