by Andras Benczur (HUN-REN SZTAKI) and Dominik Ślęzak (University of Warsaw)
Large-scale data analytics empowers organizations to harness the full potential of the vast amounts of data they generate and collect. By driving innovation, enhancing business operations, personalizing customer experiences, and improving risk management, insights derived from large-scale data analytics are critical for gaining a competitive advantage and making informed, data-driven decisions. With the exponential growth of data generated by businesses, consumers, and connected devices, it is essential to address key challenges in handling Big Data, processing real-time information, and enabling timely, actionable insights.
The ERCIM News Special Theme on Large-Scale Data Analytics focuses on two main areas. On one hand, it highlights cutting-edge techniques such as machine learning, predictive modeling, and advanced analytics methods. On the other hand, it explores applications across diverse sectors, industries, and societal challenges.
Articles on Big Data Infrastructure and Technologies delve into topics such as distributed data processing, the edge-cloud continuum, federated data analysis, and the integration of heterogeneous data sources. A special focus is given to data analytics for Open Science. From a technological perspective, machine vision and spatial data analysis emerge as key tools in several domains. Articles on Big Data applications span various verticals, including healthcare, energy, transportation, robotics, finance, agri-food, environment, sustainability, and science. In many cases, critical issues of data governance, privacy, and security are addressed. These include techniques for anonymization and de-identification of large datasets, as well as ensuring transparency, reproducibility, and explainability in large-scale data systems.
Data infrastructure and ecosystems
The first part of the articles focuses on data infrastructure and ecosystems. Barba-González et al. present a semantic-driven workflow automation system for large-scale analytics, with applications in areas such as machine learning and e-Science. The large-scale analytics platform developed by Assante et al. facilitates collaboration and advances reproducible research by enabling researchers to share, reuse, and build upon each other’s work across diverse scientific disciplines. Tzagkarakis et al. discuss the decision intelligence platform of the TwinODIS Horizon-Widera project, which combines AI and operations research to address challenges in large-scale, uncertain systems. In the Italian National Center for Sustainable Mobility project, Millitarì et al. developed a data mining platform designed for predictive maintenance in the railway sector. Sartzetakis and Chamanara describe the results of the DataBri-X project, which introduced a trustworthy AI platform that aligns with European values and ethical standards, emphasizing the transformation of data-sharing ecosystems.
Scientific data
Articles focusing on scientific data are closely linked to initiatives such as the European Open Science Cloud (EOSC), the AI4EU on-demand platform and ecosystem, and the EGI Foundation's infrastructure services. This area partly overlaps with developments in data infrastructure, as illustrated in the work of Assante et al.. Scientific digital twins are explored by Manzi et al., with an emphasis on reusing modular components. The article describes use cases including flood impact modeling and early warning systems, cyclone projections, the Virgo Gravitational Wave Interferometer Noise simulations, and high-energy physics particle detector simulations. Sipos and Schaap discuss an AI-driven platform developed in the iMagine project, designed to analyze vast amounts of image data for aquatic sciences. This platform connects with the EOSC and AI4EU initiatives and seeks to expand its scope through open calls for additional use cases.
Application verticals
The application verticals in this Special Issue cover a wide range of domains. Articles focusing on data infrastructure and open science address areas such as science (Barba-González et al., Sipos and Schaap, Manzi et al.), transportation (Millitarì et al., ), and the environment (Tzagkarakis et al., Manzi et al.), among others.
Medwenitsch et al. discuss how advanced data analysis can transform agriculture in Austria’s climate-stricken Seewinkel region. In the energy sector, Klikovits and Fabianek propose solutions to overcome challenges such as security, privacy, and GDPR compliance, which often impede data sharing, analysis, and interpretation. Similarly, Rotskos et al. present the outcomes of the Glaciation project, which focuses on scalable anomaly detection within the edge-cloud continuum for grid management. In the realm of sustainability, Suta et al. analyze information on sustainability extracted from large-scale European digital financial reports.
Additional verticals are explored in articles addressing computer vision and spatial data applications. Finally, medical and health applications form a distinct and dedicated section of this Special Issue, emphasizing their unique challenges and contributions.
Computer vision or spatial data
Several articles highlight computer vision or spatial data as their technological focus. For example, Hubner and Nausner describe the Multimodal Fusion Architecture for Sensor Applications, a robust system for real-time sensor integration that enhances situational awareness and provides precise decision support for railway security. Parisot explores resource-aware detection of satellite streaks in deep sky image streams, utilizing lightweight machine vision on edge devices. Bouchal et al. present GLayer, a GPU-accelerated software platform designed for the fast aggregation, filtering, and visualization of large-scale spatial data, particularly traffic data. García-Nieto et al. discuss their big data workflow for processing and analyzing Earth observation remote sensing satellite data, showcasing its potential for large-scale geospatial analysis.
Medical and health applications
Four articles focus on medical and health applications. Pejo et al. propose a privacy-friendly contribution evaluation technique designed to address the growing issue of selfish incentives in the self-evaluation of medical records. Al-Radhi and Németh explore methods to translate brain activity into clear and intelligible speech, aiming to restore communication abilities for individuals with severe speech disorders. Koltai et al. employ anomaly detection techniques to reduce false alarms in telemonitoring medical devices, achieving this without centralizing sensitive patient data . Finally, Zimeras integrates diverse data sources to create visualizations for digital epidemiology, enhancing the analysis and understanding of health trends (page 39).
The Special Theme demonstrates the wide verticals of large-scale data analytics addressed in Europe. Developments work in close collaboration with European data spaces and open science platforms, often addressing reproducibility. The presented results emphasize European values of openness, fairness, protection of personal values and privacy.
References:
- Sakr, Sherif, and Albert Y. Zomaya, eds. Encyclopedia of big data technologies. Springer International Publishing, 2019.
- Schintler, Laurie A., and Connie L. McNeely, eds. Encyclopedia of big data. Springer International Publishing, 2022.
Please contact:
Andras Benczur
HUN-REN SZTAKI, Hungary
Dominik Ślęzak
University of Warsaw, Poland