by José García-Nieto (ITIS, University of Málaga), Virginia García Millán (ITIS, University of Málaga), and José F. Aldana-Montes (ITIS, University of Málaga)
Researchers from ITIS Software work on projects for the generation of big data workflows for processing and analysis of earth observation, remote sensing satellite data. These handle massive amounts of data to obtain value-added applications in agroforestry, the environment, smart cities, and for society in general. As a use case, this paper provides an example of the use of Sentinel-2 satellite data for the generation of a land-cover map over a large area, the Mediterranean basin, using machine-learning algorithms and big data analysis.
Remote sensing-based earth observation (EO) is becoming increasingly important as it provides a robust technological framework for developing innovative applications in diverse areas, including climate change, precision agriculture, smart urban planning, and land cover evolution. Within this context, projects such as GreenSenti [L1] and EnBiC2-Lab2 (LifeWatch ERIC) [L2] are being developed by experts from ITIS Software (University of Málaga) to provide EO-based research tools to support investigations into the functions and services of agroforestry and ecosystems, aiding society in addressing critical challenges.
One of the key tasks, where EO plays a major role in the development of these projects, is Land Cover mapping (LC), which provides a meaningful way to describe the Earth’s surface. Spatially detailed land-cover data are essential for local, national, and international decisions regarding natural resource management. It influences the functional relationship between topography, climate, and soil while offering biophysical insights into the environment and the factors driving change. In an increasingly digital world, LC mapping has evolved into a big data challenge, with the sheer volume of data requiring management becoming a demanding task. Remote sensing generates extensive datasets characterised by unique properties, such as being multi-source, multi-scale, high-dimensional, dynamic, and nonlinear. Moreover, satellite remote sensing has long been considered the most effective method and data source for large-scale land cover classifications.
Mapping land cover on a large scale presents significant challenges due to spectral heterogeneity and the complexity of terrain. The European Copernicus programme, supported by the Sentinel satellite missions, provides an effective solution for mapping vegetation on global, regional, and local scales, with periodic and repeated observations. Since 2013, Sentinel-2 has continuously collected optical imagery, delivering high-resolution multispectral images (10–60 m) every 2 to 5 days, enabling comprehensive global monitoring. Other international initiatives, such as the Advanced Spaceborne Thermal Emission and Reflection Radiometer satellite (ASTER) deliver multispectral imagery with spatial resolutions ranging from 15 to 90 meters. These satellite’s dual cameras enable the generation of a stereoscopic digital surface model of the earth with a nominal resolution of 25 meters per pixel.
These satellite platforms are widely used today for large-scale land cover mapping, incorporating multiple images and other data to support monitoring, mapping, and modelling activities. To do so, big data workflows are orchestrated and deployed in high-performance processing environments, which carry out tasks of massive data collection and preprocessing, classification strategies, stratification, integration of auxiliary data, optimization of processes and accuracy assessment. Classifying land cover over vast regions (involving several countries) requires methods that are both reliable and reproducible, presenting unique challenges that go beyond those of traditional large-area image classifications. The immense volume of unprocessed remote sensing data introduces the “four Vs” of big data: volume, variety, velocity, and veracity, which are recognized as its core challenges.
In order to identify effective techniques for mapping land cover patterns, the remote sensing community has explored various approaches, where ML supervised methods for classification are commonly employed. Among others, Random Forest (RF) has been shown to perform efficiently in land cover mapping, thanks to its ability to manage high-dimensional data and multicollinearity, while being both fast and resistant to overfitting.
In the context of the previously presented initiatives, a methodology to streamline Big Data workflows for land cover classification over extensive areas has been implemented [1]. As a use case for validation, the complete Mediterranean Basin is mapped, which is covered by more than 450 Sentinel-2 tiles and around 1200 ASTER tiles. Three seasonal datasets from Sentinel-2, along with several derived products from both Sentinel-2 and ASTER are used, which involve more than 4TBs of information thqt considerably increase the computational demands for processing. To tackle this issue, several specific parts of the workflow have been parallelized using Dask, a tool designed to optimise algorithm execution.
The overall approach to big data storage and management is illustrated in Figure 1, along with metadata protocols. The first phase comprises data collection and distributed processing at tile level. Resulting data are then stored in a MinIO S3-compatible object storage system, while metadata are stored in a NoSQL MongoDB database for efficient indexing and querying. After this, data analysis and validation are performed, which implies high-quality training set generation, ML model generation and test. This last phase is refined until reaching an accuracy higher than 90%, so land-cover maps are obtained with low error rates and identified labels: Closed and open forests (green and olive), shrubland (orange), herbaceous vegetation (yellow), herbaceous wetland (cyan), bare vegetation (grey), cropland (pink), built-up (red) and permanent water bodies (blue) (Figure 1, right). Final land-cover maps are then available for visualisation in WebGis visualisation services.
Figure 1: Global overview of the strategy for large-scale Earth Observation data integration and analysis in the context of land cover in the Mediterranean basin scenario. An approximate number of 450 Sentinel-2 and around 1200 ASTER tile products were processed (> 4TBs). Data are collected, pre-processed and stored in a high-scale MinIO repository. MongoBD is used for data indexing and flexible querying. Additionally, RDF mapping is also performed according to RESEO ontology, so reasoning tasks can be defined for supporting knowledge extraction. Final phases consist in training data management and machine learning model generation and testing. As a result, land cover maps are obtained with identified labels: Closed and open forests (green and olive), shrubland (orange), herbaceous vegetation (yellow), herbaceous wetland (cyan), bare vegetation (grey), cropland (pink), built-up (red) and permanent water bodies (blue). This graphical scheme is a composition of figures found in [1] and [2].
Another key component in this knowledge-driven approach involves providing human experts with domain knowledge representation, supported by data standardisation and semantic integration of sources. To this end, the use of ontologies and semantic web technologies have shown high success in knowledge representation in many fields, in which the earth observation is not an exception. This is tackled in this project by the RESEO [2] ontology, which considers the special nature and structure of different satellite and airborne data products. It is implemented in standard OWL 2, according to which, an RDF repository has been generated to allow advanced SPARQL querying. This component allows the integration, reasoning and linking of heterogeneous data, such as meteorology or historical crop records and land use. This will enhance the construction of advanced on-top applications for future experts.
Links:
[L1] https://khaos.uma.es/green-senti
[L2] https://enbic2lab.uma.es/
[L3] https://itis.uma.es/
References:
[1] A. M. Burgueño, J. F. Aldana-Martín, M. Vázquez-Pendón, et al., “Scalable approach for high-resolution land cover: a case study in the Mediterranean Basin,” Journal of Big Data, vol. 10, p. 91, 2023. https://doi.org/10.1186/s40537-023-00770-z
[2] J. F. Aldana-Martín, J. García-Nieto, M. M. Roldán García, and J. Aldana Montes, “Semantic modelling of Earth observation remote sensing,” Expert Systems with Applications, vol. 187, p. 115838, 2021. https://doi.org/10.1016/j.eswa.2021.115838
Please contact:
Virginia García Millán (ITIS, University of Málaga, Spain)
José García-Nieto (ITIS, University of Málaga, Spain)