by Giulia Millitarì (University of Pisa and CNR-ISTI), Alessio Ferrari (CNR-ISTI) and Giorgio O. Spagnolo (CNR-ISTI)
We describe the initial and crucial phase of an analysis for a project belonging to the Spoke 4 on “Railway Transportation” of the Italian National Center for Sustainable Mobility (MOST) [L1], which is part of the National Recovery and Resilience Plan (PNRR). The objective of the project is the implementation of a predictive maintenance strategy within the decision-making process of Trenord [L2], a railway company responsible for the operation of regional passenger trains mostly in Lombardy. Before conducting the analysis, it was essential to perform extensive data mining procedures to make the data from the remote diagnostic system truly usable for extracting meaningful insights and apply machine learning techniques effectively.
Trains are a sustainable and efficient way to transport people and goods over long distances. The core pillars of rail transport are efficiency, safety, and reliability, all of which are guaranteed and managed through maintenance activities. Effective maintenance strategies help prevent equipment failures and unplanned downtimes, reducing costs and improving overall service performance.
In recent years, predictive maintenance [1] has emerged as an efficient and promising strategy, gaining prominence in the era of Industry 4.0, for which vast amounts of data are collected. This maintenance type, specifically based on data-driven approaches, exploits information such as sensor and maintenance data to anticipate failures and optimize maintenance schedules accordingly.
For the above mentioned project, Trenord provided data on the maintenance plan, as well as service and diagnostic events for trains in its fleet. Our analysis specifically focused on data related to the traction system of a TSR (Regional Service Train).
The data included two separate datasets on scheduled and corrective maintenance, specifying the type of maintenance, start and end dates, and the wagon with issues. Then, service data provided detail such as departure and arrival stations and kilometers traveled during each service. Additionally, diagnostic data related to the traction system (e.g., power supply loss, water temperatures exceeding a fixed threshold, absence of motor speed signal, and electrical current imbalances) were extracted from Trenord’s remote diagnostics system. This data provides insights into the train’s operational status, performance, and potential issues, while the system enables remote monitoring of the rolling stock by collecting diagnostic events detected by the train’s control units. Figure 1 illustrates a simplified version of the system’s structure. The diagnostic events dataset comprised 1,892,948 observations containing information such as the alert criticality, the affected wagon, timestamp, train speed, and latitude and longitude GPS coordinates of the train.
Figure 1: Diagram of on-board hardware structure.
Given the heterogeneity and suboptimal quality of the available datasets, we prioritized a comprehensive data analysis process focused on ensuring data quality. This involved necessary steps in data cleaning and pre-processing to achieve standards of accuracy, timeliness, and interpretability. Consequently, we were able to create a unified dataset containing all the information required to compute meaningful descriptive statistics and accomplish the research objectives. Figure 2 summarizes the entire data mining procedure performed specifically on the diagnostic events dataset.
Figure 2: Diagram of data mining procedures on the diagnostic events dataset.
Through both the documents provided by Trenord and a detailed analysis of the data, such as plotting variable distributions or performing cross-tabulation (contingency tables), we identified several issues related to data incompleteness and inconsistency, particularly in the diagnostic events dataset. The most significant data cleaning effort focused on addressing duplicate data and aberrant values, which had multiple underlying causes. Duplicates were not only identified through identical values across all variables, but also because of specific service-related situations and issues within the remote diagnostic system. Aberrant values, including false declarations, measurement errors, input mistakes, and inconsistencies, were detected by examining each variable’s values and through consultation with Trenord staff, requiring a detailed and thorough analysis. Overall, we addressed these data quality issues by removing instances based on ad hoc rules and criteria, using algorithms that we developed for the specific problems identified.
Later, we carried out data pre-processing tasks, including data integration to integrate multiple datasets, data transformation to create new variables not included in the original dataset, such as seasonal effects and service-related factors, and data reduction to aggregate data using an a priori criterion to overcome the curse of dimensionality. Additionally, we applied an undersampling technique to convert the data from a timestamp-based frequency to a daily frequency, helping to regularize the time series and potentially transform the dataset into a cross-sectional one. As a result, all variables had to be reevaluated, as their meanings changed in the daily frequency format. In this format, each row represents a day’s events, so, for example, the speed variable reflects the average speed of all events on that day.
In this way, all the different and heterogeneous datasets were combined into a single and smaller dataset, making it more suitable for computing summary statistics and performing statistical analyses, both cross-sectional and time series. This also facilitated achieving the final research objective of implementing a predictive maintenance strategy within Trenord’s decision-making process. This initial phase of the project highlighted the importance of thoroughly examining the nature of the data, revealing weaknesses in the data collection system and emphasizing the challenges of enabling immediate predictive maintenance.
Spoke 4 on “Railway Transportation” of the National Centre for Sustainable Mobility (MOST) received funding from the European Union – NextGenerationEU (PIANO NAZIONALE DI RIPRESA E RESILIENZA (PNRR) – MISSIONE 4 COMPONENTE 2, INVESTIMENTO 1.4 – D.D. 1033 17/06/2022, CN00000023). The project will run until February 2026 and is coordinated by Marco Bocciolone from the Polytechnic University of Milan. Further partners in T3.1 include the universities of Florence, Naples, Parma, and Roma “Sapienza”, as well as the industrial partners Accenture, Hitachi, Lutech, and Trenord.
Links:
[L1] https://www.centronazionalemost.it/eg/
[L2] https://www.trenord.it/
Reference:
[1] M. Binder, V. Mezhuyev and M. Tschandl, “Predictive Maintenance for Railway Domain: A Systematic Literature Review,” in IEEE EMR,
doi: 10.1109/EMR.2023.3262282
Please contact:
Giulia Millitarì, CNR-ISTI, Italy