by Olivier Parisot and Benoît Otjacques (Luxembourg Institute of Science and Technology)
In order to support knowledge extraction from data streams, we propose a visual platform for quickly identifying main features and for computing predictive models in real time. To this end, we have adapted state-of-the-art algorithms in stream learning and visualisation.
Useful information may be extracted from the high frequency data streams that are common in various domains. For example, textual data from social media such as Twitter and Facebook can be studied to extract the hot topics and anticipate trends. In a different scenario, numerical data collected by a network of environmental sensors can be inspected in order to capture events that could precede potential disaster like floods, storms or pollution peaks.
Therefore, numerous data mining techniques have recently been proposed in order to extract predictive models from data streams . On the one hand, classical analytics techniques can be applied on streams by using a certain pool of observations (by using a sliding window, for example). On the other hand, specific online/incremental methods can be applied to dynamically refresh results. A clever data obsolescence strategy is necessary to consider both significant and up-to-date data windows and allow efficient methods (without accessing too much historical data).
In order to improve stream analytics, we have developed a JAVA platform to inspect data streams on-the-fly and to apply the leading predictive models. Various specific third-parties components can be integrated into the software such as WEKA for static data mining or MOA for specific stream processing.
The platform was designed to support two kinds of data sources:
- Remote streams (i.e., available through web APIs): processed on-the-fly.
- Local streams (i.e., obtained from potentially huge files): iteratively processed in a single-pass, without accessing the previous values.
The user interface was designed to be reactive (by plotting on-the-fly the continuously arriving values) and interactive (by providing a real control to the end-user like play/pause/stop the data stream or select the processing speed).
Additionally, various analytics modules were developed in order to continuously inspect the considered data streams.
First, we have implemented a “Feature similarity’ module to extract the meaningful characteristics from data. More precisely, we have designed an innovative real time multidimensional scaling 2D projection dedicated to time series, in order to show the correlations (respectively inverse correlations) for the recent history. As an example, this module could help to determine if the IBM and ORACLE stocks quotes are following the same pattern.
Second, we have developed a “Predictive modelling” component to create and refresh models, which continuously takes into account recent history. A multitude of techniques exists for predictive analytics, and a critical issue for the data scientist is to select the appropriate technique according to the data characteristics (completion, linear/non-linear relationships, noise, etc.) and the tasks to be carried out. Our aim is to make it easier for the user to understand the predictive models. Therefore, decision trees were selected because they allow a model to be built that is both efficient and easy to interpret. On the one hand, we have applied VFDT, the reference method for classification tree induction. On the other hand, we have used model trees (i.e., decision trees combined to linear regressions) with the recent FIMT-DD algorithm  to predict numerical values.
The platform was applied on various real-world data streams (Figure 1). Initially, we tested our approach on the live stocks quotes from the Yahoo Finances website (CAC40 index – one record per second): it helped us to check the “Feature Similarity” module with real life settings. Then, we processed the data from the French electricity transmission system operator (RTE) in order to analyse energy consumption in France (oil, coal, gas, nuclear, wind, solar, bioenergy, hydraulic and pumping – 15-min time series). In this case, we processed heterogeneous values with different scales (for instance: how to use both gas and oil consumption in order to produce meaningful predictions?)
Figure 1: Live visualisation of quotes (Yahoo finance API), feature similarity analysis on the French energy consumption data (RTE – eco2mix) and extreme flooding prediction using hydrological time series from Luxembourg .
Finally, we inspected hydrological data obtained from the hydrometric stations in Luxembourg: the considered sensor network is composed of 24 stations and continuously produces 15-min time-series . Owing to the poor quality of the data, we had to apply techniques that are robust to noise and missing values in sensor data: for example, the platform was successfully used to fill data gaps in hydrological time series .
We plan to extend the software in order to help data scientists quickly identify and eliminate bad data that pollute predictive models. To this end, we are implementing real-time modules for extreme value detection, missing data imputation and live clustering.
 H. Nguyen et al.: “A survey on data stream clustering and classification“, Knowledge and Information Systems, Springer, 12/2015
 E. Ikonomovska, J. Gama: “Learning model trees from data streams”, Discovery Science, 10/2008
 L. Giustarini et al.: “A user-driven case-based reasoning tool for infilling missing values in daily mean river flow records”, Environmental Modelling and Software, 8/2016
Olivier Parisot, Benoît Otjacques