MR-DIS: A Scalable Instance Selection Algorithm using MapReduce on Spark
by Álvar Arnaiz-González, Alejandro González-Rogel and Carlos López-Nozal (University of Burgos)
Efficient methods are required to process increasingly massive data sets. Most pre-processing techniques (e.g., feature selection, prototype selection) and learning processes (e.g., classification, regression, clustering) are not suitable for dealing with huge data sets, and many problems emerge as the volume of information grows. Here, parallelisation can help. Recently, many parallelisation techniques have been developed to simplify the tedious and difficult task of scheduling and planning parallel executions. One such technique is the instance selection method ‘Democratic Instance Selection’, which uses the successful paradigm MapReduce. The main strength of this algorithm is its complexity: linear in the number of examples, i.e., O(n).