by Álvar Arnaiz-González, Alejandro González-Rogel and Carlos López-Nozal (University of Burgos)

Efficient methods are required to process increasingly massive data sets. Most pre-processing techniques (e.g., feature selection, prototype selection) and learning processes (e.g., classification, regression, clustering) are not suitable for dealing with huge data sets, and many problems emerge as the volume of information grows. Here, parallelisation can help. Recently, many parallelisation techniques have been developed to simplify the tedious and difficult task of scheduling and planning parallel executions. One such technique is the instance selection method ‘Democratic Instance Selection’, which uses the successful paradigm MapReduce. The main strength of this algorithm is its complexity: linear in the number of examples, i.e., *O(n)*.

MapReduce is a parallel programming paradigm for processing massive data sets. It takes a set of pairs <key, value> and applies two separate and distinct tasks to it: a Map task and a Reduce task. The Map tasks receive the input data and produce an intermediate set of pairs <key,value>. Then, the Reduce tasks combine these intermediate values and return an output, which will be the final result of the method [1]. The main advantage of this technique is the opacity on the management of the resources, which provides reliable and fault tolerance executions.

Democratic instance selection algorithm (DIS) [2] is based on the well-known paradigm ‘divide and conquer’. It starts by performing a fixed number of rounds r. Every round starts by splitting the data set into subsets of approximately equal size and continues by applying a classic instance selection algorithm independently over the different partitions. Those instances selected by the algorithm for removing receive a vote. After having performed a predefined number of rounds, those instances with more votes than a threshold are removed (the process of determining the threshold is beyond the scope of this article but details can be found in [2]).

Below we present a summary of the adaptation of DIS method to the MapReduce paradigm. This new approach has been called MapReduce DIS or, for short, MR-DIS and its complete implementation, written in Scala and for the Spark framework, is publicly accessible in the following GitHub repository: https://bitbucket.org/agr00095/tfg-alg.-seleccion-instancias-spark

As mentioned above, by applying MapReduce we work with <key, value> pairs and divide the initial algorithm into two phases (Map and Reduce). In our implementation, we use a set of <votes, inst>. <inst> represents the instance attributes, while <votes> accumulates the number of rounds that the instance has not been selected. The <votes> value is initialised at the beginning of the MapReduce DIS algorithm to zero. Figure 1 shows a possible scenario: the original data set has n instances, and the MapReduce DIS algorithm divides them into partitions with one thousand instances each and performs ten rounds of voting. After that, a threshold of four votes is computed and instances whose votes are below this threshold are selected. Below, both phases of the MapReduce model are explained:

- Map performs the rounds of voting, updating the value ‘votes’ of the pairs in each iteration. At the beginning of each round, the original data set is split into disjoint partitions. In the current version, the partition and distribution of instances are randomly made, thus the value of the key (number of votes of each instance) has no influence. The instance selection algorithm used in each subset is, put simply, the first proposal for condensation (condensed nearest neighbour or CNN), and is applied independently to each partition.
- Reduce estimates the votes’ thresholds and selects those instances that have enough votes. For this task we take into consideration the number of votes each instance has received during the Map phase. In order to estimate the threshold it is necessary to generate r number of groups (as many as rounds), each containing those instances whose number of votes is equal or less than the number of the group. Then, the computation of the threshold requires a classifier for estimating the error. As expected, the calculation of the error by using a sequential algorithm would make the task impracticable (it would produce a bottleneck). For this reason, in the actual implementation, a parallel implementation of nearest neighbour algorithm (kNN) [3] has been used. Finally, those instances with a number of votes below the threshold are selected for the output.

This work was supported by Ministerio Español de Economía y Competitividad under Grant No. TIN2015-67534-P

**Links:**

http://hadoop.apache.org/

http://spark.apache.org/

https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

https://bitbucket.org/agr00095/tfg-alg.-seleccion-instancias-spark

**References:**

[1] J. Dean, S. Ghemawat: “MapReduce: Simplified Data Processing on Large Clusters Commun”, ACM, 2008, 51, 107-113.

[2] C. García-Osorio, A. de Haro-García, N. García-Pedrajas: “Democratic instance selection: A linear complexity instance selection algorithm based on classifier ensemble concepts”, Artificial Intelligence, 2010, 174, 410 – 441.

[3] J. Maillo, S. Ramírez, I. Triguero, F. Herrera: “kNN-IS: An Iterative Spark-based design of the k-Nearest Neighbors classifier for big data Knowledge-Based Systems, http://dx.doi.org/10.1016/j.knosys.2016.06.012

**Please contact:**

Álvar Arnaiz-González, Alejandro González-Rogel, Carlos López Nozal, University of Burgos, Spain