Multiobjective Sparse Ensemble Learning by Means of Evolutionary Algorithms

Ensemble learning can improve the performance of individual classiﬁers by combining their decisions. The sparseness of ensemble learning has attracted much attention in recent years. In this paper, a novel multiobjective sparse ensemble learning (MOSEL) model is proposed. Firstly, to describe the ensemble classiﬁers more precisely the detection error trade-o ﬀ (DET) curve is taken into consideration. The sparsity ratio ( sr ) is treated as the third objective to be minimized, in addition to false positive rate ( fpr ) and false negative rate ( fnr ) minimization. The MOSEL turns out to be augmented DET (ADET) convex hull maximization problem. Secondly, several evolutionary multiobjective algorithms are exploited to ﬁnd sparse ensemble classiﬁers with good performance. The relationship between the sparsity and the performance of ensemble classifiers on the ADET space is explained. Thirdly, an adaptive MOSEL classifiers selection method is designed to select the most suitable ensemble classifiers for a given dataset. The proposed MOSEL method is applied to well-known MNIST datasets and a real-world remote sensing image change detection problem, and several datasets are used to test the performance of the method on this problem. Experimental results based on both MNIST datasets and remote sensing image change detection show that MOSEL performs significantly better than conventional ensemble learning methods.


Introduction
The idea of ensemble learning methods [1] is to construct a set of classifiers with base learning algorithms and then classify new data points by taking a (weighted) vote of their predictions. Generally, ensemble methods combine the prediction of individual methods and can obtain better predictive perfor-5 mance than any individual method alone. Ensemble learning methods have attracted much attention in recent years. Not only have many ensemble algorithms been proposed [2,3], but also ensemble learning methods have been applied to many areas [4,5], such as medical information processing [1] and satellite image classification [6].
In general, an ensemble learning algorithm is constructed in two steps, i.e., training a number of compo-10 nent classifiers and then combining the predictions of the components. The most prevailing approaches for training component classifiers are bagging [7], boosting [8], random subspace [9], and rotation forest [10].
Recently, research has drawn attention to multiobjective optimization of ensemble learning [11,12] and several evolutionary multiobjective algorithms (EMOAs) have been used to deal with it. Generally, most of this work is trying to obtain a set of classifiers with good performance on both diversity and accuracy 15 by using multiobjective optimization algorithms with different objectives. The multiobjective deep belief networks (DBNs) ensemble method was proposed in [13], in which a MOEA was applied to evolve multiple DBNs by considering accuracy and diversity as two conflicting objectives. A divide-and-conquer based optimization framework for ensemble classifiers generation was proposed in [12], in which the accuracy of each class was treated as the objectives to describe the performance of classifiers. Besides, maximizing the 20 ensemble size is also taken as an additional objective. The Pareto image features were applied for candidate classifiers generation in [14] by using a multiobjective evolutionary trace transform algorithm. These methods do not consider the redundancy between classifiers and the efficiency of ensemble learning, as it requires a large amount of memory to store the candidates of classifiers and lots of computation time is also needed to predict the label of each new input instance. 25 In this paper, we focus on combining the predictions of component classifiers by finding several appropriate sparse weight vectors for them. Many works have addressed the complexity of ensemble classifiers by reducing the number of classifiers in the component candidate set. The relationship between the ensem-ble learning and its component classifiers is analyzed in [15], which reveals that a better performance can be obtained by ensembling many instead of all the available classifiers. A genetic algorithm is adopted to 30 evolve the weights of the component classifiers, showing that it can generate ensemble classifiers with small sizes but good generalization ability. However, only the accuracy is considered in this method, the result contains redundant classifiers, as the sparsity of ensemble classifiers is not considered. Several pruning strategies are analyzed in [16], including reduction error (RE), Kappa pruning (KP), complementarity mea-sure (CM) and margin distance (MD). Matching pursuit (MP) is used to prune the ensemble classifiers in 35 [17] by balancing the diversity and the individual accuracy. In these methods, the greedy strategy is used to search for the optimal classifiers set and it is easy to fall into the local extremum.
The theoretical and empirical evidence in [18] suggests that a smaller ensemble size can often obtain better performance than a larger ensemble. It is, therefore, possible to obtain an ensemble which minimizes the number of individual classifiers and preserves or improves the performance of attributes, such as accuracy and cost of misclassification. 40 Sparse ensembles were proposed in [19]. The outputs of multiple classifiers were combined by using a sparse weight vector. The hinge loss and the 1-norm regularization were exploited to calculate the sparse weight vector, formulated as a linear programming problem. However, the 1-norm metric cannot describe the sparseness of ensemble classifiers precisely. This is because a weight vector with a group of small values 45 can improve the performance of 1-norm measurement but cannot improve the performance of sparseness.
The 0-norm metric can describe the sparseness more precisely [20]. The sparse ensemble learning is applied for synthetic aperture radar (SAR) image classification in [6] and for Youtube videos classification in [21].
The 0-norm learning can be regarded as an NP-hard problem, it is still an open problem to search the global optimum.

50
Compressed sensing (CS) [22] was brought to ensemble learning in [23]. It explores the globally optimal subset of classifiers for a given ensemble. To solve the compressed sensing problem, a sparse weighting vector which contains many zeros should be generated first, and then appropriate weights should be provided for the remaining classifiers according to their relative importance. Several popular methods such as SpaRAS [24], OMP [25], FISTA [26], PFP [27] are used to tune the weight vector of ensemble classifiers. In 55 [23] it is shown that compressed sensing ensembles are often as accurate as, or more accurate than, conventional ensembles, although they use only small subsets of the total set of classifiers. However, the sparseness should be set in advance when using the compressed sensing methods. Meanwhile, the characteristics of the unbalanced data classification were not taken into consideration.
In this paper, we propose the novel concept of a multiobjective sparse ensemble learning (MOSEL) 60 method, in which the relationship between the sparsity and the classification performance is explained.
To accurately describe the performance of ensemble classifiers, the detection error trade-off (DET) [28] performance is taken into consideration by adopting the false positive rate (fpr) and the false negative rate (fnr) simultaneously. Besides, the sparsity ratio (sr) of ensemble classifiers is treated as the third objective to be minimized. The DET can describe the classifiers more precisely than the accuracy metric especially for unbalance data classification problems [28]. Besides, the evolutionary multiobjective algorithm (EMOA)

65
[29] technique is first applied to evolve the combining weights of ensemble component classifiers. With the technique of tri-objective ensemble learning, we can obtain a set of ensemble classifiers with different sparseness, rather than an ensemble classifier with a certain sparseness that is previously set. The sparsity and the error rates of ensemble classifiers are explainable and their trade-offs are quantifiable in the augmented DET (ADET) space. 70 We analyze the properties of the ADET for sparse ensemble learning and several state-of-the-art many-objective optimization algorithms are applied to solve multiobjective ADCH maximization problems, in-cluding the twoarchive algorithm (Two Arch2) [30], which focuses on convergence and diversity separately, the decomposition based algorithms, such as NSGA-III [31], the evolutionary algorithms based on 75 both dominance and decomposition (MOEA/DD) [32], the reference vector guided evolutionary algorithm (RVEA) [33], an indicator based evolutionary algorithm with reference point adaptation (AR-MOEA) [34], and 3D convex-hull-based evolutionary multiobjective optimization algorithm (3DFCH-EMOA) [35,36].
By using EMOAs we can obtain a set of potentially optimal ensemble classifiers with different sr-fpr-fnr trade-offs.

80
The remaining paper is organized as follows. Section 2 gives a brief introduction to multiobjective optimization of sparse ensemble method. Section 3 presents the results of several classification problems with MNIST [37] and remote sensing change detection datasets, and Section 4 provides concluding remarks.

Ensemble Learning
The idea of a sparse ensemble of classifiers is to combine the predictions of all classifiers in the candidate 85 set using a sparse weight vector. The sparse vector has many elements with the value of zero and only classifiers corresponding to nonzero weights are selected for the ensemble. To improve the performance of the ensemble classifier and to reduce the memory demand for the components, it is required to select an optimal subset of classifiers and the corresponding weights vector for this subset. The problem of seeking sparse weights vectors can be modeled as a combinatorial optimization problem, which can be solved by 90 evolutionary algorithms [30].
In this paper, we only consider binary supervised ensemble classification problems. With a set of training samples X tr = {(x j , y j )|x j ∈ R d , y j ∈ {−1, +1}, j = 1, 2, . . . , M tr }, where y j is the class label corresponding to a given input x j , d is the dimensionality of sample of features, and M tr is the number of instances. Note that in this work we only consider binary classification problems and we set the labels as A classifier can be obtained by using a training dataset with a machine learning algorithm, which can be 100 described as an estimate of the unknown function y = f (x). The classifier C i (x) is a hypothesis f i (x) about the true function f (x), which can predict the class label y for a new input vector x from a testing dataset X ts or a validation dataset X val . Usually, the training dataset is used for base classifiers learning, the validation dataset is used for ensemble pruning and the test dataset is used for ensemble classification performance evaluation. Denote by f ji the prediction of the ith learner C i (x) for the jth sampling of the validation sample 105 x j , that is described by Eq. (1).
The prediction output label vector f i can be obtained by implementing the classifier C i for the validation dataset X val with size M val , which is denoted as in Eq. (2).
The matrix F of prediction labels for all instances obtained by all of the classifiers can be denoted by where The ensemble learning can improve the performance of classifiers by combining the decisions of each classifier and assigning weight w i to each of the classifier C i (x), and the vector of weights w is denoted by The predicted label vector y predict obtained by ensemble learning for the input dataset X can be described 115 as in Eq. (5).
The perfect ensemble classifier can be obtained by solving an equation y val = y predict . Usually, the number of equations is larger than that of the weighting variables in the equation system. In this case, there are typically no exact solutions for equations. In this case, the equation system can be approximately solved by using optimization algorithms to find solutions, which can minimize the difference between the training 120 labels and predicting labels.

Multiobjective optimization of ensemble learning
The DET curve [28] is taken into consideration to describe the performance of ensemble classifiers, which has been proved to be a good measurement to evaluate the performance of classifiers [38]. The definition of the DET curve is closely related to the two-by-two confusion matrix, which describes the 125 relationship between the ground truth and the predicted class for a binary classifier. A confusion matrix is shown in Table 1, which includes four possible outcomes. An outcome is a true positive if a positive instance is correctly classified and it is a true negative if a negative instance is correctly classified. Whenever a negative instance is classified as positive, we call it a false positive. Finally, whenever a positive instance is classified as negative, we call it a false negative. To obtain sparse ensemble classifiers with good performance, not only should the difference between 135 true label vector y val and predicted label vector y predict be minimized, but also the number of nonzero elements in the weight vector w should be minimized. In Eq. (6) we define the sparsity ratio (sr) to describe the sparseness of ensemble, Here, N is the number of classifiers in the candidate ensemble set and w 0 represents the number of nonzero entities in the weight vector. The weight vector w is constrained to non-negative values, as negative 140 weightings are neither intuitively meaningful nor reliable [23]. We try to find ensemble classifiers with a low value of sr in order to reduce classification effort and to counteract overfitting of the ensemble classifier.
The computational cost of an ensemble classifier with high sr is considered to be higher than that of an ensemble classifier with lower sr. We prefer an ensemble classifier with lower sr when given two ensemble classifiers with the same performance criteria (fpr, fnr). So sr, fpr and fnr are conflicting with each other.

145
A low value of sr means that a small number of classifiers are selected for the ensemble, i.e., the ensemble classifier has a low value of sr, which would result in a poor performance of fpr and fnr. By treating the sparse term sr as the third objective, the sparse ensemble turns out to be a multiobjective problem. We denote it as multiobjective sparse ensemble learning (MOSEL) which is described in Eq. (7), where Ω is the set of all possible weight vectors and w refers to the weightings with good performance of 150 sparse ensemble classifiers.

Sparse real encoding
The sparse real encoding strategy is designed to represent the weight vector for the evolutionary algorithms, which is an improved version of the real encoding method. The sparse real encoding is constituted by an array of real values in the interval [0, 0.1]. The length of the chromosome is determined by the num-155 ber of candidate classifiers for ensembles. Two strategies are used to modify the real encoding approach for multiobjective sparse ensembles. One is called hard threshold sparse the other is called inequality constraint. Details will be discussed below.
The classifier with a small value of weight in the ensemble learning system does not contribute much to the final decision. In this paper, we ignore the classifiers with small values by adopting a hard threshold 160 strategy. The value of weights smaller than the threshold is set to zero, as described in Eq. (8) where σ is the hard threshold. In the experimental section, the value is set to 0.05, where N is the number of candidate classifiers. The sparse real encoding can model the solution of sparse ensemble learning, and then several EMOAs can be applied to evolve the individuals in the population set.

Adaptive MOSEL classifiers selection 165
The proposed MOSEL can deliver a set of ensemble classifiers, in this part we designed an adaptive selection method to choose the most suitable classifier for a given dataset [39]. Let p(P + ) signify the frequency of positive samples and p(N − ) denote that of negative samples for a dataset. With an ensemble classifier, the risk (R) can be denoted as Eq. 9, where λ(FN, P + ) is the loss incurred for deciding Negative when the true label is Positive and so is λ(FP, N − ).

170
In many real-world problems we can not obtain the label of each sample, however, we can estimate the distributions of a dataset with a predefined classifier, and we denote them asp(P + ) andp(N − ). Specifically, we do not consider cost-sensitive classification problem in this paper, Eq. 9 can be simplified as Eq. 10: Algorithm 1 Adaptive MOSEL classifiers selection (mosel, X ts ) Require: mosel is the ensemble classifiers set, the performance of each classifier in ADET space with X val can be obtained Ensure: the most suitable ensemble classifier for X ts 1: Set t ← 0 and select a classifier i t from the solution set mosel randomly 2: Predict the labels for X ts by EnC wi t and evaluate the dataset

Framework of MOSEL
The description of the framework of MOSEL is given in Alg. 2. Firstly, we train a set of candidate 180 classifiers with X tr by adopting bagging or random subspace strategies. Secondly, optimize the sparse vector w by using EMOAs with X val , which is used to evaluate the performance of each individual of EMOAs. Thirdly, the most suitable ensemble classifiers for X ts can be obtained by adopting adaptive MOSEL classifiers selection algorithm.
Algorithm 2 Learning Procedure for MOSEL 1: Training a set of candidate of classifiers with X tr 2: Optimizing the sparse vector w by using EMOAs with X val 3: Obtain the most suitable ensemble classifier for X ts by using adaptive MOSEL classifiers selection algorithm , OMP [25], which are the most popular methods for solving sparse reconstruction prob-lems [23]. The compared pruning methods are Kappa pruning (KP) [16] and ensemble based on matching pursuit (MP) [17]. Several state-of-the-art EMOAs are used to search the solutions of MOSEL, including Two Arch2 [30], NSGA-III [31], MOEA/DD [32], RVEA [33], AR-MOEA [34] and 3DFCH-EMOA [36]. The MNIST [37] and remote sensing change detection datasets are selected to evaluate the performance of 195 the above methods. The strategy of random subspaces [9] is adopted as the dataset manipulation and the classification and regression tree (CART) [40] is used as the base learner. For each mentioned algorithm, 10 independent trials are conducted.

Parameter setting
The experiment stopping criteria of the six EMOAs are set with a maximum of 30000 function eval- 6958 5949 1009 - The MNIST dataset we used in this part is described in the left part of Table 2. As we only consider binary classification problems in this paper, we select several sub-datasets from the whole dataset, including ds1-ds9 (details are listed in the right part of Table 2). All of the sub-datasets contain two classes, for 215 instance, the positive class in ds2 includes '1', and the negative class includes '0' and '2'. Both balanced and unbalanced datasets are created; for instance, in the ds9 dataset, the ratio of positive instances to negative instances is about 1:9. For each of the datasets, 1/2 of training instances are randomly selected for candidate classifiers generation and the rest is used for ensemble performance evaluation.

220
Firstly, the reference Pareto front is shown to illustrate the properties of solutions of tested EMOAs, which is calculated as the best set of solutions of several algorithms achieved in the first experimental run.
Without loss of generality, we only discuss the result of the ds3 dataset in Table 2. The obtained reference Pareto front is shown in Fig. 2(a). We can see that the reference Pareto front 225 includes a set of discrete points on the ADET surface. To illustrate the reference Pareto front clearly, two dimensional projections are shown in Fig. 2(b) and Fig. 2(c), corresponding to fpr × sr projection and fnr × sr projection, respectively. From Fig. 2  Several metrics are chosen to evaluate the performance of studied algorithms in the comparative experiment on these datasets, including classification accuracy (acc), false positive rate (fpr), false negative rate (fnr), sparse ratio (sr) and Kappa coefficient (Kappa) [41]. Kappa coefficient is a statistic indicator which measures inter-rater agreement for categorical items. It is generally thought to be a more robust measure than simple percent agreement calculation, as Kappa takes into account the possibility of the agreement 250 occurring by chance. Generally, the larger the value of the Kappa, the better performance of the algorithm.
The statistical results of these metrics are listed in the following tables. In these tables the best results obtained are marked in light grey and the second best results are marked in dark grey.  Table 3 shows the mean and standard deviation of the classification accuracy. The average classification accuracy for each method is listed in the last column of the   As most of the datasets used in this part are large and the distributions of them are unbalance, a small improvement of the accuracy and Kappa can cause many samples to be correctly classified and reduce misclassification costs greatly. To show the classification performance in more detail, the fpr and fnr are 265 compared in Table 5 an Table 6, respectively. From these tables, we can see that MOSEL methods outperform other compared methods on fpr, which represents the misclassification ratio of negative instances.

Experimental results of image change detection 285
Remote sensing image change detection is a real-world problem that aims to find out the change information that has occurred between two images of the same area taken at different times [42]. It has been applied in many areas, including disaster monitoring, changed target detection and supervision of country resources [43]. Supervised methods have been widely used for remote sensing image change detection [44], 290 as a small amount of labeled data can be used for model training and then the built model can be applied for 295 large-scale image change detection. The change detection problem is an unbalanced classification problem as the proportion of the change area when compared to the total observed area is small. In this part, both synthetic aperture radar (SAR) [45] and optical images are used for the proposed methods evaluation.

Datasets description
Six pairs of remote sensing images are used for classification performance evaluation, details are described in the following. The first dataset is the Ottawa dataset of two SAR images with a spatial resolution of 10m × 10m and a spatial size of 290 × 350, acquired in July and August 1997, respectively. They were acquired over the city of Ottawa by the Radarsar SAR sensor and were provided by the Defence Research and Development Canada (DRDC)-Ottawa. Fig. 4(a) and (b) present the flood-afflicted areas and Fig. 4(c) 300 shows the manually defined reference map. The sample patch for model training and validation is marked in blue with a spatial size 100 × 100 in the log ratio difference image, as shown in Fig. 4(d).  Fig. 7 shows the changed areas that appear as newly reclaimed farmlands, with a spatial size 306 × 291. Fig. 8 shows the coastline where the changed areas are relatively small, with a spatial size 450 × 280. Inland water where the changed areas are concentrated on the borderline of the river is shown in Fig. 9. The spatial size of Inland water is 291 × 444. The spatial and sample patch sizes of these remote sensing dataset are listed in

Experimental results and discussion
The mean and standard deviation of the classification accuracy are shown in Table 10. By comparing all the results we can conclude that: 1) OMP performs the best on Ottawa and Farmland datasets; 2) The methods of MOSEL outperform CS and pruning ensemble methods on most of the datasets except Ottawa 330 and Farmland; 3) 3DFCH-EMOA can obtain the highest accuracy except for the Farmland dataset.  The metrics of fpr and fnr are compared in Table 12 and Table 13, respectively. By comparing the results we can find out that: 1) 3DFCH-EMOA and MOEA/DD perform better than other methods and obtaine lower average of fpr, which represents the percentage of unchanged pixels misclassified; 2) SpaRSA and 3DFCH-EMOA can obtain lower average of fnr, which represents the percentage of changed pixels misclassified. The two objectives, i.e., fpr and fnr are conflicting with each other. Generally, a method has a can find a good trade-off between these two objectives.      Table 14 shows the mean value and standard deviation of non-zero classifiers of the ensemble weight.
By comparing the results we can conclude that KP and OMP have good performance on sparsity, however, they perform poorly on accuracy and Kappa metrics.
As 3DFCH-EMOA has good performance on most of the compared metrics, we make a more comprehensive comparison between 3DFCH-EMOA and other ensemble methods. The Wilcoxon sum-rank test   Table 15. By comparing the results we can find out that 3DFCH-EMOA outperforms CS and pruning ensemble methods significantly on accuracy and Kappa metrics for most of the datasets.

350
In this paper, we proposed the multiobjective sparse ensemble learning model and analyzed its properties in the ADET space. Firstly, MOSEL is modeled as ADCH maximization problem, and the relationship between the sparsity and the performance of ensemble classifiers on the ADET space is explained. Secondly, sparse real encoding is designed as a bridge between MOSEL and EMOAs, and six EMOAs were used to find a sparse ensemble classifier with good performance. Thirdly, an adaptive MOSEL classifier 355 selection algorithm was proposed to select the most suitable ensemble classifier for a given dataset. Experimental results based on well-known MNIST and remote sensing change detection datasets show that the proposed MOSEL performs significantly better than conventional ensemble learning methods. However, the distribution of MOSEL solutions obtained by several EMOAs is not even. To find evenly distributed solutions MOSEL must be studied further.