Repositório ISCTE-IUL

Classifier performance optimization in machine learning can be stated as a multi-objective optimization problem. In this context, recent works have shown the utility of simple evolutionary multi-objective algorithms (NSGA-II, SPEA2) to conveniently optimize the global performance of different anti-spam filters. The present work extends existing contributions in the spam filtering domain by using three novel indicator-based (SMS-EMOA, CH-EMOA) and decomposition-based (MOEA/D) evolutionary multi-objective algorithms. The proposed approaches are used to optimize the performance of a heterogeneous ensemble of classifiers into two different but complementary scenarios: parsimony maximization and e-mail classification under low confidence level. Experimental results using a publicly available standard corpus allowed us to identify interesting conclusions regarding both the utility of rule-based classification filters and the appropriateness of a three-way classification system in the spam filtering domain.


Introduction
Nowadays, the use of Internet mailing services has become indispensable in the daily life of millions of users worldwide. Additionally, the combination of e-mail with the latest mobile always-connected smart-phones provides a simple but powerful method to stay in touch with other people and efficiently exchange documents at any time. As a result, both instant messaging (IM) applications and e-mail are commonly used for this purpose. However, a fundamental difference between popular IM applications (including Whatsapp or GTalk) and Internet mailing services is the existence of consent management methods, which can be found only among the former (e.g. blocking users, etc.). This situation has greatly facilitated the use of e-mails as an aggressive/massive advertisement method and virus distribution platform, originating the spam phenomenon.
Since the inception of spam, many companies and research teams have combined their efforts to fight against spam deliveries using different approaches and methods [1]. In this context, and from a scientific perspective, several machine learning (ML) algorithms have been successfully adapted and applied to filter spam messages, mainly including Naïve Bayes (NB) [2], ensemble techniques [3], Support Vector Machines (SVM) [4] and other memorybased systems [5]. Additionally, the computer security industry and the open source community also contributed with effective techniques such as DNS black and white lists [6][7], hashing schemes [8] and the development of SpamAssassin [9], the most popular filtering framework used to combine heterogeneous and complementary anti-spam techniques.
Since its creation, SpamAssassin has been widely used as the base of commercial products and filtering services including McAfee SpamKiller and Symantec Brightmail [10]. It allows system administrators to define specific filters using ad hoc rules. Each rule contains a logical expression (used as a trigger) and defines its associated score. Every time an e-mail is -4 -received for evaluation, SpamAssassin finds all the rules matching the target message and computes the sum of their scores. This accumulative value is then compared with a configurable threshold (required score) to finally classify the new incoming message as spam or legitimate (also known as ham).
In order to define accurate anti-spam filters, the SpamAssassin framework provides implementations of several techniques including regular expressions, DNS black and white lists, Distributed Checksum Clearinghouses [11], Naïve Bayes [12][13], Sender Policy Framework [14], Hashcash [15], DomainKeys Identified Mail [16], language guessing [17], as well as several extra protocol error checks. Additionally, SpamAssassin allows the use of user-defined plugins to further extend the number of available techniques that compose a given filter.
Given the configurable structure of the SpamAssassin framework, and taking into consideration that the final accuracy of each user-defined filter strongly depends on the diversity of the underlying classifiers, the optimization of rule weights and other parameters governing the primary rule-based filtering process is still a challenge. In such a situation, initial approaches for the optimization of rule-based filters have been formulated as a single objective problem, where a general performance index (e.g., number of errors, kappa index or f-score, or Total Cost Ratio) is commonly used [18]. However, a more intuitive formulation of this problem involves several objectives. In fact, at least two complementary indexes should be simultaneously considered for minimization in the development of novel accurate anti-spam filters: (i) number of false negative (FN) errors (i.e., spam messages classified as legitimate) and (ii) number of false positives (FP) errors (i.e., legitimate messages classified as spam). Nevertheless, these objectives are in conflict, since minimizing the number of FP errors can be done only at the expense of increasing the number of spam messages going into e-mail boxes, and vice versa.
-5 -Single objective optimization approaches (also known as 'a priori' methods) require that sufficient preference information is expressed (blindly) before solution set is computed (i.e. assigning weights for the target objectives to aggregate objectives within a single objective which is optimized subsequently). In contrast, multi-objective optimization ('a posteriori') methods provide insights of conflicts between the objectives, i.e. at which extend one objective can be improved at the cost of other(s). Thus, user can select the resulting solution that best fits his/her preferences. In this context, some initial approaches [19][20]10] have evaluated the suitability of applying different multi-objective evolutionary algorithms (MOEA) in the spam filtering domain for optimizing both FN and FP errors at the same time.
However, in the aforementioned studies only classical MOEA techniques (NSGA-II and SPEA2) were applied, and questions such as how to better adapt these algorithms using domain specific knowledge and how to consider other objective functions remained unanswered.
In this line, we carried out a preliminary study about the performance of several MOEA approaches to solve different optimization questions [21]. In detail, this study included the spam filtering problem as a part of the MOEA benchmarking protocol with the goal of showing the insights of conflicts between those objectives to be minimized. However, our past work did not contribute a method to accurately evaluate the structure of the decision space (i.e., a detailed analysis of the relevance of each rule), which is essential for administrators to maintain (and continuously improve) filtering services.
Moreover, in order to fight against spam in environments where the cost associated to misclassification errors is high, the three-way classification scheme [22][23][24] emerged as a reliable way of mitigating information loss and security risks. Under this scheme, classifiers can avoid providing a solution in case there are not enough evidences to assign target instances to one of the two available classes (i.e., spam or legitimate). In such a situation, -6 -these messages are labelled as 'suspicious', 'doubtful' or 'borderline', being the subject of a further examination manually carried out by the final user. In this context, to increase security while revising suspicious e-mails, images, links and dangerous attachments should not be automatically loaded. As long as suspicious e-mails do not count as errors but are classified at the expense of increasing user efforts, the amount of messages labelled in this way should be also minimized (i.e., if all the messages belonging to a given corpus are labelled as 'suspicious', the number of misclassifications will be zero). Complementarily, the appropriateness of using a three-way classification scheme was also suggested as future work in our preliminary study [21].
In the present work, we complement previous findings by using three modern plus two classic MOEA approaches in two different ways: (i) by studying the structure of the decision space in the optimization of traditional binary classification processes (i.e., minimizing the amount of necessary rules and the number of FP and FN errors) and (ii) evaluating the suitability of three-way classification schemes to accurately filter spam contents. In the former case, we take advantage of the first optimization objective (parsimony) to specifically assess the contribution of each rule when generating a correct classification. In the second case, we carry out a performance study about the minimization of FP and FN errors when working with a three-way classification filter. These analyses have been implemented as two different optimization scenarios, making use of a well-known publicly available corpus.
While this section has introduced the motivation for this work, the rest of the paper is organized as follows: Section 2 presents the problem formulation, explains how to optimize ML classifiers with evolutionary algorithms (EAs) and summarizes previous works in antispam filter optimization using MOEA. Section 3 introduces the two case studies, defines the benchmarking protocol, establishes the performance metrics to be used and presents and -7 -discusses relevant issues regarding each case study. Finally, Section 4 provides conclusions and identifies future research work.

Materials and methods
In spite of the fast progress in computer technology and the constant increase of computational power, performing exhaustive searches in large continuous and combinatorial spaces is still challenging. In this context, the remarkable popularity of EAs over other optimization techniques is mainly motivated by their ability to search these spaces and find approximate (near) optimal solutions [25]. In the particular domain of multi-objective optimization, EAs stand for well-established computational methods where the populationbased approach makes them suitable to search for approximation sets to the efficient set.
In this way, EAs were found to be particularly useful for dealing with multi-objective problems characterized by several conflicting goals, for which not simply a single optimum solution, but a set of Pareto optimal or non-dominated solutions need to be obtained.
Together, these solutions represent the trade-offs between the existing objectives, being optimal in the sense of Pareto dominance. In such a situation, a Pareto optimal solution can only be improved in one objective at the expense of loss in other(s). As long as a population of possible solutions is used in parallel to solve these problems, the search is directed not a single optimum but towards multiple Pareto optimal solutions, which is the case of MOEAs (also known as Evolutionary Multi-objective Optimization Algorithms, EMOAs).
The constant development of novel MOEAs while trying to achieve better performance with respect to the quality of the obtained set of solutions (according to both convergence and coverage of the Pareto front approximation) led to the existence of several generations of MOEAs [26] been created. The theoretical foundations of these approaches are thoroughly discussed in [27] while more recent approaches can be found in the works of Laumanns [28] -8 -and Auger et al. [29]. Additionally, a large number of real-world applications are also discussed in the work of Coello [26].

Problem formulation: spam filtering domain perspective
As previously discussed, in the context of traditional binary classifiers, misclassifications are commonly grouped into FP and FN errors. In order to correctly apply EA to optimize antispam filters, a normalized counting of the false negative and false positive occurrences is adopted. These measures are called as false negative rate (FNR) and false positive rate (FPR), respectively. Expression (1) shows how to compute their corresponding values.
Additionally, when working with traditional ML classifiers, their hits can be separated into true positives (TP) and true negatives (TN) classes, depending on whether the target message was really spam or legitimate, respectively. In this context, the relationship between the true labels and the predicted ones can therefore be presented in a two-by-two straightforward confusion matrix as shown in Table 1.
-9 -However, if we consider only rigid binary classifiers, for which every new instance is simply categorized as positive or negative, it may result in high number of misclassifications leading to high costs. To reduce these errors, the final user of an anti-spam filter could help in the classification task with the goal of improving the accuracy of the filter and reducing its associated cost. In this context, the use of a three-way classification scheme allowed us to take advantage of final users both to improve the classification accuracy of the filter and to maximize the user experience. In such a situation, the initial confusion matrix (introduced in Table 1) changes to become a three-by-two matrix, as shown in Table 2.  Table 2 Confusion matrix of three-way classifiers.

Optimizing ML classifiers with EAs
Traditional ML classification tasks involving parameter optimization and model selection can be successfully reformulated as multi-objective optimization problems. In fact, they usually require achieving improvements on several conflicting goals, such as recall/sensibility, precision/specificity and classifier complexity [21], simultaneously. Apart from this classical perspective, many other examples of multi-objective optimization of ML classifiers can also be considered, such as the trade-off between learning new information and/or forgetting old one, or between learning as many details as possible and generalizing the model to its maximum in pattern recognition [30]. However, it is only relatively recently that the design of ML systems was conceived from a multi-objective point of view, considering simultaneous optimization of multiple conflicting objective instead of combining them into a single objective. Due to large combinatorial space of such problems, solving them with exact -10 -algorithms is not possible in most of cases, hence they are often solved using evolutionary approaches, MOEAs in particular.
In this regard, a usual way to measure the performance of different ML classifiers (or different configurations of the same model) is through the Receiver Operating Characteristic (ROC) curve, which conveniently summarizes the classifier performance when varying discriminative thresholds for two-class classification problems. ROC Convex Hull (ROCCH) is currently being widely used by the scientific community, being able to represent the convex hull area of a set of points, each of which stands for an optimal classifier [21].
Maximizing ROCCH leads to finding the group of classifiers that together provides the best range of optimal classifiers.
Even though the concepts of ROC Convex Hull and the Pareto front were reported to be similar (leading to the application of EMOA for approximating ROCCH [31]), a specific and important property of ROCCH makes it more valuable than the Pareto front. When using ROCCH, any two classifiers belonging to the Convex Hull can be joined using a line in which a new virtual classifier is represented as a point with its corresponding performance [32]. This property can be straightforwardly used to save computational resources [31]. solutions is usually done by decision maker(s). As this is not an easy task, different decisionaiding tools able to take into account the preferences of decision maker(s) are used to help judge.
The Pareto fronts achieved by MOEAs are usually approximated by a finite number of points.
In fact, the hypervolume indicator (i.e., a bounded size set of points that jointly dominates a maximal part of the objective space relative to a reference point) or the Convex Hull (which dominates a large part of this space) are commonly used for this purpose. The latter is argued to be more appropriate when it comes to ROC curves approximation [31].
In a three-way classification scenario, apart from the recommendable maximization of TPR and TNR, the third obvious objective is to maximize the number of instances labeled as spam or legitimate by the filter, namely, the classified instances rate (CR) or coverage. Considering a graphical 3D plot of all these variables (see Figure 1 In Figure 1, the surface it is easy to find out that the surface of

Relevant advances on multi-objective optimization methods
Considering a general (not problem-specific) EMOA, the selection operator chosen has a huge influence over its efficiency. In each iteration, this operator picks out individuals to be passed to the next generation and hence, parents in the mating phase. In this context, the evolution of EMOAs has been widely influenced by all the research done in the scope of the selection operators.
Pareto-based EMOAs, such as Non-dominated Sorting Genetic Algorithm (NSGA) [33] and Multi-Objective Genetic Algorithm (MOGA) [34] were the first approaches using Pareto non-dominance in the ranking and selection of individuals. The next is the group of elitist EMOAs, which includes NSGA-II [35] and Strength Pareto Evolutionary Algorithm (SPEA) [36], is characterized by selecting non-dominated solutions and allowing their preservation with respect to earlier generations. Another different approach was adopted in the design of region-based EMOAs, such as the Pareto Envelope-based Selection Algorithm (PESA-II) [37], with its focus centered in the competition of regions of the objective space instead of rivalry of individuals. This strategy is similar to the concept of niching [38] with a restricted amount of individuals being selected from each niche in the objective space (note that niching in the decision space is also common [39]).
Later, the development of performance assessment measures (indicators) revealed the possibility of using EMOAs performance indicators directly in the selection operator, which gave birth to indicator-based EMOAs, such as the Indicator-Based Evolutionary Algorithm (IBEA) [40] and the Multi-objective Selection Based on Dominated Hypervolume (SMS-EMOA) [41][42]. In general, any performance indicator may be used in IBEA, but binary indicators preserving Pareto non-dominance are commonly adopted. For instance, -14 -hypervolume and epsilon indicators satisfy the desired properties (e.g., compliance with the Pareto dominance principle and monotonicity) and are commonly suggested as the default choice for the selection operator to be based on. Additionally, it was also reported that the hypervolume indicator selects extreme points and concentrates search around the knee region of the Pareto front [41].
NSGA-II stems from its predecessor NSGA and adopts the same non-dominated sorting procedure for allocating all solutions to classes with respect to their non-dominating rank [43]. In order to address the mating selection of parents that will participate in offspring production and environmental selection (from both parents and offspring), individuals from the fronts with best (first) ranks are preferred. In particular, parents are randomly chosen from the population and compared in a binary tournament. If two individuals appear to be from fronts with different ranks, then the winner to enter matting pool of individuals is the one from the non-dominated front with better rank. Moreover, if two individuals appear to be from the same front, the comparison is done based on their crowding distance (the one with the largest crowding distance is preferred). Similarly, to fill the next population of individuals from the union of parents and offspring populations, individuals from fronts having better ranks are selected first. When there are more available individuals than needed from the last front, those with the largest crowding distance (having the largest distance to the nearest neighbors) are selected. However, the crowding distance mechanism of diversity preservation works poorly when there are more than three objectives. The general framework of NSGA-II is described in the algorithm showed in Figure 2 and details of non-dominated sorting and crowding distance sorting are provided in the algorithms showed in Figure 3 and 4, respectively.
-15 -   Many other methods were also developed to improve the performance of NCSGA-II and to address diversity preservation in the selection operator. In this line, SMS-EMOA, one of the earliest hypervolume-based methods, performs a non-dominated sorting of population to build a single offspring during an initial stage. Then, the offspring is ranked against the already sorted population and an individual from the last non-dominated front with the -16 -smallest hypervolume contribution is removed. Figure 5 describes the general framework of the SMS-EMOA approach whilst Figure 6 introduces the details of its replace procedure.  -19 -

Available datasets for anti-spam research
With the goal of boosting the design of novel and accurate filters and the execution of new anti-spam experiments, several companies in conjunction with the scientific community have publicly shared their e-mail datasets (corpus). All these available corpora can be used for testing purposes, enabling results to be compared between different research works. Table 3 compiles the most popular datasets that can be freely downloaded from the Internet.

Experimental study
In order to correctly study the advantages of applying three novel indicator and  for representing each available rule, allowing its activation or deactivation depending on its associated relevance in the whole classification process. This problem formulation, initially introduced in [21], is maintained here with two main purposes: (i) assuring accurate performance comparisons with previous approaches and (ii) enabling a relevance analysis to consider the different types of rules. In detail, our previous work was focused in providing an appropriate benchmark procedure to compare several MOEA approaches aimed to solve different optimization problems. In contrast, the present work takes advantage of the same evaluation scenario to assess the contribution of each rule for accurately classifying incoming messages.
The second scenario stands for a tri-objective real-valued representation for three-way classification. As in the previous case, a real decision vector is used for representing the scores of all the available anti-spam rules. Additionally, two real variables in the interval [0, 1] are used with the goal of representing the two threshold values used for establishing the bounds that define 'unclassified' e-mails. An additional constraint is introduced here in order to guarantee that the lower bound value is smaller than the upper one. By following this problem formulation, three labels are required for classifying e-mails: legitimate, spam and unclassified. Therefore, whenever the target message achieves a score below the lower bound threshold, it is classified as legitimate. Conversely, if the e-mail score is above the upper bound threshold, the message is classified as spam. Otherwise, if the score is located inside the interval, the message is considered for a further exam (unclassified). Under this situation, the rationale is that it is better to leave an e-mail unclassified than to provide a wrong classification.
In our second scenario, the three objectives are the minimization of FNR, FPR and the unclassified ratio of e-mails (UR). This third objective also falls in the range of [0, 1], as shown in Expression (4).

Benchmarking protocol and parameter setup
In order to guarantee the reproducibility of our experiments, this subsection introduces a straightforward description of all the configuration details needed to run our experimental tests including the target filter definition, the available datasets, and different parameter details regarding the configuration of the executed algorithms.
With reference to the target filter to be optimized, we selected the default and standard spam filter configuration included in the Debian GNU/Linux Squeeze distribution running SpamAssassin 3.3.1 [9]. This decision was mainly motivated by the fact that SpamAssassin rules match some e-mails, so only those rules were finally selected to be part of our multiobjective optimization process.
From all the available alternatives shown in Table 3, we finally selected the well-known SpamAssasin corpus [46] in order to run our experimental testbed. Our selection guarantees a medium-sized corpus (containing 9349 e-mails) providing both legitimate and spam messages (6951 legitimate vs 2398 spam) characterized by a legitimate/spam ratio very similar to the proportion of current e-mail in-boxes. Moreover, SpamAssassin corpus has been distributed in the same raw format as messages were transmitted through Internet (RFC -23 -5322 format [47]). Hence, using the spamc and spamd tools included in the SpamAssassin package [9], we can easily compute those matching rules for each available message (spamc -y < file.rfc5322.eml). The output of the previous command is saved to a file, which is later used to improve the evaluation speed for each configuration.
As previously stated, in this work we apply three novel indicators and decomposition-based For all the experiments, we established a maximum number of 25,000 function evaluations.
Both the SBX single point crossover and bit flip mutation operators were applied to -24 -manipulate binary data in the tri-objective binary-real representation experiments.
Additionally, the SBX crossover and polynomial mutation operators were used to manipulate real data belonging to both problem representations (i.e., tri-objective real-valued and triobjective binary-real). For binary variables, a bit flip mutation was used with a probability of

Discussion of performance measures
Although many global performance measures can be found in the scientific literature, comparing EMOAs results is still an open problem. In contrast to single objective algorithms, the performance assessment of algorithms with multiple objectives constitutes a complex task. Among others, it involves quality of the outcome assessment (i.e., how to measure quality), computing resources used (e.g., time, number of function evaluations, etc.) as well as the analysis of several runs of the (stochastic-based) algorithm to take randomness and parameterization into account. Therefore, the analysis requires extensions of comparison methods. Instead of comparing objective vectors, approximation sets of several independent runs of algorithms need to be compared.
In multi-objective optimization, it is often impossible to know the (true) Pareto optimal set to be used in the comparison with the outcomes of EMOAs. Thus, general performance assessment criteria for multi-objective optimization algorithms should be considered including accuracy, coverage and variance, also called convergence, uniformity and spread.
Under the best of circumstances, the obtained Pareto-optimal solutions are accurate, which -25 -means they are as close as possible to the true Pareto front of non-dominated solutions, well distributed and widely spread. Coverage and spread measures are closely related but are not exactly the same, because the former requires a representation of each region of the Pareto front, while the latter makes sure that the distance between points of the Pareto front approximation is evenly distributed (apparently it tends to give higher preference to boundary points).
In this context, theoretical and empirical techniques can be used for performance assessment. In such a situation, and with the goal of accurately evaluating different approximation sets from multiple runs belonging to several stochastic multi-objective algorithms, complementary techniques can also be combined. To this end, we adopted both dominancecompliant quality indicators and 3D graphical representations of the reference fronts (composed of Pareto front solutions selected from all runs of an algorithm) for carrying out the performance assessment. While the former reduces each approximation set to a single quality value applying statistical tests to the samples, the latter shows the samples of the approximation sets giving information about how and where the performance differences occur.
Quality indicators allow the analysis of two algorithms to determine how much, and in which specific aspects, one of them is better than the other. However, these alternatives can only measure specific/limited quality aspects. Therefore, in our study we compute and analyze two -26 -complementary quality indicators that have reasonable properties related to the domain specific multi-objective optimization algorithms used in our experiments: SPREAD [35] and VUS [49][50].
The SPREAD indicator is commonly used for the comparison of different EMOAs. This indicator can be evaluated in either the objective or decision spaces, showing how far the Pareto front or set spreads in the objective or decision space, respectively. Hence, the larger the spread of the Pareto front is, the wider the range of values on objectives/variables it covers [51].
Volume Under the ROC hypersurface indicator (VUS) has a great significance in studying ML approaches for classification problems by the means of a ROC centered analysis. This indicator provides information on the volume under the convex hull, and can evaluate the solution set directly. Therefore, the better solution set is the one that obtains the higher value of VUS. The set of solutions on the ROC curve represents an approximation towards the set of optimal solutions (optimal ROC curve). However, although VUS closely resembles the commonly used hypervolume indicator, VUS is specific for the learning task. In particular, VUS considers the volume of the convex hull instead of the volume of the Pareto dominated subspace, working with a fixed reference point (also called the anti-ideal classifier). The reason for using VUS instead of the hypervolume indicator is that, for an ensemble of hard classifiers, it is always possible to create classifiers that have characteristics in the criterion space, which are given by the convex combinations of the objective function vectors of the classifiers in the ensemble. Therefore, VUS is the hypervolume indicator of the ensemble augmented by all these convex combinations.
Although the area under the (ROC) convex hull (AUC) has become a standard performance evaluation criterion in binary pattern recognition problems (being widely used to compare different classifiers independently of priors and costs), the AUC measure is only applicable to -27 -binary classification problems. The ROC curve (originally used in binary classification problems) was extended to a multi-class scenario in the work of Srinivasan and Srinivasan [52]. Moreover, some simpler generalizations to compute VUS [49] as well as more complex approaches to compute the ROC hyper-surface (VUS) [53] were also proposed.
During several years, computational complexity and precision of MOEA trade-offs were discussed. In this regard, Edwards et al. [54] showed that the VUS value of a near guessing classifier is about the same as of a near perfect classifier when more than two classes are considered. Alternative definitions of VUS were subsequently introduced for dealing with such situations. In our work, we focus on finding sets of classifiers with optimal VUS, considering the simplified ROC proposed in the work of Landgrebe and Duin [50].

Results and discussion
This section introduces and discusses the results achieved from our experimental testbed, outlining relevant aspects concerning the behavior and performance of the proposed algorithms. The results analyzed in this section are organized following the two previously defined scenarios: parsimony maximization (3D-BinaryReal) and three-way classification (3D-Real).
In detail, in the first scenario we extend our previous study [21] by addressing not only the complexity of the classifier, but also the analysis of the rules as well as their type and relevance, with the goal of improving the classifier accuracy. Moreover, in the second scenario we test a three-way classification approach as an effective method to mark messages that need to undergo further examination by the user because of their low classification confidence.
On the one hand, optimization results achieved in the first scenario confirmed that the increment in the number of rules does not necessarily lead to an improvement in anti-spam filtering classification. On the other hand, increasing the number of anti-spam filtering rules -28 -has an impact on the classifier complexity, not only in the computational resources consumption for e-mail classification, but also in the administrators' ability to understand the filtering behavior and to maintain the anti-spam system. From all the executed experiments, we can state that the classifier was able to reach high levels of accuracy in both FNR and FPR dimensions. In particular, it was found that FPR was close to zero even when using only 20% of the anti-spam rules. Additionally, the best accuracy trade-offs were also achieved using only those 20% of the available rules. From another complementary perspective, increasing the number of rules used in the classification process to exceed this 20% only produced a marginal impact on the classifier accuracy.
With the goal of better understanding those types of rules, we specifically studied the set of 20% anti-spam rules having the major impact in achieving the maximal classification accuracy. Table 4 presents the rank of the best rules (being used by all the algorithms in all the experiments), which are part of the best solutions of the reference Pareto front.
The rules shown in Table 4 are sorted according to their appearance in the reference Pareto front solutions (activation frequency). As we can see from Table 4, these rules are applied to both the e-mail header (e.g., messages origin domain) and body (i.e., message content). While the former is based on administrative measures and information sharing mechanisms between different anti-spam systems, the latter has a more customized nature according to language, economic area or activity of the institutions, and user preferences.
Moreover, the information shown in Table 4 indicates that 32.11% of relevant rules are related to the message body content and 6.5% are based on regular expressions, manually   Table 4 Rank of the 20% anti-spam rules being used in the best solutions (i.e., individuals) comprising the reference Pareto front.
As previously commented, for the second analyzed scenario (aiming at the minimization of In order to find statistically significant differences corresponding to VUS and SPREAD performance indicators among the MOEAs described above, and also taking into consideration that the underlying data do not fit a normal distribution, we performed a statistical analysis of the median differences by executing several Kruskal-Wallis tests.
In detail, Shapiro-Wilks tests were firstly carried out for the five EMOAs revealing low pvalues, which allowed us to reject the hypothesis that the data come from a normal distribution, except for the 3DCH-EMOA approach with a p-value greater than 0.   After that, we used the Kruskal-Wallis test to check the null hypothesis that the medians within each of the five algorithms are the same. Since the p-value was less than 0.05 for both VUS and SPREAD indicators, we can confirm that there are statistically significant differences amongst the medians. In order to specifically show which medians are significantly different from each other, Figures 8 and 9 show the Box-and-Whisker plot corresponding to VUS and SPREAD indicators, respectively.  Complementarily, a comparison of the statistically significant differences amongst each pair of algorithms is also shown in Tables 7 and 8 Table 8 Kruskal-Wallis analysis corresponding to the SPREAD performance indicator.
As expected, the three-objective CH-EMOA implementation (3DCH-EMOA) performs much better than all the other tested algorithms, presenting not only a high classification quality average, but also stable (predictable) behavior evidenced by a low VUS variance. This good performance is mainly motivated by the fact of CH-EMOA being an indicator-based -33 -algorithm, which precisely uses VUS as a selection criterion. The second position is occupied by the decomposition-based algorithm MOEA/D, showing a much worse VUS average and a higher variance, which is far from reaching the same performance level and stability obtained by the 3DCH-EMOA approach.
Additionally, the SPREAD indicator is useful for assessing the diversity of the solutions obtained by all the algorithms under consideration. As shown in Figure 9, Indeed, results shown in Figure 11 confirm the effectiveness of maintaining a small number of e-mails unclassified, with the goal of increasing the classification quality. From Figure 11 we can observe that the Pareto front with a few unclassified e-mails (20) dominates the Pareto front with less than 10 errors (3 FP and 3 FN). Therefore, we confirm the utility of those filters that keep messages unclassified when the computed solution has a low confidence level. -35 -Finally, to check the burden of these filter optimization methods, we measured both the overall time required to run the full experiments and also the relative burden of each EMOA.
To this end, we executed the experiments in a quad-core Intel Xeon E5520 CPU at 2.27GHz with 8GB RAM computer running Debian GNU Linux operating system. Table 9 shows the obtained results.  Table 9 Comparison of the execution times belonging to each algorithm.
As shown in Table 9, NSGA-II, SPEA2 and MOEA/D execution times are very similar.
These algorithms are approximately five times faster than the SMS-EMOA approach and thirty times faster than the 3DCH-EMOA alternative. The high computational burden of steady state methods such as SMS-EMOA is in this specific case worsened by the increased dimensionality (i.e., three objectives to minimize) of the optimization problem. The 3DCH-EMOA high computational burden is related to the convex hull calculation complexity, as described in [21,31]. MOEA/D is the algorithm that requires the smallest amount of computational time to compute an optimized SpamAssassin ruleset. However, the computational footprint of these EMOAs seems to limit their applicability in real domains.
To cope with these issues, next subsection introduces a set of practical considerations to properly deploy these optimization techniques in real environments.

Practical deployment considerations
As previously discussed, the use of different MOEAs to optimize SpamAssassin filter scores presents an important computational footprint. As long as this issue should be taken into -36 -consideration to use them in real environments, we have compiled a set of recommendations to deploy score optimization mechanisms into production e-mail filtering servers.
First of all, the optimization process requires the use of target domain messages previously classified. As long as the filtering process in a Mail Transfer Agent (MTA) can be customizable through a configurable script, these e-mails can be easily compiled by modifying this script. Moreover, the classification could be also achieved by using SpamAssassin client (spamc). To cope with SpamAssassin misclassifications, two e-mail users (e.g., not_spam and not_ham) could be created to receive feedback from the final user.
With this configuration, those messages compiled within an appropriate time period could be further used to execute the optimization process.
Complementarily, and keeping in mind the nature of the proposed methods, the optimization process should be periodically repeated once a month, or a week, depending on the specific computational capabilities. However, the proposed optimization process should not be implemented in a production e-mail filtering server to avoid lags in message exchanging.
Finally, with the goal of improving speed at affordable costs, the use of high performance parallel computing and cluster techniques (e.g., MapReduce, CUDA, etc.) is usually conducted. We found the application of these techniques suitable to achieve the improvement of MOEAs.

Conclusions and future work
In this work, we have evaluated the utility of several multi-objective evolutionary algorithms to optimize rule-based anti-spam filters from different but complementary perspectives. To this end, we presented two experimental case studies where filter complexity and three-way classification strategy were considered as additional objectives. The first scenario (parsimony maximization) revealed that the number of rules could be significantly reduced without -37 -affecting the filter performance. Moreover, experimental results related to the use of a threeway classification approach demonstrated the utility of defining a boundary region (where the classifier confidence is too low) to reduce the number of misclassification errors.
In this context, and from the experiments carried out, we would like to emphasize that from the 330 rules that match messages in the SpamAssassin corpus, only 5% to 20% of rules are really needed to achieve an optimal classification. Moreover, and taking into consideration the particular nature of the spam filtering domain, a considerable amount of relevant rules are based on regular expressions. These rules are used to specifically parse and check the e-mail structure, syntax and content, representing a major contribution in anti-spam filtering customization. The design of this type of rules constitutes an important share of the effort made by systems administrator to release novel and accurate anti-spam filters. Therefore, research aiming at the automatic generation of regular expressions from any given corpus is of high interest, having been initially addressed in the work of Basto-Fernandes et al. [55].
With regard to our three-way classification experiments, it was revealed that indicator-based algorithms perform well when carrying out multi-objective optimization of ROC curve performance. The best results for the VUS indicator were achieved by 3DCH-EMOA.
Additionally, according to SPREAD indicator results, this algorithm also achieves good performance taking into account that this approach does not allow including points in the concave parts of the Pareto front. Finally, with the introduction of an extra 'unclassified' label in the filter (targeted to inform the user of those messages with a low confidence level), a considerable improvement in quality can be achieved to avoid harmful misclassifications at low cost for e-mail users (time).
Current and future work includes the investigation of whether obtained results generalize to data sets from other domains (e.g., web spam) where classification is commonly used.
-38 -Moreover, as previously stated, the automatic generation of regular expression remains an interesting challenge in the domain of spam filtering.
All algorithms used in this study were implemented in JMetal Java framework and are available upon request by the authors.