A SPAM FILTERING MULTI-OBJECTIVE OPTIMIZATION STUDY COVERING PARSIMONY MAXIMIZATION AND THREE-WAY CLASSIFICATION

Classifier performance optimization in machine learning can be stated as a multi-objective optimization problem. In this context, recent works have shown the utility of simple evolutionary multi-objective algorithms (NSGA-II, SPEA2) to conveniently optimize the global performance of different anti-spam filters. The present work extends existing contributions in the spam filtering domain by using three novel indicator-based (SMS-EMOA, CH-EMOA) and decomposition-based (MOEA/D) evolutionary multi-objective algorithms. The proposed approaches are used to optimize the performance of a heterogeneous ensemble of classifiers into two different but complementary scenarios: parsimony maximization and e-mail classification under low confidence level. Experimental results using a publicly available standard corpus allowed us to identify interesting conclusions regarding both the utility of rule-based classification filters and the appropriateness of a three-way classification system in the spam filtering domain.


Introduction
Nowadays, the use of Internet mailing services has become indispensable in the daily life of millions of users worldwide. Additionally, the combination of e-mail with the latest mobile always-connected smart-phones provides a simple but powerful method to stay in touch with other people and efficiently exchange documents at any time. As a result, both instant messaging (IM) applications and e-mail are commonly used for this purpose. However, a discusses relevant issues regarding each case study. Finally, Section 4 provides conclusions and identifies future research work.

Materials and methods
In spite of the fast progress in computer technology and the constant increase of computational power, performing exhaustive searches in large continuous and combinatorial spaces is still challenging. In this context, the remarkable popularity of EAs over other optimization techniques is mainly motivated by their ability to search these spaces and find approximate (near) optimal solutions [25]. In the particular domain of multi-objective optimization, EAs stand for well-established computational methods where the populationbased approach makes them suitable to search for approximation sets to the efficient set.
In this way, EAs were found to be particularly useful for dealing with multi-objective problems characterized by several conflicting goals, for which not simply a single optimum solution, but a set of Pareto optimal or non-dominated solutions need to be obtained.
Together, these solutions represent the trade-offs between the existing objectives, being optimal in the sense of Pareto dominance. In such a situation, a Pareto optimal solution can only be improved in one objective at the expense of loss in other(s). As long as a population of possible solutions is used in parallel to solve these problems, the search is directed not a single optimum but towards multiple Pareto optimal solutions, which is the case of MOEAs (also known as Evolutionary Multi-objective Optimization Algorithms, EMOAs).
The constant development of novel MOEAs while trying to achieve better performance with respect to the quality of the obtained set of solutions (according to both convergence and coverage of the Pareto front approximation) led to the existence of several generations of MOEAs [26] been created. The theoretical foundations of these approaches are thoroughly discussed in [27] while more recent approaches can be found in the works of Laumanns [28] and Auger et al. [29]. Additionally, a large number of real-world applications are also discussed in the work of Coello [26].

Problem formulation: spam filtering domain perspective
As previously discussed, in the context of traditional binary classifiers, misclassifications are commonly grouped into FP and FN errors. In order to correctly apply EA to optimize antispam filters, a normalized counting of the false negative and false positive occurrences is adopted. These measures are called as false negative rate (FNR) and false positive rate (FPR), respectively. Expression (1) shows how to compute their corresponding values.
Additionally, when working with traditional ML classifiers, their hits can be separated into true positives (TP) and true negatives (TN) classes, depending on whether the target message was really spam or legitimate, respectively. In this context, the relationship between the true labels and the predicted ones can therefore be presented in a two-by-two straightforward confusion matrix as shown in Table 1. Taking into consideration the values that are part of the confusion matrix, FNR and FPR can be easily computed, as shown in Expression (2).
However, if we consider only rigid binary classifiers, for which every new instance is simply categorized as positive or negative, it may result in high number of misclassifications leading to high costs. To reduce these errors, the final user of an anti-spam filter could help in the classification task with the goal of improving the accuracy of the filter and reducing its associated cost. In this context, the use of a three-way classification scheme allowed us to take advantage of final users both to improve the classification accuracy of the filter and to maximize the user experience. In such a situation, the initial confusion matrix (introduced in Table 1) changes to become a three-by-two matrix, as shown in Table 2.  Table 2 Confusion matrix of three-way classifiers.

Optimizing ML classifiers with EAs
Traditional ML classification tasks involving parameter optimization and model selection can be successfully reformulated as multi-objective optimization problems. In fact, they usually require achieving improvements on several conflicting goals, such as recall/sensibility, precision/specificity and classifier complexity [21], simultaneously. Apart from this classical perspective, many other examples of multi-objective optimization of ML classifiers can also be considered, such as the trade-off between learning new information and/or forgetting old one, or between learning as many details as possible and generalizing the model to its maximum in pattern recognition [30]. However, it is only relatively recently that the design of ML systems was conceived from a multi-objective point of view, considering simultaneous optimization of multiple conflicting objective instead of combining them into a single objective. Due to large combinatorial space of such problems, solving them with exact algorithms is not possible in most of cases, hence they are often solved using evolutionary approaches, MOEAs in particular.
In this regard, a usual way to measure the performance of different ML classifiers (or different configurations of the same model) is through the Receiver Operating Characteristic (ROC) curve, which conveniently summarizes the classifier performance when varying discriminative thresholds for two-class classification problems. ROC Convex Hull (ROCCH) is currently being widely used by the scientific community, being able to represent the convex hull area of a set of points, each of which stands for an optimal classifier [21].
Maximizing ROCCH leads to finding the group of classifiers that together provides the best range of optimal classifiers.
Even though the concepts of ROC Convex Hull and the Pareto front were reported to be similar (leading to the application of EMOA for approximating ROCCH [31]), a specific and important property of ROCCH makes it more valuable than the Pareto front. When using ROCCH, any two classifiers belonging to the Convex Hull can be joined using a line in which a new virtual classifier is represented as a point with its corresponding performance [32]. This property can be straightforwardly used to save computational resources [31].
When dealing with a multi-objective optimization problem, there are typically m objective functions,

( )
in the remaining cases. The selection of a single solution among the set of Pareto optimal solutions is usually done by decision maker(s). As this is not an easy task, different decisionaiding tools able to take into account the preferences of decision maker(s) are used to help judge.
The Pareto fronts achieved by MOEAs are usually approximated by a finite number of points.
In fact, the hypervolume indicator (i.e., a bounded size set of points that jointly dominates a maximal part of the objective space relative to a reference point) or the Convex Hull (which dominates a large part of this space) are commonly used for this purpose. The latter is argued to be more appropriate when it comes to ROC curves approximation [31].
In a three-way classification scenario, apart from the recommendable maximization of TPR and TNR, the third obvious objective is to maximize the number of instances labeled as spam or legitimate by the filter, namely, the classified instances rate (CR) or coverage. Considering a graphical 3D plot of all these variables (see Figure 1  In Figure 1, the surface

Relevant advances on multi-objective optimization methods
Considering a general (not problem-specific) EMOA, the selection operator chosen has a huge influence over its efficiency. In each iteration, this operator picks out individuals to be passed to the next generation and hence, parents in the mating phase. In this context, the evolution of EMOAs has been widely influenced by all the research done in the scope of the selection operators.
Pareto-based EMOAs, such as Non-dominated Sorting Genetic Algorithm (NSGA) [33] and Multi-Objective Genetic Algorithm (MOGA) [34] were the first approaches using Pareto non-dominance in the ranking and selection of individuals. The next is the group of elitist EMOAs, which includes NSGA-II [35] and Strength Pareto Evolutionary Algorithm (SPEA) [36], is characterized by selecting non-dominated solutions and allowing their preservation with respect to earlier generations. Another different approach was adopted in the design of region-based EMOAs, such as the Pareto Envelope-based Selection Algorithm (PESA-II) [37], with its focus centered in the competition of regions of the objective space instead of rivalry of individuals. This strategy is similar to the concept of niching [38] with a restricted amount of individuals being selected from each niche in the objective space (note that niching in the decision space is also common [39]).
Later, the development of performance assessment measures (indicators) revealed the possibility of using EMOAs performance indicators directly in the selection operator, which gave birth to indicator-based EMOAs, such as the Indicator-Based Evolutionary Algorithm (IBEA) [40] and the Multi-objective Selection Based on Dominated Hypervolume (SMS-EMOA) [41][42]. In general, any performance indicator may be used in IBEA, but binary indicators preserving Pareto non-dominance are commonly adopted. For instance, hypervolume and epsilon indicators satisfy the desired properties (e.g., compliance with the Pareto dominance principle and monotonicity) and are commonly suggested as the default choice for the selection operator to be based on. Additionally, it was also reported that the hypervolume indicator selects extreme points and concentrates search around the knee region of the Pareto front [41].
NSGA-II stems from its predecessor NSGA and adopts the same non-dominated sorting procedure for allocating all solutions to classes with respect to their non-dominating rank [43]. In order to address the mating selection of parents that will participate in offspring production and environmental selection (from both parents and offspring), individuals from the fronts with best (first) ranks are preferred. In particular, parents are randomly chosen from the population and compared in a binary tournament. If two individuals appear to be from fronts with different ranks, then the winner to enter matting pool of individuals is the one from the non-dominated front with better rank. Moreover, if two individuals appear to be from the same front, the comparison is done based on their crowding distance (the one with the largest crowding distance is preferred). Similarly, to fill the next population of individuals from the union of parents and offspring populations, individuals from fronts having better ranks are selected first. When there are more available individuals than needed from the last front, those with the largest crowding distance (having the largest distance to the nearest neighbors) are selected. However, the crowding distance mechanism of diversity preservation works poorly when there are more than three objectives. The general framework of NSGA-II is described in the algorithm showed in Figure 2 and details of non-dominated sorting and crowding distance sorting are provided in the algorithms showed in Figure 3 and 4, respectively.   Many other methods were also developed to improve the performance of NCSGA-II and to address diversity preservation in the selection operator. In this line, SMS-EMOA, one of the earliest hypervolume-based methods, performs a non-dominated sorting of population to build a single offspring during an initial stage. Then, the offspring is ranked against the already sorted population and an individual from the last non-dominated front with the smallest hypervolume contribution is removed. Figure 5 describes the general framework of the SMS-EMOA approach whilst Figure 6 introduces the details of its replace procedure.  The success of SMS-EMOA motivated developers to create the steady state version of NSGA-II [44], in which only a single offspring is created in each generation. Therefore, only the worst individual is removed from the population at each iteration of the algorithm.
Indeed, the steady state version of NSGA-II performs better than the original one [35] at the expense of higher computational costs.

Available datasets for anti-spam research
With the goal of boosting the design of novel and accurate filters and the execution of new anti-spam experiments, several companies in conjunction with the scientific community have publicly shared their e-mail datasets (corpus). All these available corpora can be used for testing purposes, enabling results to be compared between different research works. Table 3 compiles the most popular datasets that can be freely downloaded from the Internet.  to guarantee that the lower bound value is smaller than the upper one. By following this problem formulation, three labels are required for classifying e-mails: legitimate, spam and unclassified. Therefore, whenever the target message achieves a score below the lower bound threshold, it is classified as legitimate. Conversely, if the e-mail score is above the upper bound threshold, the message is classified as spam. Otherwise, if the score is located inside the interval, the message is considered for a further exam (unclassified). Under this situation, the rationale is that it is better to leave an e-mail unclassified than to provide a wrong classification.
In our second scenario, the three objectives are the minimization of FNR, FPR and the unclassified ratio of e-mails (UR). This third objective also falls in the range of [0, 1], as shown in Expression (4).

Benchmarking protocol and parameter setup
In order to guarantee the reproducibility of our experiments, this subsection introduces a straightforward description of all the configuration details needed to run our experimental tests including the target filter definition, the available datasets, and different parameter details regarding the configuration of the executed algorithms.
With reference to the target filter to be optimized, we selected the default and standard spam filter configuration included in the Debian GNU/Linux Squeeze distribution running SpamAssassin 3.3.1 [9]. This decision was mainly motivated by the fact that SpamAssassin rules match some e-mails, so only those rules were finally selected to be part of our multiobjective optimization process.
From all the available alternatives shown in Table 3, we finally selected the well-known SpamAssasin corpus [46] in order to run our experimental testbed. Our selection guarantees a medium-sized corpus (containing 9349 e-mails) providing both legitimate and spam messages (6951 legitimate vs 2398 spam) characterized by a legitimate/spam ratio very similar to the proportion of current e-mail in-boxes. Moreover, SpamAssassin corpus has been distributed in the same raw format as messages were transmitted through Internet (RFC -23 -5322 format [47]). Hence, using the spamc and spamd tools included in the SpamAssassin package [9], we can easily compute those matching rules for each available message (spamc -y < file.rfc5322.eml). The output of the previous command is saved to a file, which is later used to improve the evaluation speed for each configuration.
As previously stated, in this work we apply three novel indicators and decomposition-based For all the experiments, we established a maximum number of 25,000 function evaluations.
Both the SBX single point crossover and bit flip mutation operators were applied to manipulate binary data in the tri-objective binary-real representation experiments.
Additionally, the SBX crossover and polynomial mutation operators were used to manipulate real data belonging to both problem representations (i.e., tri-objective real-valued and triobjective binary-real). For binary variables, a bit flip mutation was used with a probability of

Discussion of performance measures
Although many global performance measures can be found in the scientific literature, comparing EMOAs results is still an open problem. In contrast to single objective algorithms, the performance assessment of algorithms with multiple objectives constitutes a complex task. Among others, it involves quality of the outcome assessment (i.e., how to measure quality), computing resources used (e.g., time, number of function evaluations, etc.) as well as the analysis of several runs of the (stochastic-based) algorithm to take randomness and parameterization into account. Therefore, the analysis requires extensions of comparison methods. Instead of comparing objective vectors, approximation sets of several independent runs of algorithms need to be compared.
In multi-objective optimization, it is often impossible to know the (true) Pareto optimal set to be used in the comparison with the outcomes of EMOAs. Thus, general performance assessment criteria for multi-objective optimization algorithms should be considered including accuracy, coverage and variance, also called convergence, uniformity and spread.
Under the best of circumstances, the obtained Pareto-optimal solutions are accurate, which -25 -means they are as close as possible to the true Pareto front of non-dominated solutions, well distributed and widely spread. Coverage and spread measures are closely related but are not exactly the same, because the former requires a representation of each region of the Pareto front, while the latter makes sure that the distance between points of the Pareto front approximation is evenly distributed (apparently it tends to give higher preference to boundary points).
In this context, theoretical and empirical techniques can be used for performance assessment. Quality indicators allow the analysis of two algorithms to determine how much, and in which specific aspects, one of them is better than the other. However, these alternatives can only measure specific/limited quality aspects. Therefore, in our study we compute and analyze two complementary quality indicators that have reasonable properties related to the domain specific multi-objective optimization algorithms used in our experiments: SPREAD [35] and VUS [49][50].
The SPREAD indicator is commonly used for the comparison of different EMOAs. This indicator can be evaluated in either the objective or decision spaces, showing how far the Pareto front or set spreads in the objective or decision space, respectively. Hence, the larger the spread of the Pareto front is, the wider the range of values on objectives/variables it covers [51].
Volume Under the ROC hypersurface indicator (VUS) has a great significance in studying ML approaches for classification problems by the means of a ROC centered analysis. This indicator provides information on the volume under the convex hull, and can evaluate the solution set directly. Therefore, the better solution set is the one that obtains the higher value of VUS. The set of solutions on the ROC curve represents an approximation towards the set of optimal solutions (optimal ROC curve). However, although VUS closely resembles the commonly used hypervolume indicator, VUS is specific for the learning task. In particular, VUS considers the volume of the convex hull instead of the volume of the Pareto dominated subspace, working with a fixed reference point (also called the anti-ideal classifier). The reason for using VUS instead of the hypervolume indicator is that, for an ensemble of hard classifiers, it is always possible to create classifiers that have characteristics in the criterion space, which are given by the convex combinations of the objective function vectors of the classifiers in the ensemble. Therefore, VUS is the hypervolume indicator of the ensemble augmented by all these convex combinations.
Although the area under the (ROC) convex hull (AUC) has become a standard performance evaluation criterion in binary pattern recognition problems (being widely used to compare different classifiers independently of priors and costs), the AUC measure is only applicable to binary classification problems. The ROC curve (originally used in binary classification problems) was extended to a multi-class scenario in the work of Srinivasan and Srinivasan [52]. Moreover, some simpler generalizations to compute VUS [49] as well as more complex approaches to compute the ROC hyper-surface (VUS) [53] were also proposed.
During several years, computational complexity and precision of MOEA trade-offs were discussed. In this regard, Edwards et al. [54] showed that the VUS value of a near guessing classifier is about the same as of a near perfect classifier when more than two classes are considered. Alternative definitions of VUS were subsequently introduced for dealing with such situations. In our work, we focus on finding sets of classifiers with optimal VUS, considering the simplified ROC proposed in the work of Landgrebe and Duin [50].

Results and discussion
This section introduces and discusses the results achieved from our experimental testbed, outlining relevant aspects concerning the behavior and performance of the proposed algorithms. The results analyzed in this section are organized following the two previously defined scenarios: parsimony maximization (3D-BinaryReal) and three-way classification (3D-Real).
In detail, in the first scenario we extend our previous study [21] by addressing not only the complexity of the classifier, but also the analysis of the rules as well as their type and relevance, with the goal of improving the classifier accuracy. Moreover, in the second scenario we test a three-way classification approach as an effective method to mark messages that need to undergo further examination by the user because of their low classification confidence.
On the one hand, optimization results achieved in the first scenario confirmed that the increment in the number of rules does not necessarily lead to an improvement in anti-spam filtering classification. On the other hand, increasing the number of anti-spam filtering rules has an impact on the classifier complexity, not only in the computational resources consumption for e-mail classification, but also in the administrators' ability to understand the filtering behavior and to maintain the anti-spam system. From all the executed experiments, we can state that the classifier was able to reach high levels of accuracy in both FNR and FPR dimensions. In particular, it was found that FPR was close to zero even when using only 20% of the anti-spam rules. Additionally, the best accuracy trade-offs were also achieved using only those 20% of the available rules. From another complementary perspective, increasing the number of rules used in the classification process to exceed this 20% only produced a marginal impact on the classifier accuracy.
With the goal of better understanding those types of rules, we specifically studied the set of 20% anti-spam rules having the major impact in achieving the maximal classification accuracy. Table 4 presents the rank of the best rules (being used by all the algorithms in all the experiments), which are part of the best solutions of the reference Pareto front.
The rules shown in Table 4 are sorted according to their appearance in the reference Pareto front solutions (activation frequency). As we can see from Table 4, these rules are applied to both the e-mail header (e.g., messages origin domain) and body (i.e., message content). While the former is based on administrative measures and information sharing mechanisms between different anti-spam systems, the latter has a more customized nature according to language, economic area or activity of the institutions, and user preferences.
Moreover, the information shown in Table 4 indicates that 32.11% of relevant rules are related to the message body content and 6.5% are based on regular expressions, manually created by system administrators for parsing and checking e-mail structure, syntax and content. Remaining rules are related to e-mail message headers and different e-mail system administration policies.  Table 4 Rank of the 20% anti-spam rules being used in the best solutions (i.e., individuals) comprising the reference Pareto front.
As previously commented, for the second analyzed scenario (aiming at the minimization of FNR, FPR and UR) we provide a performance assessment based on the graphical and indicator-based analysis of the Pareto front. In this way, boxplots depicting median, quartiles and outliers on the multi-criteria performance indicators (SPREAD and VUS) are shown for the five algorithms under consideration. The comparison of those algorithms is done with respect to the reference Pareto front, which is taken as a closest approximation of the true Pareto front.
In order to find statistically significant differences corresponding to VUS and SPREAD performance indicators among the MOEAs described above, and also taking into consideration that the underlying data do not fit a normal distribution, we performed a statistical analysis of the median differences by executing several Kruskal-Wallis tests.
In detail, Shapiro-Wilks tests were firstly carried out for the five EMOAs revealing low pvalues, which allowed us to reject the hypothesis that the data come from a normal distribution, except for the 3DCH-EMOA approach with a p-value greater than 0.   After that, we used the Kruskal-Wallis test to check the null hypothesis that the medians within each of the five algorithms are the same. Since the p-value was less than 0.05 for both VUS and SPREAD indicators, we can confirm that there are statistically significant differences amongst the medians. In order to specifically show which medians are significantly different from each other, Figures 8 and 9 show the Box-and-Whisker plot corresponding to VUS and SPREAD indicators, respectively.  Complementarily, a comparison of the statistically significant differences amongst each pair of algorithms is also shown in Tables 7 and 8 Table 7 Kruskal-Wallis analysis corresponding to the VUS performance indicator. Indeed, results shown in Figure 11 confirm the effectiveness of maintaining a small number of e-mails unclassified, with the goal of increasing the classification quality. From Figure 11 we can observe that the Pareto front with a few unclassified e-mails (20) dominates the Pareto front with less than 10 errors (3 FP and 3 FN). Therefore, we confirm the utility of those filters that keep messages unclassified when the computed solution has a low confidence level. Finally, to check the burden of these filter optimization methods, we measured both the overall time required to run the full experiments and also the relative burden of each EMOA.
To this end, we executed the experiments in a quad-core Intel Xeon E5520 CPU at 2.27GHz with 8GB RAM computer running Debian GNU Linux operating system. Table 9 shows the obtained results.  Table 9 Comparison of the execution times belonging to each algorithm.
As shown in Table 9, NSGA-II, SPEA2 and MOEA/D execution times are very similar.
These algorithms are approximately five times faster than the SMS-EMOA approach and thirty times faster than the 3DCH-EMOA alternative. The high computational burden of steady state methods such as SMS-EMOA is in this specific case worsened by the increased dimensionality (i.e., three objectives to minimize) of the optimization problem. The 3DCH-EMOA high computational burden is related to the convex hull calculation complexity, as described in [21,31]. MOEA/D is the algorithm that requires the smallest amount of computational time to compute an optimized SpamAssassin ruleset. However, the computational footprint of these EMOAs seems to limit their applicability in real domains.
To cope with these issues, next subsection introduces a set of practical considerations to properly deploy these optimization techniques in real environments.

Practical deployment considerations
As previously discussed, the use of different MOEAs to optimize SpamAssassin filter scores presents an important computational footprint. As long as this issue should be taken into consideration to use them in real environments, we have compiled a set of recommendations to deploy score optimization mechanisms into production e-mail filtering servers.
First of all, the optimization process requires the use of target domain messages previously classified. As long as the filtering process in a Mail Transfer Agent (MTA) can be customizable through a configurable script, these e-mails can be easily compiled by modifying this script. Moreover, the classification could be also achieved by using SpamAssassin client (spamc). To cope with SpamAssassin misclassifications, two e-mail users (e.g., not_spam and not_ham) could be created to receive feedback from the final user.
With this configuration, those messages compiled within an appropriate time period could be further used to execute the optimization process.
Complementarily, and keeping in mind the nature of the proposed methods, the optimization process should be periodically repeated once a month, or a week, depending on the specific computational capabilities. However, the proposed optimization process should not be implemented in a production e-mail filtering server to avoid lags in message exchanging.
Finally, with the goal of improving speed at affordable costs, the use of high performance parallel computing and cluster techniques (e.g., MapReduce, CUDA, etc.) is usually conducted. We found the application of these techniques suitable to achieve the improvement of MOEAs.

Conclusions and future work
In this work, we have evaluated the utility of several multi-objective evolutionary algorithms to optimize rule-based anti-spam filters from different but complementary perspectives. To this end, we presented two experimental case studies where filter complexity and three-way classification strategy were considered as additional objectives. The first scenario (parsimony maximization) revealed that the number of rules could be significantly reduced without affecting the filter performance. Moreover, experimental results related to the use of a threeway classification approach demonstrated the utility of defining a boundary region (where the classifier confidence is too low) to reduce the number of misclassification errors.
In this context, and from the experiments carried out, we would like to emphasize that from the 330 rules that match messages in the SpamAssassin corpus, only 5% to 20% of rules are really needed to achieve an optimal classification. Moreover, and taking into consideration the particular nature of the spam filtering domain, a considerable amount of relevant rules are based on regular expressions. These rules are used to specifically parse and check the e-mail structure, syntax and content, representing a major contribution in anti-spam filtering customization. The design of this type of rules constitutes an important share of the effort made by systems administrator to release novel and accurate anti-spam filters. Therefore, research aiming at the automatic generation of regular expressions from any given corpus is of high interest, having been initially addressed in the work of Basto-Fernandes et al. [55].
With regard to our three-way classification experiments, it was revealed that indicator-based algorithms perform well when carrying out multi-objective optimization of ROC curve performance. The best results for the VUS indicator were achieved by 3DCH-EMOA.
Additionally, according to SPREAD indicator results, this algorithm also achieves good performance taking into account that this approach does not allow including points in the concave parts of the Pareto front. Finally, with the introduction of an extra 'unclassified' label in the filter (targeted to inform the user of those messages with a low confidence level), a considerable improvement in quality can be achieved to avoid harmful misclassifications at low cost for e-mail users (time).
Current and future work includes the investigation of whether obtained results generalize to data sets from other domains (e.g., web spam) where classification is commonly used.
Moreover, as previously stated, the automatic generation of regular expression remains an interesting challenge in the domain of spam filtering.
All algorithms used in this study were implemented in JMetal Java framework and are available upon request by the authors.