Repositório ISCTE-IUL

Machine learning methods have become an indispensable tool for utilizing large knowledge and data repositories in science and technology. In the context of the pharmaceutical domain, the amount of acquired knowledge about the design and synthesis of pharmaceutical agents and bioactive molecules (drugs) is enormous. The primary challenge for automatically discovering new drugs from molecular screening information is related to the high dimensionality of datasets, where a wide range of features is included for each candidate drug. Thus, the implementation of improved techniques to ensure an adequate manipulation and interpretation of data becomes mandatory. To mitigate this problem, our tool (called D2-MCS) can split homogeneously the dataset into several groups (the subset of features) and subsequently, determine the most suitable classifier for each group. Finally, the tool allows determining the biological activity of each molecule by a voting scheme. The application of the D2-MCS tool was tested on a standardized, high quality dataset gathered from ChEMBL 1 and have shown outperformance of our tool when compare to well-known single classification models.


1. Introduction and motivation
Technological advances achieved during recent decades have allowed important findings to be obtained in several highly relevant disciplines such as (i) computer science (Internet (Cohen-Almagor, 2013) and mobile communications (Charlesworth, 2009)), (ii) biology (DNA sequencing (França, Carrilho, & Kist, 2002)), and (iii) biomedicine (such as Face2Gene (Radke, 2017)). More specifically, the high performance achieved by the latest communication and computer systems have turned computer science into one of the most important areas of knowledge due to its wide application in various areas, and in multidisciplinary projects in particular. A clear example of its relevance is reflected in the emergence and development of several interdisciplinary research areas such as bioinformatics (development of new methods and software tools in order to facilitate the interpretation of biological data) or cheminformatics (use of computer and informational techniques to improve the decision making in the area of drug lead identification and optimization).
In fact, the high computational capabilities of computer systems together with the reduced price of storage systems allow achieving advances on processing large amounts of information. In detail, they allow (i) efficiently manipulating huge amounts of information, (ii) applying unused techniques (due to their high computational requirements) and (iii) implementing new exploratory techniques for dealing with large amounts of information (Cao et al., 2018;H. Chen, Engkvist, Wang, Olivecrona, & Blaschke, 2018). Healthcare is one of the most favoured investment and research sectors due to the immense amount of information collected over time (such as diseases, vaccines, drugs or chemical substructures) and its impact on the wellbeing of our society as a whole. The distinct characteristics and structures of information related to the drugs discovery domain (that are completely different from those used in other healthcare areas such as vaccines) together with the immense (and diverse) domain knowledge seriously hamper a straightforward manipulation of the information. This issue forced healthcare companies to intensify efforts and resources in the continuous development and improvement of specific database techniques.
On average, pharmaceutical companies invest approximately 18% (Morgan, Grootendorst, Lexchin, Cunningham, & Greyson, 2011) of their budget into research and development tasks, in order to reduce the time and resources needed to develop new drugs or improve existing ones. In fact, during the period 2015-2017, an average of 38.5 drugs were approved annually, which represents an increase of 47% when compared to the 2008-2013 period (Woodcock, 2017(Woodcock, , 2018. Additionally, the market expansion in pharmerging countries and demographic trends in developed countries (with an ageing population) have positioned the pharmaceutical sector at the top of the most profitable industries worldwide. Recent studies (Aitken, 2016;Civaner, 2012) have predicted that the pharmaceutical market will reach nearly USD 1,485 billion by 2021, representing an increase in profits of between 14-17% when compared to revenues achieved during the period 2013-2017.
Nevertheless, the complexity and elevated cost of the stages involving the development of the drug and approval process hampers the fast creation of new drugs (Adams & Brantner, 2006). One of the biggest challenges takes place during the first stage (preclinical research), where thousands of compounds are analyzed and combined in order to obtain new potential candidates for development as a medical treatment. Screening methods allow detecting the most promising molecules and reduce efforts wasted for testing futile compounds. As described in (DiMasi, Hansen, & Grabowski, 2003;Hefti, 2008) only 0.1% of the tested compounds achieved promising results according to properties required for a potential candidate to become a drug (i.e., bioactivity, toxicity levels or chemical interactions) and are suitable for further study. Consequently, the knowledge acquired by pharmaceutical laboratories from preclinical research work is highly unbalanced (low number of promising compounds and high number of useless compounds). The particular characteristics of this kind of information (high number of available chemical substructures, their distinct formatting representations and low rate of valid compounds) require the use of customized high-dimensional techniques in order to enhance data interpretation. To alleviate this problem, several researchers (Bajorath, 2002;Lipinski, Lombardo, Dominy, & Feeney, 2001) developed various techniques specially adapted to deal with the specifications of the drugs discovery stages. The other line of research is focused on the development of efficient approaches for selection of most promissing subsets of potential candidates to become a drug based on predicted bioactivity of molecules and their diversity (Yevseyeva et al., 2019). However, after a deep analysis of the state-of-the-art of pharmaceutical domain, we found a lack of high-performance decision-making and prediction techniques suitable for tackling the early pre-clinical stages of the drugs discovery process.
The usage of simple Machine Learning (ML) classifiers for screening molecules (represented by the information about their chemical substructures) has been applied with quite good results during the last years. However, we believe that the usage of high-dimensional datasets that often include dependent features has had a significant impact on the performance of classifiers. In fact, the "curse of dimensionality" (Domingos, 2012;Wilcox, 1961;Zhai, Ong, & Tsang, 2014;Zhang, Golbraikh, Oloff, Kohn, & Tropsha, 2006) issue emerged as the complexity of finding linear (and even non-linear) transformations of input variables to assess the target class. Moreover, some classifiers (such as Naïve Bayes) require the independence of input variables. The usage of feature selection schemes could be an adequate form to address this issue. However, the elimination of features could lead to a loss of information. Keeping this in mind, we believe that a Multiple Classifier System (MCS) combining the outputs of several ML classifiers created by using different subsets of features included in the original dataset could improve the screening performance achieved by single classifiers. In this work, we introduce a proposal to create disjoint feature subsets from the original data source (feature-clusters) and maximize the independence of the attributes belonging to each concrete cluster. Hence, classifiers using the MCS would achieve interesting conditions to perform better: (i) lower dimensionality and (ii) independence of input attributes.
Using MCS (Chow, 1965;Woźniak, Graña, & Corchado, 2014) provides adittional advantages. Concretely, they achieve better performance with independence of the amount of available data. Moreover, a combination of classifiers (ensemble of classifiers) trend to outperform the usage of individual classifiers, which entails a better probability of finding an optimal model. Finally, they 6 allow exploiting parallel computing and computer clustering technologies for faster operation while taking advantage of the capabilities/properties provided by each individual classifiers. Despite these interesting features of MCSs, to the best of our knowledge, they have not been applied to automatically select promising chemical substances and improve the drugs discovery process. Keeping this idea in mind and guided by the importance of the preclinical research stage and the lack of techniques to address this problem, we decided to design and develop D2-MCS (Ruano-Ordás, 2018), a novel multiple-classifier system able to automatically determine the biological activity of a specific chemical compound based on its composition (i.e. chemical substructures and physicochemical descriptors). The scientific challenges for the creation of D2-MCS were (i) the choice of an effective but simple method to evaluate the independence of features, (ii) the identification of the number of feature clusters, (iii) training and tuning of classifiers and (iv) the combination of the outputs of classifiers included in the MCS.
While this section has presented the motivations of our work, the rest of the paper is structured as follows: Section 2 outlines the most-common ML techniques used for in-silico screening. Section 3 introduces the architectural design of our current biological activity detector software; Section 4 shows the experimental protocol carried out to demonstrate the suitability of our tool. Finally, Section 5 summarizes the main conclusions extracted from this work and outlines future research lines.

In-silico screening background
Pre-clinical studies are the first stage of the complex drugs discovery process. The goal of these studies is to identify drug candidates that would be tested in humans (clinical trials) and may become approved drugs. Preclinical studies include a great amount of work and comprise all activities, from the identification of candidate molecules, to the realization of tests of the drug in living cells and animals. Since conventional methods of identifying candidate molecules (screening) are expensive regarding time and cost, it is of key importance to develop highperformance in-silico (computer-based) screening methods. The recent availability of 'big data' in cheminformatics makes data-science methods for finding structure-activity relationships (SARs) a highly auspicious direction for in-silico screening of molecular compounds.
In-silico screening (sometimes called virtual screening) has been addressed before with quite good results (Burbidge, Trotter, Buxton, & Holden, 2001;Lavecchia, 2015;Lee, Lee, & Kim, 2017). In (Lavecchia, 2015) a great review of the usage of different ML approaches for ligandbased and structure-based in-silico screening is introduced. Additionally, these works show the usage of some ML techniques including Support Vector Machines (used in (Burbidge et al., 2001)), decision trees (DT), ensemble methods (such as Adaboost or Random Forests used in (Lee et al., 2017)), Naïve Bayesian based approaches, K-Nearest Neighbor Methods (kNN) and Artificial Neural Networks (ANN, studied in (Burbidge et al., 2001)).
In spite of the great amount of research done in the area of in-silico screening, the opportunity for further development still exists. The performance of classifiers is widely hindered by the unbalanced nature of datasets (which contain only a small amount of active substances), the dimensionality of datasets (more than two thousands of features) and the hidden dependences between the features of the datasets. Moreover, the excellent performance achieved by ensemble methods (especially Random Forests) suggests that using a combination of classifiers has a high potential to improve the achieved results.
Inspired by the above ideas, we designed D2-MCS, a novel screening method able to outperform simple classifiers used in previous works (Burbidge et al., 2001;Lavecchia, 2015;Lee et al., 2017). Next section contains a detailed description of our proposal from the perspectives of software design and method operation.
In short, the D2-MCS model aims to automatically predict the biological activity of a specific chemical compound through a deep analysis of its chemical substructures. D2-MCS was entirely developed using R programming language since it has become the favorite language for data analysts and scientists all over the world (Gentleman, 1996;Voskoglou, 2017), and was mainly motivated by its: (i) ability to handle complex and large datasets, (ii) capability to easily program and execute complex simulations, and (iii) compatibility with high-performance computer clusters. Additionally, the experimental benchmarking executed in (Fernández-Delgado, Cernadas, Barro, & Amorim, 2014;Statnikov, Wang, & Aliferis, 2008;Tan & Gilbert, 2003) prove the high performance achieved by the classification models provided by the R platform. As can be depicted in Figure 1, the first stage incorporates several feature-clustering functions f(x) able to adequately split the dataset attributes (features) into k groups. Motivated by the wide variety of ways of representing and (or) encoding information, the use of customized dataoriented clustering methods is required. To this end, D2-MCS incorporates an interface able to automatically load user-defined feature clustering techniques that can be easily developed using a simple inheritance scheme. By default, D2-MCS provides two simple feature-clustering techniques: (i) BinaryFisherClustering able to deal just with binary features and (ii) MultiTypeFisherClustering capable of managing any type of feature (such as qualitative, discrete and continuous values). To accomplish this task both methods compute the significance value of each binary feature (fb) by using Equation 1.
where binary(features) stands for the features having binary values and p value fisher test f class computes the significance value of each feature depending on the class through the execution of fisher exact test (Pett, 2015). Since the null hypothesis for fisher.test is the independence of two variables, the 1. p value − could be used as a method to assess the dependence between the fb and the target attribute (class). Then, the ungrouped features are homogeneously placed in clusters according to the cluster global significance value (designated as  ). Equation 2 illustrates how  is calculated for each cluster.
As can be observed from Equation 2  Finally, the enhanced feature-managing capabilities of the MultiTypeFisherClustering method allows the creation of an additional cluster composed of all the existing non-binary features. Otherwise, the inability of handling these type of features forces BinaryFisherClustering to ignore them and therefore avoid their usage throughout the following stages.

Stage 2: Train and Tune models
Once the features are successfully grouped into clusters, stage 2 (which involves the training and tuning of models) is automatically executed to determine the best models (and parameters) for each cluster. As can be seen from Figure 1, stage 2 is responsible for building a set of classification models (grouped into 12 different families) over each previously performed feature cluster by using an objective function (called δ) to guide the model parameter-optimization process. This function was created to simplify building of the models according to the classification purpose (such as for minimizing false negative (FN) or false positive (FP) errors). D2-MCS provides possibility of selection one or several objective functions related to different performance metrics well-known in the Machine Learning environment (Coffin & Saltzman, 2000;García, Fernández, Luengo, & Herrera, 2010). Below, Table 1 shows a brief description of the available objective functions together with their associated performance measures. Refers to the ability to correctly identify positive values (e.g. detect patients with a disease).

Specificity
Specificity (Lalkhen & McCluskey, 2008) Computes the ability of the test to correctly identify negative values (e.g. identify patients without a disease).

Kappa
Cohen's Kappa Coefficient (Cohen, 1968;Thompson & Walter, 1988) Measures inter-rate agreement for qualitative items. It is a more robust measure than a simple percentage agreement calculation, as it takes into account the possibility of the agreement occurring by chance.

Accuracy
Accuracy (Makridakis, 1993) Assess the proportion of true results (both true positives and true negatives) among the total number of cases examined.

MCC
Matthew Correlation Coefficient (Boughorbel, Jarray, & El-Anbari, 2017) The MCC is, in essence, a correlation coefficient between the observed and predicted binary classifications. Returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.

PPV
Positive Predictive Values (Bewick et al., 2004;Hajian-Tilaki, 2013) Used to indicate how often a positive test truly represents a true positive.
However, the lack of a standardized way of representing the information together with a large number of available performance metrics, require the usage of specific data-oriented methodologies. To mitigate this problem, D2-MCS is equipped with the capability to automatically load new user-defined objective functions in order to build customized dataadapted classification models.
After an objective function is selected, stage two automatically executes the classifiers-creation process. To this end, we have user the caret package included in R programing language (Kuhn, 2008). This package internally includes the implementation of different methods of hyper-parameter tuning during the training process. To ensure the convergence of each ML model, the tuning configuration (grid or random search) was selected according to the caret package recommendations. Additionally, during this stage, classifiers are build using a k-fold stratified cross-validation scheme (with k = 10) (Efron & Gong, 1983;Kohavi, 1995) over each previously achieved cluster (C1,C2,...,Cn). D2-MCS provides top 33 most suitable classification models (extracted from caret package) to handle high-dimensional datasets (Fernández-Delgado et al., 2014;Statnikov et al., 2008;Tan & Gilbert, 2003). Table 2 shows a brief description of each classification model together with its corresponding R package and model family. HDclassif (Berge, Bouveyron, & Girard, 2018) sparsediscrim (Ramey, 2017) klaR (Friedman, 1989) Neural Networks

Rule-Based Models
Random Forest Rule-Based Model Rule-Based Classifier Conditional Inference Random Forest randomForest (Breiman, 2001) RWeka  RWeka  Once the models are fitted with the best-guess hyper-parameters, stage 2 automatically selects the classification model achieving best performance value (according to the objective function). Optionally, to have an overall perspective about the global behavior of all available classifiers, the performance achieved by each model can be plotted graphically.

Stage 3: Classification
Finally, in stage three previously selected classifiers (best of each cluster) are used to perform the classification task. To this end, the individual results of each classification model are combined in a unique result by applying a specific voting system. Our tool applies the majority voting system due to its adequate balance between performance, resource consumption, and computational speed (Dietterich, 2000;Ruta & Gabrys, 2005;van Erp, Vuurpijl, & Schomaker, 2002). In order to increase the flexibility of the application, the voting system is implemented as a callback function. This scheme allows users to easily test and execute their customized voting strategies.

Model evaluation
In order to assess the effectiveness of the proposed model for determining the biological activity of the molecules, we designed and executed a set of experiments involving a set comprising 3925 chemical compounds represented by 2132 descriptors. In order to reduce the elevated cost related to the drugs discovery process, it is important to minimize the number of tests of invalid compounds (biologically inactive compounds). With the aim of minimizing this problem (reducing the number of FN errors), we performed the experimental protocol using MCC and PPV as objective functions. Finally, we implemented a benchmarking comparison of D2-MCS against the model achieving best performance values among those included in Table 2. Our experimental setup together with the selected dataset is introduced in Section 4.1, while Section 4.2 presents and discusses the achieved results.

Experimental setup
To perform a straightforward and reproducible protocol, we used a standardized, high-quality dataset gathered from ChEMBL 2 version 22 based on UniProt accession P34972 (Gaulton et al., 2012). Regarding to activity data potential, duplicates were ignored, no activity or data validity comments were allowed, only data from binding assays and with a pCheMBL value were kept. This led to a dataset composed of 3925 chemical compounds (instances) represented using 2132 features.  (Veber et al., 2002) or Molecular Weight (Tresadern et al., 2017)). Additionally, the set was transformed 14 into a binary classification set where the activity cut-off was defined at a pChEMBL value > 7 (Lenselink et al., 2016) to ensure highly active compounds. Finally, each compound was written into a tab-delimited text file. The final set contained 1977 active compounds and 1948 inactive compounds. Table 3 shows the codification of each feature grouped by type. As can be seen from Table 3, each chemical substructure fingerprint is codified using binary notation to indicate its presence (1) or absence (0) for each specific chemical compound. Moreover, the physicochemical descriptors are represented with discrete or continuous values according to the descriptor type and metric representation.
To perform the experimental setup the dataset was randomly divided into four equally distributed splits. Each split comprises 25% of the whole dataset with the same amount of Active and Inactive compounds. Moreover, to avoid model overfitting and therefore ensure realistic classification results, each split is assigned to a specific stage of the D2-MCS process. Table 4 shows a brief description concerning the main characteristics of each split (such as number of compounds or class ratio) together with the relationship among each stage. As can be seen from Table 4, the union of the first two splits are used to accomplish stage one (perform the clustering of features). During this stage, the specific-data-oriented MultiTypeFisherClustering is applied as feature clustering method due to its ability to handle features having different codifications (binary, continuous and discrete). Then, second and third splits are utilized as input for both, performing the model-building and hyper-parameter-optimization processes (stage two). The partial overlapping of data used for the first two stages has been designed to reduce potential coupling troubles while maximizing information available to execute both stages. Finally, the last split is used as a test set to evaluate the performance of our proposal (and is not used for any other purpose to ensure the significance of the results).
For comparison purposes, our proposal has been benchmarked against the utilization of simple and ensemble classifiers. The same dataset organization has been used for this purpose. However, in this case, splits 1, 2 and 3 has been applied for training and optimizing purposes while split 4 is utilized for model evaluation purposes.
From a general perspective and given the similarity of the drugs discovery domain with a binary classification problem (i.e., determining the absence/presence of biological activity), there is a wide range of statistical methods available to assess the performance of the classification models (Kosinski, 2013). However, as mentioned in (Baldi, Brunak, Chauvin, Andersen, & Nielsen, 2000), the particular characteristics of this domain requires the usage of adequate problem-oriented statistical methods, in order to ensure a successful and realistic assessment of the achieved results. Following medicinal chemistry experts and authors suggestions (Boughorbel et al., 2017;Powers, 2011), we considered that the most adequate measures to evaluate the final performance of the previously constructed models are PPV and MCC, due to their usage as objective functions during the second stage.

Results and discussion
To validate and test the performance of our D2-MCS tool correctly, we consider two different scenarios: (i) MCC scenario, where MCC coefficient is used as an objective-function for building the classification model and, (ii) PPV scenario, where PPV measure is considered for the optimization of classifiers. Additionally, in order to demonstrate the suitability of D2-MCS, we also execute a performance benchmarking comparison of our proposal and the simple ML algorithm achieving best performance.
As previously stated, the first stage of the D2-MCS operation uses MultiTypeFisherClustering strategy due to its ability to handle multi-type features. It is important to highlight that obtaining an adequate clustering homogeneity is mandatory to guarantee a good classification performance. To this end, we executed MultiTypeFisherClustering to find a set of feature clusters (G) that ensures the minimization of the dispersion of the global significance ( G  ). For reducing computational requirements we limit the maximum number of clusters included in G to 50 and plotted the best G  achieved with regard of the number of clusters included in G ( Figure   2). As can be seen from Figure 2, when grouping the binary features into two clusters, the lowest dispersion is achieved. However, the usage of 41 clusters provided the worst dispersion results. Moreover, a deep analysis of the results depicted in Figure 2 shows two particular aspects: (i) the high dependence between the dispersion and number of cluster divisions, (ii) abrupt changes in the dispersion values for contiguous clustering configurations and (iii) the dispersion is worsened with the increment of the number of clusters (and therefore the limitation of 50 clusters for the configuration seems adequate). In view of the achieved results, the usage of two clusters is the best configuration to minimize the dispersion of the global significance between feature clusters. Following our method, an additional cluster is created to allocate the remaining features (continuous and discrete ones). Finally, non-binary features having constant values were ignored in the classification process, because they are useless.
The second step (model building and hyper-parameter optimization) is executed for each of the three previously obtained clusters. Figures 3 and 4 show the performance results achieved for each optimized ML model using PPV and MCC measures as objective functions, respectively. b) Performance achieved for cluster 2 of 3 (binary features).
19 c) Performance achieved for cluster 3 of 3 (non-binary features). As can be noted from results plotted in Figure 3, the performance of each ML model is closely related to the features included in each cluster. Figure 3a svmRadialWeights reveals the great classification performance achieved by svmRadialWeigths classifier (0.975) while AdaBag obtains the worst performance evaluation (0.634). With regard to second feature cluster (see Figure 3b), the best-analyzed model is Adabag (0.978) whilst rpart1SE achieves the poorest evaluation (0.720). Finally, as shown in Figure 3c (third cluster) svmRadialWeights and hdda models achieved the best (0.996) and worst (0.730) performance values, respectively. Table 5 summarizes the best ML models and hyper-parameter values for each feature cluster. Although svmRadialWeights achieves the highest performance results in two feature clusters, 20 the optimized configurations computed for each of them are significantly different due to their intrinsic characteristics.  Figure 4 shows the best performance achieved by ML models using MCC measure to optimize classifier parameters. c) Performance achieved for cluster 3 of 3 (binary features). A brief glance at results included in Figure 4 reveals ranger as the model with the best performance for all clusters. Among the other models, naive_bayes (clusters one and two) and hdda (in cluster three) achieved the worst evaluation results. Table 6 shows a brief summary of the best configuration obtained for ranger for each cluster. As in the previous scenario (PPV), although ranger model achieves the best performance, the specific configuration details to achieve the best performance for each feature cluster are quite different.
During the third stage, configurations achieved in previous stages are benchmarked against simple and ensemble ML classifiers. As stated before, there are three different classifiers (with their particular outputs) for each scenario (PPV and MCC) and therefore, the final simple classification result for each instance is computed by executing a majority-voting system over the results achieved by each classification model. The primary outcome of the third stage of each scenario comprises the set of confusion matrices achieved that are included in Table 7. The confusion matrix brings together the number of different types of errors and hits including: (i) false positive errors (FP, inactive compounds classified as active); (ii) false negative errors (FN, undetected active compounds); (iii) true positive hits (TP, number of active compounds detected); and (iv) true negative hits (TN, number of inactive compounds correctly classified). As reflected from analysis of Table 7, the usage of PPV measure as objective function allows minimizing FP errors at expenses of penalizing the FN ones. The difficulty of finding active compounds in the drugs discovery domain forces the need of minimizing misclassification of potential drug candidates as inactive compounds (FP errors). On the other hand, the usage of MCC allows achieving a balanced value between FP and FN errors. In fact, MCC reduces the number of FN errors up to 3.5 times but increases the FP errors up to 18 times when compared to PPV measure. Taking this fact into account, this measure is suitable for usage in environments where available resources are limited (mainly personal and monetary) and the emergence of FP errors does not cause major problems.
To facilitate the understanding of results included in Table 7, Figure 5 presents a plot including the Accuracy, MCC and PPV measures obtained for each objective function, MCC and PPV, respectively. As can be observed from Figure 5, using MCC as objective function allows to achieve the best Accuracy evaluations due to its good performance when classifying inactive compounds and a good MCC assessment (0.8271 indicates a very strong positive relationship). Otherwise, using PPV as objective function enables achieving the highest PPV results (0.9937) and a quite good MCC evaluation (0.9327). In view of the obtained results, it is easy to highlight the importance of choosing a problem-oriented objective function to achieve the most suitable results.
With the aim of showing the performance gained during the classification stage, we included in Figure 6 a graphical plot highlighting the classification performance achieved for each objective function during both training and testing stages. As can be seen from Figure 6, the classification performance achieved after applying the majority-voting system (indicated as stripped lines) outperforms the results achieved during the training stage. In fact, despite using for testing purposes a dataset never used during previous stages, for MCC (see Figure 6a) classification performance was able to significantly improve best training result (achieved by cluster 2) up to 0.176, while classification performance achieved by PPV improved up to 0.016 the result obtained by cluster 2 (see Figure 6b).
Finally, we performed an experimental benchmarking to assess the suitability of our model against a single best performing ML classification model. In order to simulate the same conditions as used when executing the D2-MCS experimental protocol, classification models described in Table 2 were optimized (with hyper-parameter configuration) using a straightforward 10-fold cross-validation strategy applied over the whole features set composed of the first three dataset parts (splits 1, 2 and 3). Then, using the optimized configuration, models were trained using all instances included in the same splits. From all models in each scenario, we selected the one achieving the best classification performance over the remaining dataset instances (split 4) and compared it with D2-MCS. Below, Table 8 shows the performance results achieved for the best single classifier and D2-MCS. In order to enhance the comparison between models, Table 8 includes the performance result achieved for both models during the optimization/training and testing stages (for MCC and PPV scenario). As can be seen from Table 8, D2-MCS outperforms the best classification model in each cluster. As the execution of training/optimization stage in D2-MCS provides three different performance values (one for each classifier), we computed the performance differential value (included in brackets) as the difference between the result achieved in the testing stage and the maximum value reached during the training/optimization stage. The positive performance difference value achieved by D2-MCS (highlighted in green) indicates the ability to build a suitable knowledge-generalization model. Conversely, although the usage of a single classification model achieves adequate performance results during the training/optimization stage, the negative value of the performance difference (highlighted in red) for both scenarios (using PPV and MCC as objective functions) shows a clear overfitting trend.

Conclusions and future work
This work presents D2-MCS, an MCS tool designed to automatically determine the biological activity of molecules based on 2048 chemical substructures (codified using binary values) and 84 physicochemical properties (codified using discrete and continuous values). To successfully address the manipulation of this high-dimensional dataset, D2-MCS performs a 3-stage classification process comprised by: (i) feature clustering, (ii) building and optimizing hyperparameters of a classification model and (iii) molecule classification. Additionally, we performed an experimental benchmarking comprising two scenarios (using PPV and MCC measures as objective functions) in order to measure the suitability of our D2-MCS tool. Finally, we performed a comparative analysis to assess the suitability of our model against the results achieved by the usage of simple and ensemble ML classifiers.
Results shown in Section 4 reveal the greater performance of our proposed approach against other single and ensemble ML classifiers (see Table 8). Moreover, the comparison of results achieved during training/optimization (2) and testing (3) stages, also suggests the suitability of D2-MCS to generalize knowledge and avoid overfitting problems. Furthermore, although the usage of a single classifier achieves better performance during the training/optimization stage, the reduced classification performance achieved during the testing stage shows a clear overfitting trend.
The promising results achieved by our D2-MCS tool are sustained in two key features: (i) splitting the dataset into groups of features (using feature clustering methods), facilitates both the handling of the information and the classification training tasks (divide and rule strategy) and (ii) the incorporation of an objective function able to choose the most suitable classifier according to the problem to be addressed.
Finally, and despite the performance achieved by using our tool, we are sure that new improvements are still necessary to strengthen D2-MCS. We use a majority-voting system to obtain the final decision concerning the biological activity of each molecule. We are aware that using evolutionary strategies (such as Genetic Algorithms) could increase the performance of the classification system by designing an intelligent weighing mechanism for each cluster (Friese, Bartz-Beielstein, & Emmerich, 2016). Furthermore, assessing the dependence between features could be addressed by using other feature evaluation methods used for ranking in the context of feature selection (such as Information Gain or χ 2 ) (Zheng, Wu, & Srihari, 2004). Moreover, the research and usage of domain-specific feature clustering methods should also be included in future work. In fact, the vast amount of domain-specific information in the data sets suggests that the usage of problem-oriented data management techniques is useful for both (i) facilitating information processing and (ii) building classification systems able to adapt to new knowledge with the minimum loss of accuracy. To this end, the development of new problemoriented clustering methods should help to increase the classification performance of our proposed tool. Finally, we are aware of the applicability of D2-MCS in many other disciplines, such as the classification of content in general-purpose databases.