Repositório ISCTE-IUL

. When conducting Discrete Discriminant Analysis, alternative models provide diﬀerent levels of predictive accuracy which has encouraged the research in combined models. This research seems to be specially promising when small or moderate sized samples are considered, which often occurs in practice. In this work we evaluate the performance of a linear combination of two Discrete Discriminant Analysis models: the First-order Independence Model and the Dependence Trees Model. The proposed methodology also uses a Hierarchical Coupling Model when addressing multi-class classiﬁcation problems, decomposing the multi-class problems into several bi-class problems, using a binary tree structure. The analysis is based both on simulated and real data sets. Results of the proposed approach are compared with those obtained by Random Forests, being generally more accurate. Measures of precision regarding a training set, a test set and cross-validation are presented. The R software is used for the algorithms’ implementation.


Introduction
Discrete Discriminant Analysis (DDA) is a multivariate data analysis technique that aims to classify multivariate observations of discrete variables into one of K a priori defined classes.
DDA has two main goals: 1. To identify the variables that best differentiate the K classes; 2. To assign objects whose class membership is unknown to one of the K classes, by means of a classification rule.
This work is focused on the second goal and we consider objects characterized by binary variables, in the bi-class and in the multi-class case. Note that for P binary variables there are S = 2 P possible states (i.e. S = 2 P possible observable vectors).
To derive the classification rule, based on the referred data, one should determine the posterior probability of an observation. Based on the Bayes formula the posterior probability of an observationx * -being assigned to one of the a priori known classes can be written as follows: , k = 1, . . . , K where π k represents the priori probability of class C k and f k (x) represents the probability function of x in the same class. By applying this rule, an observation x * is classified in the class with the maximum posterior probability, thus minimizing the assignment error. The prior probabilities π k , often have to be estimated using the sample at hand. When this sample is randomly selected from the population without taking into account the observations class membership, maximum likelihood estimators are used: π k = nk n , where n k is the dimension of class C k . Otherwise, if the sample considered is the union of K independent samples of size n k , k = 1, ..., K, previously selected within each class C k , equal prior probabilities are considered for all classes, π k = 1 K . Usually, the states probability function in each class C k is unknown and must be estimated using the sample observations X.
In DDA, the multinomial model is considered the most natural model where the states probability functions are estimated by the corresponding sample relative frequencies. This is the so called Full Multinomial Model (FMM) that demands a large number of parameters to be estimated (Goldstein and Dillon, 1978). To overcome this dimensionality problem, several variants of the FMM model have been proposed. In this study, we work with two specific FMM variants -the First-order Independence Model (FOIM) (Goldstein and Dillon, 1978), which assumes that the P discrete variables are independent within each class C k -and an alternative model that takes into account the dependence between variables -the Dependence Trees Model (DTM) (Celeux and Nakache, 1994).
In real classification problems, the classification errors resulting from different models differ and are often associated with different subjects. Therefore, researchers derive and compare several classification rules and recur to multiple models. The use of multiple models generally enhances the results accuracy. These models may originate from diverse subsamples drawn from an original dataset: e.g. Breiman (1996) uses the bagging strategy and Friedman (2001) uses the boosting strategy for drawing the successive subsamples. As an alternative approach, when considering a fixed dataset, multiple models may result from different parameterizations of a specific model type (e.g. a tree model with different numbers of levels) or diverse types of models may be considered. In this context the analyst often selects the classification rule that provides the best classification accuracy. However, the selection of a single classification rule means a high loss of information of the previously estimated models which could be very relevant for classification. In fact, the classification results may be provided by a combination of models overcoming the referred loss of information and enhancing classification results stability and accuracy, e.g. Friedman and Popescu (2008). Several combined methods can be found in the literature. Recently, (Kotsiantis, 2011), for example, proposed a combined model for classification -Random Subspace using Naïve Bayes (Domingos and Pazzani, 1997) and C4.5 (Quinlan, 1993). Based on 26 well known data sets (with continuous predictors), the author found the results of the proposed method encouraging. However, most studies -(Kotsiantis, 2011) reviews several -refer to Discriminant Analysis in general -DDA studies are rare. In the present work, we address DDA problems considering a simple linear combination of FOIM and DTM (Marques et al., 2013) and assess its performance in numerical experiments based on real and simulated data sets. In order to deal with multi-class problems, the Hierarchical Coupling Model that decomposes the original multi-class problem in several bi-class problems, using a binary tree structure, is also considered, (Sousa Ferreira et al., 2000). We compare the performance of the proposed combined model -a non-generative ensemble according to (Re and Valentini, 2011) -with the performance of Random Forests (Breiman, 2001) -a generative ensemble (according to the same authors), that generates sets of base learners acting on the structure of the data set to try to actively improve diversity and accuracy of the base learners. According to (Kotsiantis, 2013, p.278): "Random forests (Breiman, 2001) are one of the best performing methods for constructing Ensembles". In addition, Random Forests tend to perform better when dealing with discrete categorical features (Kotsiantis et al., 2006). The new DDA approach is presented in the second chapter after introducing the models FOIM and DTM. In the third chapter, the performance of the new model is analyzed, based both on simulated and real data sets, with small and moderate sizes. Finally, conclusions are drawn and perspectives of future work are indicated.  (Goldstein and Dillon, 1978;Celeux and Nakache, 1994) where the within-classes states probability functions are multi-nomial. However, for the case where we have P binary variables, this model involves the estimation of 2 P −1 parameters in each class. Therefore this approach needs to rely on large samples which can be very difficult to obtain in some application domains, such as health sciences and psychology. As previously referred, the FOIM model assumes the independence of variables within each class therefore reducing the number of parameters to estimate. However, this model may be unrealistic in some situations. Among alternative models that take into account the interactions between variables the Dependence Trees Model (DTM) can be considered, (Celeux and Nakache, 1994). These models, FOIM and DTM, are described next.

The First-order Independence Model
The First-order Independence Model -FOIM - (Goldstein and Dillon, 1978;Celeux and Nakache, 1994) is one of the most commonly used DDA models. It assumes that the P discrete variables are independent within each class C k , reducing to P the number of parameters needed to be estimated for each class C k . The condicional probability of assigning x * to class C k is estimated by: where n k represents the C k class sample dimension.

The Dependence Trees Model
The Dependence Trees Model -DTM - (Celeux and Nakache, 1994;Pearl, 1988), takes into account conditional dependence relationships between the predictors. DTM provides for each class an estimate of the conditional probability functions based on the idea proposed by Pearl (1988). Pearl demonstrated that through the knowledge of a graph G, where X 1 , ..., X P represent its P vertices, the probability distribution f G , associated with this graph, can be calculated as the product of the conditional probabilities: where x l(p) represents a variable that is linked to the variable x p in this graph, arbitrarily choosing one vertex as the root of the graph, x r(p) . To construct the graph for each class, we rely on the algorithm of Chow and Liu (Celeux and Nakache, 1994;Pearl, 1988), where the length of each edge referred to the pair of variables (x p , x p ′ ) represents a measure of the association between the same variables, mutual information in particular. Mutual information -Iis defined as follows: where f (x p , x p ′ ) is estimated using the maximum-likelihood approach.
After the calculation of the C P 2 mutual information values, the graph G, with P − 1 edges, corresponding to the highest total mutual information is selected. For example, take P = 5 variables and if the most important predictor relations are (X 2 , X 1 ), (X 3 , X 2 ), (X 4 , X 2 ) and (X 5 , X 2 ), then Figure 1 represents an example of a dependence tree Figura 1: Example of a dependence tree for the case of P=5 variables and the probability distribution of the first-order dependence tree iŝ where the marginal and conditional probability functions are determined simply using the observed relative frequencies in sample X.

Combining Models
The idea of combining different models currently appears in a increasing number of papers, aiming to obtain more robust and stable models -e.g. Leblanc and Tibshirani, 1996 The present study develops from the contribution of Sousa Ferreira (2004) that combines FMM and FOIM, using a single coefficient β, (0 ≤ β ≤ 1) to define a linear combination and explores several strategies to estimate this coefficient, including a regression approach using least squares minimization and likelihood maximization. This approach reveals good performances, with intermediates results between FOIM and FMM, in the small case setting -particularly when data have independent structures in each class, or equal correlation structures. Using an integrated likelihood ratio approach, interesting results are also observed, particularly in the moderate or large case settings and when data have different correlation structures in each class. However, in this FOIM-FMM combination, the coefficient derived often tends to heavily weight FOIM, while reducing substantially the contribution of FMM, even when considering smoothed frequencies. Based on this empirical conclusion, we consider the replacement of FMM, in the combination, by DTM. The corresponding conditional probability function is thus estimated as follows: The performance of the FOIM-DTM linear convex combination is the focus of the present paper. In addition, we consider the performance of the Hierarchical Coupling Model (Sousa Ferreira et al., 2000) integrating this specific combination.

The Hierarchical Coupling Model
In the multi-class case, the Hierarchical Coupling Model -HIERM -(Sousa Ferreira et al., 2000) may be considered as an alternative to the simple FOIM-DTM convex combination. HIERM decomposes one multi-class problem into several bi-class problems using a binary tree structure and implements two decisions at each level of the tree: 1. Selection of the hierarchical coupling among the 2 K−1 − 1 possible class couple; 2. Choice of the model or combining model that gives the best classification rule for the chosen couple.
In the beginning we have K classes corresponding to the samples that we want to reorganize into two classes. So, we propose either to explore all the hierarchical coupling solutions or to select the two new classes that are the most separable. These classes can be selected using the affinity coefficient (Bacelar-Nicolau 1985; Matusita 1955).
For each bi-class problem an intermediate position between FOIM and DTM models may be considered. The process stops when a decomposition of classes leads to a single class. For example, when having three classes a priori, C 1 , C 2 and C 3 , the following combinations of pairs of classes can be considered: Therefore, we can derive the classification rules in these three cases and select the one that yields the smallest misclassification error. Note that in this case (K = 3) we only have three tree configurations to consider and so it is possible to explore all the hierarchical coupling solutions (see Figure 2). E.g. in Tree (a), one observation will be first classified into C 1 vs C 2 ∪ C 3 and if it proceeds for the 2nd level it will be finally classified into C2 or C3 , according to a minimum classification error criterion. However, when the number of classes is large (greater than three) the number of admissible tree configurations becomes larger and more difficult to handle. Then, a criterion to select trees to consider is needed. In the present work we adopt a similarity coefficient based approach and select the best tree using the affinity coefficient described above (Sousa Ferreira, 2010).

Performance Measures
To evaluate the performance of a classification rule, according to a particular model, one relies on performance measures which derive from classification results as depicted in a confusion matrix -a contingency table that associates actual and predicted classes.
In the binary casea priori classes labeled 0 and 1 -the contingency table is as follows: where N 0 = a + b and N 1 = c + d.
In order to find the most appropriate measure of performance several studies have been carried out Kruskal, 1954, 1959;Marzban, 1997;Murphy and Daan, 1985). In Discriminant Analysis the Total Success Rate -TSR measure -is commonly used. It is the average of the group specific success rates estimates weighted by the classes prior probabilities (McLachlan, 1992). And, when the group prior probabilities are estimated by the relative group sizes this measure is called Efficiency (EFF): The EFF measure is simply the proportion of observations correctly classified (based on the diagonal of the confusion matrix) and misses the use of the remaining available information on the confusion matrix. Since this information can benefit the evaluation of performance of the proposed combined models, we should consider an additional evaluation measure. In fact, according to (Paik, 1998), the EFF measure may, sometimes, over-estimate the "true"success rate, particularly when classes' sizes are disproportionate or the success rates within the classes are very different. Therefore we use an additional measure of performance in the present study -the Phi Statistic (φ) or index of mean square contingency, based on all the data in the confusion matrix. (Goodman and Kruskal, 1954) where:

Data Analysis and results
In the present work, we use of the FOIM-DTM combination to solve DDA problems. In addition, when multiple classes are considered, we suggest using HI-ERM and also recurring to the FOIM-DTM combination to obtain intermediate classification results in each tree node. Regarding the combination coefficient β, we propose to use a grid of values of β ∈ [0, 1] with increments of 0.1, to weight the contribution of each model. The Random Forest (RF) algorithm (Breiman, 2001) is used for providing comparative performance evaluation of the proposed DDA approach. The implementation used is in the R package randomForest, (Liaw and Wiener, 2013). For each RF we consider 500 trees, based on 500 bootstrap samples. Additionally, for each sample with replacement, we build P RF derived from subsets of features with 1 to P features. Finally, we combine all the RF into one large RF and consider the votes of 500 * P trees for classification. In order to evaluate the performance of the proposed models, we consider both real and simulated data sets.

Simulated data
We conduct numerical experiments for simulated data using small and moderate sample sizes.
The data is simulated using the Bahadur model, as proposed in Goldstein and Dillon (1978) and in Celeux and Mkhadri (1992). The data sets considered derive from a previous study (Sousa Ferreira 2010; Sousa Ferreira et al. 2001).
In order to simulate the predictive binary variables' values, this model defines class conditional probabilities for C k , (k = 1, ..., K) as where X kp is a Bernoulli variable with parameter θ kp = E(X kp ), p = 1, ..., P such that considering two types of population structures, with P = 6 variables for the case of K = 2 and K = 4 classes. For each structure, data sets generated have 60 observations for each class (small samples) or 200 observations for each class (moderate sample). The first structure, denoted IND (Independent), is generated according to FOIM, (ρ k (p, p) = 1 and ρ k (p, g) = 0 , if p = g, k = 1, ..., K; p, g = 1, ..., 6) for all classes.
The prior probabilities are considered equal.

Real data
We conduct numerical experiments in a very small real data set that refers to 34 dermatological patients with a diagnosis of psoriasis, with chronic evolution, (Prazeres, 1996). The relationship between three classes of patients with different degrees of Alexithymia (referring to difficulty in expressing emotions) and Rorschach test indicators (personality projective test indicators) is explored. Nowadays, alexithymia is considered a risk factor for the process of somatic and psychological illness. Since it is difficult to identify, due to the absence of obvious mental symptoms, contributions that help to support its identification are relevant. One of the most commonly used measures of alexithymia is the Toronto Alexithymia Scale . This test is a 20-items (5-point Likert) instrument. Its final score is the sum of the values assigned to the 20 items (Prazeres 1996). According to the test scores, the whole sample is divided into three small classes: Nonalexithymics Class (C 1 , n 1 = 14), Alexithymics Class (C 2 , n 2 = 13), Intermediate Class (C 3 , n 3 = 7).
In this study, the goal is to explore the differences between the classes based on the fact that the alexithymia manifestations often occur after the appearances of an organic disease which, given its emotional significance and seriousness, often reflects in the Rorschach psychological test. This is a psychological test in which subjects' perceptions of inkblots are recorded and analyzed. It consists of a large number of variables measured in different scales, allowing us to know person's personality characteristics and emotional functioning.
In the present study, the characterization of each patient is based on six binary indicators of the Rorschach test (predictor variables). In this analysis, the characterization of each patient is based on six binary variables (predictor variables) of the Rorschach test indicators (Exner, 2001): • CF +C > 0 -Dichotomization of the variable CF +C based on empirically established value. The value 1 was assigned when the condition is checked and 0 if not checked. CF + C is the sum of chromatic color responses in which the formal element is secondary or absent. It indicates less affective modulation; • CF + C − F C > 0 -Dichotomization of the variable (CF + C) − F C. The value 1 was assigned when the condition is checked and 0 if not checked. A positive value in (CF + C) − F C indicates less affective modulation, where F C represents the number of chromatic color responses in which form features are of primary importance;; • V > 0 -In pure vista responses the shading features are interpreted as depth or dimensionality. No form is involved. The value 1 was assigned when the condition is checked and 0 if not checked; • C ′ > 2 -In pure achromatic color response the response is based on the grey, black or white features of the blot, when they are used as color. No form is involved. The value 1 was assigned when the condition is checked and 0 if not checked; • T = 1 -In pure texture response the shading components of the blot are used to represent a tactual phenomenon, with no consideration to the form features. The value 1 was assigned when T = 1 and the value 0 was assigned when T = 1 • SumSH −SumC > 0 -Dichotomization of the variable SumSH −SumC, that compares the sum of shading responses plus the achromatic responses with the sum of chromatic color responses. The value 1 was assigned when the condition is checked and 0 if not checked.
The variables involving the chromatic color, achromatic color and shading determinants (C, C', T, V) characterize the emotional functioning. An increase in T relates to emotional loss (e.g., marital separation). An increase in V relates to feelings of guilt or remorse. Y is related to situational stress. An increase in C' signifies the presence of disturbing negative feelings that result from an inhibition of emotional expression.
Chromatic color responses (FC,CF,C) are related to the release or discharge of emotion and to the extend to which the release is controlled or modulated. Chromatic color responses are expected to be higher than achromatic responses (FC', C'F, C'). When SumC' is greater than SumC the individual is inhibiting the release of emotions and, as a result, is burdened by irritating feelings.
(CF + C)-FC offer information concerning the modulation of emotional discharges. The FC responses relate to well controlled emotional experiences whereas the CF and the C responses relate to less restrained forms of emotional discharge. Adults without psychological problems are expected to yield higher FC than CF+C.
Since the data were not collected in a mixture model, we could not estimate prior probabilities using relative frequencies, so the prior probabilities are taken to be equal, π k = 1 K = 1 3 , k = 1, 2, 3.

Classification Results
The classification results concerning simulated data sets are presented in tables 2 to 7. The FOIM-DTM combination coefficients values (beta values) appear in the tables' first column, along with the Random Forests combination results.
The EF F and φ measures reported refer to the training and test samples (for moderate sized samples) or to the training sample and two-fold cross-validation results (for small sized samples).

• Simulated Data Results
Results referred to bi-class problems are presented in Tables 2 and 3. Generally, in the multi-class case, the models performance tends to be very poor when the HIERM approach is not considered. HIERM causes a sharp rise in the classification rates: see Tables 6 and 7 as opposed to Tables 4 and 5.
In general, in the numerical experiments conducted, the proposed approach outperforms Random Forests -it provides consistently better results when referring to small samples and, in conjugation with the HIERM approach for multi-class problems, it is clearly the winner classifier (see Table 9.).         As in the simulated data results, the HIERM approach clearly improves classification results. The best result in the real data set is attained for β = 0.2 to 0.4 according to the Phi measure, illustrating the potential of the proposed combination approach to outperform the individual models-components performances. Note that the best binary tree corresponding to the most sep-arable classes (see Figure 3) corresponds to the smallest affinity coefficient (af f (C 1 , (C 2 ∪ C 3 )) = 0.435). The first decomposition chosen by the HIERM model, suggests that the union of the extremes classes forms a well-separated class from the class composed by the intermediate patients, since these subjects obtained balanced scores. Since the data set is very sparse (2 6 = 64 states and only 17 observations) the HIERM model provides the lowest estimated misclassification risk.

Conclusions and Perspectives
In the present work we propose using a combination of two classification models -FOIM -First-order Independence Model and DTM -Dependence Trees Model -to overcome the limitations of the individual models, namely in small and moderate sized samples settings. In addition, we propose using the HIERM -Hierarchical Coupling Model approach to address multi-class problems, recurring to a binary tree decomposition scheme. We conduct a experimental study based on 8 simulated data sets and 1 real data set. We focus on small and moderately sized samples which tend to increase the difficulty of classification problems. Since all features are categorical we perform comparisons with a well known ensemble algorithm recognized to perform well in this setting (Kotsiantis et al., 2006) -the Random Forests ensemble approach (Breiman, 2001). The results obtained are very encouraging -the performance of the proposed FOIM-DTM combined approach consistently exceeds the Random Forests performance when regarding small data sets. When conjugated with the HIERM approach for multi-class problems, the proposed model outperforms Random Forests in 7 out of the 8 simulated data sets. In the real data set a very small sample is considered and, in this setting, the HIERM approach outperforms the FOIM-DTM simple combination and Random Forests as well.
We conclude that the FOIM-DTM combination is very flexible, being able to deal with different data correlations structures. In the conditional independent case -IND structure for simulated data -the FOIM naturally tends to yield the best results but the combination FOIM-DTM sometimes emerges as a better than the FOIM alternative, especially in the small sized sample cases. In the conditional non-independent case -DIF structure for simulated data -the DTM naturally tends to emerge although the combination FOIM-DTM sometimes emerges as a better than the DTM alternative, namely in the moderate sized sample cases. For the two-classes problems, the performance measures used generally agree as to the selection of the best solution. For multi-class problems with small sample sizes considered, the performance indicators may disagree. Understanding the disagreement between performance indicators should thus be the subject of future research. The benefits of the proposed approach should be further investigated using simulated data sets with diverse correlations structures and considering imbalanced data sets too. Also, the use of more real data sets should further evidence the advantage of the proposed combined approach.