Enhancing the selection of a model-based clustering with external categorical variables

In cluster analysis, it can be useful to interpret the partition built from the data in the light of external categorical variables which are not directly involved to cluster the data. An approach is proposed in the model-based clustering context to select a number of clusters which both fits the data well and takes advantage of the potential illustrative ability of the external variables. This approach makes use of the integrated joint likelihood of the data and the partitions at hand, namely the model-based partition and the partitions associated to the external variables. It is noteworthy that each mixture model is fitted by the maximum likelihood methodology to the data, excluding the external variables which are used to select a relevant mixture model only. Numerical experiments illustrate the promising behaviour of the derived criterion.


Introduction
In model selection, assuming that the data arose from one of the models in competition is often somewhat unrealistic and could be misleading. However this assumption is implicitly made when using standard model selection criteria such as AIC or BIC. This "true model" assumption could lead to overestimating the model complexity in practical situations. On the other hand, a common feature of standard penalized likelihood criteria such as AIC and BIC is that they do not take into account the modelling purpose. Our opinion is that it is worthwhile taking it into account to select a model, which leads to more flexible criteria favoring useful and parsimonious models. This point of view could be exploited in many statistical learning situations.
Whereas cluster analysis is an exploratory data analysis tool, any available information on the objects to be clustered, available in addition to the clustering variables, could be very useful to get a meaningful interpretation of the clusters. Here we address the case where this additional information is provided by external categorical illustrative variables. The purpose of this paper is to introduce a model selection criterion in the model-based clustering context that takes advantage of these illustrative variables. This criterion aims to select a classification of the data which achieves a good compromise: it is expected to provide a parsimonious and sensible clustering with a relevant interpretation with respect to the illustrative categorical variables. It is important to stress that we do not want the external variables to affect the classifications derived from the clustering variables: they are merely used to highlight some of them.
The paper is organised as follows. In Sect. 2, the framework of model-based clustering is described. Our new penalised likelihood criterion is presented in Sect. 3. Numerical experiments on simulated and real data sets are presented in Sect. 4 to illustrate the behavior of this criterion and highlight its possible interest. A short discussion section concludes the paper.

Model-based clustering
Model-based clustering consists of modelling the data to be classified by a mixture distribution and associating a class with each of the mixture components. Embedding cluster analysis in this precise framework is useful in many aspects. In particular, it allows to choose the number K of clusters (i.e. the number of mixture components) rigorously.

Finite mixture models
Please refer to McLachlan and Peel (2000) for a comprehensive introduction to finite mixture models.
The data to be clustered y = (y 1 , . . . , y n ), with y i ∈ R d , are modelled as observations of iid random variables with a mixture distribution: where the p k 's are the mixing proportions and φ(· ; a k ) denotes the components probability density function (typically the d-dimensional Gaussian density) with parameter a k , and θ = ( p 1 , . . . , p K −1 , a 1 , . . . , a K ). A mixture model can be regarded as a latent structure model involving unknown label data z = (z 1 , . . . , z n ) which are binary vectors with z ik = 1 if and only if y i arises from component k. Those indicator vectors define a partition C = (C 1 , . . . , C K ) of the data y with C k = {i : z ik = 1}. However these indicator vectors are not observed in a clustering problem: the model is usually fitted through the maximization of the likelihood of the model (1) and an estimated partition is deduced from it by the MAP rule recalled below in (2). The parameter estimator, denoted from now on byθ, is generally derived from the EM algorithm (Dempster et al. 1977;McLachlan and Krishnan 2008).
There are usually several models to choose among (typically, when the number of components is unknown). Note that a mixture model m is characterized not only by the number of components K , but also by assumptions on the proportions and the component variance matrices (see for instance Celeux and Govaert 1995). The corresponding parameter space is denoted by m . From a density estimation perspective, a classical way for choosing a mixture model is to select the model maximising the integrated likelihood, π(θ m ) being a weakly informative prior distribution on θ m . For n large enough, it can be approximated with the BIC criterion (Schwarz 1978) with ν m the number of free parameters in the mixture model m. Numerical experiments (see for instance Roeder and Wasserman 1997) and theoretical results (see Keribin 2000) show that BIC works well to select the true number of components when the data actually arises from one of the mixture models in competition.

Choosing K from the clustering view point
In the model-based clustering context, an alternative to the BIC criterion is the ICL criterion (Biernacki et al. 2000) which aims at maximising the integrated likelihood of the complete data (y, z) Indeed the latter can be approximated with a BIC-like approximation: But z andθ * m are unknown. Arguing thatθ m ≈θ * m if the mixture components are well separated for n large enough, Biernacki et al. (2000) replaceθ * m byθ m and the missing data z withẑ = MAP(θ m ) defined bŷ where τ k i (θ m ) denotes the conditional probability of the kth mixture component given . (3) The ICL criterion (McLachlan and Peel 2000) is then obtained by replacing the estimated labelsẑ by their respective conditional expectation given and if we denote by Ent(m) the estimated mean entropy : we get that ICL is the criterion BIC decreased by Ent (m): Because of this additional entropy term, ICL favors models which lead to partitioning the data with the greatest evidence. The derivation and approximations leading to ICL are questioned in Baudry (2009, Chapter 4) and in Baudry (2012). In practice, ICL appears to provide a stable and reliable estimate of the number of mixture components for real data sets and also for simulated data sets from the clustering viewpoint. ICL, which is not aiming at discovering the true number of mixture components, can underestimate the number of components for simulated data arising from mixtures with poorly separated components (Biernacki et al. 2000). It focuses on selecting a relevant number of clusters, for which a clustering with low uncertainty can be provided.

Selecting a clustering with the support of illustrative variables
Now, suppose that, beside y, known classifications u 1 , . . . , u r (e.g. associated to extra illustrative categorical variables with U 1 , . . . , U r levels) are available.
Let us introduce an example of such a situation which is fully presented and analysed in Sect. 4.3. The Wholesale data set (Bache and Lichman 2013) refers to customers of a wholesale in Portugal. The quantitative data y consist of the annual spending of each customer on different product categories. Besides are available for each customer its Channel (either Retail or Horeca: Hotel/Restaurant/Café) and its Region (Lisbon, Oporto or Other) which are regarded as the illustrative categorical variables. The objective of the study is to segment the customers and define profiles based on the spendings. And it is of interest to be able to relate these profiles to the Channel and the Region.
A possible approach is to insert the external variables into the model. See for instance Hennig and Liao (2013) which makes use of a latent class mixture model assuming conditional independence between variables. An important problem when dealing with continuous and categorical data in the same exercise lies in the balance of both types of variables. Which weights for the categorical variables and for the quantitative variables? This question is not easy to be answered since these variables do not shed the same light on the objects to be analysed. They carry information of different nature and are difficult to compare. It could be more beneficial and safer to consider the categorical variables as illustrative variables, especially when the quantitative variables are more numerous than the categorical variables. Thus, they could be used to assess clusterings derived from the continuous variables (see for instance Gordon 1999, p. 185). The Supported Integrated Completed Likelihood (SICL) criterion that is now presented aims at helping the user in this perspective.
For the sake of simplicity let us first consider the situation where there is only one illustrative variable. The general aim is to build a classification z based on y in a situation where relating the classifications z and u could be of interest to get a suggestive and simple interpretation of z. Therefore, we propose to build the classification z in each model, based on y only, but to involve u in the model selection step, particularly for the number of components. Hopefully u might highlight some of the solutions among which y would not enable to decide clearly. This might help to select a model providing a good compromise between the mixture model fit to the data and its ability to lead to a classification of the observations well related to the external classification u. To derive our heuristics, y and u are supposed to be conditionally independent given z, which means that all the relevant information in u and y can be caught by z. This is for example true in the particular case where u can be written as a function of z: u is a reduction of the information included in z, and we hope to be able to retrieve more information from u using the (conditionally independent) information brought by y.
Here is our heuristics. It is based on an intent to find the mixture model maximizing the integrated completed likelihood Assuming that y and u are conditionally independent knowing z, which should hold at least for models with enough components, it can be written for any θ m ∈ m: Moreover, let us denote n k. = U =1 n k . Denoting Thus, we get and, from (5) and (6), Now, log f (y, z; m, θ m )π(θ m )dθ m can be approximated by ICL as in (4). Thus Finally, this leads to the SICL criterion The last additional term U =1 K k=1 n k log n k n k· quantifies the strength of the link between the categorical variables u and z. It can be regarded as minus the weighted mean entropy of u given z. This might be helpful eventually for the interpretation of the classification z.
Taking several external variables into account The same kind of derivation enables to derive a criterion that takes into account several external variables u 1 , . . . , u r . Suppose that y, u 1 , . . . , u r are conditionally independent knowing z. Then (6) becomes withθ * m = arg max θ m f (y, u 1 , . . . , u r , z; m, θ m ). As before, we assume thatθ m ≈θ * m and apply the BIC-like approximation. Finally, and as before, log f (u j |z) = log f (u j |z; m,θ m ) is derived from the contingency table (n j k ) relating the categorical variables u j and z: for any k ∈ {1, . . . , K } and ∈ {1, . . . , U j }, U j being the number of levels of the variable u j , n j k = card i : z ik = 1 and u j i = .
Finally, with n k. = U j =1 n j k , which does not depend on j, we get the "multiple" external variables criterion:

Numerical experiments
All the numerical experiments rely on the RmixmodSICL package that we make available and which relies itself strongly on the Rmixmod package for R (Lebret et al. 2012).
We first present an illustration of the behaviour of SICL on the Iris and the Crabs data sets. The behaviour of SICL is then analysed in various situations by numerical experiments on simulated data sets. Finally an application on the real data set Wholesale is presented.

Illustrative numerical experiments
This section is not devoted to emphasise the qualities of SICL but rather to indicate typical behaviours of this criterion. In particular, it illustrates that this approach can be regarded as a way to evaluate the agreement between a true classification and the clustering induced by the internal variables.

Iris
The first example is an application to the Iris data set (Fisher 1936) which consists of 150 observations of four measurements (y) for three species of Iris (u). Two of the three first principal components for these data are plotted in Fig

Simulated numerical experiments
All the parameters of the distributions from which the data sets of this section are issued are detailed in the Appendix.

Random or True Label Experiments
We simulated 200 observations from a Gaussian mixture in R 2 depicted in Fig. 3. In the first case (Fig. 4), which we call the True Label experiment, the variable u corresponds exactly to the Gaussian mixture component z * from which each observation actually arises. In the second case (Fig. 5), which we call the Random Label experiment, the variable u is not z * as it is apparent by comparing  Figs. 3 and 5: the variable u for the right-hand side components is simulated from a Bernoulli distribution with probability 1 2 . In the first case, u is a function of z * while in the second case, u is not a function of z * but it is still conditionally independent on y given z * . Thus it can be expected that the conditional independence assumption can hold in models with at least four components for the True Label experiment but it cannot in models with fewer components. For the Random Label experiment, this minimal number of components for the assumption to hold is three. Diagonal mixture models (i.e. with diagonal variance matrices: [ p k L k B k ] in Celeux and Govaert 1995) with numbers of components K ranging from 1 to 10 were fitted. The criteria AIC, BIC, ICL and SICL as functions of K for one simulation are plotted in Fig. 6. We repeated this experiment with 100 different simulated data sets, cf. Table 4. BIC gets trapped in this difficult situation (overlapping Gaussian components with reasonable number of observations): it mostly misses the true number of components (four) and selects three instead. ICL almost always selects three clusters, as expected. But SICL gives quite a different answer depending on the choice of u: when u is conditionally "random", it behaves like ICL. This illustrates that it does not necessarily push the selected number of clusters toward the number of classes of u: it depends on the quality of the relation between u and the classification obtained by the model-based clustering study with the number of components at hand. Indeed SICL mostly selects four clusters in the True Label situation where u = z * : in this case the classification obtained with the four-component model is mostly well related to u, since this is the true model. However let us notice that SICL answers in this case are quite spread. This is linked to the optimisation difficulty for the fitting of the model in this situation: the "true" distribution is mostly missed by the approximation of the MLE with the EM-algorithm, even in the four-component model. Actually it can be checked in the numerical results that SICL mostly selects four clusters when the "true" distribution is well estimated and other numbers of clusters where it is not (cf. Figs. 7, 8 for examples).

Mixture of Mixtures Experiment
In this experiment SICL gives a relevant solution different from the solutions selected with BIC or ICL. We simulated 600 observations from a diagonal three-component Gaussian mixture depicted in Fig. 9 (left) where the classes of u are in red and in black. (The red class consists of two "horizontal" Gaussian components while the black one consists of a single "vertical" Gaussian component). The most general Gaussian mixture models with numbers of components K ranging  Fig. 9 (right). We repeated this experiment with 100 different simulated data sets: from Table 5, BIC almost always recovers the three Gaussian components of the simulated mixture, ICL mostly selects one cluster, and SICL selects two classes well related to u. About the conditional independence assumption The heuristics leading to SICL assumes that u and y are conditionally independent given z (see Sect. 3). This assumption is questionable. This experiment, which we call Conditionally Dependent Label, aims at studying the behaviour of SICL when this assumption can be regarded as inappropriate even in the models with relevant numbers of components. We consider a two-component diagonal Gaussian mixture and a two-class u partition "orthogonal" to this mixture. In Fig. 10 (left) the classes according to u are in red and in black: u = 1 y>0 . Diagonal mixture models with free volumes and proportions but fixed shapes and numbers of components K ranging from 1 to 10 are fitted ([ p k L k B] in Celeux and Govaert 1995). The criteria AIC, BIC, ICL and SICL as functions of K for one simulation are plotted in Fig. 10 (right). We repeated this experiment with  100 different simulated data sets. As expected in this quite easy setting, BIC and ICL almost always select two clusters (see Table 6). SICL less systematically selects the two-component solution and also highlights a little the four-component solution. Actually the conditional independence does not hold in the two-component model because it would not be fitted so that z correspond to u here but it roughly does in the four-component model for example, when it is fitted as in Fig. 11. The dispersion of the numbers of clusters selected with SICL (Table 6) illustrates that, when the conditional independence does not hold for the relevant numbers of clusters, SICL should have a tendency to select a higher number of clusters, for which the conditional independence assumption holds and for which the estimated z is well related to u.

Real data set: wholesale customers
As announced in Sect. 3 the segmentation of 440 customers of a wholesale distributor described by six continuous variables and two categorical variables is performed to illustrate the performance of the SICL criterion. 1 The continuous variables are fresh products, milk products, grocery, frozen products, detergents and paper products, and delicatessen. The two categorical variables u 1 and u 2 (see Sect. 3) are summarised in Tables 7 and 8. A simple description of the continuous variables (not reported here) clearly shows that it is highly preferable to consider a log transformation to model them with a Gaussian mixture.
The clusterings considered below are represented in Fig. 12 on the first PCA plane which is easily interpreted with respect to the original variables (see Fig. 13): the first axis is a "Grocery products" axis and is discriminant for the Channel variable while the second axis is a "Food products" axis.
As expressed in Sect. 3, it is possible to gather the log-quantitative and categorical variables in a composite mixture model with a conditional independent assumption for all the variables. This means that the variance matrices of the quantitative variables are assumed to be diagonal. This model has been recently added to the Rmixmod package. Running Rmixmod in this configuration leads to a seven-component mixture with BIC and to a five-component mixture with ICL. See Fig. 12.
We now turn to a Gaussian mixture on the log-quantitative variables using the default model in Rmixmod, so called [ p k L k C] (Celeux and Govaert 1995), namely a Gaussian mixture with free proportions and variance matrices with free determinants but a common orientation and shape. Here the component variance matrices are not restricted to be diagonal. In this case, as shown in Fig. 13, BIC selects a five-component mixture, ICL a two-component mixture and SICL a four-component mixture. The solutions selected by BIC and ICL are sensible from their respective points of view. The two-cluster solution of ICL is quite parsimonious, but it is not related at all to the two external categorical variables. It is for instance almost orthogonal to the Channel variable. On the contrary, the four-cluster solution selected with SICL is easily interpreted with respect to the two levels of the Channel variable. See Fig. 12.
Finally the solution selected with SICL is interesting: it is more parsimonious and thus easier to read and interpret than those of BIC or even ICL with the composite mixture model. And it is better related to the external variables-here particularly to the Channel variable-than the even more parsimonious solution selected by ICL with the continuous variables only (see Fig. 12, bottom left). Thus it can be an interesting solution to be considered in a study for which relating the clusters to the external variables can help to interpret or study them. Further, as compared to the ICL with the composite mixture model, SICL also provides clusters more coherent with the

Discussion
The criterion SICL has been conceived in the model-based clustering context to choose a sensible number of clusters, possibly taking advantage of an external categorical variable or a set of external categorical variables of interest (variables other than the variables on which the clustering is based). This criterion can be useful to draw attention to a well-grounded classification related to these external categorical variables.
As already stated in Sect. 3, another possible approach consists of involving the continuous and categorical variables together in a composite mixture model. This is a natural approach which can provide interesting results but, beside the aforementioned difficulty to balance the weights between different types of variables, the conditional independence (of the variables given the latent class) assumption is quite strong: it is then coherent to choose a model in which all the variables are conditionally independent. And if this assumption is broken it can be expected that the criteria should select a too high number of clusters. The SICL criterion also relies on a conditional independence assumption but only for the selection of the number of clusters and not directly for the designing of the clusters. This assumption is then less critical than in the composite mixture model approach. First our procedure enables one to model the possible dependences between the observed variables by fitting an appropriate model form since the conditional independence assumption is not involved at this step (see Sect. 4.3). Second, if this assumption does not hold, the consequence is lowered by the very fact that it is involved in the step of the number of clusters selection only. It can be expected to hold at least for models with high enough numbers of components: at worst, SICL can be led to select too high a number of clusters, particularly when some of the corresponding solutions are well related to the external variables. But the clusters can still be expected to be relevant. The risk with the composite latent-class model seems to be more severe in this situation since the assumption is involved in the designing of the clusters. This could be a further work to study this more deeply, notably with a simulation study. Moreover, as one of the referees noticed, it may be possible to get the same solution as SICL by including the external variables in the model, as it is the case for example for the Iris data set. But not including them in the model provides a stronger evidence of the link between the external variables and the clustering for the very reason that the illustrative variables are not involved in the design of the clustering.
Finally SICL could highlight partitions of special interest with respect to external categorical variables. Therefore, we think that SICL deserves to enter into the toolkit of model selection criteria for clustering. In most cases, it will select a sensible solution and when it points out an original solution, it could be of great interest for practical purposes.
Acknowledgments The authors would like to thank Christophe Biernacki for his help in the numerical experiment with the composite mixture model and helpful comments and Christian Hennig for helpful discussion which in particular helped to make the point about the conditional independence assumption. The authors are grateful to two reviewers and the Editor for their helpful comments which led to improvements of the article.

Appendix: Details for the simulated numerical experiments
Random or True Label Experiments Each data set is the observation of a sample of size 200 from a Gaussian mixture in R 2 which parameters are given in Table 9.

Mixture of Mixtures Experiment
Each data set is the observation of a sample of size 600 from a Gaussian mixture in R 2 which parameters are given in Table 10.

Conditionally Dependent Label Experiment
Each data set is the observation of a sample of size 200 from a Gaussian mixture in R 2 which parameters are given in Table 11.