Repositório ISCTE-IUL

. In data clustering, the problem of selecting the subset of most relevant features from the data has been an active research topic. Feature selection for clustering is a challenging task due to the absence of class labels for guiding the search for relevant features. Most methods proposed for this goal are focused on numerical data. In this work, we propose an approach for clustering and selecting categorical features simultaneously. We assume that the data originate from a ﬁnite mixture of multinomial distributions and implement an integrated expectation-maximization (EM) algorithm that estimates all the parameters of the model and selects the subset of relevant features simultaneously. The results obtained on synthetic data illustrate the performance of the proposed approach. An application to real data, referred to oﬃcial statistics, shows its usefulness.


INTRODUCTION
Feature selection is considered a fundamental task in several areas of application that deal with large data sets containing many features, such as data mining, machine learning, image retrieval, text classification, customer relationship management, and analysis of DNA micro-array data.In these settings, it is often the case that not all the features are useful: some may be redundant, irrelevant, or too noisy.Feature selection extracts valuable information from the data sets, by choosing a meaningful subset of all the features.Some benefits of feature selection include reducing the dimensionality of the feature space, removing noisy features, and providing better understanding of the underlying process that generated the data.
In supervised learning, namely in classification, feature selection is a clearly defined problem, where the search is guided by the available class labels.In contrast, for unsupervised learning, namely in clustering, the lack of class information makes feature selection a less clear problem and a much harder task.
An overview of the methodologies for feature selection as well as guidance on different aspects of this problem can be found in [1], [2] and [3].
In this work, we focus on feature selection for clustering categorical data, using an embedded approach to select the relevant features.We adapt the approach developed by Law et al. [4] for continuous data that simultaneous clusters and selects the relevant subset of features.The method is based on a minimum message length (MML) criterion [5] to guide the selection of the relevant features and an expectation-maximization (EM) algorithm [6] to estimate the model parameters.This variant of the EM algorithm seamlessly integrates model estimation and feature selection into a single algorithm.We work within the commonly used framework for clustering categorical data that assumes that the data originate from a multinomial mixture model.We assume that the number of components of the mixture model is known and implement a new EM variant following previous work in [7].

RELATED WORK
Feature selection methods aim to select a subset of relevant features from the complete set of available features in order to enhance the clustering analysis performance.Most methods can be categorized into four classes: filters, wrappers, hybrid, and embedded.
The filter approach assesses the relevance of features by considering the intrinsic characteristics of the data and selects a feature subset without resorting to clustering algorithm.Some popular criteria used to evaluate the goodness of a feature or of a feature subset are distance, information, dependency, or consistency measures.Some filter methods produce a feature ranking and use a threshold to select the feature subset.Filters are computationally fast and can be used in unsupervised learning.
Wrapper approaches include the interaction between the feature subset and the clustering algorithm.They select the feature subset, among various candidate subsets of features that are sequentially generated (usually in a forward or backward way), in an attempt to improve the clustering algorithm results.Usually, wrapper methods are more accurate than filters, but even for algorithms with a moderate complexity, the number of iterations that the search process requires results in a high computational cost.
Hybrid methods aim at taking advantage of the best of both worlds (filters and wrappers).The main goal of hybrid approaches is to obtain the efficiency of filters and the accuracy of wrappers.Usually, hybrid algorithms use a filter method to reduce the search space that will subsequently be considered by the a wrapper.Hybrid methods are faster than wrappers, but slower than filters.
In embedded methods, the feature selection is included into the clustering algorithm, thus fully exploiting the interplay between the selected features and the clustering task.Embedded methods are reported to be much faster than wrappers, although their performance also depends on the clustering algorithm [8].
In clustering problems, feature selection is both challenging and important.Filters used for supervised learning can be used for clustering since they do not resort to class labels.The vast majority of work on feature selection for clustering has focused on numerical data, namely on Gaussian-mixture-based methods (e. g. [9], [4], and [10]).In contrast, work on feature selection for clustering categorical data is relatively rare [11].
Finite mixture models are widely used for cluster analysis.These models allow a probabilistic approach to clustering in which model selection issues (e.g., number of clusters or subset of relevant features) can be formally addressed.Some advantages of this approach are: it identifies the clusters, it is able to deal with different types of features measurements, and it outperforms more traditional approaches (e.g., k-means).Finite mixture models assume specific intra-cluster probability functions, which may belong to the same family but differ in the parameter values.The purpose of model estimation is to identify the clusters and estimate the parameters of the distributions underlying the observed data within each cluster.The maximum likelihood estimators cannot be found analytically, and the EM algorithm [6] has been often used as an effective method for approximating the estimates.To our knowledge there is only one proposal [11] within this setting for clustering and selecting categorical features.In his work, Talavera presents a wrapper and a filter to select categorical features.The proposed wrapper method, EM-WFS (EM wrapper with forward search), combines EM with forward feature selection.Assuming that the feature dependencies play a crucial role in determining the feature importance for clustering, a filter ranker based on a mutual information measure, EM-PWDR (EM pairwise dependency ranker), is proposed.In supervised learning, filter approaches usually measure the correlation of each feature with the class label by using distance, information, or dependency measures [12].Assuming that, in the absence of class labels, we can consider as irrelevant those features that exhibit low dependency with the other features [13].Under this assumption, the proposed filter considers as good candidates to be selected the highly correlated features with other features.Feature subset evaluation criteria like scatter separability or maximum likelihood seem to be more efficient for the purpose of clustering than the dependence between features.In our work, we propose an embedded method for feature selection, using a minimum message length model selection criterion to select the relevant features and a new EM algorithm for performing model-based clustering.

THE MODEL
Let Y = y 1 , . . ., y n ′ be a sample of n independent and identically distributed random variables/features, where y = (Y 1 , . . ., Y L ) is a L-dimensional random vector.It is said that y follows a K component finite mixture distribution if its loglikelihood can be written as where α 1 , . . ., α K are the mixing probabilities (α k ≥ 0, k = 1, .., K and K k=1 α k = 1), θ = (θ 1 , .., θ K , α 1 , .., α K ) the set of all the parameters of the model and θ k is the set of parameters defining the k-th component.In our case, for categorical data, f (.) is the probability function of a multinomial distribution.
Assuming that the features are conditionally independent given the componentlabel, the log-likelihood is The maximum likelihood estimators cannot be found analytically, and the EM algorithm has been often used as an effective method for approximating the corresponding estimates.The basic idea behind the EM algorithm is regarding the data Y as incomplete data, clusters allocation being unknown.In finite mixture models, variables Y 1 , . . ., Y L (the incomplete data) are augmented by a component-label latent variables z = (Z 1 , . . ., Z K ) which is a set of K binary indicator latent variables, that is, z i = (Z 1i , . . ., Z Ki ), with Z ki ∈ {0, 1} and Z ki = 1 if and only if the density of y i ∈ C k (component k) implying that the corresponding probability function is f (y i |θ k ).Assuming that the Z 1 , . . ., Z k are i.i.d., following a multinomial distribution of K categories, with probabilities α 1 , . . ., α K , the log-likelihood of a complete data sample (y, z), is given by The EM algorithm produces a sequence of estimates θ(t), t = 1, 2, . . .until some convergence criterion is met.

Feature Saliency
The concept of feature saliency is essencial in the context of the feature selection methodology.There are different definitions of feature saliency/(ir)relevancy.Law et al. [4] adopt the following definition: a feature is irrelevant if its distribution is independent of the cluster labels i.e. an irrelevant feature has a common to all clusters probability function.
Lets denote the probability function of relevant and irrelevant features by p(.) and q(.), respectively.For categorical features, p(.) and q(.) refer to multinomial distributions.Let B 1 , . . ., B L be the binary indicators of the features relevancy, where B l = 1 if the feature l is relevant and zero otherwise.
Using this definition of feature irrelevancy the log-likelihood becomes Defining feature saliency as the probability of the feature being relevant, ρ l = P (B l = 1) the log-likelihood is (the proof is in [4]): The features' saliencies are unknown and they are estimated using an EM variant based on the MML criterion.This criterion encourages the saliencies of the relevant features to go to 1 and of the irrelevant features to go to zero, pruning the features' set.

The Proposed Method
We propose an embedded approach for clustering categorical data, assuming hat the data are originate from a multinomial mixture and the number of mixture components is known.The new EM algorithm is implemented using an MML criterion to estimate the mixture parameters, including the features' saliencies.This work extends that of Law et al. [4] dealing with categorical features.

The Minimum Message Length (MML) Criterion
The MML-type criterion chooses the model providing the shortest description (in an information theory sense) of the observations [5].According to Shannon's information theory, if Y is some random variable with probability distribution p(y|θ), the optimal code-length for an outcome y is l(y|θ) = log 2 p(y|θ), measured in bits and ignoring that l(y) should be integer [14].When the parameters, θ, are unknown they need to be encoded, so the total message length is given by l(y, θ) = l(y|θ) + l(θ), where the first part encodes the observation y, and the second the parameters of the model.
Under the MML criterion, for categorical features, the estimate of θ is the one that minimizes the following description length function: where c l is the number of categories of feature Y l .
A Dirichlet-type prior (a natural conjugate prior of the multinomial) is used for the saliencies, As a consequence, the MAP ( maximum a posterior) parameters estimators are obtained when minimizing the proposed description length function, l(y, θ).

The Integrated EM
To estimate all the parameters of the model, we implemented a new version of the EM algorithm integrating clustering and feature selection -the integrated Expectation-Maximization (iEM) algorithm.This algorithm complexity is the same as the standard EM for mixture of multinomials.The iEM algorithm to maximize [−l(y, θ)] has two steps: M-step: Update the parameter estimates according to where After running the iEM, usually the saliencies are not zero or one.Our goal is to reduce the set of initial features, so we check if pruning the feature which has the smallest saliency produces a lower message length.This procedure is repeated until all the features have their saliencies equal to zero or one.At the end, we will choose the model having the minimum message length value.The proposed algorithm is summarized in Figure 1.Initialization: initialization of the parameters resorts to the empirical distribution: set the parameters θ lk of the mixture components p(y l |θ lk ) , (l = 1, . . ., L ; k = 1, . . ., K) set the common distribution parameters θ l , to cover all the data q(y l |θ l ) , (l = 1, . . ., L) set all features saliencies ρ l = 0.5 (l = 1, . . ., L) store the initial log-likelihood store the initial message length (iml) mindl ← iml continue ← 1 while continue do while increases on log-likelihood are above δ do M-step according to (2), ( 3) and (4) E-step according to (1) if (feature l is relevant) ρ l = 1, q(y l |θ l ) is pruned if (feature l is irrelevant) ρ l = 0, p(y l |θ lk ) is pruned for all k Compute the log-likelihood and the current message length (ml) end while if ml < mindl mindl ← ml update all the parameters of the model end if if there are saliencies, ρ l / ∈ {0, 1} prune the variable with the smallest saliency else continue ← 0 end if end while The best solution including the saliencies corresponds to the final mindl obtained.

NUMERICAL EXPERIMENTS
For a L-variate multinomial we have where c l is the number of categories of feature Y l .

Synthetic Data
We use two types of synthetic data: in the first type the irrelevant features have exactly the same distribution for all components.Since with real data, the irrelevant features could have little (non relevant) differences between the components, we consider a second type of data where we simulate irrelevant features with similar distributions between the components.In both cases, the irrelevant features are also distributed according to a multinomial distribution.Our approach is tested with 8 simulated data sets.We ran the proposed EM variant (iEM) 10 times and chose the best solution.According to the obtained results using the iEM, the estimated probabilities corresponding to the categorical features almost exactly match the actual (simulated) probabilities.Two of our data sets are presented in tables 1 and 2.
In Table 1 results refer to one data set with 900 observations, 4 categorical features and 3 components with 200, 300 and 400 observations.The first two features are relevant with 2 and 3 categories respectively, the other two are irrelevant and have 3 and 2 categories each.These irrelevant features have the same distribution for the 3 components.In Table 2 the data set has 900 observations and 5 categorical features.The features 1, 4 and 5 have 3 categories each and the features 2 and 3 have 2 categories.The first three features are relevant and the last two are irrelevant, with similar distributions between components.

Real Data
An application to real data referred to european official statistics (EOS) illustrates the usefulness of the proposed approach.This EOS data set originates from a survey on perceived quality of life in 75 european cities, with 23 quality of life indicators (clustering base features).For modeling purposes the original answers -referring to each city respondents-are summarized into: Scale 1)agree (including strongly agree and somewhat agree) and disagree (including somewhat disagree and strongly disagree) and Scale 2)satisfied (including very satisfied and rather satisfied) and unsatisfied (including rather unsatisfied and not at all satisfied).
A two-step approach is implemented: firstly, the number of clusters is determined based on MML criterion -see [15]; secondly, the proposed iEM algorithm is applied 10 times and the solution that has the lower message length is chosen.Features' saliencies mean and standard deviations over 10 runs are presented in Table 3.
Applying the iEM algorithm to group the 75 European cities into 4 clusters, 2 quality of life indicators are considered irrelevant: Presence of foreigners is good for the city and Foreigner here are well integrated, meaning that the opinions re- garding these features are similar for all the clusters.In fact, most of the citizens (79%) agree that the presence of foreigners is good for the city but they do not agree that foreigners are well integrated (only 39 % agree).Clustering results along with features' saliencies are presented in Table 4 -reported probabilities regard the agree and satisfied categories.
According to the obtained results we conclude that most respondents across all surveyed cities feel safe in their neighborhood and in their city.In cluster 1 cities air pollution and noise are relevant problems and it is not easy to find good housing at reasonable price.It is not easy to find a job in cities of cluster 2. Citizens of cities in cluster 3 have higher quality of life than the others e.g. they feel more safe, are more committed to fight against climate change and are generally satisfied with sport facilities, beauty of the streets, public spaces and outdoor recreation.Air pollution and noise are major problems of cities in cluster 4; in this cluster, cities are not considered clean or healthy to leave in.

Fig. 1 .
Fig. 1.The iEM algorithm for clustering and selecting categorical features.

Table 1 .
iEM results for a synthetic data set where irrelevant features have the same distributions between components.

Table 2 .
iEM results for a synthetic data set where irrelevant features have similar distributions between components.

Table 3 .
Features' saliencies: mean and standard deviation of 10 runs