Validation of archetypal analysis

We use an information-theoretic criterion to assess the goodness-of-fit of the output of archetypal analysis (AA), also intended as a fuzzy clustering tool. It is an adaptation of an existing AIC-like measure to the specifics of AA. We test its effectiveness using artificial data and some data sets arising from real life problems. In most cases, the results achieved are similar to those provided by an external similarity index. The average reconstruction accuracy is about 93%.


I. INTRODUCTION
The application of a matrix factorization approach to fuzzy clustering dates back to at least 1974. Woodbury and Clive devise in [23] a method to estimate fuzzy partitions underlying multivariate categorical data. It is called grade of membership model, and focuses especially on clinical data for diagnostic and prognostic purposes. Independently, Mirkin and Satarov [14] propose an extension of this model to real-valued data, assuming the data as convex linear combinations of a set of c ≥ 2 prototypes. In the same vein, Cutler and Breiman [4] present their archetypal analysis (AA), which has received some popularity and following in the literature. The subject has recently attracted researchers' attention. The work by Ding, Li, and Jordan [5] is an example of how to fit crisp k-means cluster analysis into the framework of matrix factorization; or the factorized fuzzy c-mean algorithm [18], which provides an alternative way to performing the traditional fuzzy c-means (FCM) [1] clustering.
The matrix factorization approach to fuzzy clustering can be explored from different perspectives. This study focuses on the validation problem, particularly that of archetypal analysis. Indeed, one of the key issues of AA is the lack of credible measures to verify its validity; this is actually an ongoing topic of research [15]. The present work is an attempt to answer the following question: how can we assess the goodness-of-fit of a fuzzy c-partition, c = 2, 3, ..., determined by an AA? We aim to examine the effectiveness of an AA as a fuzzy clustering tool, in addition to its ability to reconstruct the original data set. To the best of our knowledge, strategies to select the number of prototypes, i.e. c, have to date been based mostly on visual inspection. Scree plot-like or elbow criteria exploring the monotonic nature of either an objective function [7], or of some measure of the variation explained by different models [4], [15], are examples of validation methods we find in the literature. Our proposal is analytical, and relies on information-theoretic principles. It is an adaptation of an AIC-like measure, proposed in [19], to the specifics of AA.
We evaluate its effectiveness using artificial data and also some data sets arising from real life problems. The numerical results attest its reliability in both dimensions of interest: clustering and reconstruction of the source data.
This paper is organized as follows. In Section II, we briefly describe the matrix factorization approach to fuzzy clustering and then highlight the way archetypal analysis positions itself in this context. Section III is devoted to theoretical aspects of our validity measure for AA. Some results of its numerical assessment are presented in Section IV; finally, Section V concludes.

II. ARCHETYPAL ANALYSIS
Let X = [x jk ] ∈ R n×N be an n×N sample real data matrix, where n ≥ 2 is the dimension of the feature space, and N > n is the sample size. We denote the kth data point by the column vector x k . Consider two matrices, and let P = [p jk ] be equal to their product This matrix configures a polytope with c extreme points, spanned by the c columns of V, namely v 1 , v 2 , ..., and v c ; denote it Π (c) V . We assume the data are drawn from Π (c) V , and read with small errors, i.e, where E is the matrix that accounts for the measurement errors. This is known as a matrix factorization approach to data analysis and here, in particular, it means X is modeled or structured by a fuzzy c-partition. Therefore, we refer to the columns of V as prototypes and, in this context, each entry of the partition matrix U, i.e. the membership degree µ ik , additionally expresses the proportion of v i present in x k [16]. Hence, every data point x k is in the convex hull of c prototypes, apart from an error: The first question is how to estimate U and V from the observed data X.
Given a pre-specified value of c, the matrices U and V are often estimated by the minimization of the objective function 1 subject to the constraints on µ ik referred to above. The symbol ∥A∥ F denotes the Frobenius norm of the matrix A. The objective function J c (4) is separately convex in U and V, but not in the product VU. Therefore, no practical computational method for solving the problem should be expected anytime soon. On the other hand, given V, the optimization problem is reduced to N independent constrained least squares problems. As a result, a flat common solver can be used to minimize J c (U|V, X), regardless of the way the matrix of prototypes V is obtained. Here, for example, we adopt an alternating optimization scheme for estimation purposes; it is also a valid option when the computational tool allows parallelization [11], as in this case. The total number of parameters involved in the estimation of U is The subsequent second question is how to estimate V, given U. Before addressing this question, we need to know whether there is any constraint in the definition of a prototype. According to the AA approach, the prototypes, now called archetypes, are convex combinations of the observed data points, where 0 ≤ β ki ≤ 1 and ∑ N k=1 β ki = 1. Given c, the total number of free β parameters is then Gathering them all in an N × c matrix B = [β ki ], allows us to write (6) in a matrix form, as follows Hence, in the context of an AA, the estimation of V converts into the estimation of the matrix B. In this study, we estimate this matrix using the algorithm proposed by Ding et al. in [5]; alternative approaches are found elsewhere (e.g. [4], [20]). Specifically, the β parameters are estimated using the following update rule [5]: where A T means the transpose of the matrix A, and (A) ± = (abs (A) ± A) /2. The estimation process alternates between updates of U and V, until convergence. The V matrix is updated using (8). Interested readers may find a brief review of alternative approaches to the estimation of V in [18]. 1 Two alternative objective functions are provided in [22] III. VALIDITY MEASURE The original version of our proposal is given in [19], where the reconstruction ability of the Bezdek [1] fuzzy c-means (FCM) algorithm is tested. We do not present its functional formula here, for reasons that soon will become clear. In this study, we adapt it to the specificity of an AA, as follows.
Motivated by the regression-like model adopted for the observed data (3), we start from the loss function used to select regressors in multiple regression analysis [21]: whereσ 2 is an estimate of the residual variance σ 2 , and cpx is a nonnegative variable accounting for the complexity of a given model. For example, in Akaike's information criterion (AIC), the value of cpx is twice the number of the estimated parameters. Following [19], we use the quantity to account for the residual variance; here,P is an estimate of P given in (1), so that δ 2 is the objective function (4) related to n × N . It has been noted in the literature that, in cluster analysis, the AIC approach to cpx in (9) tends to favor data partitions with fewer clusters, and may potentially lead to poor clustering [12]; this is also observed by Suleman in [19]. This effect might be connected to the number of parameters of a fuzzy c-partition, which increases with the sample size N and, thus, potentially tends to infinity [10]. Therefore, the aforementioned author proposes to additionally balance the complexity term in (9) using a measure of efficiency of the sought fuzzy c-partition, to prevent the underestimation of the true value of c. For this purpose, he considers the quantity effic (P |X which proves effective in an FCM framework. In the expression (11), tr (A) is the trace of the matrix A; Σ X and ΣP are the covariance matrices of X andP, respectively. Nevertheless, unlike [19], in (9) we adopt twice the quantity npar effic(P|X) for cpx, i.e.
which mimics the AIC and, in our experiments, provides better results. In this expression, npar = K µ + K β + 1 is the total number of parameters involved in the estimation process: K µ is the number of membership degrees µ ik , as in (5); K β accounts for the β parameters (7); and the extra one is for δ 2 (10); the function effic (.) is given in (11). Hence, the final form of the validity measure we are using to assess the goodness-of-fit of an AA outcome is this: We note, however, that in most real life applications, both K µ and K β ≫ 1, and N − 1 ≃ N ; therefore K µ + K β + 1 ≃ N (2c − 1) and, consequently, one can use a simplified approximate version of υ AA (c), In our numerical experiments, we make the same inferences about the underlying data structure, whether using (12) or (13). Hence, in practice, we can validate an AA using the latter index, which looks a simpler option. In sum, given a collection of competing fuzzy c-partitions of the same observed data X, c = 2, 3, ..., the best partition is selected by solving arg min c υ AA (c) or, alternatively, arg min cυAA (c).
We end this section stressing that the formula (12) differs from the one proposed in [19] essentially in the second term of its right hand side: here, we instead duplicate the quantity npar effic(P|X) , and obtain better results.

A. General procedure
The numerical computations are performed in MATLAB. We use the lsqlin() function to estimate the membership degrees µ ik , with the interior-point algorithm option [3], and explore the potential of the parallel computing feature. The maximum number of iterations is set to be 500, and the maximum absolute difference between two consecutive estimates of µ ik , i.e. the error term, is set as 0.01.
In all numerical experiments, we use data sets arranged in c * ≥ 2 clusters with known class labels. The best fuzzy c-partition is obtained by varying c from 2 to c max = max (8, 1.5 × c * ), and eventually solving At the same time as we compute υ AA (c), we record the value of the fuzzy generalization of the Dice index or criterion proposed in [9], Ψ Dice (c), and let be the optimal value of c, according to the Dice criterion. The quantity Ψ Dice (c Dice ) is a reference metric. We also record the Dice index value associated with the fuzzy c opt -partition (14), i.e. Ψ Dice (c opt ). This procedure is intended to see how effective an AA is as a fuzzy clustering tool. It is implicit here that an external index provides a more accurate evaluation of the estimated fuzzy partitions, since it uses the information of the cluster structure of the data. Recall that Ψ Dice ranges between 0 and 1, and the higher the values of Ψ Dice , the closer the estimated partition is to the partition being used as the ground truth. A FCM clustering is performed beforehand to provide for a matrix of prototypes, for seeding purposes. The weighting exponent parameter is set to 2. This algorithm has proved helpful in solving the initialization problem in related matrix factorization contexts [17], [24]. For alternative seeding methods see, for example, [2], [5], [15].
We test our proposal using artificial data as well as data sets arising from real life problems, and available in the UCI Machine Repository [13]. Given the estimated fuzzy cpartitions of X, 2 ≤ c ≤ c max , we compute the following statistics for decision purposes: υ AA (12),υ AA (13), the Dice index Ψ Dice (c), and Ψ Dice (c opt ); and, of course, keep track of c opt (14) and c Dice (15).

B. Synthetic data sets 1) Test data generation:
The artificial test data are drawn from the polytope Π (c * ) V , c * = 2, 3, .., or 7, whose vertices are located on the unit (hyper)sphere of R n , centered at the origin, for n = 2, 3, 4, or 5. When possible, i.e. for c * > n, we generate the vertices of V matrix using polymake software [8], and for c * ≤ n, we use our own software code. As a crossvalidation procedure, we verify the location of the columns of V on the hypershepre, and confirm their extremality. The latter procedure is merely an LP problem [6].
We consider four threshold levels, γ, for the membership degree in underlying fuzzy clusters: γ = 0.95, 0.85, 0.75, or 0.65. For example, γ = 0.75 means the proportion of the prototype v i in data point x k , i.e. µ ik , is at least 0.75. Each fuzzy cluster is populated with N i = 50 points; the sample size is, therefore, N = 50× c * . The membership degrees are generated from the standard uniform distribution, giving rise to the partition matrix U. This enables us to eventually build the matrix P, as defined in (1). In the next step, we add normal (0, σI) noise to P, to generate, i.e. simulate, the real data matrix X, as in (2). Here, we consider three levels for the noise, σ = 0.001, 0.01, or 0.05; I is the n × n identity matrix. Our synthetic data sets can therefore be expressed as X ≡ X (c * , n, γ, σ). We focus particularly on the effect of the space dimension, n, since, in practice, we have no control over other parameters. Summing up, six cluster structures, four dimensions, four threshold levels for µ ik and three noise levels, amounts to N cc = 6 × 4 × 4 × 3 = 288 cluster contexts.
We generate 10 random samples or runs for each cluster context, which total to 2880 artificial data sets. An eleventh run provides a flat seed for the AA algorithm, for every case; specifically, an initial guess of the matrix of prototypes V. Then, the AA algorithm alternates between updates of U and V, until convergence.
Besides the statistics referred to above, we also calculate the relative error between the optimal value of υ AA (12) and υ AA (13), denoted here by η, as well as a measure for the reconstruction accuracy of an AA, that is 1 − R, where, and Here,P =VÛ is an estimate of P, andV andÛ are the outputs of AA algorithm. We stress thatP is estimated from noisy data matrix X, but has been compared in (17) to noiseless data P. For inferential purposes, we employ the average values based on a trial of 10 runs.
2) Experimental results: To evaluate the accuracy of an AA as a fuzzy clustering tool, we begin by examining the distribution of the difference between Ψ Dice (c Dice ) and Ψ Dice (c opt ) by means of box plots, in function of the space dimension, n (Fig. 1). We can see at a glance that the AA behaves differently in the cases n = 2 and n > 2. Even though it performs better in the latter case, we find no severe outliers when n = 2. Table  I gives a numerical account of how far the partition unveiled by the proposed measure of similarity υ AA is from that provided by the external index. For higher dimensions, the difference between the two values is less than 0.05 in more than 84% of cases. The rate is slightly lower for n = 3, and much lower for 2D data sets.  We further represent (Fig. 2) the distribution of Dice index values associated with the fuzzy c-partitions determined by υ AA , i.e. Ψ Dice (c opt ). We notice a good clustering performance of the AA, given that, for n > 2, the first quartile and the median are higher than 0.7 and 0.8, respectively. Hence, combining this result with the previous one (Fig. 1), we see enough empirical evidence to support υ AA as a credible measure for assessing the goodness-of-fit of an AA, notably for the data arranged in clusters and drawn from three-or higher-dimensional space. Although a good clustering does not entail the condition c opt = c * , we believe that a match between these two quantities is always appealing. In our experiments, it occurs in 63.5% of cases, and in 31.6% of cases we find c opt < c * . When we make the same comparison for c Dice , we obtain 68.4% and 10.4%, respectively, which signals that our similarity measure may be somewhat conservative.
Another aspect that deserves attention is understanding the extent to which an AA allows the reconstruction of the original data set. The histogram in Fig. 3 shows the distribution of 1 − R (17), regardless of the cluster context. The average reconstruction accuracy is found to be fairly good: 0.93±0.05. This becomes even better, e.g. ∼ 0.99, if we alternatively measure the accuracy using 1 − R 2 , as in [20]. Given this, one can expect a negligible loss of information when replacing the original data set X by the fuzzy c-partition determined by υ AA index. Fig. 3. An empirical distribution of the reconstruction accuracy of AA. The quantities Q 1 , Q 2 , and Q 3 are the first, second (median) and third quartiles, respectively; Ncc is the number of cluster contexts. Now we give a brief account of howυ AA (13) differs from υ AA (12). The histogram of Fig. 4 shows an empirical  4. An empirical distribution of the relative difference between υ AA andυ AA , η, as given in (16). The quantities Q 1 , Q 2 , and Q 3 are the first, second (median) and third quartiles, respectively; Ncc is the number of cluster contexts.
We end this account on the experimental results with a curiosity: unlike most related research studies, here the ground truth is itself a fuzzy partition, i.e. the class labels are themselves fuzzy, meaning that µ * ik ∈ [0, 1] rather than crisp, {0, 1}. Therefore, this context is appropriate for using the generalized Dice index [9].

C. Real datasets
We select from the UCI Machine Repository [13] several data sets devoted to the clustering task, and look how differently our index υ AA and the Dice index perform in a more realistic setting. As before, we calculate the best partition according to the Dice criterion, Ψ Dice (c Dice ), and the corresponding value associated with the fuzzy c opt -partition, notably Ψ Dice (c opt ). The results obtained are displayed in Table II; the absolute difference between these two quantities are indicated in the last column, labeled ∆. In most cases, both indices provide quite similar results. However, there are two results that deserve particular attention: the υ AA index identifies the same number of clusters as expected theoretically in the case of Forest Type Mapping data set, c * = 4, and, on the contrary, sees an abnormal number of clusters in the Hill-Valley data set (6 vis-á-vis 2). We note that for the data set identified as Glass (Window / Non-W) in Table II, the glasses are alternatively categorized into window and nonwindow type, hence c * = 2.

V. CONCLUSION
We propose an analytical formula to address the issue of assessing the goodness-of-fit of archetypal analysis (AA). The proposed internal measure of similarity, υ AA , relies on information-theoretic principles, and takes into account the specifics of fuzzy clustering, namely, that it involves unlimited number of parameters. This leads us to balance the number of parameters in the complexity term in (9) with a measure of efficiency of a fuzzy c-partitions.
The results of comparing the output of υ AA index to that of an external index confirm it as a credible criterion in the unsupervised clustering framework. This is further reinforced by the estimated reconstruction accuracy, which is about 93%. We note, however, that our artificial data sets are balanced. The next step of our empirical research is to evaluate how υ AA behaves in the presence of imbalanced data.