Murua Alejandro

RESEARCH

The goal of every clustering (or unsupervised classification) method is to estimate the "true" clustering structure or partition underlying the feature space. This estimate is usually unveiled by optimizing a score function that gives a figure-of-merit on admissible partitions of the data. In fields such as bioinformatics (e.g. microarray gene expression data, proteomic sequencing) and text mining (document clustering), clustering high-dimensional data sets is a common problem. Note that one of the main di culties with genome data is the usually small sample size together with the high-dimensionality of the data. Sound machine learning and statistical methods for variable selection and/or data reduction based on the intrinsic structure of the data are needed. In the absence of such techniques, efficient and reliable clustering methods whose score functions are based on similarity or distance matrices are advantageous. Kernel methods and their extensions such as kernel K-means, multiway normalized cut (MNCut), and, recently, adaptive Potts model clustering use the similarity matrix information in an effective way, often yielding good results.

The goal of my current research is to develop sound machine learning and probabilistic methods to find (1) intrinsic structure (clustering), and (2) patterns (variable selection or functionals that intrinsically describe and interpret the data in few dimensions), in mainly complex and/or high-dimensional data and large datasets.

This research is partly based on the belief that the key to the success of modeling complex data structures lies in the embedding of kernel-based methods within parametric and Bayesian models. For example, the Bayesian parametric paradigm gives a framework for introducing structural constraints, whereas the Kernel based methods introduces flexibility (e.g. local, spatial and/or temporal adaptation). Our penalized adaptive Potts model clustering implicitly models the clusters densities via kernel density estimators (a by-product of using a Mercer kernel within a Potts model), and uses Bayesian computational techniques (Markov Chain Monte Carlo) to estimate the clusters and their number, and other parameters of the model.

Some examples of key applications guiding this research are finding groups of genes associated to certain diseases (e.g. leukemia or lung cancer) in order to devise effective individualized treatments; discovering the di erent developmental stages of sensory organs (e.g. vision); finding the role of proteins through association with proteins whose functions are known; finding relevant association between drugs and potential adverse effects; and discovering neuronal firing patterns in the brain using fMRI images.

SELECTED RESEARCH CONTRIBUTIONS

Chekouo, T., and Murua, A., "Chekouo*, T., Murua, A. and Raffelsberger, W., (2014) ``The Gibbs-Plaids Biclustering Model," Tentatively accepted in Annals of Applied Statistics, (2015).

Biclustering is the simultaneous clustering of two related dimensions, for example, of genes and experimental conditions. Most of the current research has focused on algorithms. In the first work we shed light on associated statistical models behind the algorithms. This allows us to generalize most of the popular biclustering techniques, and to justify and improve on, the algorithms used to find the biclusters. Our penalized plaids biclustering model links two previously unrelated models: the biclustering mixture model of Cheng and Church (2000) and the plaids model of Lazzeroni and Owens (2002). Our model also introduces in a natural measure of biclustering complexity (number of biclusters and overlapping), We also present a suitable version of the DIC criterion to choose the number of biclusters, a problem that has not been adequately addressed yet. Also neglected in the biclustering literature is the prior information about genes or conditions, and pairwise interaction between them. In our second paper we propose a model that does take into account this information. We adopt a Gaussian plaid model as the model describing the biclustering structure. We incorporate prior information on the dependency between genes and between conditions through relational graphs, one for the genes and another for the conditions. These graphs are conveniently described by auto-logistic models (Besag, 1974, 2001; Winkler, 2003) for genes and conditions. These distributions are pairwise-interaction Gibbs Random Fields for dependent binary data. They can be interpreted as generalizations of the finite-lattice Ising model (Ising, 1925). In our prior, the inter-dependencies between the genes are elicited through the information contained in the GO (Gene Ontology) (Ashburner et al, 2000) collection. The normalizing constants of our priors are not known. So in order to efficiently obtain estimates of the posterior distribution, we developed a hybrid procedure that mixes Metropolis-Hastings with a variant of the Wang-Landau algorithm (Wang and Landau, 2001; Atchade and Liu, 2010).

In high dimensional data, the number of covariates is considerably larger than the sample size. We propose a new method to analyze these data. The method searches simultaneously for the cluster structure of the observations and the informative variables. Our model for clustering and variable selection is inspired by the plaid model. It may be seen as a multiplicative mixture model for overlapping clustering. Unlike conventional clustering, within this model an observation may be explained by several clusters. Parameter estimation is done via the Monte Carlo EM algorithm and Important Sampling. Using an extensive simulation as well as comparisons with some competitive methods, we show the advantages of our methodology both in terms of variable selection and clustering. In particular, an application of our approach to gene expression data of kidney renal cell carcinoma (KIRC) taken from The Cancer Genome Atlas validates some previously identified cancer biomarkers.

. Murua, A., Stanberry, L. and Stuetzle, W., (2008) "On Potts model clustering, kernel K-means and density estimation," Journal of Computational & Graphical Statistics, 17 (2008), 629-658.

This work uncovers the link between several seemingly unconnected kernel-based methods (e.g. kernel K-means, Multiway Normalized Cut). Within the same framework it reveals (and re-introduces) the Potts model as one of the simplest to conceive kernel-based clustering method. This work also connects kernel-based methods to non-parametric kernel density estimation. It uses these two findings to present a powerful kernel-based clustering methodology based on Potts model, kernel density estimation, and consensus clustering. The second paper presents an important application to the exploration of brain connectivity via our adaptive Potts model clustering methodology described in the first paper. Our work has been presented at several conferences all over the world. The third paper presents a Bayesian kernel-based clustering method. The model arises as an embedding of the Potts density for label membership probabilities into an extended Bayesian model for joint data and label membership probabilities. In contrast to previous work, our Conditional-Potts clustering model introduces a complete principled way to do clustering. This is accomplished through the elucidation of an informative prior for the model parameters, and also by the development of a Wang-Landau flat-histogram-like stochastic algorithm to estimate the posterior of the parameters. In particular, the link between the Potts and the random cluster model models allows us to elucidate, using random graph theory, a very informative prior for the temperature parameter.

Kernels are now everywhere present in statistics as far as a dot product is at hand. However to the best of our knowledge kernels have not been used in mixture models. In the present work we show that they can be useful for classification purposes. They o er a flexibility in the modeling process through the kernel trick which enables capturing interesting features in some cases more easily than just using the standard methods. Our method is based on mixtures of Gamma distributions. These model the point distances to cluster centroids in the transformed space. The distances are readily computed using the kernel trick. Nested within our model are the two special cases of a mixture of exponentials and the kernel K-means method. We suggest using the log-likelihood ratio or the Bayesian Information criterion to select an appropriate parsimonious model for the data at hand. A comparison with other popular classification methods such as support vector machines, shows that our method is very competitive and computationally efficient.

We consider Bayesian nonparametric regression through random partition models. Our approach involves the construction of a prior distribution on partitions of individuals that is dependent on covariate information. Our main emphasis is to use these covariates to improve predictive inference. To do so we propose a prior on partitions based on the Potts clustering model as applied to the set of observed covariates, which drives the formation of clusters and the prior predictive distribution by covariate proximity, as desired. The resulting prior model is flexible enough to support many different types of likelihood models, although we focus the discussion on nonparametric regression. Implementation details are discussed for the specific case when the sampling model is described as multivariate multiple linear regression. Our proposal scores well in terms of model fitting when compared to other alternative nonparametric regression approaches and outperforms these alternatives in terms of predictions, as concluded from a simulation study we carried out. We illustrate the methodology with an application to health status at the turn of the 21st century across nations.

. Murua, A., Stuetzle, W., Tantrum, J. and Sieberts, S., "Model based document classification and clustering." International Journal of Statistics and Tomography, 8 (2008) 1-25.

In this work we develop a complete methodology for document classification and clustering based on model-based clustering. This explicitly models the data as being drawn from a Gaussian mixture. One main advantage of our approach is the ability to automatically select the number of clusters present in the document collection via Bayes factors. We also reviewed fractionation, an idea originally conceived by Cutting, Karger, Pedersen and Tukey (1992) for non-parametric hierarchical clustering of large data sets, and described an adaptation of it that extends model-based clustering to large data sets. We called this procedure model-based fractionation. The main idea is to reduce the data size by working with meta-data form by clusters of data points. Each meta-data is formed recursively and carries the information contained in the empirical moments of the associated cluster. We also conceived a further extension of fractionation, called model-based refractionation. This latter procedure proved successful even when dealing with di cult situations comprising sets with large numbers of small clusters. A key advantage of model-based refractionation over other competing clustering of large data sets approaches is that it does not require the number of clusters be known a priori. Part of this work was first presented at the 2002 International Conference On Data Mining and Knowledge Discovery (KDDâ€™02) in Edmonton, Canada, where it was selected as one of the best papers. We were later on invited to submit our extended work to the Information Systems Journal.

. Yeung, K.Y., Fraley, C., Murua, A., Raftery, A. and Ruzzo, L. (2001), "Model-Based clustering and data transformations for gene expression data," Bioinformatics, 17, 977-987.

This is a pioneering work on exploratory analysis of gene-expression microarray data. In it we benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external validation were available. We showed that on real gene expression data model-based clustering performed as well as leading heuristic clustering methods. But our approach has the advantage of suggesting the number of clusters (through the Bayesian information criterion BIC) as well as an appropriate model. We also assessed the degree to which real gene expression data fits the multivariate Gaussian mixture assumptions on several common data transformations. We showed that suitable chosen transformations can greatly enhance normality in gene expression data, and that models have varying performance on data sets that are transformed di erently. This paper has been identified by Thomson-ISI as one of the most cited papers in the research area of Gene Expression Data, and it continues to be cited in most journals that publish work on bioinformatics.

. Murua, A. Martin, D. (2005) "S+Bayes" library.

Doug Martin and I have conceived, developed and implemented several Bayesian hierarchical generalized linear mixed models. Among them are linear, Poisson and Binomial models. The fruit of this work has been compiled in the so-called S+Bayes software library (written in C++ with Graphical User Interface in S-PLUS). The goal is to facilitate Bayesian inference techniques to a larger audience than just statisticians in Academia. The library is freely available from Insightful Corporation.

In this work I introduced a useful notion of weak dependence between many classifiers built using the same training data. I showed that if both this dependence is low and the expected margins are large, then decision rules based on linear combinations of these classifiers can achieve error rates that decrease exponentially fast. Several experiments showed that the three common methods used to construct multiple classifiers from the same training data -bagging, boosting and randomized trees (random forests)- produce mutually weakly dependent trees. Furthermore, these results also suggested that there is a trade-off between weak dependence and expected margins: to compensate for low expected margins, there should be low mutual dependence among the classifiers making up the combination. This result has been cited in the engineering and pattern recognition literature as M "Murua's theory."

In this work we recognized speech signals using a large collection of coarse acoustic events which describe temporal relations between a few local cues in the spectrogram. The invariance to changes in duration of speech signal events was addressed by defining temporal relations in a rather coarse manner, allowing a large degree of slack. The problem is of combinatorial nature. However our acoustic events were defined in terms of labeled graphs and hence inherited a partial ordering. This allowed us to use multiple randomized trees to access the large pool of acoustic events in a systematic manner. The trees were aggregated to produce a classifier. We showed that the learning stage of this procedure is much more efficient than that for hidden Markov models (HMM). It also performed much better than HMM in our experiments. This work was presented as an invited talk at the World Congress of the Bernoulli Society.