Murua Alejandro


The goal of every clustering (or unsupervised classification) method is to estimate the "true" clustering structure or partition underlying the feature space. This estimate is usually unveiled by optimizing a score function that gives a figure-of-merit on admissible partitions of the data. In fields such as bioinformatics (e.g. microarray gene expression data, proteomic sequencing) and text mining (document clustering), clustering high-dimensional data sets is a common problem. Note that one of the main di culties with genome data is the usually small sample size together with the high-dimensionality of the data. Sound machine learning and statistical methods for variable selection and/or data reduction based on the intrinsic structure of the data are needed. In the absence of such techniques, efficient and reliable clustering methods whose score functions are based on similarity or distance matrices are advantageous. Kernel methods and their extensions such as kernel K-means, multiway normalized cut (MNCut), and, recently, adaptive Potts model clustering use the similarity matrix information in an effective way, often yielding good results.

The goal of my current research is to develop sound machine learning and probabilistic methods to find (1) intrinsic structure (clustering), and (2) patterns (variable selection or functionals that intrinsically describe and interpret the data in few dimensions), in mainly complex and/or high-dimensional data and large datasets.

This research is partly based on the belief that the key to the success of modeling complex data structures lies in the embedding of kernel-based methods within parametric and Bayesian models. For example, the Bayesian parametric paradigm gives a framework for introducing structural constraints, whereas the Kernel based methods introduces flexibility (e.g. local, spatial and/or temporal adaptation). Our penalized adaptive Potts model clustering implicitly models the clusters densities via kernel density estimators (a by-product of using a Mercer kernel within a Potts model), and uses Bayesian computational techniques (Markov Chain Monte Carlo) to estimate the clusters and their number, and other parameters of the model.

Some examples of key applications guiding this research are finding groups of genes associated to certain diseases (e.g. leukemia or lung cancer) in order to devise effective individualized treatments; discovering the di erent developmental stages of sensory organs (e.g. vision); finding the role of proteins through association with proteins whose functions are known; finding relevant association between drugs and potential adverse effects; and discovering neuronal firing patterns in the brain using fMRI images.


. Belbahri, M., Murua, A., Gandouet, O. and Partovi Nia, V., "Uplift Regression: The R Package tools4uplift," Submitted.

Uplift modeling aims at predicting the causal effect of an action such as a medical treatment or a marketing campaign on a particular individual, by taking into consideration the response to a treatment. The treatment group contains individuals who are subject to an action; a control group serves for comparison. Uplift modeling is used to order the individuals with respect to the value of a causal effect, e.g., positive, neutral, or negative. Though there are some computational methods available for uplift modeling, most of them exclude statistical regression models. The R Package tools4uplift intends to fill this gap. This package comprises tools for: i) quantization, ii) visualization, iii) feature selection, and iv) model validation.

. Murua, A. and Maitra, R., "Fast spatial inference in the homogeneous Ising model," Submitted.

In this work we propose accurate approximations for numerically calculating the normalizing constant, mean number of active vertices and mean spin interaction in the homogeneous Ising model. The approximations obviate the need for computationally expensive stochastic methods and widen the range of possible approaches in performing inference. We evaluate performance through simulations and illustrate context via Bayesian methods for detecting activation in a functional Magnetic Resonance Imaging experiment. The Ising model is a very important model in use for a wide range of applications, but has intractable normalizing constants and other moments. As explained in this manuscript, a number of authors have been limited in their ability to use more complex models and interactions because of the intractable partition function and other moments. Our accurate approximation formulas provide a way out of such situations and therefore, we believe that it will be of interest to a wide range of researchers. At present, we are extending our method to more complex models such as non-homogeneous Ising models and the Potts model.

. Adjogou, F., Murua, A. and Raffelsberger, W., "Bayesian lasso functional clustering for time-course and longitudinal data." Submitted.

In this work we are particularly interested in shedding light into the molecular mechanisms underlying critical exposure to tobacco smoke. We do this by extending a previous study (Stevenson et al., 2007) of the microarray transcriptome of rats exposed to tobacco smoke. We develop a flexible model for the analysis and clustering of these data, that can be used more generally, for complete and/or sparse time-course or longitudinal data. The model combines functional analysis and model-based clustering. The functional framework is used to model time-course data. Principal functional components are described by score coefficients which embed the curves in a much lower-dimensional space. Model based clustering is performed on the score space, thus avoiding the curse of dimensionality in the curves' space. We developed a Bayesian version of the lasso and elastic net penalty in order to render the model selection step more efficient. The number of clusters as well as the dimension of the score space are determined via this Bayesian lasso penalty model. Monte Carlo techniques are used in order to estimate the normalizing constants for different values of the penalty parameters. One of the advantages of the Bayesian lasso model is that it avoids the need to perform a costly cross-validation to select the penalty parameters.

. Murua, A., and Quintana, F., "Semiparametric Bayesian Regression Via Potts Model," Journal of Computational and Graphical Statistics, 26(2) (2017) 265-274.

In this work we consider Bayesian nonparametric regression through random partition models. Our approach involves the construction of a covariate-dependent prior on partitions of subjects. Our goal is to use covariate information to improve predictive inference. We propose a prior on partitions based on the Potts model associated with the observed covariates. To our knowledge we are the first to propose the use of the Potts model within the context of Bayesian nonparametrics. Currently, with two of my students, the approach is being extended to the Uplift marketing model, and to Cox survival regression.

. Chekouo, T., and Murua, A., "The Penalized Biclustering Model And Related Algorithms," Journal of Applied Statistics, 42 (2015) 1255-1277.
Chekouo, T., and Murua, A., "Chekouo*, T., Murua, A. and Raffelsberger, W., (2014) ``The Gibbs-Plaids Biclustering Model," Annals of Applied Statistics, 9(3) (2015) 1643-1670.
Chekouo*, T. and Murua, A., (2018) ``High-dimensional variable selection with the plaid mixture model for clustering'.'' Computational Statistics. Accepted for publication, April 2018. (DOI:10.1007/s00180-018-0818-7).

Biclustering is the simultaneous clustering of two related dimensions, for example, of genes and experimental conditions. Most of the current research has focused on algorithms. In the first work we shed light on associated statistical models behind the algorithms. This allows us to generalize most of the popular biclustering techniques, and to justify and improve on, the algorithms used to find the biclusters. Our penalized plaids biclustering model links two previously unrelated models: the biclustering mixture model of Cheng and Church (2000) and the plaids model of Lazzeroni and Owens (2002). Our model also introduces in a natural measure of biclustering complexity (number of biclusters and overlapping), We also present a suitable version of the DIC criterion to choose the number of biclusters, a problem that has not been adequately addressed yet. Also neglected in the biclustering literature is the prior information about genes or conditions, and pairwise interaction between them. In our second paper we propose a model that does take into account this information. We adopt a Gaussian plaid model as the model describing the biclustering structure. We incorporate prior information on the dependency between genes and between conditions through relational graphs, one for the genes and another for the conditions. These graphs are conveniently described by auto-logistic models (Besag, 1974, 2001; Winkler, 2003) for genes and conditions. These distributions are pairwise-interaction Gibbs Random Fields for dependent binary data. They can be interpreted as generalizations of the finite-lattice Ising model (Ising, 1925). In our prior, the inter-dependencies between the genes are elicited through the information contained in the GO (Gene Ontology) (Ashburner et al, 2000) collection. The normalizing constants of our priors are not known. So in order to efficiently obtain estimates of the posterior distribution, we developed a hybrid procedure that mixes Metropolis-Hastings with a variant of the Wang-Landau algorithm (Wang and Landau, 2001; Atchade and Liu, 2010). In the third paper we present a new method which simultaneously explores the cluster structure and the informative (discriminative) variables of the data. Our model for clustering and variable selection is inspired by the plaid model, and may be seen as a multiplicative mixture model for overlapping clustering. An extensive simulation and comparison results with the methods proposed by Pan and Shen (JMLR 2007), Witten and Tibshirani (JASA 2010) and others show that our method is very competitive in terms of clustering and much more precise in terms of variable selection than the other methods.

.Murua, A., Stanberry, L. and Stuetzle, W., (2008) "On Potts model clustering, kernel K-means and density estimation," Journal of Computational & Graphical Statistics, 17 (2008), 629-658.
Stanberry, L., Murua, A. and Cordes, D., "Functional connectivity mapping using the ferromagnetic Potts spin model." Human Brain mapping, 29 (2008) 422-440.
Murua, A., and Wicker, N., (2010) "The Conditional-Potts Clustering Model," Journal of Computational & Graphical Statistics, 23 (2014), 717--739. .

This work uncovers the link between several seemingly unconnected kernel-based methods (e.g. kernel K-means, Multiway Normalized Cut). Within the same framework it reveals (and re-introduces) the Potts model as one of the simplest to conceive kernel-based clustering method. This work also connects kernel-based methods to non-parametric kernel density estimation. It uses these two findings to present a powerful kernel-based clustering methodology based on Potts model, kernel density estimation, and consensus clustering. The second paper presents an important application to the exploration of brain connectivity via our adaptive Potts model clustering methodology described in the first paper. Our work has been presented at several conferences all over the world. The third paper presents a Bayesian kernel-based clustering method. The model arises as an embedding of the Potts density for label membership probabilities into an extended Bayesian model for joint data and label membership probabilities. In contrast to previous work, our Conditional-Potts clustering model introduces a complete principled way to do clustering. This is accomplished through the elucidation of an informative prior for the model parameters, and also by the development of a Wang-Landau flat-histogram-like stochastic algorithm to estimate the posterior of the parameters. In particular, the link between the Potts and the random cluster model models allows us to elucidate, using random graph theory, a very informative prior for the temperature parameter.

. Murua, A., and Wicker, N., "Kernel-based Mixture Models for Classification," Computational Statistics, 30(2) (2015) 317-344 (Published online in October 2014).
Murua, A. and Saulnier*, B., (2018), ``High-dimensional data classification with mixtures of hardening distances,''In progress.

To the best of our knowledge kernels have not been used in mixture models. In this work we show that they can be useful for classification. They offer a flexibility in the modeling process through the kernel trick which enables capturing interesting features in some cases more easily than just using the standard methods. Our method is based on mixtures of Gamma distributions. These model the point distances to cluster centroids in the transformed space. The distances are readily computed using the kernel trick. Nested within our model are the two special cases of a mixture of exponentials and the kernel K-means method. We use the Akaike Information criterion to select an appropriate parsimonious model for the data at hand. A comparison with other popular generative classification methods shows that our method is very competitive and computationally efficient. The fact that the squared distance of a multivariate normal variable is a gamma variable is the base to represent each data class by a mixture of gamma distributions. On the other hand, Leonard and Gauvin (2013) modeled the distance of untransformed high-dimension observations to centroids by normal distributions applying the asymptotic theory of the central limit theorem to the data dimension. In the second work, we show that an alternative asymptotic distribution of the distances can be used. This distribution turns out to be a generalized extension of the gamma distribution. We refer to it as a non-central generalized gamma distribution, or NCG-gamma law, for short. We propose to replace the gamma mixture distribution used in the KMM model with a mixture formed by NCG-Gamma distributions when the dimension of the data is big. One advantage of this law with respect to the gamma one used in the original KMM is that it has a stronger theoretical basis. We apply this model to large data sets from gene expression data and compare its performance with that of KMM with gamma laws.

. Murua, A., Stuetzle, W., Tantrum, J. and Sieberts, S., "Model based document classification and clustering." International Journal of Statistics and Tomography, 8 (2008) 1-25.
Tantrum, J., Murua, A. and Stuetzle, W. (2004), "Hierarchical model-based clustering of large datasets through fractionation and refractionation," Information Systems, 29, 315-326.
Tantrum, J., Murua, A. and Stuetzle, W. (2003), "Assessment and pruning of hierarchical model-based clustering," Proceedings of the 9th International Conference on Knowledge Discovery and Data Mining (KDD03), 197-205.

In this work we develop a complete methodology for document classification and clustering based on model-based clustering. This explicitly models the data as being drawn from a Gaussian mixture. One main advantage of our approach is the ability to automatically select the number of clusters present in the document collection via Bayes factors. We also reviewed fractionation, an idea originally conceived by Cutting, Karger, Pedersen and Tukey (1992) for non-parametric hierarchical clustering of large data sets, and described an adaptation of it that extends model-based clustering to large data sets. We called this procedure model-based fractionation. The main idea is to reduce the data size by working with meta-data form by clusters of data points. Each meta-data is formed recursively and carries the information contained in the empirical moments of the associated cluster. We also conceived a further extension of fractionation, called model-based refractionation. This latter procedure proved successful even when dealing with di cult situations comprising sets with large numbers of small clusters. A key advantage of model-based refractionation over other competing clustering of large data sets approaches is that it does not require the number of clusters be known a priori. Part of this work was first presented at the 2002 International Conference On Data Mining and Knowledge Discovery (KDD’02) in Edmonton, Canada, where it was selected as one of the best papers. We were later on invited to submit our extended work to the Information Systems Journal.

.Yeung, K.Y., Fraley, C., Murua, A., Raftery, A. and Ruzzo, L. (2001), "Model-Based clustering and data transformations for gene expression data," Bioinformatics, 17, 977-987.

This is a pioneering work on exploratory analysis of gene-expression microarray data. In it we benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external validation were available. We showed that on real gene expression data model-based clustering performed as well as leading heuristic clustering methods. But our approach has the advantage of suggesting the number of clusters (through the Bayesian information criterion BIC) as well as an appropriate model. We also assessed the degree to which real gene expression data fits the multivariate Gaussian mixture assumptions on several common data transformations. We showed that suitable chosen transformations can greatly enhance normality in gene expression data, and that models have varying performance on data sets that are transformed di erently. This paper has been identified by Thomson-ISI as one of the most cited papers in the research area of Gene Expression Data, and it continues to be cited in most journals that publish work on bioinformatics.

. Murua, A. Martin, D. (2005) "S+Bayes" library.

Doug Martin and I have conceived, developed and implemented several Bayesian hierarchical generalized linear mixed models. Among them are linear, Poisson and Binomial models. The fruit of this work has been compiled in the so-called S+Bayes software library (written in C++ with Graphical User Interface in S-PLUS). The goal is to facilitate Bayesian inference techniques to a larger audience than just statisticians in Academia. The library is freely available from Insightful Corporation.

. Murua, A. (2002), "Upper bounds for error rates associated to linear combination of classifiers, "IEEE Transactions on Pattern Analysis and Machine Intelligence, 9, 591-602.

In this work I introduced a useful notion of weak dependence between many classifiers built using the same training data. I showed that if both this dependence is low and the expected margins are large, then decision rules based on linear combinations of these classifiers can achieve error rates that decrease exponentially fast. Several experiments showed that the three common methods used to construct multiple classifiers from the same training data -bagging, boosting and randomized trees (random forests)- produce mutually weakly dependent trees. Furthermore, these results also suggested that there is a trade-off between weak dependence and expected margins: to compensate for low expected margins, there should be low mutual dependence among the classifiers making up the combination. This result has been cited in the engineering and pattern recognition literature as M "Murua's theory."

. Amit, Y. and Murua, A. (2001), "Speech recognition using randomized relational decision trees," IEEE Transactions on Speech and Audio Processing, 9, 333-341..

In this work we recognized speech signals using a large collection of coarse acoustic events which describe temporal relations between a few local cues in the spectrogram. The invariance to changes in duration of speech signal events was addressed by defining temporal relations in a rather coarse manner, allowing a large degree of slack. The problem is of combinatorial nature. However our acoustic events were defined in terms of labeled graphs and hence inherited a partial ordering. This allowed us to use multiple randomized trees to access the large pool of acoustic events in a systematic manner. The trees were aggregated to produce a classifier. We showed that the learning stage of this procedure is much more efficient than that for hidden Markov models (HMM). It also performed much better than HMM in our experiments. This work was presented as an invited talk at the World Congress of the Bernoulli Society.