Lajoie, Guillaume
- Associate Professor
-
Faculty of Arts and Science - Department of Mathematics and Statistics
André-Aisenstadt Office 4161
Courriels
Affiliations
- Membre Centre de recherches mathématiques
- Membre Centre interdisciplinaire de recherche sur le cerveau et l'apprentissage
- Titulaire Chaire de recherche du Canada en calcul et interfaçage neuronaux
- Membre CIRCA Centre interdisciplinaire de recherche sur le cerveau et l'apprentissage
- Membre Institut des algorithmes d'apprentissage de Montréal
- Membre MILA Institut des algorithmes d'apprentissage de Montréal
Research area
Student supervision Expand all Collapse all
Intrinsic latent structures in machine learning : diffusion distances, representation alignment, and graph inference
Theses and supervised dissertations / 2026-02
Natik, Amine
Abstract
Abstract
Over the last two decades, Machine Learning (ML) and Deep Learning (DL) have revolutionized many domains by uncovering predictive structures from large, high-dimensional datasets. At the same time, leveraging these methods effectively depends critically on uncovering the inherent geometric and latent structures in both the data and the architectures of the models. The larger theme uniting this dissertation is that real-world data often reside on or near low-dimensional geometric structures embedded within high-dimensional spaces. By identifying such latent structures (geometric, probabilistic, or combinatorial), we can improve interpretability, robustness, and scalability of ML methods. Models leveraging these ideas, such as Graph Neural Networks (GNNs), diffusion-based embeddings, and spectral methods, are now commonplace in applications spanning bioinformatics, computer vision, and natural language processing. This dissertation advances the understanding and application of intrinsic geometric and latent structures through four complementary contributions: (1) We present Diffusion Earth Mover's Distance (Diffusion EMD) as a geometric distance between distributions; (2) We investigate the effectiveness of Centered Kernel Alignment (CKA) for comparing neural representations; (3) We introduce GraphPPD, a Bayesian posterior predictive framework providing uncertainty-aware graph inference; and (4) We study theoretically the spectral seriation algorithm and its robustness for recovering latent orderings from noisy graph data. First, we propose Diffusion EMD, a fast and scalable method for comparing high-dimensional datasets modeled as distributions on a shared data graph. By diffusing probability mass across the graph rather than relying on Euclidean space, Diffusion EMD aligns more closely with the manifold geometry of the data. It is topologically equivalent to classical EMD with geodesic ground distance under manifold assumptions, while being more efficient and differentiable. This enables accurate large-scale comparisons, such as those in single-cell biology, where it reveals patient-level structure by embedding sample distances into a global manifold. Second, we study the effectiveness of CKA, a popular similarity metric for neural representations. We show that CKA can be highly sensitive to simple transformations, such as affine shifts, especially in high-dimensional or small-sample settings. This raises concerns about its reliability and motivates the need for more robust alternatives. Third, we introduce GraphPPD, a variational Bayesian framework for graph-level prediction tasks. While most methods focus on node- or link-level uncertainty, GraphPPD captures full posterior predictive distributions over graph-level outputs. Built on embeddings from standard GNN backbones, it enables adaptive uncertainty quantification and improves calibration under distribution shifts, supporting more reliable decision-making. Fourth, we study spectral seriation, a classical method for recovering latent sequential orderings from pairwise similarity data. We analyze its performance under a general class of random graph models, showing that it consistently recovers hidden orderings even in the presence of significant noise. Under mild assumptions on the graph structure, spectral seriation yields ordering estimates with provable convergence rates. These results highlight the robustness of spectral methods in contexts such as genomics and recommender systems. Collectively, these contributions demonstrate the significance of intrinsic geometry and latent structures in modern ML. This dissertation follows a portfolio-style format, presenting four adapted and integrated research projects. While each chapter addresses a distinct problem, together they illuminate complementary facets of capturing, analyzing, and leveraging latent structure to advance both the theory and practice of machine learning.