Hugo Cui

News

[Feb. 2025] New preprint out ! In our latest work, we analyze deep attention networks. Our analysis builds upon a surprising connection with sequence multi-index models, which I introduced in my review.
[Jan. 2025] New preprint out ! In our latest work, we sharply characterize the distribution of samples generated by a diffusion model.
[Sept. 2024] Interested in how statistical physics ideas can be used to analyze neural networks ? My new review should be of interest to computer scientists wanting to learn of such tools, and to physicists curious of how these techniques apply to machine learning.
[Jul. 2024] Learn more about our work on phase transitions in attention models at the HiLD workshop in ICML ! Our works on structured feature maps and spiked random features can be found in the main conference track.
[Feb. 2024] Three new preprints out! (1/3) In the first work, we investigate a solvable model of attention mechanism, and evidence a phase transition from positional to semantic learning.
[Feb. 2024] (2/3) In the second work, we sharply characterized how feature learning arises in shallow networks after a single gradient step.
[Feb. 2024] (3/3) Finally, in the last work, we explore the learning of deep random networks with structured weights.
[Jan. 2024] I will be presenting our work on flow-based models at ICLR 2024 in Vienna.

Publications

A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities,
Yatin Dandi, Luca Pesce, Hugo Cui, Florent Krzakala, Yue M Lu, Bruno Loureiro, AISTATS 2025 (oral)
[arXiv] [Show Abstract]
Abstract: A key property of neural networks is their capacity of adapting to data during training. Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited. In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the target function after a single, but aggressive, gradient descent step. We rigorously establish the equivalence between the updated features and an isotropic spiked random feature model, in the limit of large batch size. For the latter model, we derive a deterministic equivalent description of the feature empirical covariance matrix in terms of certain low-dimensional operators. This allows us to sharply characterize the impact of training in the asymptotic feature spectrum, and in particular, provides a theoretical grounding for how the tails of the feature spectrum modify with training. The deterministic equivalent further yields the exact asymptotic generalization error, shedding light on the mechanisms behind its improvement in the presence of feature learning. Our result goes beyond standard random matrix ensembles, and therefore we believe it is of independent technical interest. Different from previous work, our result holds in the challenging maximal learning rate regime, is fully rigorous and allows for finitely supported second layer initialization, which turns out to be crucial for studying the functional expressivity of the learned features. This provides a sharp description of the impact of feature learning in the generalization of two-layer neural networks, beyond the random features and lazy training regimes.
A phase transition between positional and semantic learning in a solvable model of dot-product attention,
Hugo Cui, Freya Behrens, Florent Krzakala, Lenka Zdeborová, NeurIPS 2024 (spotlight)
[arXiv] [Show Abstract]
Abstract: We investigate how a dot-product attention layer learns a positional attention matrix (with tokens attending to each other based on their respective positions) and a semantic attention matrix (with tokens attending to each other based on their meaning). For an algorithmic task, we experimentally show how the same simple architecture can learn to implement a solution using either the positional or semantic mechanism. On the theoretical side, we study the learning of a non-linear self-attention layer with trainable tied and low-rank query and key matrices. In the asymptotic limit of high-dimensional data and a comparably large number of training samples, we provide a closed-form characterization of the global minimum of the non-convex empirical loss landscape. We show that this minimum corresponds to either a positional or a semantic mechanism and evidence an emergent phase transition from the former to the latter with increasing sample complexity. Finally, we compare the dot-product attention layer to linear positional baseline, and show that it outperforms the latter using the semantic mechanism provided it has access to sufficient data.
Asymptotics of feature learning in two-layer networks after one gradient-step,
Hugo Cui, Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M Lu, Lenka Zdeborová, Bruno Loureiro, ICML 2024 (spotlight).
[arXiv] [Show Abstract]
Abstract: In this manuscript we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging a connection from (Ba et al., 2022) with a non-linear spiked matrix model and recent progress on Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization error in the high-dimensional limit where the number of samples , the width and the input dimension grow at a proportional rate. We characterize exactly how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can only express linear functions in this regime. To our knowledge, our results provides the first tight description of the impact of feature learning in the generalization of two-layer neural networks in the large learning rate regime , beyond perturbative finite width corrections of the conjugate and neural tangent kernels.
Asymptotics of Learning with Deep Structured (Random) Features,
Dominik Schröder, Daniil Dmitriev, Hugo Cui, Bruno Loureiro, ICML 2024.
[arXiv] [Show Abstract]
Abstract: For a large class of feature maps we provide a tight asymptotic characterisation of the test error associated with learning the readout layer, in the high-dimensional limit where the input dimension, hidden layer widths, and number of training samples are proportionally large. This characterization is formulated in terms of the population covariance of the features. Our work is partially motivated by the problem of learning with Gaussian rainbow neural networks, namely deep non-linear fully-connected networks with random but structured weights, whose row-wise covariances are further allowed to depend on the weights of previous layers. For such networks we also derive a closed-form formula for the feature covariance in terms of the weight matrices. We further find that in some cases our results can capture feature maps learned by deep, finite-width neural networks trained under gradient descent.
Analysis of learning a flow-based generative model from limited sample complexity,
Hugo Cui, Florent Krzakala, Eric Vanden-Eijnden and Lenka Zdeborová, ICLR 2024.
[arXiv] [Show Abstract]
Abstract: We study the problem of training a flow-based generative model, parametrized by a two-layer autoencoder, to sample from a high-dimensional Gaussian mixture. We provide a sharp end-to-end analysis of the problem. First, we provide a tight closed-form characterization of the learnt velocity field, when parametrized by a shallow denoising auto-encoder trained on a finite number n of samples from the target distribution. Building on this analysis, we provide a sharp description of the corresponding generative flow, which pushes the base Gaussian density forward to an approximation of the target density. In particular, we provide closed-form formulae for the distance between the mean of the generated mixture and the mean of the target mixture, which we show decays as 1/n. Finally, this rate is shown to be in fact Bayes-optimal.
High-dimensional Asymptotics of Denoising Autoencoders,
Hugo Cui, Lenka Zdeborová, NeurIPS 2023 (spotlight).
[arXiv] [Show Abstract]
Abstract: We address the problem of denoising data from a Gaussian mixture using a two-layer non-linear autoencoder with tied weights and a skip connection. We consider the high-dimensional limit where the number of training samples and the input dimension jointly tend to infinity while the number of hidden units remains bounded. We provide closed-form expressions for the denoising mean-squared test error. Building on this result, we quantitatively characterize the advantage of the considered architecture over the autoencoder without the skip connection that relates closely to principal component analysis. We further show that our results accurately capture the learning curves on a range of real data sets.
Error rates for kernel classification under source and capacity conditions,
Hugo Cui, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová, MLST 2023.
[arXiv] [Show Abstract]
Abstract: In this manuscript, we consider the problem of kernel classification. Works on kernel regression have shown that the rate of decay of the prediction error with the number of samples for a large class of data-sets is well characterized by two quantities: the capacity and source of the data-set. In this work, we compute the decay rates for the misclassification (prediction) error under the Gaussian design, for data-sets satisfying source and capacity assumptions. We derive the rates as a function of the source and capacity coefficients for two standard kernel classification settings, namely margin-maximizing Support Vector Machines (SVM) and ridge classification, and contrast the two methods. As a consequence, we find that the known worst-case rates are loose for this class of data-sets. Finally, we show that the rates presented in this work are also observed on real data-sets.
Bayes-optimal learning of deep random networks of extensive-width,
Hugo Cui, Florent Krzakala, Lenka Zdeborová, ICML 2023 (oral).
[arXiv] [Show Abstract]
Abstract: We consider the problem of learning a target function corresponding to a deep, extensive-width, non-linear neural network with random Gaussian weights. We consider the asymptotic limit where the number of samples, the input dimension and the network width are proportionally large and propose a closed-form expression for the Bayes-optimal test error, for regression and classification tasks. We further compute closed-form expressions for the test errors of ridge regression, kernel and random features regression. We find, in particular, that optimally regularized ridge regression, as well as kernel regression, achieve Bayes-optimal performances, while the logistic loss yields a near-optimal test error for classification. We further show numerically that when the number of samples grows faster than the dimension, ridge and kernel methods become suboptimal, while neural networks achieve test error close to zero from quadratically many samples.
Deterministic equivalent and error universality of deep random features learning,
Dominik Schröder^*, Hugo Cui^*, Daniil Dmitriev, Bruno Loureiro, ICML 2023.
[arXiv] [Show Abstract]
Abstract: This manuscript considers the problem of learning a random Gaussian network function using a fully connected network with frozen intermediate layers and trainable readout layer. This problem can be seen as a natural generalization of the widely studied random features model to deeper architectures. First, we prove Gaussian universality of the test error in a ridge regression setting where the learner and target networks share the same intermediate layers, and provide a sharp asymptotic formula for it. Establishing this result requires proving a deterministic equivalent for traces of the deep random features sample covariance matrices which can be of independent interest. Second, we conjecture the asymptotic Gaussian universality of the test error in the more general setting of arbitrary convex losses and generic learner/target architectures. We provide extensive numerical evidence for this conjecture, which requires the derivation of closed-form expressions for the layer-wise post-activation population covariances. In light of our results, we investigate the interplay between architecture design and implicit regularization.
Large deviations of semisupervised learning in the stochastic block model,
Hugo Cui, Luca Saglietti, Lenka Zdeborová, PRE 2022.
[arXiv] [Show Abstract]
Abstract: In semisupervised community detection, the membership of a set of revealed nodes is known in addition to the graph structure and can be leveraged to achieve better inference accuracies. While previous works investigated the case where the revealed nodes are selected at random, this paper focuses on correlated subsets leading to atypically high accuracies. In the framework of the dense stochastic block model, we employ statistical physics methods to derive a large deviation analysis of the number of these rare subsets, as characterized by their free energy. We find theoretical evidence of a nonmonotonic relationship between reconstruction accuracy and the free energy associated to the posterior measure of the inference problem. We further discuss possible implications for active learning applications in community detection.
Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime,
Hugo Cui, Bruno Loureiro, Florent Krzakala, Lenka Zdeborová, NeurIPS 2021.
[arXiv] [Show Abstract]
Abstract: In this manuscript we consider Kernel Ridge Regression (KRR) under the Gaussian design. Exponents for the decay of the excess generalization error of KRR have been reported in various works under the assumption of power-law decay of eigenvalues of the features co-variance. These decays were, however, provided for sizeably different setups, namely in the noiseless case with constant regularization and in the noisy optimally regularized case. Intermediary settings have been left substantially uncharted. In this work, we unify and extend this line of work, providing characterization of all regimes and excess error decay rates that can be observed in terms of the interplay of noise and regularization. In particular, we show the existence of a transition in the noisy setting between the noiseless exponents to its noisy values as the sample complexity is increased. Finally, we illustrate how this crossover can also be observed on real data sets.
Learning curves of generic features maps for realistic datasets with a teacher-student model,
Bruno Loureiro, Cédric Gerbelot, Hugo Cui, Sebastian Goldt, Marc Mézard,Florent Krzakala, Lenka Zdeborová, NeurIPS 2021.
[arXiv] [Show Abstract]
Abstract: Teacher-student models provide a framework in which the typical-case performance of high-dimensional supervised learning can be described in closed form. The assumptions of Gaussian i.i.d. input data underlying the canonical teacher-student model may, however, be perceived as too restrictive to capture the behaviour of realistic data sets. In this paper, we introduce a Gaussian covariate generalisation of the model where the teacher and student can act on different spaces, generated with fixed, but generic feature maps. While still solvable in a closed form, this generalization is able to capture the learning curves for a broad range of realistic data sets, thus redeeming the potential of the teacher-student framework. Our contribution is then two-fold: First, we prove a rigorous formula for the asymptotic training loss and generalisation error. Second, we present a number of situations where the learning curve of the model captures the one of a realistic data set learned with kernel regression and classification, with out-of-the-box feature maps such as random projections or scattering transforms, or with pre-learned ones - such as the features learned by training multi-layer neural networks. We discuss both the power and the limitations of the framework.
Large deviations for the perceptron model and consequences for active learning,
Hugo Cui, Luca Saglietti, Lenka Zdeborová, MSML 2020 & MLST 2021.
[arXiv] [Show Abstract]
Abstract: Active learning (AL) is a branch of machine learning that deals with problems where unlabeled data is abundant yet obtaining labels is expensive. The learning algorithm has the possibility of querying a limited number of samples to obtain the corresponding labels, subsequently used for supervised learning. In this work, we consider the task of choosing the subset of samples to be labeled from a fixed finite pool of samples. We assume the pool of samples to be a random matrix and the ground truth labels to be generated by a single-layer teacher random neural network. We employ replica methods to analyze the large deviations for the accuracy achieved after supervised learning on a subset of the original pool. These large deviations then provide optimal achievable performance boundaries for any AL algorithm. We show that the optimal learning performance can be efficiently approached by simple message-passing AL algorithms. We also provide a comparison with the performance of some other popular active learning strategies.

PhD thesis : Topics in statistical physics of high-dimensional machine learning , Hugo Cui, 2024

Preprints

High-dimensional learning of narrow neural networks,
Hugo Cui, preprint
[arXiv] [Show Abstract]
Abstract: Recent years have been marked with the fast-pace diversification and increasing ubiquity of machine learning applications. Yet, a firm theoretical understanding of the surprising efficiency of neural networks to learn from highdimensional data still proves largely elusive. In this endeavour, analyses inspired by statistical physics have proven instrumental, enabling the tight asymptotic characterization of the learning of neural networks in high dimensions, for a broad class of solvable models. This manuscript reviews the tools and ideas underlying recent progress in this line of work. We introduce a generic model – the sequence multi-index model–, which encompasses numerous previously studied models as special instances. This unified framework covers a broad class of machine learning architectures with a finite number of hidden units – including multi-layer perceptrons, autoencoders, attention mechanisms–, and tasks –(un)supervised learning, denoising, contrastive learning–, in the limit of large data dimension, and comparably large number of samples. We explicate in full detail the analysis of the learning of sequence multi-index models, using statistical physics techniques such as the replica method and approximate message-passing algorithms. This manuscript thus provides a unified presentation of analyses reported in several previous works, and a detailed overview of central techniques in the field of statistical physics of machine learning. This review should be a useful primer for machine learning theoreticians curious of statistical physics approaches; it should also be of value to statistical physicists interested in the transfer of such ideas to the study of neural networks.
A precise asymptotic analysis of learning diffusion models: theory and insights,
Hugo Cui, Cengiz Pehlevan, Yue M Lu, preprint
[arXiv] [Show Abstract]
Abstract: In this manuscript, we consider the problem of learning a flow or diffusion-based generative model parametrized by a two-layer auto-encoder, trained with online stochastic gradient descent, on a high-dimensional target density with an underlying low-dimensional manifold structure. We derive a tight asymptotic characterization of low-dimensional projections of the distribution of samples generated by the learned model, ascertaining in particular its dependence on the number of training samples. Building on this analysis, we discuss how mode collapse can arise, and lead to model collapse when the generative model is re-trained on generated synthetic data.
Fundamental limits of learning in sequence multi-index models and deep attention networks: High-dimensional asymptotics and sharp thresholds,
Emanuele Troiani, Hugo Cui, Yatin Dandi, Florent Krzakala, Lenka Zdeborová, preprint
[arXiv] [Show Abstract]
Abstract: In this manuscript, we study the learning of deep attention neural networks, defined as the composition of multiple self-attention layers, with tied and low-rank weights. We first establish a mapping of such models to sequence multi-index models, a generalization of the widely studied multi-index model to sequential covariates, for which we establish a number of general results. In the context of Bayesian-optimal learning, in the limit of large dimension D and commensurably large number of samples N, we derive a sharp asymptotic characterization of the optimal performance as well as the performance of the best-known polynomial-time algorithm for this setting --namely approximate message-passing--, and characterize sharp thresholds on the minimal sample complexity required for better-than-random prediction performance. Our analysis uncovers, in particular, how the different layers are learned sequentially. Finally, we discuss how this sequential learning can also be observed in a realistic setup.

Awards

EPFL Physics doctoral thesis award 1st prize.
EPFL best 8% PhD distinction in Physics.
G-Research PhD prize in mathematics and data science, EPFL 3rd prize.
Famelab international science communication competition, Switzerland national winner, and international finalist (representing Switzerland).
ENS Paris 2016 national competitive entrance exam, 1st place.
Recipient of two thematic awards from the French Académie des Sciences (for (Inter)national chemistry olympiads)