-
A random matrix theory perspective on the spectrum of learned features and asymptotic generalization capabilities,
Yatin Dandi, Luca Pesce,
Hugo Cui,
Florent Krzakala, Yue M Lu, Bruno Loureiro,
AISTATS 2025 (oral)
[arXiv]
[Show Abstract]
Abstract:
A key property of neural networks is their capacity of adapting to data during training.
Yet, our current mathematical understanding of feature learning and its relationship to generalization remain limited.
In this work, we provide a random matrix analysis of how fully-connected two-layer neural networks adapt to the
target function after a single, but aggressive, gradient descent step. We rigorously establish the equivalence between
the updated features and an isotropic spiked random feature model, in the limit of large batch size. For the latter
model, we derive a deterministic equivalent description of the feature empirical covariance matrix in terms of
certain low-dimensional operators. This allows us to sharply characterize the impact of training in the asymptotic
feature spectrum, and in particular, provides a theoretical grounding for how the tails of the feature spectrum
modify with training. The deterministic equivalent further yields the exact asymptotic generalization error,
shedding light on the mechanisms behind its improvement in the presence of feature learning. Our result goes
beyond standard random matrix ensembles, and therefore we believe it is of independent technical interest.
Different from previous work, our result holds in the challenging maximal learning rate regime, is fully
rigorous and allows for finitely supported second layer initialization, which turns out to be crucial for
studying the functional expressivity of the learned features. This provides a sharp description of the
impact of feature learning in the generalization of two-layer neural networks, beyond the random features
and lazy training regimes.
-
A phase transition between positional and semantic learning in a solvable model of dot-product attention,
Hugo Cui,
Freya Behrens, Florent Krzakala, Lenka Zdeborová, NeurIPS 2024 (spotlight)
[arXiv]
[Show Abstract]
Abstract:
We investigate how a dot-product attention layer learns a positional attention matrix
(with tokens attending to each other based on their respective positions) and a semantic attention matrix
(with tokens attending to each other based on their meaning). For an algorithmic task, we experimentally
show how the same simple architecture can learn to implement a solution using either the positional or
semantic mechanism. On the theoretical side, we study the learning of a non-linear self-attention layer
with trainable tied and low-rank query and key matrices. In the asymptotic limit of high-dimensional data
and a comparably large number of training samples, we provide a closed-form characterization of the global
minimum of the non-convex empirical loss landscape. We show that this minimum corresponds to either a
positional or a semantic mechanism and evidence an emergent phase transition from the former to the latter
with increasing sample complexity. Finally, we compare the dot-product attention layer to linear positional
baseline, and show that it outperforms the latter using the semantic mechanism provided it has access to
sufficient data.
-
Asymptotics of feature learning in two-layer networks after one gradient-step,
Hugo Cui,
Luca Pesce, Yatin Dandi, Florent Krzakala, Yue M Lu, Lenka Zdeborová, Bruno Loureiro, ICML 2024 (spotlight).
[arXiv]
[Show Abstract]
Abstract:
In this manuscript we investigate the problem of how two-layer neural networks learn features from data,
and improve over the kernel regime, after being trained with a single gradient descent step.
Leveraging a connection from (Ba et al., 2022) with a non-linear spiked matrix model and recent progress on
Gaussian universality (Dandi et al., 2023), we provide an exact asymptotic description of the generalization
error in the high-dimensional limit where the number of samples , the width and the input dimension grow at
a proportional rate. We characterize exactly how adapting to the data is crucial for the network to
efficiently learn non-linear functions in the direction of the gradient -- where at initialization it can
only express linear functions in this regime. To our knowledge, our results provides the first tight
description of the impact of feature learning in the generalization of two-layer neural networks in the
large learning rate regime , beyond perturbative finite width corrections of the conjugate and neural
tangent kernels.
-
Asymptotics of Learning with Deep Structured (Random) Features,
Dominik Schröder, Daniil Dmitriev,
Hugo Cui,
Bruno Loureiro, ICML 2024.
[arXiv]
[Show Abstract]
Abstract:
For a large class of feature maps we provide a tight asymptotic characterisation of the
test error associated with learning the readout layer, in the high-dimensional limit where
the input dimension, hidden layer widths, and number of training samples are proportionally large.
This characterization is formulated in terms of the population covariance of the features.
Our work is partially motivated by the problem of learning with Gaussian rainbow neural networks,
namely deep non-linear fully-connected networks with random but structured weights,
whose row-wise covariances are further allowed to depend on the weights of previous layers.
For such networks we also derive a closed-form formula for the feature covariance in terms of the
weight matrices. We further find that in some cases our results can capture feature maps learned by deep,
finite-width neural networks trained under gradient descent.
-
Analysis of learning a flow-based generative model from limited sample complexity,
Hugo Cui,
Florent Krzakala, Eric Vanden-Eijnden and Lenka Zdeborová, ICLR 2024.
[arXiv]
[Show Abstract]
Abstract:
We study the problem of training a flow-based generative model, parametrized by a two-layer autoencoder,
to sample from a high-dimensional Gaussian mixture. We provide a sharp end-to-end analysis of the problem.
First, we provide a tight closed-form characterization of the learnt velocity field, when parametrized by a
shallow denoising auto-encoder trained on a finite number n of samples from the target distribution.
Building on this analysis, we provide a sharp description of the corresponding generative flow,
which pushes the base Gaussian density forward to an approximation of the target density.
In particular, we provide closed-form formulae for the distance between the mean of the generated mixture
and the mean of the target mixture, which we show decays as 1/n. Finally,
this rate is shown to be in fact Bayes-optimal.
-
High-dimensional Asymptotics of Denoising Autoencoders,
Hugo Cui,
Lenka Zdeborová, NeurIPS 2023 (spotlight).
[arXiv]
[Show Abstract]
Abstract:
We address the problem of denoising data from a Gaussian mixture using a two-layer non-linear
autoencoder with tied weights and a skip connection. We consider the high-dimensional limit where
the number of training samples and the input dimension jointly tend to infinity while the number of
hidden units remains bounded. We provide closed-form expressions for the denoising mean-squared
test error. Building on this result, we quantitatively characterize the advantage of the considered
architecture over the autoencoder without the skip connection that relates closely to principal
component analysis. We further show that our results accurately capture the learning curves on a
range of real data sets.
-
Error rates for kernel classification under source and capacity conditions,
Hugo Cui,
Bruno Loureiro, Florent Krzakala, Lenka Zdeborová, MLST 2023.
[arXiv]
[Show Abstract]
Abstract:
In this manuscript, we consider the problem of kernel classification. Works on kernel regression
have shown that the rate of decay of the prediction error with the number of samples for a large
class of data-sets is well characterized by two quantities: the capacity and source of the data-set.
In this work, we compute the decay rates for the misclassification (prediction) error under the
Gaussian design, for data-sets satisfying source and capacity assumptions. We derive the rates as a
function of the source and capacity coefficients for two standard kernel classification settings, namely
margin-maximizing Support Vector Machines (SVM) and ridge classification, and contrast the two
methods. As a consequence, we find that the known worst-case rates are loose for this class of
data-sets. Finally, we show that the rates presented in this work are also observed on real data-sets.
-
Bayes-optimal learning of
deep random networks of extensive-width,
Hugo Cui,
Florent Krzakala, Lenka Zdeborová, ICML 2023 (oral).
[arXiv]
[Show Abstract]
Abstract:
We consider the problem of learning a target function corresponding to a deep,
extensive-width, non-linear neural network with random Gaussian weights.
We consider the asymptotic limit where the number of samples, the input dimension and the network width are proportionally large and propose a closed-form expression
for the Bayes-optimal test error, for regression and classification tasks. We further compute closed-form expressions for the test errors of
ridge regression, kernel and random features regression. We find, in particular, that optimally regularized ridge regression, as well as kernel regression,
achieve Bayes-optimal performances, while the logistic loss yields a near-optimal test error for classification. We further show numerically
that when the number of samples grows faster than the dimension, ridge and kernel methods become suboptimal, while neural networks achieve test
error close to zero from quadratically many samples.
-
Deterministic equivalent and error universality of deep random features learning,
Dominik Schröder*,
Hugo Cui*,
Daniil Dmitriev, Bruno Loureiro, ICML 2023.
[arXiv]
[Show Abstract]
Abstract:
This manuscript considers the problem of learning a random Gaussian
network function using a fully connected network with frozen intermediate layers and trainable readout layer.
This problem can be seen as a natural generalization of the widely studied random features model to deeper architectures.
First, we prove Gaussian universality of the test error in a ridge regression setting where the learner and target networks share the same intermediate layers,
and provide a sharp asymptotic formula for it. Establishing this result requires proving a deterministic equivalent for traces of the deep
random features sample covariance matrices which can be of independent interest.
Second, we conjecture the asymptotic Gaussian universality of the test error in the more
general setting of arbitrary convex losses and generic learner/target architectures.
We provide extensive numerical evidence for this conjecture, which requires the derivation of closed-form expressions
for the layer-wise post-activation population covariances. In light of our results, we investigate the interplay
between architecture design and implicit regularization.
-
Large deviations of semisupervised learning in the stochastic block model,
Hugo Cui,
Luca Saglietti, Lenka Zdeborová, PRE 2022.
[arXiv]
[Show Abstract]
Abstract:
In semisupervised community detection, the membership of a set of revealed nodes is known in addition to the graph structure
and can be leveraged to achieve better inference accuracies. While previous works investigated the case where the revealed nodes are selected at random,
this paper focuses on correlated subsets leading to atypically high accuracies. In the framework of the dense stochastic block model,
we employ statistical physics methods to derive a large deviation analysis of the number of these rare subsets, as characterized by their free energy.
We find theoretical evidence of a nonmonotonic relationship between reconstruction accuracy and the free energy
associated to the posterior measure of the inference problem. We further discuss possible implications for active learning applications in community detection.
-
Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime,
Hugo Cui,
Bruno Loureiro, Florent Krzakala, Lenka Zdeborová, NeurIPS 2021.
[arXiv]
[Show Abstract]
Abstract:
In this manuscript we consider Kernel Ridge Regression (KRR) under the Gaussian design. Exponents for
the decay of the excess generalization error of KRR have been reported in various works under the assumption of power-law decay of eigenvalues of
the features co-variance. These decays were, however, provided for sizeably different setups, namely in the noiseless case with constant regularization
and in the noisy optimally regularized case. Intermediary settings have been left substantially uncharted. In this work,
we unify and extend this line of work, providing characterization of all regimes and excess error decay rates that can be observed in terms of
the interplay of noise and regularization. In particular, we show the existence of a transition in the noisy setting between the noiseless
exponents to its noisy values as the sample complexity is increased. Finally, we illustrate how this crossover can also be observed on real data sets.
-
Learning curves of generic features maps for realistic datasets with a teacher-student model,
Bruno Loureiro, Cédric Gerbelot,
Hugo Cui,
Sebastian Goldt, Marc Mézard,Florent Krzakala, Lenka Zdeborová, NeurIPS 2021.
[arXiv]
[Show Abstract]
Abstract:
Teacher-student models provide a framework in which the typical-case performance of high-dimensional
supervised learning can be described in closed form. The assumptions of Gaussian i.i.d. input data underlying the
canonical teacher-student model may, however, be perceived as too restrictive to capture the behaviour of realistic data sets.
In this paper, we introduce a Gaussian covariate generalisation of the model where the teacher and student can act on different spaces,
generated with fixed, but generic feature maps. While still solvable in a closed form, this generalization is able
to capture the learning curves for a broad range of realistic data sets, thus redeeming the potential of the teacher-student framework.
Our contribution is then two-fold: First, we prove a rigorous formula for the asymptotic training loss and generalisation error.
Second, we present a number of situations where the learning curve of the model captures the one of a realistic data set learned with
kernel regression and classification, with out-of-the-box feature maps such as random projections or scattering transforms,
or with pre-learned ones - such as the features learned by training multi-layer neural networks. We discuss both the power and the limitations of the framework.
-
Large deviations for the perceptron model and consequences for active learning,
Hugo Cui,
Luca Saglietti, Lenka Zdeborová, MSML 2020 & MLST 2021.
[arXiv]
[Show Abstract]
Abstract:
Active learning (AL) is a branch of machine learning that deals with problems where unlabeled
data is abundant yet obtaining labels is expensive. The learning algorithm has the possibility of
querying a limited number of samples to obtain the corresponding labels, subsequently used for
supervised learning. In this work, we consider the task of choosing the subset of samples to be
labeled from a fixed finite pool of samples. We assume the pool of samples to be a random matrix
and the ground truth labels to be generated by a single-layer teacher random neural network. We
employ replica methods to analyze the large deviations for the accuracy achieved after supervised
learning on a subset of the original pool. These large deviations then provide optimal achievable
performance boundaries for any AL algorithm. We show that the optimal learning performance
can be efficiently approached by simple message-passing AL algorithms. We also provide a
comparison with the performance of some other popular active learning strategies.