Asymptotic theory of in-context learning by linear attention

Yue M Lu; Mary Letey; Jacob A Zavatone-Veth; Anindita Maiti; Cengiz Pehlevan

doi:10.1073/pnas.2502599122

. 2025 Jul 9;122(28):e2502599122. doi: 10.1073/pnas.2502599122

Asymptotic theory of in-context learning by linear attention

Yue M Lu ^a,¹, Mary Letey ^a, Jacob A Zavatone-Veth ^a,^b,^c,^d,², Anindita Maiti ^e,², Cengiz Pehlevan ^a,^b,^f,¹

PMCID: PMC12280938 PMID: 40632569

Significance

Attention-based architectures are a powerful force in modern AI. In particular, the emergence of in-context learning abilities enables task generalization far beyond the original next-token prediction objective. There is still significant room for progress in a rigorous understanding of how and when such capacity emerges. Our work provides a sharp characterization of ICL in an analytically tractable model, offering insights into the sample complexity and data quality requirements. Furthermore, we demonstrate that these insights can be applied to more complex, realistic architectures.

Keywords: machine learning, transformers, in-context learning

Abstract

Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers’ success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model’s behavior between low and high task diversity regimes: in the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine ICL and generalization beyond the scope of pretrained tasks. These theoretical insights are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.

Since their introduction by Vaswani et al. in (1), Transformers have become a cornerstone of modern AI. Originally designed for sequence modeling tasks, such as language modeling and machine translation, Transformers achieve state-of-the art performance across many domains, even those that are not inherently sequential (2). Most strikingly, they underpin the breakthroughs achieved by large language models such as BERT (3), LLaMA (4), the GPT series (5–8), the Claude model family (9), and DeepSeek R1 (10).

The technological advancements enabled by Transformers have inspired a substantial body of research aimed at understanding their working principles. One key observation is that language models gain new behaviors and skills as their number of parameters and the size of their training datasets grow (7, 11–13). A particularly important emergent skill is in-context learning (ICL), which describes the model’s ability to learn and execute tasks based on the context provided within the input itself, without the need for explicit prior training on those specific tasks. To give an example from natural language processing, a pretrained large language model might be able to successfully translate English to Italian after being prompted with a few example translations, even if it has not been specifically pretrained on that translation task (7). ICL enables language models to perform new, specialized tasks without retraining, which is arguably a key reason for their general-purpose abilities.

Despite recent progress in understanding ICL, fundamental questions about when and how ICL emerges in large language models remain unresolved. These models are typically trained or pretrained using a next-token prediction objective, but the impact of various algorithmic and hyperparameter choices during pretraining on ICL performance is still not well understood. What mechanisms do Transformers implement to facilitate ICL? How many pretraining examples are necessary for ICL capabilities to arise? Furthermore, how many examples must be provided within the input context for the model to successfully perform an in-context task? Another important question is the degree of task diversity in the training data: how diverse must the training tasks be to enable ICL for truly novel tasks that the model has not encountered before?

In this paper, we address these questions by investigating the ICL capabilities of a linear attention module for linear regression tasks. This simplified model setting, which exhibits the minimal architectural features to perform ICL for linear regression, enables us to derive an asymptotically precise theory of ICL performance, elucidating its exact dependence on various hyperparameters. In the remainder of this section, we first provide an overview of related work on ICL. We then summarize our main contributions.

1.1. Related Work.

1.1.1. ICL in transformer architectures.

The striking ICL abilities of Transformers were thrust to the fore by Brown et al.’s (7) work on GPT-3. Focusing on natural language processing (NLP) tasks, they showed that ICL performance dramatically improves with an increase in the number of model parameters, with an increase in the number of examples in the model’s context, and with the addition of a natural language task description. In subsequent work, Wei et al. (13) proposed that the emergence of ICL with increasing scale is an abrupt, unpredictable transition. This perspective has substantially influenced proposed theories for the emergence of ICL (14). However, Schaeffer et al. (15) have challenged the idea that the emergence of ICL is unpredictable; they suggest that appropriately chosen measures of otherwise hidden progress (16) reveal that ICL gradually develops with scale.

1.1.2. Empirical studies of synthetic ICL tasks.

Though ICL in NLP is both impressive and useful, these natural data do not allow precise experimental control and study. Toward a fine-grained understanding of the conditions required for ICL, many recent works have explored ICL of parametrically controllable synthetic tasks, notably linear regression and classification. These works have identified various features of pretraining data distributions that contribute to the emergence of ICL (17–21). Closely related to our work is a study of ICL of linear regression by Raventós et al. (20). Their work identified a task diversity threshold for the emergence of ICL, below which a pretrained Transformer behaves as a Bayesian estimator with a prior determined by the limited set of pretraining tasks. Above this threshold, the model’s performance approaches that of within-context optimal Bayesian ridge regression, corresponding to a Gaussian prior over all tasks, including those not seen during pretraining. A motivating objective of our work is to provide a theoretical account of the empirical findings made by Raventós et al. (20), which underscore the roles of task diversity, regularization, model capacity, and data structure in the emergence of ICL.

1.1.3. Theoretical studies of ICL.

Many theoretical studies of ICL have centered on the idea that Transformers learn a particular algorithm during pretraining, which is then flexibly deployed to solve in-context tasks. In broad strokes, papers from this program of research often consider a particular algorithm for solving an in-context task, prove that Transformers can approximately implement this algorithm, and then empirically compare the ICL performance of a pretrained Transformer to the performance of that algorithm (22–29). A clear consensus on which algorithm underlies ICL of linear regression in full transformers has yet to emerge (22–29). Within this line of research, closest to our work are a series of papers that consider ICL of linear regression by simplified Transformers using linear, rather than softmax, attention modules (25, 27–32). Zhang et al. (29) studied these models in the limit of infinite pretraining dataset size (i.e., the population risk limit), and show that their performance on in-context linear regression nearly matches that of the Bayes-optimal estimator for the ICL task. However, they found that linear Transformers are not robust to shifts in the within-context covariate distribution. Zhang et al. (28) then showed that any optimizer of the within-context risk for a linear Transformer solves the ICL task with an approximation to one step of gradient descent from a learnable initialization, and that the resulting estimator can saturate the Bayes error for tasks with a Gaussian prior and nonzero mean. As we will discuss in Section 2, our reduction of the linear attention module is inspired in part by these works. In very recent work, Duraisamy (32) has studied the fine-sample risk of in-context linear regression with a single step of gradient descent, without directly analyzing Transformers. An et al. (25) and Wu et al. (30) investigated how linear Transformers adapt to limited pretraining data and context length, again showing that in certain cases nearly optimal error is achievable. Like these studies, our work considers linear attention; but our analysis, with its asymptotically sharp predictions, allows us to elucidate the exact dependence of ICL performance on various hyperparameters and to pinpoint when and how the transition from memorization to ICL of linear regression occurs. In closing, we highlight the work of Reddy (21), who analyzed the emergence of in-context classification through a phenomenological model.

1.2. Summary of Contributions.

We now summarize the main contributions of our paper relative to the prior art reviewed above. Building on recent literature, we focus on a simplified model of a Transformer that captures its key architectural motif: the linear self-attention module (25, 27–31). Linear attention includes the quadratic pairwise interactions between inputs that lie at the heart of softmax attention, but it omits the normalization steps and fully connected layers. This simplification makes the model more amenable to theoretical analysis. Our main result is a sharp asymptotic analysis of ICL for linear regression using linear attention, leading to a more precisely predictive theory than previous population risk analyses or finite-sample bounds (28, 29). The main contributions of our paper are structured as follows:

We begin in Section 2 by developing a simplified parameterization of linear self-attention that allows pretraining on the ICL linear regression task to be performed using ridge regression.
Within this simplified model, we identify a phenomenologically rich scaling limit in which the ICL performance can be analyzed (Section 3). As the token dimension tends to infinity, we allow the number of pretraining examples scale quadratically with the token dimension, while the context length and pretraining task diversity scale linearly. In this joint limit, we compute sharp asymptotics for ICL performance using random matrix theory.
The asymptotically precise theory curves we derive reveal several interesting phenomena (Section 3). First, we observe double-descent in the model’s ICL generalization performance as a function of pretraining dataset size, reflecting our assumption that it is pretrained to interpolation. Second, we study the nonmonotonic dependence of ICL performance on context length. Last, we uncover a transition from memorization to ICL as the pretraining task diversity increases. This transition recapitulates the empirical findings of Raventós et al. (20) in full Transformer models.
In Section 4, we demonstrate through numerical experiments that the insights from our theory, derived using the simplified linear attention model, transfer to full Transformer models with softmax self-attention. In particular, the scaling of pretraining sample complexity and task diversity with token dimension required for successful ICL is consistent.

2. Problem Formulation

We begin by describing the setting of our study.

2.1. ICL of Linear Regression.

In an ICL task, the model takes as input a sequence of tokens ${x_{1}, y_{1}, x_{2}, y_{2}, \dots, x_{ℓ}, y_{ℓ}, x_{ℓ + 1}}$ , and outputs a prediction of $y_{ℓ + 1}$ . We will often refer to an input sequence as a “context.” The pairs ${x_{i}, y_{i}}_{i = 1}^{ℓ + 1}$ are i.i.d. samples from a context-dependent joint distribution $P (x, y)$ . Hence, the model needs to gather information about $P (x, y)$ from the first ℓ examples and use this information to predict $y_{ℓ + 1}$ from $x_{ℓ + 1}$ . We will refer to ℓ as the “context length.”

In this work, we focus on an approximately linear mapping between $x_{i} \in R^{d}$ and $y_{i} \in R$ :

\begin{matrix} y_{i} = ⟨ x_{i}, w ⟩ + ϵ_{i}, \end{matrix}

[1]

where $ϵ_{i}$ is a Gaussian noise and $w \in R^{d}$ is referred to as a task vector. We note that the task vector w is fixed within a context but can change between different contexts. The model has to learn w from the ℓ pairs presented within the context, and use it to predict $y_{ℓ + 1}$ from $x_{ℓ + 1}$ .

2.2. Linear Self-Attention.

The model that we will analytically study is the linear self-attention block (33). Linear self-attention takes as input an embedding matrix Z, whose columns hold the sequence tokens. The mapping of sequences to matrices is not unique. Here, following the convention in refs. 29, 30, and 33, we will embed the input sequence ${x_{1}, y_{1}, x_{2}, y_{2}, \dots, x_{ℓ}, y_{ℓ}, x_{ℓ + 1}}$ as

\begin{matrix} Z = [\begin{matrix} x_{1} & x_{2} & \dots & x_{ℓ} & x_{ℓ + 1} \\ y_{1} & y_{2} & \dots & y_{ℓ} & 0 \end{matrix}] \in R^{(d + 1) \times (ℓ + 1)}, \end{matrix}

[2]

where 0 in the lower-right corner is a token that prompts the missing value $y_{ℓ + 1}$ to be predicted.

For value matrix $V \in R^{(d + 1) \times (d + 1)}$ and key and query matrices $K, Q$ such that $K^{⊤} Q \in R^{(d + 1) \times (d + 1)}$ , the output of a linear-attention block (33–35) is given by

\begin{matrix} A : = Z + \frac{1}{ℓ} V Z {(K Z)}^{⊤} (Q Z) . \end{matrix}

[3]

The output A is a matrix while our goal is to predict a scalar, $y_{ℓ + 1}$ . Following the choice of positional encoding in Eq. 2, we will take $A_{d + 1, ℓ + 1}$ , the element of A corresponding to the 0 prompt, as the prediction for $y_{ℓ + 1}$ :

\begin{matrix} \hat{y} : = A_{d + 1, ℓ + 1} . \end{matrix}

[4]

2.3. Pretraining Data.

The model is pretrained on n sample sequences, where the μth sample is a collection of $ℓ + 1$ vector–scalar pairs ${x_{i}^{μ} \in R^{d}, y_{i}^{μ} \in R}_{i = 1}^{ℓ + 1}$ related by the approximate linear mapping in Eq. 1: $y_{i}^{μ} = ⟨ x_{i}^{μ}, w^{μ} ⟩ + ϵ_{i}^{μ}$ . Here, $w^{μ}$ denotes the task vector associated with the μth sample. We make the following statistical assumptions:

$x_{i}^{μ}$ are d-dimensional random vectors, sampled i.i.d. over both i and μ from an isotropic Gaussian distribution $N (0, I_{d} / d)$ .
At the beginning of training, construct a finite set with k task vectors, denoted by
$\begin{matrix} ω_{k} = {w_{1}, w_{2}, \dots, w_{k}} . \end{matrix}$
The elements of this set are independently drawn once from
$\begin{matrix} w_{i} \sim_{i.i.d.} N (0, I_{d}) . \end{matrix}$
For $1 \leq μ \leq n$ , the task vector $w^{μ}$ associated with the μth sample context is uniformly sampled from $ω_{k}$ . Note that the variable k controls the task diversity in the pretraining dataset. Importantly, k can be less than n, in which case the same task vector from $ω_{k}$ will be repeated multiple times.
The noise terms $ϵ_{i}^{μ}$ are i.i.d. over both i and μ, and drawn from a normal distribution $N (0, ρ)$ .

We denote a sample from this distribution by $(Z, y_{ℓ + 1}) \sim P_{train}$ .

2.4. Parameter Reduction.

Before specifying the training procedure, it is insightful to first examine the prediction mechanism of the linear attention module for the ICL task. This proves to be a fruitful exercise, shedding light on key questions: can linear self-attention learn linear regression in-context? If so, what do the model parameters learn from the data to solve this ICL problem? By closely analyzing these aspects, we can also formulate a simplified problem that lends itself to analytical study.

We start by rewriting the output of the linear attention module, Eq. 4, in an alternative form. Following ref. 29, we define

\begin{matrix} V = [\begin{matrix} V_{11} & v_{12} \\ v_{21}^{⊤} & v_{22} \end{matrix}], M = [\begin{matrix} M_{11} & m_{12} \\ m_{21}^{⊤} & m_{22} \end{matrix}] : = K^{⊤} Q, \end{matrix}

[5]

where $V_{11} \in R^{d \times d}$ , $v_{12}, v_{21} \in R^{d}$ , $v_{22} \in R$ , $M_{11} \in R^{d \times d}$ , $m_{12}, m_{21} \in R^{d}$ , and $m_{22} \in R$ . We assume that the inner dimension of $K^{⊤} Q$ is greater than equal to $d + 1$ so that matrix M can achieve full rank. From Eqs. 3 and 4, one can check that

\begin{matrix} \hat{y} & = \frac{1}{ℓ} 〈 x_{ℓ + 1}, v_{22} M_{11}^{⊤} \sum_{i = 1}^{ℓ} y_{i} x_{i} + v_{22} m_{21} \sum_{i = 1}^{ℓ} y_{i}^{2} \\ + M_{11}^{⊤} \sum_{i = 1}^{ℓ + 1} x_{i} x_{i}^{⊤} v_{21} + m_{21} \sum_{i = 1}^{ℓ} y_{i} x_{i}^{⊤} v_{21} 〉, \end{matrix}

where $⟨ \cdot, \cdot ⟩$ stands for the standard inner product.

This expression reveals several interesting points, including how this model could express a solution to the ICL task. First, not all parameters in Eq. 5 contribute to the output: we can discard all the parameters except for the last row of V and the first d columns of M. Second, the first term

\begin{matrix} \frac{1}{ℓ} v_{22} M_{11}^{⊤} \sum_{i = 1}^{ℓ} y_{i} x_{i} \end{matrix}

offers a hint about how the linear attention module might be solving the task. The sum $\frac{1}{ℓ} \sum_{i \leq ℓ} y_{i} x_{i}$ is a noisy estimate of $E [x x^{⊤}] w$ for that context. Hence, if the parameters of the model are such that $v_{22} M_{11}^{⊤}$ is approximately $E {[x x^{⊤}]}^{- 1}$ , this term alone makes a good prediction for the output. Third, the third term, $M_{11}^{⊤} \sum_{i = 1}^{ℓ + 1} x_{i} x_{i}^{⊤} v_{21}$ does not depend on outputs y, and thus does not directly contribute to the ICL task that relies on the relationship between x and y. Finally, the fourth term, $m_{21} \sum_{i = 1}^{ℓ} y_{i} x_{i}^{⊤} v_{21}$ , only considers a one-dimensional projection of x onto $v_{21}$ . Because the task vectors w and x are isotropic in the statistical models that we consider, there are no special directions in the problem. Consequently, we expect the optimal $v_{21}$ to be approximately zero by symmetry considerations.

Motivated by these observations, and for analytical tractability, we study the linear attention module with the constraint $v_{21} = 0$ . In this case, collecting the remaining parameters in a matrix

\begin{matrix} Γ : = v_{22} [\begin{matrix} M_{11}^{⊤} / d & m_{21} \end{matrix}] \in R^{d \times (d + 1)} \end{matrix}

[6]

and the input sequence in another matrix $H_{Z}$ , defined as

\begin{matrix} H_{Z} : = x_{ℓ + 1} [\begin{matrix} \frac{d}{ℓ} \sum_{i \leq ℓ} y_{i} x_{i}^{⊤} & \frac{1}{ℓ} \sum_{i \leq ℓ} y_{i}^{2} \end{matrix}] \in R^{d \times (d + 1)}, \end{matrix}

[7]

we can rewrite the predicted label as

\begin{matrix} \hat{y} = ⟨ Γ, H_{Z} ⟩ . \end{matrix}

[8]

The $1 / d$ scaling of $M_{11}$ in Γ is chosen so that the columns of $H_{Z}$ scale similarly; it does not affect the final predictor $\hat{y}$ .

We note that ref. 29 provides an analysis of the population risk (whereas we focus on empirical risk) for a related reduced model in which both $v_{21}$ and $m_{21}$ are set to 0. Consequently, the predictors considered in ref. 29 slightly differ from ours in Eqs. 6–8 by an additive term (due to $m_{21}$ ). The authors of ref. 29 justify this reduced model through an optimization argument: if these parameters $v_{21}$ and $m_{21}$ are initialized to zero, they remain zero under gradient descent optimization of the population risk.

In the remainder of this paper, we will examine the ICL performance of the reduced model given in Eqs. 7 and 8, except when making comparisons to a full, nonlinear Transformer architecture. Henceforth, unless explicitly stated otherwise, we will refer to this reduced model as the linear attention module.

2.5. Model Pretraining.

The parameters of the linear attention module are learned from n samples of input sequences,

\begin{matrix} {x_{1}^{μ}, y_{1}^{μ}, \dots, x_{ℓ + 1}^{μ}, y_{ℓ + 1}^{μ}}, μ = 1, \dots, n . \end{matrix}

We estimate model parameters by minimizing MSE loss on next-output prediction with ridge regularization, giving

\begin{matrix} Γ^{*} & = \underset{Γ}{arg min} \sum_{μ = 1}^{n} {(y_{ℓ + 1}^{μ} - ⟨ Γ, H_{Z^{μ}} ⟩)}^{2} + \frac{n}{d} λ Γ_{F}^{2}, \end{matrix}

[9]

where $λ > 0$ is a regularization parameter, and $H_{Z^{μ}}$ refers to the input matrix Eq. 7 populated with the μth sample sequence. The factor $n / d$ in front of λ makes sure that, when we take the $d \to \infty$ or $n \to \infty$ limits later, there is still a meaningful ridge regularization when $λ > 0$ . The solution to the optimization problem in Eq. 9 can be expressed explicitly as

\begin{matrix} vec (Γ^{*}) & = {(\frac{n}{d} λ I + \sum_{μ = 1}^{n} vec (H_{Z^{μ}}) vec {(H_{Z^{μ}})}^{⊤})}^{- 1} \\ \times \sum_{μ = 1}^{n} y_{ℓ + 1}^{μ} vec (H_{Z^{μ}}), \end{matrix}

[10]

where $vec (\cdot)$ denotes the vectorization operation. Throughout this paper, we adopt the row-major convention. Thus, for a $d_{1} \times d_{2}$ matrix A, $vec (A)$ is a vector in $R^{d_{1} d_{2}}$ , formed by stacking the rows of A together.

2.6. Evaluation.

For a given set of parameters Γ, the model’s generalization error is defined as

\begin{matrix} e (Γ) : = E_{P_{test}} {(y_{ℓ + 1} - ⟨ Γ, H_{Z} ⟩)}^{2}, \end{matrix}

where $(Z, y_{ℓ + 1}) \sim P_{test}$ is a new sample drawn from the probability distribution of the test dataset. We consider two different test data distributions $P_{test}$ :

ICL task: $x_{i}$ and $ϵ_{i}$ are i.i.d. Gaussians as in the pretraining case. However, each task vector $w^{test}$ associated with a test input sequence of length ℓ is drawn independently from $N (0, I_{d})$ . We will denote the test error under this setting by $e^{ICL} (Γ)$ .
In-distribution generalization (IDG) task: The test data are generated in exactly the same manner as the training data, i.e., $P_{test} = P_{train}$ , hence the term in-distribution generalization. In particular, the set of unique task vectors ${w_{1}, \dots, w_{k}}$ is identical to that used in the pretraining data. We will denote the test error under this setting by $e^{IDG} (Γ)$ . This task can also be referred to as in-weight learning.

The ICL task evaluates the true ICL performance of the linear attention module. The task vectors in the test set differ from those seen in training, requiring the model to infer them from context. The IDG task assesses the model’s performance on task vectors it has previously encountered during pretraining. High performance on the IDG task but low performance on the ICL task indicates that the model memorizes the training task vectors. Conversely, high performance on the ICL task suggests that the model can learn genuinely new task vectors from the provided context.

To understand the performance of our model on both ICL and IDG tasks, we will need to evaluate these expressions for the pretrained attention matrix $Γ^{*}$ given in Eq. 10. An asymptotically precise prediction of $e^{ICL} (Γ^{*})$ and $e^{IDG} (Γ^{*})$ will be a main result of this work.

2.7. Baselines: Bayes-Optimal Within-Context Estimators.

Following ref. 20, it is useful to compare the predictions made by the trained linear attention to optimal estimators that use only the current context information. These estimators rely solely on the data within the given context for their predictions but need oracle knowledge of the full statistical models underlying the data. Under the mean square loss, the optimal Bayesian estimator ${\hat{y}}_{Bayes} = E_{P_{test}} [y_{ℓ + 1} | x_{1}, y_{1}, x_{2}, y_{2}, \dots, x_{ℓ}, y_{ℓ}, x_{ℓ + 1}]$ in our setting has the form

\begin{matrix} {\hat{y}}_{Bayes} = {(w_{Bayes})}^{⊤} x_{ℓ + 1}, \end{matrix}

where $w_{Bayes}$ is the Bayes estimator of the task vector w.

For the ICL task, the Bayes-optimal ridge regression estimator is given by

\begin{matrix} w_{Bayes}^{ridge} : = {(\sum_{i = 1}^{ℓ} x_{i} x_{i}^{⊤} + ρ I_{d})}^{- 1} (\sum_{i = 1}^{ℓ} y_{i} x_{i}), \end{matrix}

[11]

where the ridge parameter is set to the noise variance ρ. We will refer to it as the ridge estimator. For the IDG task, the Bayes-optimal estimator is given by

\begin{matrix} w_{Bayes}^{dMMSE} : = \frac{\sum_{j = 1}^{k} w_{j} e^{- \frac{1}{2 ρ} \sum_{i = 1}^{ℓ} {(y_{i} - w_{j}^{⊤} x_{i})}^{2}}}{\sum_{j = 1}^{k} e^{- \frac{1}{2 ρ} \sum_{i = 1}^{ℓ} {(y_{i} - w_{j}^{⊤} x_{i})}^{2}}} . \end{matrix}

[12]

Here, we assume that the training task vectors ${w_{1}, \dots, w_{k}}$ are known to the estimator. Following ref. 20, we will refer to this estimator as the discrete minimum mean squared error (dMMSE) estimator.

The test performance of these estimators are calculated by

\begin{matrix} e_{P_{test}}^{Bayes} = E_{P_{test}} {(y_{ℓ + 1} - {(w_{Bayes})}^{⊤} x_{ℓ + 1})}^{2}, \end{matrix}

where $P_{test}$ can be associated with either the ICL or the IDG task, and $w_{Bayes}$ can be the ridge or the dMMSE estimator. To avoid possible confusion, we emphasize that we will sometimes plot the performance of an estimator on a task for which it is not optimal. For example, we will test the dMMSE estimator, which is Bayes-optimal for the pretraining distribution, on the ICL task, where it is not optimal. This will be done for benchmarking purposes.

3. Theoretical Results

To answer the questions raised in the introduction, we provide a precise asymptotic analysis of the learning curves of the linear attention module for ICL of linear regression. We then focus on various implications of these equations, and verify through simulations that our insights gained from this theoretical analysis extend to more realistic nonlinear Transformers.

3.1. Joint Asymptotic Limit.

We have now defined both the structure of the training data as well as the parameters to be optimized. For our theoretical analysis, we consider a joint asymptotic limit in which the input dimension d, the pretraining dataset size n, the context length ℓ, and the number of task vectors in the training set k, go to infinity together such that

\begin{matrix} \frac{ℓ}{d} : = α = Θ (1), \frac{k}{d} : = κ = Θ (1), \frac{n}{d^{2}} : = τ = Θ (1) . \end{matrix}

[13]

Identification of these scalings constitutes one of the main results of our paper. As we will see, the linear attention module exhibits rich learning phenomena in this limit.

The intuition for these scaling parameters can be seen as follows. Standard results in linear regression (36–38) show that to estimate a d-dimensional task vector w from the ℓ samples within a context, one needs at least $ℓ = Θ (d)$ . The number of unique task vectors that must be seen to estimate the covariance matrix of the true d-dimensional task distribution $N (0, I_{d})$ should also scale with d, i.e., $k = Θ (d)$ . Finally, we see from Eq. 6 that the number of linear attention parameters to be learned is $Θ (d^{2})$ . This suggests that the number of individual contexts the model sees during pretraining should scale similarly, i.e., $n = Θ (d^{2})$ .

3.2. Learning Curves for ICL of Linear Regression by a Linear Attention Module.

Our theoretical analysis, explained in detail in SI Appendix, leads to asymptotically precise expressions for the generalization errors under the two test distributions being studied. Specifically, our theory predicts that, as $d, n, ℓ, k \to \infty$ in the joint limit given in Eq. 13,

\begin{matrix} e^{ICL} (Γ^{*}) ⟶ e^{ICL} (τ, α, κ, ρ, λ) almost surely, \end{matrix}

and

\begin{matrix} e^{IDG} (Γ^{*}) ⟶ e^{IDG} (τ, α, κ, ρ, λ) almost surely, \end{matrix}

where $e^{ICL} (τ, α, κ, ρ, λ)$ and $e^{IDG} (τ, α, κ, ρ, λ)$ are two deterministic functions of the parameters τ, α, κ, ρ and λ. The exact expressions of these two functions can be found in SI Appendix, section 5. For simplicity, we only present in what follows the ridgeless limit (i.e., $λ \to 0^{+}$ ) of the asymptotic generalization errors.

Result 1.

(ICL generalization error in the ridgeless limit). Let

$\begin{matrix} q^{*} : = \frac{1 + ρ}{α}, m^{*} : = M_{κ} (q^{*}), μ^{*} : = q^{*} M_{κ / τ} (q^{*}), \end{matrix}$ [14]

where $M_{κ} (\cdot)$ , defined in SI Appendix, section 8, is a function related to the Stieltjes transform of the Marchenko–Pastur law. Then

$\begin{matrix} e_{ridgeless}^{ICL} : = lim_{λ \to 0^{+}} e^{ICL} (τ, α, κ, ρ, λ) = \\ \{\begin{matrix} \frac{τ (1 + q^{*})}{1 - τ} [1 - τ {(1 - μ^{*})}^{2} + μ^{*} (ρ / q^{*} - 1)] \\ - 2 τ (1 - μ^{*}) + (1 + ρ) & τ < 1 \\ (q^{*} + 1) (1 - 2 q^{*} m^{*} - {(q^{*})}^{2} M_{κ}^{'} (q^{*}) + \frac{(ρ + q^{*} - {(q^{*})}^{2} m^{*}) m^{*}}{τ - 1}) \\ - 2 (1 - q^{*} m^{*}) + (1 + ρ) & τ > 1 \end{matrix}, \end{matrix}$ [15]

where $M_{κ}^{'} (\cdot)$ means the derivative of $M_{κ} (q^{*})$ with respect to $q^{*}$ .

Result 2.

(IDG generalization error in the ridgeless limit). Let $q^{*}$ , $m^{*}$ , and $μ^{*}$ be the scalars defined in Eq. 14. We have

$\begin{matrix} e_{ridgeless}^{IDG} : = lim_{λ \to 0^{+}} e^{IDG} (τ, α, κ, ρ, λ) = \\ \{\begin{matrix} \frac{τ}{1 - τ} (\frac{ρ + q^{*} - 2 q^{*} (1 - τ) (q^{*} / ξ^{*} + 1)}{1 - p^{*} (1 - τ)} + \frac{τ μ^{*} {(q^{*} + ξ^{*})}^{2}}{q^{*}}) & τ < 1 \\ \frac{τ}{τ - 1} [ρ + q^{*} (1 - q^{*} m^{*})] & τ > 1 \end{matrix}, \end{matrix}$ [16]

where $ξ^{*} = \frac{(1 - τ) q^{*}}{τ μ^{*}}$ and $p^{*} = (1 - κ (\frac{κ ξ^{*}}{1 - τ} + 1)^{- 2})^{- 1}$ .

We will discuss various implications of these equations in the next sections.

We derived these results using techniques from random matrix theory. The full setup and technical details are presented in SI Appendix. The computations involve analysis of the properties of the finite-sample optimal parameter matrix $Γ^{*}$ (Eq. 10). A key technical component of our analysis involves characterizing the spectral properties of the sample covariance matrix of $n = Θ (d^{2})$ i.i.d. random vectors in dimension $Θ (d^{2})$ . Each of these vectors is constructed as the vectorized version of the matrix in Eq. 7. Related but simpler versions of this type of random matrices involving the tensor product of i.i.d. random vectors have been studied in recent work (39). Some of our derivations are based on nonrigorous yet technically plausible heuristics. We support these predictions with numerical simulations and discuss in SI Appendix the steps required to achieve a fully rigorous proof.

3.3. Sample-Wise Double-Descent.

How large should n, the pretraining dataset size, be for the linear attention to successfully learn the task in-context? In Fig. 1, we plot our theoretical predictions for the ICL and IDG error as a function of $τ = n / d^{2}$ and verify them with numerical simulations. Our results demonstrate that the quadratic scaling of sample size with input dimensions is indeed an appropriate regime where nontrivial learning phenomena can be observed.

As apparent in Fig. 1, we find that the generalization error for both ICL and IDG tasks are not monotonic in the number of samples. In the ridgeless limit, both ICL and IDG errors diverge at $τ = 1$ , with the leading order behavior in the $τ ↑ 1$ (respectively $τ ↓ 1$ ) limit given by $\frac{c_{1}}{1 - τ}$ (respectively $\frac{c_{2}}{τ - 1}$ ), where $c_{1}$ (respectively $c_{2}$ ) is a τ-independent constant. This leads to a “double-descent” behavior (38, 40) in the number of samples. As in other models exhibiting double-descent (38, 40, 41), the location of the divergence is at the interpolation threshold: the number of parameters of the model (elements of Γ) is, to leading order in d, equal to $d^{2}$ , which matches the number of pretraining samples at $τ = 1$ . Further, we can investigate the effect of ridge regularization on the steepness of the double descent, as illustrated in Fig. 1C for the ICL task. As we would expect from other models exhibiting double-descent (38, 40, 41), increasing the regularization strength suppresses the peak in error around the interpolation threshold.

Finally, we note that if we take the limit of $κ \to \infty$ and $α \to \infty$ in Result 1 (in either order), the ICL generalization error in the ridgeless case reduces to the generalization error of simple ridgeless interpolation with isotropic Gaussian covariates in $d^{2}$ dimensions (38, 41):

\begin{matrix} lim_{α \to \infty} lim_{κ \to \infty} e_{ridgeless}^{ICL} & = lim_{κ \to \infty} lim_{α \to \infty} e_{ridgeless}^{ICL} \\ = \{\begin{matrix} 1 - τ + \frac{ρ}{1 - τ} & τ < 1, \\ \frac{ρ τ}{τ - 1} & τ > 1 . \end{matrix} \end{matrix}

This result makes sense, given that in this limit the ICL generalization problem reduces to the generalization error of ridge regression in $d^{2}$ dimensions with covariates formed as the tensor product of i.i.d. Gaussian vectors, which by universality results in ref. 39 should in turn be asymptotically equal to that for isotropic Gaussian covariates (38). We note that taking the limit $α \to \infty$ is not strictly necessary for this universality, but doing so simplifies the expressions and makes the double-descent obvious.

3.4. ICL and IDG Error Curves Can Have Nonmonotonic Dependence on Context Length.

How large should the context length be? In Fig. 2, we plot our theoretical results verified with experiments. We observe that we have correctly identified the regime where in-context and in-weight learning appear: context length indeed scales linearly with input dimensions, as numerical simulations computed using finite d fit the asymptotic error curves.

An interesting observation is that the IDG and ICL errors do not always monotonically decrease with context length; we see that there are parameter configurations for which the error curves are minimized at some finite $α^{*}$ .

While the functional form of $e_{ridgeless}^{IDG}$ and $e_{ridgeless}^{ICL}$ is too complex to study their minimizers analytically, we can investigate the nonmonotonicity of IDG and ICL error in α numerically. This is done in Fig. 3, where the blue-green colormap corresponds to the value $α^{*}$ , if it exists, that minimizes $e_{ridgeless}^{IDG} (τ, α, κ, ρ)$ or $e_{ridgeless}^{ICL} (τ, α, κ, ρ)$ at fixed $τ, κ, ρ$ . The gray color indicates cases where the corresponding function is monotonic in α. The results reveal that the nonmonotonicity of the errors with respect to context length α depends in a nontrivial manner on both the task diversity κ and the noise-to-signal ratio ρ.

Fig. 3. — Phase diagram of nonmonotonicity in α of ridgeless IDG and ICL error, for fixed τ against task diversity κ and label noise ρ. The colormap corresponds to the value $α^{*}$ that minimizes error in α, if it exists, at that particular $κ, ρ$ pair; gray is plotted if the error curve is monotonic at that $κ, ρ$ pair. Figures (A–C) in this plot correspond to the respective setup in Fig. 2, and points A–E correspond to the respective curve in Fig. 2. The dashed vertical lines are plotted at $κ = min (τ, 1)$ . We know from Eq. 17 and surrounding discussion that ICL error diverges in α for all $κ, ρ$ to the left of this line, and IDG error diverges on this line for $τ < 1$ .

In addition to the nonmonotonicity of error curves in α, we also notice a divergence in error as α increases for some $κ, τ$ values. To determine when this occurs, we compute the large α limit of the above error curves. For ICL error, we have

\begin{matrix} lim_{α \to \infty} e_{ridgeless}^{ICL} & = \{\begin{matrix} \infty & κ \leq min (τ, 1), \\ 1 - τ + ρ + \frac{ρ κ τ}{(κ - τ) (1 - τ)} & τ < 1, κ > τ, \\ ρ + \frac{ρ κ}{(κ - 1) (τ - 1)} & τ > 1, κ > 1 . \end{matrix} \end{matrix}

[17]

Note the explicit divergence of ICL error for $κ \leq min (τ, 1)$ (independent of ρ). Similar investigation of Result 2 under the limit $α \to \infty$ reveals an explicit divergence of IDG error for $κ = τ$ . The divergence in α of ICL and IDG errors is a peculiarity of the ridgeless limit of errors we have taken in Results 1 and 2 and is further discussed in SI Appendix, section 5.

3.5. Learning Transition with Increasing Pretraining Task Diversity.

It is important to quantify if and when a given model is actually learning in-context, i.e., solving a new regression problem by adapting to the specific structure of the task rather than relying solely on memorized training task vectors. We refer to this phenomenon as task generalization.

We expect pretraining task diversity (k) to play a key role in task generalization. To provide some motivation for this, suppose the model was pretrained on only a single task vector ( $k = 1$ ). It is reasonable to expect that the model will memorize this task vector and make predictions under the prior that the task vector in the test examples is identical to that particular pretraining task vector. On the other extreme, suppose the model was pretrained by drawing a fresh task vector for every training sequence. Here, the expectation would be that the model achieves task generalization. Indeed, results from previous computational studies on the linear regression task (20) suggest that transformers are better in-context learners if they have seen a greater task diversity in the training set, and further the models transition from a task memorization phase to a task generalization phase as task diversity increases. Hence, we study the task generalization behavior of the linear transformer as the task diversity increases and precisely quantify the task diversity scale at which such transition happens.

First, we need to discuss how to measure task generalization. To motivate our solution, we take a closer look at the ridge estimator ( $w_{Bayes}^{ridge}$ as in Eq. 11), which is the Bayes-optimal estimator given the ICL task structure. We analytically characterize $e_{ICL}$ and $e_{IDG}$ errors for this estimator in SI Appendix, section 6, giving

e_{ICL}^{ridge} = ρ (1 + \frac{1}{α} M_{α} (\frac{ρ}{α})) = e_{IDG}^{ridge} .

Note that the performance of the ridge estimator is identical on the ICL and IDG test distributions, meaning that it does not rely at all on memorizing task vectors.

Motivated by these arguments, we propose to measure the task generalization capability of a model by studying the difference between its performance on the ICL test distribution and its performance on the IDG test distribution. Specifically, we consider the quantity $g_{task} = e_{ICL} - e_{IDG}$ for a given model or estimator. This difference being large implies the model, performing better on training tasks, has not learned the true task distribution and is not generalizing in task. Conversely, a small difference between ICL error and IDG error suggests that the model is leveraging the underlying structure of the task distribution rather than overfitting to and interpolating specific task instances seen in training.

Simulations of $g_{task}$ as a function of task diversity, $κ = k / d$ , are shown in Fig. 4 for two inference models: dMMSE estimator, and the linear transformer. The dMMSE estimator is included as a benchmark quantifying the performance of a “pure memorizer,” as it is the Bayes-optimal estimator that assumes task vectors only seen in the training test. We see that, as the task diversity parameter κ increases, $g_{task}$ falls for both estimators but at very different rates.

The $g_{task}$ for the dMMSE estimator falls very slowly as a function of κ. This is expected because the explicit form of $w_{Bayes}^{dMMSE}$ , given in Eq. 12, is that of a kernel smoother employing a Gaussian kernel as the weighting function around each task vector. Known results about this class of estimators show the performance of the dMMSE estimator on the ICL task suffers from the “curse of dimensionality” (42–44), i.e., the required number of training samples $k = d κ$ would need to be exponential in d to outperform some given $e_{ICL}$ error tolerance. Intuitively, the IDG task distribution $P_{test} = P_{train} = Unif {w_{1}, \dots, w_{k}}, w_{i} \sim N (0, I_{d})$ and the ICL task distribution $P_{test} = N (0, I_{d})$ are indistinguishable (by appropriate measures) only if k is exponentially large in d. We present more formal arguments in SI Appendix, section 6.

On the other hand, the $g_{task}$ for the linear transformer estimator first behaves similarly to the dMMSE estimator, but starts falling sharply after around $κ = 1$ . How quickly does $g_{task}^{LT}$ limit to 0 as $κ \to \infty$ ? By expanding $e_{ICL} - e_{IDG}$ from Eq. 15 and Eq. 16 in κ we have the leading asymptotic behavior given by

g_{task}^{LT} = 0 + \frac{1}{κ} \{\begin{matrix} c_{1} & τ < 1 \\ c_{2} & τ > 1 \end{matrix} + O (\frac{1}{κ^{2}})

for constants $c_{1}, c_{2}$ dependent on $α, τ, ρ$ , provided in SI Appendix, section 6. This sharp fall signals a transition of the model to the task generalization regime. To further understand the nature of this transition and the role of κ in the solution learned by the linear attention mechanism, consider the regime where $τ, α \to \infty$ with $α / τ \equiv c^{*}$ kept fixed. Under this setting, we have

\begin{matrix} lim_{\begin{matrix} τ \to \infty \\ α \to \infty \end{matrix}} g_{task}^{LT} = \{\begin{matrix} (1 - κ) (1 + \frac{ρ}{1 + ρ} c^{*}) & κ < 1 \\ 0 & κ > 1 \end{matrix} . \end{matrix}

[18]

This change in analytical behavior indicates a phase transition at $κ = 1$ . We see this phenomenon in both Fig. 4 A and B where the $g_{task}$ curve plotted against κ becomes nondifferentiable at $κ = 1$ for $α \to \infty$ .

We conclude that the linear transformer model is a much more efficient task generalizer than the dMMSE estimator which is the Bayes-optimal memorizer, as shown in Fig. 4. The $1 / κ$ decay in $g_{task}^{LT}$ vs. the much slower decay of $g_{task}^{dMMSE}$ suggests that the linear transformer quickly learns an inference algorithm that generalizes in-context rather than interpolates between training tasks.

4. Experiments with Fully Parameterized Nonlinear Transformers

Given that our theoretical results are derived from a simplified linear attention setting, we aim to determine whether these insights can be extended to nonlinear and deep Transformer models with full $K, Q, V$ parameterization. Even though we would not expect the specific algorithm discussed in Section 2.4 to transfer to these nonlinear settings, we will test whether our proposed scalings and main qualitative predictions remain. Specifically, we will investigate the following: 1) whether we have identified the correct scaling regime for nontrivial learning in an ICL task; 2) whether the full Transformer exhibits a sample-wise double descent, where the location of the peak error scales quadratically with input dimensions as predicted by our theory; and 3) whether the transition from memorization to task generalization occurs.

Our experiments^* are done with a standard Transformer architecture, where each sample context initially takes the form given by Eq. 2. The fully parameterized linear transformer Fig. 5A and the softmax-only transformer Fig. 5B do not use a multilayer perceptron (MLP) in addition to attention. If MLPs are used (e.g., Fig. 5 C and D), the architecture consists of blocks with: 1) a single-head softmax self-attention with $K, Q, V \in R^{(d + 1) \times (d + 1)}$ matrices, followed by 2) a two-layer dense MLP with GELU activation and hidden layer of size $d + 1$ (1). Residual connections are used between the input tokens (padded from dimension d to $d + 1$ ), the pre-MLP output, and the MLP output. We use a variable number of attention+MLP blocks before returning the final logit corresponding to the $(d + 1, ℓ + 1)$ th element in the original embedding structure given by Eq. 2. The loss function is the mean squared error (MSE) between the predicted label (the output of the model for a given sample Z) and the true value $y_{ℓ + 1}$ . We train the model in an offline setting with n total samples $Z_{1}, \dots, Z_{n}$ , divided into 10 batches, using the Adam optimizer (45) with a learning rate $10^{- 4}$ until the training error converges, typically requiring 10,000 epochs.^† The structure of the pretraining and test distributions exactly follows the setup for the ICL and IDG tasks described in Section 2.

In Fig. 5, we plot the average error for a range of fully parameterized linear (A) and nonlinear architectures (B and C) on the ICL task against the parameter $τ = n / d^{2}$ , which measures the pretraining sample complexity. Notice that the large-d full $K, Q, V$ parameterized linear architecture studied in (A) follows the general trend of the theory derived on a simplified parameterization. Further, in all figures, plots across different values of d roughly collapse, suggesting again that our scaling relations capture the observed behavior.

Across all architectures, we observe a double descent in τ. The peak of this curve occurs at the interpolation threshold, which we identify by tracking the value of τ at which the training loss becomes nonzero (see figure caption). Our theory predicts that the number of samples n at the peak, as well as the interpolation threshold, should scale with $d^{2}$ . Indeed, for the fully parameterized linear transformer in (Fig. 5A), the double descent occurs very close to $τ = 1$ for larger d, as predicted by the reduced-parameter linear theory. This general quadratic scaling is summarized in (Fig. 5D) across a range of architectures. These findings suggest that nonlinear Transformers exhibit a similar interpolation transition, with the same quadratic scaling in sample size as seen in the simplified model.

Finally, we can investigate the effects of task diversity on $g_{task}$ in a nonlinear model, shown in Fig. 6. This provides empirical evidence toward the choice of κ scaling, as errors are consistent over a range of token dimension choices. Further, we recover the memorization-to-task-generalization transition in κ, as observed before in ref. 20. In Fig. 6, we see the value of κ for which $g_{task}$ appears to converge toward 0 is below 1 (as in Eq. 18), suggesting that the more powerful architecture used here may be more “task-efficient” than the linear architecture in achieving memorization-to-generalization transition.

5. Discussion

In this paper, we computed sharp asymptotics for the ICL performance in a simplified model of ICL for linear regression using linear attention. This exactly solvable model demonstrates a transition from memorizing pretraining task vectors to generalizing beyond them as the diversity of pretraining tasks increases, echoing empirical findings in full Transformers (20). Additionally, we observe a sample-wise double descent as the amount of pretraining data increases. Our numerical experiments show that full, nonlinear Transformers exhibit qualitatively similar behavior in the scaling regime relevant to our solvable model. Understanding the mechanistic underpinnings of ICL of well-controlled synthetic tasks in solvable models is an important prerequisite to understanding how it emerges from pretraining on natural data (21).

Our paper falls within a broader program of research that seeks sharp asymptotic characterizations of the performance of machine learning algorithms. This program has a long history in statistical physics (41, 46, 47), and has in recent years attracted substantial attention in machine learning (38, 41, 48–54). For simplicity, we have assumed that the covariates in the in-context regression problem are drawn from an isotropic Gaussian distribution. However, our technical approach could be extended to anisotropic covariates, and, perhaps more interestingly, to featurized linear attention models in which the inputs are passed through some feature map before linear attention is applied (34, 35). This extension would be possible thanks to an appropriate form of Gaussian universality: for certain classes of regression problems, the asymptotic error coincides with that of a model where the true features are replaced with Gaussian features of matched mean and covariance (38, 39, 48–53, 55). This would allow for a theoretical characterization of ICL for realistic data structure in a closer approximation of full softmax attention, yielding more precise predictions of how performance scales in real Transformers.

In our analysis, we have assumed that the model is trained to interpolation on a fixed dataset. This allows us to cast our simplified form of linear attention pretraining as a ridge regression problem, which in turn enables our random matrix analysis. In contrast, Transformer-based large language models are usually trained in a nearly online setting, where each gradient update is estimated using fresh examples with no repeating data (56). Some of our findings, such as double-descent in the learning curve as a function of the number of pretraining examples, are unlikely to generalize to the fully online setting. It will be interesting to probe these potential differences in future work.

Finally, our results have some bearing on the broad question of what architectural features are required for ICL (7, 13, 21). Our work shows that a full Transformer—or indeed even full linear attention—is not required for ICL of linear regression. However, our simplified model retains the structured quadratic pairwise interaction between inputs that is at the heart of the attention mechanism. It is this quadratic interaction that allows the model to solve the ICL regression task, which it does essentially by reversing the data correlation (Section 2.4). One would therefore hypothesize that our model is minimal in that further simplifications within this model class would impair its ability to solve this ICL task. In the specific context of regression with isotropic data, a simple point of comparison would be to fix $Γ = I_{d}$ , which gives a pretraining-free model that should perform well when the context length is very long. However, this further-reduced model would perform poorly if the covariates of the in-context task are anisotropic. Generally, it would be interesting to investigate when models lacking this precisely engineered quadratic interaction can learn linear regression in-context (57), and if they are less sample-efficient than the attention-based models considered here.

Supplementary Material

Appendix 01 (PDF)

pnas.2502599122.sapp.pdf^{(397.7KB, pdf)}

Acknowledgments

We thank William L. Tong for helpful discussions regarding numerics. Y.M.L. was supported by NSF Award CCF-1910410, by the Harvard Faculty of Arts and Sciences Dean’s Fund for Promising Scholarship, and by a Harvard College Professorship. J.A.Z.-V. and C.P. were supported by NSF Award DMS-2134157 and NSF CAREER Award IIS-2239780. J.A.Z.-V. is presently supported by a Junior Fellowship from the Harvard Society of Fellows. C.P. is further supported by a Sloan Research Fellowship. A.M. acknowledges support from Perimeter Institute, which is supported in part by the Government of Canada through the Department of Innovation, Science and Economic Development and by the Province of Ontario through the Ministry of Colleges and Universities. This work has been made possible in part by a gift from the Chan Zuckerberg Initiative Foundation to establish the Kempner Institute for the Study of Natural and Artificial Intelligence. This research was supported in part by grants NSF PHY-1748958 and PHY-2309135 to the Kavli Institute for Theoretical Physics, through the authors’ participation in the Fall 2023 program “Deep Learning from the Perspective of Physics and Neuroscience.”

Author contributions

Y.M.L. and C.P. designed research; Y.M.L., M.L., J.A.Z.-V., A.M., and C.P. performed research; and Y.M.L., M.L., J.A.Z.-V., A.M., and C.P. wrote the paper.

Competing interests

The authors declare no competing interest.

Footnotes

This article is a PNAS Direct Submission.

^*Code to reproduce all experiments is available at https://github.com/Pehlevan-Group/incontext-asymptotics-experiments (58).

^†Note that larger d models are often trained for fewer epochs than smaller d models due to early stopping; that said, whether or not early stopping is used in training does not affect either the alignment of error curves in d-scaling nor the qualitative behavior (double descent in τ and transition in κ).

Contributor Information

Yue M. Lu, Email: yuelu@seas.harvard.edu.

Cengiz Pehlevan, Email: cpehlevan@seas.harvard.edu.

Data, Materials, and Software Availability

Code to reproduce experiments’ data have been deposited in Github (https://github.com/Pehlevan-Group/incontext-asymptotics-experiments) (58).

Supporting Information

References

1.Vaswani A., et al. , Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017). [Google Scholar]
2.A. Dosovitskiy et al. , An image is worth 16x16 words: Transformers for image recognition at scale. arXiv [Preprint] (2021). 10.48550/arXiv.2010.11929 (Accessed 1 June 2025). [DOI]
3.J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2019, June), “Bert: Pre-training of deep bidirectional transformers for language understanding” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), J. Burstein, C. Doran, T. Solorio, Eds. (Association for Computational Linguistics, 2019), pp. 4171–4186.
4.H. Touvron et al. , Llama: Open and efficient foundation language models. arXiv [Preprint] (2023). https://arxiv.org/abs/2302.13971.
5.A. Radford et al. , “Improving language understanding by generative pre-training” (OpenAI, 2018; https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf).
6.Radford A., et al. , Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019). [Google Scholar]
7.Brown T., et al. , Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020). [Google Scholar]
8.J. Achiam et al. , Gpt-4 technical report. arXiv [Preprint] (2023). https://arxiv.org/abs/2303.08774 (Accessed 1 June 2025).
9.Anthropic, “The claude 3 model family: Opus, sonnet, haiku” (Claude-3 Model Card, Anthropic, 2024).
10.DeepSeek-AI et al. , Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv [Preprint] (2025). 10.48550/arXiv.2501.12948 (Accessed 1 June 2025). [DOI]
11.D. Ganguli et al. , “Predictability and surprise in large generative models” in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (2022), pp. 1747–1764.
12.A. Srivastava et al. , Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. TMLR (2023).
13.J. Wei et al. , Emergent abilities of large language models. arXiv [Preprint] (2022). 10.48550/arXiv.2206.07682. (Accessed 24 June 2025). [DOI]
14.C. Olsson et al. , In-context learning and induction heads. Transformer Circuits Thread (2022). https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. Accessed 24 June 2025.
15.R. Schaeffer, B. Miranda, S. Koyejo, “Are emergent abilities of large language models a mirage?” in Proceedings of the 37th International Conference on Neural Information Processing Systems (Curran Associates Inc., New Orleans, LA, 2023), pp. 55565–55581.
16.Barak B., et al. , Hidden progress in deep learning: SGD learns parities near the computational limit. Adv. Neural Inf. Process. Syst. 35, 21750–21764 (2022). [Google Scholar]
17.S. C. Y. Chan et al. , “Data distributional properties drive emergent in-context learning in transformers” in Proceedings of the 36th International Conference on Neural Information Processing Systems (Curran Associates Inc., New Orleans, LA, 2022), pp. 18878–18891.
18.A. K. Singh et al. , “The transient nature of emergent in-context learning in transformers” in Proceedings of the 37th International Conference on Neural Information Processing Systems (Curran Associates Inc., New Orleans, LA 2023) pp. 27801–27819.
19.Bietti A., Cabannes V., Bouchacourt D., Jegou H., Bottou L., Birth of a transformer: A memory viewpoint. Adv. Neural Inf. Process. Syst. 36, 1560–1588 (2023). [Google Scholar]
20.Raventós A., Paul M., Chen F., Ganguli S., Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression. Adv. Neural Inf. Process. Syst. 36, 14228 (2023). [Google Scholar]
21.G. Reddy, “The mechanistic basis of data dependence and abrupt learning in an in-context classification task” in The Twelfth International Conference on Learning Representations (2024).
22.Bai Y., Chen F., Wang H., Xiong C., Mei S., Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Adv. Neural Inf. Process. Syst. 36, 57125–57211 (2023). [Google Scholar]
23.Y. Li, M. E. Ildiz, D. Papailiopoulos, S. Oymak, “Transformers as algorithms: Generalization and stability in in-context learning” in Proceedings of the 40th International Conference on Machine Learning (PMLR, Honolulu, Hawaii, 2023), pp. 19565–19594.
24.E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, D. Zhou, What learning algorithm is in-context learning? arXiv [Preprint] (2022). 10.48550/arXiv.2211.15661 (Accessed 24 June 2025). [DOI]
25.K. Ahn, X. Cheng, H. Daneshmand, S. Sra, “Transformers learn to implement preconditioned gradient descent for in-context learning” in Proceedings of the 37th International Conference on Neural Information Processing Systems (Curran Associates Inc., New Orleans, LA, 2023), pp. 45614–45650.
26.D. Fu, T. Q. Chen, R. Jia, V. Sharan, Transformers learn higher-order optimization methods for in-context learning: A study with linear models. arXiv [Preprint] (2023). 10.48550/arXiv.2310.17086 (Accessed 24 June 2025). [DOI]
27.J. Von Oswald et al. , “Transformers learn in-context by gradient descent” in Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, A. Krause et al. , Eds. (PMLR, 2023), vol. 202, pp. 35151–35174.
28.R. Zhang, J. Wu, P. L. Bartlett, In-context learning of a linear transformer block: Benefits of the MLP component and one-step GD initialization (2024).
29.Zhang R., Frei S., Bartlett P. L., Trained transformers learn linear models in-context. J. Mach. Learn. Res. 25, 1–55 (2024). [Google Scholar]
30.J. Wu et al. , How many pretraining tasks are needed for in-context learning of linear regression? arXiv [Preprint] (2023). 10.48550/arXiv.2310.08391. (Accessed 24 June 2025). [DOI]
31.P. Chandra, T. K. Sinha, K. Ahuja, A. Garg, N. Goyal, Towards analyzing self-attention via linear neural network. OpenReview. ICLR submission [Internet] (2024). https://openreview.net/pdf?id=4fVuBf5HE9. Accessed 24 June 2025.
32.K. Duraisamy, Finite sample analysis and bounds of generalization error of gradient descent in in-context linear regression. arXiv [Preprint] (2024). https://arxiv.org/abs/2405.02462 (Accessed 1 June 2025).
33.S. Wang, B. Z. Li, M. Khabsa, H. Fang, H. Ma, Linformer: Self-attention with linear complexity. arXiv [Preprint] (2020). 10.48550/arXiv.2006.04768 (Accessed 24 June 2025). [DOI]
34.Z. Shen, M. Zhang, H. Zhao, S. Yi, H. Li, “Efficient attention: Attention with linear complexities” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2021), pp. 3531–3539.
35.A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret, “Transformers are RNNs: Fast autoregressive transformers with linear attention” in International Conference on Machine Learning (PMLR, 2020), pp. 5156–5165.
36.Marchenko V. A., Pastur L. A., Distribution of eigenvalues for some sets of random matrices. Mat. Sb. 114, 507–536 (1967). [Google Scholar]
37.Z. Bai, J. W. Silverstein, Spectral Analysis of Large Dimensional Random Matrices (Springer, 2010), vol. 20.
38.Hastie T., Montanari A., Rosset S., Tibshirani R. J., Surprises in high-dimensional ridgeless least squares interpolation. Ann. Stat. 50, 949–986 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
39.S. Dubova, Y. M. Lu, B. McKenna, H. T. Yau, Universality for the global spectrum of random inner-product kernel matrices in the polynomial regime. arXiv [Preprint] (2023). 10.48550/arXiv.2310.18280 (Accessed 1 June 2025). [DOI]
40.Belkin M., Hsu D., Ma S., Mandal S., Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. U.S.A. 116, 15849–15854 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
41.A. B. Atanasov, J. A. Zavatone-Veth, C. Pehlevan, Scaling and renormalization in high-dimensional regression. arXiv [Preprint] (2024). 10.48550/arXiv.2405.00592 (Accessed 1 June 2025). [DOI]
42.Tsybakov A., Introduction to Nonparametric Estimation. Springer Series in Statistics (Springer, New York, NY, 2008). [Google Scholar]
43.M. Belkin, A. Rakhlin, A. B. Tsybakov, “Does data interpolation contradict statistical optimality?” in Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, K. Chaudhuri, M. Sugiyama, Eds. (PMLR, 2019), vol. 89, pp. 1611–1619.
44.Zavatone-Veth J. A., Pehlevan C., Nadaraya-Watson kernel smoothing as a random energy model. J. Stat. Mech.: Theory Exp. 2025, 013404 (2025). [Google Scholar]
45.Kingma D., Ba J., Adam: A method for stochastic optimization in International Conference on Learning Representations (ICLR) (San Diego, CA, USA, 2015). [Google Scholar]
46.Watkin T. L. H., Rau A., Biehl M., The statistical mechanics of learning a rule. Rev. Mod. Phys. 65, 499–556 (1993). [Google Scholar]
47.A. Engel, C. van den Broeck, Statistical Mechanics of Learning (Cambridge University Press, 2001).
48.F. Gerace, B. Loureiro, F. Krzakala, M. Mézard, L. Zdeborová, “Generalisation error in learning with random features and the hidden manifold model” in International Conference on Machine Learning (PMLR, 2020), pp. 3452–3462.
49.Loureiro B., et al. , Learning curves of generic features maps for realistic datasets with a teacher–student model. Adv. Neural Inf. Process. Syst. 34, 18137–18151 (2021). [Google Scholar]
50.Mei S., Montanari A., The generalization error of random features regression: Precise asymptotics and the double descent curve. Commun. Pure Appl. Math. 75, 667–766 (2022). [Google Scholar]
51.Hu H., Lu Y. M., Universality laws for high-dimensional learning with random features. IEEE Trans. Inf. Theory 69, 1932–1964 (2022). [Google Scholar]
52.H. Hu, Y. M. Lu, T. Misiakiewicz, Asymptotics of random feature regression beyond the linear scaling regime. arXiv [Preprint] (2024). https://arxiv.org/abs/2403.08160 (Accessed 1 June 2025).
53.O. Dhifallah, Y. M. Lu, A precise performance analysis of learning with random features. arXiv [Preprint] (2020). https://arxiv.org/abs/2008.11904 (Accessed 1 June 2025).
54.H. Cui, F. Behrens, F. Krzakala, L. Zdeborová, “A phase transition between positional and semantic learning in a solvable model of dot-product attention” in Proceedings of the 38th International Conference on Neural Information Processing Systems, (Curran Associates Inc., Vancouver, BC, Canada, 2025), pp. 36342–36389.
55.A. Montanari, B. N. Saeed, “Universality of empirical risk minimization” in Proceedings of Thirty Fifth Conference on Learning Theory, Proceedings of Machine Learning Research, P. L. Loh, M. Raginsky, Eds. (PMLR, 2022), vol. 178, pp. 4310–4312.
56.N. Muennighoff et al. , “Scaling data-constrained language models” in Thirty-Seventh Conference on Neural Information Processing Systems (2023).
57.W. L. Tong, C. Pehlevan, “MLPs learn in-context on regression and classification tasks” in The Thirteenth International Conference on Learning Representations (2025).
58.M. Letey, W. Tong, incontext-asymptotics-experiments. Github. https://github.com/Pehlevan-Group/incontext-asymptotics-experiments. Accessed 20 June 2025.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix 01 (PDF)

pnas.2502599122.sapp.pdf^{(397.7KB, pdf)}

Data Availability Statement

Code to reproduce experiments’ data have been deposited in Github (https://github.com/Pehlevan-Group/incontext-asymptotics-experiments) (58).

[r1] 1.Vaswani A., et al. , Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017). [Google Scholar]

[r2] 2.A. Dosovitskiy et al. , An image is worth 16x16 words: Transformers for image recognition at scale. arXiv [Preprint] (2021). 10.48550/arXiv.2010.11929 (Accessed 1 June 2025). [DOI]

[r3] 3.J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2019, June), “Bert: Pre-training of deep bidirectional transformers for language understanding” in Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), J. Burstein, C. Doran, T. Solorio, Eds. (Association for Computational Linguistics, 2019), pp. 4171–4186.

[r4] 4.H. Touvron et al. , Llama: Open and efficient foundation language models. arXiv [Preprint] (2023). https://arxiv.org/abs/2302.13971.

[r5] 5.A. Radford et al. , “Improving language understanding by generative pre-training” (OpenAI, 2018; https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf).

[r6] 6.Radford A., et al. , Language models are unsupervised multitask learners. OpenAI Blog 1, 9 (2019). [Google Scholar]

[r7] 7.Brown T., et al. , Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020). [Google Scholar]

[r8] 8.J. Achiam et al. , Gpt-4 technical report. arXiv [Preprint] (2023). https://arxiv.org/abs/2303.08774 (Accessed 1 June 2025).

[r9] 9.Anthropic, “The claude 3 model family: Opus, sonnet, haiku” (Claude-3 Model Card, Anthropic, 2024).

[r10] 10.DeepSeek-AI et al. , Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning. arXiv [Preprint] (2025). 10.48550/arXiv.2501.12948 (Accessed 1 June 2025). [DOI]

[r11] 11.D. Ganguli et al. , “Predictability and surprise in large generative models” in Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (2022), pp. 1747–1764.

[r12] 12.A. Srivastava et al. , Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. TMLR (2023).

[r13] 13.J. Wei et al. , Emergent abilities of large language models. arXiv [Preprint] (2022). 10.48550/arXiv.2206.07682. (Accessed 24 June 2025). [DOI]

[r14] 14.C. Olsson et al. , In-context learning and induction heads. Transformer Circuits Thread (2022). https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html. Accessed 24 June 2025.

[r15] 15.R. Schaeffer, B. Miranda, S. Koyejo, “Are emergent abilities of large language models a mirage?” in Proceedings of the 37th International Conference on Neural Information Processing Systems (Curran Associates Inc., New Orleans, LA, 2023), pp. 55565–55581.

[r16] 16.Barak B., et al. , Hidden progress in deep learning: SGD learns parities near the computational limit. Adv. Neural Inf. Process. Syst. 35, 21750–21764 (2022). [Google Scholar]

[r17] 17.S. C. Y. Chan et al. , “Data distributional properties drive emergent in-context learning in transformers” in Proceedings of the 36th International Conference on Neural Information Processing Systems (Curran Associates Inc., New Orleans, LA, 2022), pp. 18878–18891.

[r18] 18.A. K. Singh et al. , “The transient nature of emergent in-context learning in transformers” in Proceedings of the 37th International Conference on Neural Information Processing Systems (Curran Associates Inc., New Orleans, LA 2023) pp. 27801–27819.

[r19] 19.Bietti A., Cabannes V., Bouchacourt D., Jegou H., Bottou L., Birth of a transformer: A memory viewpoint. Adv. Neural Inf. Process. Syst. 36, 1560–1588 (2023). [Google Scholar]

[r20] 20.Raventós A., Paul M., Chen F., Ganguli S., Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression. Adv. Neural Inf. Process. Syst. 36, 14228 (2023). [Google Scholar]

[r21] 21.G. Reddy, “The mechanistic basis of data dependence and abrupt learning in an in-context classification task” in The Twelfth International Conference on Learning Representations (2024).

[r22] 22.Bai Y., Chen F., Wang H., Xiong C., Mei S., Transformers as statisticians: Provable in-context learning with in-context algorithm selection. Adv. Neural Inf. Process. Syst. 36, 57125–57211 (2023). [Google Scholar]

[r23] 23.Y. Li, M. E. Ildiz, D. Papailiopoulos, S. Oymak, “Transformers as algorithms: Generalization and stability in in-context learning” in Proceedings of the 40th International Conference on Machine Learning (PMLR, Honolulu, Hawaii, 2023), pp. 19565–19594.

[r24] 24.E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, D. Zhou, What learning algorithm is in-context learning? arXiv [Preprint] (2022). 10.48550/arXiv.2211.15661 (Accessed 24 June 2025). [DOI]

[r25] 25.K. Ahn, X. Cheng, H. Daneshmand, S. Sra, “Transformers learn to implement preconditioned gradient descent for in-context learning” in Proceedings of the 37th International Conference on Neural Information Processing Systems (Curran Associates Inc., New Orleans, LA, 2023), pp. 45614–45650.

[r26] 26.D. Fu, T. Q. Chen, R. Jia, V. Sharan, Transformers learn higher-order optimization methods for in-context learning: A study with linear models. arXiv [Preprint] (2023). 10.48550/arXiv.2310.17086 (Accessed 24 June 2025). [DOI]

[r27] 27.J. Von Oswald et al. , “Transformers learn in-context by gradient descent” in Proceedings of the 40th International Conference on Machine Learning, Proceedings of Machine Learning Research, A. Krause et al. , Eds. (PMLR, 2023), vol. 202, pp. 35151–35174.

[r28] 28.R. Zhang, J. Wu, P. L. Bartlett, In-context learning of a linear transformer block: Benefits of the MLP component and one-step GD initialization (2024).

[r29] 29.Zhang R., Frei S., Bartlett P. L., Trained transformers learn linear models in-context. J. Mach. Learn. Res. 25, 1–55 (2024). [Google Scholar]

[r30] 30.J. Wu et al. , How many pretraining tasks are needed for in-context learning of linear regression? arXiv [Preprint] (2023). 10.48550/arXiv.2310.08391. (Accessed 24 June 2025). [DOI]

[r31] 31.P. Chandra, T. K. Sinha, K. Ahuja, A. Garg, N. Goyal, Towards analyzing self-attention via linear neural network. OpenReview. ICLR submission [Internet] (2024). https://openreview.net/pdf?id=4fVuBf5HE9. Accessed 24 June 2025.

[r32] 32.K. Duraisamy, Finite sample analysis and bounds of generalization error of gradient descent in in-context linear regression. arXiv [Preprint] (2024). https://arxiv.org/abs/2405.02462 (Accessed 1 June 2025).

[r33] 33.S. Wang, B. Z. Li, M. Khabsa, H. Fang, H. Ma, Linformer: Self-attention with linear complexity. arXiv [Preprint] (2020). 10.48550/arXiv.2006.04768 (Accessed 24 June 2025). [DOI]

[r34] 34.Z. Shen, M. Zhang, H. Zhao, S. Yi, H. Li, “Efficient attention: Attention with linear complexities” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2021), pp. 3531–3539.

[r35] 35.A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret, “Transformers are RNNs: Fast autoregressive transformers with linear attention” in International Conference on Machine Learning (PMLR, 2020), pp. 5156–5165.

[r36] 36.Marchenko V. A., Pastur L. A., Distribution of eigenvalues for some sets of random matrices. Mat. Sb. 114, 507–536 (1967). [Google Scholar]

[r37] 37.Z. Bai, J. W. Silverstein, Spectral Analysis of Large Dimensional Random Matrices (Springer, 2010), vol. 20.

[r38] 38.Hastie T., Montanari A., Rosset S., Tibshirani R. J., Surprises in high-dimensional ridgeless least squares interpolation. Ann. Stat. 50, 949–986 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r39] 39.S. Dubova, Y. M. Lu, B. McKenna, H. T. Yau, Universality for the global spectrum of random inner-product kernel matrices in the polynomial regime. arXiv [Preprint] (2023). 10.48550/arXiv.2310.18280 (Accessed 1 June 2025). [DOI]

[r40] 40.Belkin M., Hsu D., Ma S., Mandal S., Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. U.S.A. 116, 15849–15854 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[r41] 41.A. B. Atanasov, J. A. Zavatone-Veth, C. Pehlevan, Scaling and renormalization in high-dimensional regression. arXiv [Preprint] (2024). 10.48550/arXiv.2405.00592 (Accessed 1 June 2025). [DOI]

[r42] 42.Tsybakov A., Introduction to Nonparametric Estimation. Springer Series in Statistics (Springer, New York, NY, 2008). [Google Scholar]

[r43] 43.M. Belkin, A. Rakhlin, A. B. Tsybakov, “Does data interpolation contradict statistical optimality?” in Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, K. Chaudhuri, M. Sugiyama, Eds. (PMLR, 2019), vol. 89, pp. 1611–1619.

[r44] 44.Zavatone-Veth J. A., Pehlevan C., Nadaraya-Watson kernel smoothing as a random energy model. J. Stat. Mech.: Theory Exp. 2025, 013404 (2025). [Google Scholar]

[r45] 45.Kingma D., Ba J., Adam: A method for stochastic optimization in International Conference on Learning Representations (ICLR) (San Diego, CA, USA, 2015). [Google Scholar]

[r46] 46.Watkin T. L. H., Rau A., Biehl M., The statistical mechanics of learning a rule. Rev. Mod. Phys. 65, 499–556 (1993). [Google Scholar]

[r47] 47.A. Engel, C. van den Broeck, Statistical Mechanics of Learning (Cambridge University Press, 2001).

[r48] 48.F. Gerace, B. Loureiro, F. Krzakala, M. Mézard, L. Zdeborová, “Generalisation error in learning with random features and the hidden manifold model” in International Conference on Machine Learning (PMLR, 2020), pp. 3452–3462.

[r49] 49.Loureiro B., et al. , Learning curves of generic features maps for realistic datasets with a teacher–student model. Adv. Neural Inf. Process. Syst. 34, 18137–18151 (2021). [Google Scholar]

[r50] 50.Mei S., Montanari A., The generalization error of random features regression: Precise asymptotics and the double descent curve. Commun. Pure Appl. Math. 75, 667–766 (2022). [Google Scholar]

[r51] 51.Hu H., Lu Y. M., Universality laws for high-dimensional learning with random features. IEEE Trans. Inf. Theory 69, 1932–1964 (2022). [Google Scholar]

[r52] 52.H. Hu, Y. M. Lu, T. Misiakiewicz, Asymptotics of random feature regression beyond the linear scaling regime. arXiv [Preprint] (2024). https://arxiv.org/abs/2403.08160 (Accessed 1 June 2025).

[r53] 53.O. Dhifallah, Y. M. Lu, A precise performance analysis of learning with random features. arXiv [Preprint] (2020). https://arxiv.org/abs/2008.11904 (Accessed 1 June 2025).

[r54] 54.H. Cui, F. Behrens, F. Krzakala, L. Zdeborová, “A phase transition between positional and semantic learning in a solvable model of dot-product attention” in Proceedings of the 38th International Conference on Neural Information Processing Systems, (Curran Associates Inc., Vancouver, BC, Canada, 2025), pp. 36342–36389.

[r55] 55.A. Montanari, B. N. Saeed, “Universality of empirical risk minimization” in Proceedings of Thirty Fifth Conference on Learning Theory, Proceedings of Machine Learning Research, P. L. Loh, M. Raginsky, Eds. (PMLR, 2022), vol. 178, pp. 4310–4312.

[r56] 56.N. Muennighoff et al. , “Scaling data-constrained language models” in Thirty-Seventh Conference on Neural Information Processing Systems (2023).

[r57] 57.W. L. Tong, C. Pehlevan, “MLPs learn in-context on regression and classification tasks” in The Thirteenth International Conference on Learning Representations (2025).

[r58] 58.M. Letey, W. Tong, incontext-asymptotics-experiments. Github. https://github.com/Pehlevan-Group/incontext-asymptotics-experiments. Accessed 20 June 2025.

PERMALINK

Asymptotic theory of in-context learning by linear attention

Yue M Lu

Mary Letey

Jacob A Zavatone-Veth

Anindita Maiti

Cengiz Pehlevan

Significance

Abstract

1.1. Related Work.

1.1.1. ICL in transformer architectures.

1.1.2. Empirical studies of synthetic ICL tasks.

1.1.3. Theoretical studies of ICL.

1.2. Summary of Contributions.

2. Problem Formulation

2.1. ICL of Linear Regression.

2.2. Linear Self-Attention.

2.3. Pretraining Data.

2.4. Parameter Reduction.

2.5. Model Pretraining.

2.6. Evaluation.

2.7. Baselines: Bayes-Optimal Within-Context Estimators.

3. Theoretical Results

3.1. Joint Asymptotic Limit.

3.2. Learning Curves for ICL of Linear Regression by a Linear Attention Module.

Result 1.

Result 2.

3.3. Sample-Wise Double-Descent.

Fig. 1.

3.4. ICL and IDG Error Curves Can Have Nonmonotonic Dependence on Context Length.

Fig. 2.

Fig. 3.

3.5. Learning Transition with Increasing Pretraining Task Diversity.

Fig. 4.

4. Experiments with Fully Parameterized Nonlinear Transformers

Fig. 5.

Fig. 6.

5. Discussion

Supplementary Material

Acknowledgments

Author contributions

Competing interests

Footnotes

Contributor Information

Data, Materials, and Software Availability

Supporting Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases