In-Context Learning Demystified

4 minute read

Published:

đź“– TL;DR: the next-token prediction of a transformer block taking some context and query as input is equivalent to the output of the same transformer with weights updated by the context and with only the query as input.

Researchers at Google recently published a really cool result [1] that goes a long way towards understanding the known phenomenon of in-context learning (ICL) in large language models. The paper is titled Learning without training: the implicit dynamics of in-context learning.

As first clearly shown by GPT-3 [2], ICL is the capability of a language model to learn to perform a task from examples in its prompt or context without updating its parameters—hence in-context learning. As an example, I just asked ChatGPT:

“if 2+3 = 10 and 4+2 = 12, what is 2+8?”

It is unlikely (though possible) that this task was part of the pretraining data and yet ChatGPT managed to figure out from a few examples the hidden rule (of doubling the result of the sum) and correctly answered “20”—which is quite remarkable. This happened at inference time with no parameter updates. How is it possible?

There has been a surge of studies trying to explain ICL. First among these was von Oswald et al. (2023), who provided a simple construction where a single linear self-attention layer is equivalent to performing gradient descent on some loss, thus showing a form of meta-learning. Since then, many papers have generalised and extended these results [4][5][6]. However, as noted by [1], most theoretical studies have relied on highly simplified models of self-attention, for example without softmax.

By contrast, [1] actually go the opposite way and abstract away what they see as the key property of context-aware layers such as self-attention. Remarkably, they derive a quite general result that for transformers can be stated as follows:

the next-token prediction of a transformer block with some context \(C\) and query token \(x\) as input is equivalent to the output of the same transformer with weights updated by the context and with only the query as input. Mathematically, this can be written as:

\[f_W(C, x) = f_{W + \Delta W(C)}(x)\]

where \(f_W(\cdot)\) is the neural network function with parameters \(\theta\) (omitted for simplicity) including an MLP with weights \(W\). This notation is not quite accurate but serves to get the main point across. The derivation is remarkably simple and elegant in my opinion.

Said another way, putting a query along with some context into a transformer block turns out to be the same as putting only the query to the same transformer with updated MLP weights, where the update depends on the context. The statement of the theorem is actually a bit more precise and general, so check out the paper for the details.

The authors further derive an explicit formula for the implicit weight update and verify their results on some toy problems. They also nicely show that building the context token by token defines an implicit gradient descent learning dynamics on the MLP weights—which aligns with the intuition that the wider the context is, the less the output (or the implicit weight update) should change.

The work still has some limitations in that it does not consider the effect of multiple blocks or the generation of more than one token at a time. These are interesting research directions, but to my mind the result already provides a very satisfying explanation for ICL.

References

[1] Dherin, B., Munn, M., Mazzawi, H., Wunder, M., & Gonzalvo, J. (2025). Learning without training: The implicit dynamics of in-context learning. arXiv preprint arXiv:2507.16003.

[2] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.

[3] Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., & Vladymyrov, M. (2023, July). Transformers learn in-context by gradient descent. In International Conference on Machine Learning (pp. 35151-35174). PMLR.

[4] Ahn, K., Cheng, X., Daneshmand, H., & Sra, S. (2023). Transformers learn to implement preconditioned gradient descent for in-context learning. Advances in Neural Information Processing Systems, 36, 45614-45650.

[5] Zhang, Y., Singh, A. K., Latham, P. E., & Saxe, A. (2025). Training dynamics of in-context learning in linear attention. arXiv preprint arXiv:2501.16265.

[6] He, J., Pan, X., Chen, S., & Yang, Z. (2025). In-context linear regression demystified: Training dynamics and mechanistic interpretability of multi-head softmax attention. arXiv preprint arXiv:2503.12734.