♾️ Infinite Widths (& Depths) Part III: The Maximal Update Parameterisation (\(\mu\)P)
Published:
This is the third and last post of a short series on the infinite-width limits of deep neural networks (DNNs). In Part I, we showed that the output of a random network becomes Gaussian distributed in the infinite-width limit. Part II went beyond initialisation and showed that infinitely wide nets trained with GD are basically kernel methods.
We also saw that the main limitation of this kernel (NTK) regime is that the weights and so the layer preactivations barely move during training at large width [1][2]. This fails to capture the behaviour of practical, finite-width networks and results in worse generalisation performance.
Here, we review the Maximal Update Parameterisation (\(\mu\)P) [3], a rapidly developing and much more influential parameterisation of DNNs that effectively puts feature learning back into the infinite-width limit. I am grateful to Alexandru Meterez for helping me understand \(\mu\)P much more quickly than I would have on my own.
TL;DR
The Maximal Update Parameterisation: roughly, \(\mu\)P and its extensions are a prescription for how to scale a model such that the order of the feature updates at each layer does not vary with the model size (e.g. width and depth).
Under \(\mu\)P, it turns that what you don’t only get more stable training dynamics, but also stable hyperparameters, meaning that optimal hyperparameters will be conserved across different model sizes. This unlocks zero-shot hyperparameter transfer [4][9], meaning that you can tune a small model and transfer optimal hyperparameters such as the learning rate to bigger (wider and/or deeper) models, resulting in major efficiencies at large scale.
\(\mu\)P
Motivated by the lack of feature learning in the NTK or “lazy” regime, [3] introduced \(\mu\)P as a parameterisation that essentially allows for as much feature learning as possible in the infinite-width limit. By as much as possible, it is meant that we allow the features or preactivations at each layer to change as much as possible without blowing up with the width \(N\). The parameterisation is maximal (hence \(\mu\)P) in this sense. More specifically, in the NTK the features evolve in \(\mathcal{O}(N^{-1/2})\) time and so remain practically unchanged during training at large width. In \(\mu\)P, the features updates are instead of order \(\mathcal{O}_N(1)\).
More formally, \(\mu\)P can be derived from the following 3 desiderata:
- the layer preactivations are \(\mathcal{O}_N(1)\) at initialisation;
- the network predictions are \(\mathcal{O}_N(1)\) during training; and
- the layer features also evolve in \(\mathcal{O}_N(1)\) during training.
These are seen desiderata because they are not strict necessary or sufficient conditions but rather things that we would like DNNs to have to ensure more stable training dynamics and, as it turns out, hyperparameters at different scales.
Satisfying these desiderata boils down to solving a system of equations for a set of scalars (commonly referred to as “abcd”) parameterising the layer transformation, the (Gaussian) initialisation variance, and the learning rate [5][6]. Different optimisers (e.g. SGD vs Adam) and types of layer (e.g. fully connected vs convolutional) lead to different “abcd” scalings. One version of \(\mu\)P rescales each layer by \(1/\sqrt{\mathtt{fan\_in}}\) except for the output which is scaled by \(1/N\). If you read Part II of this series, you might notice that this scaling recipe is very similar to the NTK parameterisation. The only difference lies in the output scaling, which turns out to be critical and is what allows the features to change in the infinite-width limit. [3] also showed that while in the standard parameterisation (SP) of DNNs (based on He and similar initialisations) the features do evolve, the output diverges with the width.
Remarkably, [4] showed that in \(\mu\)P many optimal hyperparameters also remain stable as the width changes. As noted above, this means that you can tune a small model and then use the optimal hyperparameters such as the learning rate to train a bigger (i.e. wider) model, avoiding the expensive tuning at large scale.
Extensions
Standard (width-only) \(\mu\)P has been extended to some local algorithms [12], sparse networks [13], second-order methods [14], and sharpness-aware minimisation [15].
Excitingly, \(\mu\)P has also been extended to depth for ResNets (“Depth-\(\mu\)P”) [7][8], such that stable training dynamics and transfer are also conserved independent of the network depth \(L\) [9]. This is done mainly by scaling each residual block by \(1/\sqrt{L}\) and is enabled by the commutativity of the infinite-width and infinite-depth limit of ResNets [10][11].
Recently, I found that using Depth-\(\mu\)P for a local algorithm called predictive coding allowed, for the first time, stable training of 100+ layer networks [16]. See the paper and blog post for more.
Concluding thoughts
I think \(\mu\)P is amazing. It’s only a slight overstatement to say that \(\mu\)P is the only theory that has had a major impact on practice: many frontier AI labs including OpenAI, xAI and Apple (and probably others too) make use of it. Of course, \(\mu\)P built itself on previous theoretical advances including the NTK, the theory of signal propagation in DNNs, and mean-field theories, among others.
Other resources
Besides the references below, I found the following material useful in understanding \(\mu\)P:
- Microsoft’s blog post introducing \(\mu\)P;
- this conversation with Greg Yang focused on “Tensor Programs”;
- Microsoft’s blog post on the hyperparameter transfer results;
- the
mup
github repo (PyTorch); and - this talk on the scaling exponents of different parameterisations;
For other reviews of \(\mu\)P, see:
See also the nanoGPT-mup
github repo (PyTorch).
References
[1] Chizat, L., Oyallon, E., & Bach, F. (2019). On lazy training in differentiable programming. Advances in neural information processing systems, 32.
[2] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., & Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32.
[3] Yang, G., & Hu, E. J. (2021). Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning (pp. 11727-11737). PMLR.
[4] Yang, G., Hu, E., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., ... & Gao, J. (2021). Tuning large neural networks via zero-shot hyperparameter transfer. Advances in Neural Information Processing Systems, 34, 17084-17097.
[5] Pehlevan, C., & Bordelon, B. (2023). Lecture Notes on Infinite-Width Limits of Neural Networks.
[6] Yang, G., & Littwin, E. (2023). Tensor programs ivb: Adaptive optimization in the infinite-width limit. arXiv preprint arXiv:2308.01814.
[7] Yang, G., Yu, D., Zhu, C., & Hayou, S. (2023). Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244.
[8] Bordelon, B., Noci, L., Li, M. B., Hanin, B., & Pehlevan, C. (2023). Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit. arXiv preprint arXiv:2309.16620.
[9] Noci, L., Meterez, A., Hofmann, T., & Orvieto, A. (2024). Super consistency of neural network landscapes and learning rate transfer. Advances in Neural Information Processing Systems, 37, 102696-102743.
[10] Hayou, S. (2024). Commutative Scaling of Width and Depth in Deep Neural Networks. Journal of Machine Learning Research, 25(299), 1-41.
[11] Hayou, S., & Yang, G. (2023, July). Width and depth limits commute in residual networks. In International Conference on Machine Learning (pp. 12700-12723). PMLR.
[12] Ishikawa, S., Yokota, R., & Karakida, R. (2024). Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation. arXiv preprint arXiv:2411.02001.
[13] Dey, N., Bergsma, S., & Hestness, J. (2024). Sparse maximal update parameterization: A holistic approach to sparse training dynamics. arXiv preprint arXiv:2405.15743.
[14] Ishikawa, S., & Karakida, R. (2023). On the parameterization of second-order optimization effective towards the infinite width. arXiv preprint arXiv:2312.12226.
[15] Haas, M., Xu, J., Cevher, V., & Vankadara, L. C. Effective Sharpness Aware Minimization Requires Layerwise Perturbation Scaling. In High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning.
[16] Innocenti, F., Achour, E. M., & Buckley, C. L. (2025). $\mu$PC: Scaling Predictive Coding to 100+ Layer Networks. arXiv preprint arXiv:2505.13124.