Blog posts

2025

In-Context Learning Demystified

4 minute read

Published:

πŸ“– TL;DR: the next-token prediction of a transformer block taking some context and query as input is equivalent to the output of the same transformer with weights updated by the context and with only the query as input.

Energy-based Transformers

3 minute read

Published:

πŸ“– TL;DR: Energy-based Transformers (EBTs) learn a scalar energy function parameterised by a transformer. Empirically, EBTs show promising scaling and reasoning properties on both language and vision tasks.

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published:

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

2024

KANs Made Simple

2 minute read

Published:

πŸ€” Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

2023

🧠 Predictive Coding as a 2nd-Order Method

10 minute read

Published:

πŸ“– TL;DR: Predictive coding implicitly performs a 2nd-order weight update via 1st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.