Posts by Tags

💭 My experience as an Applied Scientist Intern at Amazon

7 minute read

Published: April 27, 2024

♾️ Infinite Widths Part I: Neural Networks as Gaussian Processes

6 minute read

Published: November 16, 2024

This is the first post of a short series on the infinite-width limits of deep neural networks (DNNs). We start by reviewing the correspondence between neural networks and Gaussian Processes (GPs).

♾️ Infinite Widths Part I: Neural Networks as Gaussian Processes

6 minute read

Published: November 16, 2024

This is the first post of a short series on the infinite-width limits of deep neural networks (DNNs). We start by reviewing the correspondence between neural networks and Gaussian Processes (GPs).

Thermodynamic Natural Gradient Descent

7 minute read

Published: July 19, 2024

I recently came across this paper Thermodynamic Natural Gradient Descent by Normal Computing. I found it very interesting, so below is my brief take on it.

Predictive Coding as a 2^nd-Order Method

10 minute read

Published: August 10, 2023

📖 TL;DR: Predictive coding implicitly performs a 2^nd-order weight update via 1^st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

♾️ Infinite Widths Part I: Neural Networks as Gaussian Processes

6 minute read

Published: November 16, 2024

This is the first post of a short series on the infinite-width limits of deep neural networks (DNNs). We start by reviewing the correspondence between neural networks and Gaussian Processes (GPs).

KANs Made Simple

2 minute read

Published: October 09, 2024

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

KANs Made Simple

2 minute read

Published: October 09, 2024

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

KANs Made Simple

2 minute read

Published: October 09, 2024

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

Thermodynamic Natural Gradient Descent

7 minute read

Published: July 19, 2024

I recently came across this paper Thermodynamic Natural Gradient Descent by Normal Computing. I found it very interesting, so below is my brief take on it.

PhD Reflections

17 minute read

Published: October 21, 2025

Having recently submitted my PhD thesis, I’ve been thinking a lot about my PhD experience. Here I would like to share some reflections. Needless to say that this is my own, biased experience, and PhDs can vary greatly depending on the field, lab, supervisor, etc.

💭 My experience as an Applied Scientist Intern at Amazon

7 minute read

Published: April 27, 2024

💭 My experience as an Applied Scientist Intern at Amazon

7 minute read

Published: April 27, 2024

Can We Scale Predictive Coding? or Why the Brain Might Be Much Wider Than Deep

10 minute read

Published: May 29, 2026

📖 TL;DR: The gradients computed by predictive coding converge to backpropagation’s for much wider than deep networks (like the brain), under stable parameterisations.

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published: May 20, 2025

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

⛰️ The Energy Landscape of Predictive Coding Networks

9 minute read

Published: October 01, 2024

📖 TL;DR: Predictive coding makes the loss landscape of feedforward neural networks more benign and robust to vanishing gradients.

Predictive Coding as a 2^nd-Order Method

10 minute read

Published: August 10, 2023

📖 TL;DR: Predictive coding implicitly performs a 2^nd-order weight update via 1^st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

♾️ Infinite Widths Part I: Neural Networks as Gaussian Processes

6 minute read

Published: November 16, 2024

This is the first post of a short series on the infinite-width limits of deep neural networks (DNNs). We start by reviewing the correspondence between neural networks and Gaussian Processes (GPs).

♾️ Infinite Widths Part I: Neural Networks as Gaussian Processes

6 minute read

Published: November 16, 2024

This is the first post of a short series on the infinite-width limits of deep neural networks (DNNs). We start by reviewing the correspondence between neural networks and Gaussian Processes (GPs).

PhD Reflections

17 minute read

Published: October 21, 2025

Having recently submitted my PhD thesis, I’ve been thinking a lot about my PhD experience. Here I would like to share some reflections. Needless to say that this is my own, biased experience, and PhDs can vary greatly depending on the field, lab, supervisor, etc.

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published: May 20, 2025

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

♾️ Infinite Widths (& Depths) Part III: The Maximal Update Parameterisation (\(\mu\)P)

8 minute read

Published: April 09, 2025

This is the third and last post of a short series on the infinite-width limits of deep neural networks (DNNs). In Part I, we showed that the output of a random network becomes Gaussian distributed in the infinite-width limit. Part II went beyond initialisation and showed that infinitely wide nets trained with GD are basically kernel methods.

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published: February 20, 2025

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

♾️ Infinite Widths Part I: Neural Networks as Gaussian Processes

6 minute read

Published: November 16, 2024

This is the first post of a short series on the infinite-width limits of deep neural networks (DNNs). We start by reviewing the correspondence between neural networks and Gaussian Processes (GPs).

KANs Made Simple

2 minute read

Published: October 09, 2024

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

⛰️ The Energy Landscape of Predictive Coding Networks

9 minute read

Published: October 01, 2024

📖 TL;DR: Predictive coding makes the loss landscape of feedforward neural networks more benign and robust to vanishing gradients.

Predictive Coding as a 2^nd-Order Method

10 minute read

Published: August 10, 2023

📖 TL;DR: Predictive coding implicitly performs a 2^nd-order weight update via 1^st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published: May 20, 2025

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

Can We Scale Predictive Coding? or Why the Brain Might Be Much Wider Than Deep

10 minute read

Published: May 29, 2026

📖 TL;DR: The gradients computed by predictive coding converge to backpropagation’s for much wider than deep networks (like the brain), under stable parameterisations.

♾️ Infinite Widths (& Depths) Part III: The Maximal Update Parameterisation (\(\mu\)P)

8 minute read

Published: April 09, 2025

This is the third and last post of a short series on the infinite-width limits of deep neural networks (DNNs). In Part I, we showed that the output of a random network becomes Gaussian distributed in the infinite-width limit. Part II went beyond initialisation and showed that infinitely wide nets trained with GD are basically kernel methods.

Energy-based Transformers

5 minute read

Published: July 18, 2025

📖 TL;DR: Energy-based Transformers (EBTs) learn a scalar energy function parameterised by a transformer. Empirically, EBTs show promising scaling and reasoning properties on both language and vision tasks.

Energy-based Transformers

5 minute read

Published: July 18, 2025

📖 TL;DR: Energy-based Transformers (EBTs) learn a scalar energy function parameterised by a transformer. Empirically, EBTs show promising scaling and reasoning properties on both language and vision tasks.

♾️ Infinite Widths (& Depths) Part III: The Maximal Update Parameterisation (\(\mu\)P)

8 minute read

Published: April 09, 2025

This is the third and last post of a short series on the infinite-width limits of deep neural networks (DNNs). In Part I, we showed that the output of a random network becomes Gaussian distributed in the infinite-width limit. Part II went beyond initialisation and showed that infinitely wide nets trained with GD are basically kernel methods.

⛰️ The Energy Landscape of Predictive Coding Networks

9 minute read

Published: October 01, 2024

📖 TL;DR: Predictive coding makes the loss landscape of feedforward neural networks more benign and robust to vanishing gradients.

Can We Scale Predictive Coding? or Why the Brain Might Be Much Wider Than Deep

10 minute read

Published: May 29, 2026

📖 TL;DR: The gradients computed by predictive coding converge to backpropagation’s for much wider than deep networks (like the brain), under stable parameterisations.

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published: May 20, 2025

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

♾️ Infinite Widths (& Depths) Part III: The Maximal Update Parameterisation (\(\mu\)P)

8 minute read

Published: April 09, 2025

This is the third and last post of a short series on the infinite-width limits of deep neural networks (DNNs). In Part I, we showed that the output of a random network becomes Gaussian distributed in the infinite-width limit. Part II went beyond initialisation and showed that infinitely wide nets trained with GD are basically kernel methods.

In-Context Learning Demystified?

4 minute read

Published: August 01, 2025

📖 TL;DR: a transformer block implicitly uses the input context to modify its MLP weights.

In-Context Learning Demystified?

4 minute read

Published: August 01, 2025

📖 TL;DR: a transformer block implicitly uses the input context to modify its MLP weights.

💭 My experience as an Applied Scientist Intern at Amazon

7 minute read

Published: April 27, 2024

Energy-based Transformers

5 minute read

Published: July 18, 2025

📖 TL;DR: Energy-based Transformers (EBTs) learn a scalar energy function parameterised by a transformer. Empirically, EBTs show promising scaling and reasoning properties on both language and vision tasks.

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published: May 20, 2025

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

⛰️ The Energy Landscape of Predictive Coding Networks

9 minute read

Published: October 01, 2024

📖 TL;DR: Predictive coding makes the loss landscape of feedforward neural networks more benign and robust to vanishing gradients.

Predictive Coding as a 2^nd-Order Method

10 minute read

Published: August 10, 2023

📖 TL;DR: Predictive coding implicitly performs a 2^nd-order weight update via 1^st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

Can We Scale Predictive Coding? or Why the Brain Might Be Much Wider Than Deep

10 minute read

Published: May 29, 2026

📖 TL;DR: The gradients computed by predictive coding converge to backpropagation’s for much wider than deep networks (like the brain), under stable parameterisations.

Can We Scale Predictive Coding? or Why the Brain Might Be Much Wider Than Deep

10 minute read

Published: May 29, 2026

📖 TL;DR: The gradients computed by predictive coding converge to backpropagation’s for much wider than deep networks (like the brain), under stable parameterisations.

♾️ Infinite Widths (& Depths) Part III: The Maximal Update Parameterisation (\(\mu\)P)

8 minute read

Published: April 09, 2025

This is the third and last post of a short series on the infinite-width limits of deep neural networks (DNNs). In Part I, we showed that the output of a random network becomes Gaussian distributed in the infinite-width limit. Part II went beyond initialisation and showed that infinitely wide nets trained with GD are basically kernel methods.

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published: February 20, 2025

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

♾️ Infinite Widths Part I: Neural Networks as Gaussian Processes

6 minute read

Published: November 16, 2024

This is the first post of a short series on the infinite-width limits of deep neural networks (DNNs). We start by reviewing the correspondence between neural networks and Gaussian Processes (GPs).

💭 My experience as an Applied Scientist Intern at Amazon

7 minute read

Published: April 27, 2024

KANs Made Simple

2 minute read

Published: October 09, 2024

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published: February 20, 2025

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

In-Context Learning Demystified?

4 minute read

Published: August 01, 2025

📖 TL;DR: a transformer block implicitly uses the input context to modify its MLP weights.

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published: February 20, 2025

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published: February 20, 2025

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

Can We Scale Predictive Coding? or Why the Brain Might Be Much Wider Than Deep

10 minute read

Published: May 29, 2026

📖 TL;DR: The gradients computed by predictive coding converge to backpropagation’s for much wider than deep networks (like the brain), under stable parameterisations.

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published: May 20, 2025

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

⛰️ The Energy Landscape of Predictive Coding Networks

9 minute read

Published: October 01, 2024

📖 TL;DR: Predictive coding makes the loss landscape of feedforward neural networks more benign and robust to vanishing gradients.

Predictive Coding as a 2^nd-Order Method

10 minute read

Published: August 10, 2023

📖 TL;DR: Predictive coding implicitly performs a 2^nd-order weight update via 1^st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

⛰️ The Energy Landscape of Predictive Coding Networks

9 minute read

Published: October 01, 2024

📖 TL;DR: Predictive coding makes the loss landscape of feedforward neural networks more benign and robust to vanishing gradients.

Thermodynamic Natural Gradient Descent

7 minute read

Published: July 19, 2024

I recently came across this paper Thermodynamic Natural Gradient Descent by Normal Computing. I found it very interesting, so below is my brief take on it.

💭 My experience as an Applied Scientist Intern at Amazon

7 minute read

Published: April 27, 2024

Can We Scale Predictive Coding? or Why the Brain Might Be Much Wider Than Deep

10 minute read

Published: May 29, 2026

📖 TL;DR: The gradients computed by predictive coding converge to backpropagation’s for much wider than deep networks (like the brain), under stable parameterisations.

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published: May 20, 2025

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

♾️ Infinite Widths (& Depths) Part III: The Maximal Update Parameterisation (\(\mu\)P)

8 minute read

Published: April 09, 2025

This is the third and last post of a short series on the infinite-width limits of deep neural networks (DNNs). In Part I, we showed that the output of a random network becomes Gaussian distributed in the infinite-width limit. Part II went beyond initialisation and showed that infinitely wide nets trained with GD are basically kernel methods.

Can We Scale Predictive Coding? or Why the Brain Might Be Much Wider Than Deep

10 minute read

Published: May 29, 2026

📖 TL;DR: The gradients computed by predictive coding converge to backpropagation’s for much wider than deep networks (like the brain), under stable parameterisations.

KANs Made Simple

2 minute read

Published: October 09, 2024

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published: May 20, 2025

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

♾️ Infinite Widths (& Depths) Part III: The Maximal Update Parameterisation (\(\mu\)P)

8 minute read

Published: April 09, 2025

This is the third and last post of a short series on the infinite-width limits of deep neural networks (DNNs). In Part I, we showed that the output of a random network becomes Gaussian distributed in the infinite-width limit. Part II went beyond initialisation and showed that infinitely wide nets trained with GD are basically kernel methods.

Thermodynamic Natural Gradient Descent

7 minute read

Published: July 19, 2024

I recently came across this paper Thermodynamic Natural Gradient Descent by Normal Computing. I found it very interesting, so below is my brief take on it.

KANs Made Simple

2 minute read

Published: October 09, 2024

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

♾️ Infinite Widths (& Depths) Part III: The Maximal Update Parameterisation (\(\mu\)P)

8 minute read

Published: April 09, 2025

This is the third and last post of a short series on the infinite-width limits of deep neural networks (DNNs). In Part I, we showed that the output of a random network becomes Gaussian distributed in the infinite-width limit. Part II went beyond initialisation and showed that infinitely wide nets trained with GD are basically kernel methods.

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published: February 20, 2025

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published: May 20, 2025

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

♾️ Infinite Widths (& Depths) Part III: The Maximal Update Parameterisation (\(\mu\)P)

8 minute read

Published: April 09, 2025

This is the third and last post of a short series on the infinite-width limits of deep neural networks (DNNs). In Part I, we showed that the output of a random network becomes Gaussian distributed in the infinite-width limit. Part II went beyond initialisation and showed that infinitely wide nets trained with GD are basically kernel methods.

Can We Scale Predictive Coding? or Why the Brain Might Be Much Wider Than Deep

10 minute read

Published: May 29, 2026

📖 TL;DR: The gradients computed by predictive coding converge to backpropagation’s for much wider than deep networks (like the brain), under stable parameterisations.

PhD Reflections

17 minute read

Published: October 21, 2025

Having recently submitted my PhD thesis, I’ve been thinking a lot about my PhD experience. Here I would like to share some reflections. Needless to say that this is my own, biased experience, and PhDs can vary greatly depending on the field, lab, supervisor, etc.

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published: May 20, 2025

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

⛰️ The Energy Landscape of Predictive Coding Networks

9 minute read

Published: October 01, 2024

📖 TL;DR: Predictive coding makes the loss landscape of feedforward neural networks more benign and robust to vanishing gradients.

Predictive Coding as a 2^nd-Order Method

10 minute read

Published: August 10, 2023

📖 TL;DR: Predictive coding implicitly performs a 2^nd-order weight update via 1^st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

PhD Reflections

17 minute read

Published: October 21, 2025

Having recently submitted my PhD thesis, I’ve been thinking a lot about my PhD experience. Here I would like to share some reflections. Needless to say that this is my own, biased experience, and PhDs can vary greatly depending on the field, lab, supervisor, etc.

PhD Reflections

17 minute read

Published: October 21, 2025

Having recently submitted my PhD thesis, I’ve been thinking a lot about my PhD experience. Here I would like to share some reflections. Needless to say that this is my own, biased experience, and PhDs can vary greatly depending on the field, lab, supervisor, etc.

♾️ Infinite Widths (& Depths) Part III: The Maximal Update Parameterisation (\(\mu\)P)

8 minute read

Published: April 09, 2025

This is the third and last post of a short series on the infinite-width limits of deep neural networks (DNNs). In Part I, we showed that the output of a random network becomes Gaussian distributed in the infinite-width limit. Part II went beyond initialisation and showed that infinitely wide nets trained with GD are basically kernel methods.

⛰️ The Energy Landscape of Predictive Coding Networks

9 minute read

Published: October 01, 2024

📖 TL;DR: Predictive coding makes the loss landscape of feedforward neural networks more benign and robust to vanishing gradients.

Predictive Coding as a 2^nd-Order Method

10 minute read

Published: August 10, 2023

📖 TL;DR: Predictive coding implicitly performs a 2^nd-order weight update via 1^st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

Predictive Coding as a 2^nd-Order Method

10 minute read

Published: August 10, 2023

📖 TL;DR: Predictive coding implicitly performs a 2^nd-order weight update via 1^st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

Thermodynamic Natural Gradient Descent

7 minute read

Published: July 19, 2024

I recently came across this paper Thermodynamic Natural Gradient Descent by Normal Computing. I found it very interesting, so below is my brief take on it.

KANs Made Simple

2 minute read

Published: October 09, 2024

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

Energy-based Transformers

5 minute read

Published: July 18, 2025

📖 TL;DR: Energy-based Transformers (EBTs) learn a scalar energy function parameterised by a transformer. Empirically, EBTs show promising scaling and reasoning properties on both language and vision tasks.

♾️ Infinite Widths (& Depths) Part III: The Maximal Update Parameterisation (\(\mu\)P)

8 minute read

Published: April 09, 2025

This is the third and last post of a short series on the infinite-width limits of deep neural networks (DNNs). In Part I, we showed that the output of a random network becomes Gaussian distributed in the infinite-width limit. Part II went beyond initialisation and showed that infinitely wide nets trained with GD are basically kernel methods.

Thermodynamic Natural Gradient Descent

7 minute read

Published: July 19, 2024

I recently came across this paper Thermodynamic Natural Gradient Descent by Normal Computing. I found it very interesting, so below is my brief take on it.

In-Context Learning Demystified?

4 minute read

Published: August 01, 2025

📖 TL;DR: a transformer block implicitly uses the input context to modify its MLP weights.

Energy-based Transformers

5 minute read

Published: July 18, 2025

📖 TL;DR: Energy-based Transformers (EBTs) learn a scalar energy function parameterised by a transformer. Empirically, EBTs show promising scaling and reasoning properties on both language and vision tasks.

Predictive Coding as a 2^nd-Order Method

10 minute read

Published: August 10, 2023

📖 TL;DR: Predictive coding implicitly performs a 2^nd-order weight update via 1^st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

⛰️ The Energy Landscape of Predictive Coding Networks

9 minute read

Published: October 01, 2024

📖 TL;DR: Predictive coding makes the loss landscape of feedforward neural networks more benign and robust to vanishing gradients.

Francesco Innocenti

Posts by Tags

Amazon

Bayesian inference

Bayesian neural networks

Fisher information

Gaussian processes

KAN

Kolmogorov-Arnold networks

Kolmogorov-Arnold representation theorem

Normal Computing

PhD

applied scientist

backpropagation

central limit theorem

deep information propagation

deep neural networks

depth-mup

dynamical mean field theory

energy-based models

energy-based transformers

feature learning

gradient descent

hyperparameter transfer

implicit gradient descent dynamics

in-context learning

industry

inference as optimisation

inference learning

infinite depth

infinite width

infinite width limit

internship

interpretability

kernel methods

large language models

lazy learning

linear regime

local learning

loss landscape

machine learning

maximal update parameterisation

muP

multi-layer perceptrons

mup

natural gradient descent

neural scaling laws

neural tangent kernel

optimisation theory

predictive coding

reflections

research

rich regime

saddle points

saddles

second-order method

second-order methods

splines

system-2 thinking

tensor programs