Posts by Tags

Amazon

Bayesian inference

Bayesian neural networks

Fisher information

Predictive Coding as a 2nd-Order Method

10 minute read

Published:

📖 TL;DR: Predictive coding implicitly performs a 2nd-order weight update via 1st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

Gaussian processes

KAN

KANs Made Simple

2 minute read

Published:

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

Kolmogorov-Arnold networks

KANs Made Simple

2 minute read

Published:

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

Kolmogorov-Arnold representation theorem

KANs Made Simple

2 minute read

Published:

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

Normal Computing

PhD

PhD Reflections

17 minute read

Published:

Having recently submitted my PhD thesis, I’ve been thinking a lot about my PhD experience. Here I would like to share some reflections. Needless to say that this is my own, biased experience, and PhDs can vary greatly depending on the field, lab, supervisor, etc.

applied scientist

backpropagation

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published:

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

Predictive Coding as a 2nd-Order Method

10 minute read

Published:

📖 TL;DR: Predictive coding implicitly performs a 2nd-order weight update via 1st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

central limit theorem

deep information propagation

deep neural networks

PhD Reflections

17 minute read

Published:

Having recently submitted my PhD thesis, I’ve been thinking a lot about my PhD experience. Here I would like to share some reflections. Needless to say that this is my own, biased experience, and PhDs can vary greatly depending on the field, lab, supervisor, etc.

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published:

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published:

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

KANs Made Simple

2 minute read

Published:

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

Predictive Coding as a 2nd-Order Method

10 minute read

Published:

📖 TL;DR: Predictive coding implicitly performs a 2nd-order weight update via 1st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

depth-mup

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published:

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

dynamical mean field theory

energy-based models

Energy-based Transformers

5 minute read

Published:

📖 TL;DR: Energy-based Transformers (EBTs) learn a scalar energy function parameterised by a transformer. Empirically, EBTs show promising scaling and reasoning properties on both language and vision tasks.

energy-based transformers

Energy-based Transformers

5 minute read

Published:

📖 TL;DR: Energy-based Transformers (EBTs) learn a scalar energy function parameterised by a transformer. Empirically, EBTs show promising scaling and reasoning properties on both language and vision tasks.

feature learning

gradient descent

hyperparameter transfer

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published:

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

implicit gradient descent dynamics

In-Context Learning Demystified?

4 minute read

Published:

📖 TL;DR: a transformer block implicitly uses the input context to modify its MLP weights.

in-context learning

In-Context Learning Demystified?

4 minute read

Published:

📖 TL;DR: a transformer block implicitly uses the input context to modify its MLP weights.

industry

inference as optimisation

Energy-based Transformers

5 minute read

Published:

📖 TL;DR: Energy-based Transformers (EBTs) learn a scalar energy function parameterised by a transformer. Empirically, EBTs show promising scaling and reasoning properties on both language and vision tasks.

inference learning

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published:

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

Predictive Coding as a 2nd-Order Method

10 minute read

Published:

📖 TL;DR: Predictive coding implicitly performs a 2nd-order weight update via 1st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

infinite depth

infinite width

infinite width limit

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published:

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

internship

interpretability

KANs Made Simple

2 minute read

Published:

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

kernel methods

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published:

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

large language models

In-Context Learning Demystified?

4 minute read

Published:

📖 TL;DR: a transformer block implicitly uses the input context to modify its MLP weights.

lazy learning

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published:

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

linear regime

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published:

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

local learning

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published:

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

Predictive Coding as a 2nd-Order Method

10 minute read

Published:

📖 TL;DR: Predictive coding implicitly performs a 2nd-order weight update via 1st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

loss landscape

machine learning

maximal update parameterisation

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published:

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

muP

multi-layer perceptrons

KANs Made Simple

2 minute read

Published:

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

mup

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published:

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

natural gradient descent

neural scaling laws

KANs Made Simple

2 minute read

Published:

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

neural tangent kernel

♾️ Infinite Widths Part II: The Neural Tangent Kernel

7 minute read

Published:

This is the second post of a short series on the infinite-width limits of deep neural networks (DNNs). Previously, we reviewed the correspondence between neural networks and Gaussian Processes (NNGP), showing that, as the number neurons in the hidden layers grows to infinity, the output of a random network becomes Gaussian distributed.

optimisation theory

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published:

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

predictive coding

PhD Reflections

17 minute read

Published:

Having recently submitted my PhD thesis, I’ve been thinking a lot about my PhD experience. Here I would like to share some reflections. Needless to say that this is my own, biased experience, and PhDs can vary greatly depending on the field, lab, supervisor, etc.

Scaling Predictive Coding to 100+ Layer Networks

5 minute read

Published:

📖 TL;DR: We introduce \(\mu\)PC, a reparameterisation of predictive coding networks that enables stable training of 100+ layer ResNets on simple tasks with zero-shot hyperparameter transfer.

Predictive Coding as a 2nd-Order Method

10 minute read

Published:

📖 TL;DR: Predictive coding implicitly performs a 2nd-order weight update via 1st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

reflections

PhD Reflections

17 minute read

Published:

Having recently submitted my PhD thesis, I’ve been thinking a lot about my PhD experience. Here I would like to share some reflections. Needless to say that this is my own, biased experience, and PhDs can vary greatly depending on the field, lab, supervisor, etc.

research

PhD Reflections

17 minute read

Published:

Having recently submitted my PhD thesis, I’ve been thinking a lot about my PhD experience. Here I would like to share some reflections. Needless to say that this is my own, biased experience, and PhDs can vary greatly depending on the field, lab, supervisor, etc.

rich regime

saddle points

saddles

Predictive Coding as a 2nd-Order Method

10 minute read

Published:

📖 TL;DR: Predictive coding implicitly performs a 2nd-order weight update via 1st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

second-order method

Predictive Coding as a 2nd-Order Method

10 minute read

Published:

📖 TL;DR: Predictive coding implicitly performs a 2nd-order weight update via 1st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

second-order methods

splines

KANs Made Simple

2 minute read

Published:

Confused about the recent KAN: Kolmogorov-Arnold Networks? I was too, so here’s a minimal explanation that makes it easy to see the difference between KANs and multi-layer perceptrons (MLPs).

system-2 thinking

Energy-based Transformers

5 minute read

Published:

📖 TL;DR: Energy-based Transformers (EBTs) learn a scalar energy function parameterised by a transformer. Empirically, EBTs show promising scaling and reasoning properties on both language and vision tasks.

tensor programs

thermodynamic AI

transformers

In-Context Learning Demystified?

4 minute read

Published:

📖 TL;DR: a transformer block implicitly uses the input context to modify its MLP weights.

Energy-based Transformers

5 minute read

Published:

📖 TL;DR: Energy-based Transformers (EBTs) learn a scalar energy function parameterised by a transformer. Empirically, EBTs show promising scaling and reasoning properties on both language and vision tasks.

trust region

Predictive Coding as a 2nd-Order Method

10 minute read

Published:

📖 TL;DR: Predictive coding implicitly performs a 2nd-order weight update via 1st-order (gradient) updates on neurons that in some cases allow it to converge faster than backpropagation with standard stochastic gradient descent.

vanishing gradients