Evan Walters (@evanatyourservice)

This way, even if the features with the highest variance aren't present, we can still identify the dog using the less overt features.

11.12.2024 18:47 👍 0 🔁 0 💬 0 📌 0

Or, we can use an optimizer that normalizes the features implicitly. Somewhere in the grads are all features related to the dog, fur, body shapes, leash, context... If we whiten the gradient, we get closer to the model learning all the dog-related features equally.

11.12.2024 18:47 👍 1 🔁 0 💬 1 📌 0

Only learning these high-variance features makes a poor model, though, because as soon as those features aren't visible it has trouble identifying the dog. We battle this explicitly by cropping pictures (data aug) so the model is forced to learn other parts of the dog.

11.12.2024 18:47 👍 0 🔁 0 💬 1 📌 0

If we think about it from the perspective of image classification and are trying to identify a dog, the most identifiable features are likely the face, ears, or tail, which have high variance. These high-variance features will stand out most in the gradient, so the model will mainly learn these.

11.12.2024 18:47 👍 0 🔁 0 💬 1 📌 0

Many second-order optimizers aim to whiten the gradient, which scales each direction in the gradient to unit length. But why is this useful?

11.12.2024 18:47 👍 3 🔁 1 💬 1 📌 0

In a world of tuning I wanted to see how PSGD kron would fair without any tuning whatsoever on some Atari RL. I plugged it into CleanRL PPO with defaults and same LR as adam and it did quite well, check out some graphs! W&B report: api.wandb.ai/links/evanat...

11.12.2024 16:19 👍 0 🔁 0 💬 0 📌 0

GitHub - evanatyourservice/xLSTM-JAX: An implementation of mLSTM from xLSTM in JAX An implementation of mLSTM from xLSTM in JAX. Contribute to evanatyourservice/xLSTM-JAX development by creating an account on GitHub.

I implemented mLSTM from the xLSTM paper from @HochreiterSepp and team in JAX, it can more or less be used in place of attention. Haven't done a lot of experiments with it yet, if you give it a try please report back!

github.com/evanatyourse...

30.11.2024 16:43 👍 1 🔁 0 💬 0 📌 0

Hi @clementpoiret.bsky.social I am one of the co-authors of PSGD from 2022, and actively working on PSGD Kron with Xilin and @evanatyourservice.bsky.social glad you are excited about PSGD Kron!

28.11.2024 02:16 👍 3 🔁 1 💬 1 📌 0

Just put together a starter pack for Deep Learning Theory. Let me know if you'd like to be included or suggest someone to add to the list!

go.bsky.app/2qnppia

22.11.2024 21:35 👍 87 🔁 31 💬 29 📌 5

woah

26.11.2024 04:32 👍 1 🔁 0 💬 0 📌 0

PSGD ❤️ MARS

MARS is a new exciting variance reduction technique from @quanquangu.bsky.social 's group which can help stabilize and accelerate your deep learning pipeline. All that is needed is a gradient buffer. Here MARS speeds up the convergence of PSGD ultimately leading to a better solution.

26.11.2024 04:21 👍 14 🔁 5 💬 2 📌 2

Thanks Zhipeng, glad to be a part!

26.11.2024 04:06 👍 2 🔁 0 💬 0 📌 0

Evan Walters

Latest posts by Evan Walters @evanatyourservice