This way, even if the features with the highest variance aren't present, we can still identify the dog using the less overt features.
This way, even if the features with the highest variance aren't present, we can still identify the dog using the less overt features.
Or, we can use an optimizer that normalizes the features implicitly. Somewhere in the grads are all features related to the dog, fur, body shapes, leash, context... If we whiten the gradient, we get closer to the model learning all the dog-related features equally.
Only learning these high-variance features makes a poor model, though, because as soon as those features aren't visible it has trouble identifying the dog. We battle this explicitly by cropping pictures (data aug) so the model is forced to learn other parts of the dog.
If we think about it from the perspective of image classification and are trying to identify a dog, the most identifiable features are likely the face, ears, or tail, which have high variance. These high-variance features will stand out most in the gradient, so the model will mainly learn these.
Many second-order optimizers aim to whiten the gradient, which scales each direction in the gradient to unit length. But why is this useful?
In a world of tuning I wanted to see how PSGD kron would fair without any tuning whatsoever on some Atari RL. I plugged it into CleanRL PPO with defaults and same LR as adam and it did quite well, check out some graphs! W&B report: api.wandb.ai/links/evanat...
I implemented mLSTM from the xLSTM paper from @HochreiterSepp and team in JAX, it can more or less be used in place of attention. Haven't done a lot of experiments with it yet, if you give it a try please report back!
github.com/evanatyourse...
Hi @clementpoiret.bsky.social I am one of the co-authors of PSGD from 2022, and actively working on PSGD Kron with Xilin and @evanatyourservice.bsky.social glad you are excited about PSGD Kron!
Just put together a starter pack for Deep Learning Theory. Let me know if you'd like to be included or suggest someone to add to the list!
go.bsky.app/2qnppia
woah
PSGD โค๏ธ MARS
MARS is a new exciting variance reduction technique from @quanquangu.bsky.social 's group which can help stabilize and accelerate your deep learning pipeline. All that is needed is a gradient buffer. Here MARS speeds up the convergence of PSGD ultimately leading to a better solution.
Thanks Zhipeng, glad to be a part!