Looking forward to hear it π€ Happy to help if there is more you need.
Looking forward to hear it π€ Happy to help if there is more you need.
Always happy to compare to good and interesting work :)
Our experiments use absolute. It would probably work about the same with normalized coordinates though, but I recon it likely requires fiddling a bit with the initialization range of the projection matrix (W_p) if you prefer normalized.
Thanks! I am not too familiar with those tasks, but no I don't think it should be hard to test. And it would be quite interesting to do. Our code is available and the implementation is plug-and-play with standard attention. You only need to give it the nD-position of each token.
And here it is. Maybe something along these lines you were thinking of? Designed directly for vision, tested on 2D, 2D-T, 3D, and multi-modal, and it extrapolates very well.
Paper: arxiv.org/abs/2602.01418
Website: chrisohrstrom.github.io/parabolic-po...
Code: github.com/DTU-PAS/para...
Where to Attend: A Principled Vision-Centric Position Encoding with Parabolas
Paper: arxiv.org/abs/2602.01418
Website: chrisohrstrom.github.io/parabolic-po...
Code: github.com/DTU-PAS/para...
@rgring.bsky.social @lanalpa.bsky.social
What if position encodings were designed for vision from scratch? We introduce PaPEβParabolic Position Encoding. Outperforms RoPE on 7/8 datasets and extrapolates to higher resolutions without fine-tuning or position interpolation. Paper, code, and website in thread π§΅
Actually working on a principled encoding for 2D, 2D-T, and 3D. Coming soon in a couple of weeks ;)
Congratulations. You are now officially Danish.
Thanks to my collaborators @rgring.bsky.social @lanalpa.bsky.social.
Try it out for yourself: github.com/DTU-PAS/spik...
We also get a much smaller input sizes with up to a 6.9x reduction over voxels and up to a 8.9x reduction over frames.
Results are pretty good. Inference speedups are up to 3.4x over voxels for a point cloud network and up to 10.4x over frames for a Transformer.
This comes without sacrificing accuracy. We even outperform voxels and frames in most cases on gesture recognition and object detection.
Spiking Patches works by creating a grid of patches and let each patch act as spiking neuron. A patch increases its potential whenever an event arrives within the patch, and a token is created everytime a patch spikes (when the potential exceeds a threshold).
We achieve this through tokenization of events. Our tokenizer is called Spiking Patches.
Something cool is that tokens are compatible with GNNs, PCNs, and Transformers.
This is the first time that anyone applies tokenization to events. We hope to encourage more of this.
What if we could represent events (event cameras) in a way that preserves both asynchrony and spatial sparsity?
Exited to share our latest work where we answer this question positively.
Spiking Patches: Asynchronous, Sparse, and Efficient Tokens for Event Cameras
Paper: arxiv.org/abs/2510.26614
How is external links to be understood? Is it e.g. okay to link to a video (not our own) with examples of a concept that we describe as a preliminary?
Can Dynamic Neural Networks boost Computer Vision and Sensor Fusion?
We are very happy to share this awesome collection of papers on the topic!
True. Not much of an issue on small codebases. Mostly just feels better with a snappier formatter for those.
black is great, but I prefer Ruff because of speed and it is also a really nice linter. docs.astral.sh/ruff/
Inventors of flow matching have released a comprehensive guide going over the math & code of flow matching!
Also covers variants like non-Euclidean & discrete flow matching.
A PyTorch library is also released with this guide!
This looks like a very good read! π₯
arxiv: arxiv.org/abs/2412.06264
Maybe this is it? arxiv.org/abs/2409.10202 @jakubgregorek.bsky.social