[10/π§΅] This work is the result of an amazing team effort w/ Julius von KΓΌgelgen, Alain Ryser, Thomas Sutter, Bernhard SchΓΆlkopf, Julia Vogt
π arXiv: arxiv.org/abs/2502.06314
π©βπ» Code: github.com/alicebizeul/...
[10/π§΅] This work is the result of an amazing team effort w/ Julius von KΓΌgelgen, Alain Ryser, Thomas Sutter, Bernhard SchΓΆlkopf, Julia Vogt
π arXiv: arxiv.org/abs/2502.06314
π©βπ» Code: github.com/alicebizeul/...
[9/π§΅] As a result, PMAEβs masking ratio becomes a more interpretable and robust hyperparameter!
Unlike MAEsβwhere the optimal ratio varies across datasetsβwe show that masking PCs that account for 20% of the data variance consistently yields near-optimal performance.
[8/π§΅] What about the masking ratio?
In MAEs, this ratio represents the proportion of masked-out pixels.
In PMAE, we make the masking ratio more data-driven by leveraging PCA. The masking ratio now reflects the proportion of data variance captured by the set of masked PCs.
[7/π§΅] We show that PMAE outperforms MAEs in downstream image classification on CIFAR10, TinyImageNet and MedMNIST datasets.
Using a ViT-Tiny, we observe an average 38% improvement in linear probing performance compared to MAEs with the standard 75% masking ratio.
[6/π§΅] However, instead of working with a subset of pixels, the ViT processes the original image with a subset of its principal components (PCs) masked out. The model is then trained to output images that, when projected onto the masked PCs, match the ground truth.
[5/π§΅] Our approach, Principal Masked Autoencoders (PMAE), closely follows the design of the Masked Autoencoder (MAE): a Vision Transformer (ViT) encoder-decoder is trained to reconstruct missing information from the visible parts.
[4/π§΅] We posit that this reduces the redundancy between visible and masked-out information and ensures the visible information is predictive of masked-out components.
[3/π§΅] Need a refresher on PCA?
For natural images, projecting data into its principal components partitions the information into a set of global features.
By masking principal components instead of raw pixels, we effectively mask more global rather than local features.
[2/π§΅] What if, instead of masking pixels, we mask information in a more meaningful space using off-the-shelf image transformations?
We keep it simple: we consider the space of principal components and reconstruct masked-out principal components instead of raw pixels.
[1/π§΅] Unlike text, images are not compact representations. Masking and reconstructing 75% of raw pixelsβa common practice in MIMβcan thus lead to failure cases:
β Visible pixels may be redundant with the masked ones.
β Visible pixels may be unpredictive of the masked regions.
β¨New Preprint β¨ Ever thought that reconstructing masked pixels for image representation learning seems sub-optimal?
In our new preprint, we show how masking principal componentsβrather than raw pixel patchesβ improves Masked Image Modelling (MIM).
Find out more below π§΅