Bill Psomas's Avatar

Bill Psomas

@billpsomas

MSCA Postdoctoral Fellow @ Visual Recognition Group, CTU in Prague. Deep Learning for Computer Vision. Former IARAI, Inria, Athena RC intern. Photographer. Crossfit freak. ๐Ÿ“Prague, CZ. ๐Ÿ”— http://users.ntua.gr/psomasbill/

539
Followers
207
Following
57
Posts
21.11.2024
Joined
Posts Following

Latest posts by Bill Psomas @billpsomas

Preview
Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation? Open-vocabulary segmentation (OVS) extends the zero-shot recognition capabilities of vision-language models (VLMs) to pixel-level prediction, enabling segmentation of arbitrary categories specified by...

7/7 Resources ๐Ÿ“„

Paper: arxiv.org/abs/2602.23339
Code: github.com/TilemahosAra...

Joint work with: @tim-arav.bsky.social, @stojnicv.xyz, Nikos Komodakis, and @gtolias.bsky.social.

We thank @noagarciad.bsky.social, @skamalas.bsky.social, @ekazakos.bsky.social for the feedback.

See you @ #CVPR2026

09.03.2026 16:19 ๐Ÿ‘ 4 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Post image

6/n ๐‘ท๐’†๐’“๐’”๐’๐’๐’‚๐’๐’Š๐’›๐’†๐’… ๐‘บ๐’†๐’ˆ๐’Ž๐’†๐’๐’•๐’‚๐’•๐’Š๐’๐’

๐Ÿฅค ๐‘น๐‘ต๐‘บ can easily be employed for fine-grained tasks like ๐’‘๐’†๐’“๐’”๐’๐’๐’‚๐’๐’Š๐’›๐’†๐’… ๐’”๐’†๐’ˆ๐’Ž๐’†๐’๐’•๐’‚๐’•๐’Š๐’๐’ by simply expanding the support set with a few examples of a specific instance, letting it ๐’”๐’†๐’‘๐’‚๐’“๐’‚๐’•๐’† ๐’•๐’‰๐’‚๐’• ๐’Š๐’๐’”๐’•๐’‚๐’๐’„๐’† ๐’‡๐’“๐’๐’Ž ๐’Š๐’•๐’” ๐’ƒ๐’“๐’๐’‚๐’…๐’†๐’“ ๐’„๐’๐’‚๐’”๐’”.

09.03.2026 16:19 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

5/n ๐‘ฉ๐’“๐’Š๐’…๐’ˆ๐’Š๐’๐’ˆ ๐’•๐’‰๐’† ๐‘ฎ๐’‚๐’‘

โšก ๐‘น๐‘ต๐‘บ improves over different kinds of OVS approaches ๐’ƒ๐’š 14.1% ๐’๐’ ๐’‚๐’—๐’†๐’“๐’‚๐’ˆ๐’†, while maintaining open-vocabulary generalization.

09.03.2026 16:19 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

4/n ๐‘ซ๐’š๐’๐’‚๐’Ž๐’Š๐’„ ๐‘ญ๐’†๐’˜-๐’”๐’‰๐’๐’• ๐‘บ๐’„๐’†๐’๐’‚๐’“๐’Š๐’๐’”

We investigate multiple ๐’‡๐’†๐’˜-๐’”๐’‰๐’๐’• settings where visual or textual information may be missing for some test classes.

๐ŸŽ‰ We consistently improve respective baselines, making ๐‘น๐‘ต๐‘บ a ๐’‘๐’“๐’‚๐’„๐’•๐’Š๐’„๐’‚๐’ and ๐’๐’‘๐’†๐’-๐’˜๐’๐’“๐’๐’… ๐‘ถ๐‘ฝ๐‘บ method.

09.03.2026 16:19 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

3/n ๐‘ฏ๐’๐’˜ it works?

๐Ÿ’พ ๐‘น๐‘ต๐‘บ stores ๐‘ฝ๐‘ณ๐‘ด ๐’‡๐’†๐’‚๐’•๐’–๐’“๐’†๐’” from visual and textual examples in a ๐’Ž๐’†๐’Ž๐’๐’“๐’š-๐’†๐’‡๐’‡๐’Š๐’„๐’Š๐’†๐’๐’• manner.

๐Ÿ–ผ๏ธ At test time, it ๐’“๐’†๐’•๐’“๐’Š๐’†๐’—๐’†๐’” ๐’•๐’†๐’”๐’• ๐’Š๐’Ž๐’‚๐’ˆ๐’† ๐’“๐’†๐’๐’†๐’—๐’‚๐’๐’• ๐’†๐’™๐’‚๐’Ž๐’‘๐’๐’†๐’” to train a linear classifier on both modalities.

09.03.2026 16:19 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

2/n Zero-shot open-vocabulary segmentation (OVS) is significantly underperforming fully supervised.

๐ŸŒ‰ ๐‘น๐‘ต๐‘บ ๐’ƒ๐’“๐’Š๐’…๐’ˆ๐’†๐’” ๐’•๐’‰๐’Š๐’” ๐’ˆ๐’‚๐’‘ using a few pixel-level annotated visual examples along with class names.

With a few adaptation steps on each test image, we improve zero-shot ๐’ƒ๐’š up to 34% on average.

09.03.2026 16:19 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Video thumbnail

1/n #CVPR2026 Accepted Paper๐Ÿš€

๐‘จ๐’“๐’† ๐’‚ ๐‘ญ๐’†๐’˜ ๐‘ฌ๐’™๐’‚๐’Ž๐’‘๐’๐’†๐’” ๐‘ฌ๐’๐’๐’–๐’ˆ๐’‰ ๐’•๐’ ๐‘ฉ๐’“๐’Š๐’…๐’ˆ๐’† ๐’•๐’‰๐’† ๐‘บ๐’–๐’‘๐’†๐’“๐’—๐’Š๐’”๐’Š๐’๐’ ๐‘ฎ๐’‚๐’‘ ๐’Š๐’ ๐‘ถ๐’‘๐’†๐’-๐‘ฝ๐’๐’„๐’‚๐’ƒ๐’–๐’๐’‚๐’“๐’š ๐‘บ๐’†๐’ˆ๐’Ž๐’†๐’๐’•๐’‚๐’•๐’Š๐’๐’?

๐‘น๐’†๐’•๐’“๐’Š๐’†๐’—๐’† ๐’‚๐’๐’… ๐‘บ๐’†๐’ˆ๐’Ž๐’†๐’๐’• (๐‘น๐‘ต๐‘บ) answers this question.

Paper/code at the end๐Ÿ‘‡๐Ÿผ

09.03.2026 16:19 ๐Ÿ‘ 8 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Post image Post image Post image Post image

Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?

Tilemachos Aravanis @stojnicv.xyz @billpsomas.bsky.social Nikos Komodakis @gtolias.bsky.social

tl;dr: almost yes if use 1-3 images, no if more(fig 6)
arxiv.org/abs/2602.23339
#CVPR2026

27.02.2026 16:17 ๐Ÿ‘ 7 ๐Ÿ” 3 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰

26.02.2026 23:05 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Excited to share that our paper "Global-Aware Edge Prioritization for Pose Graph Initialization" has been accepted to CVPR 2026! #CVPR2026 See you soon in Denver!๐Ÿฅณ๐Ÿฅณ Code is coming soon๐Ÿšง
โ“How would you do an accurate and efficient pose graph initialization in a global manner? arxiv.org/abs/2602.21963

26.02.2026 15:54 ๐Ÿ‘ 10 ๐Ÿ” 3 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image Post image Post image

Global-Aware Edge Prioritization for Pose Graph Initialization

@weitong8591.bsky.social, @gtolias.bsky.social, Jiri Matas, @danielbarath.bsky.social

tl;dr: rank pose graph edges->global consistency->improve SfM

arxiv.org/abs/2602.21963

26.02.2026 13:24 ๐Ÿ‘ 9 ๐Ÿ” 2 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Post image

Sleeping while waiting on an โ€œanywhere in the worldโ€ paper decision release. #CVPR2026

20.02.2026 21:21 ๐Ÿ‘ 17 ๐Ÿ” 1 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 0
Preview
Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency As fine-tuning becomes impractical at scale, probing is emerging as the preferred evaluation protocol. However, standard linear probing can understate the capability of models whose pre-training optim...

8/8 Resources ๐Ÿ“„

Paper: arxiv.org/abs/2506.10178
Code: github.com/billpsomas/e...

Joint work with: Dionysis Christopoulos,@eirinibaltzi.bsky.social,@ikakogeorgiou.bsky.social, @tim-arav.bsky.social,Nikos Komodakis,Konstantinos Karantzalos,Yannis Avrithis,@gtolias.bsky.social.

See you @ ICLR 2026๐Ÿ‡ง๐Ÿ‡ท

20.02.2026 15:03 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

7/n Take-home messages ๐Ÿ’ก

EP:
- Plug-and-play.
- Compatible with all pre-training families.
- Unlocks the potential of encoders optimized for local representations.
- Complementary with PEFT.
- Better to have it, than not to have it. ๐Ÿ‘€

20.02.2026 15:03 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

6/n EP + PEFT = ๐Ÿ”ฅ

- EP captures information that LoRA alone does not, and vice versa.
- LoRA+EP improves over both pure EP and pure LoRA.

๐Ÿ“Œ Example: a LoRA+EP configuration with 250K params reaches 72%, 4.3% above linear probing (67.7%), while using over 3ร— fewer parameters.

20.02.2026 15:03 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image Post image

5/n Interpretability ๐Ÿ”

- EP queries specialize in distinct spatial regions.
- Attention maps are complementary.
- Semantic correspondences emerge (e.g. tails, feet).
- Verified quantitatively too.

20.02.2026 15:03 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

4/n Designed for local representations๐Ÿงฉ

๐Ÿ“Š Across ImageNet-1K:

- Consistent gains over k-NN and Linear Probing (LP).
- Particularly strong improvements for MIM, VL, and generative.
- Minimal overhead.

20.02.2026 15:03 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image Post image

3/n Core observation โš™๏ธ

Prior attentive probing uses redundant projections.

๐Ÿ” Introducing Efficient Probing (EP):

๐Ÿ“Œ Multi-query cross-attention.
๐Ÿ”Œ Plug-and-play on top of frozen encoders.
๐Ÿ’ธ Lightweight and parameter-efficient.

20.02.2026 15:03 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

2/n Why revisit probing? ๐Ÿค”

- Linear probing underestimates encoders optimizing local representations.
- Full fine-tuning is costly at scale.
- Attentive probing helps, yet methods are over-parametrized and not well-studied.

๐Ÿ‘‰ Can we get attention benefits without that much overhead?

20.02.2026 15:03 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image

1/n Attention, Please! ๐Ÿš€

Our work โ€œRevisiting Attentive Probing Through the Lens of Efficiencyโ€ has been accepted at #ICLR2026.

We introduce Efficient Probing (EP) โ€” a lightweight, multi-query attentive probing method for frozen encoders.

Paper + code at the end ๐Ÿ‘‡

20.02.2026 15:03 ๐Ÿ‘ 11 ๐Ÿ” 4 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 1
Post image Post image Post image Post image

BBoxMaskPose v2: Expanding Mutual Conditioning to 3D

Miroslav Purkrabek, Constantin Kolomiiets, Jiri Matas

tl;dr: sota in human pose estimations, especially for the hard cases
arxiv.org/abs/2601.15200

03.02.2026 10:49 ๐Ÿ‘ 12 ๐Ÿ” 1 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Would love to try

13.01.2026 18:33 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

Best promo anyone could make for this position ๐Ÿ‘๐Ÿพ๐Ÿฐ And, amazingly, everything said is true ๐ŸŽ†

09.01.2026 05:36 ๐Ÿ‘ 2 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Preview
Postdoctoral research position in Instance-level visual generation Czech Technical University in Prague (CTU) offers a fellowship program, the CTU Global Postdoc Fellowship. This new and attractive two-year fellowship-program offers excellent researchers who have rec...

I have an opening for a two years post-doc position on instance-level (personalized) visual generation. Eligibility: (i) <=7 years from Ph.D. (ii) studies or 1 year outside of Czechia (ii) >=3 journal with IF or CORE A*/A conference papers. Deadline: 15 Feb.
Details: www.euraxess.cz/jobs/399390

08.01.2026 11:11 ๐Ÿ‘ 12 ๐Ÿ” 10 ๐Ÿ’ฌ 2 ๐Ÿ“Œ 1
Post image

๐Ÿš€New task: Instance-level Image+Textโ†’Image Retrieval

๐Ÿ”ŽGiven a query image + an edit (โ€œduring nightโ€), retrieve the same specific instance after the change โ€” not just any similar object.

๐Ÿ›ขNew dataset on HF: i-CIR huggingface.co/datasets/bil...

๐Ÿ”ฅDownload, run, and share results!

06.01.2026 20:00 ๐Ÿ‘ 12 ๐Ÿ” 5 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0

12/12 Joint work with Giorgos Petsangourakis, Christos Sgouropoulos, Theodoros Giannakopoulos, Giorgos Sfikas, @ikakogeorgiou.bsky.social.

27.12.2025 10:32 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 0 ๐Ÿ“Œ 0
Preview
REGLUE Your Latents with Global and Local Semantics for Entangled Diffusion Latent diffusion models (LDMs) achieve state-of-the-art image synthesis, yet their reconstruction-style denoising objective provides only indirect semantic supervision: high-level semantics emerge slo...

11/n Summary๐Ÿ

REGLUE shows that the way we leverage VFM semantics matters for diffusion. Combining compact local semantics with global context yields faster convergence and state-of-the-art image generation.

๐Ÿ“„arXiv: arxiv.org/abs/2512.16636
๐Ÿ’ปProject: reglueyourlatents.github.io

27.12.2025 10:30 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0
Post image Post image

10/n Faster convergence๐Ÿ”ฅ

REGLUE (SiT-B/2) achieves 12.9 and 28.7 FID at 400K iterations in conditional and unconditional generation, respectively, outperforming REPA, ReDi, and REG. REGLUE (SiT-XL/2) matches 1M-step SOTA performance in just 700k iterations (~30% fewer steps).

27.12.2025 10:30 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

9/n Alignment effects โš“

External alignment complements joint modeling, but its benefits depend on the signal. Local alignment yields consistent gains, whereas global-only alignment can degrade performance. Spatial joint modeling remains the primary driver.

27.12.2025 10:29 ๐Ÿ‘ 1 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0

8/n Local > Global Semantics๐Ÿงฉ

Our analysis shows that jointly modeling with patch-level semantics drives most gains. The global [CLS] helps, but fine-grained spatial features deliver a strongly larger FID improvement, highlighting the importance of local structure for diffusion.

27.12.2025 10:29 ๐Ÿ‘ 0 ๐Ÿ” 0 ๐Ÿ’ฌ 1 ๐Ÿ“Œ 0