(@jana-z) — KonKok

This is also the first paper of my PhD — huge thanks to my amazing co-authors:
@thwiedemer.bsky.social , Fanfei Li,
@thokle.bsky.social ,
@prasannamayil.bsky.social , Matthias Bethge, Felix Wichmann, Ryan Cotterell, and @wielandbrendel.bsky.social

03.02.2026 13:12 👍 0 🔁 0 💬 0 📌 0

Overall, our results point to a dual failure of machine mental imagery:
models struggle both to generate and to interpret visual states as actionable evidence for sequential decision-making.

03.02.2026 08:36 👍 0 🔁 0 💬 0 📌 0

Is the problem simply bad image generation?
We provide models with ground-truth visual chains of thought (oracle intermediate states) and instruct them to use these visuals in their reasoning.
Performance improves only in some tasks, and often remains at chance.

03.02.2026 08:36 👍 0 🔁 0 💬 0 📌 0

Same tasks, different representation:
When visual states are transcribed into text, many models can solve problems they fail in the visual setting.
This suggests the bottleneck is not logic, but reasoning in the visual domain itself.

03.02.2026 08:36 👍 0 🔁 0 💬 0 📌 0

Zooming in on Rush Hour, we compare reasoning paradigms ranging from text-only MLLMs to models with latent or explicit visual reasoning.

None of these paradigms reliably outperform the others, indicating that making visual reasoning more explicit does not solve the problem.

03.02.2026 08:36 👍 0 🔁 0 💬 0 📌 0

Across all tasks, state-of-the-art multimodal models often perform at or near chance, even at relatively low difficulty.

Performance degrades rapidly as soon as reasoning requires sequential visual state updates, rather than long-horizon planning or complex rules.

03.02.2026 08:35 👍 1 🔁 0 💬 0 📌 0

What does Mentis Oculi test?
A collection of visual reasoning tasks (e.g. Rush Hour, Sliding Puzzle) designed to probe whether models can mentally transform visual states across multiple steps.
Each puzzle is specified by a single image, but solving it requires a visual rollout.

03.02.2026 08:35 👍 0 🔁 0 💬 0 📌 0

Can AI reason by “imagining” — not just by seeing or reading?

We introduce Mentis Oculi, a benchmark for machine mental imagery: multi-step visual puzzles that require maintaining and updating visual states over time.
📄 arxiv.org/abs/2602.02465
🌐 jana-z.github.io/mentis-oculi/

🧵⬇️

03.02.2026 08:34 👍 4 🔁 2 💬 7 📌 2

Latest posts by @jana-z