tyler bonnen (@tylerbonnen)

Human-level 3D shape perception emerges from multi-view learning

check out our project page for interactive examples and some really useful visualizations of model dynamics:

tzler.github.io/human_multiv...

all the code, images, and human behavior are available:

code: github.com/tzler/human_...
benchmark: huggingface.co/datasets/tzl...

26.02.2026 16:28 👍 1 🔁 0 💬 0 📌 0

the emergent alignment between these models and human perception opens up so many scientific opportunities

we're already working on some extensions of these results. if you have ideas/questions would love to hear from you!

26.02.2026 16:28 👍 1 🔁 0 💬 1 📌 0

there is an emergent alignment between these multi-view models and human perception

concretely, VGGT matches human-level accuracy (left), error patterns (center), and reaction time (right)

again, there is no training, no fine tuning, no liner decoders needed to predict these behavioral measures

26.02.2026 16:28 👍 1 🔁 0 💬 1 📌 0

a graphic titled "protocol for estimating model performance on a single trial" with four sections left to right. far left: three images of abstract shapes from the example 'oddity' trial used previously. each image is labeled with either A, A', or B. left: the pairs of images (A B, A A', B A') stacked vertically right: an arrow extends from each image pair to the word "model" and then an arrow extends to a visualization of the "confidence map" extracted by the model for this pair far right: a bar chart with "confidence" on the y axis and each pair (A B, A A', B A') on the x axis. the pair with the match object (A A') has the highest confidence

we develop an evaluation framework for these 'multi-view' models

using a pair-wise encoding strategy we design a set of metrics using model 'confidence', 'confidence margin,' and 'solution layer'

these are all zero-shot metrics. no experimental behavior/stimuli necessary

26.02.2026 16:28 👍 1 🔁 0 💬 1 📌 0

a new class of models leverages similar visual-spatial data

given a sequence of images, multi-view vision transformers (eg DUST3R, Pi3, VGGT) learn to predict associated spatial information, including depth and camera pose.

this is a radical departure from standard vision encoders

26.02.2026 16:28 👍 1 🔁 0 💬 1 📌 0

okay, so, what kind of data *d*o we learn from?

at the very least: visual sequences, depth, and self-motion cues. @brialong.bsky.social, @mcxfrank.bsky.social and Linda Smith have done incredible work characterizing these experiences with head-mounted cameras

(thank you Bria for the video! 🙇)

26.02.2026 16:28 👍 1 🔁 0 💬 1 📌 0

why models fail is relevant for cognitive theories:

e.g., do models need object-level inductive biases to perceive 3D? or do they need more human-like sensory experience?

achieving human-level performance might provide proof of principle for what computational strategies might work

26.02.2026 16:28 👍 2 🔁 0 💬 1 📌 0

a bar chart comparing humans and many vision models. the y axis is "3D perceptual accuracy" and the x axis contains humans (orange) and a models of different sizes from DINOv2, CLIP, and MAE. the best models are only half as good (.4) as humans are (.8) on this 3D perception benchmark

there's been a substantial gap between humans and vision models on 3D perception tasks

we put together a neurips benchmark (MOCHI) to evaluate vision models: arxiv.org/abs/2409.05862

since then, we've observed this failure across a wide range of architectures and training objectives

26.02.2026 16:28 👍 1 🔁 0 💬 1 📌 0

an example 'oddity' trial. there are three images side by side, containing abstract shapes from different viewpoints that are challenging to discriminate between

how do we evaluate human 3D perception?

here's an example from a cognitive science task. each trial has three images: two images depict the same object from different viewpoints (A, A'), the other depicts a different object (B). the task is to select the non-matching object (B)

26.02.2026 16:28 👍 2 🔁 0 💬 1 📌 0

Human-level 3D shape perception emerges from multi-view learning Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligenc...

excited to share some recent work!

neural networks trained on multi-view sensory data are the first to match human-level 3D shape perception

we predict human accuracy, error patterns, and reaction time—all zero-shot, no training on experimental data

arxiv.org/abs/2602.17650

1/🧠

26.02.2026 16:28 👍 39 🔁 6 💬 1 📌 1

thank you Marianna!!!

28.11.2025 16:29 👍 0 🔁 0 💬 0 📌 0

thank you Iris!!!

28.11.2025 16:28 👍 0 🔁 0 💬 0 📌 0

THANK YOU ALEX!!

28.11.2025 16:28 👍 1 🔁 0 💬 0 📌 0

thank you!!

28.11.2025 16:28 👍 0 🔁 0 💬 0 📌 0

i'd love that!

28.11.2025 16:27 👍 1 🔁 0 💬 0 📌 0

thank you Anna, for everything!!!!

28.11.2025 16:27 👍 1 🔁 0 💬 0 📌 0

thank you Dota!! 🥳🙏✨

28.11.2025 16:26 👍 1 🔁 0 💬 0 📌 0

thank you!!!

28.11.2025 16:26 👍 0 🔁 0 💬 0 📌 0

thanks Marcelo!

28.11.2025 16:25 👍 0 🔁 0 💬 0 📌 0

🥳

28.11.2025 16:25 👍 0 🔁 0 💬 0 📌 0

thanks Ida! 🥹

28.11.2025 16:25 👍 0 🔁 0 💬 0 📌 0

thank you!!

28.11.2025 16:24 👍 0 🔁 0 💬 0 📌 0

thanks Laura!

28.11.2025 16:24 👍 0 🔁 0 💬 0 📌 0

thank you!

28.11.2025 16:24 👍 0 🔁 0 💬 0 📌 0

thanks Nacho!!! i'll definitely reach out about apartments and live music

28.11.2025 16:23 👍 1 🔁 0 💬 1 📌 0

thanks Tobi 🥳

28.11.2025 16:22 👍 0 🔁 0 💬 0 📌 0

!!!! thank you Natalia !!!!

28.11.2025 16:21 👍 0 🔁 0 💬 0 📌 0

me toooo!

28.11.2025 16:20 👍 1 🔁 0 💬 0 📌 0

thank you Nina!! 🙏🥹

28.11.2025 16:20 👍 0 🔁 0 💬 0 📌 0

thank you Victoria! ✨

28.11.2025 16:19 👍 0 🔁 0 💬 0 📌 0

tyler bonnen

Latest posts by tyler bonnen @tylerbonnen