Human-level 3D shape perception emerges from multi-view learning
check out our project page for interactive examples and some really useful visualizations of model dynamics:
tzler.github.io/human_multiv...
all the code, images, and human behavior are available:
code: github.com/tzler/human_...
benchmark: huggingface.co/datasets/tzl...
26.02.2026 16:28
π 1
π 0
π¬ 0
π 0
the emergent alignment between these models and human perception opens up so many scientific opportunities
we're already working on some extensions of these results. if you have ideas/questions would love to hear from you!
26.02.2026 16:28
π 1
π 0
π¬ 1
π 0
there is an emergent alignment between these multi-view models and human perception
concretely, VGGT matches human-level accuracy (left), error patterns (center), and reaction time (right)
again, there is no training, no fine tuning, no liner decoders needed to predict these behavioral measures
26.02.2026 16:28
π 1
π 0
π¬ 1
π 0
a graphic titled "protocol for estimating model performance on a single trial" with four sections left to right.
far left: three images of abstract shapes from the example 'oddity' trial used previously. each image is labeled with either A, A', or B.
left: the pairs of images (A B, A A', B A') stacked vertically
right: an arrow extends from each image pair to the word "model" and then an arrow extends to a visualization of the "confidence map" extracted by the model for this pair
far right: a bar chart with "confidence" on the y axis and each pair (A B, A A', B A') on the x axis. the pair with the match object (A A') has the highest confidence
we develop an evaluation framework for these 'multi-view' models
using a pair-wise encoding strategy we design a set of metrics using model 'confidence', 'confidence margin,' and 'solution layer'
these are all zero-shot metrics. no experimental behavior/stimuli necessary
26.02.2026 16:28
π 1
π 0
π¬ 1
π 0
a new class of models leverages similar visual-spatial data
given a sequence of images, multi-view vision transformers (eg DUST3R, Pi3, VGGT) learn to predict associated spatial information, including depth and camera pose.
this is a radical departure from standard vision encoders
26.02.2026 16:28
π 1
π 0
π¬ 1
π 0
okay, so, what kind of data *d*o we learn from?
at the very least: visual sequences, depth, and self-motion cues. @brialong.bsky.social, @mcxfrank.bsky.social and Linda Smith have done incredible work characterizing these experiences with head-mounted cameras
(thank you Bria for the video! π)
26.02.2026 16:28
π 1
π 0
π¬ 1
π 0
why models fail is relevant for cognitive theories:
e.g., do models need object-level inductive biases to perceive 3D? or do they need more human-like sensory experience?
achieving human-level performance might provide proof of principle for what computational strategies might work
26.02.2026 16:28
π 2
π 0
π¬ 1
π 0
a bar chart comparing humans and many vision models. the y axis is "3D perceptual accuracy" and the x axis contains humans (orange) and a models of different sizes from DINOv2, CLIP, and MAE. the best models are only half as good (.4) as humans are (.8) on this 3D perception benchmark
there's been a substantial gap between humans and vision models on 3D perception tasks
we put together a neurips benchmark (MOCHI) to evaluate vision models: arxiv.org/abs/2409.05862
since then, we've observed this failure across a wide range of architectures and training objectives
26.02.2026 16:28
π 1
π 0
π¬ 1
π 0
an example 'oddity' trial. there are three images side by side, containing abstract shapes from different viewpoints that are challenging to discriminate between
how do we evaluate human 3D perception?
here's an example from a cognitive science task. each trial has three images: two images depict the same object from different viewpoints (A, A'), the other depicts a different object (B). the task is to select the non-matching object (B)
26.02.2026 16:28
π 2
π 0
π¬ 1
π 0
Human-level 3D shape perception emerges from multi-view learning
Humans can infer the three-dimensional structure of objects from two-dimensional visual inputs. Modeling this ability has been a longstanding goal for the science and engineering of visual intelligenc...
excited to share some recent work!
neural networks trained on multi-view sensory data are the first to match human-level 3D shape perception
we predict human accuracy, error patterns, and reaction timeβall zero-shot, no training on experimental data
arxiv.org/abs/2602.17650
1/π§
26.02.2026 16:28
π 39
π 6
π¬ 1
π 1
thank you Marianna!!!
28.11.2025 16:29
π 0
π 0
π¬ 0
π 0
thank you Iris!!!
28.11.2025 16:28
π 0
π 0
π¬ 0
π 0
THANK YOU ALEX!!
28.11.2025 16:28
π 1
π 0
π¬ 0
π 0
thank you!!
28.11.2025 16:28
π 0
π 0
π¬ 0
π 0
i'd love that!
28.11.2025 16:27
π 1
π 0
π¬ 0
π 0
thank you Anna, for everything!!!!
28.11.2025 16:27
π 1
π 0
π¬ 0
π 0
thank you Dota!! π₯³πβ¨
28.11.2025 16:26
π 1
π 0
π¬ 0
π 0
thank you!!!
28.11.2025 16:26
π 0
π 0
π¬ 0
π 0
thanks Marcelo!
28.11.2025 16:25
π 0
π 0
π¬ 0
π 0
π₯³
28.11.2025 16:25
π 0
π 0
π¬ 0
π 0
thanks Ida! π₯Ή
28.11.2025 16:25
π 0
π 0
π¬ 0
π 0
thank you!!
28.11.2025 16:24
π 0
π 0
π¬ 0
π 0
thanks Laura!
28.11.2025 16:24
π 0
π 0
π¬ 0
π 0
thank you!
28.11.2025 16:24
π 0
π 0
π¬ 0
π 0
thanks Nacho!!! i'll definitely reach out about apartments and live music
28.11.2025 16:23
π 1
π 0
π¬ 1
π 0
thanks Tobi π₯³
28.11.2025 16:22
π 0
π 0
π¬ 0
π 0
!!!! thank you Natalia !!!!
28.11.2025 16:21
π 0
π 0
π¬ 0
π 0
me toooo!
28.11.2025 16:20
π 1
π 0
π¬ 0
π 0
thank you Nina!! ππ₯Ή
28.11.2025 16:20
π 0
π 0
π¬ 0
π 0
thank you Victoria! β¨
28.11.2025 16:19
π 0
π 0
π¬ 0
π 0