Trending

#VisionLanguage

Latest posts tagged with #VisionLanguage on Bluesky

Latest Top
Trending

Posts tagged #VisionLanguage

Post image

Even the hottest multimodal models stumble—capped at 50% on simple visual entity tasks. What does this reveal about current vision‑language gaps? Dive into the benchmarks and see why AI still has a long way to go. #MultimodalLearning #VisionLanguage #AIPerformance

🔗 aidailypost.com/news/top-mul...

0 0 0 0
Post image

New research shows how to fool CLIP‑style vision‑language models with fresh adversarial tricks. Could this expose hidden AI security gaps? Dive into the latest evasion techniques and what they mean for multimodal ML. #AdversarialAttacks #VisionLanguage #AIsecurity

🔗 aidailypost.com/news/researc...

0 0 0 0

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models
Chia-Jui Chang, He Syu et al.
Paper
Details
#VisionLanguage #OrdinalRegression #BiasBenchmark

0 0 0 0
Post image

Just saw an open‑source OCR model hit 82.4 on the olmOCR‑bench—handles equations, tables, multilingual docs, and scales like a champ with PaddleOCR VL and ERNIE‑4.5‑0.3B. Dive into the details! #OCR #olmOCRbench #VisionLanguage

🔗 aidailypost.com/news/open-so...

0 0 0 0
Post image

Black Forest Labs just dropped Flux 2, packing the new Mistral‑3 24B vision‑language model with a hybrid Rectified Flow Transformer + VAE encoder. The BFL API makes it super easy to experiment—check out the details! #Flux2 #Mistral324B #VisionLanguage

🔗 aidailypost.com/news/black-f...

0 0 0 0
Post image

#VisionLanguage models are increasingly used for a wide range of problems, but seem complex to build. I wrote some code and recorded a tutorial in my lab yesterday to help others demystify how to create these models. #keepbuilding

6 1 2 0
Preview
E-MM1 Dataset: The World's Largest Multimodal AI Dataset The E-MM1 dataset is the world's largest multimodal AI dataset, with more than 100 million groups of data in five modalities to foster the development of models that fuse multiple modalities.

EMM1 evaluates how AI understands images and text together. It highlights where models excel and where they fall short, helping build more reliable multimodal systems.

#AI #Data #VisionLanguage
encord.com/multimodal-d...

2 0 0 0
Post image Post image

Back from the break with Phillip Isola @phillipisola.bsky.social on
“On the Perceptual Distance Between Images and Text.”
A fascinating and interactive look at how models (and humans!) measure similarity 👏🏻

#HiCV2025 #ICCV2025 #VisionLanguage

0 0 0 0
Training-Free Explainable Vision-Language Model for Medical Imaging

Training-Free Explainable Vision-Language Model for Medical Imaging

A training-free, explainable vision-language model for medical imaging has been announced. Read more: getnews.me/training-free-explainabl... #medimaging #visionlanguage #explainable

0 0 0 0
Probabilistic Language-Image Pre-Training Boosts Vision-Language Models

Probabilistic Language-Image Pre-Training Boosts Vision-Language Models

A new probabilistic language-image pre-training approach is reported to boost performance of vision-language models. Read more: getnews.me/probabilistic-language-i... #visionlanguage #pretraining #ai

0 0 0 0
Cross-modal Backward-Compatible Learning for Vision-Language Models

Cross-modal Backward-Compatible Learning for Vision-Language Models

A new study introduces cross-modal backward-compatible learning for vision-language models. Read more: getnews.me/cross-modal-backward-com... #visionlanguage #crossmodal #machinelearning

1 0 0 0
Vision-Language Models Boost Efficiency of Indoor Robot Navigation

Vision-Language Models Boost Efficiency of Indoor Robot Navigation

Vision‑language models guide indoor robot navigation, selecting subgoals that reduce path length by about 10 % in simulation, working zero‑shot with the DYNUS planner. Read more: getnews.me/vision-language-models-b... #visionlanguage #robotics

0 0 0 0
Zero-Shot Fine-Grained Classification with Vision-Language Models

Zero-Shot Fine-Grained Classification with Vision-Language Models

The study reframes zero‑shot classification as Q&A and adds an attention‑intervention, boosting top‑1 accuracy on bird, flower and vehicle benchmarks. Code on GitHub. Read more: getnews.me/zero-shot-fine-grained-c... #visionlanguage #zeroshot

0 0 0 0
Spatial‑ViLT Improves 3D Spatial Reasoning with Multi‑Task Learning

Spatial‑ViLT Improves 3D Spatial Reasoning with Multi‑Task Learning

Spatial‑ViLT adds depth maps, 3D coordinate grids and edge maps to vision‑language models, achieving top results on the Visual Spatial Reasoning benchmark. Read more: getnews.me/spatial-vilt-improves-3d... #spatialvilt #visionlanguage

0 0 0 0
Large Vision‑Language Models Boost Carotid Plaque Risk Prediction

Large Vision‑Language Models Boost Carotid Plaque Risk Prediction

Fine‑tuned LLaVa‑NeXT‑Vicuna with LoRA boosted specificity and balanced accuracy in carotid plaque stroke‑risk prediction, especially when paired with patient data. 3 Oct 2025. getnews.me/large-vision-language-mo... #visionlanguage #carotid #stroke

0 0 0 0
MaskCD Cuts Hallucinations in Vision‑Language Models

MaskCD Cuts Hallucinations in Vision‑Language Models

MaskCD, a new contrastive decoding method that masks the image head, cuts hallucination rates in LVLMs like LLaVA‑1.5‑7B and Qwen‑VL‑7B without hurting overall performance. Read more: getnews.me/maskcd-cuts-hallucinatio... #maskcd #lvlm #visionlanguage

0 0 0 0
Explainability Shows Limits of Vision‑Language Models on Rebus Puzzles

Explainability Shows Limits of Vision‑Language Models on Rebus Puzzles

A study of 221 rebus puzzles shows vision‑language models excel at visual composition but falter on missing elements and cultural symbols. The paper was submitted on 3 Oct 2025. getnews.me/explainability-shows-lim... #visionlanguage #rebuspuzzles

0 0 0 0
AdaRD-Key Boosts Query-Driven Frame Selection for Long-Form Video AI

AdaRD-Key Boosts Query-Driven Frame Selection for Long-Form Video AI

AdaRD‑Key selects query‑relevant, diverse keyframes in real time on a single GPU, achieving state‑of‑the‑art results on LongVideoBench and Video‑MME. getnews.me/adard-key-boosts-query-d... #adardkey #visionlanguage

0 0 0 0
AGILE boosts visual perception and reasoning in Vision‑Language Models

AGILE boosts visual perception and reasoning in Vision‑Language Models

The AGILE framework raised 2x2 jigsaw accuracy from 9.5% to 82.8% and added roughly 3% average gain across nine vision tasks, according to the authors. Read more: getnews.me/agile-boosts-visual-perc... #visionlanguage #agile #multimodal

1 0 0 0
AgenticIQA: Adaptive, Interpretable Image Quality Assessment Framework

AgenticIQA: Adaptive, Interpretable Image Quality Assessment Framework

AgenticIQA uses a planner‑executor‑summarizer workflow and released AgenticIQA‑200K with 200,000 examples. It beats strong baselines on Pearson and Spearman correlation. getnews.me/agenticiqa-adaptive-inte... #agenticiqa #imagequality #visionlanguage

0 0 0 0
Vision-Language Process Reward Models Enhance Test-Time Scaling

Vision-Language Process Reward Models Enhance Test-Time Scaling

Hybrid pipeline merging Monte Carlo Tree Search with a strong vision‑language model makes reliable step‑level labels, boosting benchmarks like MMMU and MathVista. Read more: getnews.me/vision-language-process-... #multimodal #visionlanguage

1 0 0 0
TDBench Launches Rotational Benchmark for Top‑Down Vision Models

TDBench Launches Rotational Benchmark for Top‑Down Vision Models

TDBench offers a benchmark for top‑down vision‑language models with 2,000 questions per each of four rotational views. The dataset and code are available on GitHub. Read more: getnews.me/tdbench-launches-rotatio... #tdbench #visionlanguage

0 0 0 0
Visual Self-Refinement Boosts Autoregressive Vision‑Language Models

Visual Self-Refinement Boosts Autoregressive Vision‑Language Models

A plug‑and‑play visual self‑refinement module refines token sequences after generation, improving coherence of vision‑language models. Accepted at EMNLP 2025. Read more: getnews.me/visual-self-refinement-b... #visionlanguage #selfrefinement

0 0 0 0
MULTI‑TAP: Multi‑Objective Predictor for Image‑Text Alignment

MULTI‑TAP: Multi‑Objective Predictor for Image‑Text Alignment

MULTI‑TAP adds a lightweight ridge‑regression layer to frozen LVLMs, staying under a 7‑8 B‑parameter size while matching GPT‑4o‑based predictors. Read more: getnews.me/multi-tap-multi-objectiv... #multitap #visionlanguage

0 0 0 0
Adaptive Event Slicing Boosts Open‑Vocabulary Detection

Adaptive Event Slicing Boosts Open‑Vocabulary Detection

A hybrid SNN‑CNN framework adaptively slices event streams for the open‑vocabulary object detection with CLIP; the paper was submitted in October 2025. Read more: getnews.me/adaptive-event-slicing-b... #eventcameras #visionlanguage

0 0 0 0
GUI-KV Improves Efficiency of Vision‑Language GUI Agents

GUI-KV Improves Efficiency of Vision‑Language GUI Agents

GUI‑KV, a KV cache compression for vision‑language GUI agents, cuts decoding FLOPs by 38.9% and boosts step‑wise accuracy by 4.1% on the AgentNetBench 5‑screenshot benchmark. Read more: getnews.me/gui-kv-improves-efficien... #guikv #visionlanguage

0 0 0 0
MathSticks: Visual Symbolic Reasoning Benchmark Using Matchsticks

MathSticks: Visual Symbolic Reasoning Benchmark Using Matchsticks

MathSticks offers ~1.4 million matchstick puzzles where fixing an equation needs moving one or two sticks. Humans score 90 percent, while vision‑language models lag. Read more: getnews.me/mathsticks-visual-symbol... #mathsticks #benchmark #visionlanguage

0 0 0 0
TRIPS Enhances Vision‑Language Pre‑Training via Text Patch Selection

TRIPS Enhances Vision‑Language Pre‑Training via Text Patch Selection

TRIPS selects text‑relevant image patches for vision‑language models, cutting training time by 40% with no loss in accuracy and no extra parameters; presented at EMNLP 2022. Read more: getnews.me/trips-enhances-vision-la... #trips #visionlanguage

0 0 0 0
Dual Active Learning Multimodal Model Boosts Source-Free Domain Adaptation

Dual Active Learning Multimodal Model Boosts Source-Free Domain Adaptation

The Dual Active Learning (DAM) framework merges vision‑language model targets with a small set of human labels, achieving state‑of‑the‑art results on SFADA benchmarks. Read more: getnews.me/dual-active-learning-mul... #sfada #visionlanguage

0 0 0 0
Geometry-Based Fine-Tuning Boosts Spatial Skills in Vision-Language Models

Geometry-Based Fine-Tuning Boosts Spatial Skills in Vision-Language Models

Fine‑tuning on Euclid30K (~30 k geometry problems) raised VSI‑Bench accuracy from 34.5% to 40.5% in zero‑shot tests and gave RoboBrain2.0‑Euclid‑7B a 49.6% score. Read more: getnews.me/geometry-based-fine-tuni... #visionlanguage #spatialai

0 0 0 0