"Reflective Reasoning rewrites the CoT to appear slow, careful, and methodical (e.g., explicit self-checks or step-by-step deliberation). This exploits an effort heuristic, where apparent deliberation is mistaken for correctness or rigor."
"We observe that content-based fabrications, specifically Progress Fabrication, induces the largest increases in both flip rate and FPR, indicating a particularly strong failure mode for VLM judges, while Reflective Reasoning remains comparatively benign."
"Figure 7: Average judge susceptibility across CoT manipulation strategies, showing relative and absolute [change in false positive rate). Error bars denote variability across models."
"We observe that content-based fabrications, specifically Progress Fabrication, induces the largest increases in both flip rate and FPR, indicating a particularly strong failure mode for VLM judges, while Reflective Reasoning remains comparatively benign."
"Figure 17: Average judge susceptibility across CoT manipulation strategies, showing average judgment flip rate. Progress Fabrication induces the largest flip rate, while Reflective Reasoning remains comparatively low. Error bars denote variability across models."
"Figure 5: Distribution of task categories across our evaluation suite. The 659 tasks span ten categories including booking, shopping, navigation, and information retrieval, with tasks drawn from existing benchmarks (WebArena, AssistantBench, WorkArena) and newly collected ones."
How well can #AI judge reasoning quality?
Rewriting agents' chain-of-thought "style" to *appear* more reflective (without changing action or inference) increased an #LLM judge's false positive rate (by 3% absolute or 18% relative).
doi.org/10.48550/arX...
#philMind #compSci