HealthBench Evaluation Highlights Gaps for Japanese Medical AI
Researchers translated 5,000 HealthBench cases to Japanese and evaluated GPT‑4.1 and LLM‑jp‑3.1; GPT‑4.1’s score fell while LLM‑jp‑3.1 performed poorly. Paper posted 22 Sep 2025. Read more: getnews.me/healthbench-evaluation-h... #healthbench #japan