#llmbenchmark

3 weeks ago

Google’s Gemini 3.1 Pro just doubled its reasoning scores on the latest benchmark—big win for AI reasoning chops. Curious how it stacks up? Dive in for the details. #GoogleGemini #ReasoningBoost #LLMBenchmark

🔗 aidailypost.com/news/google-...

1 0 1 0

AI Daily Post

@aidailypost.com

3 months ago

Gemini 3 Pro tops the new AI reliability benchmark, but hallucinations are still a problem. How does it stack up against GPT‑5.1 and Grok 4? Dive into the numbers and what they mean for LLMs. #Gemini3Pro #HallucinationRates #LLMbenchmark

🔗 aidailypost.com/news/gemini-...

0 0 0 0

collide.

@collide-ai.bsky.social

5 months ago

grok crushed others on speed, 10x faster tokens per sec, catch the full video exclusively on collide.io/community #llmbenchmark #grokai #modelperformance

0 0 1 0

GetNews.me

@getnews-me.bsky.social

6 months ago

PsychiatryBench Introduces a Comprehensive LLM Benchmark for Mental Health

PsychiatryBench, announced Sep 7 2025, offers a benchmark of over 5,300 items in eleven psychiatric QA formats, such as diagnostic reasoning and treatment planning. getnews.me/psychiatrybench-introduc... #psychiatrybench #llmbenchmark

0 0 0 0

Giskard

@giskard-ai.bsky.social

10 months ago

Asking chatbots for short answers can increase hallucinations, study finds | TechCrunch Turns out, telling an AI chatbot to be concise could make it hallucinate more than it otherwise would have.

Thanks to Kyle Wiggers for this article. We're honored to see our research covered by TechCrunch. 🤝

Read the article here: techcrunch.com/2025/05/08/a...

#AISecurity #LLMBenchmark #research

0 0 0 0

Giskard

@giskard-ai.bsky.social

10 months ago

Read the article here: www.lesechos.fr/tech-medias/...

#AISecurity #LLMBenchmark #LesEchos

0 0 0 0

Giskard

@giskard-ai.bsky.social

10 months ago

Phare LLM Benchmark: an analysis of hallucination in leading LLMs LLM benchmark reveals how LLMs confidently generate hallucinations & spread misinformation. It exposes critical AI security & safety risks when models provide authoritative-sounding but factually…

Phare is developed by Giskard with Google DeepMind, the European Commission and Bpifrance as research & funding partners.

👉 Full analysis: www.giskard.ai/knowledge/go...
Benchmark results: phare.giskard.ai

#AISecurity #LLMBenchmark #LLMs

0 0 0 0

Giskard

@giskard-ai.bsky.social

11 months ago

(P03) (INCYBER) Le défi du passage à l’échelle Assistez en direct à la séance plénière du Forum INCYBER sur le thème du défi du passage à l'échelle. La cybersécurité est une chaîne que la faiblesse d’un s...

Full recording 👉 www.youtube.com/live/5hNnwl5...

#LLMBenchmark #AISecurity #ForumINCYBER #Research

0 0 0 0

Giskard

@giskard-ai.bsky.social

1 year ago

✨ Announcing Phare: new multi-lingual #LLMBenchmark 🌊

We're announcing an open & independent LLM benchmark to evaluate key AI security dimensions including hallucination, factual accuracy, bias, and potential for harm across several languages, with Google DeepMind as research partner.
👇

0 0 1 0

Posts tagged #llmbenchmark