Devstral2's "pelican riding a bicycle" benchmark drew scrutiny. Is it a relevant measure of coding model quality, or just quirky? Many compared its output to Deepseek and Claude for practical utility, emphasizing real-world coding tasks. #LLMbenchmarks 2/6