15 Cloud/local LLMs benchmarked on 38 real tasks. MiniMax and Kimi tied for 2nd
Posted by ianlpaterson 3 hours ago
Comments
Comment by ianlpaterson 3 hours ago
The surprising part was the QA process. My initial results showed Haiku beating Sonnet. That turned out to be a json_array scorer bug where max_score was set to expected_row_count instead of len(expected_rows), producing quality scores above 100%. A thin-space Unicode character (U+2009) in Gemini Flash responses broke three regex scorers silently. I ended up running 5 separate QA passes, each using a different model, and each pass found bugs the previous ones missed.
Gemini 2.5 Flash scored 97.1% at $0.003/run w/ a 1.1s median response time. Opus scored 100% at $0.69/run. GPT-oss-20b scored 98.3% for $0. The cost spread across models that all score above 95% is genuinely hard to justify for most tasks.
Scoring code and raw results are in the post. Happy to answer questions about methodology.