GLM-5.2 is the new leading open weights model on Artificial Analysis
Posted by himata4113 6 hours ago
Comments
Comment by Tiberium 5 hours ago
I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.
Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.
Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.
Comment by benjiro29 4 hours ago
If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks.
In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.
There has been really no training on Opus models going on, really, none i tell you! /sarcasm
Comment by maxdo 18 minutes ago
Comment by vitalyan123 4 hours ago
Comment by duskdozer 4 hours ago
Sarcasm, considering the source of their own training data?
Comment by margalabargala 58 minutes ago
Comment by orphea 3 hours ago
Comment by baron3dl 1 hour ago
Comment by ComputerGuru 1 hour ago
Comment by overfeed 33 minutes ago
Comment by mannanj 2 hours ago
Though only in particular situations, like when it’s done to them and not when they do it. Cause they have the power and are morally right and know better than you. And if you question this at all, well you’re a threat to American values and a supporter of the Chinese and leading to the break down of Democracy.
This isn’t a type of reasoning argument or manipulation tactic used by the rich throughout history to trick the naive and gullible masses or anything like that. Trust me, I’m rich and I’m morally right. /sarcasm
Comment by vorticalbox 4 hours ago
To point where I stop it and simple tell it to “start writing code you can work it out as you go along”
Seems writers block also effects LLM
Comment by robertkarl 1 hour ago
In this paper they nerf an LLMs ability to emit waffling thinking tokens like "wait", "but", "alternatively", and the models (they're old, small models in the paper) terminate reasoning faster and perform better. I bet Anthropic is tuning this on their backend.
Comment by mikeocool 4 hours ago
Just output the code and we’ll work through it!
I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway.
Comment by giancarlostoro 3 hours ago
Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.
Comment by xstas1 2 hours ago
Comment by giancarlostoro 2 hours ago
https://www.reddit.com/r/ClaudeAI/comments/1psxuv7/anthropic...
Comment by saltsucker 2 hours ago
Also just think about it, why would a model trained on the world’s corpus of text (that isnt formatted in xml) perform better with XML? It would be a better study if that post tested markdown, org, xml, json, etc. 10 times to see if their is a difference
Comment by swingboy 1 minute ago
Comment by adastra22 46 minutes ago
Comment by root-parent 2 hours ago
Comment by noworriesnate 31 minutes ago
Comment by thinkingtoilet 3 hours ago
Comment by epolanski 4 hours ago
It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.
Comment by RyanHamilton 3 hours ago
Comment by epolanski 3 hours ago
Comment by happyPersonR 2 hours ago
Comment by overfeed 1 minute ago
Comment by h14h 2 hours ago
Per AA, while K2.7 Code is roughly on par w/ K2.6 in terms of intelligence, it uses half the output tokens to get there.
Comment by bertili 5 hours ago
Comment by Tiberium 5 hours ago
Comment by andai 4 hours ago
Comment by epolanski 4 hours ago
Low nailed the overwhelming majority of mundane tasks on it's own, medium was good for more complex stuff.
Comment by robmccoll 3 hours ago
Comment by gbingles 48 minutes ago
Comment by rdsubhas 2 hours ago
Comment by esafak 58 minutes ago
Comment by cmrdporcupine 3 hours ago
GLM5.2 ends up being far more expensive than I thought it would be when I tried it on openrouter. I ground through $5 USD worth of tokens quite quickly.
And this was high, not max.
Comment by kristopolous 3 hours ago
All it does is pull a json from their main table page and parses it with the fields I care about (coding).
There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.
Current partial output
score age size name
47.1 58 large Kimi K2.6
47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort)
47.5 70 - Muse Spark
47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
48.6 55 - GPT-5.5 (Non-reasoning)
48.7 188 - GPT-5.2 (xhigh)
50.1 29 - Qwen3.7 Max
50.7 1 large GLM-5.2 (max)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
51.5 92 - GPT-5.4 mini (xhigh)
52.1 55 - GPT-5.5 (low)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
55.5 118 - Gemini 3.1 Pro Preview
56.2 55 - GPT-5.5 (medium)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
57.2 104 - GPT-5.4 (xhigh)
58.5 55 - GPT-5.5 (high)
59.1 55 - GPT-5.5 (xhigh)
62 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
To see everything, run it like so $ curl day50.dev/art-analysis.sh | bash
The repo: https://github.com/day50-dev/aa-eval-emailsome key takeaways:
* open models are on about a 4-7 month lag right now depending on how you want to measure it
* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.
if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.
Comment by papersail 3 hours ago
score age size name
62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
59.1 55 - GPT-5.5 (xhigh)
58.5 55 - GPT-5.5 (high)
57.2 104 - GPT-5.4 (xhigh)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
56.2 55 - GPT-5.5 (medium)
55.5 118 - Gemini 3.1 Pro Preview
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
52.1 55 - GPT-5.5 (low)
51.5 92 - GPT-5.4 mini (xhigh)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
50.7 1 large GLM-5.2 (max)
50.1 29 - Qwen3.7 Max
48.7 188 - GPT-5.2 (xhigh)
48.6 55 - GPT-5.5 (Non-reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)Comment by christoff12 13 minutes ago
Are the scores here normalized such that each point difference is equidistant?
Comment by papersail 33 minutes ago
rank score age size name
1 62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
2 59.1 55 - GPT-5.5 (xhigh)
3 58.5 55 - GPT-5.5 (high)
4 57.2 104 - GPT-5.4 (xhigh)
5 56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
6 55.5 118 - Gemini 3.1 Pro Preview
7 53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
8 53.1 132 - GPT-5.3 Codex (xhigh)
9 52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
10 51.5 92 - GPT-5.4 mini (xhigh)
11 50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
12 50.7 1 large GLM-5.2 (max)
13 50.1 29 - Qwen3.7 Max
14 48.7 188 - GPT-5.2 (xhigh)
15 48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
16 47.8 205 - Claude Opus 4.5 (Reasoning)
17 47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort)
18 47.5 70 - Muse Spark
19 47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort)
20 47.1 58 large Kimi K2.6
21 47.1 29 - Gemini 3.5 Flash (minimal)
22 46.7 449 - Gemini 2.5 Pro Preview (Mar' 25)
23 46.5 211 - Gemini 3 Pro Preview (high)
24 46.5 16 - Qwen3.7 Plus
25 46.4 120 - Claude Sonnet 4.6 (Non-reasoning, High Effort)
26 45.6 5 large Kimi K2.7 Code
27 45.6 104 - GPT-5.4 (low)
28 45.5 56 large MiMo-V2.5-Pro
29 45.1 43 - GPT-5.5 Instant (May 2026)
30 45.0 29 - Gemini 3.5 Flash (high)
31 44.9 58 - Qwen3.6 Max Preview
32 44.7 216 - GPT-5.1 (high)
33 44.2 188 - GPT-5.2 (medium)
34 44.2 126 large GLM-5 (Reasoning)
35 43.9 92 - GPT-5.4 nano (xhigh)
36 43.4 71 large GLM-5.1 (Reasoning)
37 43.4 16 large MiniMax-M3
38 43.2 54 large DeepSeek V4 Pro (Reasoning, High Effort)
39 43.0 188 - GPT-5.2 Codex (xhigh)
40 42.9 76 - Qwen3.6 Plus
41 42.9 205 - Claude Opus 4.5 (Non-reasoning)
42 42.6 182 - Gemini 3 Flash Preview (Reasoning)
43 42.2 99 - Grok 4.20 0309 (Reasoning)
44 42.1 56 large MiMo-V2.5
45 41.9 91 large MiniMax-M2.7
46 41.4 91 - MiMo-V2-Pro
47 41.3 121 large Qwen3.5 397B A17B (Reasoning)
48 41.0 48 - Grok 4.3 (high)
49 40.5 71 - Grok 4.20 0309 v2 (Reasoning)
50 40.5 342 - Grok 4
51 39.8 54 large DeepSeek V4 Flash (Reasoning, High Effort)
A longer curated list based on kristopolous’ list, with more models included. For each model, I kept only the two highest-scoring entries. I used DeepSeek V4 Flash as the cutoff, since I consider it the lowest acceptable model that is still locally deployable.Comment by cmrdporcupine 17 minutes ago
Surprised to see MiniMax M3 so low on that list, not really my experience, I found it smarter than Gemini for a lot of things, that's for sure.
Also surprised to see Gemini 3.1 ranked that high there. It remains IMHO blatantly incompetent for tool use even in their own harnesses, so I can only assume this benchmark isn't ranking workflow things very high. Gemini can write code just fine. It just can't work well as an agent.
GLM 5.2 and Qwen3.7 max were from my experience fairly expensive to use on a per token price and hard to argue in favour of when the SOTA coding plans have a fixed price that makes them potentially more cost effective. (Yes I know z.ai has a coding plan but I've heard reliability nightmare stories, and it's not very cheap)
DeepSeek is clearly the best value for $$. With the right harness and prompting.
Comment by bel8 2 hours ago
Comment by kristopolous 2 hours ago
If you really want to see all of them:
Or run the script
Comment by ashenke 2 hours ago
Comment by tcp_handshaker 3 hours ago
- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...
- China is going to eat the US lunch on AI
- What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.
- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?
Comment by Certhas 2 hours ago
Mistral is clearly currently not competing for Frontier Model. Whether this is due to a lack of VC Funds or a lack of technical ability or the former arising from the latter would be interesting to know.
The top models are from startups. Among the FAANG only Google managed to get a Frontier model, and they litterally invented the architecture and have more money than they can possibly spend to throw at the problem. Facebook shows that even ungodly amounts of money don't get you there though.
So why did no EU based Startups succeed while two US start ups succeeded? I agree that that's a very important question the EU should ask. The Internet revolution was driven by US companies, and now AI will be as well, with Chinese Open Weights mixed in. The EU consistently can not turn its considerable economic output into fast moving tech firms.
Comment by Quarrel 1 hour ago
They've got a heap of contractors working to help industry adopt LLMs. It is just classic consulting work, and they'd look like a really great company if we weren't comparing them to literal $2T+ companies losing money hand-over-fist...
Comment by sschueller 2 hours ago
Comment by kristopolous 2 hours ago
They had Watson, remember, it won on jeopardy like 15 years ago? They've been at this for a long time
Maybe it's good at something else?
Comment by tekchip 2 hours ago
Comment by JSR_FDED 11 minutes ago
Upon closer inspection the $1B is (a) over 10 years, (b) mostly internal cross-billing between departments.
Comment by vunderba 45 minutes ago
Comment by root-parent 2 hours ago
They had to start from scratch, but dont seem to have the management to be smart enough, to stop doing it in house. They could have just acquired a startup that could build a frontier model.
What is also very ironic since their whole bussiness for the last 15 years, has been buying companies a la CA Associates...
Their previous Watson branding and collapse of Watson expectations cost them one CEO, but the current CEO was part of the same team. They just dont learn....
Comment by greenavocado 2 hours ago
Comment by marcus_cemes 2 hours ago
ETH Zurich and EPFL universities recently put out an open model called Apertus (was on the HN front page a few months back), it's not a frontier model, but they built it properly regarding copyright and data transparency.
It might look a bit slow or old-fashioned, but focusing on doing things ethically and legally feels like a much better path than just joining the race to scrape everything.
Comment by dr_dshiv 2 hours ago
Doing things with ethical intentions does not necessarily produce outcomes that are beneficial for society at large.
Comment by marcus_cemes 1 hour ago
Comment by tw1984 54 minutes ago
Europoor is not doing anything. If your lack of AI progress is caused by regulations and respect for IP laws, how about EVs, robotics, drones, batteries, quantum computing. Also slowed down by your over regulations? LOL.
Europoor is called Europoor for a reason, your attitude here is the best explanation on how it happened.
Comment by JKCalhoun 1 hour ago
Might also just be the result of "good will" (that the company has deftly fostered). Other companies might learn from Anthropic in that regard.
Comment by JSR_FDED 9 minutes ago
Comment by kristopolous 2 hours ago
Also what are they building it for? I'd think it's to serve ads better or something like that. Maybe Muse Spark fits facebook's needs perfectly...
Comment by jansan 2 hours ago
Comment by kristopolous 2 hours ago
I mean that is the smart move here. Focus the model on optimizing the core business. For Meta, that's not coding tools.
Comment by applicative 2 hours ago
They will forever have superior weights?
Comment by JKCalhoun 1 hour ago
Comment by rapind 41 minutes ago
Comment by cmrdporcupine 24 minutes ago
1. SamA and his company has a well-deserved bad reputation and Anthropic got some early good PR for basically not being SamA.
2. Claude Code got early head space, Boris and crew basically "invented" this kind of agent, and so has first mover advantage despite its known reliability and cost issues.
3. Most people I talk to haven't even tried Codex for some reason
Also it's uncool to complain about downvotes.
Comment by senordevnyc 2 hours ago
And Zuck hasn't spent that much on AI yet. Half of that is projected spending for 2026.
As to whether it's all for nothing, Q1 2026 revenue was up 33% over Q1 last year, driven largely by...better AI-driven ad targeting. So the spending doesn't seem that crazy to me.
Comment by ricardobayes 2 hours ago
Comment by alecco 3 hours ago
Comment by kristopolous 3 hours ago
But if that's your thing, here you go: https://github.com/day50-dev/aa-eval-email/commit/1853be6461...
add an argument (any argument) and it will be sorted as your specified. It just works as a toggle flipping the order ... so literally any string will do.
The original link has been updated accordingly with the new code.
Comment by datadrivenangel 2 hours ago
Comment by kristopolous 2 hours ago
$ ./art-analysis.sh | grep small
or maybe just the qwen $ ./art-analysis.sh | grep Qwen
only the ones in the past 30 days $ ./art-analysis.sh | awk '$2 < 31'
I use it in pipes like this.Comment by spwa4 3 hours ago
Comment by bodhi_mind 2 hours ago
Comment by slig 3 hours ago
Comment by kristopolous 3 hours ago
Comment by duckmysick 2 hours ago
That aside, this is a good script you're running. Thanks.
Comment by tasuki 2 hours ago
Comment by fridder 3 hours ago
Comment by snsnbsne 2 hours ago
Setup a fresh new large monitor. Open CLI. Run command. Watch output at the bottom of your screen. Keep watching the bottom of your screen for the rest of the day.
Sure you can tile windows and it helps but come on. Just have the command/input section in the bottom and the “output” on top. Keep the command bit on the bottom.
Comment by scrollop 1 hour ago
Comment by mrngld 4 hours ago
https://artificialanalysis.ai/agents/coding-agents?coding-ag...
I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.
Comment by undecidabot 1 hour ago
Comment by lukewarm707 2 hours ago
openai, google and anthropic subscriptions are not available with privacy.
looking at the link there it's interesting that going from cursor cli to codex cli take gpt 5.5 from 7th to 3rd. but they didn't do open model in codex.
so, hard to say it's for sure a model benchmark. maybe open models are just shit at swe agent harness...it's not the most parsimonious explanation though.
Comment by vadansky 1 hour ago
Unless you're running it locally, aren't you just trusting some other entity?
Comment by lukewarm707 1 hour ago
however the legal terms are different, openai reads your data. they store it for 30 days, but of course once it hits the disk you can keep as long as you like in a civil case like nyt v openai.
the same for google and anthropic. so, it's not always nice if someone is paid to read your data for safety. people upload sensitive matters, personal videos and so on.
i wouldn't prioritise it myself but you can also know that the data will all come out in discovery if you are in a legal issue. maybe that's not important, but people thought it did matter to give some protections to patient records, legal advice and therapy. you upload that to gpt and it goes into discovery.
Comment by ttul 2 hours ago
Fable 5 is cool and all, but we have not yet seen GPT-5.6.
Comment by cmrdporcupine 4 hours ago
It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.
Having better luck with MiniMax M3, from a cost/benefit ratio.
Comment by pjerem 2 hours ago
With a good harness, that's my favorite model for any personal project. I use Opus 4.8 at work because i don't have to pay for it and of course I love it, but DeepSeek is like 80% there for one tenth of the price.
Comment by zooming 3 hours ago
Comment by re-thc 2 hours ago
GPT can find fault in everything and anything including its own work.
Comment by gbingles 37 minutes ago
Code is somewhat artistic. If you don't have well defined standards and priorities, the AI review cycle can spiral infinitely figuratively debating what makes art good, and your code will be no better for it.
Comment by cmrdporcupine 32 minutes ago
This makes it slower to work with for prototyping, and it will, if not properly disciplined, litter your code with "legacy adapters" and "bridge code" and temporary incremental refactoring steps [arguably not terrible for work in real commercial software projects]. And it will create too many unit & integration tests, if you're not careful.
But it does, in my opinion, tend to produce more reliable software and I trust it far more than I did when I was working in Claude.
When I could afford it, I had both plans running, Claude to produce new features, and then Codex to brutally critique it battle test it, sharpen the edges, and produce better tests, and this flow went extremely well.
Now I just work with Codex and various open models.
Comment by cmrdporcupine 55 minutes ago
Somehow it's just way more careful than the others, and also much better at empirical verification of its hypothesis, writing tests, etc. I am assuming a lot of RL done on that kind of flow, and on seeking out negative cases, failure points, race conditions.
Comment by unrvl22 5 hours ago
This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.
Comment by stanac 4 hours ago
Looking at openrouter [1], some of the cheaper offerings are for quantized models. Not sure how much intelligence is lost in quantization. And they are not 3 times cheaper. Where did you find 3x lower prices for APIs? I am considering skipping open router and using them directly for that price.
edit:
I see, croft [2] 8bit for $0.50/$0.08/$2.20
Comment by scrlk 3 hours ago
Comment by johnnyApplePRNG 26 minutes ago
Claude Shannon is rolling in his grave.
Comment by ComputerGuru 1 hour ago
Comment by scrlk 1 hour ago
So you could end up paying more for unquantised weights, only to get silently hit with a quantised KV cache...
Comment by benjiro29 4 hours ago
I do not have GLM 5.2 numbers because the whole default max setting is overkill. But GLM 5.1 numbers had it at 12x cheaper then API rates. And about 2.5x more tokens vs zai their own subscription service.
Yes, its FP8 but lets be honest, do we know for sure that even zai runs at FP16? I learned a long time ago with Claude and Codex how much cheating happens on model levels, even from the big boys.
Comment by CuriouslyC 5 hours ago
Comment by thehamkercat 3 hours ago
(there's a table which shows comparison between vendors)
Also, it seems there's a general one as well (for all kimi models?): https://github.com/MoonshotAI/Kimi-Vendor-Verifier
Comment by cedws 4 hours ago
Comment by kilroy123 3 hours ago
Comment by orbital-decay 2 hours ago
Comment by ComputerGuru 1 hour ago
Comment by alecco 3 hours ago
Comment by unrvl22 5 hours ago
Comment by Schiendelman 5 hours ago
I've tried a number of these, and the learning curve is very steep compared to "install Claude Code and pay $100/mo". There is no way saving me $50/month matters compared to figuring that out.
Comment by andai 4 hours ago
https://docs.z.ai/devpack/tool/claude
Here's my setup. I add this to my .bashrc
export ZAI_API_KEY="your_key_here"
alias claudez='ANTHROPIC_AUTH_TOKEN="$ZAI_API_KEY" ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic" ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]" ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.7" ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.7" claude'
Then I just run claudez
pro tip the same thing works with deepseek https://api-docs.deepseek.com/guides/anthropic_api
Even more pro tip: Claude Code can set this up for you haha
Comment by Schiendelman 4 hours ago
Unless this were a massive differentiator, people aren't going to be "talking about it" the way GP suggests!
Comment by cromka 1 hour ago
It's crazy that apparently writing software without knowing how to edit a single config file is normal now.
Comment by johnnyApplePRNG 24 minutes ago
Comment by bityard 18 minutes ago
Comment by fc417fc802 4 hours ago
Comment by Schiendelman 4 hours ago
Comment by donohoe 2 hours ago
Comment by fc417fc802 4 hours ago
Comment by Schiendelman 3 hours ago
Comment by neonstatic 3 hours ago
Comment by ramraj07 1 hour ago
Comment by skeledrew 3 hours ago
Comment by Schiendelman 2 hours ago
1) You haven't even heard of it.
2) You have to know to look for both GLM and Z.ai. These are usually in the same article when reporting about GLM is written, at least.
3) You have to understand there could be a benefit in trying it; you have to want to try it for some reason. Their own blog post puts it below Opus 4.8 in each of the three benchmarks they used.
4) You have to figure out the pricing, which isn't obviously in the blog post...
5) When I first went to Z.ai, I got an error popup (not logged in): "You do not have permission to access this resource. Please contact your administrator for assistance." I am using a personal computer...
6) When I typed something in the resultant field and pressed enter, I got "Clear Current Chat? To start a new chat, your current conversation will be discarded. Sign in to save chats"
I think today's article helped with 1 and 2, which helps their top of funnel. But they're fighting a big uphill battle.
Comment by chen66996 4 hours ago
Comment by chillfox 1 hour ago
Comment by re-thc 2 hours ago
There's ZCode (https://zcode.z.ai). Which is like the Codex App.
That's as "easy" as it is for non-devs that you're complaining about.
Comment by qingcharles 24 minutes ago
Comment by Schiendelman 1 hour ago
Comment by gerryf2 2 hours ago
I'd pay for an out of the box solution. i.e. an Installer with updates
Comment by embedding-shape 5 hours ago
Wasn't this released like 2 days ago? Everyone is still evaluating and playing around with it, things like the submission is just starting to come out. Give it some days at least before jumping to conclusions, ideally weeks.
Comment by unrvl22 5 hours ago
Comment by cedws 4 hours ago
Comment by knollimar 3 hours ago
Comment by redox99 3 hours ago
Comment by smith7018 2 hours ago
Comment by redox99 2 hours ago
Do note that GLM is not multi modal, which can be a deal breaker. And these open models are not good outside coding.
Comment by unrvl22 2 hours ago
Comment by smith7018 1 hour ago
I wish I had the time to set it up and work on side projects but unfortunately life and work have been crazy (as I'm sure many here feel). That's why I asked for anecdotes about it.
Comment by Hamuko 5 hours ago
Comment by andai 4 hours ago
https://github.com/QuantiusBenignus/Zshelf/discussions/2
Not accounting for hardware, of course :)
Comment by NorwegianDude 3 hours ago
Nvidia GPUs are much more efficient than Apple hardware for inference(and training).
Comment by Hamuko 4 hours ago
Not accounting hardware in my costs, since I didn’t buy my hardware for running models. Running models is just something it can do in addition to what I got it for.
Comment by igravious 5 hours ago
Comment by Hamuko 4 hours ago
Comment by simianwords 4 hours ago
Comment by anuramat 4 hours ago
link?
> Why
imho everything but opus produces unusable code (fable was even better...), eg gpt5.5 seems to write the absolute worst code that still technically solves the problem; tbh I'd be totally willing to trade "raw intelligence" for "code taste"
more labs need to figure out whatever anthropic did to destroy everybody else on frontiercode bench
Comment by CuriouslyC 2 hours ago
Comment by simonw 3 hours ago
That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.
In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.
Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.
Comment by 0xbadcafebee 2 hours ago
Comment by ashenke 2 hours ago
Comment by _pdp_ 3 hours ago
Comment by simonw 3 hours ago
Even the local models I run on my Mac are getting surprisingly good at that now.
Comment by tiahura 2 hours ago
Comment by CuriouslyC 5 hours ago
Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.
Comment by sdesol 2 hours ago
This is honestly what I care bout the most now, which is how well they can write. I think we have reached a point now, if you know how to program, you can provide enough information for the models to pretty much do what you need.
What they still struggle immensely with is the writing which has too many nuances but they are truly getting better.
Comment by Havoc 4 hours ago
Discovered today that they set reasoning effort to max by default. So that’s probably why
Comment by andai 4 hours ago
Comment by igravious 5 hours ago
Comment by elwebmaster 4 hours ago
Comment by theplumber 4 hours ago
Review the commits with both Claude and GPT 5.5 Xhigh. You can see that Fable is still sloppy(er) compared to GPT. You can test it the other way around as well(drive the dev with GPT and review with GPT and Claude). You get the same result Claude has an edge though and that’s on building more beautiful user interfaces.
Comment by fragmede 4 hours ago
Comment by CubsFan1060 4 hours ago
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
Comment by wongarsu 4 hours ago
Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic
Comment by user43928 1 hour ago
There they can deploy these models while using the existing legal frameworks.
Comment by CubsFan1060 3 hours ago
Comment by bitmasher9 3 hours ago
We’re approaching a world where running a primer frontier model is possible on a workstation, probably will have something under $30k that looks like a desktop for Nvidia’s next generation. It sounds expensive, until you look at your Anthropic bill.
It’s similar unit economics as could computing for the open models. You can save a ton on the expenses by buying the hardware, but it requires a lot of in-house expertise, and you get the most value if you keep the system operating around the clock. The big kink is open models are usually 2 quarters behind frontier, and your competitors are probably trying to get access to mythos.
Comment by program_whiz 53 minutes ago
That's a $500K-$1M+ rig as of now. That's a lot of $200 subscriptions to break even, but reasonable if you are paying Anthropic $25/M tokens. Then of course there's the power, cooling, and maintenance to consider...
But yeah, I can see if the prices come down 10x in a few years, or crater after the bubble, $30-40k might get you a decent machine.
Comment by wongarsu 3 hours ago
But prices are changing rapidly, and not for the better
Comment by MikhailTal 3 hours ago
Your usage will peak during certain timezone work hours(even if you are a huge multinational company most of your engineers/users tend to be from only a few locations), so then you have a bunch of gpus doing nothing the rest of the day. especially with latency sensitive stuff, this is a decades old tradeoff problem, its not unique to llms
Comment by Havoc 4 hours ago
Would need to be a pretty determined medium biz
Comment by moffkalast 4 hours ago
Comment by petesergeant 4 hours ago
Comment by CubsFan1060 4 hours ago
Comment by tancop 4 hours ago
Comment by petesergeant 2 hours ago
I'm pretty skeptical, especially given typical utilization patterns. Do you have numbers, or this is just vibes?
Comment by re-thc 4 hours ago
Years.
Even Microsoft said they don't have enough for Github and need to call Amazon.
Getting a few even at decent prices is hard. Unless the shortages goes down...
Comment by tensegrist 4 hours ago
am i missing something?
Comment by OtherShrezzing 4 hours ago
Comment by xiaoyu2006 4 hours ago
Comment by simianwords 4 hours ago
Comment by stymaar 3 hours ago
We have no proof in either direction, it's not like we had access to their financial numbers in details.
And the pricing itself muddies the water, as input tokens that are already in the KV cache are practically free for the provider, whereas other tokens are expensive. So they could still make money overall thanks to people having multi-turn conversation (and as such, paying multiple times for the same token), but lose money on actual compute done.
> there are lots of third party hosting services that will still run at breakeven/profit.
How can you be sure that they are making profit directly from token price, and are not billing at marginal cost (i.e. electricity price, without counting the cost of the GPUs) and aiming to make a profit later on from the valuable training data that they are collecting in the process?
Comment by simianwords 2 hours ago
You are free to believe that they are doing all this. Or you can simply believe the intuition that models are getting cheaper by the day. I can run Gemma 4 31B from my laptop today.
Comment by leemoore 1 hour ago
Comment by XCSme 4 hours ago
[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...
Comment by XCSme 4 hours ago
This means, that models are losing more and more general and domain-specific knowledge.
Look at those graphs on ARtificialAnalysis, GLM-5.1 still performs similarly or better:
AA-Omnisicence Accuracy: https://i.snipboard.io/5DYmpx.jpg
IFBench: https://i.snipboard.io/74kg0R.jpg
I still feel like models are not getting any smarter for a few months already, they just changed their training to be focused more on some areas than others, so shifting the intelligence from one place to another, not necessarily increasing the overall intelligence or "AGI" score.
Comment by sourcecodeplz 3 hours ago
Comment by xiaoyu2006 4 hours ago
Comment by kingstnap 5 hours ago
Excited to see if this turns out to be a Open Weight Opus 4.5 or better.
Comment by andai 4 hours ago
I've had models that benched poorly but performed great. And I constantly see models at near the top of AA, which are terrible.
There doesn't necessarily seem to be a lot of overlap between benchmarks and real world usage. (Let alone common sense!)
As far as they go, though, these harder benchmarks match my experience more closely:
and https://cognition.ai/blog/frontier-code
Where we see "top" models drop way down in score when given longer tasks.
That being said, I've had a reasonably pleasant time with GLM-5.2 so far. (And have had an OK time with DeepSeek as well.)
By the time I'm done testing all the Chinese models, they'll be obsolete :)
Comment by adastra22 29 minutes ago
Comment by wongarsu 4 hours ago
That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions
Comment by hereme888 48 minutes ago
Comment by bityard 8 minutes ago
Comment by ponyous 2 hours ago
Here are the results compared to Gemini 3.5 Flash:
Model + config CodeErr/gen Cost/gen Median time Quality
gemini-3.5-flash, low 0.71 $0.18 68s baseline
GLM 5.2, reasoning high 0.61 $0.18 289s -6.0%
GLM 5.2, reasoning off 1.52 $0.10 126s -13.6%
Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.
Comment by NiloCK 2 hours ago
I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like
- give 3d modelling task
- render and snapshot from a variety of angles
- feed to third-party vision model for a "what is this" type query
- grade on end-to-end accuracy
Bonus points for asking the vision model something like "how beautiful is this 1-10".
Comment by ponyous 1 hour ago
I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ...
I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet.
Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):
<0.2 → Poor – Misses core intent; largely irrelevant or incorrect.
<0.4 → Weak – Partially relevant; significant omissions or errors.
<0.6 → Fair – Covers main points but lacks completeness or precision.
<0.8 → Good – Mostly accurate; minor gaps or deviations.
<=1.0 → Excellent – Fully aligned; precise, comprehensive, and faithful to intent.
Here is the scenario list (prompts are much more detailed): dragon-bottle-stopper
editing-param-mid-conv
editing-parametric-enclosure
editing-swap-material-param
editing-text-edit-cube
multi-turn-bird-house
multi-turn-dice-tower
multi-turn-modular-planter
multi-turn-phone-stand
multi-turn-shelf
one-shot-bookend
one-shot-cable-clip
one-shot-chess-queen
one-shot-coaster
one-shot-coffee-cup
one-shot-dog-tag
one-shot-dragon-figurine
one-shot-hex-bracket
one-shot-keychain-fob
one-shot-low-poly-tree
one-shot-pegboard-hook
one-shot-pi4-case
one-shot-threaded-jar
[0]: https://grandpacad.comComment by ComputerGuru 1 hour ago
Comment by ponyous 1 hour ago
Edit: Surprisingly very good results with 3.0 flash with high thinking.
Cost: $0.06
Duration: 3.22 min
Code Errors: 1.3 per attempts (meaning on average it had to retry 1.3 times)
Adherence was on par with 3.5 flash Low thinking
Comment by ComputerGuru 1 hour ago
Comment by osti 49 minutes ago
Comment by m-dot-reviews 2 hours ago
The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So I thought having a place to review and comment could be beneficial to people.
I'm not sure how best to get the corpus bootstrapped (i.e. people will likely only visit/post on the site if there's already activity), so posting it here for anyone who'd like to contribute.
Comment by rahidz 4 hours ago
Comment by segmondy 2 hours ago
Comment by dryarzeg 4 hours ago
Comment by 0xbadcafebee 2 hours ago
Comment by mordae 4 hours ago
It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.
Comment by osti 4 hours ago
Comment by freigeist79 1 hour ago
Comment by adrian_b 4 hours ago
With open weights LLMs, it is affordable to use many different models, each for whatever it is better.
Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task.
Comment by Havoc 4 hours ago
Comment by _pdp_ 4 hours ago
DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.
Comment by LUmBULtERA 2 hours ago
Comment by kreddor 1 hour ago
Comment by davidwritesbugs 5 hours ago
Comment by ramon156 4 hours ago
I haven't extensively used 5.2 yet, but it seems a lot better.
Comment by RDTvlokip 1 hour ago
Comment by alansaber 2 hours ago
Comment by dizhn 3 hours ago
Comment by Alifatisk 1 hour ago
Comment by robertwt7 2 hours ago
This is a great step up in open models however the pricing to support z.ai is not far cheaper than Claude / OpenAI subscription
Comment by Pragmata 4 hours ago
QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?
Comment by XCSme 4 hours ago
GLM-5.2 is already close to Opus-4.7 level:
https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...
Comment by XCSme 4 hours ago
Comment by segmondy 2 hours ago
Comment by Pragmata 4 hours ago
QWEN3.6 27b is pretty good, but i can still notice some spots where it's not as good as the frontier models.
Comment by segmondy 2 hours ago
Comment by Pragmata 2 hours ago
Comment by KaoruAoiShiho 2 hours ago
Comment by piterrro 2 hours ago
Comment by 0xbadcafebee 2 hours ago
Comment by enraged_camel 2 hours ago
I'm not interested in using AI to write code that would have taken me 5-10 minutes to write myself. I use AI to debug complex bugs and develop large features that span multiple domains - stuff that normally takes hours, if not days/weeks. A model that is "enough for 95%" does not cut it for that, because the failures compound during long-horizon tasks and the thing becomes a mess.
Comment by JustSkyfall 3 hours ago
Comment by CuriouslyC 2 hours ago
Comment by bel8 2 hours ago
I work on mid-sized projects currently (200k to 1kk lines of code).
Comment by Mashimo 3 hours ago
Comment by Alifatisk 1 hour ago
You probably refer to GLM-4.7
Comment by segmondy 2 hours ago
Comment by Alifatisk 1 hour ago
Comment by jingpostmedia 2 hours ago
Comment by creamyhorror 4 hours ago
Comment by hyqzz8 1 hour ago
Comment by zftnb666 3 hours ago
Comment by Havoc 5 hours ago
Their servers are melting though - getting more timeouts etc
Comment by nh43215rgb 5 hours ago
That is unfortunate...
Comment by lousken 4 hours ago
Comment by Marciplan 4 hours ago
Comment by lousken 4 hours ago
Comment by 0xbadcafebee 1 hour ago
Comment by jayess 1 hour ago
Comment by eckelhesten 4 hours ago
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
Comment by Alifatisk 2 hours ago
I had the Lite plan, I NEVER maxed out the quota because I considered these things. If I, for example, switched over to GLM-5-Turbo, then I could've easily burned through quota.
Comment by granra 3 hours ago
My workflow is usually:
- read file. I want to achieve X, how do? Do not implement anything.
- I would do a, b and c
- sketch a brief implementation of your suggestion
- <code> (not writing files yet)
- instead of your approach x, wouldn't it make sense to instead do z? What would that look like?
- <code>
- nice, implement this
- starts writing files, run tests, etc.
Comment by eckelhesten 2 hours ago
You'll see that it quickly gives up. Thing is, they seem to count cached hits as if they were the non-cached tokens.
I wont be subscribing again thats for sure. I am not paying iPhone money for a Xiaomi.
Comment by Computer0 2 hours ago
Comment by sourcecodeplz 3 hours ago
Comment by Alifatisk 2 hours ago
Comment by dsrtslnd23 4 hours ago
Comment by Imustaskforhelp 3 hours ago
To all people on Hackernews, I am curious as to what agent harness are you using it with.
Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.
Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.
I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?
Comment by Alifatisk 2 hours ago
Comment by hit8run 4 hours ago
Comment by attogram 2 minutes ago
Comment by kissgyorgy 4 hours ago
Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.
The model might be good, but if the API is so bad, it's effectively useless.
[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...
Comment by segmondy 2 hours ago
Comment by ac29 1 hour ago
Comment by zozbot234 1 hour ago
Comment by ComputerGuru 1 hour ago
Comment by Havoc 4 hours ago
Comment by osti 4 hours ago
Comment by Asfand3099 2 hours ago