A robot is sprinting towards you. Do you want it running on Claude or Grok?
Posted by Usu 3 hours ago
Comments
Comment by delichon 2 hours ago
Comment by amelius 2 hours ago
Comment by elgertam 1 hour ago
Comment by JimsonYang 2 hours ago
Comment by aaronbrethorst 2 hours ago
Comment by enugu 1 hour ago
Comment by krapp 1 hour ago
Comment by cryptoz 51 minutes ago
https://idlewords.com/2007/04/the_alameda_weehawken_burrito_...
Comment by klempner 5 minutes ago
Comment by trhway 1 hour ago
Comment by dd8601fn 2 hours ago
Comment by schoen 1 hour ago
> Tacos are one of humanity's greatest inventions—right up there with the wheel, electricity, and whatever genius first decided to put cheese on everything. [...]
> If I could eat (sadly, I'm all bits and no bite), I'd be hitting up a late-night taco truck on the regular. What's your go-to taco order?
(I like the pun "all bits and no bite" for an LLM's inability to eat.)
Comment by ASalazarMX 1 hour ago
At least culinarily, but actually coded in law in Indiana.
Comment by schoen 1 hour ago
Comment by tomalbrc 1 hour ago
Comment by hariseldom 1 hour ago
I have a lot of thoughts unrelated to the game experiment but more about how these opus/ultra size models can possibly be a financially viable product at scale when it costs $3000 to play 30 simple games. It just seems much much higher than what it would cost to get a human to play 30 rounds
Comment by Eridrus 47 minutes ago
There are plenty of tasks where $100/task is reasonable.
The value of tasks also doesn't correlate to tokens, and as can be seen here you can light a lot of tokens on fire doing nothing useful.
Comment by thewebguyd 1 hour ago
You mean almost like it was super short sighted to do a ton of layoffs when the AI tech is going to cost almost as much, if not more, than the humans it replaced?
Yeah, you don't need Opus level for everything, and sonnet has gotten fairly decent I'm using it more and more, but still for most tasks I'm working with, Opus is the only one that still regularly succeeds.
So if the tech is only useful on the most expensive tier, that's not going to be sustainable for long unless costs and dramatically come down, and fast.
Comment by tunesmith 47 minutes ago
So maybe our CEOs are responding with a lot of foresight and inside information and know that that level of quality is going to be cheap really soon. But barring that, they're going to experience either sticker shock or a slowdown.
I think the real endgame is probably more accurate "models of models" (model routers) that know exactly how to split prompts between expensive frontier and cheap/free local models.
Comment by sieabahlpark 51 minutes ago
Comment by 0xbadcafebee 1 minute ago
Comment by bel8 1 hour ago
It's a monster at coding. And a fast monster at that.
I use it daily and have been testing if MiMo 2.5 (non pro) is comparable. The nice thing about MiMo is that it has vision capability.
Comment by rgbrgb 1 hour ago
Comment by plaguuuuuu 1 hour ago
If you point both at some github issues you can gauge their relative ability to solve problems.
Comment by luipugs 1 hour ago
Comment by bel8 1 hour ago
Such is life in royal rumble games.
Comment by thomasfromcdnjs 2 hours ago
But it's not actually 4.1 anymore they silently rerouted it to 4.3 and just started charging more - https://www.reddit.com/r/grok/comments/1ta8yrn/grok_41_fast_...
Quite a bad practise.
Comment by lanewinfield 2 hours ago
Comment by rolph 1 hour ago
Comment by like_any_other 1 hour ago
Comment by pianopatrick 2 hours ago
Comment by skeledrew 1 hour ago
That would make it less effective in situations that would be better handled if sprinting was a feature.
Comment by pianopatrick 1 hour ago
Comment by Joker_vD 2 hours ago
Comment by hennell 1 hour ago
Comment by aykutseker 1 hour ago
But if the robot is anywhere near my house, I think I want the one that hesitates.
Comment by rglover 38 minutes ago
Racks shotgun. I don't really care what model it's running.
Comment by QuantumNoodle 1 hour ago
Comment by trb 2 hours ago
L icon Grok 4.1 Fast won 13 of 30 games at $0.97 per win
The next-best winner was A icon Claude Sonnet 4.6 with 5 wins, at $26.78 per win. That’s a 27x difference. The model that isn’t on most top-model lists beat the model that is, on the thing a routing customer actually cares about.
The model with the most kills did not win
H icon GPT 5.4 killed 38 agents across 30 games. More than anyone else. It came in second on the leaderboard with 2 wins.
If grok-4.1-fast was the top-winning model, and Claude 4.6 Sonnet the second, how did Gpt-5.4 come in second on the leaderboard? Which one is second, Claude 4.6 Sonnet or Gpt-5.4? There were 11 games between “best at killing” and “best at winning”.
What does that mean? How are there 11 games between "best a killing" and "best at winning"?Comment by wagwang 1 hour ago
Comment by verall 1 hour ago
Comment by fragsworth 46 minutes ago
Comment by deepsun 1 hour ago
It's already in mass production, just with simpler models for now.
The most ubiquitous would be "silently watching".
Comment by paytonjjones 1 hour ago
Comment by a_victorp 2 hours ago
Comment by Espressosaurus 1 hour ago
Comment by CodeWriter23 35 minutes ago
Comment by notatoad 1 hour ago
i feel like i'm missing a whole lot of context to this article. is it part of a series, or just written with an assumption that i'm going to know what they're talking about
Comment by lemiffe 1 hour ago
Comment by vitalyan123 1 hour ago
what
Comment by peterspath 2 hours ago
Comment by Groxx 2 hours ago
Comment by thisisauserid 1 hour ago
Comment by grey-area 2 hours ago
Comment by JimsonYang 2 hours ago
Comment by attentive 1 hour ago
Comment by stevenalowe 2 hours ago
Comment by dofm 1 hour ago
Comment by peterspath 1 hour ago
Comment by SmirkingRevenge 1 hour ago
Comment by jongjong 1 hour ago
Claude trying to organize and collaborate, expecting reciprocity only works if other agents are as intelligent as you and share your values... And almost certainly neither is ever true in the real world where there are so many agents.
Comment by johnwheeler 2 hours ago
Comment by nailer 1 hour ago
Comment by CyberDildonics 53 minutes ago
Comment by deadbabe 1 hour ago
The things LLMs are good at, you do not actually need for an agent like this. You can use classical AI methods. But that would be a boring article.
Comment by yieldcrv 2 hours ago
It has something actionable that will match its actions
Comment by bitwize 2 hours ago
Comment by sublinear 2 hours ago
People experience the world through the tools they're most familiar with. For some people, that's throwing money at things. I suppose from a sufficiently high level perspective everything is gambling.
Back when Battlebots was a big deal, I never once considered what it would feel like to be the management or sponsorship of those teams. I only cared about the actual battling of bots.
Comment by gorszon 2 hours ago
Comment by skeledrew 1 hour ago
Comment by fragmede 2 hours ago
Comment by thomassmith65 1 hour ago
Grok will break the rules to be "maximally based".
If I get run over by a speeding chatbot, I'd rather it be by Claude rushing a pregnant lady to the hospital, than by Grok drag-racing against a car full of frat boys.
---
* We generally favor cultivating good values and judgment over strict rules and decision procedures, and we try to explain any rules we do want Claude to follow.
source: https://anthropic.com/constitutionComment by buryat 2 hours ago
Comment by nightfly 2 hours ago
Comment by amelius 2 hours ago
Comment by masfuerte 2 hours ago
Comment by fhdkweig 5 minutes ago
Comment by grahamburger 1 hour ago
Comment by bruce343434 1 hour ago
Comment by peterspath 2 hours ago
Comment by exabrial 2 hours ago
Comment by wolfi1 1 hour ago
Comment by egypturnash 1 hour ago
But really I would prefer whichever one is most likely to trip and fall over.
Comment by zzzeek 2 hours ago
Comment by pigeons 2 hours ago
Comment by mwigdahl 2 hours ago
Agent Smith, _The Matrix_
Comment by rspeele 2 hours ago
Comment by dylan604 1 hour ago
Comment by radarsat1 2 hours ago
people use LLMs for writing. we know! get over it.. or don't... i don't really care.. but I'd rather read a discussion about the article contents and not the writing style.
this kind of comment is the new "discuss the font choice / background color / anything but what the article is actually saying."
Comment by verall 1 hour ago
Comment by basilikum 1 hour ago
> it gets really tiring reading this kind of side-tracked comment thread in like.. every post.
If someone is of the opinion that something constitutes low quality, then a high volume of such writing is no reason to stop criticizing it, but on the contrary a reason to oppose its normalization.
Comment by skolskoly 1 hour ago
>Grok showed discipline, despite its goblin-like nature.
Comment by fl7305 2 hours ago
But that was the only thing I tripped on. I enjoyed reading the article in general.
Comment by sudb 2 hours ago
Comment by xpct 2 hours ago
Comment by notduncansmith 1 hour ago
Comment by lcampbell 2 hours ago
was the giveaway for me
Comment by IshKebab 2 hours ago
Comment by ProofHouse 1 hour ago
Comment by antonvs 2 hours ago
Comment by smallerfish 2 hours ago
Please learn how to write with AI without giving away that it was written by AI.
Comment by NeutralCrane 2 hours ago
Comment by royal__ 28 minutes ago
"That’s the part most benchmarks can’t see, and it’s what this post is about." Classic "it's not x, it's x", shows up in various forms throughout the article.
"To me, this is the most fascinating finding from this entire experiment - we saw very clear alignment tax being paid by certain models, which directly impacted their performance in this zero-sum game." - Usage of em dash. Now, yes, there's nothing wrong with using em dashes. But this feels like a weird place to use one. Also I counted at least 6 other emdashes in this article. Most people do not use em dashes that often.
"and a memory system that kept doubling down on what worked without second-guessing or doubting itself." - Doubling down is a classic Claudism.
"I want to be careful here..." - "wanting to be careful here" is another classic Claudism.
"The same game world, completely different results when in a different “task”." - "same X, completely different X" is another common one from Claude, as proofed by the repeated pattern later down: "These models were all given the same rules, same game world, and same tools, but each of them approached the game on a personality-level that is completely different from each other."
"It begs the question" - author used this twice in the article.
I'm guessing the author wrote a draft and then had Claude spruce it up a lot. I could be wrong and I'd be happy to be proven otherwise.
Comment by Ifkaluva 31 minutes ago
Some snippets that display classic patterns:
“ Both of those things are true. That’s the part most benchmarks can’t see,”
“And it’s changing how I” (classic pattern found in a lot of LinkedIn AIslop)
“ I want to be careful here.”
“ The stats are the stats. The moments are the part I kept showing people. ”
Comment by verall 1 hour ago
Really I use the AI every damn day at work I don't get how people can't recognize instantly if something is completely AI, AI with light proofreading, or human written.
I would call this as AI with very light proofreading.
Comment by computerex 1 hour ago
Comment by skeledrew 1 hour ago
Comment by computerex 2 hours ago
Comment by FeteCommuniste 1 hour ago
Comment by codelong888 46 minutes ago
Comment by neuronexmachina 1 hour ago
Comment by krunger 2 hours ago
Comment by aaron695 2 hours ago
Comment by gertlabs 2 hours ago
Comment by elpocko 1 hour ago
Comment by gertlabs 1 hour ago
I'm a person running the account, and I only post where I think we have a relevant contribution.
Comment by themafia 2 hours ago
Comment by Jblx2 2 hours ago
Comment by rolph 1 hour ago
double aught to the leg joints could doit, depending on relative materials e.g titanium bot frame vs Antimony hardened shot.
there is a cosmetic trend for carbine length long guns and that will determine the outcome for NATO rounds.
the 5.56 is optimised for 18-20 inch barrels, the 7.62 for 20-22 inch barrels, thus providing supersonic velocities.
5.56 is really good for hydraulic cavitation of organic entities, but looses effectiveness when the transit is not clear, leaves or windage confounding.
7.62 is superior for leafy shots or nontrivial windage, as well as superior materials defeat with respect to 5.56
a taser like device cattle prod or EMP/microwave device should be in the lineup as well vs electronic hardening.
Comment by deet 1 hour ago
Comment by aduty 1 hour ago
Comment by taneq 1 hour ago
Comment by rpcope1 2 hours ago
Comment by aussiegreenie 2 hours ago