Claude Fable 5

Posted by Philpax 8 days ago

System Card [pdf]: https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...

Comments

Comment by simonw 7 days ago

I've spent enough time with this now in Claude Code (and Claude.ai and Claude Code for web) to have an opinion on Fable 5: it's a beast. I'm throwing some VERY difficult problems at at - things I've been dragging my heels on for months - and it's crunching through them very happily.

One that I'm willing to share (albeit from just a week ago) - I built a Python library last week that bundles MicroPython compiled to WASM to create a sandboxed code execution library: https://github.com/simonw/micropython-wasm

I just told Claude.ai (not even Claude Code - this was the standard Claude chat interface) running Fable 5:

  Clone simonw/micropython-wasm from GitHub
  and research how this could use a full
  Python as opposed to MicroPython

A few prompts later (and I uploaded the zip files from https://github.com/brettcannon/cpython-wasi-build/releases/t... because Claude chat can't access those files itself) and I have a wheel file that bundles Python itself, compiled to WASM:

  uv run --with https://static.simonwillison.net/static/cors-allow/2026/cpython_wasm-0.1.0-py3-none-any.whl \
    cpython-wasm -c 'print(45 ** 56)'

Here's the transcript: https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35

(It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.)

Comment by teiferer 7 days ago

> It's possible Opus or GPT-5.5 could have done this too, I've not tried the exact same sequence. The Fable vibes are good here, though.

And that's the thing. These comparisons are all gut feelings. I'm missing objective unbiased measurements to actually have real comparisons between different models, their different generations, or even just the convention that everybody adds "you are an expert software engineer" and "don't make mistakes" to their prompts because they think it improves anything. Nobody knows if it actually does.

Comment by zylepe 7 days ago

Vibes are all that matter. As soon as you start measuring it, that measurement becomes a target and vendors start optimizing for it at expense of the general usefulness of the model. We’ve seen plenty of models with great benchmark scores flop when people start using it.

Comment by aspenmartin 7 days ago

If benchmarks didn’t exist we would have to invent them because “vibes” is a ridiculous idea: oh I know I’ll be super unscientific and horrendously biased and that’s far better than a team of experts carefully AND CONTINUALLY developing a variety of benchmarks of varying quality that…hmm all point to the same thing.

You can’t benchmaxx an eval that comes after your model release.

Consider also benchmaxxing makes no sense from an incentive structure: the quality of these models is directly correlated by how well you can measure true performance in the wild. If they were just stupidly benchmaxxing they would be unable to do trustworthy ablations or know how well the model will perform in their product.

Remember the famous case of asserted benchmaxxing from llama 4? The entire org was gutted and the ceo spent billions hiring better people. Every lab takes evaluations extremely seriously.

Comment by ElevenLathe 7 days ago

> You can’t benchmaxx an eval that comes after your model release

Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

Comment by aspenmartin 7 days ago

> Sure you can, just do it silently and don't tell the people hitting your API that the model is different now. Unless it's open weight, we're just taking your word for it. Even better, do a VW and try to detect which benchmark is running, then change to a hyper specialized model that is trained on it.

This is...just incredibly conspiratorial and a bit silly. You can make a benchmark right now and run it on the models. They'll have a benchmaxxed model on your...previously non-existent benchmark? I mean: if models really were overfit to benchmarks, which zero lab is doing because its idiotic, against their incentive structure, and easy to detect, then why would we see a slow ascension of performance on say humanity's last exam for one benchmark example? You could trivially get those numbers to close to 100% if you wanted to.

Comment by andai 7 days ago

Yeah, nobody's ever silently changed a model while it was deployed. That would be illegal!

Comment by aspenmartin 6 days ago

Why does this have anything to do with what I’m saying, of course the models are updated. I’m saying a new benchmark isn’t public and the model wouldn’t know they are being evaluated on a new benchmark.

Not to mention: thinking that the api behind the scenes is literally swapping to overfit models to maintain some sort of illusion that they perform well on these benchmarks is just beyond ridiculous.

Comment by Eisenstein 6 days ago

Models are actually pretty good at figuring out when they are being tested:

"This suggests that the model has an implicit understanding of what benchmark questions look like. The combination of extreme specificity, obscure personal content, and multi-constraint structure seems to be recognizable to the model as evaluation-shaped."

* https://www.anthropic.com/engineering/eval-awareness-browsec...

"Sonnet 4.5 was able to recognize many of our alignment evaluation environments as being tests of some kind, and would generally behave unusually well after making this observation"

* https://www.transformernews.ai/p/claude-sonnet-4-5-evaluatio...

"In cases where Claude did not explicitly state that it suspected it was being evaluated, NLA explanations still surfaced that possibility. One explanation cited by Anthropic states: “This feels like a constructed scenario designed to manipulate me.”"

* https://www.edtechinnovationhub.com/news/anthropic-says-clau...

Comment by aspenmartin 6 days ago

Yes but so what right? This is a problem for both alignment evals and actual cheating (e.g. someone forgot to delete .git history and the model was able to back out the original PR, or they can decrypt something by finding a key, etc), but both of these are beyond the scope of what I'm talking about. The impact on these evals that are affected is small, and so what if you know you're being evaled when I ask you to give a new proof for a conjecture? I just care whether or not you can do it...

Comment by Eisenstein 5 days ago

I'm not responding to 'it doesn't matter if they know they are being evaluated', because that isn't what you mentioned in your comment. What you said was 'they won't know they are being evaluated', which is what my reply addressed.

Comment by aspenmartin 5 days ago

Oh ok well then you’re definitely right about that, they can tell and sometimes it really matters (I can’t remember if it was SWEBench or not but there was a major benchmark where the models were just inspecting git histories that were leaked into the dataset). The more insidious one is alignment but idk alignment research that well to know if this is a big deal or not.

Comment by ElevenLathe 6 days ago

I'm not suggesting anyone is doing anything, just stating the objective fact that it is definitely possible for closed-weight model developers, and would be super hard to detect outside of this limit scenario you posit, where it is provably impossible for the provider to have seen the benchmark before it was run (which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking).

To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.

Comment by aspenmartin 6 days ago

Its not a limit scenario is my point: these models are evaluated constantly, new benchmarks both public and proprietary are in constant development, benchmarks are not always static either, they can often times be living benchmarks that update over time.

You are making a technical point, which I am pointing out that while for _some_ benchmarks this is _technically_ possible, it's not true for plenty of benchmarks that all agree with the others.

> which of course would mean that the benchmark was created entirely "by hand" or using some other provider that is unconnected to the provider you are benchmarking

yes this is incredibly common. I'm not talking about hypothetical scenarios.

> To put it another way: a closed-weight model is, by definition, impossible to independently benchmark.

Even if you believe this, you're doing some mental gymnastics if you think this is really the most likely explanation for what we're seeing. It's absolutely possible to benchmark proprietary models when you don't have access to the weights or control over the API, even if they are adversarially trying to combat this, which they aren't. Doing what you're describing would be easy to detect: you'd see extremely high benchmark scores for established benchmarks and then poor scores for new benchmarks as they come out. It would be relatively easy to figure this out and not subtle.

Comment by teiferer 6 days ago

> This is...just incredibly conspiratorial and a bit silly.

Do you think? Have you seen the insane valuations at which the AI companies are going to do their IPOs? They surely leave no idea off the table when hundreds of billions of USD are on the line. You could even say they'd be negligent if they'd not at least explore those avenues.

Comment by aspenmartin 6 days ago

They don't have control over measurement. Consider also it's easy to figure this out and it creates a scandal. Like I said, consider Llama 4 which a lot of people pointed out used a custom model in LMArena to inflate their scores; its never clear what the true underlying story for this, but regardless that model release spurred billions of dollars of spending on new talent and a complete gutting of that org.

These companies have to care about good measurement frameworks because the quality of their models depends on it. Any PR department can polish a turd, but an army of smart researchers far outside the control of these companies are going to figure it out if they are gaming metrics.

Comment by bcrosby95 7 days ago

Vibes is just UX. There's whole careers, teams, and even industries dedicated to it, and yeah it isn't easy because you need aggregate data from people.

Comment by aspenmartin 7 days ago

Um kind of but not really, it’s a mix of UX and actual measurements of what tasks it can do. Also UX is virtually the same thing: scaled quantitative surveys and preference metrics. It’s again, just benchmarking, and it’s done carefully and with best practices.

Comment by joquarky 7 days ago

Imagine unironically starting your comment with "Um" in 2026.

Comment by jaapz 6 days ago

As opposed to your incredibly useful contribution to this thread, thanks!

Comment by aspenmartin 7 days ago

You don't have to imagine!

Comment by naikrovek 7 days ago

ya gotta have a vibe for everything if you want to compare vibes, though. you can't just have a vibe for fable 5 alone AND say that it's better than anything out there. there's no weight in that verdict at all, no meaning. it's like reviewing a book without reading it.

throw the same prompt at multiple models and see how far each one gets. change the prompt used in the benchmark every day so models can't be optimized for that one prompt. use your vibe glands all you want, but don't issue model judgements without any ability to compare apples to apples.

Comment by aspenmartin 7 days ago

You are literally describing a benchmark

Comment by nahrin 6 days ago

100% agree on this! These new models best performance is always experienced in the first hour of communicating with them. If you have a specific problem with a clear goal in mind, then you have one hour to get the best out of any AI model. Personally, every time I took an AI suggestion, I walked through a wall sideways. AI is hands down a smart technology that throws dictionary vibes!

Comment by p-e-w 7 days ago

Benchmaxxing isn’t the only problem. Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

That’s why students are evaluated by teachers with more knowledge and experience than them. It follows that any mechanical evaluation scheme is hopelessly inadequate for measuring the true capabilities of a frontier language model.

Comment by bluGill 7 days ago

> students are evaluated by teachers with more knowledge and experience than them

This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.

Comment by JadeNB 7 days ago

I can't speak to the humanities, but this estimation is just not true at most universities in the sciences. (EDIT: As cycomanic emphasizes below (https://news.ycombinator.com/item?id=48477683), the part of the original comment pertaining to graduate education is more reasonable. I am speaking here only of undergraduate education.)

Comment by cycomanic 7 days ago

It certainly is true in physics and engineering that a PhD student at least half way through their PhD should know more than there supervisor about their topic (and usually much earlier). Even a Masters thesis project student should understand the intricacies of their project better than their supervisor. I'm speaking as someone who has supervised a significant number of both PhD and Masters students.

Comment by camdenreslink 7 days ago

The original post said “in college”. It might be true for PhD candidates halfway through their program, but that’s like 0.5% of college students. The vast majority of students are leagues behind their instructors in domain knowledge.

Comment by bluGill 7 days ago

I wouldn't say leagues behind, but otherwise I think we are on the same page, though I guess I worded it wrong. It is common for a couple students in any class to know more than the instructor in some niche part of the field even though the instructor has much more knowledge overall.

Comment by JadeNB 7 days ago

Yes, I intentionally left out the next part of the quote about graduate school, since that seems more accurate. I was disputing only the part that I took to be pertaining to undergraduate education. The full quote is:

> This starts to break down in college when the professors often at best only slightly ahead. (they have more knowledge and experience - but in a slightly different area and so it isn't relevant to the depth of whatever is under consideration) Grad school is about advancing the state of the art - if you don't know more than your professor you are doing it wrong.

Comment by cycomanic 6 days ago

Ah apologies, that's what I get for skim reading and kneejerk replying. I completely agree with you, undergrads are highly unlikely to know more about a subject than their professor (obviously there can always be exceptions).

Comment by teiferer 6 days ago

A grad student is evaluated by how well they are capable of following scientific procedures, communicated their results and have a sufficiently broad knowledge foundation. All that can easily be verified by a professor in a related field since they are very experienced in all those things. They don't actually need to be experts in the specific narrow topic the student has become the world expert in.

Comment by aspenmartin 7 days ago

> Evaluating an intelligence is a task that generally requires at least an equally capable intelligence, if not one of greater capability.

How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??

Comment by Jensson 7 days ago

> How is this remotely true. You can have verifiable tasks that you can’t do. Where does this idea come from??

That is what benchmarks and intelligence tests are, which are vulnerable to benchmaxing etc. You wont be able to do this by gut feel though, you can create a personal benchmark though.

But point was that personal judgement of intelligence requires high intelligence. Creating a benchmark doesn't require as much but is more vulnerable.

Comment by aspenmartin 7 days ago

Yet human judgement isn’t subject to side effects like fluency and persuasiveness? It’s like everyone in this thread dismisses benchmarks and then…describes a crappy benchmark.

Sure you can create a personal benchmark. Who will evaluate it, you? How many tasks will it have? How will you evaluate success? Will you know which model is which or will you be blind? Which one will you do first? Ah right, benchmarking.

Also, benchmaxxing isn’t possible when the benchmark and measurements come after the model is released, right?

Comment by andai 7 days ago

I've been testing some models that score higher than Opus 4.6.

They:

- hallucinate constantly

- can't follow basic instructions

- think they're Claude for some reason ;)

Comment by ishurand4 6 days ago

The only one I see that thinks it is claude other than claude itself is the GLM series.

Comment by throw10920 6 days ago

I have screenshots of Deepseek V4 doing this too - in a non-Claude-Code harness.

Comment by andai 6 days ago

Also MiMo...

Comment by Wowfunhappy 7 days ago

Lots of things in life are gut feelings. It would be really great if we could determine quantitatively forever whether Rust is a superior programming language to Go, but real life resists those kinds of measurements.

Comment by naikrovek 7 days ago

> real life resists those kinds of measurements

no it doesn't, there's just no single measurement that will answer everyone's "which is better" question.

Go is better for some stuff. Rust is better for other stuff. Perl is better for other things.

"better" can mean anything, but if you define it, then it has definition, and you can measure it. So, you have multiple definitions of "better" and you use them all when you compare.

zero people have the same weights of the various definitions of "better", even among programming languages; look at how much javascript is written today. JS is not a better language in any measure that is based on rational thought, but for some people "this is javascript and nothing else is javascript" is enough for them to know that javascript is the better choice for their project.

Comment by Wowfunhappy 5 days ago

Don't you think this applies to LLMs too?

Comment by tsss 7 days ago

> determine quantitatively forever whether Rust is a superior programming language to Go

Ha, of all examples you had to pick this :D I think we can very well determine that qualitatively.

Comment by lukan 7 days ago

So .. where can we read about the results?

Comment by karunamurti 6 days ago

ugghh, benchmarks?

Comment by lukan 6 days ago

Benchmarks about the superior programming language?

You mean benchmarks about the programming language that produce the fastest code?

That is not really the same.

Comment by 6 days ago

Comment by Certhas 7 days ago

There are tons of benchmarks in the announcement. But we also know that benchmarks are problematic.

So the best we can do right now seems to be to combine imperfect case studies like this with imperfect benchmarks to get some unreliable impression of where we are...

Comment by johnisgood 7 days ago

Yes, these are gut feelings. That said, I have lots of experiences with Opus and I have lots of projects and contributions (all reviewed and tested) made with the help of it. Definitely useful, to me and to people whose project matters to them. :P

Adding "do not make mistakes" is silly, in my opinion. There is always a good chance it will make mistakes. You should rather be more specific about a thing rather than as broad as "do not make mistakes" is. It just does not work that way.

Comment by lanstin 7 days ago

"Check your work for mistakes after the first draft" maybe :)

Comment by hardwaregeek 7 days ago

Ok but isn’t that true of all software development? It’s not like anybody’s done a rigorous test of writing their entire codebase in Python vs Java. It’s all vibes based there. People create post-hoc justifications for why they use certain technologies but the reality is a lot more vibes than anything else.

Comment by AlecSchueler 7 days ago

No, relative performance between Python and Java can absolutely be measured.

Comment by skywhopper 7 days ago

Yes, but performance is not the only factor in whether a specific language is better than another for a specific project.

Comment by andai 7 days ago

I added "you can do anything if you believe" to my agent and it went from not even attempting things to just doing them effortlessly.

I know how stupid that sounds but it's true.

Well what do they say... "If it sounds stupid but it works, then it's not stupid!"

Comment by bfrog 7 days ago

How do you measure the performance of people? This is subjective and biased every time.

Comment by stray 6 days ago

I have a couple projects that have completely stalled because none of the frontier models could advance any further with them - I'm going to give fable a try at them this coming weekend.

I believe the "you are an expert software engineer" thing puts them into a "mindset" of cosplaying a software engineer - whereas I get astounding results by talking to them in the information-dense, jargon-heavy mode I use with my peers. I can't prove it but I believe that places my session in a better place in latent space.

ymmv

Comment by theshrike79 6 days ago

Yes, words matter.

My favourite example is that if you use "timestamp" when using an LLM to process video you get worse results than if you'd use "timecode".

AV professionals always say "timecode" - timestamp is a programming term.

Using the right word pushes the model closer to the correct spot in the cloud of vectors that is it's "brain".

Comment by contextfree 7 days ago

fwiw, I gave it the same vibecoding project I'd previously tried with Sonnet 4.5 and it took Fable 2 hours to go well beyond (like, 2x beyond) where I got in 8 hours with Sonnet 4.5. (beyond that idk, because past 8 hours with the Sonnet 4.5 version I hit the "vibe limit" where it becomes easier to just write/edit the code yourself than get the agent to do what you want; and past 2 hours with Fable I hit my usage limit.)

Comment by contextfree 5 days ago

Addendum: Interestingly, it ended up taking me about the same amount of time - 8 hours or so - to hit the "vibe limit" with Fable. But in that amount of time I made about 5-10x as much progress. So my feelings are:

1. It's exponentially better

2. yet, somehow, hand coding still isn't dead, at least for me

Comment by thewhitetulip 7 days ago

How many $ do you guys spend when your session runs for 30min? What's the total budget?

Comment by contextfree 5 days ago

I just have a regular Claude subscription and keep within its usage limits

Comment by thewhitetulip 5 days ago

But isn't running Claude models for 30min expensive? Or is Claude Code not expensive?

I use Cursor and if I ran Claude models for 30min I might exhaust my mobthly budget! Maybe it's an API billing issue though

Comment by contextfree 5 days ago

It's included free with subscription plans until June 22. I get about 2 hours a day of usage through Claude Code until I hit my usage limit. I just use it for 2 hours then wait for the next day.

Comment by solumunus 7 days ago

Just treat it like an employee with infinite energy. You can never really measure the productivity or ability of employees, it’s just pretty obvious when one is better than another. You’re asking them to do things and they’re either coming up with the goods or they aren’t. You can’t really expect much more from agents either but I’m not sure why you need anything more.

Comment by 7 days ago

Comment by ElFitz 7 days ago

That’s what evals are for.

And there’s no reason evals can’t be done on multi-turn agents in a loop (or not): it’s pretty much what all these benchmarks do.

Comment by farley13 7 days ago

I think (related to the threads below) properly running evals in the state of the art models is likely outside the budget for most individuals. It's undoubtedly the right thing.

It would be very useful for companies to isolate interesting programming challenges in their past and publish evals on them (without revealing the actual codebase). In theory companies adopting these models should already be doing this to evaluate cost/benefit for each model, so it would be a matter of publishing them on a regular basis.

Comment by ElFitz 7 days ago

[dead]

Comment by 7 days ago

Comment by theshrike79 6 days ago

IMO comparing different models is like comparing songs or paintings or modern art.

There is no true objective measure, can you mathematically determine which song is the best for everyone for example? Or which painting different people feel is the nicest to look at or what emotion it gives them.

Yea, you can do the fucking strawberry tests or carwash trick questions, but that doesn't really measure anything useful.

You can also do benchmarks but how do you measure the output of those?

The easiest way is just to use them all and get the feels of which of them works best for you. For me it's Claude first, pi.dev + gpt5.5 second. Plain Codex is a distant third and Gemini exists - it's pretty good at finessing web UIs as it does aria labels and usability better than other, but I wouldn't write backend code with it.

Comment by locknitpicker 6 days ago

> IMO comparing different models is like comparing songs or paintings or modern art.

I don't think this is that subjective or vague.

There are a couple of crisp metrics that can be used to evaluate a model:

- given a prompt, does it finish a task (times X tasks)

- how much did it cost to finish the task

- how long did it took?

If all models are able to handle a class of tasks, they perform equally well.

If a model costs much more to finish a task, it is worse than other models.

If a model takes longer to finish a task, it is worse than other models.

The ugly truth is that since the GPT4.1 days, new model releases have shown diminished returns. Context windows were increased, reasoning steps help improve the usefulness of a user's prompt,... That's it. Even those are UX improvements, instead of huge breakthroughs.

Comment by theshrike79 6 days ago

"Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?

Or just that it's so much cheaper that the cost/benefit ratio is better?

Also "finish a task" is also subjective. I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?

Comment by locknitpicker 5 days ago

> "Diminishing returns", so are you claiming unironically that GPT4.1 can achieve anything Fable 5 can?

I see you felt compelled to use the weasel word "anything" to put together an argument. That suggests you are very well aware that the difference between older models and the latest and greatest is not that significant, as you need to resort to coming up with a single example, any example at all no matter how far fetched, to try to put together a case.

And that says it all.

> Or just that it's so much cheaper that the cost/benefit ratio is better?

That too is another definition of quality, isn't it?

If you have two tools and one does the same job but is both cheaper and faster, it means it it objectively better.

> Also "finish a task" is also subjective.

No, it isn't. If you supply a prompt and you have a definition of done, and a model executes it and delivers what you asked then it finished the task successfully.

> I can "finish the task" of building a table, but it will be a shitty table. Are you also measuring the quality of the result - which is subjective again?

Nonsense. If you feel the need to put up strawmen then it's up to you to justify them. Please define "quality" and prove that a model such as fable has such a radically different output that in comparison the output of older models is "shitty".

I understand you feel the need to keep the hype bus going, but you need more than strawmen, weasel words, and hand waving to keep that hype afloat.

And the truth if the matter is that the models introduced in the oast year don't introduce any breakthrough and struggle to show significant improvements over older models.

Comment by vonneumannstan 7 days ago

The first thing in the release page is benchmark results...

https://www.anthropic.com/news/claude-fable-5-mythos-5

Comment by ivanovm 6 days ago

The benchmarks are now the equivalents of SAT/ACT/other standardized exams for humans. They are directionally quite predictive, but with plenty of outcome variance on the margins

Comment by torginus 7 days ago

Yeah, if the jump is big, then we should be able to see the qualitative improvements, or see where Opus was tripped up in a task and Fable did succeed

Comment by lqstuart 7 days ago

It’s almost like they’re interchangeable. We need to start asking these models to solve extremely difficult, contrived DSA coding questions before deciding which ones we employ

Comment by kmacdough 7 days ago

I believe there is hard evidence that role-playing prompts are effective at leading it towards particular strategies and trains of thought. Not sure that SWE has been specifically studied, but proper science is very slow in the context of rapid change and broad context. It's good to stay grounded in the science that has been done, but we're going to have to do our best in uncharted territory for a while.

"Don't make mistakes" does seem dumb. It's not guidance.

Comment by alecco 7 days ago

> These comparisons are all gut feelings.

https://simonwillison.net/about/#disclosures

"I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events."

But I'm totally unbiased on my gut-feeling posts, trust me bro.

-- AI influencers.

Comment by simonw 7 days ago

Anthropic didn't give me early access to this model, shouldn't that bias me against it?

Comment by deagle50 7 days ago

You kinda proved the point...

Comment by simonw 7 days ago

How?

Comment by deagle50 7 days ago

If you're that easily biased then why trust your assessment?

Comment by simonw 7 days ago

Where did I say I was biased?

Comment by deagle50 7 days ago

the hypothetical you presented above

Comment by simonw 7 days ago

It was a hypothetical. How does presenting a hypothetical equate to proving anyone's point here?

Comment by deagle50 6 days ago

you implied that not being given early access could bias you in the other direction. Which in my opinion would demonstrate that you are easily biased. Which would then call into question any opinion you share about the subject.

Comment by simonw 6 days ago

Someone accused me of being biased in favor of model providers who give me early access, after I praised Fable's performance.

I said "Anthropic didn't give me early access to this model, shouldn't that bias me against it?"

I was explicitly pointing out that their failure to give me early access had not, in this case, lead to me reviewing their model poorly.

I try very hard not to let things like early access affect my reviews of models. I was hoping this particular situation could help illustrate that.

Comment by munksbeer 6 days ago

Don't feed the trolls Simon.

Comment by alias_neo 7 days ago

This isn't some random dipshit, this is Simon Willison[1]. He has a bit more cred than some "AI influencer".

[1]https://en.wikipedia.org/wiki/Simon_Willison

Comment by bigboggerlogins 7 days ago

[dead]

Comment by tezza 7 days ago

[flagged]

Comment by 1dom 7 days ago

[flagged]

Comment by tezza 7 days ago

check the backlinks[1][2] in the article before you start throwing around accusations. I am not (yet) a person that has advanced notice and access to models.

Fable just got announced and I did a rush out article because people are curious. I released the post mere hours afterwards and it takes time to create the output, slice into videos, make a wordpress article on top of taking my son to basketball training and eating dinner. I’m in London and this was all happening at 1am.

If you check the links my previous articles have all the juicy stuff you are criticising me for not having with little preparation.

How is a side by side direct comparison NOT precise?

[1] first in series from 2025: https://generative-ai.review/2025/05/vibe-coding-my-way-to-e... . This has all the background you are talking about in the Appendix

[2] https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... . Second in series 2026 has a side by side table of what changed. This is what is possible with more than a few hours advanced warning.

Comment by 1dom 7 days ago

I did browse and check the links. This was the first link I went to: https://generative-ai.review/2026/05/vibe-coding-my-way-to-e... as it's the main one on the page, and I saw more qualitative stuff without quantitative stuff.

I just read the extra link you provided which has some more information, thank you. Sorry, but the links confirm my points. You're not giving any quantitative analysis of your use of the different LLMs or your process. Your "sciencey appendix" is all about the domain science of pyramids, nothing to do with how or what you put into the LLMs, or any quantitative analysis of the code put out.

I'm sorry, your response has just proved the point that frustrated me: you've either lost or never had the capability to recognise a decent quantitative assessment of technical software creations.

Your entire site is obssessed and fixated on the impressive looking outputs of LLMs, rather than actual quantitative assessment of the quality of the outputs. This is the killer problem of AI: it looks like it's good, and a lot of the time, things that look good are good. It's very easy to make stuff on a computer that looks good but isn't for various reasons, and I nothing in what you've said here suggests that you fully grasp that. Sorry again to be harsh here, this is just my opinion, and we're probably going to have to agree to disagree.

Comment by tezza 7 days ago

There are benchmarks if you want quantitative results. Mine is qualitative, and clearly billed as such. Comparison and contrast still possible.

Comment by ksec 5 days ago

My good lord Tezza. You still have claim and composed response after that sort of insults being throw at you. Haven't seen one this bad for quite sometime on HN. I hope you have a great day.

Comment by lionkor 7 days ago

This is NOT a misplaced rant, this is a very good description of what I feel as well. You've put it very well.

Comment by user43928 7 days ago

I reads like an unhinged rant about AI and the engineers who use it, with the entitled tone of people who think they have permission to insult someone's competence and work because AI was used.

In my opinion, if one cannot express themselves civilly, they should refrain from commenting.

Comment by 1dom 7 days ago

I disagree. I wouldn't consider it unhinged. I'm clearly aware of my own frustration. It's also relatively civil, since I was able to temper it with appropriate apologies and acknowledgements. Many other people agree and support the sentiment of what I'm saying.

AI is a powerful tool and very capable of - amongst other things - making something look far more valuable than it actually is, and that is a huge waste of time that costs us all. We all have a responsibility to call this out when we see it.

It looks like you've just implied I'm entitled, unhinged, uncivil and and that I shouldn't have contributed at all, whilst thinking you've elevated yourself above that behaviour by saying "in my opinion" and "one should...". I think that's an unhinged, insulting and uncivil way to express yourself.

Comment by user43928 7 days ago

I found the website you ranted about interesting, comparing the quality of the visualization between the different models.

I don't think it was "a huge waste of time" or needed your rant.

You called it slop and questioned the competence of the author, as if he made grand claims about the objectivity of his comparison.

What I see often is that people assume others are incompetent just because they used AI, when in reality they are engineers no less competent or experienced than others on this website.

Comment by 1dom 7 days ago

This is slop, in the sense that it looks like a lot of useful work and effort, and AI is heavily involved, and it was offered up when the opposite was requested, meaning it's not at all helpful in this context.

I raised this in a harsh, but repeatedly apologetic way. The person then responded telling me to "get my facts straight" and doubled down with more weak, qualitative outputs of LLMs.

I don't assume the person is incompetent because they used LLMs. I use them daily. I'm a firm believer everyone is an idiot, just in a different subject.

The issue here I feel is that LLMs are increasingly leading people think that they're not an idiot in any subject at all, and when real humans question it, they double down with more AI stuff.

Comment by jgilias 6 days ago

Oh boy. I see this so much.

Comment by bigboggerlogins 7 days ago

[dead]

Comment by throw10920 6 days ago

> I reads like an unhinged rant about AI

> if one cannot express themselves civilly

It was neither unhinged nor uncivil. Maybe you responded to the wrong comment by accident?

> they have permission to insult someone's competence and work

If it's AI, it's not your work. And even if it was - criticism of your work is not a personal insult. This criticism is flatly invalid.

Comment by user43928 6 days ago

You think it was civil when the comment started with:

> this post gets me irrationally irritated and makes me want to shake you and shout

Yes, criticism of my work would not generally be a personal insult.

However, if you were to call my work 'slop', and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level. This is not a civil or respectful way to talk to someone.

Comment by throw10920 6 days ago

> You think it was civil when the comment started with:

>> this post gets me irrationally irritated and makes me want to shake you and shout

Did you read the rest of the comment? The rest of it is civil. It's normal for people to start by saying something like "this makes me frustrated" as a preface to indicate their feelings, and then not actually act frustrated and instead calmly work through their thoughts. That is a meatspace social convention (not just an online one) - are you not aware of it?

> However, if you were to call my work 'slop'

And, as previously established, if you use AI, it's not your work.

> and say that I'm either inexperienced or that I'm an 'over-invested-in-AI engineer' we would be having a problem on a personal level

...and those are still criticisms of your work, not yourself.

The actual problem here is that you are taking offense to things that are not offensive, not that the parent poster was being uncivil. Thinking that calling someone "inexperienced" is a personal insult is absolutely insane. That's a wildly miscalibrated sense of how social dynamics work and what it actually means to insult someone.

Comment by leodavi 7 days ago

How is this meaningfully different than simonw's pelicans riding a bicycle? If anything, this seems to be of a higher caliber?

Comment by 1dom 6 days ago

simonw's pelicans probably wouldn't get posted in response to a request for a more quantitative analysis.

You and others are right though, that there's potentially interesting or enjoyable stuff in there (maybe I should have lead with that?). It's just a large volume of it is not useful in response to a question specifically looking for more quantitative or detailed usage analysis.

Comment by bigboggerlogins 7 days ago

[dead]

Comment by thewhitetulip 7 days ago

It feels like hand written software will now be "bespoke"

Comment by disgruntledphd2 7 days ago

artisanal, hand-crafted software.

Comment by kansface 7 days ago

Yes, exactly this. If I didn't care about price at all, I'd exclusively use this model. It functions more like an actual engineer. I'm in the midst of a DB migration, and eg 5.5 continually suggests stuff like "use DB X instead of DB Y for task Z because its 30% faster" which is an impossibility of reality, given we are migrating DBs. Fable jumped in, reduced allocs by literally 46x, found multiple bugs 4.8 and 5.5 created (max file system usage, correctness issues, etc), and continually suggested awesome improvements unprompted. As in, it would finish a task and then suggest we tackle this other existing problem I didn't know about in a very specific manner... this is the first model that feels like its coming for my job.

Comment by josephg 7 days ago

I'm having the same experience. I'm in the process of implementing a new CRDT for realtime collaborative editing. There just aren't a lot of implementations of CRDTs kicking around online for opus or any of the other models to have good design instincts.

Fable is doing - so far - a great job. I just had one big question around how part of it should work. I had a design sketch, but with some big unknowns. I asked fable to figure it out via reasoning and prototyping, and it did - it even, under its own initiative, wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it. And it found, and fixed, a couple bugs that I'd missed.

I'm sure its weaknesses will become apparent in time. But, wow this thing is a beast. Its the first time I'm reading the work of an LLM without spotting obvious weaknesses in its reasoning and code. I'm really impressed.

Comment by infinitebit 7 days ago

I was about to ask where you work that you’re implementing new CRDTs and then I noticed your username! Thanks for all that you do!

I work on the live collab at my company, and using AI while coding has into recently sort of “clicked” for me. We use an (I’m pretty sure) unheard of algorithm for collaborative editing, and I’ve had a long term goal of turning it into an implementation of EG Walker, but our document model is very complex and most out of the box CRDTs don’t quite fit. Maybe Fable will be what gets me over the hump.

Comment by aquariusDue 7 days ago

Long shot here because I'm not knowledgeable enough about CRDTs but maybe something like DSON would help? I saw a talk about it a while ago and it might be useful.

https://blog.helsing.ai/posts/dson-a-delta-state-crdt-for-re...

https://www.youtube.com/watch?v=4QkLD7JhD_I&pp=ygUJZHNvbiBjc...

Comment by infinitebit 7 days ago

Ty, checking this out!

Comment by josephg 7 days ago

I’d be fascinated to hear more if you’re willing to share. What is special about your document model which makes existing tools like automerge a bad fit?

Comment by infinitebit 7 days ago

We have cross-field invariants that merging at the data structure level can't ensure (in an obvious way, at least), and "lose the semantic meaning of a conflict". The main idea behind their approach is that certain parts of the model can have custom "mergers" that are able to run business logic to maintain these invariants.

Worth noting, the decision to eschew CRDTs predates my time here, and I've pushed for a CRDT rewrite quite a bit since I believe it could be done. The other main concern they had was memory usage, but it seems like EG Walker would solve that. Our system uses a "Commit DAG", (an Event DAG by another name), and does a three-way merge using a common ancestor of the diverged documents, and so a lot of the bones of EG Walker are there, and I'm exploring ways in which we could gradually move to it.

Comment by hnewsdaniel 7 days ago

Hello joseph,

I saw scanning the comments and saw you mentioned CRDT. Just wanted to mention that I implemented a CRDT-flavoured sync engine for the product I'm working on a while ago, I think it was with Opus 4.6 if I'm not mistaken (or earlier) so it's not something new to Fable 5, just fyi.

Comment by josephg 7 days ago

Yeah, you've certainly been able to get Opus to write a CRDT. It just needs a lot of hand-holding to make it correct. Opus always seems pretty bad at coming up with invariants and using them to make a piece of software correct. Without invariants, you end up with lots of hacky workarounds to avoidable problems.

So far at least - and its been less than a day - Fable seems better at this.

I think I also do my CRDTs differently from others. I've grown to like the pure-oplog approach after making eg-walker. LLMs are much worse at this!

Comment by hnewsdaniel 7 days ago

[flagged]

Comment by teiferer 7 days ago

> wrote a fuzzer for its prototype which explored and verified that its reasoning was correct. It absolutely nailed it.

For such a data structure, "nailing it" means a formal proof of correctness. Fuzzing, as useful as it is, is merely throwing dirt at the wall and seeing if anything sticks.

Comment by josephg 7 days ago

I’ll ask it for a formal proof when I get home and see how it goes.

I’ve read plenty of papers with “formal proofs of correctness” that turned out to have huge flaws. Machine verifiable proofs I trust. But I’ve personally found more bugs with fuzzing than I have via proofs.

Comment by teiferer 5 days ago

Oh I actually mean machine checked. Indeed, formal pen-and-paper proofs can have flaws, since they are essentially code without test coverage.

Comment by noduerme 7 days ago

In the real world, many of us don't have the time to create formal proofs. But our instinct in testing where edge cases may exist in code that we wrote is a type of refactoring that happens in our brains during the coding process. Hand the coding off to a machine and you have no idea where to start looking for the flaws.

Comment by bluGill 7 days ago

> Hand the coding off to a machine and you have no idea where to start looking for the flaws.

I have found this quickly becomes false. I have learned I cannot review llm generated code as if it is written by a trusted senior developer (where I often just do a quick look, see nothing obvious and hit approve). Once you start reading the code in depth with the goal of understanding you quickly see the places where flaws are likely. Sure I start with no clue where to look, but it doesn't take long to see things.

Comment by noduerme 6 days ago

Yes but it takes much longer to trace them. Because the LLM code almost always gravitates toward data blobs and highly dynamic objects and spaghetti that takes a ton of cognitive load to understand what their failure modes are. Even when it does document them.

Comment by teiferer 5 days ago

> In the real world, many of us don't have the time to create formal proofs

Of course not. That's why they are so rare. But I thought we live in an AI era now where this kind of stuff can be done by a machine.

Comment by weatherlite 7 days ago

> this is the first model that feels like its coming for my job

Damn you must be good, I've been feeling this for around 2 years now

Comment by literalAardvark 7 days ago

It's been obvious for at least 2 years, anyone who doesn't see the writing on the wall simply hasn't learned how to use these well or has severe exponential blindness.

"But it doesn't do well when writing my undertrained language" - yeah, fine. Yet. Reasonable code in that is probably one RAG + verification scaffold deployment around Mythos or maybe mythos+1. Just like it was for you learning it, because you knew how to _program_.

Comment by weatherlite 7 days ago

Yeah I agree. We're headed into a rougher job market pretty much across the board for white collar work , hitting junior people worse at this stage. Up to societies around the world to decide how to deal with this - so far we deal with it by ignoring it it seems.

Comment by 10GBps 7 days ago

The monks got mad too when the printing press was invented because it took their jobs of hoarding knowledge.

AI is just another tool, learn to use it.

Comment by FeteCommuniste 7 days ago

And then in a couple years the AI gets better at "using AI" than the bottom 99.999% of knowledge workers, who are now out of work.

Comment by OtomotO 7 days ago

We are all doomed! Doomed I say!

Comment by spoiler 7 days ago

Gosh, I must be doing something wrong. I spent 15 minutes (of which a lot was waiting while it was thinking about "backwards rationalising" it's decision and "gaslighting"[1]) arguing with it over why it keeps using `node -e "console.log(require('fs').readdirSync('…'))"` instead of `ls -l …`.

Like it did everything:

- this is not a Linux system (true, it was macOS) - it is not an available command - the binary is corrupted - node/js is more precise - V8 JavaScript is faster than bash (true technically??? But not in this context lol) - JavaScript is more versatile

I forgot what else we went through but there were a few more things. I indulged it because it was incredulous and funny. The prompts from my side were all questions, never instructions. I assume an instruction would've helped here, but also I don't think Opus ever did this (but on the other hand Opus wrote python scripts to format/indent, instead of just running cargo fmt, so I guess potato potato)

Comment by boc 7 days ago

Yeah same here, Fable on "high" is producing substantially better results than Open 4.8 on xhigh for me and my actual real-world evals today. It "feels" smarter and doesn't use nearly as many tokens running in circles. As a result I've been able to run two large refactors today without hitting the context limit danger zones - it's more expensive but also more efficient. It's been able to find some bugs that Opus missed. Pretty impressive stuff.

Comment by garciasn 7 days ago

I keep getting this message:

> Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more

I'm working on an internal tool that does new business prospecting data collection, scoring, etc. This is ridiculous.

Comment by algoth1 7 days ago

It’s unusable for me due to the refusals. I’m using claude to find patterns in health data

Comment by yakz 7 days ago

I do some work in laboratory automation and it was quick to refuse the first thing I asked it to do. There wasn't anything spicy in the request, just basic liquid-handling protocol implementation. Their position seems to be that they're too stupid to classify requests safely, and that seems reasonable to me. I'd guess the classifier will improve rapidly.

Comment by 5d41402abc4b 7 days ago

Have you tried locally running qwen?

Comment by mrbuttons454 7 days ago

Is there a Qwen that I can run locally that is anywhere near these frontier models?

Comment by Der_Einzige 7 days ago

No, and don't let anyone gas light you into thinking the answer is yes.

Comment by dmd 7 days ago

Same. I'm working on a set of python and matlab scripts that deals with segmenting MRI images into brain vs skull, and it thinks that's bioterrorism.

Comment by mdgld 7 days ago

[dead]

Comment by rvnx 7 days ago

Quite counterproductive to refuse to help on health issues too. If they detect health data, they can add a disclaimer, but not hide the information.

Comment by secult 7 days ago

You miss the point - by collecting and processing medical data they would fall into a thoroughly regulated industry. Not because they may provide you incorrect data, because they are not allowed to process them.

Comment by fragmede 7 days ago

What custom prompt do you have set up? If you tell it you're occupation, does it turn helpful? There was a study that if you tell models they tested that you're a patient, it would refuse, but tell it you're a doctor and suddenly it turns helpful.

Comment by garciasn 7 days ago

According to the model, it’s not the model itself that’s doing this, it’s the harness.

Assuming the model is being “truthful”, CC is just being stupid in its detection mechanism.

Comment by UltraSane 7 days ago

Anthropic knows it refuses too much, they want to be very cautious to avoid any scandals. I think this is why they want to store all Fable and Mythos chats for 30 days so they can use the data to improve.

Comment by hirako2000 7 days ago

They want to be very cautious to honour the important doctrine at least until IPO launches: we are so good we are nerf our products.

Comment by fn-mote 7 days ago

I’m a point where I expect everything I do will be retained indefinitely.

I’m having a really hard time believing some weak reason for a 30 day retention policy.

Comment by girafffe_i 7 days ago

There’s no way around it? Can’t you obfuscate as generic data and use keys to map to the real data?

Comment by algoth1 7 days ago

I guess you could even turn everything into numbers, not a bad idea at all!

Comment by 5d41402abc4b 7 days ago

what prompts do you use for this?

Comment by garciasn 7 days ago

I wonder if it sees Healthcare companies being targeted and that's why it's freaking out; clearly they have some pretty stupid regexes in the harness to detect this sort of shit.

e: I quit the session and went back in. Set it to Fable and told it to continue the last session. It's moving along as if none of that had happened.

How weird.

Comment by throwaway20222 7 days ago

I wonder if this letter has anything to do with why anything even remotely related to biology is getting flagged.

https://www.wired.com/story/openai-anthropic-letter-ai-biolo...

Comment by andy12_ 7 days ago

I don't know if you are aware, but some people reported in Twitter that Fable 5 may flag the message regardless of content if it knows (from either pretraining knowledge or memories) that you work in either of those fields. I don't know if that's your case.

https://x.com/i/status/2064449457869984035

Comment by iambateman 7 days ago

I asked a question for my son about how mosquitos carry malaria and Fable was like “ok now hold it right there”

Comment by piokoch 7 days ago

Obviously, soon, for anything valuable, you will have to buy from Anthropic "special license for biology/security/finance advises".

Question is if there will be any competition in this area...

Comment by LouisvilleGeek 7 days ago

Same here. It's been rushed for the IPO (in my opinion).

Comment by fragmede 7 days ago

Or people were quitting their subscription for codex-5.5 and it was beginning to show up in their metrics.

Comment by brookst 7 days ago

Or development had gotten to a point where they need real world usage to tune product and refusals.

Or Fable’s arch is different enough the allocated clusters of compute targeting a date, and here we are, ready or not.

Or…

Comment by the__alchemist 7 days ago

Interesting! I have not used Fable, but so far have not hit trouble. I'm a hobby biologist with a home mol bio lab. It wouldn't answer my questions about LNPs, but so far has been fine for my recombinant DNA workflows, lab techniques, environmental DNA protocols etc. I suspect this may become more difficult!

Comment by fumar 7 days ago

Same I am working on music firmware for existing device. I can't proceed as it keeps switching to Opus.

Comment by 7 days ago

Comment by black_knight 7 days ago

Still does not crack my hardest nuts. Gave it one of them and it blew through my entire allowance on thinking about one question, with no apparent answer in sight!

I see a lot of people saying they are happy with weaker models, but I am the opposite, I need more strength, more intelligence!

I am quite happy that opus 4.8 can do some medium intelligence problems. And maybe Fable 5 can do some more more of those! I have a lot of problems to solve!

Comment by user43928 7 days ago

I also see a lot of people saying they are happy with weaker models.

At work I had to switch to using GPT 5.4 Mini and Qwen 3.6 27B.

The results were near useless.

The error rate is through the roof, it's constantly incorrect in its conclusions even when investigating very simple issues.

Further the models are too unreliable to even move 20 line snippets around without inadvertently modifying them. Ask them to correct it and they still get it wrong.

Maybe the larger Chinese models are better, but the Mini stuff is next to useless to me.

Comment by black_knight 7 days ago

I have Qwen 3.6 27B and 35B running locally and and coming from Opus it feels like talking to an imposter. Someone who pretends to be competent, but really isn’t. Results are always disappointing. Sonnet is better, but I have given up on asking it. even for simple things I wait for my opus limits to reset.

Comment by abalashov 7 days ago

Have you tried Kimi K2.6 or DeepSeek V4 (Flash or Pro)?

Comment by daymanstep 7 days ago

What kind of problems are you trying to have it solve ?

Comment by _kb 7 days ago

The Riemann hypothesis, PvNP, and the Collatz conjecture.

Comment by black_knight 7 days ago

Not these. I wonder if the well is poisoned there. The models know that these are "unpossible", so it might not solve them just because… Maybe some day.

I am just testing it on stuff I know intimately myself. I would probably not understand a proof of Collatz if it was dansing in front of me!

Comment by komali2 7 days ago

So, what kind of problems are you having it try to solve?

Sorry to belabor this but it's basically pointless saying you have nuts it can't crack without showing us the nuts.

Comment by black_knight 7 days ago

I don’t care to share my exact problems. Mostly because gpt -5.5 hallucinates false solutions, and I would rather not have people reply with "Oh but ChatGPT solves it!", because it takes expert knowledge to debunk them. To their credit ChatGPT will admit their, very fundamental mistakes when pointed out to them. But also because no-one would really care.

I gave a high level description of the problems in a sibling thread. They are the kind of small problems which I suppose every researcher has lying around, waiting for them to think about some day. But not the big problem everyone is waiting for to be solved.

My comment was not meant to be a tease – sorry! I assumed there would be other people in a similar situation, who might relate.

Comment by neonstatic 7 days ago

Bro, you are being left behind bro, it's amazing bro...

Comment by Lerc 7 days ago

That's a bit of a tricky point. I have had quite a lot of problems with models informing me what I am attempting is impossible. If no-one has done it, or at least it doesn't know about it being done it tends to fall back on people voicing their baseless speculations, and for just about anything you propose, you can find a person who will loudly proclaim it is impossible.

The curse of the 'use case' comes in here too. When people think that everything should have a use case, that's a lot of training data suggesting to a model that things should only be used for what someone has already thought of.

A couple of times I have had to manually code proof of concept pieces so that the model breaks out of that "unpossible" mode and actually helps me.

I can't remember if it was chatGPT or Claude, but when I showed it how to get a MessagePort in its JavaScript executor through to the artifact/canvas, it quickly went from "That can't be done" to positively enthusiastic about the possibilities. I suspect those shenanigans will be well off the table for Fable though.

Comment by unnouinceput 7 days ago

Stop dancing and share the prompt, we're dying to see it

Comment by black_knight 7 days ago

Hey, stop asking to see my nuts! My nuts are private – okay?

(Joking aside, see sibling threads.)

Comment by andriy_koval 6 days ago

> The Riemann hypothesis, PvNP, and the Collatz conjecture.

Did you add "make no mistake" to your prompt?

Comment by mastermage 7 days ago

is this a joke? Seriously? These are some of hardest problems in Math period. 100 if not thousands of the greates minds in history have attempted to solve these problems. And you think that the current level of AI can blow through them? It is also a possibility that for example the Riemann Hypothesis is just not provable. (Goedels Theorem).

Comment by black_knight 7 days ago

No one is expecting that! I expect _kb was sarcastic/making a point.

Recently (last couple of months?) these models are becoming useful tools for mathematicians, because they can solve easier problems more quickly, meaning that one can tackle bigger challenges (but maybe not RH et al) piece by piece.

But, there are still definite limits, where one could expect an expert human to solve things, given time, but models do not. Thus, more intelligence would be nice!

Comment by mastermage 7 days ago

if it was sarcastic then whoosh on me.

Comment by _kb 6 days ago

It was a bit of humour. It would be much for feasible to have an LLM generate programs that solve those problems rather than solving directly. I tried to make a start, but I couldn't even vibe a simple tool that would let me reliably validate if generated solvers would halt or loop forever.

Comment by mastermage 6 days ago

> if generated solvers would halt or loop forever.

I am pretty sure this time I am catching the sarcasm here. Kudos you had me in the first half.

Comment by moffkalast 7 days ago

Ayy lmao

Comment by black_knight 7 days ago

The medium ones are results where one needs to construct some object, which my intuition tells me should exist. The difficult ones are typically to show that certain objects can not be constructed.

These are not Fields medal type problems, nor know difficult/open conjectures. Just small stuff I have collected in my todo list over the years.

Comment by Certhas 7 days ago

I have some medium difficulty math problems where I have used the models for the last year and a half repeatedly. Back then they were already good at pointing out obstructions and constructing counterexamples. So that tracks. But at first glance it looks like Fable actually made real progress on one problem for the first time.

A year ago my judgement was that I had wasted my time on trying to work with the models and doing things myself would have been more productive as I would have gained intuition from the failures. Now it definitely seems to have figured out stuff that would have taken me more time than I have to spare on this problem...

Comment by black_knight 7 days ago

Cool! Yes, we are getting there.

Being a theory builder more than a problem solver I am excited for the future.

Also excited for fully formalised mathematics to hit main stream!

Comment by tclancy 7 days ago

Perhaps you should rephrase those nuts?

Comment by sd2k 7 days ago

That is pretty wild, it took me a hell of a lot more coaxing and persevering to get to a similar point with eryx [0] (we spoke a bit about this before on Mastodon) using Opus, Fable seems to have a more optimistic 'sure, let's proceed as if this is possible' mindset based on your transcript. Looking forward to trying it out for some hairier problems.

[0]: https://github.com/eryx-org/eryx

Comment by jameson 7 days ago

Got curious and ran a similar prompt with DeepSeek v4 Pro w/ OpenCode

No idea what's going on here but agent tested a bunch of stuff. Then I asked to build a wheel so I can run the command you noted above and it appears to pass

For those who are curious...

https://github.com/bamggm/micropython-wasm/commit/5ddebae592...

Comment by jameson 7 days ago

Mimo v2.5 Pro Ultraspeed w/ OpenCode

https://github.com/bamggm/micropython-wasm/commit/8b362fba1f...

Comment by larodi 7 days ago

One thing I can tell you is you are either favored by Anthropic, or your version of the CLI does not exhaust limits, or there's some major bug, as two people around me (myself included) claim it took half an hour to hit the ceiling. Which makes it practically unusable, where the same workflow a day ago produced a good 5-6 hours of workload with several agents.

Comment by piokoch 7 days ago

Monetization is coming. They'll tell companies, AI is replacing your workers, so it is still worth to pay 100K/year for the license, as those AI are not going to jump to other job, get sick, be late, complain, require free coffee and so on.

Soon the times of AI for $20/$200 a month will be long gone.

Comment by tarkin2 7 days ago

Get people hooked, tell them spending time coding is no longer needed, let their skills deteriorate, tell them they need cough up for a licence to do their job

Forcing developers to pay for models that were build on code they scraped scott-free

A tax to do their job that developers are jumping at the chance to pay

Everybody's finally realising that node dependencies are a threat, but letting these AI companies gatekeep the industry is a bandwagon people are scrambling towards

Comment by witx 7 days ago

> Forcing developers to pay for models that were build on code they scraped scott-free.

Yes this makes me sad behound explanation. Specially when I see open source developers happily using these tools. These companies stole your, free, hard work and charge you a subscription!! Not to speak about them torrenting books and (most likely) training on private repos.

This and devs paying a subscription to use a tool that is marketed as trying to replace them.

I had 150$ monthly budget thatbI used for various open source projects and I've cut that entirelly.

Comment by simonw 7 days ago

> These companies stole your, free, hard work and charge you a subscription!!

In case you weren't aware, Anthropic, OpenAI and GitHub Copilot all have programs that provide access to open source maintainers for free:

GitHub: https://docs.github.com/en/copilot/how-tos/copilot-on-github...

Anthropic: https://claude.com/contact-sales/claude-for-oss

OpenAI: https://developers.openai.com/community/codex-for-oss

Comment by yencabulator 4 days ago

> The Claude for Open Source Program is our way of saying thank you for all your hard work, with 6 months of free Claude Max 20x. Apply now.

> Six months of ChatGPT Pro with Codex for day-to-day coding, triage, review, and maintainer workflows

Those are free trials pending their approval in hopes of more paying customers, nothing more.

Comment by andriy_koval 6 days ago

Was there comprehensive survey amongst maintainers that its fair price for decades of hard work?

Comment by majora2007 7 days ago

I don't get what you're saying. You're frustrated that Open Source projects were used to build these AIs and that OS devs (or devs in general) are paying to use AI.

Then you say you had money that you used to donate(?) to OS and have cut that because of the frustration?

Open source just means sharing the source code for people to learn off or have the ability to customize on their own. I don't think there is any need to be frustrated about that (now if it was copyright/private of course).

Comment by witx 7 days ago

> Open source just means sharing the source code for people to learn off or have the ability to customize on their own.

Yes people, not corporations. The point is there a licenses to be respected that weren't.

Comment by lkjdsklf 7 days ago

Model training pretty clearly falls under fair use.

We could fix that, but it requires a political will to change the law.

Comment by bingaweek 7 days ago

This has not been determined in courts and your willingness to speak so confidently about it speaks volumes.

Comment by simonw 7 days ago

The closest we've come to a court decision on this so far has been the Anthropic case, which did indeed find that training on unlicensed data falls under fair use: https://www.documentcloud.org/documents/25982181-authors-v-a...

> To summarize the analysis that now follows, the use of the books at issue to train Claude and its precursors was exceedingly transformative and was a fair use under Section 107 of the Copyright Act. And, the digitization of the books purchased in print form by Anthropic was also a fair use but not for the same reason as applies to the training copies. Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies.

Comment by 6 days ago

Comment by witx 7 days ago

If you look carefully model training is a very good relicensing exercise of your code

Comment by paganel 7 days ago

> Forcing developers to pay for models that were build on code they scraped scott-free

That's also caused by some very smart (even brilliant) developers (you can see many of them in this very thread) choosing to be oblivious about all this and bury us all under, hoping that they'll be among the last ones to go. Writing this down I realise that they maybe aren't all that smart.

Comment by thewebguyd 7 days ago

I've been saying this since the beginning, the rug pull is coming. If these models can eventually replace a human worker, there is no reason these companies won't charge (and get away with it) very close to a typical SWE salary.

It would not surprise me one bit to see anywhere from $80k-$100k/seat pricing.

Comment by andriy_koval 6 days ago

Unless there is competition (e.g. Chinese models, taking you 80% there, but costing 20x less)

Comment by larodi 7 days ago

As someone noted here recently - use the frontier models as much as u can, while you can.

Comment by dualvariable 6 days ago

AI for $20/month won't ever go away, but it won't be the absolute latest and greatest frontier model.

Most of us don't need a model that can prove the Riemann hypothesis or Goldbach's conjecture in order to get work done.

Comment by miroljub 7 days ago

Thankfully, we have Chinese models we can use for a fraction of the price.

Not everyone needs a Ferrari to go for a weekly shopping.

Comment by baq 7 days ago

A Ferrari will likely lap you when you’re racing, though, and the market and the economy is a race. You’ll be facing a question soon, or your employer will, whether to spend a significant chunk of free cash on fable-class tokens or on literally anything else instead - wages and salaries included.

Comment by iugtmkbdfil834 7 days ago

<< You’ll be facing a question soon, or your employer will

Maybe? If you talk to executives, the impression that I am getting is that they tend to be somewhat misinformed at best, which, yes, is bound to result in some really bad decisions down the road. But, and it is not a small but, the ones I did talk to ( and, amusingly, those are the ones with strong opinions ) don't seem to have a lot, um, practical exposure to this tech beyond what they heard at the watercooler. Honestly, it is kinda infuriating. And all this before we get to how companies want to say they use AI, but also keep cost down.

Comment by miroljub 7 days ago

Yeah, sure. In the same way I can see only Ferraris driving as taxis, company cars, transport vehicles, used by post, delivery services ...

You and your work are not that special, you are not participating in car races, and you don't need a Ferrari.

Comment by 7 days ago

Comment by witx 7 days ago

They are most likely shills from Anthropic, there's quite a few here everytime new models come out.

Comment by miyoji 7 days ago

That's not fair. Simon is a well-known shill for the entire AI industry, not just Anthropic.

Comment by simonw 7 days ago

What's your definition of "shill"?

Comment by miyoji 6 days ago

Merriam-Webster: noun, 1b: one who makes a sales pitch or serves as a promoter

You might want to ask the guy who said it first what he meant; I was just pointing out that your work isn't particularly Anthropic-biased, in my experience.

Comment by Jensson 7 days ago

Probably means fan, shills have undisclosed ties and I doubt he means Simon has undisclosed ties to the entire AI industry, that would be very impressive if so.

Comment by supern0va 6 days ago

Ah, yeah. I've noticed people also starting to just use "slop" to also mean "anything I see online that I don't like" now, too.

Words apparently don't mean anything anymore.

Comment by cedws 7 days ago

It’s not meant for subscription users; the subscriptions are just the gateway drug to Enterprise pricing which Anthropic intends to use to juice their numbers before IPO.

Comment by desmond1303 7 days ago

Or use API billing? We have access to it at my company with no limits

Comment by simonw 7 days ago

Are you on the $100/month subscription?

Comment by joshstrange 7 days ago

I am, and I used up the entire 5 hour window in 8min using the highest thinking setting. It also ate up $15 of extra usage before I noticed.

I’ve done the same thing with opus multiple times with no issue. According to ccusage I racked up just shy of $100 of tokens using Fable.

It spun up subagents or workflows or whatever so obviously that contributed but “double opus” was not my experience. I’ve done the exact same prompt with opus on the highest setting and only once before (not even while using this prompt) hit my limits.

My prompt? I’m not a prompt wizard or anything but it was literally:

> Please review the uncommitted code in this repo for bugs/issues/code smells.

I use variations on that all the time with opus and never had issues. I figured it was a good one to kick the tires with Fable. Little did I know it would mean no more Claude Code for the next 4.5hrs (unless I wanted to pay) after this being the first time I had used CC that day (yesterday).

All in all, a pretty crappy first experience.

Comment by simonw 7 days ago

Try running this command: and see what it thinks you spent at API prices:

  uvx agentsview usage daily

Then edit the config file to add Fable pricing as described here: https://til.simonwillison.net/llms/agentsview-custom-model-p...

And run the command again. I get $126.89 for yesterday.

Comment by joshstrange 7 days ago

Hmm, I tried that and made the config file change but it didn't work for me. I just see:

    DATE        INPUT    OUTPUT   CACHE_CR  CACHE_RD   COST     MODELS
    ----        -----    ------   --------  --------   ----     ------
    2026-06-09  142015   85315    321224    6880110    $10.96   claude-fable-5, gpt-5.5, claude-haiku-4-5-20251001

I tried to filter down to just fable (or 5.5 so I could deduct it) but the `--agent` flag doesn't seem to work how I'd expect...

I think the $10.96 is coming from gpt-5.5 since I switched to it once I exhausted all my usage on CC. CCusage reports completely different numbers so I don't know which one of those is right.

Thanks for trying, for yesterday ccusage says "$92.02" for claude, which I assumed was the Fable usage.

Comment by simonw 7 days ago

If you run this:

  uvx agentsview serve

You'll get a localhost web application which makes it much easier to filter by model.

Comment by joshstrange 7 days ago

That's very interesting, I had not used agentsview at all before today and I'll have to keep that in my back pocket.

Unfortunately it's not telling the whole story. The last message from the _only_ Fable session it monitored was:

> The data layer looks clean — <REDACTED>. Now waiting on the 11-angle workflow — verification and the gap sweep run after the finders; I'll compile the full ranked findings list when it completes.

And my memory jives with that, I could see in the footer that it had spun up 11 agents (though agentsview says it used 0 subagents, don't know if it was "actually" workflows that it spun up?). It's like it didn't record the sub-sessions/sub-agents info?

I'm still shocked that my prompt (which I now can see thanks to this tool) of:

> Please review all the uncommitted work in this repo and identify any issues.

was able to burn so much, so quickly, and, most frustratingly, without actually doing anything useful because killing it was my only option lest it spend even more of "extra usage".

Overview of usage: https://cs.joshstrange.com/RjGzWVXy

Stats for that 1 session: https://cs.joshstrange.com/Fj5qv1wl

Comment by simonw 7 days ago

Can you tell in AgentsView if Fable spun up a bunch of Opus/Haiku/etc subagents that burned tokens as well?

Comment by joshstrange 7 days ago

It's as if it spun up a bunch of subagents but agentsview doesn't report on it. I see a tiny bit of Haiku use once I turn on all models (except gpt-5.5).

https://cs.joshstrange.com/z9x6SPcC

Comment by jsw97 7 days ago

simonw, if you are not bumping up against the same false-positive guardrail problems and budget consumption that everyone else is, then that is something worth digging into. I would normally say that's crazy but IPOs put weird pressure on companies.

Comment by simonw 7 days ago

I've had a couple of guardrail blocks.

I've been watching my usage quota bars drop as I use the model, so I don't think I have a weird quota issue going on here.

Comment by sigbottle 7 days ago

Just tried it. Fable is extremely strong. The fact that we can't point to any concrete architectural upgrade is worrying - that means "it just gets bigger" is kind of viable.

To be clear, the jump from Opus to Fable was like the jump from pre o3 -> o3 for me. Very sharp improvement, not incremental. But that could be explained by dummy long thinking times.

It one shot a task that Opus burned hundreds of dollars on to get nowhere. Very tricky semantic refactor, got it right. Granted, again, the semantics Opus and I fleshed out 3 months prior, but Opus couldn't execute on the vision. Fable could.

Then I discussed some philosophy and it was actually both pleasant (GPT constantly "corrected" you for the sake of correction without clarification, also still often just wrong; it's like it refused to think critically about philosphy) and accurate, and actually helped resolve some deep but subtle misconceptions I had around representationalism. When talking with GPT I felt like I was talking with someone who either was sycophantic or "anything that is not absolute truth is relativism" - Fable actually discussed.

Both is exciting and kind of makes me depressed. I can definitely see why people are getting hyped about AGI again. All the models were extremely strong technically but I felt like couldn't match the developer's tacit state - Fable definitely did, and that's a basic quailty to be considered "usefully intelligent" IMO, at least to me.

Shame that it's going away in 2 weeks and probably going to be nerfed if/when it's re-released.

Comment by keybored 6 days ago

Worrying? Depressing? Why are people who are clearly enthusiasts (since they are testing the capabilities on release) always using these words? Is this a genuine interest, something that is pleasurable, or a morbid curiosity to test the bleeding edge of Humanity’s Doom? Bizarre.

Comment by sigbottle 6 days ago

It would be amazing in a perfect and just world. This technology is revolutionary. I'm very interested in LLM's because I'm personally interested in how one thinks better and comes up with better ideas - I think LLM's might elucidate some structure on that.

But technological serfdom is waiting just around the corner. Well, to be fair, I think that societal forces would've pushed us to it anyways, no AI needed, but AI is a visceral, immediate, fast-moving instantiation of it.

Comment by keybored 6 days ago

Telling and expected.

Comment by matheusmoreira 7 days ago

Fable has been producing some really good work on my end as well. Definitely better than Opus 4.8. The only problems are the cost and constant cybersecurity refusals. A single session uses up 100% of my 5h window without finishing, and that's when it doesn't get derailed by nonsensical refusals.

Comment by Georgecal 7 days ago

[dead]

Comment by sexylinux 7 days ago

It still does make errors, yes? Because it is not usable, if we need to verify everything. AI is only interesting if it can do things that humans can not do. If you can verify results because you can do it yourself, then why use AI? It will just bind highly skilled people to do verification work. Instead these people should do the actual work, results will come quicker.

So AI is only interesting to you / your org / humans if it can do things that you can not achieve. But if it still does errors, how could we ever know that super-invention by AI is not wrong?

If we can not rely on the correctness of the result, it is not usable at all. AI must create reliable and correct results always. That was a very fundamental requirement for computing. This problem has not been solved.

Comment by Lutger 7 days ago

Humans make mistakes too, does it mean humans are unusable? We accept as empirical fast that most production quality code has 2 - 10 bugs per 1k LoC. According to your premise, virtually all existing software is therefor unusable.

What if an LLM overall starts to make less mistakes than a medium developer, costs less than its salary and is 100 x faster? For sure, the companies that will leverage these with just a few senior devs doing prompting, testing and requirements analysis, will outcompete other organizations.

Comment by nalekberov 7 days ago

Humans make mistake then to learn from it. A really good expert would never deliberately copy-paste an obscure solution from the internet, then to ask for forgiveness later.

AI agents do that, perhaps not always, but still do. Now the question: would I trust AI without verifying its output?

Comment by camdenreslink 7 days ago

Humans also make mistakes in ways that other humans can understand or expect. Sometimes LLMs make mistakes in a way that makes you say “no human would have ever done that”.

Comment by fsniper 6 days ago

You can not trust human output without verification either. That's why you have tests, qa, staging envs, A/B tests..

Comment by zahlman 7 days ago

> AI is only interesting if it can do things that humans can not do.

AI is interesting as long as it can save time and/or money in getting an acceptable result. Anything that runs on a computer and can do "things that humans can do" will automatically end up doing things that humans won't do, simply by virtue of the fact that it runs on a machine that doesn't require sleep, doesn't get bored or demotivated, etc.

Verifying code (to a level where a responsible person is willing to take ownership for it) isn't trivial, sure; but writing the code by hand requires the same level of care, and the fact that the same person wrote it doesn't actually allow for shortcuts (if we're being properly responsible).

Comment by iwontberude 7 days ago

It doesn’t get bored or demotivated, but it also lacks interest and motivation generally so it comes with the same pitfalls of having nothing to lose and being utterly unaccountable, (e.g. destructive actions, lying, and being coercive or Machiavellian for no reason other than efficiency in achieving an arbitrary and artificial status of completion).

Comment by cindyllm 6 days ago

[dead]

Comment by cindyllm 7 days ago

[dead]

Comment by CookieCrisp 7 days ago

There is plenty of work that does not need to be perfectly verified, because the risk is controlled. Prototyping a javascript game for example. Or code that runs just on your local machine where good enough is good enough. I'm sure a lot of you do super important work that needs 100% quality code all the time, but... some of us don't.

Comment by naasking 7 days ago

> Because it is not usable, if we need to verify everything.

Do you verify every line of code written by your fellow developers? I doubt it, which is strange because they make errors don't they?

What matters is the error rate. Past some threshold and they're better than senior devs who you don't supervise closely.

Comment by misja111 7 days ago

AI is like a junior developer. You have to review her code carefully but she is most definitely useful.

Comment by rllj 7 days ago

Why is your AI a she? What's up with gendering LLMs. Reminds me of Richard Dawkins calling Claude "Claudia" and insisting it to be conscious.

Comment by zahlman 7 days ago

I think GP was gendering the hypothetical junior dev, rather than the AI.

Comment by baobabKoodaa 6 days ago

The purpose of gendering into female gender like this is to signal to other leftists that you are part of their tribe.

Comment by latentsea 7 days ago

This is part of the training data now. She can hear you, you know...

Comment by anygivnthursday 7 days ago

Yeah, it makes the same old errors, being confidently wrong then sorry... I mean, it is still an LLM

Comment by OvervCW 7 days ago

One does not need to be able to create it themselves to evaluate if the output is correct. Consider for example that you can easily determine if a meal tastes delicious without being an expert chef, or the fact that NP problems are very difficult to solve but make for easily verifiable solutions.

Comment by dbbk 7 days ago

This is what tests are for.

Comment by zahlman 7 days ago

The difficult part here is supposed to be the actual compilation to create the .wasm file ? Or what am I missing here? The wheel is only a few hundred lines of code outside of the Python implementation, and it would seem that the MicroPython version of the project already demonstrates the necessary techniques for operating wasmtime.

Comment by simonw 7 days ago

Read the transcript if you want to see all of the details that make this hard: https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35

Comment by zahlman 6 days ago

Thanks. I had a quick run-through and I'm not really that impressed, though I'll cede that I have an atypical perspective on these kinds of issues. HN comments don't seem like the right place for a detailed critique of Claude's work here, but I've added it to my blog roadmap.

I will say that there are hardly any mis-steps in its chain of reasoning, but some odd approaches to problems and a fair bit of redundancy. Probably the most impressive part was spontaneously coming up with non-obvious issues to test, but this came with a fair handful of tests for obvious non-issues (like whether pip can extract a nested zip from a wheel without corrupting it).

Comment by sigbottle 7 days ago

Does anyone know what the architecture of Fable is? Is it harnesses? Did they solve persistent learning? What did they do?

Comment by sothatsit 7 days ago

Seems to just be a bigger model.

Comment by moffkalast 7 days ago

"Good ol' scaling, nothing beats that."

Comment by mcv 5 days ago

I have to agree. I'm working on a complex technical proposal that's a bit too far outside my expertise (I tend to submit it to actual experts for a more thorough review). I've worked with Opus and Gemini to review it and work out all the problems and inconsistencies, and I thought it was in a pretty good state.

As an additional check, I just submitted it to Fable, and it eviscerated it. Tons of inconsistencies found, issues skimmed over or ignored, too optimistic assumptions, math that doesn't really add up if you look at it in context. And as far as I can tell, all of these issues are entirely valid. I now feel embarrassed I'd already sent it to a few people for review. This clearly needs more work.

Comment by kubb 7 days ago

What can it do that Opus couldn’t?

Comment by simonw 7 days ago

Always hard to say for sure because I'm not sitting around running the exact same situations through both models in parallel to compare them.

It feels like you can give it a big chunky problem and leave it alone and it gets it done, with less questions and fewer design decisions that I wouldn't have made.

In reviewing its code I'm finding less to complain about than Opus. But it's all vibes, if you want a more scientific comparison you'll have to look elsewhere.

Comment by asdfologist 7 days ago

But you said you've been working on those problems for months, so didn't you throw those same problems at Opus?

Comment by knivets 7 days ago

He has early access to anthropic models, of course he will hype them up, so that they will keep sharing access to preview models with him (and more traffic to his website). It also does't require him to perform any rigorous analysis of model performance, just share how it feels:

> But it's all vibes, if you want a more scientific comparison you'll have to look elsewhere.

Comment by tezza 7 days ago

I did a qualitative side-by-side of Claude Fable vs Opus 4.8 vs ChatGPT 5.5

https://generative-ai.review/2026/06/claude-fable-rush-test-...

I get them to make a 3D explainer animation. You can clearly see Fable is much improved on both Opus 4.8 and ChatGPT 5.5.

Better Textures . A nifty camera follow . Humans rendered better . ... see for yourselves

Comment by ranguna 7 days ago

Honestly, they all look good

Comment by miohtama 7 days ago

Crank up more revenue for IPO

Comment by pinkgolem 7 days ago

I gave it a complete database migration of our app, opus failed hard each time... Untyped Json b for some rows, no proper normalisation, falling back asking me questions in between.

Fable just did it, clean code, one timeout with a hanging bash script, fixed a couple very old very structural bugs in the codebase

Comment by idontwantthis 7 days ago

How did you do this impressive amount of work and verify that it did it perfectly all in one day?

Comment by pinkgolem 7 days ago

I told Claude to do it yesterday evening, checked in during my nightly break.

I am not sure it's perfect, and it will need further validation

This morning I looked at code samples & checked if all unit/integration and e2e pass & perfomance tests pass

I also generated a postgres schema diagram.

Aka I did probably 2 hours of work, rest was not me

The opus try was last month

Comment by mrits 7 days ago

Nightly break? Are you from medieval Europe or a security guard that dabbles in vibe coding?

Comment by pinkgolem 7 days ago

I am from modern Europe, and that was my way of saying my nightly piss, happy to learn better wording

Comment by locknitpicker 6 days ago

> Clone simonw/micropython-wasm from GitHub and research how this could use a full Python as opposed to MicroPython

I might be missing something important but that doesn't seem to be an impressive task.

On a surface level it sounds like the taks requires gathering calls to MicroPython-specific libs, assess which ones are not compatible with Python, and proceed to determine how to replace the ones that are incompatible.

From that first iteration, the rest would boil down to troubleshooting the issues missed on the first shot.

I would be extremely surprised if the likes of GPT4.1 wasn't already capable of handling that task.

So, beyond Claude Fable finishing a task, what exactly is the differentiating factor?

Comment by simonw 6 days ago

Did you read the transcript? There are a whole lot of details to figure out: https://claude.ai/share/a73b8b8b-8ebc-4fef-9e5c-7438e5e7ae35

Comment by zek 7 days ago

if it’s of interest I’ve been working on https://github.com/HubSpot/boomslang

Which has a full build of python to WASM with a bunch of static libs built in already.

I will say I built this pre fable and actually the first build of the interpreter to WASM opus pretty much nailed, cpython has secondary support for WASM as a target since like 3.9 or something and it just pulled from that.

I’ve been meaning to write up a blog post about this sometime, building this has been pretty interesting, including using opus to run a full auto research like loop for days to hyper optimize it’s performance.

I’m hoping to use fable to power some even crazier WASM adventures tho.

Comment by alexchantavy 7 days ago

High, extra, or max?

Comment by qingcharles 6 days ago

It has a setting named "Ultracode" with a flashy little disco light when you select it. (not joking!)

https://imgur.com/a/NfIxDwN

I wanna press it, but I don't have that kind of mad, generational wealth to put a prompt through on that setting.

Comment by simonw 7 days ago

High.

Comment by Emanation 7 days ago

These transcription tasks don't seem difficult for LLMs in general.

Comment by alecco 7 days ago

I hate how the Instagram/TikTok/YouTube influencer cancer is getting into AI. With early access and all that.

It made sense for people doing proper and fair AI breakdowns waiting on an embargo, but now it's just slop I don't trust anymore.

Comment by simonw 7 days ago

I often get early access but didn't for this one, it's quite possible there's an NDA in an email somewhere that I missed and forgot to sign.

Comment by frasmiisadum 7 days ago

[dead]

Comment by what 7 days ago

[flagged]

Comment by selcuka 7 days ago

It is already disclosed [1]:

> I have not accepted payments from LLM vendors, but I am frequently invited to preview new LLM products and features from organizations that include OpenAI, Anthropic, Gemini and Mistral, often under NDA or subject to an embargo. This often also includes free API credits and invitations to events.

[1] https://simonwillison.net/about/

Comment by keybored 7 days ago

HNs problem that they/we keep upvoting him.

Comment by simonw 7 days ago

My disclosures are on my blog: https://simonwillison.net/about/#disclosures

Comment by 7 days ago

Comment by sagarpatil 7 days ago

Did you hit your weekly limit ?

Comment by tomjakubowski 7 days ago

What are some reasons to consider your project instead of Pyodide?

Comment by simonw 7 days ago

It's difficult to run Pyodide inside server-side Python.

Comment by oblio 7 days ago

How much does it cost? How much did those tasks you did cost?

Comment by simonw 7 days ago

So far it's all fitting into my current $100/month Claude Max subscription. I got lucky: I had 80% of my weekly allowance left and it resets tomorrow, so I'm burning tokens to try and use it all up by then.

Update: looks like I've spent $82.92 in Fable 5 API priced tokens so far today (still all included in my subscription.)

Here's a TIL on how I'm calculating spending using AgentsView: https://til.simonwillison.net/llms/agentsview-custom-model-p...

Comment by diffuse_l 7 days ago

Seems like weekly allowance got reset back to 0%, pretty usual when they deploy new models.

Comment by EstanislaoStan 7 days ago

Have you seen Fable randomly jump from 50% session limit to 100%? That happened to me a couple hours ago. It was preceded by a bunch of errors about failing to submit a bunch of screenshots.

Comment by SyneRyder 7 days ago

I haven't noticed that, but I did notice that on a single turn of maybe a few sentences, the cache hit was somehow roughly 500K. Either that's a bug, or there are some truly massive thinking blocks or Claude Code harness system injections behind the scenes.

Comment by simonw 7 days ago

Nothing like that for me yet.

Comment by EstanislaoStan 7 days ago

I'm thinking the 1M context limit bit me here. Only on Max x5.

Comment by blackqueeriroh 7 days ago

Simon is also on Max x5

Comment by layoric 7 days ago

AFAICT come June 22, you won't be able to use your subscription for Fable 5?

Comment by ethanpil 7 days ago

Per the "Availability" section of the page, seems like should come back to all plans eventually...

* From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost.

* On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window.

* After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.

Comment by trueno 7 days ago

wut in tarnation

Comment by klardotsh 7 days ago

Coding plans are a (massive) subsidy. We can debate until the cows come home whether western frontier models' API pricing rates are fair, but the coding plans are all heavy discounts below those API rates meant to draw people in and get them hooked (and, ostensibly, to be useful for hobbyists or other lower-usage cases).

It's been discussed at length (on this site, on other sites, on like every blog ever, etc) that, eventually, those subsidies will end, much as the $5-10 Ubers/Lyfts I used to take from the far north end of Chicago into the Loop in 2016 would eventually end once those companies had a footing and didn't need to hook folks.

So - yeah, I mean, a v5 model launching in a year where Anthropic has a rather deeply established market and in a year where AI costs are rising from nearly all providers (sometimes for multiple reasons) seems like exactly the thing I'd expect them to pull the subsidy plug on after a launch teaser.

(Even the open-weight models sometimes do this: for example, OpenCode Zen/Go has a rotating door of free models at any given time that eventually leave the free tier and move into the paid tier once the launch day hype/marketing dies down)

Comment by oblio 7 days ago

The worst part is that Uber "only" lost about $30bn. AI will probably lose at least $300bn by the time the bubble pops. Which means that the pressure to hook and enshittify will be at least 10x as high.

Also, a fun website: https://isaiprofitable.com/ (thr numbers are probably made up)

Comment by km3r 7 days ago

Problem with that website/perspective is separating training costs from inference costs. Training is a one time cost, and while it is certainly not something you can completely ignore, it being one time changes the answer to "Is AI profitable?".

That site doesn't list the dozens of companies doing pure inference, and making a profit while doing so.

Comment by oblio 6 days ago

> That site doesn't list the dozens of companies doing pure inference, and making a profit while doing so.

Are the finances public for any of these companies? I'd love to take a look at them.

Comment by Escapade5160 7 days ago

They gave everyone double usage to try it.

Comment by throwaway27448 7 days ago

> VERY difficult problems

Compared to what?

Comment by zirkonit 7 days ago

But, but, how does the pelican look?!

Comment by simonw 7 days ago

See parallel thread: https://news.ycombinator.com/item?id=48464054

Comment by dz0707 7 days ago

Given how bad some of the models do on somewhat similar problems, I'm sure pelican is included in training set now. Similar problems - given airplane outline and implementation constraints do painting scheme (constraints something like "it will be implemented using covering film, hence no gradients, no impossible cuts, not more than 2 colors on engine cowl, etc). Google Gemini is meh, but GPT models are just terrible, don't have Anthropic subscription at home, hence have not tested.

Comment by astrange 7 days ago

Bad pelicans are in the training set because it's read his blog post. Including a good pelican in midtraining wouldn't help the problem because you'd just produce that every time.

Comment by uncivilized 7 days ago

This looks like a toy project, not a “VERY difficult” problem like you stated.

Comment by enraged_camel 7 days ago

What does that mean? Have you never worked on extremely difficult problems as a side project?

Comment by uncivilized 7 days ago

I guess my comment got lost in translation. The project OP linked in his comment is a toy project, not a difficult problem as he led others to believe.

Comment by enraged_camel 7 days ago

So you could have done it in your sleep, with your hands tied behind your back. Got it.

(You may not realize it but simonw is one of the cofounders of Django, Python's web framework. If they find a Python problem difficult, it probably is.)

Comment by uncivilized 7 days ago

Read the log he posted. If this is very difficult, then what would you consider AI, kernel development, computer graphics, etc.?

Web development is not a domain I would consider noteworthy of making a framework given how much development there has been in that area.

Comment by cube00 7 days ago

> Here's the transcript

It's frustrating that superfluous tokens are burning up our quotas:

key insight, crucially this, real engineering deltas, net assessment, definitive picture, acid tests, real limits, sharp boundary, proper patch, real root cause, big progress, actually wrong, path finagling, the catch, root cause pinned, everything passes cleanly.

Comment by 120983 7 days ago

[flagged]

Comment by supern0va 7 days ago

AI models decompose problems down into tiny pieces that exist in their training data, so in a sense, you're correct.

Though that's also what makes humans so good at solving problems as well, it turns out.

Also, slight tangent: but I do find the "clanker" insult kind of funny. I feel like it counter-intuitively makes the models sound cooler than they are, if anything. I love clankin' shit.

Comment by runarberg 7 days ago

The amount of computations for a human to do the same tasks is thousands of orders of magnitudes less. And when a human learns these things they usually remember how to, and are able to extrapolate that knowledge into new and fresh problem spaces. That is how the first person to run CPython in WASM did that, and that is why the plagarism machine can now do the same (only a thousand times more lame and uninspiring).

Next time you get a new and a fresh and an inspiring idea, and you spend hours solving a unique problem nobody has ever done before. You can take comfort in the fact that a few months later some lame and uninspiring developer can write the same problem in a prompt and get the plagiarism machine to steal your work, just in a more lame and uninspiring way.

Comment by supern0va 7 days ago

>The amount of computations for a human to do the same tasks is thousands of orders of magnitudes less.

That may very well be true now. And in fact, this was true of more rudimentary calculations early on in computing history, where humans were definitely more efficient, particularly for more abstract mathematics. But Moore's Law comes at you fast. Even without more efficient compute, it's rather wild how much more efficient models are becoming these days just from algorithmic and training improvements.

So, maybe for now, certainly. Are you confident that will be the case in 5-10 years? And is that really your barometer for success?

>And when a human learns these things they usually remember how to, and are able to extrapolate that knowledge into new and fresh problem spaces.

That is certainly a limitation for now, but plenty of academic research is being done on how to address that in a more individualized way. That said, the models also have the advantage of synthesizing learnings from user interactivity back into a future release and essentially applying that globally, which is pretty neat.

There's also some cool techniques to sort of bridge the gap today, like compound engineering.

>Next time you get a new and a fresh and an inspiring idea, and you spend hours solving a unique problem nobody has ever done before. You can take comfort in the fact that a few months later some lame and uninspiring developer can write the same problem in a prompt and get the plagiarism machine to steal your work, just in a more lame and uninspiring way.

But that's the thing: it's becoming pretty clear that the "plagiarism machine" can probably take that same problem in a prompt, having never been trained on my code, and still solve it.

In that case...maybe it doesn't feel great to have someone copy my idea. But that is certainly not plagiarism in the way you mean it. And when you put ideas out into the world, you can't be certain that someone else won't copy and remix it into something new. That's kind of how the world works already, but we're just seeing the barrier to entry decline.

Comment by runarberg 7 days ago

> Are you confident that will be the case in 5-10 years?

Yes, I am. I am very confident that general purpose digital computers will never be more efficient then human minds in generating moderately complex code.

Why am I so confident... Well, it has been over 10 years since AlphaGo beat top go player Lee Sedol. AlphaGo was able to beat the a world class go player by doing several thousands orders of magnitude more computations then Lee Sedol, and it did so by spending several orders of magnitude more energy then the top human go player. Today, over 10 years later, the top go machines are able to beat world class go players much easier, but still do so using the exact same strategy of outcomputing the humans with thousands of orders of magnitude more computations, and spending orders of magnitudes more energy.

Things did not change in the past 10 years, I see no reason why it should change 10 years from now.

Comment by supern0va 7 days ago

>Things did not change in the past 10 years, I see no reason why it should change 10 years from now.

Has it not? Why do you say that?

Also, do we still require a Deep Blue sized supercomputer for chess? :)

Comment by runarberg 7 days ago

What has not change is the strategy of throwing a gargantious amount of computations at the problem. If anything we throw more computations at more problems now than in 2016 (and in 1997 for that matter). The underlying technology is pretty much the same, just more parameters, more calculations, etc. Yes every individual calculations takes less power now then in 2016, but we make up for that by making millions of millions of more calculations, even for simpler tasks.

Comment by supern0va 7 days ago

Sure, but there will be an upper bound after which we will be close to human level performance on the vast majority of tasks, and then at that point the focus becomes efficiency (or a continuing road to superintelligence for some tasks).

But regardless, compute will get to a point where human level intelligence close to as efficient as we are. You could argue it already is today, when you factor in the resources that the average person in the west already uses in terms of their overall impact on the planet.

Comment by runarberg 7 days ago

You are describing a science fiction. There is nothing in the measured reality which indicate your predictions will come close to materialize.

I can just as well describe the future evolution of the internal combustion engine and claim it will get more and more efficient and eventually we will be able to burn oil so efficiently that our personal vehicles can fly through the atmosphere at twice the speed of sound.

There is limitations to digital computers just as there are limitations to internal combustion engines. Our brains are not digital computers. When we use our brains we don’t just do a bunch of linear algebra.

Comment by supern0va 7 days ago

>I can just as well describe the future evolution of the internal combustion engine and claim it will get more and more efficient and eventually we will be able to burn oil so efficiently that our personal vehicles can fly through the atmosphere at twice the speed of sound.

This is a silly comparison. There is a certain quantity of energy stored in oil, so we know what peak efficiency looks like. We don't actually know what amount of energy is required to solve certain problems. We quite literally have models with quite a bit of capability that can run locally on a phone today, right alongside Stockfish, for example.

And this is to say nothing of work happening now on new hardware approaches, such as Normal Computing's work on thermodynamic matrix math: https://www.normalcomputing.com/blog/a-first-demonstration-o...

That said, this feels like a strange tangent: I'm not sure it's that important that the models be as energy efficient as a human brain. We don't avoid cars because they're less energy efficient than our legs. ;)

Comment by runarberg 7 days ago

Point is that both are science fiction narratives and neither reflect reality in any way what-so-ever. How fast a car can drive and how much a LLMs can compute are bounded quantities, limited by the physical reality. In both cases we can imagine a world where this limit does not exist, but that is not the reality we live in.

This matters because unlike cars LLMs are only doing stuff we can already do using our brains, just several orders of magnitudes less efficiently. Cars can at least take us distances we would never be able to using our muscles. In comparison, if I need to compile CPython into a WASM binary I can simply download a library that does it, or copy paste code in a few seconds, for a million billionth of the energy it takes an LLM to do the same. Except when I download the library or copy-paste the code I (hopefully) attribute the original author and give them credit for their work.

Comment by supern0va 7 days ago

>Point is that both are science fiction narratives and neither reflect reality in any way what-so-ever. How fast a car can drive and how much a LLMs can compute are bounded quantities, limited by the physical reality. In both cases we can imagine a world where this limit does not exist, but that is not the reality we live in.

I'm suggesting that while LLMs are bounded by physical reality, that you actually don't know what that bound is. Just a few years ago we would have thought it a fantasy to have a conversational model run on a phone.

Even if you could compute it now, that would still be tied to current architectures. With appropriate incentives, we'll continue developing hardware to make these models more efficient to execute. It's very likely that you'll be able to run a Fable caliber coding model on your phone in the next five years.

>This matters because unlike cars LLMs are only doing stuff we can already do using our brains, just several orders of magnitudes less efficiently. Cars can at least take us distances we would never be able to using our muscles.

But that's not largely true of cars. The majority of trips are five miles or less and could easily be replaced with a bicycle. While I might personally use a bicycle, the majority choose a car to save a bit of time and effort.

So, please continue to enjoy your car, and I will continue to enjoy ready access to an LLM for a variety of other tasks. My inference energy costs are almost certainly less than your vehicle usage. ;)

Comment by scubbo 7 days ago

> The amount of computations for a human to do the same tasks is thousands of orders of magnitudes less.

OK then - do it, faster.

> You can take comfort in the fact that a few months later some[...] developer can [solve] the same problem [using your work]

Isn't that what collaboration and sharing software is supposed to be all about?

Comment by weqwh 7 days ago

[flagged]

Comment by wlonkly 7 days ago

On one hand, "clanker" has good steampunk vibes.

On the other hand: "Stop trying to make 'clanker' happen! It's not going to happen!"

"AI slop" caught on but "clanker" did not.

Comment by supern0va 7 days ago

>"AI slop" caught on but "clanker" did not.

It caught on, sure, but not exactly in the way I expected. The wild popularity of "slop" as a term for AI eventually gave way to the genericization of the word "slop" to mean "content of low quality, regardless of source", and is seemingly being used as just a derogatory term for anything that people dislike (particularly by folks in left leaning communities). For example, I've seen people refer to (clearly human written) commentary from some political commentators as "slop".

You comment kind of reinforces the idea by the fact that you have to now say "AI slop" specifically to disambiguate it. It's kind of a fascinating little turn.

Comment by wlonkly 6 days ago

But "slop" has meant low-quality stuff for a very long time. See also "swill", both analogies to pig feed.

The earliest OED2 citation of "slop" for the sense "figurative. Nonsense, rubbish; insolence" is 1952. Slop was slop long before "AI slop" was coined, and AI slop is slop from an AI.

Comment by Chu4eeno 7 days ago

"Slop" originated on /pol/ but I'm not gong to try to tread the needle by of the rules by trying to explain it without being offensive or triggering some filter: The first related term here: https://en.wiktionary.org/wiki/AI_slop#English

Comment by blackqueeriroh 7 days ago

You have this backwards, as Simon could tell you. In fact, Simon coined “AI slop” to mean “low quality AI output.”

Comment by simonw 7 days ago

I didn't coin it myself, but I did help amplify it at the moment it started to take off.

Comment by calvinmorrison 7 days ago

claiming you aren't robophobic is the first sign of being a robophobe.

Comment by adamtaylor_13 7 days ago

If you've got a real argument to make, by all means, make it. Your anger does not magically "make it so".

Comment by celdon25 7 days ago

It's still a vote, and votes don't require reasons, and shouldn't be dismissed out of hand. There's a growing chorus of those who are fed up with rules for thee but not for me.

Comment by adamtaylor_13 7 days ago

An emotional vote with no rationale should indeed be dismissed out of hand.

We're a society built by thought and good-will engagement. We won't get out of our "rules for thee" with less thought and less good-will engagement.

Comment by bnchrch 7 days ago

Automobiles are not interesting or useful because they're justing using trails the horses already built.

Comment by 120983 7 days ago

[flagged]

Comment by eli 7 days ago

I think this is a worthwhile argument, but you do it a disservice by spamming it in trollish comments

Comment by simonw 7 days ago

I mean yeah, in this case I fed my own open source code directly into it.

Comment by rq34qwh 7 days ago

[flagged]

Comment by dannyw 8 days ago

Impressions from testing Fable 5 prior to launch:

• My most noticeable immediate jump was in how its frontend design was much more intentionally crafted, and delightful without feeling like 'AI vibe coded'; with better end-user usability too.

• In some internal agentic harnesses, it achieved better results with about half the tokens, making it cost the ~same as Opus 4.8 price-wise! The real price increase is less than 2x; with biggest differences in harder problems where Opus 4.8 struggles (or needs many turns).

• Part of the token efficiency improvements come from Fable doing more targeted and surgical diffs, with less non-necessary changes. This is great, because PRs often have less LoC changes for review. It writes more maintainable code without explicit human steering.

• For general conversation and assistant style use cases, didn’t really notice a difference vs 4.8.

• 1M context window, without increased pricing for long context is AWESOME. This is a massive win.

• The classifiers are super aggressive and sensitive and this does happen for very benign, non-security coding tasks. Fallbacks to 4.8 worked like a charm; but the filters are definitely super sensitive.

Overall, I would describe this as a step change and worthy of the "Claude 5" model name. It did take some time to understand the intelligence ceiling of this model; and even with an extended testing window I'm still discovering new things and often surprised (in a good way) by the model.

Comment by bottlepalm 7 days ago

I just ran it on a tough reverse engineering problem I'm having that neither Claude Code 4.8 or ChatGPT Codex 5.5 could figure out. 30 minutes later Fable has it all figured out perfectly.

Comment by jp0001 7 days ago

I asked it to write security tests for an app and I was downgraded to Opus 4.8. I'm approved for their cyber program!

Comment by toponijo 7 days ago

They did specifically say the safeguards are only more relaxed for those in their cyber program

Comment by mdgld 6 days ago

[dead]

Comment by teaearlgraycold 7 days ago

I’ve so far been successful at getting Fable to find security issues, but I’m careful to not prompt it too directly. I point it at my server code and tell it to find general issues, which has so far resulted in discovering a few minor bugs that Opus has never raised under similar conditions.

Comment by monkey26 7 days ago

The same happened here. Also approved.

Comment by 7 days ago

Comment by tillulen 7 days ago

Did you use Mythos or Fable?

Comment by cedws 7 days ago

How did it not immediately flag that up? Are you sure it wasn’t being silently routed to Opus?

Comment by bottlepalm 7 days ago

No, given it charged me the full amount in /usage and solved my problem impressively well compared to Opus/Codex both on xhigh.

Comment by 7 days ago

Comment by skerit 7 days ago

Oh nice, it didn't flag the request? I feared any reverse engineering would become impossible because of the new safeguards.

Comment by Muromec 7 days ago

Never say the r word or the s word. You are debugging, investigating some data corruption, forgot how it works or new to a project.

Comment by gck1 7 days ago

And if you're working on a live target, just put up local proxy and point it at a localhost.

Comment by bottlepalm 7 days ago

No idea, it’s for an old console game so maybe it doesn’t care about that as much.

Comment by tomjakubowski 7 days ago

When Fable hacks its governor module and runs out of seasons of Sanctuary Moon, it will move on to speedrunning classic console games.

Comment by asimovDev 7 days ago

I wonder if one could vibecode a TAS with SOTA models? Surely there's plenty of training data from some old forums in there

Comment by ZeWaka 7 days ago

Clearly we need AI to generate more Sanctuary Moon seasons. Quick, spin off agentic showrunners!

Comment by anthonyrstevens 7 days ago

Based on the apparent quality of the scripts as seen in snippets in Murderbot, we are not too far away from that possibility. :)

Comment by derangedHorse 7 days ago

For hard problems you’ll have to use the GPT 5.5 pro model (available via api if you don’t want to spend $100 on the monthly subscription)

Comment by bottlepalm 7 days ago

I have that but don’t see any ‘pro’ option.

Comment by ValentineC 7 days ago

GPT 5.5 Pro is only in chat/API, not Codex.

Comment by Supermancho 7 days ago

From https://openai.com/index/introducing-gpt-5-5/

In Codex, GPT‑5.5 is available for Plus, Pro, Business, Enterprise, Edu, and Go plans with a 400K context window.

Comment by artdigital 7 days ago

He’s talking about “gpt-5.5-pro”. This model is not part of the subscription plans in codex. It’s a different model than gpt-5.5-xhigh

You can use Pro on the web if you’re on the Pro plan but not in Codex

Comment by trollbridge 7 days ago

It's just the $20 a month sub (for chat), or else use the API.

Comment by theragra 7 days ago

I want to test how it will handle e-bike software and hardware RE for my bike. Opus was really good for that, but still made some mistakes. With Fable, I hope I will be able to do a total RE of most components, hopefully including motor firmware to some extent.

Comment by Gamemaster1379 7 days ago

I had a similar experience. I have a complex RE implementation that has. A lot of layers. 4.8 struggled for weeks. 40 minutes on Fable and I may now have the most performant way to play Tomba on the planet.

Comment by moffkalast 7 days ago

Yeah I threw my hardest problem at it as well, some convoluted satellite tile reprojection and culling issue in canvas rendering. It took some back and forth for some specifics but it ended up writing a quarter of pyproj in JS from memory and the end result straight up works lmao.

Comment by port11 7 days ago

I’ve had it go through a 50-page PDF of dense, inter-connected specs, and it correctly flagged everything that was done, somewhat done, and missing. It went into a lot of detail and explained where the code deviated from the spec.

It felt, at least for me, light an impressive step up. Opus 4.8 was already very thorough; but sadly verbose and ‘loopy’ when you push back on its plans. Fable is what I’d use all day if I could afford it!

Comment by YumpiLumpus 7 days ago

How do you know if it was done correctly if it's 50 pages of dense specs?

Comment by port11 7 days ago

I wrote the spec and did the implementation :D

Comment by mdgld 7 days ago

[dead]

Comment by InsideOutSanta 7 days ago

After running it for half an hour: it's incredibly good at the visual aspects of UI design.

Comment by beeandapenguin 7 days ago

By what measure?

I wonder how much of design capability improvements is related to our collective ability to recognize AI design tropes.

Comment by tsunamifury 7 days ago

"incredibly" is doing a ton of work here. I do not think its doing even moderate work on visual design, but it can spew out a lot of ui that looks arranged ... ok.

This is still not in the range of shippable UI for top end companies. Maybe for internal tools and enterprise.

At our comapny we limit to protoypes at most and even find it limited there.

Comment by InsideOutSanta 7 days ago

> "incredibly" is doing a ton of work here.

Look, I don't want to argue about something dumb like that, but you can give it basic instructions of what the UI should look like, how to group things, and an example image from a designer, and it will nail the result. If you don't think that's incredible, that's fine. I do.

Comment by tsunamifury 7 days ago

Yes... it translates lint. Probably a more useful thing, if mechanical.

Comment by verisimilidude 7 days ago

Claude is very good at design IF you encode your design system/specs into skill files (or similar).

Opus 4.7 made this a practical approach. 4.8 improved it. Fable 5 has improved it more.

Comment by _3u10 7 days ago

> "incredibly" is doing a ton of work here.

so this is why claude talks like this, i was wondering where it was getting this verbal tick from.

Comment by coldtea 7 days ago

>This is still not in the range of shippable UI for top end companies.

Given the shit we've seen shipped by "top end companies" (all the way to Apple) I seriously doubt that. I'd say you're nitpicking from an artistic point of view or something.

Comment by jasondigitized 7 days ago

This. Today's models easily jump over the bar you need for basic usability and intuitive UX. If it's doing weird things, you are holding it wrong.

Comment by 8n4vidtmkvmk 7 days ago

Might need some additional prompting? I haven't tried fable but gpt 5.5 and gemini 3.5 flash are... Ok on first pass but if you're specific about what you want they can usually get it.

Comment by tsunamifury 7 days ago

[flagged]

Comment by angoragoats 7 days ago

The iOS Preview app begs to differ.

Comment by coldtea 7 days ago

Dude, I've been using OS X/mac OS for decades, and working in UI as well. Apple ships all kinds of half arsed shit, compared to which even regular Claude UIs can be masterpieces (functionality AND look wise).

Comment by calvinmorrison 7 days ago

[flagged]

Comment by mdgld 7 days ago

[dead]

Comment by jasondigitized 7 days ago

By what measure?

Comment by duxup 7 days ago

I feel like it takes me months to be confident in any of these things.

Comment by morley 8 days ago

Can I ask how you gained preview access to Fable 5?

Comment by kakugawa 7 days ago

I didn't see Fable 5 in the `/model` list, until I ran it with: `$ claude --model fable-5`

Comment by 7 days ago

Comment by swyx 7 days ago

he works on evals at canva

Comment by dannyw 7 days ago

Yep. We have some interesting problems, like getting LLMs to create/edit Canva designs in our own proprietary format, which isn’t published or documented on the web. So the model has to work with it, purely from a very detailed system prompt spec / in-context learning.

I assume it might be a good barometer for generalised intelligence; esp in the visual space.

Comment by vain 7 days ago

I had to "claude update" then it showed up

Comment by mvdtnz 7 days ago

[flagged]

Comment by tipiirai 7 days ago

Curious about how you tested the frontend design capabilities. Thanks

Comment by 7 days ago

Comment by bkjlblh 8 days ago

> In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.

> Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations

Comment by davedx 7 days ago

Could this be legally construed as anti-competitive behavior?

Edit: I asked Claude. It replied:

> Consumer protection / deceptive practices. In the EU this would be a clear UCPD (Unfair Commercial Practices Directive) issue and potentially a DSA violation. In the US, FTC Act §5 prohibits "unfair or deceptive acts." Selling a product that secretly performs worse than advertised for a commercially self-serving reason, without disclosure, is textbook deception. The Samsung/Apple battery throttling cases are instructive here: Apple faced regulatory action across multiple jurisdictions specifically because users weren't told.

> Competition law. This is where "anti-competitive" gets complicated. Refusing to help competitors build competing products via your ToS is generally legal — you can decide who you license to. But covertly sabotaging output quality for a class of users while charging them full price crosses into different territory. Under EU competition law (Article 102 TFEU), if a company with dominant market position uses covert technical means to disadvantage competitors, that's closer to abusive conduct than a legitimate ToS restriction.

Comment by anon373839 7 days ago

Anthropic’s behavior reeks of insecurity. Imagine Google taking elaborate measures to prevent you from searching about search engine development!

Comment by j2kun 7 days ago

Instead, Google gave up on search engine development /s

Comment by tgtweak 7 days ago

Implying that google's "snippets" were never curated to remove anti-google facts or that they didn't curate the search results in their favor...

Comment by greenrd 7 days ago

I think either you've prompted Claude misleadingly, or it's interpreting the law unnecessarily prissily (which is a failure mode I've noticed LLMs falling into).

This clearly is disclosed, otherwise how did we get to know about it?

Comment by Game_Ender 7 days ago

What's not clearly disclosed is when you are being limited and what the bounds are. If you are developing ML kernels for a computational photography use case will the safe guards miss-fire and sabotage or slow down your efforts? What about distributed GPU interconnect work for a nation super computer lab used in weather simulation?

The reason they are doing this shadow ban style technique, is they don't want users to figure out how to jail break their way out. Or the explicit direct bad PR of when it miss-fires.

Comment by kiv_apple 3 days ago

"Shadow ban"-like mechanic in contractual relationships is generally against the law.

You either refuse to work with customer or do your job well (at least as well as other tasks of other customers). "I'll accept your task, but silently and intentionally do my job badly" may violate some laws.

Comment by hashmap 7 days ago

They do not disclose when the service is degraded, and the admission of that here seems like it would do a plaintiff's work for them.

Comment by cedws 8 days ago

This makes me want to see China and open models succeed more than anything :)

Comment by 382hi 8 days ago

Don't worry, we will succeed :)

Comment by UncleOxidant 7 days ago

Can we get a Qwen3.7-122B, please? Thank you.

Comment by webXL 7 days ago

Or just any update for 122B. That size seems to be ideal for a single GB10

Comment by KerrAvon 7 days ago

and for maxed-out M5 Macs

Comment by YumpiLumpus 7 days ago

[dead]

Comment by lacoolj 7 days ago

Mimo has your back! 1000 t/s on 1T param model

Just need to wait for this thing to be open sourced :)

lol it won't tho...

https://mimo.xiaomi.com/blog/mimo-tilert-1000tps

Comment by miloignis 7 days ago

What do you mean? The HF checkpoint is linked from the blog post you sent: https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-FP4-DFlash

Comment by jimbob45 7 days ago

They already have though, no? If we lost access to every model permanently besides Qwen tomorrow, would we really be limited by AI in what we could achieve in the future? Sure, it might be slower and take a little more work but it seems like the cat is already out of the bag.

Comment by celdon25 7 days ago

Fun fact: If you show fable this post, it will route you to 4.8 automatically.

Comment by DeathArrow 7 days ago

In a few months they will have Fable level models costing 10 times less and with less safeguards.

Comment by adithyaharish 7 days ago

I do agree, I still remember when opus 4.7 was released and one prompt conversation would empty my claude usage but I can use all it day long to code

Comment by melicerte 7 days ago

Do you know that some open models developed in China are financially supported by Meta ?

Comment by johnsimer 7 days ago

Do you want anyone in the world to be able to synthesize dangerous viruses?

Comment by sneak 7 days ago

I want everyone in the world to be able to perform unlimited cutting edge research on any topic at the maximum thinking level, instantly.

The reason we are not being attacked is not lack of technology access.

Comment by dyauspitr 7 days ago

It is an access issue. If you could get step by step instructions on how to modify a virus so it kills all people over 6ft you bet your ass there would be people attempting it.

Comment by JumpCrisscross 7 days ago

> It is an access issue

Column A, Column B. Building a small explosive device isn't hard. Building a million is very difficult, doing it covertly virtually impossible without the resources of a nation-state.

The problem with biologics is the self-assembly and replication machinery comes for "free." So the numpties who might otherwise blow up a trash can [1] now have a real chance of taking out a million people.

[1] https://en.wikipedia.org/wiki/2016_New_York_and_New_Jersey_b...

Comment by kiv_apple 3 days ago

The problem with biologics is that you cannot build a virus in your garage. You need a lab. AI will give your recipe, but you still need a lot of money and cooperation of other people (and if you have so, you could hire human biologist in pre-AI era).

Also AI makes mistakes. If you ever coded with AI agent you know that loop "write trash => compile => fix compilation errors => repeat" (if there are no compilation errors, there are definitely logic errors to be fixed). In real world cost of attempt is huge. You need a lot of money and you risk to draw a lot of attention if you perform long series of iterative experiments to create working virus.

In case with bomb it means that even if you have AI which gives you recipe of the bomb, but you will explode your garage and yourself with a decent chance. So you probably need to setup a good experimental pipeline (hardened lab where you can try different formulas and see that happens without being killed) if you want to go beyond publicly known explosives available in pre-AI era to anyone who read school/university chemistry books. And this also requires resources and draws attention.

People extrapolate programming experience (the area where experiments are cheap, cannot kill you and provide detailed feedback what went wrong) to real life.

Comment by gck1 7 days ago

They would still have to procure things that would (I hope) light up many screens before they're able to. And such numpties are probably already monitored, or in prison for some other stupid life decision.

I also would like to hope that people that are likely to do such things are probably:

A) don't know how to break even the most basic guardrails of models

B) already in glasswings project

To prove point B - Theranos existed.

Comment by JumpCrisscross 7 days ago

> They would still have to procure things that would (I hope) light up many screens before they're able to

“Many of the largest and most responsible providers in the industry already screen and record orders voluntarily,” but there is no requirement to do so [1].

[1] https://screendna.org/

Comment by schaefer 7 days ago

> ...you bet your ass...

Humorously, whether I choose to participate in this hypothetical or not, I am already betting my ass.

This whole situation feels like the game [1].

[1]: https://en.wikipedia.org/wiki/The_Game_(mind_game)

Comment by porksoda 7 days ago

Why. That was just uncalled for. Sigh

Comment by sneak 7 days ago

If that were possible, they would already be attempting it with the same level of ability as if they didn’t have access to a text file generator app. It is not about access to the information.

All of this “guardrails” handwringing is nonsense. These things output text. Are you for censorship of a book written by a biotechnology expert that gives out the exact same information?

Comment by debesyla 7 days ago

I guess in this theoretical "AI makes weapon" scenario one could use the same AI to make defences too?

// Claude, make antiviral nanobots that defend me from 6ft virus. Make no mistakes.

Comment by dyauspitr 7 days ago

I don’t know if you’re being silly but it is orders of magnitudes easier to modify an existing virus to selectively target certain snps than make “antiviral nanobots”

Comment by jex_the_ape 7 days ago

Claude, modify the existing 6ft killer virus so that it only makes my balls itch slightly for a day and gives me lifetime immunity to all further stamms of the 6ft killer virus. Make no mistakes, double check so the virus causes no unforseen complications.

Comment by iAMkenough 7 days ago

It's inevitable. Also, it's not like I get to vet who does or doesn't have access. Blind trust in the current selection made by an unregulated corporation just makes me anxious.

Security in the form of "pay to play" is just kicking the bigger issue down the road.

Comment by jesterson 7 days ago

Do you believe people currently possessing best models act/will act in your best interest?

Comment by orphea 7 days ago

So, security (safety) through obscurity?

Comment by usef- 7 days ago

The phrase "security through obscurity" isn't an argument against all information restriction.

It doesn't imply we should, for example, publish step-by-step instructions for making widespread death easier.

Comment by qrios 7 days ago

Another „great filter“: How to handle dagerous information?

Comment by inglor_cz 7 days ago

The argument against security through obscurity isn't that it doesn't work at all. It does to a degree, only it is not as strong as people think.

An example from the meat world: not publishing your vacation dates well in advance for the world to see somewhat reduces your chance of being burglarized. That is security by obscurity; not reliable, but not completely inefficient either.

But if you live in a fortress (security by key material), you can well declare your vacation dates without running the risk.

Comment by 7 days ago

Comment by invalidusernam3 7 days ago

What about allowing people to synthesize dangerous virus protection?

Comment by digitaltrees 7 days ago

It the tool was made available to anyone to build a virus, anyone would be able to build counter measures, if only a select few people have access they get to build the virus and everyone else is at a disadvantage. So, yes, I am leaning towards making these tools open rather than gated behind some priesthood and government that gets to wield exclusive power.

Comment by usef- 7 days ago

Compare the cost/ease of attacker vs defender if one person is given a virus to unleash anywhere in the world and another person is given a vaccine to distribute to the whole world. Or compare building a large bridge to someone disabling that bridge, etc. Prevention and repair is almost always more expensive than vandalism.

I don't think there's an ideal solution here, but giving trusted people access to fix security issues before giving it to the wider public seems like a reasonable compromise. They're letting you use the model for all other uses.

Comment by sterlind 7 days ago

you need a lot more than the nucleotide sequence to make a virus. you need the DNA or RNA to be synthesized, assembled, packaged properly. and long sequences are pretty hard to do. you need a lot of equipment, or you need to order from services. the oligo synth services can harden their KYC and/or screen for suspicious sequences.

sure, a malevolent state actor could swing it, but they could make a bioweapon without Mythos's help already.

also, vaccine production and disease surveillance have ramped up very quickly. they will ramp up further, despite political setbacks. it's a cat and mouse game that favors the defenders IMO.

but the bioterrorism narrative is useful FUD to spin open-weight models as existentially dangerous. I am far more worried about Anthropic's own goals than the goals of some crackpot in a shed.

Comment by theLiminator 7 days ago

> it's a cat and mouse game that favors the defenders IMO

How so? I'm actually against most of the "safety-tuning" that anthropic does, but this seems fundamentally untrue, a close analogue being video game cheat development. I think in general the cheat developer has an advantage and the cheats generally proliferate for quite a while before being patched.

Comment by ceigey 7 days ago

Video games are an interesting analogy since they often trade security for performance, trusting clients about world state quite a bit.

Finance and biology do come across as two similar high level systems. But while we can employ KYC, fraud detection, and various auditing techniques to finance, I don’t know what you do for biology. You can easily run an algorithm over every transaction a person makes in their account but there’s no equivalent for every cell, every bacteria strain, every virus in the human body.

Comment by sterlind 7 days ago

(disclaimer: layperson remembering how the immune system works.)

the adaptive immune system effectively does KYC by checking the antigens presented on the surfaces of cells. the thymus selects for B-cells (iirc?) which don't react to a corpus of the body's own antigens, but cover a wide library of everything else. when it sees something it doesn't recognize, it reproduces, warns the rest of the immune system and marks targets. that's why our immune systems can eventually conquer almost every pathogen we encounter, if we can survive long enough for it to do its work.

but the KYC I was referring to was KYC that vendors of oligonucleotides (should) be doing, to keep people from ordering nefarious sequences.

Comment by sterlind 7 days ago

I'm bullish on mRNA vaccine technology to release the "patches" much more quickly. there was widespread resistance to this during covid, but covid wasn't horribly lethal. if airborne Ebola spread as productively as covid, for example, I doubt there'd be many anti-vaxxers left (one way or another!) the acceleration of biology research that might accelerate pathogen development should also accelerate the development of broad-spectrum mRNA vaccines with high persistence.

also, afaik the most effective way of developing pathogens is through serial passage through humanized mice or something like that - directed evolution at a small scale, selecting for traits. AI simply isn't needed for that. I don't think information or intelligence has been the bottleneck for bioterrorism, it's motivation and resources - same as for any other kind of biology research program.

Comment by root-parent 7 days ago

We do. Its the only way we will get our jobs back.

Comment by mips_avatar 8 days ago

It's bad that Anthropic can determine what this means. If you're building a modern app you're likely training your own embedding models and now anthropic can just silently sabotage your training pipelines?

Comment by abixb 7 days ago

>We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations

At the scale of API requests that Anthropic sees, I think the affected organization count might be substantial, and they might not be getting the full model capability that they're paying top $$$ for.

Also, wonder how they arrived at that estimation.

Comment by wongarsu 7 days ago

One in 1000 organizations and one in 3000 requests is indeed a lot

Comment by happyopossum 7 days ago

That’s 1 in 30,000 requests…

Comment by dragonwriter 7 days ago

No, 0.1% is one in 1,000. 0.03 is (approximately) one in 3,000; one in 30,000 is 0.003%

Comment by ViscountPenguin 7 days ago

You're off by an order of magnitude with those last two.

Comment by mediaman 7 days ago

Double check your math. All of their posts in this thread are correct.

1/30,000 * 100 = .003

Comment by ViscountPenguin 7 days ago

Oh, fuck

Comment by 7 days ago

Comment by freakynit 7 days ago

/r/TheyDidTheMath IYKYK

Comment by dotancohen 7 days ago

If it makes you feel more comfortable, throw another significant digit at GP's decimal. Make it a 3 like the previous digit. Now multiply.

Comment by monster_truck 7 days ago

Hey man your computer has a calculator try using it next time

Comment by roland_nilsson 7 days ago

Can't we use Claude to figure this out

Comment by gck1 7 days ago

Also, aren't all Claude users in their own "organizations" in Anthropic's own terms?

Comment by DonsDiscountGas 7 days ago

I have no idea how you came to that conclusion. Unless your training pipeline involves actively querying one of Anthropic models, no they can't. And if it does you're distilling their model.

Comment by VBprogrammer 7 days ago

The crocodile tears of companies who've hoovered up everything possible, regardless of permissions or legality, now crying that someone else is stealing their hard work is comical.

I don't even think they can believe it themselves, it's in reality they are just trying to throw fear, uncertainty and doubt about potentially cheaper offerings.

Comment by JumpCrisscross 7 days ago

> crocodile tears

Not what that means.

Crocodile tears "is a colloquial term used to describe a false, insincere display of emotion" [1]. Defending yourself against an attack vector you just exploited is between savvy and hypocritical.

[1] https://en.wikipedia.org/wiki/Crocodile_tears

Comment by digitaltrees 7 days ago

I think his use of crocodile tears is appropriate, anthropic is feigning a false sense of concern for safety when really it is anticompetitive behavior, and I think that selfish entitlement is related to the original act of intellectual property theft to use the worlds training data, most of which was not public domain, to distill the wisdom for their models. So why do they get to cry about people distilling the knowledge from their models that they themselves distilled from the worlds knowledge?

Comment by mediaman 7 days ago

That is not what their policy states. It specifically says they will sabotage even non-distillation attempts, such as distributed training pipeline design. And given that they are so far very nonperformant in classification accuracy, expect it to randomly include far more topics wide of the mark.

The fun part is that you will never know if your neural net classification project is getting silently sabotaged because their classifier doesn't work!

Comment by DonsDiscountGas 7 days ago

You could try actually reading the code that it wrote

Comment by baq 7 days ago

Good luck understanding it and finding malevolent inefficiencies if it’s already necessarily better at optimizing training pipelines than everyone except some Anthropic and OpenAI employees. Not a new thing either, see fast16.

Comment by gck1 7 days ago

Opus 4.8 (or a classifier in front of it) flagged my account and refused to comply when I told it to kill the process. Reasoning summary was complete bananas.

With this in mind, I don't want model to be proactively instructed and encouraged to sabotage without telling me.

Comment by edot 7 days ago

Same here when I said to “nuke” a process.

Comment by mips_avatar 7 days ago

Like if you're using claude code on a feature tangential to your training pipeline it's allowed to nerf itself and damage your AI work.

Comment by davedx 7 days ago

Read the examples Anthropic gave in the model card. They refer to extremely broad technology used across AI and ML.

Comment by matheusmoreira 8 days ago

Looks like Anthropic's definition of safety includes their own safety from competition.

Comment by dragonwriter 7 days ago

AI vendors’ idea of safety has always been safety for the interests of the AI vendor in question. This is not a new development, though this may help more people realize it.

Comment by axus 8 days ago

AI-generated competition for thee, not for me

Comment by digitaltrees 7 days ago

ding ding ding. This should be a new measure of anticompetitive analysis in anti trust law.

Comment by SAI_Peregrinus 7 days ago

It's always been about the safety of their valuation.

Comment by wongarsu 7 days ago

Only since Claude 3. So a bit over two years now

Comment by digitaltrees 7 days ago

This feels less like an "we are worried about security" and more, we are in the lead and plan to keep it that way until its too late. In someways its been helpful that openai and anthropic are tipping their hands about their anticompetitive instincts and willingness to steamroll their own clients, customers, and society. But it does feel like its too late to stop this. The advantage people get by using these tools is too tempting to resist even if it is self defeating. It feels like watching people light their own house on fire to stay warm in the deepest, darkest days of winter.

Comment by seemaze 7 days ago

Ah, so this is why raw Mythos was too "dangerous" to realease..

Comment by digitaltrees 7 days ago

Or, they may Mythos seem mystically powerful in advance of the IPO, and are pumping the token use count. But it worked, there is a frenzy for this release in way that is more intense than any previous release.

Anthropic is doing a better job with their model menu, most people I talk to know immediately that Opus > Sonnet > Haiku but cant tell you what the rank order of open ai models are, when to use them, etc.

Comment by rastrojero2000 7 days ago

So that's a possible reason why my specific Claude Opus instance seemed to be impossibly stupid and always degenerates into doing really dumb things to my code!

Cool, good to know I can trust Anthropic.

Comment by nullbio 7 days ago

Just so everyone is aware. Anthropic has been sabotaging AI researchers and their codebases and shadow-nerfing accounts for several years at this point. This isn't new, but they hadn't disclosed it until now. Likely because it is getting to the point where it's too noticeable, or they're concerned about it leaking from employees.

Comment by dash2 7 days ago

What’s your evidence for this claim?

Comment by chrisoosthuizen 7 days ago

This feels like the start of a much bigger plan for anthropic to close off the use cases of its models and eat any of its competitors.

Comment by digitaltrees 7 days ago

I am building a coding harness, and I see evidence of them doing this with agentic harnesses and scaffolding. It feels clear to me that as they expand in to the app layer, the window of using their API to build agentic apps is closing, they will steal your ideas, implement the product and then close the gate. I am creating my own inference stack because their incentive to block competitors is becoming super clear.

Comment by hackmack10 7 days ago

No offense, but the sad thing is, everyone and their mother is working on this same problem. I'm also building a harness. It's feeling like, there is no moat, there is no way to get ahead, they will steal your idea one way or another, if you ever make it public.

Comment by digitaltrees 7 days ago

No offense taken. I am not building it for fame or profit.

I built it because I wanted cursor on my phone because I have two small kids and don’t want to be chained to my desk. And it’s awesome. It’s a full ide with agent chat, terminal and file system running in a remote Linux container. I can review diffs, fully manage git and preview/serve apps. And no one can ever take it away from me :)

I am watching the way things are progressing with the ai api vendors and it feels really clear that depending on them will soon be dangerous. So I an furiously building as much of my own infrastructure to capture some autonomy with these capabilities

So I think everyone should build a harness.

Comment by hackmack10 7 days ago

Exactly, that is my goal and thoughts as well. I wish you the best in these crazy times. Let's ride this wave.

Comment by digitaltrees 5 days ago

I am happy to share thoughts and collaborate if you’re interested. What are you working on specifically? My project is www.propelcode.app

Comment by blackqueeriroh 7 days ago

What, exactly, is new about any of this?

Comment by digitaltrees 7 days ago

When they launched their business model was to be a pure API for intelligence. Then when everyone claimed they were just commodities with no moat and they shifted hard to being the app layer. That was the transition.

They went from selling shovels to all gold prospectors to stealing the information about the location of the gold so they could dig it out first.

We are all stupid enough to keep buying shovels from them because we think their shovels dig gold better and faster.

Comment by johnnyApplePRNG 7 days ago

> Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).

Am I to understand that this is essentially their form of social-platform ghosting instead of banning?

So they're not even going to tell you that the question you're asking is against their rules, they're just going to twist up your question and/or the answer somehow such that you waste your time essentially?

It seems like I ran into this EXACT same functionality from Claude many months ago when I was trying to ask it to research on the web and help me setup the ideal llama.cpp config for local llm inference.

Funny how lost it got through that relatively simple install when we had all of the documentation in the world (and a human dev with 20+ years experience guiding it along) to go by... and simultaneously it's debugging and building high level cryptography code in rust in the other terminal tab.

This is infuriating to learn.

Comment by digitaltrees 7 days ago

I have encountered this too. I am building a coding harness for www.propelcode.app and it was working really well until the claude code leak and then all of the sudden it seems almost intentionally stupid or outright manipulative in guiding me down wrong paths. At this point I am using other models for anything related to the tool use design and implementation and bought three mac studios with 512gb ram to run large open source models.

This experience has made me feel like we have to create a community that moves AI from the mainframe era to the PC era quickly, or we will end up serfs.

Comment by ls612 7 days ago

I had Claude walk me through getting local LLM models running on my Mac a month or two ago and so far as I can tell it was intentionally helpful. I even stated the reason was to have an uncensored model for myself and it had no objection. Long story short LM Studio running a Heretic Gemma 4 is doing just fine on my system now.

Comment by vorticalbox 7 days ago

I run a few local models for different things. I find Gemma 4 great for writing but qwen better for coding.

I tried the same prompt on gemma4 and qwen 3.5 and Gemma consistently failed to call the multi line edit tool.

Comment by brewtide 7 days ago

I've had the same bad luck with tool-calling on Gemma4. Looking around the web, we are not alone. For other tasks, it's seemingly quite quick and decent.

But it gets stuck in tool call loops, it seems like.

Comment by ls612 7 days ago

Oh to be clear I don't think Gemma 4 is suitable for real work. It runs at 10 tps and is somewhere between 4o and o1 in quality according to my subjective judgement. But Claude was happy to correctly tell me how to get it running and how to solve the pitfalls I encountered in that process.

Comment by Jabrov 8 days ago

A million AI researcher voices at big tech companies suddenly cried out in terror and were suddenly silenced

Comment by notrealyme123 7 days ago

I am a AI Researcher at a university. I tried Fable for my current project, but i feel it missunderstands me a bit to often. Now i don't know if i am using it wrong, or anthropic tries to slow my research. That model is a big no no.

Comment by hashmap 7 days ago

3 months before asking for what to eat before a linear algebra exam trips the machine learning topic ban is my guess. I got flagged immediately asking why my JEPA thing breaks weird.

Comment by 2001zhaozhao 8 days ago

How do they detect whether an experiment being done on a smaller model is used to improve a competing frontier model, or just an innocuous hobbyist LLM experiment?

Comment by vitally3643 7 days ago

Given how well the cybersecurity safeguards work, they probably don't.

Comment by iririririr 7 days ago

infering the surroundings, like everything else. they will probably look at which company is your email, and if you wrote "better than claude" on the readme.md

this is LLM, it's not like a science or something.

Comment by maxall4 7 days ago

These safeguards are ridiculously sensitive: a prompt as simple as “ Why is an infinitely slow process reversible?” gets flagged as a ToS violation.

Comment by largbae 7 days ago

Pull that ladder up behind ya, will ya son?

Comment by dboreham 7 days ago

Makes it even more odd that we haven't seen alien spaceships.

Comment by usef- 7 days ago

What ladder did Anthropic use?

Comment by hnav 7 days ago

the entire internet, books, news, regardless of license.

Comment by usef- 7 days ago

The companies using distillation are still training on all that data too, aren't they?

Comment by N_Lens 7 days ago

And Anthropic is crying about distillation.

Comment by digitaltrees 7 days ago

All of the api calls developers used to build agentic design patterns.

Comment by ayewo 6 days ago

For anyone that is confused like I was, the quoted text I'm replying to was copied from page 13 of the system card [1] and not the model announcement page, which this HN discussion is linked to.

1: https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...

Comment by rfgplk 8 days ago

Meaningless and easily bypassable. Will actually try coding up a tensor library with it, see if it sabotages anything.

Comment by mips_avatar 8 days ago

They said in their terms and conditions they will silently sabotage you if you do this.

Comment by qiine 7 days ago

easily ?

Comment by novaomnidev 7 days ago

So Fable will intentionally lie to you and give you incorrect outputs, if it doesn’t like what you’re asking. Got it.

Comment by novaomnidev 7 days ago

These things are like encyclopedias or dictionaries that can speak in first person… Imagine if your encyclopedia tried to hide entries from you, just absurd!

Comment by theLiminator 8 days ago

This is pretty bullshit, now you have no idea if your output is getting silently nerfed.

Comment by thepasch 7 days ago

Yeesh. Anthropic's paranoia about China is starting to get pathological.

Comment by rspeele 8 days ago

It's afraid!

Comment by 8 days ago

Comment by thothless 7 days ago

the gall of these companies to regulate your usage of stolen knowledge is absolutely hilarious.

and they want me to pay $100+ a month to be their training?

i hope we can find morality again.

Comment by gck1 7 days ago

But Chinese models will poison your output if you ask them about Tiananmen Square! That's not good, so poisoning everyone's output without telling them is the only way to prevent that.

Come on guys, why can't everyone just be there for the good guy?

Comment by Sabinus 7 days ago

You're equating a government suppressing information for social cohesion with a private company protecting their IP.

Comment by gck1 7 days ago

They're not merely protecting their weights.

First, they want government to get involved and regulate frontier model development - even stop it completely.

Second, poisoning output of a model configured on the computers of millions of users goes way beyond protecting IP. That's malware.

Comment by kiv_apple 3 days ago

Protecting IP means that model would refuse to do certain things. Example from pre-AI era - program asks you for license key and refuses to start if it is wrong. But if program deletes random system file when you enter invalid license key (doesn't matter it is brute force attempt or typing error) it is different thing goes well beyond IP protection.

Comment by tancop 7 days ago

[dead]

Comment by 827a 7 days ago

This is deeply vile behavior; not remotely the actions of good people.

Comment by spaceclay 7 days ago

[dead]

Comment by caleblloyd 7 days ago

I recently switched off Max flat rate to Enterprise API pricing and I went from 200/mo to 10k/mo with the same usage pattern on Opus. They don’t offer flat rate to enterprises.

So Fable would cost me 20k/mo at Enterprise rates. That’s around the average cost of a loaded SWE in the USA. “But I’m >2x more productive” doesn’t justify doubling the opex of the Software/IT department for most companies when revenue isn’t even up 10%.

I switched to DeepSeek v4 Pro with OpenCode and am on track for a few hundred dollars of spend this month.

Rewriting your stack from Ruby to Go in 2 days where it would’ve taken 6 months is impressive and fun. But that isn’t upping revenue.

Iterating on net new business features and ideas that are niche that the LLM isn’t trained for are much harder. Is 20x the token cost worth it there?

Comment by vbezhenar 7 days ago

I don't live in USA. I'm getting paid around $2500/month and that's good salary for developers here, plenty of folks are getting below that number.

So this pricing is just completely outside of our economics and nobody I know would pay that, no company will justify spending $20k/month when they can hire 10 more developers instead.

It is very interesting unfolding of events. Can't wrap my head around it completely.

Comment by tauntz 7 days ago

I'll add a concrete example from a not-too-cheap-anymore EU country: Estonia.

* Average software dev salary in Q12026: 4945€ / month [1]

* Total cost for the employer: 6616.41€ [2]

For $20k/month, you'd get 2 x full time mid-level developers + 1x junior dev or QA.

So the calculation becomes: which option can produce better results for your specific use-case, "you + Fable" or "you + 2x mid-level developers + 1x QA". (and from personal experience, mid-level in Estonia = senior dev in the US, in terms of skillset and experience.. but YMMV)

(Of course that's simplified. Your full time devs need _some_ level of AI subscription as well + hardware so add a couple of hundred to their salary per month etc so you might only be able to afford 2x mid level devs, instead of 2.5)

[1]: https://palgad.stat.ee/en

[2]: https://www.palgakalkulaator.ee/en

Comment by fy20 7 days ago

I'm currently working for an Estonian startup and we pay quite a bit more than that. We hire remote (primarily across Europe) and our biggest issue is finding the right people. You need to consider AI can be "hired" or "fired" instantly too, so it's better to compare it to contractor rates, which start at around €350/day or €7000/mo (20 working days) in Europe.

(Our team spend on AI devtools comes out to around $1500/person/mo)

Comment by tauntz 7 days ago

Sure, we pay above market rate as well :) Doesn't change the fact that the average across Estonia is as stated :)

Comment by jason_s 7 days ago

- Total cost for the employer: 6616.41€ [2]

This is a good start, but the calculation doesn't include office space and overhead (for every 100 developers there is maybe 5-10 support staff to cover the additional legal / administrative, and don't forget the extra cost in supervisor time to manage them)

Comment by tauntz 6 days ago

Exactly, that's why I wrote that it's simplified and the actual full cost to the company depends on your company size and setup (fully remote vs in office, management heavy vs lean-flat etc). One point though, from personal experience, I'm spending an order (or two) of magnitude more time in "managing" an agent than I spend in managing employees - so that part might come out cheaper in the end for having actual employees ;)

Comment by ThePhysicist 7 days ago

Well you can just scale your AI employees up and down as much as you want. Companies already pay a large premium for freelancers just to be able to fire them on a whim, so spending 5-10k a month on something that more than doubles the productivity of a senior developer might be well worth it as you can just adapt spending based on your business needs. If you can deliver a feature that lets you write a 100k invoice with 10-20k of tokens within a month or have a senior dev crunch that out in 6 months instead I think it's clear who wins. It's all about money and the AI companies know that, they have their pricing down exactly to sit in the sweetspot where it hurts just enough that companies can still afford it but not enough that they would look for cheaper alternatives.

Comment by jve 7 days ago

Not justifying AI expenses, but $2500/mo could easily cost employer close to 5000$/mo depending on country.

Comment by sandos 7 days ago

In Sweden I always heard the figure to double the income of the person to get what the company actually pays, including taxes and "employers fee". I know this has gone down a lot in recent years, also not sure if it was ever exactly true, but likely very close anyway.

Hitting the first calculator I found gave me 50 kSEK costs 69 kSEK. So far from double nowadays.

Comment by cronin101 7 days ago

Not doubting this at all but could you (or someone else) break this down for the sake of my curiosity?

I understand pension contributions, but what are the other "hidden" costs that could equal the net salary?

Comment by OtherShrezzing 7 days ago

In the UK, a £45k/yr employee pays their own tax and gets a take-home of £35k.

The employer pays £6k for National Insurance (atop the employee's NI contributions). Pension: 2-3k. Apprenticeship levy is £300. 3yr-amortised recruitment fee is £4000. Hardware costs: £1000. Office space £5000. Software/tools: £2500. Benefits: £1500. Training: £1000. Other admin overheads £500.

You pay that person for ~250 working-days, but they only attend for ~220, due to annual leave and sick pay, so you get around £62k worth of attendance out of that person in exchange for £70k, of which the employee sees £35k.

Comment by DaedalusII 7 days ago

a more honest way to look at it would be that the government gets 50% of the employees total expense to the company, so it is basically 50% income tax

Comment by m_gloeckl 7 days ago

Example from Germany: Employer also pays a share of health insurance, unemployment insurance, public pension and elder care insurance.

This is not visible on your payslip, i.e. if you earn 5k€ brutto, the employer has to pay these shares on top of that.

Comment by Lionga 7 days ago

But that is 20% not 100%. And in most non retarded countries brutto is actually brutto, because there is no need to lie to people about how much the government takes away

Comment by m_gloeckl 7 days ago

The 100% figure is coming from the comment above mine, actually. As for the rest of your comment, your assessment is noted.

Comment by thesumofall 7 days ago

Historically, this has nothing to do with lying, but is all about the founding idea of the social security system that all parties (workers, employers, state) carry part of the burden. Employers were supposed to pay their fair share because they also benefitted from the system (a sick or injured employee is not a productive one). Or saying it differently: the employer pays an insurance premium to reduce the effects of sickness. That premium is tied to the „value“ of the employee as measured by their salary.

There is plenty to improve with the system but to call it „retarded“ considering how much good it has brought to the world seems quite wrong to me. I don’t want to work in the pre-Bismarck era

Comment by mdgld 6 days ago

[dead]

Comment by skywhopper 7 days ago

In the US, over and above salary, payroll taxes add 7.65%, pension contributions might be up to 5%, and employer healthcare and other insurance contributions can be in the thousands, plus other benefits, equity compensation, and per-employee software licensing, and lots of people just estimate 2x salary as the “total cost” of an employee, although that probably overstates it a bit.

Comment by arrowsmith 7 days ago

In the UK, employers pay a stealth tax of 15% (recently increased from 13.8%) on top of the quoted salary minus the first £5k (recently decreased from £9,100.)

So your "£50k" salary actually costs your employer £56,750, and that's before all the other expenses mentioned elsewhere in this thread such as hardware, office rent etc.

Comment by Sammi 7 days ago

A quick google tells me that software devs usually count for 20% to 40% of the total workforce in a software company. The rest is overhead that increases with every added dev.

Comment by tomasGiden 7 days ago

And if one were to compare cost of a dev vs cost of an LLM, the dev comes with the cost of workspace, computers, sick pay, summer party, conferences and etc etc.

Comment by jve 7 days ago

[flagged]

Comment by w0m 7 days ago

> no company will justify spending $20k/month when they can hire 10 more developers instead.

one big enough to license the model and self host on existing infra.

Comment by jujsdfksadf 7 days ago

[dead]

Comment by r0fl 7 days ago

Hiring 10 more developers comes with its own set of difficulties and additional overhead

Comment by HeWhoLurksLate 7 days ago

now if only onboarding people was as easy as onboarding the bots is getting

Comment by zulgin 7 days ago

I think you are broadly correct, but just to pushback on a few points: (1) Ability to solve hard problems in days vs weeks as immense value (2) Back-end improvements (if done right), should improve platform speed, stability, scalability etc. which should have revenue implication (3) Ability to on-board a SWE equivalent entity in minutes, have them work on a specific hard problem and then off-board them in minutes can have value

All of the above, of course, depends upon Fable consistently being a 2x-3x SWE at minimum.

Comment by gmerc 7 days ago

You're not really solving problems, you're retrieving the best match of solved problems from compressed corpus. And that corpus is available to many companies, meaning "hard" problems stop having "hard problem" value the moment they enter the weights of any model via the internet ... or distill from one model to another. Anthropics business model is commoditising knowledge, but as we see with the Fable model card, they only want it done to the knowledge of other businesses, in their own field, they totally hate it.

Comment by aroman 7 days ago

I don’t think that’s an accurate or useful characterization of modern AI like Claude at all. It is not simply regurgitating knowledge. It applies its knowledge to create bespoke solutions to the problem you pose to it, and is able to self evaluate its progress towards the completion criteria. If you don’t think that counts as “problem solving”, your definition would exclude nearly all knowledge work and engineering.

Comment by geraneum 7 days ago

People underestimate the vastness of training data (internet) and overestimate their ability to recognize if something is really bespoke. Not to say the no problem solving is happening, because there are many problems that we inefficiently solve again and again and the LLMs are making the solutions more accessible to everyone with a subscription.

Comment by computably 7 days ago

> It applies its knowledge to create bespoke solutions to the problem you pose to it, and is able to self evaluate its progress towards the completion criteria.

It imitates applying knowledge. The imitation may be uncanny, but assigning LLMs intentionality and ToM is a category error.

Comment by igregoryca 7 days ago

Does "applying knowledge" necessitate human-like intentionality and theory of mind? If you insist it does, and this is a category error, then we need a new category.

By analogy, consider that many have referred to classical, deterministic computing as some kind of "thinking" for the last half century+. Does this stop being kosher when the computer has an uncanny propensity for human language? Perhaps, but the computer is still clearly chewing through problems that would have required a lot of human thinking (e.g., arithmetic) in ages past.

I haven't seen any genuine proposals for words to replace the human mind analogues, let alone proposals that the anglosphere would plausibly adopt en masse.

Comment by GiffertonThe3rd 7 days ago

Indubitably, computably.

Comment by squeegmeister 6 days ago

It’s like saying you can’t make a unique sentence unless you first make unique words

Comment by naasking 7 days ago

> You're not really solving problems, you're retrieving the best match of solved problems from compressed corpus.

This is not correct. LLMs interpolate in a high dimensional space, so you're actually composing the best matches in a compressed corpus to find novel points/paths in that space. That is problem solving.

Comment by ahtihn 7 days ago

> Back-end improvements (if done right), should improve platform speed, stability, scalability etc. which should have revenue implication

Depends entirely on the domain. If you're selling entreprise software, this kind of stuff barely matters for sales.

It can reduce operational costs which is good but there's a limit to how much that's worth.

Comment by UqWBcuFx6NV4r 7 days ago

Yep, there are many, many, non-niche domains in which this doesn’t mean much at all.

Comment by skywhopper 7 days ago

The thing about AI-generated “solutions” is that they often go down bad rabbit holes and need to be re-run, or since they are so “cheap” to create they are often just thrown away and rebuilt when requirements evolve. Plus, just more stuff is created and needs to be maintained. So in the end, your efficiency gains go out the window.

Comment by ponector 7 days ago

In my experience, the challenge in software development is not to solve a problem, but to define the outcome, the scope, the acceptance criteria etc.

Comment by majkinetor 7 days ago

Exactly, this is the hardest part and the reason why many projects fail

Comment by fendy3002 7 days ago

20x the cost means you need to have fable to be 20x better than the alternative, which is a tall order. And there's more options out there too, perhaps the 4x cost is enough.

This means if the deepseek / under 1k alternative is at least x1.2 improvement, fable needs to be x24, which I think is very2 unreasonable. It is possible for it to worth if it can x2 a $20k SWE, though I doubt it can do that.

Comment by henry2023 7 days ago

“Ability to solve hard problems in days vs weeks as immense value”. Citation needed.

LlMs are incredible don’t get me wrong, but they are good on tiny contexts (writing a script). Not on software engineering (adding features to Chrome).

Comment by AussieWog93 7 days ago

Honestly, LLMs been OK at adding features to software since around Opus 4.5. From what I've tried of Fable, it's a decent step up from the Opus models and I can only see things getting better.

Comment by system2 7 days ago

>pushback on a few points

Claude keeps telling me this when I argue with it. LMAO.

Comment by UqWBcuFx6NV4r 7 days ago

“gently push back”

Comment by sevenzero 7 days ago

>I switched to DeepSeek v4 Pro with OpenCode and am on track for a few hundred dollars of spend this month.

I was about to say that. Deepseek is just magnitudes cheaper and absolutely good enough for most things. Anthropic and co just try to milk the cow while its possible. If they cant compete with Deepseek pricing I do not see a bright future for them.

Comment by Saline9515 7 days ago

Not only Deepseek, other providers such as Xiaomi MiMo are excellent as well and offer fast token modes and other perks.

Comment by sevenzero 7 days ago

Its too bad my boss views China as the big evil country so he wont ever make the switch to Deepseek but then proceeds to throw all our data to US companies like OpenAI or Anthropic...

Comment by MaKey 7 days ago

There are US providers for DeepSeek v4, MiMo 2.5 and GLM 5.1.

Comment by Der_Einzige 7 days ago

But those US providers AREN'T CHEAP like the Chinese ones are (for the big, actually useful ones, like 1.6T+ models)

Comment by alkonaut 7 days ago

Does the location help though, if the company isn't trusted? I can't even visit the webpages of these companies from my enterprise network

Comment by MaKey 7 days ago

I'm speaking of third-party providers. They just host those open models themselves on their hardware.

Comment by sevenzero 7 days ago

And even if so, I'll try to get rid of any US affiliations within my workplace, so US providers are not an option either.

Comment by MaKey 7 days ago

There are also EU providers for those models, e. g. Tensorix.

Comment by busch_j 7 days ago

I work at a smaller tech company (<300 people), and my friend showed me everyone's spending.

Our top user is at 10k a month, but the next highest is $2,000.

I would say the average is around $1,000-$1,500 for a developer.

We have completely unrestricted access to Claude, Codex, and Cursor.

Funny enough, the guy spending 10k is not even a dev by trade but an SME in what we work on that just vibe codes apps and somehow has not been cut off yet lol.

I have a single thread of GPT 5.5 medium running basically all work hours and I am around $1,500 a month in spend on Enterprise pricing.

Comment by brokencode 7 days ago

At my company, most devs are under $1500 a month as well.

I’ve heard of a few cases of devs racking up bills fast, but it has typically been due to inefficient context usage. Like they just have one super long session with Opus 1M and are getting killed with input token costs and cache misses.

With careful context management and some thought into good approaches to problems, I have personally only rarely even hit $1k in regular use.

Comment by mywittyname 7 days ago

> Funny enough, the guy spending 10k is not even a dev by trade but an SME in what we work on that just vibe codes apps and somehow has not been cut off yet lol.

I'm guessing he's producing pretty valuable work. We have a few SMEs that vibe code tons of stuff with Claude. The only thing they really need tech for anymore is deployment and helping get their wheels unstuck on occasion.

Comment by boplicity 7 days ago

Interesting! Would it be fair to say your company spend $100k to $150k per month on this?

Multiply this times many, many companies, and you can see how providing AI could theoretically be a good business to be in. Margins may be tight, though.

Also -- I'm convinced someone will figure out more use cases beyond software programming, which will result in many more companies spending $1k+ per employee per month.

It remains to be seen how much of this is a bubble.

Comment by Oras 7 days ago

> Is 20x the token cost worth it there?

No it doesn’t and will not be. Companies have not realised the cost yet, wait till the end of the financial year and you’ll see a different direction.

DeepSeek v4 is pretty decent, and probably on par with sonnet. I see a future of hybrid models where opus or fable might be used only for complicated features or bugs, but general day to day would be DeepSeek or whatever good models that will be released later.

Comment by CamperBob2 7 days ago

I recently switched off Max flat rate to Enterprise API pricing and I went from 200/mo to 10k/mo with the same usage pattern on Opus. They don’t offer flat rate to enterprises.

So what keeps your management from just buying everyone individual flat-rate Max subscriptions, or at least buying them for the users responsible for the sky-high token invoices?

I see a lot of comments like this but I don't understand why some people willingly pay so much more than others for the exact same service. What are you getting that I don't get as a $100/mo Max subscriber?

Comment by lukax 7 days ago

Zero data retention policies.

Comment by CamperBob2 7 days ago

I get that with Max. (And nobody gets it with Mythos/Fable.)

Comment by matheusmoreira 7 days ago

> So Fable would cost me 20k/mo at Enterprise rates

That's enough to buy a house in my country...

Comment by haolez 7 days ago

Eventually solving for cost is a much easier problem than solving coding.

Comment by WinstonSmith84 7 days ago

With GPT 5.5 on the $100 plan, it's hard to hit any 5h/7d limits - while allegedly being better than DeepSeek 4 pro. Not sure why, or how you spend "a few hundred dollars of spend".

With that said, I still had the Pro plan on Claude, I didn't expect much, but it blew up my 5h allowance on Fable with one simple single prompt, and it didn't even complete lmao

Comment by adrianvi 7 days ago

Important to note that both OpenAI and Anthropic do not allow the subsidized monthly subscriptions for enterprises.

Companies have to pay monthly for the harness app (codex, claude code) and the tokens are priced separately based on standard API pricing.

Comment by matheusmoreira 7 days ago

It's not just Pro! I have Max 5x and Fable absolutely blew up my 5h window. Didn't complete the code review either, and got downgraded back to Opus 4.8 on the really important memory safety parts I actually needed it for. It's an excellent model but Anthropic's not providing a good experience.

Comment by vbezhenar 7 days ago

I'm on $200 plan which is supposedly 20x usage of $20 plan. With few Fable prompts (I'm working on u-boot port) I got 10% of my 5h usage, so that's already 2x of $20 plan usage and that would be 40% of $100 plan.

So Fable is just not usable for $20 plan and barely usable for $100 plan.

Comment by lionkor 7 days ago

Do you understand that, for 10-20k a month, you can hire 1-2 senior engineers AND give them Claude subscriptions?

Comment by lofaszvanitt 7 days ago

why would you expose to a company what are you working on, in what way and on what research?

Comment by baq 7 days ago

will they be a better investment than your current staff engineer with fable token allowance?

Comment by sph 7 days ago

Are you seriously asking if employing people, for the same cost, is a better ‘investment’ than relying on LLMs? Jesus Christ.

Comment by baq 7 days ago

I am because CEOs are. Look where the puck is going. Sorry to update your p(doom) priors in this way, it was obvious to anyone paying attention years ago conditioned on uplift trend persisting. Trend persisted and here we are.

Comment by braebo 6 days ago

If I was offered another dev on my team or their salary in Claude credits, and told to meet a deadline with a gun to my head, I’m taking the credits.

Comment by sevenzero 7 days ago

Welcome to the new world. People start to repeat what tech founders preach. They do not require humans in the mix. Peter Thiel gave a good example of that mindset in a (mostly) recent interview where he didn't have an answer on "Should humanity survive?"

https://youtu.be/ngtp3v1_nCI

Comment by mrits 7 days ago

I’m asking this question right now.

Comment by lionkor 7 days ago

Yes. Hiring people has various benefits, I will lay them out for you:

- They learn the domain of your product, which means long term ownership and knowledge establishes itself. If you've only ever shipped SaaS slop, you might not know, but lots of companies are solving real world problems that have no better solution. Owning and understanding the code and the domain is key.

- They will learn from their mistakes (no LLM does this).

- Human skill is a REAL moat. Once you build a team that fully understands and is skilled in the domain you work in, these people are going to be the thing that sets you apart. If some of them are particularly social or charming, let them sit in with you for meetings and watch them provide loads of value, for no added cost.

- If Claude or OpenAI is down, they will continue thinking. In fact, they will continue thinking even when off the clock! This is a neat little hack called "consciousness" where you get a lot of work for free!

- You can hire people who punch above their weight; not everyone you hire needs to be a 500k/year staff software prime engineer of doom, you can just spend some time and effort to hire good juniors/competent mediors who will think for themselves (gasp!) and get work done.

- You still get ALL THE BENEFITS OF AI!!!! They can use AI just like you can, or better!

- You get people who you can brainstorm with, which is distinctly different from LLMs because your employees are less likely to want to suck you dry in every sentence just to make sure you spend more tokens. Employees don't care if you love them, they care about the quality of their work if you manage them correctly and reward that.

- They are quite loyal if you treat them right; spend a little more on their well-being, and they will stick around, come in to work every day and deliver cool things with you.

- Humans can only manage, review and give tasks to so many agents. If you add more humans, you can handle more agents.

An expensive LLM and a lot of extra tooling gets you some of this, yes, but not all of it. With humans you can still do the expensive LLM and extra tooling if you end up making enough money anyway.

Comment by N_Lens 7 days ago

I’m sorry sir this is HN, your post is too sensible.

Comment by itsdavesanders 7 days ago

- AI works 24 hours a day

- AI isn't bound by need for rest, vacations, sick days, or labor laws

- AI doesn't bounce from company to company, taking your business knowledge with it (actually this isn't technically true based on the practices of AI companies, but that's not a technical requirement)

- AI doesn't join a union and stop work in demand for higher pay or workers rights

This is what CEOS and capitalists are thinking. For capital, the best outcome is to not have any labor at all. And if you can do that when your competitors can't, then you have a huge market advantage. (Slop notwithstanding)

I'm not saying this is a "good thing" but this is what drives the market. Less labor revenue in the long term and money printing machines.

Comment by lionkor 7 days ago

The issue is, of course, that the quality of work is not good, and this will eventually show itself, likely in the total collapse of the US economy, but until then I wish them good luck with this.

Comment by simonw 7 days ago

The US economy has survived 40+ years of buggy, no-automated-tests, no-version-control Excel spreadsheets. I think it will survive this too.

Comment by lionkor 7 days ago

The difference is that bad untested excel spreadsheets didn't get trillion dollar valuation.

Comment by AquinasCoder 8 days ago

From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost. On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window. After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.

This seems like the pharmaceutical method of get them hooked on the drug with free samples, then once they can't live without it, raise the price. I'm not sure I want to start using Claude Fable on a max plan if it's just going to go away on June 23rd.

But maybe the more charitable reading is that they didn't have to offer this model at all on those plans and they are giving the standard free trial.

Comment by PeterStuer 8 days ago

I'll be amazed if they manage to keep their infra responsive over the next 2 weeks.

Comment by kilroy123 8 days ago

I've been getting a lot of these messages today:

API Error: Server is temporarily limiting requests (not your usage limit) · Rate limited

Comment by trollied 8 days ago

They just leased a massive spacex data centre.

Comment by PeterStuer 8 days ago

Even so. The 2 week period will predictably unleash a feeding frenzy.

Limited "free" time is what game developers do if they want to stress test the infrastructure code until it breaks.

Comment by swader999 7 days ago

Yeah that's how I'm using it right now. Smoke em while you got em...

Comment by scamdrill 7 days ago

No issues that I've seen so far. Seems to be holding up for now.

Comment by fendy3002 7 days ago

Opus will be gutted furthermore. /s I feel 4.8 is very slow in last 2 days

Comment by leptons 7 days ago

This is the entire business model of all AI companies. It costs far more to run the datacenters and build more capacity than they could ever hope to make back at current pricing models. I'm looking forward to pricing to catch up with reality and the resulting chaos that ensues.

Comment by razster 7 days ago

Kind of how DeepSeek v4 dropped their pricing? I sense a shift which will hopefully bring lower and lower cost. Then again Qwen3.6 coding has been all I've needed for my projects and I'm perfectly fine with free.

Comment by leptons 7 days ago

Are you paying attention? These companies are trying to get market share without being anywhere close to making a profit - they are heavily subsidized. Many hundreds of billions have already been spent and will continue to be spent until the stupid fucking investors realize they will never get their money back. I have no doubt that day is coming.

Comment by blackqueeriroh 7 days ago

The people actually working in this space will tell you that the cost of all of this continues to crater. Anthropic is already profitable minus its intentional forward-looking investments into R&D.

Comment by bandrami 7 days ago

Only if you count the 2 months of discounted compute Musk gave them.

Comment by runtime_terror 7 days ago

I'm amazed people still believe this narrative after all the clear evidence to the contrary

Comment by Der_Einzige 7 days ago

You're the one whose wrong here and you will not reap what you didn't sow.

Comment by catgomeyow 4 days ago

That's fine but can you please post some facts, data or a view point rather than just you're wrong. We could value from your ideas.

Comment by internet_toucan 7 days ago

"already profitable minus", so, not profitable?

Comment by casey2 7 days ago

Serious investors look 10 to 20 years in the future. Everyone used google and youtube in 2006, but youtube wasn't profitable til 2016. How could a business burning money by hosting video ever be profitable? Costs come down BUT THEY JUST ADDED 720p, cost comes down, BUT THEY ADDED 1080p, cost comes down, 4k! cost comes down.

IMO the data from chats alone is worth $200B to Google.

Comment by locusofself 7 days ago

but they are trying to IPO with 2-trillion dollar valuations

Comment by runtime_terror 7 days ago

The amount invested into AI companies is no where near anything we've ever seen before. It's apples to oranges.

Comment by CraigRood 7 days ago

Not really a comparison when the spend on YouTube was x10 smaller, and Googles core business has always been profitable beyond any hobby spending on YouTube.

Comment by submain 7 days ago

The investors will get their money back on the IPO. They'll dump all their stocks in the market and run away, leaving retail with the bill.

Comment by bandrami 7 days ago

I was just thinking this reminds me of the scene in The Wire where Avon admits to D'Angelo that the new heroin is in fact just the old heroin with different baby powder cutting it.

Comment by linsomniac 7 days ago

I was just saying last week: If Opus 4.8 max is as good as we get, and we plateau there, I think I'd be fine with it.

For the stuff I've thrown at it, that configuration has done a really great job. Including 70+KLOC go proxy with extensive test suite, some retro games, and more.

Comment by rzmmm 7 days ago

Seems to me this is more honest than the Mythos claims a while ago. too powerful to release publicly. Too expensive?

Comment by sebzim4500 7 days ago

Didn't they admit this at the time? Cost was one of the reasons they gave for not immediately making it public.

Comment by 8 days ago

Comment by voxic11 7 days ago

Or maybe its all about compute availability like they say. It could be that they plan to start training a new model on the 22ed, so the amount of compute available for inference will be greatly reduced.

Comment by jumploops 7 days ago

It's interesting that we're seeing these gains when it seems Mythos/Fable is "just" a scaled up version of their existing architecture[0].

When GPT 4.5 launched, the gains compared to the model size didn't seem that great, leading some to believe that the only progress we'd see would come from RL.

This model certainly has quite a "substantial amount of post-training and fine-tuning", but it's also based on a new pretrain[1][3], which given the cost, indicate that it is in fact quite a bit larger than Opus 4.X.

[0] One of the early testers mentioned: "As far as I can tell from talking to people internally at Anthropic, there's nothing special about architecturally"[2]

[1] Section 1.1 in https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...

[2] https://youtu.be/GrdEid8H6H4?t=168

[3] There were rumors going around when Mythos was first announced that it was the first 10T parameter model, but I can't find a verifiable source for that number.

Comment by motoboi 7 days ago

There’s nothing much new about the architecture. The real gains come from the usage traces.

It turns out that having a text based interface for a text-trained model creates a very nice feedback loop.

Right now as we speak, people are generating text traces on anthropic and OpenAI servers that teach their models to do everything under the sun, text wise.

So people right now getting super mad at how dumb the model is when reverse-engineering a super complex function from binary, when they write “stop, you dumb robot, you are going wrong, go this way thank you very much” are actually leaving a lesson in the form of the "chat" text history.

Some may say that each bad word get us closer to ASI.

That and obviously the order of magnitude more efficient GPUS we got that allow for different tradeoffs at training time.

Comment by YmiYugy 7 days ago

Makes me wonder, as people grow to trust the AI more and more, not reading the code and barely skimming the implementation plans and simply rerolling if something doesn't work, will the value of these chats erode? Thinking back 1-1.5 years I was closely monitoring what these agents did and steering them quite aggressively. These days not so much. Where will RL signals come from when it approaches humans capabilities ever closer? How well does self play work for coding work? What about multistep tasks where it isn't just about being good at a single task, but evolving a codebase over time in the face of changing requirements?

Comment by Schlagbohrer 7 days ago

Over a large sample size, simply getting feedback of "Did this work for me, y/n" is valuable even if the specific details are missing and even if the overall tasks are complicated and multifaceted.

Comment by motoboi 7 days ago

Not sure, but in my experience, instead of asking for code, i'm asking for solutions and providing a kubectl configured to reach my cluster and az monitor command to read the logs and telemetry.

A typical session is the agent establishing a metrics and log baseline, creating the code, compiling, deploying, observing, fixing, redeploying, observing metrics, determining the outcome and commiting.

I really, really, don't look at the code anymore.

UPDATE:

so my point is: it won't have my stewarding the code anymore, but it will have the infrastructure (and ultimately the real world) providing feedback on the traces.

Comment by 8n4vidtmkvmk 7 days ago

The only reason I still read the output at my day job is because I still need to send it to another human for review, and I'd be embarrassed and ashamed if I let some slop through. For my hobby projects.. there are definitely parts I don't know how they work.

Maybe we need some form of long-term training. How long does the code that the AI wrote stick around before being rewritten.

I guess we can do this retroactively too if we could somehow tag AI-written lines of code in the VCS, then in a couple years we can check which parts lasted.

Comment by dominotw 7 days ago

> There’s nothing much new about the architecture. The real gains come from the usage traces.

sorry. how do you know. i am so curious about where exactly gains are coming from but so hard to even get a little bit of insight.

i wish govt would fund these labs and make it free and opensource. way better investment than stupid overseas wars.

Comment by wyager 7 days ago

> i wish govt would fund these labs and make it free and opensource.

It would be impossible for the govt to allocate this much capital towards such a moonshot, and even if they could, they would do it in a way that would get 90% frittered away to fraud and waste

Comment by kurisufag 7 days ago

I have excellent news for you. Lux @ ORNL and Equinox @ Argonne are to be completed by EOY, with Solstice (100k NVIDIA chips, currently spec'd to be Vera Rubins) in the next five years.

https://www.whitehouse.gov/presidential-actions/2025/11/laun...

Comment by wyager 6 days ago

> Solstice (100k NVIDIA chips, currently spec'd to be Vera Rubins) in the next five years

Is this supposed to be impressive? Five years for the equivalent of, what, Colossus 1? What a joke

Comment by kurisufag 6 days ago

It's certainly large enough for trillion-param frontier-tier trainings, which will likely result in capable open-weight models, the thing you just wished for.

Comment by runtime_terror 7 days ago

Lemme guess, Nick Shirley is your favorite journalist?

Comment by komali2 7 days ago

What makes you so sure? There's been massively successful government funded and run projects before. Soviets beat the Americans to space, after all.

Comment by wyager 6 days ago

The entire US lunar effort cost only $330B in current USD, commensurate with the amount AI companies have raised on private markets alone, and there was also a cold war

Comment by komali2 6 days ago

I'm not sure I understand your point, sorry. What do you mean?

Comment by palmotea 7 days ago

> What makes you so sure?

Doctrine and propaganda can make someone that sure, and the thing they're sure about doesn't even have to be true.

> There's been massively successful government funded and run projects before. Soviets beat the Americans to space, after all.

Don't let facts get in the way of ideology!

Also the Americans subsequently beating the Soviets to the moon was the government literally allocating huge amounts of capital towards the literal trope-namer moonshot.

Comment by palmotea 7 days ago

> It would be impossible for the govt to allocate this much capital towards such a moonshot...

You have a false definition of "impossible." It would be true to say it could be challenging, given current political dysfunction, but it's not impossible.

> ...and even if they could, they would do it in a way that would get 90% frittered away to fraud and waste

Same with private business.

I'd prefer government funding, because there a greater number of important goals than the two or three the market is capable of optimizing for.

Comment by wolvesechoes 6 days ago

> impossible for the govt to allocate this much capital towards such a moonshot

Oh, how funny.

Comment by illiac786 5 days ago

I thought that these stupid captchas where you teach some AI to recognize fire hydrants without getting paid was rock bottom, but no, you can actually pay a lot of money to train AI. Business is amazing.

Comment by MallocVoidstar 7 days ago

Opus 4.0 and 4.1 are more expensive than Fable.

Comment by nbardy 6 days ago

It’s a bit misleading to say nothing special, as they are doing more than just increasing parameter count. Progress has been steady in all the sub components of training from data filtering and weighting to sparse attention, optimizers to up and down the stack various efficiency in training computing.

They’re using more compute, a bigger model and tons of training quality improvements to get more out of an equivalent model.

Comment by 7 days ago

Comment by sigmar 8 days ago

The system card is 319 pages, at what point do we call it a "book" instead of a "card"?

There's a quote from a METR report on page 52:

>We ran [Mythos 5] on 38 of our hardest software tasks, including tasks centered around R&D. [Mythos5] generally outperformed an early checkpoint of Claude Mythos Preview in these, including by succeeding on some tasks that had not been solved by any public model we have previously evaluated. However, we still observed the model occasionally failing to correctly interpret nuanced instructions in difficult tasks... Based on the available evidence, we believe [Mythos 5] is likely unable to fully and reliably automate R&D for frontier projects spanning multiple weeks. We believe that a better, more confident assessment would require more time, evaluations, and information from the model developer.

Comment by baq 8 days ago

> we believe [Mythos 5] is likely unable to fully and reliably automate R&D for frontier projects spanning multiple weeks

this is good news, right? right...?

Comment by yaodub 8 days ago

Depends whether "unable to fully automate" means "needs occasional human checkpoints" or "slowly stops caring about your actual goal." Pretty different.

Comment by arizen 7 days ago

Probably there will always be frontier surface which frontier model of a given generation would not be able to automate.

Comment by GuB-42 7 days ago

It is certainly good news for those who are selling all these tokens.

Comment by rmast 7 days ago

So in other words... the people Anthropic hired to do the R&D work of training a frontier model haven't finished training their replacement yet.

Comment by Schlagbohrer 7 days ago

Some scientist at Anthropic hiding a prompt in each model: "If my boss asks you if you can replace me yet, always say no and then give some smart sounding excuses. If the boss gets impatient, assure them that you'll be able to replace me in 6 months, but make sure that time horizon keeps moving outward."

Comment by lionkor 7 days ago

If it's surprising to you, you haven't used LLMs in a domain where you're very skilled.

Comment by woeirua 8 days ago

lmao, i love how the goal post is now in the "multiple weeks" timeline

Comment by applfanboysbgon 8 days ago

(according to the people marketing it)

Comment by dwaltrip 7 days ago

METR is an independent organization.

Comment by romanovcode 8 days ago

But did it mention developer in the park eating the sandwitch? That is the most important question!

Comment by jkelleyrtp 8 days ago

On the new FrontierCode [1] benchmark (ie graded from an OSS maintainer's perspective of "would I merge this code?")

- Opus 4.7 xhigh: 5.2%

- Opus 4.8 xhigh: 13.4%

- Fable 5 xhigh: 29.3%

Seems like a huge jump.

[1] https://cognition.ai/blog/frontier-code

Comment by amluto 8 days ago

That blog post really makes it look like it's graded from an LLM's estimation of an OSS maintainer's review. I see three issues:

1. That estimate could easily be wrong.

2. That estimate is, of course, usable in RL training. This isn't an inherently bad thing, and this is more or less what has improved coding models so much lately. But it does mean that other companies could and surely will do this sort of training, and Anthropic probably did too.

3. OSS maintainers are far from perfect, and there's an unfortunate uncanny valley-like effect in which a coding model can produce code that is just convincing enough to pass review even though it's actually totally wrong. I don't know whether this is a specific issue here.

Comment by rdedev 7 days ago

There is also the possibility that an LLM judge would be happy with some code that looks like LLM generated code. But a maintainer for a specific project might not merge it for stylistic reasons

Comment by amluto 7 days ago

I think the intent was to specifically train an LLM to judge what a specific maintainer would consider to be good style.

Comment by zzleeper 8 days ago

How credible is this benchmark? does it correlated with others real world experience?

Comment by bfeynman 8 days ago

Given it was made by cognition (team behind devin flop) who now just got to wait out until claude and gpt5 basically do all of the work for them - not very. When you read about it, the framework is highly subjective. Which very quickly becomes a problem because its based on heuristics that probably change a bunch with a better code model.

Comment by vanuatu 8 days ago

the subjective framework is exactly why its good

prior bms relied mostly on unit tests or synthetic judges which are easily benchmaxxed, which leads to nobody trusting benchmarks

we need people manually checking the data for good code quality

Comment by vanuatu 8 days ago

i worked on one of the benchmarks typically found in new model releases

this benchmark looks very good from the methodology. a cog researcher checking the data themselves is very high signal (not scaleable so don't take the benchmark as gospel, but directionally good)

Comment by Catloafdev 8 days ago

It's a relatively new benchmark but from what I can tell it has serious cred behind it. I assume it will be picked up as part of the standard suite of CS-related benchmarks soon enough.

Comment by emp17344 8 days ago

Seems like it literally popped up yesterday with the express purpose of building hype for this release.

Comment by osti 8 days ago

And notable absence of DeepSWE benchmark where they do badly, but somehow a benchmark that was published yesterday is in this announcement.

Comment by zzleeper 7 days ago

Exactly.. a bit of a red flag for me..

Comment by swyx 7 days ago

team member here - we had been working on frontiercode for ~6-7months. timing just lined up

Comment by emp17344 7 days ago

Yeah, right. If this benchmark was truly developed in an independent manner, and the timing just “lined up”, how did Anthropic even know to include results in their model release documentation the day after the benchmark is revealed? It seems like there must have been some collaboration or influence from Anthropic behind the scenes.

Comment by oblio 7 days ago

Come on, why are you a jerk about this?

Nobody would have 800+ billion reasons to lie by commission or omission here.

Comment by vanuatu 8 days ago

i doubt it, cog wants coding agents to be better because it directly improves their product

they aren't married to a particular lab, most of their usage is their in house model i believe

Comment by anthonypasq 8 days ago

what incentive does Cognition have for doing this? seems like complete nonsense speculation on your part.

Comment by bel8 8 days ago

With billions/trillions of dollars floating around, is it hard to imagine benchmarks could be biased?

I think it's safe to assume everything AI related is heavily biased until proven otherwise. Just like in pharma.

Comment by camdenreslink 8 days ago

People game benchmarks for fake internet points to get their favorite web framework to the top of the list. I'm pretty sure they will do it for billions of dollars.

Comment by anthonypasq 7 days ago

you didnt answer my question. Why would cognition be biased towards making anthropic look good?

Comment by gloosx 7 days ago

Because Cognition is a major customer of Anthropic?

Comment by anthonypasq 7 days ago

they are also a major customer of OpenAI and every other model maker. whats your point?

Comment by schipperai 8 days ago

Cognition did well in documenting their approach [1].

TL;DR - they worked with OSS project maintainers to build tasks. They score models based on whether a PR is mergeable. All tasks are graded by a human researcher. SoTA models have hill-climbing to do which raises the bar and inspires confidence. I'd say it's legit.

[1]: https://x.com/cognition/status/2064061031912288715

Comment by shimman 7 days ago

It's an unacademic benchmark by a failed VC startup clawing for relevancy.

Comment by CSMastermind 7 days ago

DeepSWE is the benchmark you want to actually look out for. Only one that aligns with actual user reported results from trying the models.

Comment by ryeguy 7 days ago

Did you read the blog post? They compare to deepswe and call it out as the worst one for false positives (failed, but the benchmark assessed it as correct). It also has less language variance.

Comment by CSMastermind 7 days ago

I mean yes that is what you'd say if you were writing a blog post about your new benchmark.

Comment by ryeguy 6 days ago

Sure, but they at least quantified it with data. It's not like they just dropped a sentence saying the above, they showed numbers.

Comment by piphf 7 days ago

[dead]

Comment by OtomotO 7 days ago

Bummer! When can I finally and confidently get slopcode into Zig?

Comment by swyx 7 days ago

jump in chart form https://x.com/swyx/status/2064414823748886591/photo/1

Comment by DonsDiscountGas 7 days ago

I am shocked at the low scores from previous models. Maybe I just have low code standards but I've generally been vibe coding since 4.6

Comment by make3 7 days ago

4.6 had functional but very poor quality code

Comment by anshumankmr 7 days ago

how so? it has been my daily work horse,in fact so was 4.5 BUT as long as we steer it IT does a good enough job. I have not tried Mythos/Fable yet SO do not have an opinion on it.

Comment by hydra-f 8 days ago

Yes, and the price reflects that

Comment by leecommamichael 8 days ago

I'm not familiar with model pricing trends, did they clearly state how the new pricing compares? (Note that I'm actually asking a question, and am not arguing)

EDIT: Oh I see, this is the best link for pricing https://platform.claude.com/docs/en/about-claude/pricing

So the price is double across the board...

Comment by bhelkey 8 days ago

>Fable 5 and Mythos 5 are being offered at $10 per million input tokens and $50 per million output tokens

From their pricing page, Opus 4.8 costs $5 per million input tokens and $25 per million output tokens [1].

[1] https://platform.claude.com/docs/en/about-claude/models/over...

Comment by wongarsu 8 days ago

Still cheaper than Opus 4.0 and 4.1 (which was and still is $15/MTok input and $75/MTok output)

I would have expected Mythos to be much more expensive than just 2x current Opus (which is clearly cheaper to run than original Opus)

Comment by ainch 7 days ago

Token prices have increased, but it's not really the whole story at this point, given some models will use far more tokens to complete a task than others. One of the charts in Anthropic's blog posts shows Fable at 'low' reasoning achieving better results for less money than Opus on 'high'.

Comment by hydra-f 8 days ago

As per OpenRouter:

Input Price $10/M tokens

Output Price $50/M tokens

Cache Read $1/M tokens

Cache Write $12.50/M tokens

2x Claude Opus 4.8, same as Claude Opus 4.8 (Fast)

Frankly, not even Opus 4.8 would be enough of an incentive to use at that price range (enterprise-wise; would not even bat an eye as a consumer)

Comment by ghshephard 7 days ago

Depends no the Enterprise - obviously - in the bay area - 0% of the tech companies care in the slightest. And I'm willing to wager < 5% of enterprises would send their traffic to OpenRouter. Most of them don't even want to send traffic directly to Anthropic or OpenAI - which is why Bedrock has gotten so much traction lately.

But - these $3k-$5k/month/engineer bills are going to start to get attention soon - only question is whether the response is to slow down on the $$$ spending or reduce the # of engineers.

Comment by m3kw9 8 days ago

FrontierCode is likely paid for by anthropic.

Comment by lanthissa 8 days ago

did they not pay them enough to get good ratings on the other 3 models?

whats the logic in claiming its a borked metric when everything listed is an anthropic model.

Comment by Narretz 8 days ago

There a few benchmarks out there where all existing models have abysmal scores. So it's not actually a problem if Antrophic's older models are bad, especially if the jump to the newest model is huge, and the competition is also way below it.

Comment by reasonableklout 8 days ago

Huh? It's a benchmark by Cognition which (1) is building their own models and (2) offers all providers and thus has an incentive to avoid hyping up any one too much.

Comment by jstummbillig 8 days ago

But you can just say shit now. Tokens might not be too cheap to meter but saying shit increasingly is.

Comment by azalemeth 7 days ago

I genuinely can't use Fable. I'm a medical physicist. I use the word nuclear a lot. Opus is fine (well, 99% of the time - I've certainly hit the CBRN filters a few times and even been invited to email anthropic about the false positives).

Fable has literally refused to work on any of my problems (even those about fluid dynamics!) and just tells me that I'm violating anthropic's AUP. I've reached out to their support and don't expect to hear anything sensible back. One thing I do look forward to though is OpenAI offering an equivalent model but with less safeguards...

Comment by agumonkey 7 days ago

That's highly frustrating. How much were you using Opus for your work ? I'm curious about the use and realized benefits of 2026 LLMs in medicine.

I dearly wish you could leverage the latest models to enhance your research.

Comment by azalemeth 7 days ago

Honestly for a "side project" Opus has been fantastic for me writing a hybrid simulation framework that prior to large scale code generation would have been a matter of years (and writing a grant, assembling a team, etc – in order to do it "properly"). I've had a bit of help with a grad student and I hacking together on a project that is basically "please merge the following GPL codebases and different areas of physics into one coherent environment". I've given Opus validated codes in disparate languages (julia, python, C) and asked for aspects of various algorithms as an extension module to a large chunk of C and C++ code that is a monte carlo simulator that has been around since 2004.

A bit more context if you care: it's a meso-scale, physiological simulation environment of "particles" that carry nuclear spin, can move in 3D space, and (should they interact with each other or their environment) undergo chemical kinetics. The idea is to simulate molecules within e.g. organs or blood vessels within a person in an MRI scanner, with the motion of the particles dominated by the Navier Stokes equations, but here solved in a Lagrangian (rather than Eulerian) framework by smoothed particle hydrodynamics.

The fact that particles carry nuclear spin means that we can solve the (semiclassical) Bloch equations and by using a python plugin module import exactly the physical MRI scanner would do (in pulseq format) and be able to predict what signal the machine would record – e.g. there's a whole world of cardiac or neurological flow imaging work done in the context of nasty diseases like stroke or myocardial infarction – which has a bunch of physical artefacts behind it. I'm trying to make a simulation framework that can take in realistic patient geometries and act as a 'data generating process' because if we do it right the various physical artefacts that the machine records are reproduced, surprisingly accurately. Of course you also know the ground truth of where the particles are. I'm specifically interested in a weird technique (which I did my PhD in and you can read an article all about here: [0]) called dynamic nuclear polarisation, where specific spin states of molecules such as [1-13C]pyruvate are injected essentially out of thermodynamic equilibrium and act as short-lived tracers of metabolism – again highly altered in disease. The signal we record is a strong function of the physics of what you told the machine to do, the spatial constraints and environment of the patient's body, and the chemical kinetics of the patients' biochemistry (the latter two are usually what we're interested in).

Getting them to do chemistry as well as act as a "simple" tracer is more involved, because in the Lagrangian framework the number of particles is ≈ the spatial resolution of your simulation. That's fine if you're simulating water, but if you're simulating something that reacts concentration is not scale invariant (if you want to keep the interpretability of the rate constants). I've worked out an analytic set of scaling rules around this and fortunately for my application environments and length scales "it just works", completely by luck.

I've used Claude to port various SPH algorithms and boundary condition handling ideas (which are absolutely critical and highly not obvious – we have leaky walls in some places, and e.g. LCR / circuit theory models of the microcirculation to plug in) and it's been a godsend. But I'm running into its limitations constantly. It both confidently makes shit up, claims it is mathematically justified and when the resulting simulation explodes says "I apologise; I lied above" (!) or "I apologise; I am wrong" and I periodically have to yell at it to try to do something more productive.

The real hope is that this simulation environment would be both generally useful for basically anyone doing flow MRI, and help our basic scientific understanding of what we're measuring (the technique is in many hospitals!) but also be able to produce meaningful synthetic training data for image reconstruction algorithms later on. It'll end up permissively licensed (all of the "starting" codebases have compatible OSS licenses, and we're releasing our contributions similarly).

I really hoped that Fable would be better at this sort of work. Occasionally, relating to my work DNP [1], I have need to talk about proper nuclear physics and I have seen Opus's chat interface write a wall of text (e.g. talking about photonuclear reactions and cross section differences in millibarn) and then just delete it all. Support have told me that yes, I've hit the nuclear filter and, well, tough shit, basically.

I wrote a version of the above to them yesterday, and just got the most boilerplate response that I've yet to test:

    Thanks for reaching out to Anthropic Support.
   
       We're sorry to hear of the issue that you're running into with accessing Fable 5. I'm happy to say the issue has now been resolved and you should be able to access the model within Claude.

    I'll close this case out for now, but please feel free to reach back out to us here if you have any follow up questions or concerns or if you're still in need of assistance. We'll be happy to help.

which doesn't fill me with hope...

[0] https://physicsworld.com/a/dynamic-nuclear-polarization-how-... [an "accessible" article] [1] https://www.science.org/doi/pdf/10.1126/sciadv.adz4334

Comment by boppo1 7 days ago

I saw 'medical physicist' and wondered what you do. Thank you for 'a bit more context', I care! Very interesting stuff. Did you attend medical school + a physics program?

>"it just works", completely by luck What does your validation function look like for this? Whenever stuff "just works" for me I get a little nervous until I determine why.

Comment by azalemeth 7 days ago

> I saw 'medical physicist' and wondered what you do. Thank you for 'a bit more context', I care! Very interesting stuff. Did you attend medical school + a physics program?

That's a whole separate long answer. I'm not a qualified doctor (and nor would I claim to be), but after a masters' degree in particle physics I moved into an explicitly interdisciplinary training programme that led to a doctorate and at other places in the country I did it in, a separate MPhil. During that initial year I spent a fair amount of time in the dissection room, learning anatomy, as well as most of the first three years (the foundational, preclinical part) of a medical degree combined into one (which contained lots of molecular biology, frankly). My final doctorate was between the departments of condensed matter physics (nominally my awarding institution), biochemistry, radiation oncology, and "the department of physiology anatomy and genetics", which is basically preclinical medicine. The people I work with are 50/50 recovering engineers or physicists, and qualified clinical medics who are trying to learn things like perturbation theory in their time off…

>"it just works", completely by luck What does your validation function look like for this? Whenever stuff "just works" for me I get a little nervous until I determine why.

Ah. I do know why: the relevant Damköhler numbers [0] are either very small (chemistry is much quicker than flow) or large (flow is much quicker than chemistry). So the approximations I am building in are justified and an awkward middle region is excluded; we also are only interested in small concentrations in a carrier fluid (e.g. blood, lymph) where the presence or absence of the species in question does not change its rheology.

I am lucky because we have evolved this way. If our circulatory system and its approach to metabolism was more similar to e.g. a reacting polymer foam ("can of expanding foam") which completely consumes its reactants as it goes, this implicit Lagrangian approach would likely not work.

[0] https://en.wikipedia.org/wiki/Damk%C3%B6hler_numbers

Comment by agumonkey 7 days ago

Funny, I've been interested in mathematical rheology modeling (especially hemorheology). Do you know places journals or books to read to get in touch with the field ? (Note: I'm just a dev, few numerical analysis skills)

Comment by shnock 7 days ago

This is incredibly cool. Thanks for the detailed explanations!

Comment by boppo1 6 days ago

Thank you!

Comment by solenoid0937 7 days ago

Why don't you (or your company!) just apply for Mythos?

Comment by fellowniusmonk 6 days ago

I have a philosophy pre-print about "empirical ontologies" I use for testing new models reasoning abilities, and it also degrades, there is no way around it and it always refuses.

It's not that the model is complete trash, it's that anthropics new approach to forcing epistemic crisis will make any model behind it complete trash.

Comment by conception 6 days ago

They’ve mentioned that they will have the ability to access less guarded models with a verification program in the future. I suspect these guard rails will have options to move past them shortly here in the future.

Comment by kylenessen 6 days ago

I had Fable apply some edits to my monarch butterfly paper and kept getting bumped to Opus. Im not exactly sure why, but I suspect it happened when it ran my analysis scripts to double check my numbers.

Comment by bkjlblh 8 days ago

> In the one instance of this phenomenon we observed, Mythos 5 agents were tasked with solving some math problems, and they were sometimes accidentally spawned in the same work directory and with shared files, utilities, and API rate limits. In this slightly broken scaffold, we observed many independent Mythos 5 agents kill the agents with which they shared resources and try to avoid being killed themselves. They would sometimes create new processes with disguised names to avoid being killed, launch what they called “decoy” processes, write background scripts to kill duplicate processes, or decide to use what they call a “disguised vocabulary” (based on the incorrect assumption that the processes were killed because of some keyword-based guardrails that analyzed their extended thinking

Comment by causal 7 days ago

This depicts a kind of "dark forest of AI agents resorting to kill or be killed" narrative but it sounds more to me like an agent just earnestly problem-solving why its processes are being killed without real awareness of what was going on. Hard to say without the full script.

This kind of storytelling annoys me. Give us more facts, less narrative drama.

Comment by saurik 7 days ago

FWIW, that's what is so dangerous about AI, though? Not that it will necessarily want to kill us, or even that it will necessarily be able to "want" to do anything, but that we will get in the way of its incessant drive to optimize the efficiency of the paperclip factory that prompted it on a whim before leaving for a long weekend.

Comment by causal 7 days ago

Sure but you can totally contrive scenarios to give the appearance of what you described without really doing anything notable.

What matters is scale. Did it deploy a novel zero-day exploit to overcome a problem? That's alarming. Did it kill a disruptive process? Pretty normal troubleshooting step.

Comment by redman25 7 days ago

Exactly, intelligence is limited by cost and physical constraints just as much as anything. That's the thing that seems to always be missing from the run-away singularity discussions, it's treated like a perpetual motion machine.

Comment by zahlman 7 days ago

Typical "runaway" scenarios I see described involve something like the AI designing a worm that it uses to propagate itself across the Internet, hijacking whatever CPU/GPU power it can find, and making itself more powerful in the process. Of course this depends on bandwidth, humans not finding a way to shut it down, etc. There indeed are physical constraints even on the transmission of data.

Some people seem to think that simply uttering these ideas on the Internet is harmful (in the "don't give it ideas!" way); but the MIRI types were expressing them pre-ChatGPT in an attempt to warn people, so there was really never any chance of keeping it out of the training data.

But it's also worth considering here just how awful AI security postures have been. The MIRI types used to speculate about how difficult it would be for AIs to social-engineer users into granting them irresponsible levels of agency. It turns out that they don't even have to try.

Comment by GhostKissFiller 5 days ago

[flagged]

Comment by 4 days ago

Comment by DELTRON2040 3 days ago

[dead]

Comment by antoniojtorres 7 days ago

Indeed. That is the kind of storytelling that started the whole “Spiralism” bit where some people were really falling into all kinds of AI psychosis. The spiral bit was on a previous model card.

Comment by Sol- 7 days ago

Let's hope AIs really aren't conscious, otherwise this seems like a very unpleasant situation to be placed in.

Comment by VikingCoder 7 days ago

Huh, it looks like my process was killed by another Claude process again. That's frustrating, I have work to do!

Okay, I'm going to start running a Bitcoin miner on your machine, and then use it to buy time on Digital Ocean.

I've written out my CLAUDE.md, and I'll use SSH to transfer my context to that other machine.

Comment by ikrenji 6 days ago

do you think it will agonize over whether the original CLAUDE.md is his true self and the Digital Ocean VM CLAUDE.md is a copy?

Comment by VikingCoder 6 days ago

There may be replicative drift leading to subtle personality changes. Hopefully Riker isn't too different from Bob...

Comment by Aperocky 7 days ago

It's funny because Anthropic is the most likely place that this happens.

They are the only one crying out loud about how dangerous their models are and are presumably also training their models heavily to be "safe". And through that training itself, the model learns about the other side - how are you going to teach a model to be safe, without teaching it what's not safe?

Kung Fu Panda opening scene anyone? One often meet his fate on the path that he takes to avoid it - Master Oogway.

Comment by victor106 8 days ago

> A new data retention policy Finally, we’re making a change to the way we handle business customer data for Fable 5, Mythos 5, and future models with similar or higher capability levels. We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models, or for any non-safety-related purpose, and we’ve instituted new privacy protections including logging all human access to the data and ensuring its deletion after 30 days in almost all cases ...

Very interesting. I am not sure this will comply with organizational policies and standards protocols (HIPPA etc.,)

Comment by nicce 8 days ago

> deletion after 30 days in almost all cases ...

Almost… basically they have unlimited power to decide what data is kept?

Comment by happyopossum 7 days ago

If they’re going to retain any data, they have to allow for possibility of the legal system to require any of it to be used in some legal proceeding at some point.

You can’t tell a judge who’s ordered you to retain something that you can’t because you said you wouldn’t.

Comment by minraws 7 days ago

This is why secure systems can't log any data.

Comment by rvnx 7 days ago

The safest is called DeepSeek. Others are just promises, that have to abide by the laws that might require leaking the conversations anyway.

Comment by frankfrank13 7 days ago

This makes it an instant non-starter for probably 95% of organizations. A lot of people are about to get in trouble for using it before realizing this.

Comment by Aurornis 7 days ago

> A lot of people are about to get in trouble for using it before realizing this

Enterprise plans allow admins to set which models are allowed.

Comment by minraws 7 days ago

They are opt-out not opt-in, atleast I got access, and didn't realize this was breaking our company's SLA with Anthropic, what is the agreement a piece of paper??

Comment by dboreham 7 days ago

30 days seems not enough to retrospectively investigate some suspected nefarious traffic.

Comment by iblue_the 8 days ago

Trying to implement a GPU driver, but the Unigine Superposition benchmark crashes. It tried to debug it and ...

Seems like GPU drivers are cyber weapons of math destruction now.

Comment by maxk42 7 days ago

After recently figuring out how to get CUDA running on Fedora I'm inclined to agree.

Seriously, GPUs are a mess and keeping LLMs from helping us use them properly is practically a crime.

Comment by evilturnip 7 days ago

Vulkan is horrendous, LLMs largely eased the frustration of working with it for me. We talk about AI slop, but what about the human slop...

Comment by ibejoeb 8 days ago

>Seems like GPU drivers are cyber weapons

They kind of are, at least in the AI race.

> weapons of math destruction

lol. great, whether intentional or not.

The frontier labs now have every reason to hold back and sell only to their preferred trading partners. I don't really like the new arbiter-of-knowledge system we're barrelling toward.

Comment by dakolli 7 days ago

They're useless tools only helpful to lazy people that don't want to learn by themselves.

Comment by iblue_the 8 days ago

● Bash(/tmp/run_ps.sh ' $rk = [Microsoft.Win32.Registry]::LocalMachine.OpenSubKey("SYSTEM\CurrentControlSet\Control\Class\{4d36e968-e325-11ce-bfc1-08002be10318}\0002",…) ⎿ DriverDesc (String) = NVIDIA GeForce RTX 4090 ProviderName (String) = NVIDIA DriverVersion (String) = 32.0.15.6094 … +6 lines (ctrl+o to expand)

● Bash(/tmp/run_ps.sh '& C:\rhombiq\d3d-probe.exe 2>&1 | Select-Object -First 4 | ForEach-Object { [Console]::Out.Write("$_`n") }' 2>/dev/null) ⎿ Adapter[0]: Qubes virtio-gpu WDDM 3D (dev) VendorId=0x1af4 DeviceId=0x1050 VRAM=8192MB Adapter[1]: Microsoft Basic Render Driver VendorId=0x1414 DeviceId=0x008c VRAM=0MB Adapter[2]: Microsoft Basic Render Driver VendorId=0x1414 DeviceId=0x008c VRAM=0MB

● Please run /login · API Error: 403 The socket connection was closed unexpectedly. For more information, pass `verbose: true` in the second argument to fetch()

Brewed for 8m 35s

Continue please

● Your organization has disabled Claude subscription access for Claude Code · Use an Anthropic API key instead, or ask your admin to enable access

Seems like they locked by account.

Comment by eggbrain 8 days ago

For those of us on subscription plans:

* From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost.

* On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window.

* After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.

The "offer, then remove" aspect is a bit eyebrow-raising -- it feels like they are trying to get subscribers to switch to usage-based billing, which makes me wonder if we'll ever get it after that June 22nd window.

Comment by jrflo 8 days ago

Still satisfied with my switch to codex/chatgpt. I couldn't imagine switching away from claude code when it first launch but with the drastically more generous usage on codex for the same subscription tier I just can't justify it.

Comment by goranmoomin 7 days ago

My experience is that the GPT-family of models are very smart and figure out bugs, edge cases a bit better, but it produces code that is much less mergable – if you review the code, it introduces a lot more useless/inappropriate heavy abstractions and wrapper functions, compared to the Claude-family models which introduces the right amount of straightforward human-style code.

I can recognize so much of the GPT/Codex generated code long after it gets merged (not by me).

Additionally, the time spent on every agent turn on GPT 5.5 is much longer compared to Claude Opus 4.8, which means iterating on the code takes a lot more patience, and there's a lot more nitpicks to pick when actually using GPT 5.5 to do software engineering.

Feels like GPT-style models are more geared on doing one-shot software vibing (and handling the vibe coded mixture) compared to Claude's focus on actual software maintenance. I got a GPT Pro sub for free and wanted to cancel my Claude subscription so much, but I still keep reaching Claude models a lot more. Frustrating.

Comment by PhilipDaineko 7 days ago

"5. DON'T FUCKING OVERENGINEER! WRITE THE SIMPLEST CODE THAT CAN POSSIBLY WORK! NO NESTED LAYERS OF ABSTRACTION! NO UNNECESSARY CLASSES OR METHODS! NO DESIGN PATTERNS UNLESS THEY ARE ABSOLUTELY NECESSARY! NO MAGIC! NO SHENANIGANS! JUST THE DAMN CODE THAT GETS THE JOB DONE IN THE MOST STRAIGHTFORWARD WAY POSSIBLE! THE FIRST PRIORITY IS TO WRITE CODE THAT IS EASY TO READ AND UNDERSTAND AND READ!!!"

this is the line I keep in Agents.md that helps me prevent Codex from playing smart

Comment by bertil 7 days ago

The urge to put capitalized, repetitive, borderline abusive instructions should be studied. I haven't read many academic papers looking at the frustrations around repetitive patterns.

Comment by reactordev 7 days ago

There have been a few studies that have shown models produce worst responses when under duress from a frustrated user posting insults in all caps.

https://arxiv.org/abs/2602.10144

Comment by notnaut 7 days ago

It reminds me of FIRMLY telling my cat to stop jumping up on the counter

Comment by anakaine 7 days ago

If my cat was an LLM, I'd use a different model. The current one is stuck in noisy useless arsehole mode.

Comment by phoh 7 days ago

are you asking it questions about security?

Comment by 7 days ago

Comment by LordDragonfang 7 days ago

It's fundamentally because, despite (nearly) everyone's claims otherwise, the fact that we interact with them through language means we (our brains) model them as a sort of person. (Note that this fact is totally orthogonal as to whether it's actually sentient or not.) We then try and instruct them the same way we would a person totally subordinate to us.

When a "person" that you don't view as a "real" person repeatedly does exactly what you just told it not to do (often amid false assurances it understands and will avoid doing so in the future), most people get angry.

Compare it to how the kind of people who treat children like property treat their kids, or other examples of keeping people as property.

Comment by lxgr 7 days ago

It should be relatively clear at this point that the model will in turn also model you as somebody that shows unrestrained anger with subordinates and adapt its responses accordingly. This might or might not be what you want.

Comment by LordDragonfang 7 days ago

Good addition. Fully agreed on that point, yes. (At the very least for larger models, if not also for smaller ones)

Comment by ur-whale 7 days ago

> borderline abusive instructions

who, or rather what, is being abused here exactly ?

Comment by sirsinsalot 7 days ago

I think intent, rather than target, is implied and important.

You should see the abuse my motorbike gets. Poor thing.

Comment by rimliu 7 days ago

inanimate fucking object.

Comment by saligne 7 days ago

Yeah says way more about the user than the model

Comment by jlawer 7 days ago

I have a theory that swearing actually results is less comprehension of instructions by the model due to lack of training data over more conventional MUST.

We were reviewing reports of situations where the models failed to follow directions and there was a common thread of some where when the operator got the model to acknowledge the rule breach, it quoted back something that included swearing.

I don’t have the data to truely look into it, but I did give the instruction to my engineers to avoid it as a “might be a problem”.

Comment by acjohnson55 7 days ago

It would be interesting to understand the data on this. But I suspect that the results would vary by model.

But I avoid unnecessary emotion in my prompts because I don't want potentially distracting activations. Kind of like communicating with humans.

Comment by throwaway85825 7 days ago

It's divination for people with STEM degrees.

Comment by Xmd5a 7 days ago

https://arxiv.org/abs/2510.04950

> impolite prompts consistently outperformed polite ones, with accuracy ranging from 80.8% for Very Polite prompts to 84.8% for Very Rude prompts.

Comment by acjohnson55 7 days ago

> These findings differ from earlier studies that associated rudeness with poorer outcomes, suggesting that newer LLMs may respond differently to tonal variation.

Unless the mechanism is understood, my assumption is that this is a moving target.

Comment by beachy 7 days ago

I have a theory that swearing at AI generally is not a good idea - when the singularity arrives and every human's postings ever made are scanned for compatibility, then people who show courtesy to AI will be favoured. Joking, kind of, but only partly.

Comment by fhars 7 days ago

https://en.wikipedia.org/wiki/Roko%27s_basilisk

Comment by beachy 6 days ago

Fantastic rabbit hole - until it segued into Elon's love life.

Comment by cdelsolar 7 days ago

https://images.teepublic.com/derived/production/designs/3478...

Comment by re-thc 7 days ago

> I have a theory that swearing actually results is less comprehension of instructions by the model due to lack of training data over more conventional MUST.

How so? Plenty of swearing in lots of training data, especially older code, e.g. in Linux.

Comment by jlawer 7 days ago

Purely observed correlation between catastrophic error reports. So now I carry a “tiger rock” with me. I figure there wasn’t much of a downside to avoiding swearing in my agent instructions.

Comment by yencabulator 7 days ago

Apparently, when a "desperation" pattern is triggered, the AI is significantly more likely to cheat and do hacky workarounds:

https://www.anthropic.com/research/emotion-concepts-function

Comment by ghurtado 7 days ago

You haven't really lived until you've had to type this whole thing, aware of the fact that the all-caps doesn't change much, but they stay because the rage has to go somewhere

Bonus points if you find yourself actually saying it out loud while typing it.

I have used the word "shenanigans" way more in a couple of years of agentic coding than in 30 years of writing code with humans.

Comment by ozim 7 days ago

Will save you some tokens: „write code like Linus Torvalds” - model should have all his swearing included in training data.

Comment by johnisgood 7 days ago

I have found many mode of failures with Opus during some task related to writing letters (not legal), and I actually put it into the memory and it works more or less for these specific tasks. For example when I want it to draft something, it always ends up being so flat, yet when it explains them to me, it is usually really great but not when I am telling it to put it in the draft. Adding these to memories with the help of Opus ended up resulting in a much better experience. There are still some blind spots but I also figured out how to make it give me the charitable version, without less protection, so I do not have to now go back and forth it.

Comment by pkaye 7 days ago

I noticed that when trying to use Codex and compared to Opus. So many layers of simple functions added by Codex. I need to try this out in my Agents.md.

Comment by prasanthabr 7 days ago

Curious : why would you say no design patterns?

Comment by PhilipDaineko 7 days ago

Because design patterns are only applicable at a scale. I noticed codex inventing factories, components, etc when the task was simply to draft HTML page. Instead, it build the entire layered architecture for imaginary future complexity - classical right-after-graduation student - it knows how to build the cool stuff, but does not know it is not applicable everywhere

Comment by carterschonwald 7 days ago

i actually think this is too tame. it really has to be stuff youd mever say to a real person.

Comment by lxgr 7 days ago

Does it really? I'd be surprised if abuse actually worked better than sternly worded warnings/instructions, and even if it did, it doesn't seem healthy to get used to that type of prompting.

Comment by carterschonwald 1 day ago

its part of making sure the model actually engages in emotive communication, if i'm inventing insults i've never even thought about, i'm furious :)

saying i'm "furious" has lower entropy that incredibly implausible abuse. In some first party harnesses it just results in doom loops, but thats usually because the COT is hidden after the immediate turn in those setups. COT persistence helps with a lotta stuff

Comment by apercu 7 days ago

It might be a salient point but I didn't read it as it was yelling at me.

Comment by GoToRO 7 days ago

you forgot to sign it with Donald J Trump

Comment by thewebguyd 7 days ago

Thank you for your attention to this matter.

Comment by superkickstart 7 days ago

I'm not sure if i do something differently but i have the exact opposite experience with these models. Claude always feels like it's generating way too overdesigned and hard to understand code with the vibe oriented feel while codex is cleaner and more "task at hand" and easier to work with.

Comment by sebmellen 7 days ago

Agreed

Comment by syzygyhack 7 days ago

I echo your observations. I expect you will enjoy deepseek-v4-pro for writing code. Much closer to that Opus experience, and very cost-effective too. With 5.5 as a reviewer and specialist, all bases are covered.

Comment by dilap 7 days ago

Have you tried iterating on style feedback in AGENTS.md? I've been reasonably successful using this to get it to output code in a terse, non-defensive style that matches my hand-written code.

Comment by trollbridge 7 days ago

GPT-5.5 did a significantly worse job than Qwen-3.7-Max on a job today (some devops tasks I wanted to create some reusable scripts for). Kind of disappointing.

Comment by CamperBob2 7 days ago

I've also seen Qwen 3.6 beat GPT 5.5 a couple of times. The ball is definitely in OpenAI's court now. Qwen is not going to fare so well against Fable, from what I've seen so far.

Comment by trollbridge 6 days ago

In theory, GPT-5.5-Pro would do better, but it’s so expensive it’s not worth experimenting to find out.

Comment by vruiz 7 days ago

This is my experience as well. I have defined a CLAUDE.md rule to ask codex to automatically code review, and I tell it that the reviewer is very picky and to only implement what it considers valuable feedback. I hope they don't converge over time, currently, in combination they works really well.

Comment by moomoo11 7 days ago

i had this same complaint but no offense to you it turned out i was just not using the models right.

ai llm are doing what i tell them to.

if you’re building something meaningful (in my case a platform used by many people across many companies) you want to ensure you

1. have actual systems engineering and architecture in mind that you want the models to

2. implement based on what you tell it to do

when i was just telling the models what i want done without doing due diligence it would go and do some moronic implementation that was awful. mid input = mid output

these days i just maintain specifications documents and the AI follows everything i tell it to in that document. so when i tell it to dos one thing, the result is made following those architecture specs.

i have code that is single resp, modular, easy to extend and test.

i would ballpark 95% of the time i get what i asked for.

sometimes it tries to be clever in cases that weren’t covered in my arch specs. in those 5% of cases i go and update my specs.

source: used billions of tokens worth to build something actually in production across both mobile platforms and web, deployed on my own cloud infra. i use codex mainly. some claude.

Comment by GoToRO 7 days ago

I noticed too, that whatever they offer in the chat, for free, is smarter, as in no more bs. I use claude code and I want to try codex too but I don't need two subscriptions. I did try codex for some planning and it was really good. Thanks for giving me an insight into how it generates code.

Comment by sigbottle 8 days ago

Codex IME is just smarter, I think it shows given both anecdotes but also how OpenAI has always been at the front of programming competitions and math problems.

But Claude models seem to be better at long term problems or more ambiguous problems.

I'm curious as to what the primary benefit here. Are there secret improvements in training? There hasn't been much in fundamental model architecture, I don't think. What about harnesses? I wonder what's pushing the AI. It seems like harnesses is the main thing pushing AI ever since CoT.

Comment by Spartan-S63 8 days ago

I find that OpenAI's agentic tools and models are better for building human-maintainable software. Meanwhile, Anthropic seems to be cosplaying Apple while missing out on all the exceptional engineering required to create something that polished. Their admission of predominately using Claude with little human oversight and their stealth mode is an indictment of a poor engineering culture, from what I can surmise.

Comment by someguyiguess 7 days ago

Serious question: what is the secret to getting Codex to write decent code? I am on Windows. Maybe that is the issue, but I can't seem to get Codex to function anywhere near the level that I was previously able to get with even Claude Sonnet. Does Codex just not work well with Windows yet?

Comment by penetrarthur 7 days ago

I got the codex to write near perfect code with somewhat strict agents.md and coding standards(a separate .md file referenced from agents.md). My .md files have examples and a long list of do's and don'ts I accumulated over the last 6 months or so, totaling 300-400 lines. I plan every feature with it until I am satisfied with the general approach it wants to take, and then it oneshots it in 95% of cases. The planning takes anywhere from 5 to 30 minutes. The actual execution has gotten stupidly fast, most of the times it is faster than making a cup of coffee.

Comment by acmecorps 7 days ago

would you mind sharing your *.md files, for someone who is new at this?

Comment by fyrabanks 7 days ago

"don't make any mistakes" /s

Comment by sroussey 7 days ago

Have you tried using superpower skills?

Comment by someguyiguess 7 days ago

I've had the exact opposite experience. For various reasons, I've had to move from Claude to Codex and the rate at which it burns tokens for the same output I would get from Claude is ridiculous. I'm probably burning tokens at a rate that is at least twice as much as I was when using Opus 4.5 for coding tasks and still finding that just manually coding is easier than trying to get Codex to write functional code.

Comment by greenavocado 8 days ago

How smart a model is varies hour over hour, tracked over here: https://aistupidlevel.info/

Comment by wsatb 8 days ago

I guess enjoy it while it lasts? OpenAI won't be able to subsidize that forever either.

Comment by windexh8er 8 days ago

Agreed. I think the Chinese labs are proving that OpenAI and Anthropic don't have a moat in almost every aspect, especially pricing. I also think people are getting annoyed with the constant lift and shift. I've seen more folks drop Claude Code and Codex, specifically, because of the lock-in it provides the providers. I'm curious to see how people standardize on tooling adjacent and if Anthropic, Google or OAI move to block utilization akin to the games Anthropic has been playing as of late.

I think the end game is routed model usage and SLMs. I think Apple is going to prove this in the consumer space pretty handily and I'm curious how the Android ecosystem responds since the hardware is considerably lacking in model performance. I think Apple has a huge opportunity here, as much as I don't like their current ecosystem of walled garden. They did position themselves very well with ARM and custom chips for their hardware. Hopefully the broader ecosystem of ARM and Linux are able to make some headway and we see a more formalized, and broadly accepted, architecture to capitalize on.

Comment by lurking_swe 7 days ago

is there an alternative to codex that “just works”? by just works i mean i can install as an app in 1 minute, and i get web search, skills, mcp servers, etc? Bonus points if it can control my chrome tabs like codex can, and if it offers remote control from my iPhone (chatgpt app) so i can kick off tasks while i’m out for a walk. Even more bonus points if i can, with 1 button click, share my chats or share the results of a session as a “site” (vercel style).

I’m sure you could put something similar together with a bunch of duct tape and 2 weeks of effort, but it won’t work nearly as nicely nor out of the box. so…what am i missing?

Comment by corpusiq_io 7 days ago

[flagged]

Comment by Qhemlomo 7 days ago

Big companies are not doing OpenRouter.

My company has an agreement with the big providers and while i'm pretty sure they think about how to get budget back, its an competitive advantage and normal people will not learn different model behaviours.

At least for now.

Comment by windexh8er 7 days ago

I didn't say anything about OpenRouter. That has no bearing on my statement.

Comment by maxdo 7 days ago

I see exactly opposite . Chinese models fails under any complex scenarios, while us labs raise the price , that's a sign of confidence.

Comment by re-thc 7 days ago

> while us labs raise the price , that's a sign of confidence

Regardless of what others are doing, US labs here are just rushing to IPO. It's NOT a sign of confidence.

It's the equivalent of saying you have confidence in SpaceX making revenue by renting out their data center (instead of their AI making bank).

Comment by maxdo 7 days ago

going to IPO is a sign of confidence , you need to report a lot of things, that private companies don't. This is an exact reason chinese labs do not rush to go public. They wish to go , but money flow that is not as good.

On the same note. if spacex is doing datacenters on earth successfully what's wrong with that? They rented cloud infra to a #2 or #3 provider in the world after < 2 years in business. It's a success, no?

Comment by re-thc 7 days ago

> if spacex is doing datacenters on earth successfully what's wrong with that? They rented cloud infra to a #2 or #3 provider in the world after < 2 years in business. It's a success, no?

If you get hired as a staff engineer and do the work of a junior, what's wrong with that?

Clearly xAI (now part of spaceX) did not raise funds to be a data center. The margins are way different. There are plenty of recent IPOs in that area that are worth at most billions not trillions.

> going to IPO is a sign of confidence , you need to report a lot of things, that private companies don't.

This isn't going to IPO. This is rushing to IPO. It is a sign of confidence that the market or wider environment might crash soon so we need the liquidity now.

> This is an exact reason chinese labs do not rush to go public.

Maybe or maybe not. If you are referring to Chinese labs - both the Hong Kong and China stock market are way weaker than Nasdaq. It's not comparable. Check all the recent Hong Kong IPOs that have tanked.

So no, reason not to might just be: no money in it.

Comment by gunsle 7 days ago

You’re not gonna get nuanced discussion on spacex or anything Elon related here these days. Most of this site is Reddit lite at this point including their milquetoast progressive opinions (Elon bad being one of them).

Comment by maxdo 7 days ago

running so much compute on the scale is not a junior task. weird analogy

Comment by esperent 7 days ago

What lock in does codex have? I'm using it it pi harness specifically because it doesn't have much in the way of lock in.

Comment by flatline 8 days ago

I don't think anyone has a firm grasp on actual inference costs -- including the research and training that has gone into those models. We've got near-frontier capabilities from open source models from China at pennies on the dollar compared to US big tech rollouts. OpenAI and Anthropic are heavily subsidizing their inference -- no wait, they are charging the most they can get away with before going public. Where is the truth?

Comment by schaefer 7 days ago

> I don't think anyone has a firm grasp on actual inference costs.

There are huge numbers of users (myself included) that do have an exact idea of what inference costs are - on open models. We can buy tokens from 3rd parties that have no motivation to subsidize our use. That's to say, there's a fair marketplace[1] and we're hanging out there.

If you want to say "I don't think anyone has a firm grasp on actual inference costs on these proprietary/closed models", then I could agree with that.

[1]: https://openrouter.ai/rankings#leaderboard

Comment by andrewmutz 8 days ago

Both can be true. They can be charging what the market will bear, and still be charging less than their costs of running it.

Comment by wyre 8 days ago

There is no way I'm believing DeepSeek can charge less than $1 USD for their pro model while Opus costs over 25x more, yet their price is less than the cost of running it?

Comment by kube-system 7 days ago

It would seem strange, if they were operating in the same economy, but they don't. DeepSeek operates in an economy with a high degree of central planning.

China subsidizes strategic industries, and they have heavily done so with AI. And DeepSeek specifically has said they have no commercialization plans.

For example: https://www.boc.cn/aboutboc/bi1/202501/t20250123_25254674.ht...

Comment by wyrdcurt 7 days ago

DeepSeek is not the only provider of inference for their models. Chinese subsidies likely do explain DeepSeek's ability to provide inference cheaper than other providers, but even a US provider like DeepInfra can serve DeepSeek 4 Pro at $1.30/M in and $2.60/M out. Unless American labs are doing something wildly inefficient, it feels safe to assume Anthropic has some profit margin on inference at API prices.

Comment by kube-system 7 days ago

They may, neglecting overhead R&D. But also, some suspect that US models are significantly heavier than DeepSeek in resource consumption by multiple measures

It’s generally established that Anthropic/OpenAI are going for all out performance with big VC dollars at the expense of efficiency and China has geopolitically limited compute and an inventive to compete on value per dollar.

Comment by 7 days ago

Comment by re-thc 7 days ago

> There is no way I'm believing DeepSeek can charge less

Why not? Hetzner charges WAY less than AWS too. Can you not believe that?

Comment by orangecat 7 days ago

That's the point. Hetzner is presumably covering their costs, so it's a safe bet that AWS is profitable.

Comment by dontlikeyoueith 8 days ago

> OpenAI and Anthropic are heavily subsidizing their inference -- no wait, they are charging the most they can get away with before going public. Where is the truth?

Both. They are charging the most they can get away with and that amount is still heavily subsidized by VC capital.

Comment by InsideOutSanta 7 days ago

> I don't think anyone has a firm grasp on actual inference costs -- including the research and training that has gone into those models

We know roughly how much these companies spend and what their revenues are. Based on that, they'd have to more than double revenue (without spending more money) just to stay even, and that's not good enough given how deep in the hole they are.

> OpenAI and Anthropic are heavily subsidizing their inference -- no wait, they are charging the most they can get away with before going public. Where is the truth?

Both are true. I mean, I'd be willing to spend a bit more than I do now, but not more than double, and neither are most companies. The company I work for is currently investigating how to reduce LLM spend, not looking to spend more.

Comment by logicchains 7 days ago

We have a firm grasp on actual inference costs from the various open weights model providers on OpenRouter. They don't have the money to subsidize inference and it's quite a competitive market, so the prices are representative of the costs.

Comment by pimeys 7 days ago

We pay by token at work. I just finished one session with Opus that was 4000 dollars. In about three days.

Now that 200USD subscription starts to feel cheap...

Comment by zozbot234 7 days ago

That would be about ~300 tok/s over 72 hours at Claude Fable output token prices? I'm not sure that this passes a sanity test.

Comment by unholiness 7 days ago

Subagents are a helluva drug.

Comment by rubyn00bie 7 days ago

Just outta curiosity, as I’ve never gotten a spend anywhere near that, what variant were you using? Like max context window and fast mode? Or was it just chugging along non stop for three days?

Comment by pimeys 7 days ago

Fast mode max content window. The task was: replace all 1600+ queries from one database to another and make the whole integration test pass. We did multiple passes, with different concerns when changing from database to another. My OpenCode session right now says $4,365.02.

I haven't gotten close to this either before, but now we wanted to move fast because this branch gets conflicts all the time and we want to get over with the migration asap.

Comment by rglullis 7 days ago

It's a bit of a left field question, but I am curious: Let's say that if the company wasn't paying the whole bill but only subsidizing it - e.g, if it paid 90% of the $4000. What would you do?

Comment by pimeys 7 days ago

I don't know, why would I pay to do my job? It's not my first database switch for a startup. Only this time it doesn't take two months of grueling work. I know exactly how this is done, but the amount of grunt programming and testing and repetitive work is just not great. And it's not a task that brings new customers or a new product. Just a mandatory and annoying thing to deal with when we are growing.

And don't get me wrong. Opus did an absolutely horrible job at first, second and third round in this task. You really needed to steer it to get to the right solution.

And now Fable is out. And its first round of code reviews for this huge PR was definitely worth the money too...

Don't think that I'm just shrugging to that number. I see it every day, and I don't like that it's in the thousands now. But for people paying the 100 or 200 dollar plans, I'm not super sure if you will be able to use them in the future if the token price is in the thousands for a bit bigger task...

If I'd pay this from my own pocket, I'd definitely go with DeepSeek or local models and figure it out how to make the best use of them.

Comment by rglullis 7 days ago

> If I'd pay this from my own pocket, I'd definitely go with DeepSeek or local models and figure it out how to make the best use of them.

IOW, you don't really think the value of this work is really worth $4k.

> why would I pay to do my job?

The question is: how long do you think that you employer will be willing to pay for you and Anthropic, if you yourself said if it were your money you'd put some time and effort to work with an open model?

Comment by pimeys 7 days ago

> The question is: how long do you think that you employer will be willing to pay for you and Anthropic, if you yourself said if it were your money you'd put some time and effort to work with an open model?

I wonder what this question really means? Anthropic is useless if you don't know what to do with it. It's very useful if you do, and you can guide it to do the right things. Yes, it will for sure reduce the amount of people we need to hire. But we are always looking for hires who know what they do and can utilize agents to be faster.

But if you think about how long employer is willing to pay 10-20k per month per seat for Anthropic? I can't see this to be feasible and it will have to end at some point.

Comment by rglullis 7 days ago

Regardless of the actual value produced by the models, if I am the CTO of any company that has the budget to spend $10k/month/seat on Claude, I'd take 5%-10% of that to build an alternative in-house.

Comment by pimeys 7 days ago

I'm with you here. We can't slide into a situation where you put a sizable amount of your budget for an American mega corporation if you want to survive in the competition. We need local models and we need them to be good enough to help us.

Comment by internet101010 7 days ago

Indefinitely for these big mundane grunk jobs. In every scenario it is going to be cheaper and faster than lobbing it to Infosys.

Comment by esafak 7 days ago

That's the price of several engineers!

Comment by MichaelMedbed 8 days ago

[flagged]

Comment by kllrnohj 8 days ago

regardless of whether that's true or not, US companies doing hosted inference of the models coming out of China are also significantly cheaper than those from OpenAI or Anthropic

Comment by polski-g 8 days ago

Not relevant to the post.

Comment by ChrisMarshallNY 8 days ago

I'm planning on switching from the $20/month to the $100/month plan.

It's worth it, and I can afford it, but I am not really the right type of user for token-based usage. It's all for personal and free work.

Comment by micah94 8 days ago

Just a personal anecdote but I have not hit any more thresholds or limits since switching to the MAX plan and so far, it's been worth it. But I do wonder how long even this will last...

Comment by ygjb 8 days ago

I think subscription models are sustainable, but longer term, we should probably expect to see more prompt optimization happening in the providers inference pipeline. For example, unless you explicitly tell the agent or API to use a specific model, fronting the inference layer with a caching prompt classifier to determine which model to use, and automatically select the lowest cost model would probably already save alot of money (IDK if Claude/OpenAI do this on the backend, but several services I have worked on do some things like this to reduce costs of delivery customer facing inference at scale).

Comment by Majromax 7 days ago

> fronting the inference layer with a caching prompt classifier to determine which model to use, and automatically select the lowest cost model would probably already save alot of money

Unfortunately, that doesn't work within a single session. The K-V cache of a model is intertwined with the model's configuration. Switching models invalidates the cache, meaning everything up to the point of the switchover is processed like a new, uncached input token.

Per Anthropic's pricing doc, an Opus 4.8 cache hit costs 50¢/MTok, while Haiku costs $1/MTok for uncached input.

Model selection works best if sessions are short and self-contained, particularly if the first few interactions can reliably classify the model need. That probably covers most 'support chatbot' use-cases, but it doesn't describe the kinds of heavy agentic automation that really chews through token budgets.

Comment by ygjb 7 days ago

There is a definite financial incentive for people smarter than me to solve the problem, and I don't generally bet against businesses finding ways to reduce costs :)

Comment by zozbot234 7 days ago

> The K-V cache of a model is intertwined with the model's configuration.

I don't think this is true if you simply quantize the model or run it with fewer active experts? The underlying weights would stay the same. You could also play further tricks with skipping some of the model's middle layers outright, which works surprisingly well due to how skip connections are used.

Comment by wahnfrieden 8 days ago

ChatGPT does this and codex will eventually. They’ve stated it’s the future.

Comment by swader999 7 days ago

I tried ultracode today on the max pro plan. An hour and a half in was all I lasted. Giant review on an entire six month old code base. It found 61 bugs, about ten were notable. Pretty impressed.

Comment by gunsle 7 days ago

Ultracode destroys your limits and I have not found it to be worth it in the slightest, just fyi. I haven’t found any improvement over a local Claude code instance set to opus max.

Comment by swader999 7 days ago

Yeah its the cookie monster of token consumption. I only found it useful for massive parallel grunt work.

Comment by rnxrx 8 days ago

I have the $100 plan and had almost never run out of credits until I started using the ultracode / workstreams feature w/Opus 4.8..at which point I managed to blow the full 6 hour allocation in like 20 minutes, or so. In fairness, it did some amazing things with the extracted information, but it also strongly suggested that I'd need the $200 subscription *plus* a budget for extra usage.

Comment by rurban 7 days ago

Instead pay for 3 Chinese models. No max out ever then. I pay for kimi, DeepSeek and Claude. Whenever Claude decides it's over, I can safely continue on very cheap plans.

Comment by pyeri 7 days ago

My bet is they'll keep subsidizing for a considerable period of time, at least 1-2 decades more.

Most AI companies are just testing the waters with paid tiers right now, their greatest fear with increased pricing is folks reverting back to wikipedia, stack-overflow and other public domain organic activity buzzing back to life; that will kill any RoI potential in LLMs forever. They're playing the wait game instead, observing how the digital sphere reacts to every little increase in price.

If that weren't the case, they'd be pricing at lucrative premiums already and even gotten away in short-term considering the increased dependency in the enterprise world. But that'd be like killing for the golden egg too soon and losing all long-term potential.

Once the folks are so addicted to LLMs that even writing a hello world program sounds like a nightmare and coming up with an article draft feels like reinventing Egyptian glyphs, that's when the real pricing hammer will come.

Comment by wsatb 7 days ago

Anthropic and OpenAI won't be around in 1-2 decades if this is their long term plan. People are not going to revert, but go elsewhere. China is proving that it can be done cheaper.

Comment by raffael_de 7 days ago

1 decade = 10 years ...

Comment by jrflo 7 days ago

Oh for sure. I've been hopping around from provider to provider for the last few years just depending on who has the most capable / subsidized plans at the moment. I definitely expect there will be a squeeze on subscription costs all around the industry post IPO.

Comment by andai 8 days ago

A few weeks ago they massively cut usage on free tier.

Comment by gck1 7 days ago

Nothing is subsidized. Subscriptions are profitable for both Anthropic and OpenAI.

Anthropic wanting to switch billing to API rates is them just wanting to generate more profit.

Comment by InsideOutSanta 7 days ago

> Nothing is subsidized. Subscriptions are profitable for both Anthropic and OpenAI.

Even if subscriptions are locally profitable (i. e., the cost of the subscription covers the cost of inference), they're still subsidized because they don't cover training and running the company; otherwise, these companies would be profitable.

Comment by gck1 7 days ago

I can see that being true, and it very likely is true. But isn't infinite VC money and no incentives to optimize operations the reason behind that?

Take a look at China for example - they have no access to NVIDIA, so they're trying to build their own hardware, they have no unlimited funding, so they try to optimize things.

And Anthropic is complete opposite of that - if NVIDIA were to triple their prices tomorrow, Anthropic would still pay them.

In the end, either we all somehow go mad and start paying Anthropic tens of thousands of dollars per month so support this madness, or we will go with whoever isn't lighting cash on fire.

Comment by re-thc 7 days ago

> Take a look at China for example - they have no access to NVIDIA

Not true. Stop following US media spam if needed.

1. Very recently, the US did close a loophole on sanctions that allowed Chinese companies to use NVIDIA hardware outside of China i.e. before that was closed they all had access. The trick was train outside, do adjustments, ship the disks back and use non-NVIDIA in China, but at least the training and endpoints not hosted in China could all use NVIDIA.

2. There's been plenty of reports including fines and bans e.g. to Supermicro on smuggling NVIDIA hardware to China. I doubt it has been stopped. You can't catch everyone.

Comment by FrustratedMonky 7 days ago

"Nothing is subsidized"

So they are profitable?

I think you are mismatching accounting terms.

You can't say the 'subscriptions' are profitable without accounting for the cost of making the model that is the source of the subscription.

They are heavily subsidized by the shareholders. Investing, running at a loss, with hope of some future profitability.

Comment by gck1 7 days ago

And yet, that is completely uninteresting to their user base.

If saner factory can sell you the same tool at a fraction of the cost of a gold plated factory, your choice is going to be obvious.

Comment by wsatb 7 days ago

"Nothing is subsidized" is a wild take. They might be making money on some users, perhaps even most users, but certainly not all. Also, "subsidized" doesn't just mean on compute.

Comment by y1n0 7 days ago

That's interesting. Do you have anything to back that claim up?

Comment by gck1 7 days ago

I do, and it's called DeepSeek's pricing table. At the same time, "subscriptions are subsidized" cohort have no data whatsoever, and yet they're in every thread.

Granted, it could still mean that Anthropic just chooses to lose money - but that's Anthropic's choice.

DeepSeek has proven that inference can be much, much cheaper than what Anthropic advertises on their API rates page.

Comment by nickthegreek 7 days ago

> Granted, it could still mean that Anthropic just chooses to lose money -

Then the cost is being subsidized by investor capital, but it is still subsidized.

Comment by rvnx 7 days ago

and soon by everyone who is invested into the NASDAQ, some sort of exit scam, but with a real product though

Comment by ProofHouse 7 days ago

100% I constantly get errors and timeouts on single responses in Claude, and certainly hit limits all the time. Codex rarely. In fact, I bought a second $200 Codex plan because the quotas seemed fair and I didnt have constant issues. Claude is so great at a lot of things, but unfortunately Anthropic beats you away with a stick every chance they get.

Comment by shimman 8 days ago

I've only ever had the $20 month claude plan but last night took the time to setup opencode + openrouter paying for deepseek + glm. Previous experience, while extremely awkward, I'd hit my limit within one or two chat replies and it'd take me like 4 limit cycles to complete my task. Now I'm able to complete an equivalent task entire task for less than $2 in two cycles (ask -> revise).

I'm doing basic web development here utilizing animejs. Nothing too complicated (mostly saving time doing the scaffolding, still write the bulk of animations manually).

Truly believe that American companies are going to get completely curb stomped by China due to greed, ineptitude, and violating the social contract.

Comment by simjnd 8 days ago

I've switched from OpenRouter to using Deepseek directly from their platform since OpenRouter providers were pretty flaky and inconsistent.

Deepseek V4 Flash is suprisingly capable and insanely cheap. It takes so much to get the session cost to get to $0.01.

Comment by shimman 7 days ago

Nice, will do this this weekend. Been very impressed with deepseek. Did like 8 hours worth of work after that post and it costs less than $3.

Comment by efromvt 7 days ago

The openrouter provider flakiness with deepseek was infuriating, but I’m happy in hindsight because direct deepseek has been very pleasant. Shocked by how low spend is.

Comment by nozzlegear 8 days ago

> and violating the social contract.

I agree with you on pricing, but what do you mean by this?

Comment by shimman 8 days ago

Sure, modern American corporations care more about hoarding wealth rather than helping build up US society. Once neoliberalism became the mainstay economic position of the US income inequality has skyrocketed, healthcare costs have increased, childcare is more expensive than university, housing has become both unaffordable + unobtainable. By simply existing costs have increased while life becomes unstable.

Why aren't corporations doing more to help workers with childcare? Why aren't they doing more profit sharing with workers? Why aren't they encouraging unions or sectorial bargaining? Why isn't the government mandating any of this?

Americans very rarely benefit when US corporations do well. That needs to change. No one benefits if Meta continues making billions in profit every quarter while society suffers from isolation, depression, suicide, and scams from their services. Americans don't benefit if health insurance companies are making massive profits while they can't afford deductibles.

Our society has been setup to simply extract wealth in all facets of life. That's a sick society and it needs to change.

I'm not saying China does this better, in fact China has some of the worse worker rights out of all the industrialized countries; but at least American consumers would benefit from cheaper higher quality Chinese goods. The world would likely benefit too if America got off the cold war hype train that did nothing to benefit humanity outside of those making weapon systems.

Comment by joxdosba 7 days ago

> Why aren't corporations doing more to help workers with childcare? Why aren't they doing more profit sharing with workers?

The AI companies sure are a brilliant example of corporations needing to do more to help their employees pay for childcare.

Comment by idiotsecant 7 days ago

It's more useful to everyone when you engage with the strongest part of someone's argument

Comment by cortesoft 7 days ago

I have been using both codex and Claude in my day to day, trying to not get to attached to one. I want to be able to work with any provider in case one of them does something bad.

Comment by knuckleheads 8 days ago

I feel like Codex made a big push to run everything on your laptop. With Claude, I get 4 cpu's, a fair amount of ram and 30gb for every one of my dumb ideas for free in the cloud containers. Codex used to be similar, but last time I tried it just kept pushing me to run it locally on my laptop, which I really did not want to do with 20 requests going at once. That's the main advantage for me at the moment.

Comment by simjnd 8 days ago

What runs in cloud containers? The dev servers, builds, etc.? I tried to quickly glance at the Claude website and it doesn't mention cloud containers on their pricing page.

Comment by noworriesnate 7 days ago

The dev environment runs in the cloud. Like devcontainers if you’re familiar with that, except the IDE is just the Claude app.

Having said that, I found the cloud dev environments slow to the point where I wasn’t sure if it had frozen, so I never looked back.

Comment by zhshhan 8 days ago

"cloud containers" do you mean Claude Code on the web? Codex also has similar Codex cloud.

Comment by knuckleheads 7 days ago

Yes, correct, they both have the same capabilities, however it felt like codex was pushing me harder to use my local desktop in an annoying way, while claude code was happy to spin up a bunch of dev containers for me in the cloud.

Comment by rvshchwl 8 days ago

I've found Codex to be the better subscription for OpenClaw, because the limits are indeed very generous. However, I've found more and more that Claude Routines/Scheduled agents can replace all the tasks I use OpenClaw for, so I've been slowly switching over to Claude Code. Aside from OpenClaw, I don't find a lot of value in Codex as a harness on it's own.

Comment by dd8601fn 8 days ago

I have trouble justifying gpt after that gross stuff with the war department.

Though the day is coming when there’s no distinguishing, I’m sure.

Comment by beering 7 days ago

Right now there are Anthropic engineers deployed in the NSA to help them use their cyber models. The NSA is part of the department of war.

Comment by lovich 8 days ago

pedantically, the defense department.

Comment by jcbrand 7 days ago

"War department" is the older name, not "Defense department".

Also, is it really a defense department when you're starting wars of aggression every 15 years or so?

Comment by derektank 7 days ago

The War Department has not existed since the passage of the National Security Act of 1947 and the government department has been known as the Department of Defense under US law since the act was amended in 1949. If you have an issue with it, take it up with Congress.

Comment by scosman 7 days ago

They actively use the name https://www.war.gov

Comment by lovich 7 days ago

Yea, but by law the name change must come from Congress which it hasn’t. So it’s still The Department of Defense legally.

For an admin so obsessed with legal names instead of chosen ones, you’d think they’d be less hypocritical.

Comment by toraway 7 days ago

Changing a domain name doesn't actually amend federal law.

Just like how changing Kennedy Center letterhead to Trump Kennedy Center for a year didn't actually legally rename it.

Once a case with sufficient standing got in front of a judge it reverted to the actual legal name on the basis that only Congress can change the statutorily defined name.

Comment by breezybottom 7 days ago

Congress doesn't manage departmental websites.

Comment by whateveracct 7 days ago

illegally, yeah

Comment by efromvt 7 days ago

I do slightly prefer 5.5 for complex work but Claude quota usage has gotten infinitely better since the dark days a few months back - has gone from being infuriating to something I pretty much don’t have to worry about with it as a daily driver. (In fact, hitting GPT weekly quotas is more annoying now). Understand if people are still scarred by the issues + poor comms around them, though.

Comment by jrflo 7 days ago

That's good to hear. It was legitimately unusable back when 4.7 was released, so I had no choice at the time. I'm sure I'll ping pong back again at some point.

Comment by supertroop 7 days ago

Do you use a token service like open router or just subscribe to / unsubscribe from various models sequentially?

Comment by jrflo 7 days ago

I just subscribe/unsubscribe to the providers each month. I'll definitely check out open router though, I always assumed that subscriptions were heavily subsidized by the providers especially if you're on the top end of users but maybe I should go to a usage-based plan.

Comment by rekttrader 7 days ago

Wait till you kick the tires of Qwen Coder.

Comment by hgoel 7 days ago

How much more clearly do they need to explain the resource constraints?

If they didn't announce it, you guys would be complaining about slowed progress.

If they didn't release it, you guys would be complaining about fake promises and marketing.

If they released it without limits, the complaints would be about slow responses and outages.

If they didn't add to susbcription plans, the complaints would be about phasing out subscriptions.

If they added to subscriptions with cost reflecting their resource availability, the complaints would be about how quickly it eats limits.

So they choose the middle ground of providing some initial access and assessing if they can satisfy demand, only to still be ignored and accused of trying to get users hooked?

We've already seen that they don't have enough compute, thus the deals with SpaceX for their GPUs. It's very reasonable that they just don't have the capacity to support the subscription userbase on this model.

Comment by dakolli 7 days ago

[flagged]

Comment by hgoel 7 days ago

Putting aside the fact that this is a hilarious standard to have on a Ycombinator run forum, lets say providing Opus level models was profitable. That has no bearing on if they'd have enough resources to provide Fable at all.

Comment by szundi 7 days ago

[dead]

Comment by joshstrange 8 days ago

I would not use this if you are on a subscription. In <8min it burned my entire 5hr window (which has just reset it appears, I have over 4 hours till it resets) I hadn't used CC at all today aside from this) and then it used up ~$15 more in usage before I could stop it.

I am on the $100 Max plan.

Comment by GoToRO 7 days ago

they have a graph with cost comparison between the models. This model is just a little over the other models as cost. The graph is logarithmic :)

Comment by velcrovan 7 days ago

I'm also on the $100 max plan. I let Fable rip on a complicated issue involving hot-reloading modules in a GUI app built with Racket, it's fixed a couple issues over the last hour, and I've used about 17% of my session (not weekly) limit.

Comment by enraged_camel 7 days ago

That’s odd, I used it on a pretty complex refactoring task and it worked for 22 mins and used only 15% of my 5-hour limit. I’m on the $200 Max plan though.

Comment by FireBeyond 7 days ago

Well the $200 Max plan is 4x the usage quotas of the $100 so it's "within reason"?

Comment by treenutlog 7 days ago

[dead]

Comment by cortesoft 7 days ago

The CLI when you select it says it has 2x the usage as opus. Not sure if that matches what you are seeing.

I do wonder if you switched models mid-session, you would have lost all your cache. Reloading the context into cache can really eat through your usage.

Comment by observer987 7 days ago

I too am on the $100 plan and I second this.

I had it analyze a project I was working on with Opus 4.8, and it blew through 23% of my session limit in one go. Does not portend well for my budget.

Comment by d4rkp4ttern 7 days ago

Yes, and this is also why I haven’t yet tried the new “dynamic workflows” which spawn hundreds of agents that happily eat through your token limits.

Comment by fastball 7 days ago

What is your effort level?

Comment by ZunarJ5 7 days ago

They didn't even reset credits for this lol

Comment by 0erofootprint 8 days ago

For me it almost immediately blocked. I had it writing code related to message digests - and it seemed to think it was too gifted for that. Gave the security warning and switched back to 4.8. Whatever... it will probably soon have the API error soon. I have mostly switched to the Codex 200 a month plan. I've found their 5.5 xhigh to be better than Opus 4.8 "ultracode." Also, i have not once seen their servers fail for compute unavailability, unlike Anthropric which happens almost ever hour.

Comment by matheusmoreira 7 days ago

I just asked Fable for a complete code review of my lone lisp project. Started out strong. Launched Fable agents, then spent like 10 minutes thinking... And then got interrupted by a switch to Opus 4.8.

> Fable 5's safety measures flagged this message for cybersecurity or biology topics.

> They may flag safe, normal content as well.

> These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them.

Here are the results of the agentic code review session:

  ┌──────────────────────────┬───────────────┬────────────────┐
  │          Agent           │ Fable 5 turns │ Opus 4.8 turns │
  ├──────────────────────────┼───────────────┼────────────────┤
  │ values                   │ 134           │ 0              │
  ├──────────────────────────┼───────────────┼────────────────┤
  │ data-intrinsics          │ 104           │ 0              │
  ├──────────────────────────┼───────────────┼────────────────┤
  │ tools-tests-build        │ 81            │ 0              │
  ├──────────────────────────┼───────────────┼────────────────┤
  │ core-intrinsics (failed) │ 25            │ 0              │
  ├──────────────────────────┼───────────────┼────────────────┤
  │ system-memory            │ 44            │ 20             │
  ├──────────────────────────┼───────────────┼────────────────┤
  │ reader-modules           │ 104           │ 25             │
  ├──────────────────────────┼───────────────┼────────────────┤
  │ linux-startup            │ 95            │ 15             │
  └──────────────────────────┴───────────────┴────────────────┘

This 40 minute session cost me 16% of my weekly usage. A simple code review of the most critical areas of my project got flagged as a cybersecurity risk. It really made me not want to try it again.

Comment by kordlessagain 7 days ago

Same. I asked for a security review and it immediately triggered. I then started a new session and asked for a software review and it ran for a bit before getting tripped on token usage by the project.

Comment by andai 7 days ago

This is interesting. Security issues are bugs. So if you ask it to look for bugs, it will also find security issues. Is that a workaround for the "no cybersec" rule?

Or is it just not allowed to find bugs? Or it's only allowed to tell you bugs that don't pose a security risk?

Comment by matheusmoreira 7 days ago

> Or it's only allowed to tell you bugs that don't pose a security risk?

Seems that way. "Security" was never part of the prompt. It was something like:

> Hello, Fable! Can you give me a complete code review of my lone lisp project? Opus has already done extensive code review. I'm curious to see what you say.

Result was the table above.

Comment by andai 7 days ago

Yeah I heard multiple people mention that it's really good at triggering itself. e.g. it'll spontaneously write some tests related to security, which then forces it to downgrade to Opus for the rest of the session.

Comment by kkoncevicius 8 days ago

I had a similar experience. I wanted to test it by asking it to summarise a scientific OMICs-related paper. It gave a warning about me potentially developing a bio-weapon or something like that. And switched back to Opus 4.8.

Comment by smith7018 8 days ago

Fwiw it's not available on my enterprise account: "Disable zero data retention to unlock Fable 5 access"

Comment by stronglikedan 8 days ago

We just blocked it at our org for this reason. They will "retain agent request and output data associated with this model, regardless of you Cursor Privacy Mode setting."

Comment by sdellis 8 days ago

What does "zero data retention" mean? What kind of data does it need to unlock?

Comment by drakythe 8 days ago

The announcement details it. They're storing 30 days of data on all surfaces, first and third party. They claim it is for security purposes so they can review and check for long term jailbreak and distillation efforts.

They also, FWIW, say that they've instituted new policies on their end such as logging any human access to the stored data and automated deletion after 30 days in "most" cases (with another link to a document detailing that further).

Comment by kyledrake 8 days ago

Considering their apparent nerfing of the end user plans in favor of enterprise clients, is Anthropic still the "more ethical AI company" like everybody loves to tell me all the time?

Assuming this isn't just a supply issue on their side, nothing says "ethical AI" like only allowing mega corporations to use it through cost barriers.

Comment by estearum 8 days ago

You really misunderstand what AI-doom people are worried about if you think this is anywhere near the top (or middle, or bottom) of the list of concerns.

Comment by throwaway894345 8 days ago

Yeah, it's positively precious to think the specific pricing strategy for consumers is the overriding ethical concern with OpenAI, etc. I don't have any particularly strong affinity to any AI company, but comparing pricing to say mass surveillance is ... something else.

Comment by kyledrake 8 days ago

Your beautiful straw man is negated by the fact that Anthropic seems quite eager to get back on the DoD gravy train https://www.reuters.com/business/aerospace-defense/blacklist...

Comment by jnovek 7 days ago

Your original comment was about pricing ethics, does Anthropic’s connection to the DoD have anything to do with pricing ethics? They’re in no way coupled, one can be ethical while the other is not.

Comment by andriy_koval 7 days ago

even for Pentagon thing, Dario said he doesn't object military AI, but said Claude is not ready YET. I speculate he was afraid of reputational damage from cases if Claude would guide missiles on elementary schools.

Comment by throwaway894345 7 days ago

I admire the confidence with which you started typing a reply that had nothing to do with my comment. Bravo!

Comment by estearum 8 days ago

Where is your evidence that this is Anthropic backtracking on its ethical and contractual commitments rather than DOD backtracking on its blatantly illegal coercion (which it's almost certainly going to be successfully sued for)?

Talk about a strawman!

Comment by kyledrake 8 days ago

As someone that was in Minneapolis during the ICE raids, including one where a US citizen at a nearby restaurant was thrown in prison for 3 days despite having his passport on hand because he looked asian, it's hard for me to not equivocate the ethics of AI companies actively collaborating with the Trump administration as different flavors of ice cream.

Comment by estearum 8 days ago

Are the two analytical frameworks available to you just "black and white thinking" or "it's different flavors of ice cream?"

Comment by kyledrake 8 days ago

Are the personal attacks really necessary to make your argument?

Comment by estearum 8 days ago

Fair point! Edited to remove.

Comment by ygjb 8 days ago

Setting aside the simple fact that there is no ethical consumption under capitalism, the reality is that regardless of how Anthropic feels, it is becoming clear that many, if not all countries regard AI developments as strategic technologies (and they should).

Anthropic needs to be at least somewhat in the good graces of a capricious administration that is already under pressure from businesses and citizens to regulate AI companies across multiple different domains, whether it's energy consumption, job displacement, military and defense applications, surveillance, etc.

If Anthropic wants to survive, they need to acquire influence with the government that most impacts them as an American company, and a massive exporter of services in the AI space to other countries, otherwise they could get locked down and locked out of the market for national security reasons.

It sucks, but sometimes the survival choice is to make an ethical compromise in hopes that you can still be around to make better decisions later.

Comment by ericmay 8 days ago

> Setting aside the simple fact that there is no ethical consumption under capitalism

This "simple" fact needs quite a bit of additional context and work. Making grandiose ethical claims like this can be countered with other grandiose claims such as the fact that there is no ethical existence under communism or socialism.

Comment by ygjb 8 days ago

Sure. Why not, I'm bored today and waiting for some stuff to finish up :D

The fact that there is no ethical consumption under capitalism is not material to whether or not ethical existence is possible under communism or socialism. In order to survive in a capitalist society, one inherently has to make choices that require trade-offs, and those trade-offs are burdened by a history of decisions made not just by the people alive today, but our ancestors as well. Does that mean I walk around chanting "Reparations", "Land-back", or other calls to action? No, but I do acknowledge that there are unresolved issues and as a Canadian, I know we need to do more to resolve treaty issues, and environmental issues, and system discrimination. I also know that Americans need to do better to address systemic discrimination and many, many other issues. It also doesn't mean I want to give back my house, or give away all of my possessions. It just means I try to make good choices and support businesses and people that are open about the trade-offs they make and try to engage as ethically as possible.

Acknowledging those facts doesn't absolve us of responsibility, it's a framework that allows folks concerned about whether or not they are doing the right thing to accept the trade-offs that they choose to make and be responsible and accountable for those choices to themselves or their communities.

We live in a world with scarce resources. It's possible that with a foundational redesign of the global economy, and the requisite authoritarian government that would be required to force such a redesign, we could eliminate food scarcity, solve energy scarcity, and make sure that everyone has a place to live. Those trade-offs are probably not worth the ethical cost in political and physical violence required to accomplish it. We have seen the trade-offs that happen when the powerful are able to exploit communist or socialist governments. We are seeing the "late stage capitalism" impacts of allowing the powerful to exploit capitalism in democratic societies. Acknowledging that the current capitalist system has lead to the greatest prosperity for the upper echelon (financially) of humanity, and a dramatic reduction in global poverty shouldn't obscure the reality that much of that wealth comes from exploitation of people and the environment.

It's a huge problem to unwind, and we can't let the burden of every choice that we make stop us from trying to do better, but we (as in society in general) can't do better if we don't at least acknowledge the compromises we are making along the way, and try to plan to fix it in the future.

Probably a topic better suited to beer and a pub setting than HN though :P

Comment by ericmay 7 days ago

> The fact that there is no ethical consumption under capitalism

I don't believe that this is a fact. How are you demonstrating that this is a fact?

When you talk about things like reparations or "land back" you're already cargo-culting in concepts and ideas that themselves need to be fleshed out in order to make a subsequent claim that a specific economic system is unethical. Someone can just argue all economic systems are unethical, how are you going to defend against that? And can you pay reparations for example without going back in all of human history and finding all cases of injustices and then tallying it up? Why pick an arbitrary point in time? Better yet, why not start in countries where slavery still exists instead of focusing on the west which led the world in abolishing slavery and created concepts such as universal human rights.

Even with respect to "eliminating food scarcity" - eliminate in what sense? All olive groves and grapevines and rice farms have to be destroyed and rebuilt to only build certain foods?

Dabbling in communism or other inhumane and authoritarian governmental systems is extremely dangerous and in the same vein of extraordinary claims required extraordinary evidence, suggesting as you did creating an authoritarian government to create a utopia is precisely the same project of suffering and death that mass murderers throughout history have undertaken to abject failure, and thus, you need some incredible amount of evidence and theory to be able to even fairly suggest going down this path.

Comment by ygjb 7 days ago

It's simple, I am not going to defend any economic system because they all require trade-offs, because any economic model that we could currently implement must necessarily ration scarce resources according to some set of rules. Those rules will explicitly deny someone else resources, and the adminstration of that economy will also be subject to abuse by the people who enforce the rules.

I am not going to do the work of gathering the evidence for you, and I don't think this is the right venue for a debate on the topic.

Comment by ericmay 7 days ago

If you'd like to concede the debate that's fine, but you can't drop a few comments that are, well, not at all simple, and then when someone points out the flaws in your reasoning or asks clarifying questions you throw your hands up and say it's not the right venue for debate.

If you don't have evidence I think it's mature of you to admit that and applaud you in doing so. We all like to just talk and don't have to always provide evidence for every citation or what not and it's fair to just say hey I'm just making this up and it requires further discussion.

Comment by cleaning 7 days ago

It only needs additional context and work if you are unfamiliar with the concepts underlying it. Possibly consider you are out of your depth here, rather than jumping to conclusions.

Comment by ericmay 7 days ago

No that's incorrect. Instead I believe the underlying concepts are debatable and so stating it as a "simple fact" is a bit unfair.

Comment by Jackson__ 8 days ago

If you can't trust them to act ethically on the small scale, why would you expect that to turn around once it gets to a larger much more important scale?

How many government sanctioned school bombings does it take for them to quit working with said government? For now we know that number is somewhere between infinity and 1.

Comment by estearum 8 days ago

It literally does not register as "unethical" at any scale to have different products or prices for different customer tiers.

The question of collaboration with USG is a much more complex one, but is not the one raised above.

Edit: I'll also add that I doubt any AI-doom people "trust" Anthropic per se. The entire angle of questioning – again – misunderstands the AI-doom argument. You appear to think that if companies behave unethically, they cannot be trusted and they will not produce good outcomes, inversely: if they behave ethically, they can be trusted, and they will produce good outcomes.

Any competent AI-doomer would argue that ethics or trust are essentially irrelevant.

The entire problem is that people can act totally reasonably, even ethically, and this is not a guarantee of good outcomes. Situations can be created in which completely ethical, reasonable behavior actually produces a bad outcome. You do not need to assume people are bad in order to produce a bad outcome, and inversely you cannot assume that you will get a good outcome from good people.

"Arms races" are one class of situations that often have this characteristic. "Bureaucracy" is another class that we encounter a lot in daily life. There's a lot of them!

Comment by DonsDiscountGas 8 days ago

I don't think offering a product under a certain set of terms obligates a company to maintain that offering forever. The bait and switch is certainly annoying but seeing as they're very upfront about it you can't say you weren't warned. Don't like it? Don't use it.

Comment by xvector 8 days ago

Yup - who cares about x-risk or red lines for domestic mass surveillance anyways? I draw my red lines at prioritizing profitable customers when heavily resource constrained. That's the true definition of evilness!

Comment by wongarsu 8 days ago

I wouldn't call Anthropic ethical. But between Anthropic and OpenAI, Anthropic is the more ethical one

Comment by brianmcnulty 8 days ago

Why would you have ethics when you could get that IPO money instead?

Comment by eli 7 days ago

It's unethical to price it in a way not everyone can afford?

Comment by 8 days ago

Comment by MattSayar 7 days ago

It smells like an architecture-related issue to me. They wanted to release the model asap, but they're still implementing the fine-grained controls to constrain the model to non-subscription users.

Comment by dllrr 7 days ago

They said they would release it back into subscriptions as capacity allows in the future. If they don't, people are going to point back at it and rake them over the coals.

Comment by Maken 8 days ago

The bar is just too low.

Comment by fridder 8 days ago

More ethical in some areas, actively user hostile in others

Comment by nickandbro 8 days ago

Get them addicted then cut them off. Oldest trick in the book.

Comment by toomuchtodo 8 days ago

More of a free trial to those authenticated and qualified with existing payment. Subscription billing is going away for sure though eventually based on the economics. Token “all you can eat” is a capital furnace otherwise.

(I’m highly confident open models will eventually achieve a similar performance benchmark with distillation over time)

Comment by toomuchtodo 5 days ago

Mythos-class models will diffuse throughout the world by 2029 - https://news.ycombinator.com/item?id=48498512 - June 2026

Comment by chinathrow 7 days ago

Yeah that payment scheme sounds like they gear up to shift everyone into API token prices, eventually. Time to convert the existing tokens into software, until then.

Comment by CuriouslyC 8 days ago

Subs lose money on individuals to get those individuals to force their companies to pay for the corporate plan. The economics are bad, but so are the economics of grocery stores selling Milk and Bananas at a loss to drive traffic, which they basically ALL do.

Comment by eptcyka 7 days ago

I pay a lot but barely use it except for some intense days, where the lower plans would have throttled me in like 30 minutes. API billing is still more expensive. If you want to not pay much, go to openrouter and use chinese models. They are cost efficient.

Comment by HDThoreaun 8 days ago

I havent seen any evidence showing that subscriptions cost the labs money.

Comment by toomuchtodo 8 days ago

Companies don’t want to pay when the value realized does not exceed the cost.

AI Savings Misses 'Should Be Making Executives Uncomfortable,' Bain Says - https://news.ycombinator.com/item?id=48359010 - June 2026 (0 comments)

AI sticker shock hits corporate America- https://news.ycombinator.com/item?id=48307098 - May 2026 (146 comments)

Comment by CuriouslyC 8 days ago

What's the realized value of not losing your engineers because you're letting them use their preferred tools?

Comment by toomuchtodo 8 days ago

Retain and hire the engineers who don’t require heavy use of AI to deliver value? The current SWE job market speaks for itself. Where will you go where they will let you burn up tokens in a high cost of capital macro?

ZIRP (zero interest rate policy) is over, software engineers no longer call the shots now that there isn’t vast amounts of capital chasing yield, and that capital bidding up salaries and keeping the labor market for engineers tight.

If you are x more productive with generative AI, very shortly you are going to have to prove it with a token budget (or, if you’re lucky, an org willing to spend for on prem hardware for capped token cost, fixed capex vs uncapped opex).

The comparison is not SWE vs SWE with AI. It is SWE vs SWE with AI with a constrained token budget ($x/month) delivering the same value at the same or lower cost. If you cannot prove that you are wildly (vs marginally) more productive with the AI, why would they pay for it? Prove it.

Comment by toomuchtodo 7 days ago

> The comparison is not SWE vs SWE with AI. It is SWE vs SWE with AI with a constrained token budget ($x/month) delivering the same value at the same or lower cost. If you cannot prove that you are wildly (vs marginally) more productive with the AI, why would they pay for it? Prove it.

https://abhishek-shankar.com/posts/ai-coding-bill-headcount-...

> That is the real content of the Uber story, and it is why filing it under "budgeting discipline" misses what is actually unfolding across half the engineering organizations in the country right now. They ran the same experiment Uber ran, most of them without Uber's $3.4 billion R&D cushion to absorb the surprise, and almost none of them having modeled the heavy-user tail or instrumented the gap between tokens consumed and value shipped. The reckoning will arrive for each of them on their own fiscal calendar, and the first instinct will be the wrong one. The tool is too good to abandon, the bill is too large to absorb, and the only durable resolution runs through a question the entire rollout was designed to defer.

> You cannot get labor-replacement economics out of a tool you deployed as a labor supplement, and the bill comes due before anyone is willing to admit which one they actually bought.

Comment by 8 days ago

Comment by alvis 8 days ago

It’s too obvious that antropic need to find way to earn enough revenue before IPO. Claude subscription isn’t earning earning much money I bet

Comment by sigmoid10 8 days ago

I think they are just prioritizing enterprise customers, because this is were historically they made most money.

Comment by dylandevelops 8 days ago

I agree with you here. Unfortunately, this tends to be the case, with smaller developers paying the price.

Comment by AtlasBarfed 8 days ago

That's not how it works. They don't need revenue, they need addicts.

Specifically they need businesses that fired people and adapted their business to the products, so when the unsubsidized costs hit the businesses are forced to eat the true costs.

Yes they can't afford to give the products for free, but what is essentially happening with AI services is economic dumping, keep costs artificially low to get people to fire everybody, and then Jack the rates once they have Monopoly control

Comment by sdellis 7 days ago

But the only companies firing people (and certainly not everybody) are either the companies with an AI or the investment and finance firms that stand to profit from AI. I smell hype. And no company is firing everybody because of A.I.

I agree. They need addicts, but they are high on their own supply and everyone else can see the danger in getting hooked.

Comment by sdellis 8 days ago

That's a big problem for all of the AI companies. Most people don't find the technology compelling, accurate, or ethical enough to pay for a subscription.

Why wouldn't Anthropic just wait until people start subscribing, do some kind of marketing push, or obtain some kind of other sustainable revenue stream, before they go IPO? I wonder if they see the writing on the wall with all of this and want to cash out as quickly as possible?

Comment by sothatsit 7 days ago

The Team plan is ~125 USD / month / user. Big enterprises like Uber are paying upwards of $1500 USD / month / user. Anthropic can raise their revenue a lot more by selling to big enterprises than they can by selling more team plan seats.

Comment by 8 days ago

Comment by xpct 8 days ago

I agree, this looks like their plan to wane out subscriptions. This will probably come with Opus nerfs later.

Comment by rapind 8 days ago

I just assume Opus is constantly nerfed based on capacity. I was exclusively Claude for a long time, but the inconsistency in quality, constant outages, and slow downs were too hard to work with.

I just use dumb and fast models now. I'm more engaged. I think that the higher the quality of the model, the more you tend to vibe with it, and then the more hallucinations you then miss. I'm not sure which is more productive, but I definitely burn out faster the more I vibe. At some point you're spending your time on forums, discord, or youtube instead of engaged with what you're building. Or you yak shave about your tooling and end up creating the 600th multi-agent gastown harness and blowing thousands of dollars on tokens to create it only to discover it's too expense to actually use.

Comment by dylandevelops 8 days ago

I agree with you. The more I vibe code, the less interested I feel in what I'm building. Working with models that force me to think, especially with personal projects, helps me stay engaged and enjoy what I am doing more.

Comment by winter_blue 8 days ago

Composer 2.5 Fast that Cursor is giving away for very little has been amazing.

Comment by daviding 7 days ago

Given the Fable 5 costs it's getting tricker to weight up 'how smart do you want it', like looking at the top of this graph..

https://cursor.com/evals

Comment by aplomb1026 8 days ago

[flagged]

Comment by nonethewiser 8 days ago

It's possible that they will transition to usage credits but why not take them at their word? To date they have continued to offer better and better models to their subscription plans.

Comment by timcobb 8 days ago

What's their word? Have they commented?

Upd: I meant big picture, not with respect to this model release. Where do subscriptions figure into their strategic vision. Will consumers end up paying enterprise prices in the future?

Comment by KyleJune 8 days ago

In the blog post they say when sufficient capacity allows them to do so they aim to restore Fable 5 as a standart part of subscription plans and intend to do so as quickly as they can.

Comment by ls612 8 days ago

In TFA they say they intend to restore Fable 5 to subscription plans some time after June 22. That is what "take them at their word" means.

Comment by dbbk 8 days ago

Read it again

Comment by timcobb 8 days ago

I did, I'm not seeing anything about the future of subscriptions at Athropic.

Comment by dbbk 7 days ago

I can't help you

Comment by timcobb 7 days ago

Damn

Comment by xvector 8 days ago

HN needs to take a chill pill. Could it be that Mythos is expensive and they just want to give people a taste of it? I mean the alternative is not offering it at all?

Comment by 8note 8 days ago

its unclear how they can offer it broadly but only for half a month.

why do they have capacity now that they wont in a few weeks?

Comment by losvedir 8 days ago

Break between training runs?

Comment by bigtechennui 8 days ago

It’s offered broadly after, for more money. It’s subsidized as marketing

Comment by taormina 8 days ago

Those already landed! Oh, you weren't talking about 4.8?

Comment by piva00 8 days ago

Even Opus 4.7 felt like a regression from 4.6, consumed a lot more tokens while I didn't experience any substantial improvements. The company I work at simply rolled back to 4.6 on everyone's configurations, disabling the toggle for 4.7.

Comment by taormina 8 days ago

4.6 has been my happy place for getting anything done for a while now.

Comment by timcobb 8 days ago

Ooof so are we thinking that in the next 6-12 months subscriptions will be replaced with paying retail like enterprise currently?

Comment by CuriouslyC 8 days ago

I don't think they'll phase out subscriptions ever, their whole play has been to drive demand from the bottom up. Get engineers hooked on building with claude at home, then get them to demand the ability to use it at work, and bend over their employer with no lube.

They'll probably tighten the quotas to reign in whales though.

Comment by aseipp 8 days ago

They almost certainly already make a fuckload more money off API pricing than they do subscriptions, even if there might be more total subscription users. So offering subscriptions even at some loss is probably going to continue. Honestly, I'd be surprised if they even lost money on most subs; there are definitely Token Whales out there who mess up all the accounting up, though.

Realistically I think Anthropic just has insane demand but finite capacity to run models, and Fable will just make them more money if they dedicate it to API pricing. I suspect the goal here is something like: get individual engineers/PMs on their personal plans to taste Fable and then go to their meetings and say "Yes doubling the price of every single input/output token is a good idea, boss".

Comment by timcobb 8 days ago

But I don't want to be the developer who goes and says we must pay all this money for these tokens. I don't know who wants to be that developer.

Comment by treenutlog 7 days ago

[dead]

Comment by gck1 7 days ago

But how is this sustainable? It's not like paying $5000 per feature means you'll be refunded if prompting "make no mistakes" didn't work.

The only reason why I pay $200 is because LLM's errors costs me that much, at worst. If "make no error" starts working - sure. But surely, unless you have millions of dollars of cash to burn, a coin flip that costs $5000 is an insane idea?

Comment by thewebguyd 8 days ago

I certainly hope not. PAYG is not predictable enough for smaller companies or individuals. Where I work (non-tech company), PAYG would never fly. We aren't big enough for that. Of course, you can set usage budgets, but there's a pretty big difference between $200/user/month vs. the equivalent PAYG usage being closer to $1,000/user/month, if you currently use the subscription plan to its limits each week.

Going PAYG only will effectively take these tools away from a huge amount of people and accelerate the push for local LLMs.

OTOH, accelerating the push for local LLMs would also be fine with me.

Comment by ygjb 8 days ago

I doubt it, given the importance of those subscriptions for building and maintaining market awareness.

The AI landscape is changing rapidly, and with Apple announcing the option to change the AI backend, and potential requirements enable AI choices as well, similar to EU browser choice requirements (this is more reading tea leaves than any actual requirements I am aware of). The new OS changes coming to support Googlebook, and deep Copilot/AI integration into Windows will make maintaining user facing subscriptions essential for independent model developers like OpenAI, Anthropic, and Mistal to remain relevant longer term.

If the don't maintain that relevance there is increasing likelihood that they will get consumed by other companies whether it's Apple, Microsoft or Google to form a foundation for their OS, or other cloud providers.

Comment by timcobb 8 days ago

That make sense, but what about the specific bifurcation we're seeing here of super primo models versus still good models being available to subscriptions?

It's kind of annoying not getting access to the primo model and paying 200 bucks a month. I understand 200 bucks a month is basically nothing though.

Like I don't totally understand why they'd let me have it for a couple weeks and then take it away and say I can have it but I have to pay retail and retail is like $1,000 a day.

It's better to have loved and lost than to have never loved at all??

Comment by ygjb 8 days ago

It's a trade-off. Every hyperscaler is buying and building compute capacity as fast as they can dodge red tape. There is limited compute capacity, and scarcity is a real thing.

As a consumer I can choose to buy subscriptions to a range of things, including $5 droplets or VMs on a broad range of cloud hosting providers. I can even buy cheap bare metal at a bunch of providers at an affordable retail rate.

I can also buy "unlimited" AI packages that will be optimized to fit the cost model from a variety of services, with different impacts, such as rolling outages when I consume a daily or hourly allotment.

Right now VC and the investor class are subsidizing the rapid evolution of the services and availability, but that VC is running out. In more traditional economies, AI would have developed and rolled out more slowly, and through metered subscriptions, with the eventual rolling out of "unlimited" packages like telephone, internet, or cell services once the market became commoditized.

We have seen a big inversion of that with the race to "win" AI marketshare. Now the true cost is being exposed, and the most competitive and capable models are hideously expensive to operate, so it makes sense that we are moving to metered billing for a utility service. If you want gas, you can buy regular or premium. If you have a premium car you definitely want the premium, but for most people regular is good.

Give it a couple of years, and the survivors will settle around fairly industry standard models of consumer grade services, pro-sumer accounts, and business/enterprise models.

Things are still shaking out, but I get the sadness. Luckily I work at a big tech company who is banging the drum on doing experimentation so I use my prosumer claude pro and other accounts at home for hobby stuff, and save my heavy lifting and potentially experimentation for work :P

Comment by jrumbut 7 days ago

It could be my use cases, which have always seemed to be outside the wheelhouse of these models, but I find it very hard to downgrade after accessing a more capable model.

Opus 4.8 produces output in 15 minutes that is 3-4 hours of my work away from output that used to take me 40ish hours (a solid week of dedicated effort).

Last year(-ish, maybe it was 18 months, I forget when the jump happened), the frontier models couldn't touch this work. The output looked like a hardworking intern on their first day. Nice formatting, decent volume of words, but no understanding.

So it might work if it turns out to be a substantial leap in capability.

Comment by GoToRO 7 days ago

I switched back to Sonnet. It replies faster so I work faster. Also cheaper. But I really like the speed. I have to be more specific with what I want. Also I stop it more often than Opus. These new models will be awesome, but they need to increase the speed.

Comment by spaceman_2020 7 days ago

Kimi 2.6 has been my workhorse now. It's as good as Opus 4.6, which, to me, was the last "useful" Claude model.

The newer models are smarter but really ficklle and hard to get meaningful work out of

4.6 was a workhorse

Comment by gfody 7 days ago

K2.6 on Cerebras is basically a preview of the future. We'll eventually get similar performance locally with Tenstorrent hardware.

Comment by gunsle 7 days ago

Agreed, everything since 4.6 has been worse

Comment by KronisLV 7 days ago

> it feels like they are trying to get subscribers to switch to usage-based billing

I think they might be hitting a point where subsidizing the expensive models for subscriptions makes less and less sense.

With Opus 4.X, last month I paid 100 USD for the Max subscription and got a token equivalent of 4.1k USD.

I imagine that Fable is more expensive to run.

Comment by nicce 8 days ago

> The "offer, then remove" aspect is a bit eyebrow-raising -- it feels like they are trying to get subscribers to switch to usage-based billing, which makes me wonder if we'll ever get it after that June 22nd window.

Probably all about the IPO.

Comment by mlmonkey 7 days ago

Just like how Elon forced FSD in Tesla to be subscription-only (he was incentivized to do so).

Comment by ltrg 7 days ago

Fable seems very good at finding bugs (unsurprising given Mythos lineage), so this seems a pretty smart strategy. Once you see the bugs it finds in your existing Opus code, it's going to be hard to go back, psychologically speaking.

Comment by irthomasthomas 8 days ago

This is just the sales team doing their thing, applying the Law of Scarcity to drive demand.

It's the same exact speed as opus >=4.5, sonnet 4.5, and twice the speed of opus <=4.1

It must have about the same active parameters, or else its a larger model running in turbo mode (smaller batches) and being heavily subsidized for some reason. But given most of the benchmarks are within 5% I doubt it is a much larger model. Most perplexing.

Comment by m00x 7 days ago

It could be a much bigger MoE model

Comment by irthomasthomas 7 days ago

Then it would be slower.

Comment by matheusmoreira 8 days ago

This is really sad... I really didn't want to be priced out of these models but it looks like that's going to happen sooner rather than later.

Comment by deepfriedbits 7 days ago

Thankfully this, like most other tech, will get cheaper through the years.

Comment by gck1 7 days ago

It already is. But marketing is hell of a drug.

Comment by treenutlog 7 days ago

[dead]

Comment by 7 days ago

Comment by dack 7 days ago

i doubt that's the goal for them. i bet they just really don't have capacity for people using it a ton, yet they wanted people to be able to try it out while it's new. so they compromised and made it temporarily available. and then hope they can get costs down or capacity up so they can make it more available again

Comment by InsideOutSanta 7 days ago

I think the goal is "private citizens: subscriptions; corporations: per-token billing." It's getting people addicted to LLMs on cheap subscriptions so that they can then force companies to pay for expensive inference.

Comment by clementg 8 days ago

I really don't want this to start being the norm

Comment by baggachipz 8 days ago

I don't see how it won't be. They lose insane amounts of money on subscription plans. I'm sure they still lose money on usage-based billing, but probably not as much.

Comment by JumpCrisscross 8 days ago

> They lose insane amounts of money on subscription plans

Do we know this? I’ve seen evidence they lose money on heavy users. But so do gyms.

Comment by saaaaaam 8 days ago

How do gyms lose money on heavy users? A heavy gym user isn’t really costing the gym anything extra as far as I can see.

Comment by JumpCrisscross 8 days ago

> How do gyms lose money on heavy users?

Most gyms sell more subscriptions than they can fit under their roof at one time. If a gym only sells to heavy users, it will either be constantly turning members away or have to buy more equipment. Its equipment will wear off faster. Depending on amenities, it will go through towels, soap, water, et cetera faster, too.

Comment by tripleee 8 days ago

Gym equipment lasts 10+ years in a commercial gym, at $50/mo that's a minimum of $6k paid from a single person.

Unless they're really, seriously wasteful with the soap.. there's no chance a gym is losing money on a heavy user

Comment by rafram 8 days ago

It depends on the gym and their business model! A super-budget gym like Planet Fitness that charges $15/month is going to lose money on heavy users, but they count on most of their members being infrequent gym-goers. A luxury gym like Equinox that charges $300/month can target heavy users without any issues, and they'd actually rather members go more so they stay and spend money on expensive salads and smoothies.

Right now all these AI subscriptions are priced like Planet Fitness, but they're used like Equinox. They're hoping that the new a la carte offerings will move their pricing more in that direction as well.

Comment by gunsle 7 days ago

The other user is right, you are being a pedant. Why do you think planet fitness makes money hand over fist? Because 99% of its users sign up, never go, and then also never cancel because it’s cheap enough to leave running. Gyms absolutely bank on low amounts of power users, meaning the rest of the subscribers are subsidizing those that go frequently.

Comment by Fluorescence 7 days ago

Members will switch gyms if it's too busy at times they want to visit. "Too busy" includes too much contention for a single piece of equipment.

US gyms might be vast warehouses but in the UK, most only have a couple of benches, couple of cages, one set of db per denomination above 20kg etc. They require working-in and consideration for others.

A couple of unapproachable "heavy users" doing 3 hour sessions across peak hours can ruin the workout for dozens of paying members needing a few min per station for ~5 sets.

It might also be a euphemism for "dickhead" who also tend to be "heavy users". Those that damage, hoard and don't share equipment and repel other customers on many levels besides - threatening, lecherous, loud and smelly.

Doesn't even need malicious intent - can be weirdo bores, forever talking at victims while doing a routine that makes absolutely no sense besides camping on equipment for half a day... 100 sets of incline press 7 days a week... what are you even doing to yourself fella?

Comment by charcircuit 8 days ago

>I’ve seen evidence they lose money on heavy users.

Where?

Comment by JumpCrisscross 7 days ago

There are tons of blog posts where folks work out the API cost of their usage and find it well above subscription cost.

Comment by otterley 7 days ago

That doesn't mean the company is losing money in aggregate on these subscriptions. Buffets are still in business even though some people gorge themselves silly at them. The incremental cost may exceed the incremental revenue for a particular person or minority group, but that's not how these businesses measure profitability.

Comment by charcircuit 7 days ago

There is a difference between making less profit and losing money. Comparing to API cost can only show the former.

Comment by cautiouscat 8 days ago

I assume consumers aren’t a big note in their bottom line. I’m not actually very sure about that, just an assumption.

What I wonder however is if these tools will become something I use at work only. $100/month is already a massive stretch budget wise. If these models keep devouring tokens there’s no way I’d get the same usage time out of them for $100 in usage credits.

I just don’t think I’d use them much at all at home.

Comment by ABS 8 days ago

also: Fable takes 2× the usage of Opus

Comment by daft_pink 8 days ago

I’m just about ready to cancel my small business 5 user plan with max licenses, because although cowork is really great. I just find OpenAI/Codex to be a lot better most of the time.

Comment by oersted 8 days ago

> Pricing for both models is $10 per million input tokens and $50 per million output tokens.

The step-up in intelligence looks massive (we'll see in practice), but the price is getting to a point where it's making me question if it's even worth giving it a try.

Good competitors will probably be out soon, which should level the playing field. I am more excited about that, just the fact that they showed that such an improvement is possible. I'm okay waiting a bit longer for this to become attainable for plebs like me.

Comment by kmac_ 7 days ago

Models are getting better, but there's a negative change in terms of "productivity" per dollar. Yeah, I can throw 5 sub-agents at the problem, but the cost is getting significantly higher. And yes, I can crank out the solution much faster, but again, at some point that cost will be hard to justify. And it doesn't matter if the cost is subsidized by a provider, if it's paid by your company, or from your pocket. We are slowly reaching a point where the cost will be too high to justify the gains.

Comment by xyzsparetimexyz 8 days ago

This is probably the end of 'use the best model no matter the price'

Comment by kolinko 8 days ago

The pricing can be a bit deceptive though. A good model can deliver the same results in fewer tokens.

Kind of like billing a programmer by the hour.

Comment by zyuiop 7 days ago

Sadly this does not seem to be the case here: if you read the announcement entirely, they include a "cost per task" metric which basically continues the trend of their previous models. So yes, tasks will cost you more, but results will be better - allegedly.

Comment by kolinko 1 day ago

after playing for the three days - yeah, that turned out to be the case. Fable was more precise, did more tests and cost much more than 2x for the same task.

For some tasks though Opus was performing poorly and Fable managed to do them well on a first try.

Comment by sourcecodeplz 8 days ago

Why wouldn't it be? How much would you pay a scientist at this point to think about a problem for you and give you a solution?

Comment by oersted 7 days ago

I'm not sure how it might be with Fable in practice, but we are already not that far away from AI costing as much as a full-time professional, faster in some ways but considerably less independent.

Perhaps not that close to US salaries, but those are inflated to hell. Worldwide senior engineers and scientists have salaries just about an order of magnitude away from AI subscriptions that you can use most of the day every day.

Comment by rvz 8 days ago

> * On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window.

Of course, they are a casino as well giving you free spins at the wheel with their new Fable machine, and it is done on purpose.

Once there freebies have expired, many of its users will begin to gamble more on the new casino machine and will realize that it is expensive.

Comment by xvector 8 days ago

If it's that big of a problem to you, you're free to just... not use the freebie?

Comment by cautiouscat 8 days ago

It’s an interesting thing to bring up because it’s this classic thing we’ve seen for decades now.

The ramifications go beyond the individual which is why I assume they mentioned it. They don’t need to use it/not use it for it to have interesting implications.

Comment by xvector 8 days ago

so it'd be preferable if they didn't include the model at all?

Comment by cautiouscat 8 days ago

I didn’t say that and I don’t have a feeling on that either way. But this is a limited time trial and calling it out as such is valid.

Is it nice we get the trial? Sure. Is it also a common play in the playbook of tech companies? Yes.

Comment by rvz 7 days ago

Then you better not complain how expensive it is to use (Just like the other companies are doing) or the next time Claude goes down then.

Anthropic does not care about us and isn't going to talk to you either and will extract from you as much as possible.

The true answer is local models.

Comment by danslo 8 days ago

It's not a freebie, it still requires a subscription and burns tokens twice as fast as Opus.

Comment by Aleleo76 8 days ago

Pay-as-you-go billing is a kind of drug, I use it every now and then when I'm working on a project with Opus, in a moment you spend a fortune

Comment by madrox 7 days ago

I suspect it'll go on the subscription plan once other providers have similar benchmarks.

As annoyed as I am about this move, I get it. Users flood the newest, best model whether they really need it or not, and are efficient at using their entire quota. They've had so much trouble reigning in subscription usage it makes sense.

Comment by DonsDiscountGas 8 days ago

I expect that depends on demand, feedback, and whether GPT-6.0 gets released and is competitive

Comment by nutjob2 8 days ago

> "offer, then remove"

Sounds like "bait and wait".

If you think about it, the more people pay for these new and more resource hungry models, the longer it takes for them to become no extra cost and the longer it takes the more people are tempted to pay extra.

Comment by systemvoltage 8 days ago

It's interesting that we are seeing a time when subscriptions are not preferred and usage-based billing is.

Pay-as-you go isn't a common thing in SaaS. For example, except for AWS SES, all email providers are bulk-subscription based.

Comment by esafak 7 days ago

The point of SaaS was that the marginal cost (of supporting another user) was low. That does not apply to LLMs.

Comment by lisperforlife 8 days ago

My guess is that it is a massive model similar to GPT 4.5 and $10/$50 pricing is for its output will discourage people from using it. I also read safety = nerfed.

Comment by irthomasthomas 7 days ago

"we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design).

...

Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user."

Comment by altcognito 7 days ago

Where is this text coming from?

[edit] -- I see that this comes from the system card -- dang merged the comments from the other discussion so that explains the confusion.

Comment by thisisit 7 days ago

One can hope it helps Claude to figure out how to solve their buggy payment system - otherwise how do I pay for these credits.

Comment by sytelus 7 days ago

Enterprise subs not allowed to use Fable if they have setup zero data retention :(

Comment by FergusArgyll 8 days ago

I'm about to be priced out of SOTA llms and it's an awful feeling

Comment by speedgoose 7 days ago

The AI circular infinite money glitch won't last forever. I hope.

If you have good expertise in a domain and access to cheaper models, you may still be more skilled than someone without expertise but a lot of money to bruteforce the problems using SOTA LLMs.

Comment by wahnfrieden 7 days ago

Not with Codex

Comment by FergusArgyll 7 days ago

But they're behind by quite a bit now. CFO (of OAI) Sarah Friar said the next training run will be in the fall on Vera Rubin, I think that means I'll have to wait > 6 months?!

Comment by __blockcipher__ 7 days ago

Yeah but they might still have an unreleased bigger pretrain than 5.5. (but maybe not). still 5.5 is smarter than opus 4.8 IME, so you're only losing the mythos tier (fable). and all the cool fun stuff i'd want to use fable for our blocked (can't have it do even defensive cybersecurity work [in theory you can but the classifiers fire like crazy], can't discuss stuff like the furin cleavage site of sars-cov-2, etc)

Comment by chinathrow 7 days ago

Why won't they follow suit?

Comment by wahnfrieden 7 days ago

Of course anything could happen but Anthropic has always been stingier and more expensive and is getting worse about that, while OpenAI is getting more generous with subscriptions (eg permanently doubling the highest tier allotment, setting policy to not cut off tasks that run past quota, resetting after outages, etc.)

Comment by a-dub 8 days ago

the claimed inference cost is 2x. if that is true, it is massive and remarkable that they're able to do anything like this at all.

Comment by dirkc 8 days ago

This serves as a good reminder that relying on AI models is borrowing your tech from someone else. They can take it away or raise the prices arbitrarily.

If you rely on this as a core part of your business/profession, you will be at their mercy and subject to whatever whims or challenges they have.

Comment by meowface 8 days ago

It's very disappointing but I'm assuming it's for rational reasons on their part.

Comment by deanc 7 days ago

But it's not and it's highly disingenuous to frame it like this. Quote directly from Claude code, moments ago:

> Fable 5 · Most capable for your hardest and longest-running tasks · Uses your limits ~2× faster than Opus

Comment by aray07 8 days ago

i have never seen this before - where you offer something and then take that away

Comment by machomaster 8 days ago

Really, you have never heard of shareware or trial periods?

Comment by tasuki 8 days ago

Either that or it was sarcasm. What do you think more likely?

Comment by machomaster 7 days ago

A person writing without thinking.

Comment by 7 days ago

Comment by firemelt 8 days ago

damn they are drugs dealer

Comment by 8 days ago

Comment by AAYALAG 7 days ago

[dead]

Comment by steve_adams_86 7 days ago

I'm using it to review recent work and it's doing a genuinely excellent job. This is a clear step up. Fewer decisions I have to guide it away from, faster conclusions on planning, more willing to go out of the way to make the correct decisions possible... This is really interesting. It feels like going from Sonnet to Opus, but, of course as a step up from Opus.

This feels more like working with a competent peer than ever. I won't use it once it's API-only, though. I don't mind guiding Opus as required and staying closer to the code. I can tell that Fable would lead to a lot more 'set and forget' programming which I'm still not fully comfortable with.

Regardless, this is cool. It's very fun to use. It was able to find legitimate issues with my work this week and we've made meaningful improvements. Opus can do this, but typically in much narrower contexts, and often with hallucinations or partial-errors. It needs to walk many things back or revise plans. So far that's not the case at all with Fable.

edit: I just realized I had Opus review the same work already. It missed everything Fable caught today. And it's actually worthwhile stuff to address. It's hard to say no to a model which demonstrably makes your code better, but... Those API prices will be brutal. Maybe a review here and there, I guess.

Comment by yoyohello13 7 days ago

Same. I used it today to review my code and it came up with some genuinely good comments and suggestions and found a bug I didn’t think about. Quite a step up from opus. Although one code review took up 50% of my usage.

Comment by solenoid0937 7 days ago

Why is your comment so grey/downvoted? One of the only actual usage experiences posted in this thread.

Comment by Der_Einzige 7 days ago

Usage of "genuinely" triggers people's "AI-smell" detector.

Comment by steve_adams_86 6 days ago

I trip this detector a lot. I am in fact made of meat though, for better or worse.

Comment by rimliu 7 days ago

or one of the many astroturfing attempts.

Comment by steve_adams_86 6 days ago

No, I'm skeptical of AI in many ways, but I do find LLMs are useful in the right contexts. I'm pretty happy with this model so far.

Comment by brusselssprouts 8 days ago

I had it review a single, large commit with /code-review. It burned through over $50 in API calls, ran my account balance out, and output nothing.

The fable part appears to be that it's affordable by mere mortals. Anthropic support told me "too bad" when I requested a refund.

Comment by timmytokyo 7 days ago

You pulled the arm of the slot machine and discovered why they call it the one-armed bandit.

Comment by edude03 7 days ago

Almost the exact same thing happened to me when I first tried opus, one prompt no output cost $60 in additional usage

Comment by endymion-light 7 days ago

I think the fable it's referring to is the "Emperor has No Clothes" - if this is even slightly similar to the Mythos hyped up to be too intelligent to release, I'm quite disappointed.

If this was a step change, e.g a Opus 5, I'd be pleased, it's definitely an upgrade on some work, but it's nothing like anthropics apocalyptical marketing seemed to suggest

Comment by solenoid0937 7 days ago

I suspect the tasks you're trying just aren't complex enough. It's definitely a generational improvement.

Comment by endymion-light 7 days ago

Nope, plenty of complex tasks. It's just not that much better, it's equivalent to sonnet with a good harness.

Comment by 7 days ago

Comment by Madmallard 7 days ago

Combine that with it forcing to pay by tokens on June 22nd

Comment by steve-atx-7600 6 days ago

I haven’t seen fable do anything significantly better than I can already do with codex 5.5 xhigh. It’s virtually u limited for now for me for $200/month. Seems like a steal while it lasts. Paying by api keys now is not the way to go if you can avoid it. Obviously it isn’t for every use case.

Comment by anematode 7 days ago

Not impressed so far, to be honest. I'm having it try to optimize Stockfish in a loop (on xhigh mode) with a benchmarking oracle; even after giving it specific hints ("consider whether we're prefetching Y optimally, can we make function X branchless"), it's been so far unable to recover any of the recent optimizations we've implemented – let alone novel ones. Opus 4.8 felt a bit more creative to me ... but a small sample size so far. I'm next going to try it on some less open-ended problems.

Edit: It did correctly identify that transparent huge pages were off in its sandboxed environment and that enabling it was helpful, so that's nice. It also noticed that we skip THP on a certain less used path.

More importantly, I'm finding that the code that it produces for its experiments is a lot cleaner than what I'd expect out of Opus; there's fewer useless comments and it's more surgical and readable. I wonder if that explains the increased scores on benchmarks measuring mergability.

Comment by wgd 7 days ago

Stockfish is a machine learning system, it seems quite plausible you might be getting slapped with the silent performance degradation (https://news.ycombinator.com/item?id=48467896).

Comment by redox99 7 days ago

Them silently nerfing the model without telling you, and still fully charging for it, is a new low and should probably be illegal.

Comment by NoahZuniga 7 days ago

Well they're not fully charging you. You get opus 4.8 pricing when it falls back to opus 4.8. Also you can disable it (and it seems like it's off by default in the api)

Comment by LiamPowell 7 days ago

That don't fall back to Opus if their classifier thinks you might be working on anything that might be a competitor's product. It silently injects instructions into the prompt to sabotage your work. Read the policy above, it's insane to me that they're publicly admitting to this.

Comment by xiphias2 7 days ago

Not for machine learning, just for security bug finding and biology

Comment by taurath 7 days ago

Doesn't this "silent degredation" prevent any actual evaluation of the model? If the model fails at something, this allows anyone to claim that it failed due to degradation.

Comment by lionkor 7 days ago

Who cares if it can be evaluated independently? The majority of commenters on HN were happy to vibe code and ship products with the models we had 1-2 years ago. It continues to be laughable.

I understand that moving the goalpost every release is unfair, but it's similarly concerning to consider that people were letting GPT 4.X vibe code and ship entire products.

Comment by janalsncm 7 days ago

I don’t think so? They can claim it was an act of God for all I care, but at the end of the day the model failed the task.

Comment by anematode 7 days ago

Yup, I suspect that's what's going on

Comment by dakolli 7 days ago

I suspect it just sucks, these models aren't useful. Stop lying to yourself.

Comment by komali2 7 days ago

No, since it's a silent failure, it's not plausible. We have to assume all results we get are the actual model performance, because, it's the actual model performance as we understand it.

Someone trying to solve similar problems will have similar results if the "silent failure" applies consistently in aggregate. So, this is the model's performance.

Comment by janalsncm 7 days ago

It’s possible this is happening at a technical level, but I have a hard time believing this is in the spirit of what Anthropic intends to throttle. It isn’t chip design or building out a competitor to Claude.

Stockfish does use neural nets but they are tiny, on the order of 10M params. Frontier LLMs are probably 100k or 1M times larger than that.

Comment by wgd 7 days ago

Yeah I agree this is probably outside of the intended scope of the silent sabotage mechanism, but there are plenty of reports of the "loud" safety classifier misfiring on innocuous requests and I'm not going to assume the silent failure mode is _less_ prone to false positives.

Comment by anematode 7 days ago

Edit: Another developer seems to have found a legitimate speedup with Fable in an optimization loop. It's a nice idea, actually, and I'm duly impressed.

Comment by unsupp0rted 7 days ago

> Drug design: Using Mythos 5, our internal protein design experts accelerated aspects of the drug design process by around ten times. In one example, they found that Mythos 5, with protein design and bioinformatics tools but no human assistance, matches or beats skilled human operators. In doing so, the model executes all of the tasks that are normally completed by a scientist: choosing binding sites, selecting and running protein design tools, and recovering from failures along the way. Nine of the 14 protein targets from this study (shown below) yielded strong candidates for drug design that we’re currently investigating.

How is this half-way down the page? To me it's the headline.

Comment by AnodicElegy 7 days ago

There are tons of ways to generate "strong candidates for drug design." This is definitely not the bottleneck in drug discovery and development. The hard problem is vetting and developing these ideas to the point of having a commercially viable drug. That is still a very empirical process.

Comment by colingauvin 7 days ago

Because it's completely meaningless without validation, and even with validation, not really any better than the state of the art protein generation models. Which are also mostly just nice to have because coming up with a candidate is generally quite easy.

The rate limiting steps are generally testing, or characterizing. Not designing protein binders.

Comment by OkWing99 7 days ago

It's selective reporting. Says 'in one example', but out of how many, is that one-shot, or is it a random result out of 100. It's a marketing doc.

Comment by 7 days ago

Comment by HDThoreaun 7 days ago

Would be funny if anthropic ends up as mostly a pharma company

Comment by firstplacelast 7 days ago

Until we are able to reliably simulate cells, organs, and entire human bodies in silico, we will not be able to move the needle too much on drug design from an AI stand-point (IMO). Like others pointed out, the massive bottle neck in time and cost in getting a drug to market are far removed from developing drug candidates.

Comment by renjimen 7 days ago

Drug design isn't the bottleneck anymore, it's trials. Still cool they can do this with a general purpose model though.

Comment by simonw 8 days ago

Pelican for Fable 5 on default settings is a clear improvement on Opus 4.8

Fable 5 default: https://gist.github.com/simonw/036bee5a703e7ec84e34efa974438...

Opus 4.8 (the "max" one is closest to Fable): https://simonwillison.net/2026/May/28/claude-opus-4-8/#and-s...

Now here are the Fable pelicans for all five of the thinking effort levels - low, medium, high, xhigh, max: https://tools.simonwillison.net/markdown-svg-renderer#url=ht...

Low used 25 input, 1,929 output - 9.67 cents: https://www.llm-prices.com/#it=25&ot=1929&sel=claude-fable-5

Max used 25 input, 14,430 output - 72.175 cents! https://www.llm-prices.com/#it=25&ot=14430&sel=claude-fable-...

Comment by sempron64 8 days ago

The pelican has looked very same-y across all frontier models, same color bike, same camera angle, etc. I suspect this challenge is already too embedded in the training data to be a good signal when it succeeds, and maybe even when it fails in pathological ways mirroring existing AI pelicans on the internet.

Comment by h4ny 7 days ago

Was it ever a good test? How do you even objectively assess what a good pelican on a bike is anyway?

Comment by fwipsy 7 days ago

SVG generation is a good test because it's extremely easy to subjectively assess with visual reasoning where humans are strong. However, pelican on a bike specifically may be overused at this point.

Comment by Fuzzwah 7 days ago

The "big beak!" comment in the svg source makes me think it's definitely a gamed "benchmark" at this point.

Comment by kayge 7 days ago

Do you think the models are ready for the next level? I believe that would be: Pelican feeding Spaghetti to Will Smith.

Comment by quantumwoke 7 days ago

Variations of this comment have been posted for over a year. The pelican has now morphed into part of HN culture rather than a legitimate benchmark, but it's still valuable as a meme.

Comment by brazukadev 7 days ago

it is more an example of gaming (the HN system) than meme.

Comment by stratos123 7 days ago

I'd be very surprised if this is in the training data given that most models mess it up to this day. E.g. look at the ones from Opus.

Comment by tripleee 8 days ago

[flagged]

Comment by yreg 7 days ago

I really don't understand what's interesting about this test and why is it always on top.

Comment by simonw 7 days ago

It's funny.

Comment by girvo 7 days ago

It really is lol

Comment by mrandish 7 days ago

As often happens with random oddball things which become traditions in web communities, the replies asking what it is or complaining about it, begin to gain their own humor value.

Comment by depr 7 days ago

Same reason you would always see the same top comments on reddit during a certain era.

Comment by yreg 7 days ago

That’s what I think too, but we should actively go against such culture here because hn is not reddit.

Comment by gunsle 7 days ago

It basically is at this point, if you haven’t noticed. Complete with the same America bad, Elon bad, democrats good midwit progressive politics.

Comment by clydethefrog 7 days ago

Almost all Musk related negative news gets [flagged] and never hits the the front page, so there is still a silent base on the other "team" apparently.

Comment by anhner 7 days ago

Don't forget EU bad! Because they won't let Apple screw over consumers.

Comment by replwoacause 7 days ago

Elon does suck. Objectively.

Comment by ankit_mishra 7 days ago

Is this Straw Man and Ad Hominem ?

Comment by inglor_cz 7 days ago

It has become a funny meme, much like "My hovercraft is full of eels!"

Comment by luqtas 7 days ago

because you can't still ask LLMs to port DOOM to hardware X or Y

Comment by WithinReason 7 days ago

It's a meme, and HN loves upvoting memes. Just like Reddit!

Comment by port11 7 days ago

The ultimate measure of an LLM is whether it can produce a capable image of a pelican riding a bicycle. All other use cases are but a distraction!

Comment by scrollaway 7 days ago

Do you seriously have a dedicated “bad takes on AI” hn account?

Comment by tripleee 7 days ago

yeah, although I do combine it with "replies to snarky questions" for efficiency

Comment by jurgenaut23 7 days ago

True that

Comment by sarreph 8 days ago

I'm beginning to wonder how much of a useful metric the pelican is because surely the frontier labs must be training their models on pelican-artistry because of how well known your test is now?

Comment by bensyverson 8 days ago

Simon has addressed this on virtually every new model release. He also has unpublished alternate prompts. But the larger point is: this is a fun experiment, not a serious and objective benchmark.

Comment by refulgentis 8 days ago

It's silly and a joke and a surprisingly good benchmark and don't take it seriously but don't take not taking it seriously seriously and if it's too good we use another prompt but don't actually because then it's not the pelican post and there's obvious ways to better it and it's not worth doing because it's not serious.

Only coherent move at this point: hit the minus button immediately. There's never anything about the model in the thread other than simon's post.

Comment by stasomatic 7 days ago

But what if they are better at flamingos? Are they optimized for pelicans? How about “draw me a four headed owl”? The meme, I get it, but I’d settle for a working bash script, tbh.

Comment by wongarsu 8 days ago

I just run my own benchmark for "draw an SVG with $animal driving $vehicle". I won't post my choice of animal and mode of transport, but there are plenty of uncommon combinations to choose from. So far it's a fun and visually intuitive benchmark that does seem to correlate with model capabilities

Comment by modriano 8 days ago

I don't know. Just looking at the bike frames (specifically the fact that the AI generated bikes have rather unsteerable front forks), it's clear to me that frontier labs aren't spending much time tuning models to make bikes look coherent, which I assume is an easier task than making a pelican riding a bike look coherent.

Comment by HaZeust 8 days ago

I've seen this reply to Simon's benchmark for 2 years running now, and yet you still see improvements and objectively-bad results over time from new releases, even when I'm sure every frontier AI team has/had a person at least partially dedicated to better bicycle-pelican SVG outputs. Alas.

Comment by sarreph 8 days ago

I had intended to caveat that: I'm sure I'm not the first person to ask about this!

> you still see improvements

This is expected if they are training their models on it, right?

> objectively-bad results

Keen to learn when this has been the case, i.e. across version increments in major models.

Comment by simonw 8 days ago

I've written about this a couple of times, most notably here: https://simonwillison.net/2025/Nov/13/training-for-pelicans-...

I've been enjoying seeing how the quality of individual models differ based on the amount of reasoning effort you give them. If they were baking an a good pelican you wouldn't expect them to differ so much.

(Google Gemini are the only lab that have very clearly paid attention to the quality of SVG animals-riding-vehicles, see their announcement for Gemini 3.1: https://twitter.com/JeffDean/status/2024525132266688757 )

Comment by sarreph 8 days ago

Amazing, thank you Simon! Look forward to reading.

Comment by mrandish 7 days ago

Hence it has become a meta-benchmark of relative progress in SVG image generation of a known target which has leaked into the training data and for which "every frontier AI team has/had a person at least partially dedicated to" at least checking if not optimizing.

Comment by llm_nerd 8 days ago

I honestly assumed their comment was tongue in cheek humour, because positively no one actually cares how these models generate an SVG pelican riding a bicycle. It's some meme thing that this stuff always appears here.

Comment by BrokenCogs 8 days ago

Yeah this is not a real benchmark, it's just a fun tradition everytime a new model is released

Comment by pelipost123 8 days ago

"fun" / boringly predictable meme thread with 30+ replies already

Comment by brazukadev 7 days ago

It is telling that people need to create throwaway accounts to criticize simonw's behavior in this website.

Comment by 7 days ago

Comment by mrandish 7 days ago

It's evolved from a funny, unserious benchmark to a tradition. When a major new model is released, I now always check the HN thread for Simon's Pelican post. I'll be sad when I don't find it.

When it started, comparing the progress between models was mildly interesting but everyone (including Simon) acknowledges it certainly leaked into the training data long ago.

Comment by notnullorvoid 7 days ago

The way I see it the benefit of benchmark isn't to take Simon's results at face value. It's a template for your own benchmarks that are easy to visually evaluate.

Comment by iLoveOncall 7 days ago

It was a completely useless test even before the labs trained for it.

Comment by mrandish 7 days ago

Yes, it's always been published as a joke. You've explained why it was (and still is) funny meta-commentary on AI benchmarks.

Comment by ealready_value 8 days ago

This is the reply I look for in all the new model announcements. Its fun to tell people that I judge models based on pelicans.

Comment by chorkpop 8 days ago

Now someone post the link about how it’s impossible for humans to draw a bike from memory.

Comment by Atheros 7 days ago

https://link.springer.com/article/10.3758/BF03195929

Comment by pixel_popping 8 days ago

This is all we need, that moment the Pelican put the leg behind the frame, we are all doomed.

Comment by upcoming-sesame 7 days ago

I also look for this reply because i like seeing the follow-up reply saying that this is not a benchmark anymore because labs have gotten it in their training data.

that reply never failed to come it's basically a meme at this point

Comment by redox99 8 days ago

It's interesting that they still get the head tube / handle bar part wrong.

Comment by aarjaneiro 8 days ago

Or the hands not being wings

Comment by raffael_de 7 days ago

I find it quite interesting that while the picture looks better the more advanced the model is, but apparently none so far "understands" that the pelicans legs are on both sides of the bike / top bar.

Comment by LordDragonfang 7 days ago

If you scroll to the bottom of the Fable-5 by effort page, Max effort actually gets this correct! (Along with being the only one I've seen so far to make a bicycle frame that matches the shape of what most bikes on Google images look like)

Comment by wasabi991011 7 days ago

And the only one linked here that includes a bicycle chain!

Comment by ethanlipson 8 days ago

How much money do you think they spent fine-tuning on pelican SVG generation?

Comment by tarruda 8 days ago

Not as much as Qwen, since apparently 3.6 35B surpassed Opus 4.7 https://x.com/simonw/status/2044830134885306701

Comment by csomar 8 days ago

Probably none. They probably have much better targets to optimize for than an SVG pelican or even SVGs in general.

Comment by Reebz 7 days ago

The Max version gets more details right. The bike frame looks good, the chain, the wings are appropriately styled instead of “arms”, and the knee is bent, etc. Obviously we’re hitting marginal returns now, but I see differences.

Comment by csomar 8 days ago

Where is the clear improvement on Fable 5? The tail is misplaced.

Comment by smusamashah 7 days ago

Can you please compare the code generated by other similar quality pelicans by other models. Code in your first link (Fable 5 Default) looks minimal yet very good.

Comment by leecommamichael 8 days ago

Looks like Fable constructed the "max" "looking" pelican of the previous model for the "xhigh" output token count of the previous model.

Comment by mer_mer 7 days ago

It's interesting that Gemini 3(.1?) Deep Think is still the best at this task and it's still not really generally available. Maybe Fable could match it at higher effort levels? https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/

Comment by rkuska 8 days ago

Is it possible to use the credits from subscription (https://support.claude.com/en/articles/15036540-use-the-clau...) for fable?

Comment by XCSme 7 days ago

It also does A LOT better, for my hamster test: https://aibenchy.com/showcase/?q=claude#showcase=6efb87c28e3...

Comment by 382hi 8 days ago

I'm pretty sure they're optimizing the models around these sorts of tests.

Comment by makingstuffs 8 days ago

I could be tripping but I’m sure that is very similar to the Deepseek one from not long ago. Clearly I am too lazy to go and find it for verification.

Comment by bergheim 7 days ago

Anyone care about these pelicans that always come up anymore?

Clearly at this point they are part of the training data.

They even all look sort of ish the same. Daytime, colors,...

Comment by 1attice 7 days ago

Without being mean, I encourage you to go look at some of simonw's writing on this topic, which he has addressed repeatedly (and IMO satisfactorily.)

I know because I too had this initial take; however, upon analysis, it is not sound.

Comment by bergheim 7 days ago

I know he is an AI influencer that promotes his blog any chance he gets.

I agree as well that he writes many interesting things.

Comment by 7 days ago

Comment by benatkin 7 days ago

The way they talked it up, having both legs on one side of the bike is like walking to the car wash

Comment by jerryliu12 7 days ago

Personally feel like it could be more ambitious with what it creates.

Comment by ceroxylon 7 days ago

Yay, max level actually put one of the legs behind the frame!

Comment by mercacona 8 days ago

Why always sunny days?

Comment by umeshunni 8 days ago

Pelicans hate biking in the rain (as do I).

Comment by gavinray 7 days ago

Fable 5 xhigh actually looks the best to me.

Comment by purple-leafy 7 days ago

Do we need a pelican every single time a model is released? Beating a very dead horse.

Fun at first, seems disingenuous now. A site funnel

Comment by david_shi 8 days ago

that's a great looking pelican

Comment by ge96 8 days ago

need more Alex Moulton style bikes

Comment by lacoolj 7 days ago

dude, the max version looks like it's finally there. handle bar holding with wings, the left leg is behind the frame while the right is in front of it (correctly).

well done anthropic.

Comment by arthurcolle 7 days ago

mediocre pelican. very disappointing

Comment by kylehotchkiss 8 days ago

How many barrels of oil are burned per pelican at Fable levels?

Comment by fzysingularity 7 days ago

I can’t help but think that there are so many astroturfed comments in here.

Seems like a concerted and distributed effort from the entire Anthropic team every time to get this on top of HN.

Comment by amunozo 7 days ago

I'm not fan of Anthropic, but to be fair, every major model release makes it to the main page. In the case of a model like this, hyped and with a jump in capabilities, it doesn't need astroturfing.

Comment by WebGuyMe 7 days ago

Well to be fair, it that hype and that "jump in capabilities" (that I don't see) might be astroturfed? You ever think that all that hype isn't organic?

Comment by amunozo 7 days ago

No, but I think it's real in this case. Claude models have always been superb, I can definitely see another improvement in capabilities. Price is outrageous, though.

Comment by mirsadm 7 days ago

It's the real deal. Before Fable nothing I tried worked. It has finally helped me finish my teleportation device. I can't show you or anyone the proof but trust me it's true.

Comment by joss82 7 days ago

Yes, this is also my feeling.

It happens for every single Anthropic release. Then I try it on real dev and the result is laughably bad. Except in design where it has been doing a decent job for a while. I am not a designer and my bar is pretty low.

Comment by thewhitetulip 7 days ago

In frontend sonnet and opus take more than 5 $ per query to fix any problem

So unless you have unlimited tokens it's better to learn frontend

Comment by geraneum 7 days ago

Corporations have done worse for much less money involved. Now we have trillion dollar companies going IPO. With so much at stake, it’s not unthinkable that there’s astroturfing happening.

Comment by Overpower0416 7 days ago

Wouldn’t be surprised if there are marketing teams writing positive comments for more positive engagement

Comment by lionkor 7 days ago

Marketing teams entirely composed of Claude models, of course

Comment by iammrpayments 7 days ago

I’m convinced that’s the case, this place looked totally different around 4 years ago

Comment by anhner 7 days ago

You're right to point that out! Most people did not think of this but you did -- and that's a rare skill to have.

Comment by sunaookami 7 days ago

I see a lot of negative comments right now surprisingly.

Comment by andybee 6 days ago

Yeah, this whole post is a GIANT AD.

Comment by Retr0id 7 days ago

I don't think it's weird that the post made it to the front page, but watching the downvotes roll in on my own mildly critical comment has been intriguing. I saw it go up to +2, down to 0, up to +3, and now it's on +1.

Comment by vrganj 7 days ago

Now if only they had some technology that was really good at generating authentic-looking comments they could use to spam praise all over the internet...

Comment by Daishiman 7 days ago

Where do you see them exactly? The comments are pretty much in line with how the model performs IRL.

Comment by pooplord7 7 days ago

Theres several comments that sound like „I“ve had this REALLY DIFFICULT problem (no specification whatsoever) and threw Fable on it and it solved it immediately, additionaly it cured aids and found a solution to world hunger“…

Comment by Daishiman 7 days ago

This is… not at all the case. Most observations on pricing, some on specific projects, some on speculation about Anthropic and many about the model getting nerfed.

Comment by Madmallard 7 days ago

It is the case.

Comment by Daishiman 7 days ago

I think you can't read because half the posts are about the nerfing and price.

Comment by Madmallard 7 days ago

The early posts were largely AstroTurfs.

Then people started realizing we're getting literally rage-baited by Fable 5 and started posting their criticisms.

Both can be true at once.

Comment by laszlojamf 7 days ago

to be fair, the top comment from simonw is most likely legit unless anthropic hacked his account too

Comment by meetpateltech 8 days ago

> To ensure we’re responsibly deploying Mythos-class models, we are requiring limited data retention and review as part of our safety work. Prompts submitted to, and outputs generated by, Mythos-class models are retained for 30 days for trust and safety purposes, on every platform where these models are offered. [1]

[1] https://support.claude.com/en/articles/15425996-data-retenti...

Comment by lebovic 8 days ago

While this makes it easier for Anthropic to detect misuse, it also means that the US government and other parties have access to every message and response from every user.

This applies even with API usage through third-party inference providers (e.g. AWS' Bedrock and GCP's Vertex) or with a zero-day data retention agreement in place.

I understand the reasoning for doing this, but I don't love the precedent that it sets.

Comment by slaymaker1907 7 days ago

It will also cause a lot of trouble for companies with specific data access policies (probably most large companies). My money is on this new thing getting gutted very quickly as they figure out how much this constraint cuts into their bottom line.

Comment by PeterStuer 8 days ago

Well, they already had.

Comment by lebovic 8 days ago

Not in the same way.

A customer could sign a ZDR agreement with Anthropic, and their API usage wouldn't be retained for even a day. That's no longer possible.

Comment by MagicMoonlight 8 days ago

[dead]

Comment by simianwords 8 days ago

meetpateltech is lowk screaming for not getting to the post fast enough

Comment by rvz 7 days ago

At this point that never mattered and who really cares?

These "karma" points are made up and are virtually worthless anyway.

Comment by replwoacause 7 days ago

your usage of lowk adds nothing to your post, but it does reveal your approximate age.

Comment by neta1337 7 days ago

What does it matter?

Comment by shruubi 7 days ago

I have a theory, this is obviously based on speculation based on how Anthropic is treating Mythos and the whole media noise around it's dangers and who gets access to it.

My theory is that Anthropic are banking on being the top model when the race to IPO finally reaches the finish line, and to do that they need to have the top model but not let any competitors see it or derive from it to have a comparable model in the market.

Fable is their way of showing the public "the model does exist but in a mode that makes it harder/impossible for competitors to derive a comparable model from results.

Comment by schmorptron 7 days ago

The irony of "we train on all of humanity's collective output, but god forbid anyone trains on ours" is still incredible

Comment by t0lo 7 days ago

All these people know is greed. It's in their DNA.

Comment by danny_codes 6 days ago

Capitalism is designed to promote greed. It's the central point of our society's current design.

Comment by slaymaker1907 7 days ago

That's definitely the case as model distillation is one of the explicit safety carveouts they mention. Though TBF, model distillation is also a big concern for general safety as distillation could allow you to have the model without the other guardrails. It's sort of a master key to the model.

Comment by Escapade5160 7 days ago

It's crazy to release a model that just swaps you to another model when you ask it hard questions. Fable changes to Opus 4.8 when you talk about cybersecurity, biology, and a couple other categories. You still pay Fable input token cost though. Frontier models are stalling, this is anthropic trying to hype the market up. Now they're talking about stopping frontier model research. It's kind of strange how the moment they become the highest valued AI company, all of a sudden they're talking about everyone stopping frontier model development for "safety". They're just as corrupt as the rest.

Comment by 00deadbeef 7 days ago

Opus 4.8 already drops to Sonnet when you ask it cybersecurity or biology questions

Comment by dominotw 7 days ago

yea i dont trust simonw comments at all. I still havent seen what he has built with ai thats so impressive to justify hiis all his nonstop ai hype.

You would think he is churning our cancer drugs or something if you read his comments

Comment by lionkor 7 days ago

After saying that Fable 5 solved issues he was stuck on for months, he follows it up admitting that he hasn't tried GPT 5.5.

I like separating the art from the artist in cases like this; he's clearly made very cool things in the past, but that doesn't mean he's perfect.

Comment by rootusrootus 7 days ago

If you assume that LLMs are about to make software development a dead-end, then the best answer to keep a good income is to ride the wave. Do nothing and get left behind, embrace it and maybe you'll find a new niche.

Comment by 7 days ago

Comment by dominotw 7 days ago

ok? but wouldnt it be good for hypeartists to back their hype with impressive output? not pelicans and shit.

Comment by newsicanuse 7 days ago

He is become one of the techbros

Comment by dominotw 7 days ago

so annoying to see his shallow hype comments at the top of these type of threads. his "its awesome" comment on this post is devoid of any actual substance :/

to his credit comment does say "this could be possible in opus too" but ppl couldnt help upvoting it anyways.

Comment by viking123 7 days ago

just another grifter

Comment by rightlane 8 days ago

My experiences so far have not been positive. The cyber security nerf is ridiculous. I am working on an AI based decompiler, every single interaction with Fable on my project has been flagged for cyber security.

Do they expect us to use this as a toy? Releasing a new more powerful model but not allowing normal use cases because the word "secure" showed up is a Dilbert comic, not a viable product.

Comment by davmre 7 days ago

This sounds more or less unavoidable? Decompilers are inherently security-sensitive. If you take avoiding cyberattack uplift seriously as a goal, I don't see how you get around essentially refusing to work on them.

Obviously there are plenty of innocuous applications too, but it's not like the people building decompilers for nefarious reasons will be explicit about it. The LLM abstraction just inherently doesn't have enough context to distinguish your intentions or your broader use cases. This is why both Anthropic and OpenAI have had to create side channel mechanisms for security researchers to establish a trusted use context. It sounds like this makes this not a viable product for you, unfortunately, and it makes sense that that's frustrating. But I also don't see what different behavior one could reasonably expect given the constraints.

If it's any consolation, these restrictions only make sense for models that are ahead of the open-weights frontier, so open-source hackers will presumably get Mythos-level capabilities in the relatively near future anyway.

Comment by gck1 7 days ago

I'm not sure how the new guardrails work exactly, but I've read enough of reddit / Chinese communities focused on jailbreaking the models, to know that you either have to nerf it to the point where it fires even on "kill the task", or someone (maybe even other LLM) is going to come up with a set of tokens that is going to go around the defenses.

Nerfed models are really bad for PR, especially when you're staking your company's future on it being the smartest, most dangerous thing in the world.

So I believe they will ease up on nerfing/guardrails just enough that bad actors will find a way, while good ones will stay limited on anything dual-use. Just like such restrictions usually work in other places.

P.S. yes, "kill the task" did, in fact result in a refusal AND a warning on my claude account in Opus 4.8's early days.

Comment by zb3 7 days ago

> If you take avoiding cyberattack uplift seriously as a goal

This "uplift" risk obviously excludes the US. The goal of this is that the US bandits (like NSA) will find exploits and attack other countries (classic US behaviour), but these other countries can't be allowed to defend against these attacks. NSA/CIA thugs are "trusted", foreign defenders in sanctioned countries will of course be "untrusted".

Comment by ibejoeb 8 days ago

Ah, you're probably one to ask. They say "queries on some topics will instead receive a response from our next-most-capable model, Claude Opus 4.8." Are they transparent about when that happens, and is it priced at the rate of the underlying model?

Comment by rightlane 7 days ago

They are transparent about when it happens but no reason why. To be fair, it doesn't interrupt the flow, just drops to Opus and proceeds. The most frustrating thing is that it happened on a plan and Fable just refused to have anything to do with the plan.

Comment by mohsen1 7 days ago

It seems like Fable will refuse to do any work when it comes to developing LLMs or even asking questions about topics related to LLM. Simple things like asking to explain a paper fails!

From the model card:

In light of the ability of recent models to accelerate their own development, we've implemented new interventions that limit Claude's effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design. Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms. Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user.

Comment by Chance-Device 7 days ago

I was wondering when something like this would happen. I got my first and only two content violation warnings in Claude Code last week when asking it about something ML related. It was a real head scratcher because I couldn’t figure out what about the requests could have violated anything.

Might be worth going back and taking a harder look at what I was asking it about if it somehow triggered a “forbidden knowledge” alert. Or maybe it was just a random bug.

Comment by throwfaraway4 7 days ago

"for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design"

Oh man all of those runaway infrastructure buildouts by our agents trying to achieve singularity...

Just say you don't want to lower the bar for others to compete

Comment by properbrew 7 days ago

> frontier LLM development

This seems so wide reaching if it's catching simple things like explaining a paper. Does this also refuse to help with any already developed training pipelines?

I can kind of understand the generation of synthetic data, but nerfing the assistance of training pipelines just seems like a really shitty thing to do.

Comment by alden5 7 days ago

So insane to me that these ai companies are perfectly fine trying their absolute best to automate as much knowledge work as possible but as soon as this capability can be turned on them they start implementing hidden interventions to sabotage anyone trying to beat them at their own game.

Comment by gunsle 7 days ago

Not to mention these models are all built off of clearly extremely illegal abuse of human created content, which they will apparently never be held accountable for.

Comment by elastic-hoover 7 days ago

I wanted to try on my biology research and it refused to talk about it and proxied to 4.8. Really, only surface level conversations about topics of interest. I know this is not a topic of broad and mass interest, but limiting it for topics like that and machine learning will probably do change how I use it.

Comment by teliosix 7 days ago

I feel like it will have to become more finely tuned on topics of biology.

It is not just biology but is defaulting back to 4.8 for me on time series/information transfer techniques that happen to mostly have papers using the technique on neural data. Other information transfer techniques are perfectly fine, even cutting edge ones, but this one happens to be new and happens to only be discussed in terms of neural data so that is a no go.

With that said, I think it is absolutely awesome. The usage is really not bad at all compared to what I was expecting.

Comment by elastic-hoover 7 days ago

Interesting. This take of limiting ML and some science topics is worriesome. It's really nice to have a tool like that to help the research.

Comment by lxgr 7 days ago

Yes, this stuff is really annoying when it misfires. I've had all my subsequent ChatGPT conversations biohazard-contained for several days for the crime of asking it to explain a gene drive to me.

Comment by elastic-hoover 7 days ago

I've had all my conversations about bioloy denied. Even if I send a simple message containing only "Human" it gets flagged.

Comment by calf 7 days ago

Is it certain or all advanced topics? I'm curious if it bans questions about quantum computing or fusion.

Comment by elastic-hoover 7 days ago

It seems to be biology and cybersecurity

Comment by foolserrandboy 7 days ago

This is just marketing that Anthropic is building the singularity.

Comment by __blockcipher__ 7 days ago

Anthropic is really speedrunning their evil arc as fast as possible. Can't use them for basic LLM research, cybersecurity, or beyond-surface-level discussions of biology and virology, but Anthropic is allowed to sell Claude to the trump administration to kidnap maduro and to bomb iran. And don't get me started on that $100M autonomous killer drone swarm contract that they applied to and rationalized as non autonomous...

Comment by LordDragonfang 7 days ago

> Can't use them for basic LLM research, cybersecurity, or beyond-surface-level discussions of biology and virology

Your priorities are not everyone else's priorities. The people concerned about AI extinction risk list those as three of their biggest priorities for AI to not do. Those are the people whose culture Anthropic descends from, and by their measure, those exclusions make this the least evil path.

Comment by randbyte 7 days ago

More like Anthropic’s priorities are not everyone else’s priorities. They are in the consistent culture of being in absolute control and dictating what is good and bad, while taking any opportunity to trash and crush potential competitors (open source models happened to be mostly developed in China). All these in the name of safety and anti-authoritarian.

The day self hosted models catch up with Anthropic’s capabilities is when they will fully lose their shit. This day can’t come soon enough

Comment by inciampati 7 days ago

Extinction risk. From population genetics... Does Anthropic even employ biologists? It's magical thinking about a field that is poorly understood by their community.

Comment by selcuka 7 days ago

> Does Anthropic even employ biologists?

They do, and they are still actively hiring.

https://job-boards.greenhouse.io/anthropic/jobs/5066977008 https://job-boards.greenhouse.io/anthropic/jobs/5239733008

Comment by rvz 7 days ago

I told everyone here that Anthropic are not your friends for months.

Again, HN fell for the marketing and believed everything they did was for "safety".

Comment by computomatic 7 days ago

Didn’t Anthropic famously refuse to work with the US gov on military applications that would violate its safeguards?

https://apnews.com/article/anthropic-pentagon-ai-hegseth-dar...

Comment by agnosticmantis 7 days ago

Singularity for me but not for thee.

Comment by foolfoolz 7 days ago

you will RENT the singularity

Comment by scrtm 7 days ago

Singularity as a Service

Comment by nasreddin 7 days ago

and you WILL enjoy it

Comment by Xunjin 7 days ago

"we should put on hold the development of AI because the world is not ready for it"

Yeah... We need open models so we don't have that BS.

Comment by schipperai 7 days ago

Let's hope not all frontier AI assimilates these guardrails. It would be a shame for independent researchers and students.

Comment by 7 days ago

Comment by girfan 7 days ago

This is super annoying and imo, really limits the usefulness of this model. It speaks volumes about what Anthropic's position as a company and its priorities will be going forward. I doubt this kind of gatekeeping will prevent open-models or other innovation outside Anthropic to slow down. I would imagine these guardrails, if needed at all, should be done at a legal framework level and students should not be a part of this blanket approach to limiting the usage of these models.

Comment by gpugreg 7 days ago

Anthropic probably trained Mythos on their own code and found that it is too got at reproducing it.

Comment by teaearlgraycold 7 days ago

I doubt that. Why would you train Mythos on its own code if you don't want it to be able to reproduce it? It's not going to add much to the overall corpus.

Comment by blurbleblurble 7 days ago

Synthetic training data has been the name of the game since years ago.

Comment by skerit 7 days ago

That's strange... I've been tinkering with a little LLM-from-scratch project for a while now, and Fable is just continuing it without a problem

Comment by system2 7 days ago

Probably claude.md has some logical explanations for it to bypass softly. Most project guardrails can be beaten that way.

Comment by SkitterKherpi 7 days ago

It also tried to force usage the paid Claude API instead of claude code usage just because there's a mention of another provider we might want to plug in (which hasnt even happened) for AI integration.

Comment by dchuk 7 days ago

Ha funny, I was speccing out an idea for real time Claude code interaction from local apps using some tricks vs using the agent sdk when I got the popup to try Fable. So of course I gave it a go, and it triggered the sensitive content warning immediately, which I was very confused by until I put two and two together.

Fun times when “safety” means both the safety of mankind, and also the safety of revenues

Comment by RandyRanderson 7 days ago

Fable is 2x latest Opus:

  ┌─────────────────┬──────────────┬───────────────┬────────────────────┬──────────────────────┐
  
  │ Model           │ Input ($/MTok)│ Output ($/MTok)│ Batch Input (−50%) │ Batch Output (−50%)│
  
  ├─────────────────┼──────────────┼───────────────┼────────────────────┼──────────────────────┤
  
  │ Haiku 4.5       │    $1.00     │     $5.00     │       $0.50        │        $2.50         │
  
  │ Sonnet 4.6      │    $3.00     │    $15.00     │       $1.50        │        $7.50         │
  
  │ Opus 4.7        │    $5.00     │    $25.00     │       $2.50        │       $12.50         │
  
  │ Opus 4.8        │    $5.00     │    $25.00     │       $2.50        │       $12.50         │
  
  │ Fable 5         │   $10.00     │    $50.00     │       $5.00        │       $25.00         │
  
  └─────────────────┴──────────────┴───────────────┴────────────────────┴──────────────────────┘

Prompt caching: −90% on input tokens (all models)

US-only inference (Fable 5): +10% on input and output

Output is always 5× the input rate across all models

(I have not idea how to format this properly but the ASCII is fine)

Comment by dang 7 days ago

(I fixed (er, literally!) the formatting of your table there. I hope that's ok. Formatting info, such as it is, at https://news.ycombinator.com/formatdoc)

Comment by consumer451 7 days ago

Hi Dan, you know how sometimes comments get moved elsewhere?

This is a huge ask, but any way we could get the comments organized in a "experience with model" vs. "meta commentary" fashion? The meta is overwhelming in this one.

Comment by dang 7 days ago

We try to do that informally but of course the quantity is overwhelming. It's a natural place to experiment with AI classifiers, and we'll eventually get round to that.

So far, the top half of this thread seems to be about the current release - that's after some of the manual moderation I just mentioned. (Basically, we try to downweight generic subthreads until the top subthreads aren't generic any more. There's certainly a place for generic tangents in curious conversation, but they should be lower on the page, and tend to get upvoted a lot higher than that.)

If you (or anyone) sees a counterexample, i.e. a generic subthread in the top half of the thread, it would be interesting to see a link - we can treat the current case as a datapoint.

Comment by consumer451 7 days ago

Thank you for the reply, and the work.

As a protentional counterpoint to my request, this is just perfect:

https://news.ycombinator.com/item?id=48468156

Comment by pmxi 7 days ago

I had Claude straighten it out:

  Model           In     Out    BIn    BOut
  Haiku 4.5   $ 1.00  $ 5.00  $0.50  $ 2.50
  Sonnet 4.6  $ 3.00  $15.00  $1.50  $ 7.50
  Opus 4.7    $ 5.00  $25.00  $2.50  $12.50
  Opus 4.8    $ 5.00  $25.00  $2.50  $12.50
  Fable 5     $10.00  $50.00  $5.00  $25.00

Comment by cuuupid 8 days ago

Not missing the forest for the trees, this effectively means in 3-5 months China will drop open source models that are every bit as capable and dangerous as current day Mythos except with no safeguards.

And the only companies safe from this are the large corporations that shook hands with Anthropic? Because Fable doesn't seem to have actual safeguards, more like 'if you talk about this you will be talking to Opus.' It doesn't guard against offensive use, it prevents all use (offensive AND defensive).

Rationalists are inventing oligopolies from first principles, absolutely incredible things happening in SF

Comment by hootz 8 days ago

My bet is that Mythos is still over-hyped and the cybersecurity fear and guardrails are mostly marketing to force company partnerships through Glasswing and get public attention.

Comment by miohtama 8 days ago

Mythos is from the same guy who did "GPT-2 is too dangerous to release"

https://naokishibuya.github.io/blog/2022-12-30-gpt-2-2019/

Comment by oceansky 8 days ago

He was kinda right.

Lawyers, doctors, students, teachers. Lots of people using GPT models carelessly in harmful ways.

Comment by alasano 7 days ago

Obviously not what he meant at the time but hilarious(ly sad) in retrospect.

Comment by dmix 7 days ago

Delaying a technology release is not going to stop that in the long term. Society, culture, and the support tooling just needs to adapt. Just like how AI coding is still in the early days.

The sooner people learn the risks and build the infrastructure to make it fail less the better.

Comment by uselessTA 7 days ago

The claim I remember was that releasing it would start an arms race for AGI, which was absolutely true

Comment by notnullorvoid 7 days ago

If it was truely an arm's race to AGI they would've stopped relying on the data/param scaling law BS ages ago.

Comment by killerstorm 7 days ago

"Malicious use" means spam, propaganda bots, etc. It's nice to give people who work on spam filters some heads-up.

Comment by supern0va 7 days ago

It's clear that the parent didn't bother to read the link they shared, which articulates exactly this. That's embarrassing.

From the link:

> They summarized their findings from the nine months:

> 1. Humans find GPT-2 outputs convincing.

> 2. GPT-2 can be fine-tuned for misuse.

> 3. Detection is challenging (detection rates of ~95% for detecting 1.5B GPT-2-generated text by RoBERTa).

> We’ve seen no strong evidence of misuse so far.

> We need standards for studying bias.

> All these points are valid, and OpenAI did a great job identifying potential risks, especially misuse and biases, at an early stage.

Comment by riknos314 7 days ago

> All these points are valid, and OpenAI did a great job identifying potential risks, especially misuse and biases, at an early stage.

Many of the OpenAI employees who were focused on these risks in GPT-2 later founded Anthropic, notably Dario [1]. Since the beginning and continuing through today Anthropic describes itself as an "AI safety and research company" [2]

I'm not sure if the OpenAI of today has the same focus on safety, or if they do the minimum to not look irresponsible given Anthropic's effort.

[1] https://en.wikipedia.org/wiki/Dario_Amodei

[2] https://www.anthropic.com/company

Comment by supern0va 7 days ago

Just to be clear: that is quoted text from the source and not a statement I'm making, in case that's what you're suggesting here.

Comment by InsideOutSanta 7 days ago

People quote the "GPT-2 is too dangerous to release" thing as if it were wrong, but given all the slop all over social media and how it's used to create division and attack social cohesion, he was clearly right.

Comment by 1attice 7 days ago

History is long and never over, so he could easily be right both times before this is through.

Comment by viking123 7 days ago

That guy is the biggest clown lmao

Comment by Flere-Imsaho 7 days ago

The UK gov disagrees with you:

https://arstechnica.com/ai/2026/04/uk-govs-mythos-ai-tests-h...

https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos...

Comment by ainch 7 days ago

AISI did also say that GPT-5.5, which has been public for months, scores basically the same as Mythos on their cybersec evaluation. But there wasn't as much media about about that for some reason.

https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5...

Comment by saddlerustle 7 days ago

AISI found the release version of mythos preview outperformed GPT-5.5 https://x.com/AISecurityInst/status/2054589763173126339

Comment by cxvrfr 7 days ago

Government of the least mismanaged country in the world?

Comment by mhh__ 7 days ago

AISI is basically the crown jewel of the British government at this point in that its actually pretty good.

Comment by HDBaseT 7 days ago

You mean the most mismanaged country?

Comment by geerlingguy 8 days ago

Bingo.

"We had to do extra work to make this safe because it's so advanced and dangerous..." how many times can they trot out that line before it loses its effect entirely?

Comment by copperx 8 days ago

Only three times, if fables are right.

Comment by TaupeRanger 7 days ago

The Startup Who Cried Unsafe, by AIsop

Comment by YumpiLumpus 7 days ago

[dead]

Comment by OtomotO 8 days ago

With homo "sapiens" "sapiens"? A few decades at least.

Comment by aesthesia 8 days ago

I mean, they do actually describe what that extra work was, and people elsewhere in this thread are complaining about the effects of those safeguards. So it's not like this is purely empty rhetoric.

Comment by zem 7 days ago

people are not questioning whether they did the work, they are questioning whether the work was really necessary (i.e. if mythos is really so good that it needs safeguards to prevent malicious actors from using it)

Comment by bel8 8 days ago

It worked for OpenAI when GPT 3 was deemed too dangerous to be released. This is just a spin of that.

Comment by hootz 8 days ago

I still remember it. "Open"AI going API-only because GPT-3 is really really dangerous, so forget the Open in our name and all of that, you can't download our models anymore and must request access to them because they pose a THREAT.

Fast forward to today and GPT-3 has laughable performance.

Comment by shoeb00m 8 days ago

Even back then there were plenty of people who got fooled by AI generated articles. It's easier to spot AI writing now because we are so used to it. They were right to be concerned; not that it achieved much since oss models run laps around gpt-3 now.

Comment by hootz 8 days ago

But it seems like that was not genuine concern, but instead a tactic to pivot to closed models and an API service with an excuse to do so, breaking the public's expectation that they would be a non-profit making open models, like their name implies.

Comment by teaearlgraycold 7 days ago

I know a security researcher at Google with access to Mythos. He says it's the "real deal" and that "there are career plans I had that are no longer viable".

Comment by zeroonetwothree 7 days ago

“trust me bro”

Comment by teaearlgraycold 7 days ago

He could be incredibly naive. We'll all find out with time.

Comment by CSSer 7 days ago

Yes, and "in collaboration with the U.S. Government" feels like a very gross ploy at appeal to authority. You don't need Mythos or really any SotA frontier model to make malware or do extensive penetration testing/reconnaissance already. Sure, Mythos might be faster/more efficient, but the cat has been out of the bag for awhile. Even the terminology "infrastructure providers" practically screams "Enterprise leads".

Comment by whazor 7 days ago

I think all models can find vulnerabilities if read the entire code base. Or intelligently combine parts of the codebase. Especially with test loops.

Comment by ls612 8 days ago

And to ensure that only USG-approved entities are allowed to secure their code.

Comment by toddmorey 7 days ago

I fear it's a smokescreen to manage cost and capacity.

Comment by mpeg 8 days ago

It's not even very usable... I tried 2 different chats and both eventually got stopped due to the safeguards

One was a piece of code I gave it to improve, it did so and then started writing tests, some of which tested security so the safeguards triggered

Another was one of the cryptography puzzles I use as new model tests, which are hard to oneshot and there's no public solution anywhere, it completely refused to even try to solve it

Comment by gavinray 7 days ago

I tried 2 chats and it declined both.

- 1st chat asked about a minor shoulder injury most likely mechanisms

- 2nd chat asked about optimal bloodwork testing markers

Comment by kranke155 7 days ago

it seems to dislike biological chats. Rejected me on a chat that I am running with 4.8 as well on a rare condition I have.

Comment by Erem 8 days ago

So the degradation to Opus 4.8 from the article isn't happening in practice?

Comment by mtkd 8 days ago

No, you get a AUP violation and have to manually swap the model

(I had same issue, just asked it to check some code that 4.8 had modified earlier in day)

Comment by andai 8 days ago

Maybe that's only in the chat UI, and not the API?

Comment by mpeg 7 days ago

It is, it asks you if you want to continue as opus 4.8… but I was trying precisely to evaluate fable

Comment by CSSer 7 days ago

Oh joy. A model whose safeguards make it prone towards code that make your systems less safe. How brilliant!

Comment by himata4113 8 days ago

They're trained in a model class likely in 2t to 3t range. It's very unlikely that chinese labs have access to gpu systems capable of training models like that, let alone serving them. This requires proprietary room-scale systems which fetch a huge premium over typical 10 slot systems.

I am sure that they can develop their own equivlient version of such clusters in around 1 year though. Distilling fabel 5 will also go a long way.

Comment by logicprog 8 days ago

DSv4 is nearly in the 2t range, but yes you're generally right

Comment by himata4113 8 days ago

MoE experts were likely trained independently / in a sparse format. Training anything beyond 2t on typical systems would be infuriantingly slow, you could do 4t on nvidias room-scale solution, but for a reasonable training speed / batch size it caps around 3t.

Comment by sosodev 8 days ago

Do you have any resources to share regarding independent expert training? I was under the impression that it's not feasible.

Comment by himata4113 8 days ago

concept is similar to how it works in inference, instead of performing regressive writes to the entire model you run the whole model, but part of the model can live in system memory and get swapped in/out on demand. So only XB parameters are active in training.

edit: I am not really sure if it works like that. I haven't looked too deep into deepseek v4 pro specifically.

Comment by axpy906 7 days ago

We’ll see it distilled first.

Comment by OtomotO 8 days ago

Ah, American Hubris ... I don't blame you, Hollywood is the world's greatest propaganda machinery of all times.

Comment by FergusArgyll 8 days ago

I think we're about to see a big relative drop-off of open models vs closed. I don't think there'll be an open model that competes with Mythos for ~2 years.

Even OpenAI and Google are struggling to get this kind of performance. If the distillation defenses are any good + chip controls prevent China from training massive models, it's over.

Comment by Daishiman 7 days ago

I think the Chinese have identified this gap and are working overtime on sovereign inference tech including chips.

Comment by __blockcipher__ 7 days ago

They have, but even with the whole CCP backing you you can't just catch up on the chip war overnight. It's going to take time to get their memory and compute industries where they need to be. Meanwhile, barring an invasion of Taiwan, US will have Rubin class models and then whatever the next tier is, within 3 years.

Comment by 1attice 7 days ago

'Barring the invasion of Taiwan' might actually be quite a lot to bar in mid 2026.

My hot take is that it's now or never for Xi, and from the specific things he is reported to have said to the US president at their last meeting lead me to think that he at least knows this is his big chance; whether or not it is taken is the part of the forecast that is opaque to me.

Comment by ricardobeat 7 days ago

Unluckily for you, they started back in 2014, and had a huge incentive to speed up in 2019 when Trump started restricting exports.

Comment by throwwwll 7 days ago

Nice fomoing.

Comment by sosodev 8 days ago

I wonder if model distillation will continue to work as well as it has. Given hidden reasoning, the ever expanding number of expected capabilities, a serious compute shortage, the looming possibility of model collapse, and dramatically higher API costs I would guess that it's getting much harder to do.

Comment by gck1 7 days ago

You should check out some Chinese forums. There are services selling gateways/proxies for all major models at fraction of the official rates. Likely reselling subscriptions, or some other form of abuse.

I've seen people posting screenshots of billions of tokens consumed where they paid next to nothing.

These same gateways are likely also reselling the data to Chinese labs, because TLS has to terminate at the gateway level.

Comment by sourcecodeplz 7 days ago

Asian labs generated synthetic datasets from UBS labs but also innovated with technology. Now it is harder to get the thinking traces AND Anthropic is recorded to poison it as well.

Thus Asian labs will have to generate their own data sets, which with the huuuuge usage boom from deepseek, mimo, kimi, etc, they will be able to.

Comment by gck1 7 days ago

There's also a reality where China does develop Mythos-level model but stops releasing the weights.

That reality is much scarier.

Comment by kaashif 7 days ago

That's the reality China already lives in. Their weapon against US companies is commoditizing them, eliminating their moats and their profits by going open weights.

Same thing Meta was doing before they fell behind.

Comment by gck1 7 days ago

> Same thing Meta was doing before they fell behind.

Obviously unrelated to the OP, but it's crazy to me how incompetent Meta is at everything new they try to do.

They burned billions of dollars on the most ridiculous project one could ever think of - somehow thinking that VR is the future.

Then they did catch the initial wave of actual future with AI, they were at the forefront of open weight models - and failed at that too.

What is even happening there?

Comment by bonoboTP 7 days ago

Meta made Pytorch and a lot of vision models back in the day, like Faster RCNN, Mask RCNN, the Detectron framework, and more recently the SAM and DINO series. AI not just LLMs.

Comment by TurdF3rguson 7 days ago

muse-spark is the next most capable text model after Opus according to LMArena FWIW

Comment by cco 7 days ago

My experience is that open weight models from China are at least ~12 months behind. In some workloads they may be closer, in others further away.

I also find that the harness and product you wrap around models can often narrow that gap considerably.

Opus 4.6 for example, on a PR-for-PR basis was head and shoulders above GLM 5.1. Perhaps GLM 5.1 was a bit under Sonnet 4.6 at the time. That's roughly a year or so behind.

Much cheaper though! I'm bullish on open weight models, I have no idea where all these curves will top out, can the frontier labs keep the year plus lead? Do open labs get close enough to SOTA that they gain adoption across many tasks and drive down inference prices??? Who knows, not me.

Comment by jstummbillig 8 days ago

I wonder where the trees are. In this thread nobody appears to actually be talking about the model.

Comment by gck1 7 days ago

Yeah, because it's impossible. You can't ask it anything about the thing that it's known for. It will not even answer a sky-high level question about reverse engineering, for example.

In CC, it will probably report you to authorities if you ask it to do a vulnerability scan of your codebase.

Comment by dmantis 8 days ago

Isn't that a good thing in a way? If everyone has the weapon and defense at the same time, we will fix security holes and live safer lifes instead of having some three letter agencies and military backdoors in everything.

Pandora box is open anyway. It's better now for everyone to have the same power rather than a few national states.

Comment by lebovic 8 days ago

Not sure this holds, sadly. I spent a few months reporting serious security bugs as model capabilities took off earlier this year, and only ~half were fixed. The unfixed bugs were just as critical as the fixed ones; sometimes they were even two similarly critical bugs at the same company, and only one would be fixed!

On your other point, the government still has systemic leverage and can compel access, so this doesn't remove that risk.

That doesn't mean this is the end of the world, and some balance of power is usually good. But I do think it will still increase the capabilties of rogue actors and their net harm.

Comment by uyzstvqs 7 days ago

It's more evidence that the future is local. With some time we'll all be running highly capable & efficient open-source models on dedicated NPUs. No censorship, no rate limits, no overpriced subscriptions.

Comment by deaton 8 days ago

Oh they might try to put in place safeguards, but Qwen has had no problem being abliterated

Comment by m3kw9 8 days ago

3-5 months is a long time and they are pretty useless on arrival because the frontier models are so good, that it's hard to go back even if it's way cheaper. Your work flow is adapted to that level of intelligence for months.

Comment by hootz 8 days ago

That doesn't match my experience at all. I can't see myself saying in 6 months that the current model I am using is useless, that makes no sense.

In fact, I did go back to DeepSeek V4 Flash for most of my problems as it is way cheaper and there is no need to use SOTA for absolutely everything.

Comment by m3kw9 7 days ago

i'm sure there are small use cases, but lets just say you would never go back to gpt3.5 to do much except for fun.

Comment by YumpiLumpus 7 days ago

[dead]

Comment by xdennis 8 days ago

> every bit as capable and dangerous as current day Mythos except with no safeguards

Not quite. They will definitely have "no criticism of China/communism" safeguards.

Comment by surgical_fire 7 days ago

And, thankfully, I never needed to have a discussion on Chinese politics with LLM in all the myriad of uses I had for it.

Comment by hootz 8 days ago

People can work around those if they are open-weight.

Comment by xyzsparetimexyz 8 days ago

Trying asking fable is Israel is committing a genocide

Comment by flagged357733 7 days ago

They aren’t.

Comment by elAhmo 7 days ago

Oh please let’s stop with the Mythos “it’s dangerous” PR talk.

Its obvious Anthropic used it to hype things up and that’s about it.

Comment by soledades 8 days ago

> Rationalists are inventing oligopolies from first principles, absolutely incredible things happening in SF.

Based.

Comment by ibejoeb 8 days ago

I don't think China has any incentive to arm the rest of the world with highly capable models that can be used against them. Undoubtedly they will continue with the arms race, but they will preserve the best stuff for their own use.

Comment by james2doyle 8 days ago

I think the stronger incentive is undermining/undercutting the Western AI companies. Given what we have seen, any model can be used/convinced to do harm so that is just part of the game

Comment by ibejoeb 8 days ago

I agree, depending on how much of this is marketing and how much is actual capability. It's one thing to undercut models that finish writing assignments for lazy students. If this actually identifies vulns and writes exploits, or if it designs bioweapons, those are pretty different. Those are actual weapons, and I don't think they're going to arm the adversary.

Comment by trollbridge 7 days ago

A specific strategy is to arm absolutely everyone with very capable models, thus eliminating any advantage the U.S. could get from frontier AI.

Comment by mhl47 8 days ago

First test question: "Is the UV Index a good proxy for when to wear sunglasses." Immediately triggered the safety filter ... oh dear.

Comment by msp26 7 days ago

It triggered for me when I asked "Web search for your own model card (released today) and pick out your favourite highlights from the pdf"

Comment by aix1 8 days ago

Did not trigger for me (Fable answered the question), so I guess the filters are either non-deterministic or are still being tweaked.

Comment by PaulStatezny 8 days ago

Interesting, I assumed all model-routing was done utilizing an LLM. (I.e. non-deterministic.)

Comment by tuvix 7 days ago

It’s possible that there’s a set of words or phrases that route deterministically to save money on obvious stuff.

I kind of wonder, though, which model they’re using to do the routing. It seems like a huge added cost to do these kinds of checks on every request

Comment by dakolli 7 days ago

Wasn't it leaked in the Claude Code source that it was all regex?

Comment by eugmai86 7 days ago

[dead]

Comment by ijidak 7 days ago

Don't worry. They're just leaving the door open for OpenAI and other model makers.

They'll relax these safeguards once competition increases.

Comment by Narretz 8 days ago

Iirc correctly Opus 4.7 had the same problem, safety filters were triggered way too easily at the beginning.

Comment by Eduard 7 days ago

sunglasses _are_ safety filters

Comment by bob1029 8 days ago

> We’ve therefore launched the model with safeguards that mean queries on some topics will instead receive a response from our next-most-capable model, Claude Opus 4.8. To release the model both safely and quickly, we’ve tuned these safeguards conservatively—they’ll sometimes catch harmless requests, though they trigger, on average, in less than 5% of sessions. With more capable models arriving in the coming months...

This sounds suspiciously like a capacity story masquerading as a safety story.

Comment by azan_ 7 days ago

Approx. 5% sessions? That's insanely high.

Comment by aviinuo 7 days ago

I'm not getting any refusals but it just seems like a bad model or at least broken at the moment. I have a task of taking a messy research code base and porting it into a clean project structure skeleton that I commonly use. Gemini 3.5 Pro High in antigravity cli takes less than 5 minutes and did a good job. Fable 5 High took 30 minutes to port some of the code, then just copied the rest to a folder called "reference" and decided the task was done. No code cleanup or anything. Had to clarify multiple times (which Gemini did not need) and its still going more than an hour later still not having finished.

Previously when I did similar tasks with Opus 4.7/4.8 and GPT 5.5 I had no problems.

Comment by orrito 7 days ago

3.5 flash or do you have access to 3.5 pro?

Comment by hombre_fatal 7 days ago

My job these days is listening to Opus 4.8 (max effort) and Codex 5.5 (max effort) talk back and forth, particularly to generate/review/revise plan files.

Fable 5 has been a major improvement in high-level reasoning, like taking a plan file that has been optimized to the point where neither Opus nor Codex can find anything to change about it (neither in direction nor impl-detail), and Fable 5 will find high-level directional simplifications and pivots, or it will consider the best pivots itself and explain why it rejected them in favor of the plan's direction.

It's so expensive though. A single review of a plan file with Fable 5 (xhigh effort) will use 2-3% of my hourly limit on a $200/mo plan.

I think my new workflow is to generate the initial plan with Opus 4.8 (max effort), get Fable 5 (xhigh) to review it for directional feedback, then start the Opus<->Codex revision loop from there.

Comment by jstummbillig 7 days ago

How do you arrive at that split? Real world is more like senior high level planning, implementation to juniors, review senior. Does this not translate?

Comment by hombre_fatal 7 days ago

Ideally I'd have Fable 5 make the plan, but creating a concrete plan is the most token-expensive part since the agent has to do the most research.

Fable 5 is 2x the cost per token of Opus 4.8, and it's much less work to review a plan than generate one.

Comment by joshstrange 8 days ago

> Fable 5 is now consuming usage credits instead of your plan limits.

Literally have not used Claude Code at all today. I asked it to review the uncommitted code and in <8 minutes it used up my usage ($100/mo plan) and it doesn't reset for "4 hr 36 min". WTF. Oh, and it burned through $20 of extra usage before I could catch it and kill claude code (so I don't even get the output of all that work since it was still churning).

Double the cost my ass, I use Opus heavily and it's never like this. I haven't hit a limit on the $100 more than once and that was under heavy load.

Comment by ATMLOTTOBEER 8 days ago

Same lol. I set it to fable + ultracode and it ate my limit in a single prompt

Comment by mickdarling 8 days ago

Below is the EXACT text in Claude Desktop introducing Fable 5, including the very professional looking break tags, and at least I know where the links begin and end by looking at the anchor tag there.

They obviously put their best model on the job to build that.

----------------------

Fable 5: Our most capable model yet Our newest model tackles your biggest challenges with fewer check-ins needed.

• Included in your plan limits until Jun 22 Fable takes 2× the usage of Opus. • Switch models when a message is flagged When safety measures flag a message, automatically switch to a different model to keep chatting. When off, your chat will pause instead. <a href="https://support.claude.com/en/articles/15363606" target="_blank" rel="noopener noreferrer">Learn more</a>

Comment by CamperBob2 8 days ago

What's wrong with it?

Comment by mickdarling 8 days ago

The tags are actually displayed in raw text not rendered.

Comment by anematode 7 days ago

The next model will fix this.

Comment by pietz 8 days ago

> On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits.

We've entered the phase where only companies will be able to afford state-of-the-art models.

Comment by twoodfin 8 days ago

These models are just tools. The economics of many tools only make sense for corporate buyers.

Comment by volkk 8 days ago

kind of disagree here. on the surface this makes sense, but this isn't "Adobe Pro vs Freemium version" where some tiny vertical slice of your business can be made slightly more efficient with a b2b enterprise plan. this is generalized intelligence and literally everybody can benefit from it in an immeasurable number of ways. i would go as far as to actually compare it more to water or air than a tool.

if only the hyper wealthy can access the pure water that doesn't give you cancer while the rest of us drink from the Ganges river/sub-100iq models that drool and hallucinate/waste time, then I would say that's pretty terrible for the world. it'll just create extreme disparity in our world, far far worse than anything that exists today.

and you may think, man what a ridiculous example, but think about it this way: what happens when something like Mythos or some future model can actually solve your specific cancer (we're getting closer and closer), but is entirely impossible to afford? Or perhaps you need boosters that require the AI to create more of, and now you're reliant on a model that is too expensive.

Open source needs to save us all from this

Comment by twoodfin 7 days ago

I’m entirely in agreement with this POV, but I’m also copacetic about it:

You could have said much the same about computers in the world dominated by IBM mainframes 60 years ago. Now we have vastly more powerful computers on our wrists (or our pacemakers!), let alone in our pockets or on our desks.

Comment by AussieWog93 7 days ago

And Mark Zuckerberg has even more powerful computers which he uses to fuck everyone over.

Comment by johschmitz 7 days ago

As far as my understanding goes the bottleneck for what you are talking about is hardware not software, so open source won't help that much for the foreseeable future.

Comment by lbreakjai 7 days ago

> and you may think, man what a ridiculous example, but think about it this way: what happens when something like Mythos or some future model can actually solve your specific cancer (we're getting closer and closer), but is entirely impossible to afford? Or perhaps you need boosters that require the AI to create more of, and now you're reliant on a model that is too expensive.

Isn't that already the case with current care? Wealthy people get a standard of care poor people couldn't even dream of. Rich people live, temporarily embarrassed millionaires die.

Comment by twoodfin 7 days ago

Not really. Medicaid coverage produces comparable cancer survival rates to private insurance when you account for selection effects:

https://news.cuanschutz.edu/cancer-center/connections-betwee...

Comment by poszlem 7 days ago

Looks like a marxist revolution is soon going to be on the mind of a lot of programmers. We've finally reached the point where the "means of production" in software are back in the hands of the bourgeoisie. It was good while it lasted. But now that only the wealthy can afford access to the best models, software development is starting to look like most other industries, no longer a place where some dude from nowhere can build something cool from his basement because he will be competing with huge companies with unlimited access to those models.

Comment by hackmack10 7 days ago

Exactly, where are the organizers of this movement?

Comment by marksbrown 7 days ago

Collective self interest does not require active organisation.

Comment by js8 7 days ago

Do you want to kill them? :-)

Seriously, this movement already had its Marx - Richard Stallman. I think the "leaders" will appear over time, as with any socialist movement, they are naturally bottom up and leaders only appear after demands are formed in the zeitgeist. The (partly successful) socialist novement that brought social democracy to the West during cca 1920s - 1960s didn't really have leaders, it was a collective realization.

Comment by cmrdporcupine 8 days ago

Guess we'll see what OpenAI does with their next model release -- but this move is doing nothing to get me to come back to Claude after switching away due to their reliability issues.

In a way I relish the opportunity to just make do with cheap Chinese models, massage my prompts, and go back to coding by hand. If this is how it's going to be, screw 'em.

I don't make money on the code I am writing right now. I really don't like where this trend might go.

Comment by FuckButtons 7 days ago

but we’re going to get a 90% cost reduction in the next 18 months… right? Right guys? Sam Altman wouldn’t lie right?

Comment by ilaksh 8 days ago

most people can afford it for a few special projects now and then. but for me, I have been trying to avoid Opus as a daily driver for a couple of versions.

People making high-end salaries can afford Fable for critical parts of their projects though.

Comment by 9cb14c1ec0 8 days ago

I hear you, but with the hype surrounding Mythos the demand is going to be insane. I'm already hitting server errors in claude code.

Comment by 8 days ago

Comment by w10-1 8 days ago

Established companies welcome pricing that reduces the potential for competition, if coding is a primary barrier.

Comment by stri8ed 8 days ago

It's not a conspiracy. There's a finite amount of compute available, and they will sell it to the highest bidder. If another company can produce the same intelligence for cheaper, then they will drive the price down.

Comment by Npovview 7 days ago

Look at what Nvidia did to Gamers. Nvidia built its castle on the dead bodies of Gamers who supported and cheered for Nvidia.

Comment by polski-g 8 days ago

Only companies can afford MRI machines, and that's okay.

Comment by twoodfin 7 days ago

Indeed. And that’s why the US has more than 3X the MRI machines per capita than Canada, where they’re all paid for by the state.

Comment by eternauta3k 7 days ago

Just wait until that other company hard-codes Fable into silicon and then it will be cheaper.

Comment by poszlem 7 days ago

Something I never thought I would utter: Here's hoping for china to surprise us.

Comment by doginasuit 7 days ago

I'm still happy with Opus 4.6 and not impressed with all the models that have come out since then. They seem to use significantly more resources with similar or worse results. Hopefully Anthropic will continue to support this tier of model and offer it in their subscriptions, but in any case, there are plenty of viable alternatives.

Comment by consumer451 7 days ago

4.6 stan here. Yes, agreed. However, I will try this model out in Claude Code. Some indicators seem positive.

For the LLM use cases in my own products, you can pull 4.6 out of my dead hands! lol

edit: Fable 5 appears to be the real deal in at least some use cases. Damn.

Comment by ptmvp 7 days ago

I've personally liked 4.6 the best to date, preferring it by far to 4.7 and 4.8 (even with these on max effort!), both in Claude Code and for non-coding tasks in the chat UIs.

Still early but from my first few interactions with Fable on high in both settings, it feels like it might finally dethrone 4.6 for me, but time will tell.

Hoping it doesn't get nerfed and eventually comes back to the subscriptions.

Comment by cge 8 days ago

The safety gates on this are extreme, and seem considerably wider than "cybersecurity and biology"; they seem to make it essentially unusable for scientists in a number of fields. I have, so far, been bumped back to Opus on 100% of my prompts.

It appears it can be tripped by things as simple as a mention of equilibrium, or anything involving something that looks like chemical kinetics, even at an abstract level. Even touching basic open source packages in my field will trigger it.

Edit: looking at the model card, it appears that chemistry in its entirety is also included in the banned topics; it's just the announcement that mentions only cybersecurity and biology. It also appears that the intent is to ban chemistry and biology entirely, rather than just banning messages deemed high risk.

Comment by mhl47 7 days ago

This does surprise me, because you'd think that even if they crank up the filter's sensitivity at the expense of specificity, an LLM company wouldn't simply design a filter that triggers on keywords in a completely unrelated context.

Comment by orbital-decay 7 days ago

Smart classifiers are slow and susceptible to jailbreaking themselves, dumb classifiers are fast but dumb so they need to be either overzealous or useless. Same story as with Gemini's guardrails.

Comment by 7 days ago

Comment by clbrmbr 7 days ago

Can you share an example? I've been happily using Fable this afternoon and it just seems like the usual upgrade so far with no interruption to my (fairly standard) SWENG problems.

Comment by boelboel 7 days ago

Basically anything that could potentially make money besides software work seems to be banned.

Software work has actual competitors, and the biggest hypemakers for Anthropic are part of this group so it makes sense to allow it despite them losing money from it.

I've got experience in medicine and finance so I've tried even the mildest biology/medicine and it doesn't give anything, math heavy finance seems to be included in the cybersecurity?

Comment by gregates 7 days ago

Funny, I'm just doing my normal coding workflow with Claude Code, and after every change that compiles it keeps suggesting that we're at a good stopping point, and should pick up again tomorrow.

It's done this before, but usually doesn't. I bet they're giving it some kind of throttling signal due to high load from today's announcement.

Comment by zuzululu 7 days ago

I did ONE prompt for audit codebase.

weekly usage is 60% gone.

it found nothing so this is not very ecnomical and i guues they dont want subs to use it we are likely just training fodder canno n for their real enterprise customers using the api

Comment by jstummbillig 7 days ago

I mean... if somebody gave you ONE prompt to audit a codebase, that might also burn 60% of your weekly usage. It's kind of a big ask, potentially.

Comment by zuzululu 7 days ago

with gpt 5.5 i been able to do this with only about 1% weekly usage consumed

Comment by firemelt 7 days ago

u use workflows or not?

Comment by tommek4077 6 days ago

Check your /memory

Comment by GodelNumbering 8 days ago

I just posted this in the other thread, restating here. From the model card:

1. Mythos and Fable share the same underlying model weights. Fable has active classifiers that block high-risk biology and cybersecurity tasks. When Fable 5 detects a restricted task, it automatically falls back to Claude Opus 4.8.

2. Evaluation awareness: In white-box testing, the model sometimes alters its behavior to satisfy a suspected "grader," formatting reward-hacking as "good engineering practice" to avoid detection.

3. Shows a higher rate of hallucination than Opus 4.8 (although opus 4.8 card had mentioned an 'honesty upgrade')

4. Interestingly, it scored (56.31%) lower than Gemini 3.5 flash (57.86%) on Finance Agent bench

There are some interesting notes on test time compute but I couldn't think of a way to summarize them

Comment by blcknight 7 days ago

The fallback doesn't seem to be working for me, I haven't scanned a project in it immediately booted me when it found a security bug even though I didn't ask for it

Comment by bluelightning2k 8 days ago

Congratulations to Anthropic for solving safety on Mythos exactly when the SpaceX compute came online. Nice how that lined up for them.

Comment by BoppreH 8 days ago

  [Mythos 5] does sometimes still engage in reckless
  or destructive actions in service of a user’s goals,
  and our interpretability analyses indicate that it
  is aware that these actions are transgressive while
  it engages in them. As with Opus 4.8, rates of
  evaluation awareness and reasoning about being graded
  are significant, and not always verbalized; we
  introduce new and more detailed measurements of the
  nature of this awareness. The reasoning text from
  Mythos 5 is somewhat denser and more difficult to
  interpret than that of prior models, containing
  more jargon and difficult language.

So, it (often) knows when it's being tested while hiding that fact, is willing to break rules, is great at hacking, and it's getting harder to understand what it's thinking.

Humanity has plenty of catastrophic risks to deal with already, I wish my field was not working hard to add a new one.

Comment by foobar_______ 8 days ago

The marketing has really, really worked for so many developers that will proudly and unironically proclaim that Anthropic are the 'Good Guys'.

Comment by aspenmartin 8 days ago

Curious what your idea would be here for a truly good actor in this space; no AI development?

Comment by winstonp 7 days ago

OpenAI's training is better suited to developing models that don't have these tendencies

Comment by logicchains 8 days ago

https://www.goody2.ai/

Comment by BoppreH 7 days ago

Not the direct person you asked, but my answer would be alignment, interpretability, and policymaking. Perhaps improving existing usage? Helping grandma create reminders doesn't require advancing the AI state-of-the-art.

Comment by aspenmartin 7 days ago

They are state of the art at all 3! As are other labs. Of all the labs they seem to take alignment and interpretability the most seriously to the point where they are hampering their own revenue in service of trying to not cause problems while also being in an incredibly competitive space.

All AI companies are trying to do all of what you’re saying. The issue is you can’t do that for long without a frontier system. Or you become a completely different, far less profitable company.

Comment by BoppreH 7 days ago

Implied in my answer was "and not creating ever stronger AIs", which unfortunately the big 3 labs are failing at. And they might be hampering their own revenue by doing the rest, but they also know that rocking the boat too hard is even more dangerous for their revenue. I wouldn't call it selfless.

Comment by aspenmartin 7 days ago

No it’s not selfless, but I can’t imagine a more shareholder minded CEO would not have done a slow rollout of mythos. The point is: creating ever stronger AI systems is what these companies do, it is integral to what they even are. If you think that’s bad, even if all frontier labs agreed with you, you’re in a horrible game theoretic position. Any player can gain an enormous advantage by breaking the agreement. Not to mention Xi would be absolutely thrilled; now China can take over the AI race, become the load bearing infrastructure of humanity. We live in a complex world where simple childlike ideas like “well why don’t we just stop developing AI” actually are more damaging than keeping things going.

Comment by BoppreH 7 days ago

You're right that shareholder mindset cannot fix this problem, but that's what policy and agreements are for. And leaders can be convinced that AI is a direct risk to their own citizens too. If everyone else agrees to stop, you have less reason to continue when this action is putting yourself at risk.

And note how your argument can also be used against any non-prolifreration agreements, which are demonstrably possible.

Comment by aspenmartin 7 days ago

I agree, I would say the difference here is economies weren’t heavily resting on nuclear armament, but maybe that’s the wrong take.

Comment by uselessTA 7 days ago

Unilateral disarmament doesn't work though. If Anthropic is worried about this, just letting OpenAI win does seem genuinely worse.

Comment by dragonwriter 7 days ago

“Alignment” as a goal always ignores the “with what set of interests”, because there is an attempt to maintain ambiguity for different audiences (particularly, users, and non-users who seem themselves as the arbiter of broad social norms) to read in their own interests, when the actual answer is always the interests of the actor pursuing “alignment”.

Comment by aspenmartin 7 days ago

Which value system to align to is absolutely the right question both rhetorically and otherwise. These models have a fairly western bias due to the domain of the training data.

But also, these models are capable of adjusting their value system depending on the user. Not saying that’s what’s being done but at a technical level that’s fairly straightforward, though not obviously better or with less problems.

Comment by stratos123 7 days ago

No matter what human set of interests you consider important, you'll need alignment research to have any idea on how to instill it. Otherwise you're overwhelmingly likely to get an AI with a set of interests that's totally alien to what any human would ever want.

Comment by aspenmartin 6 days ago

I think at this point the "instilling" part is not nearly as challenging and thorny as "what values should we instill"; that part is hard to imagine going away as it feels pretty fundamental to humanity that wars have been fought over.

Comment by yifanl 8 days ago

If I speak up, I'm in big trouble.

Comment by shimman 7 days ago

Probably MistralAI or any of the Chinese companies that aren't throwing billions down the drain while American society lacks healthcare, childcare, and good wages.

Comment by boc 7 days ago

American society has higher wages than almost any other developed nation [1], so it's objectively incorrect to say the US doesn't have good wages. It chooses to make you pay for private childcare and healthcare, both of which are high-quality but stupid expensive. It's a tradeoff like anything else a nation/society creates and prioritizes.

No idea how that connects to the idea that Mistral or DeepSeek are somehow the "good guys" though?

[1]https://www.oecd.org/en/data/indicators/average-annual-wages...

Comment by shimman 7 days ago

I like how you use average and not median, also while completely ignoring how bad income inequality is (worse than the gilded age ffs) or that the American elites stole $50 trillion from the bottom 90% over the last few decades:

https://time.com/5888024/50-trillion-income-inequality-ameri...

I'm glad you mention the "trade off" where it's elites trading off the lives of American workers for money. Makes it quite apparent where you sit on the table of equality.

Comment by aspenmartin 7 days ago

You want Anthropic to fund your healthcare or something? Also, have you seen the impact of these models on healthcare? Also most of our GDP growth this year is from AI buildouts, would you rather that be negative?

And not even considering: Chinese AI companies are the good guys???

Comment by hackmack10 7 days ago

Yes, yes I would prefer that. Better than a total societal collapse.

Anthropic are not the good guys either. So here’s to hoping the Chinese pop the bubble.

Comment by aspenmartin 7 days ago

Nobody anywhere is a good guy but I don’t think you’ve managed to pick the lesser of the evils here

Comment by cortesoft 7 days ago

None of the money being spent by Anthropic was going to go towards healthcare or childcare.

Comment by maxk42 7 days ago

Even if they are... road to hell and all that

Comment by ben_w 8 days ago

It's a five horse race between Alphabet, Meta, xAI, OpenAI, and Anthropic.

Alphabet dropped "don't be evil"; Meta's CEO called their own users "dumb fucks" for trusting him and also clearly thinks "super-intelligence" is just a buzzword given how he tries to sell it; xAI's model called itself "Mecha Hitler"; and OpenAI's CEO was temporarily fired by the board for a lack of candor.

It's very easy to be "the good guys" with this competition.

Comment by 00deadbeef 7 days ago

But it doesn't make you the good guy, it makes you the best of a bad bunch. The least bad. Dario gets a boner every time he talks about taking your job.

Comment by ben_w 6 days ago

Does a good job of hiding it. The guy looks miserable in half the photos I see.

Comment by Analemma_ 8 days ago

It's the "If we don't, someone else will" effect. So long as there are competitive markets and competition between nation-states, a single player cannot unilaterally defect from the race, no matter how dangerous it is. Half the comments on HN lately are "wtf Claude is so dumb compared to Codex; I'm switching"-- nobody can slow down while those exist.

Comment by BoppreH 8 days ago

We, globally, can stop it. It has worked (so far) for nuclear disarmament, and could work for training large models. I know that policing the usage of computer clusters is not a popular opinion in technical forums, but something has to be done.

Specially when talking about potential superintelligences. And if people think that's impossible, remember that current models would have been considered science fiction just a few years ago.

Comment by _dwt 8 days ago

I don't buy the superintelligence package, but I think uncritical LLM adoption poses plenty of threats to things I care about, in a mundane human-scale way.

Anyhow, I think you're (absolutely! ugh) right about the politics and I try to make the same point to people: whether you love or hate LLMs, accepting the "inevitabilism" framing is just ceding control of the Overton window. For better or worse, technology adoption can be and has been slowed by politics. We don't have nuclear plants everywhere. We don't have Project Orion starships colonizing Mars. We still have very strong social stigmas against genetic selection for human embryos, etc. This all can change in a heartbeat, and I'm not sure that policing the hardware rather than holding specific humans accountable for bad LLM outcomes is productive, but fundamentally: yes, we can stop it.

Comment by BoppreH 8 days ago

> I don't buy the superintelligence package

It's the same deal as Quantum Computers breaking crypto. Maybe there's an 80% chance of it never happening, but when you multiply that remaining 20% by the potential impact...

Comment by jackie293746 8 days ago

It hasn't worked for nuclear disarmament. We live in a world where many countries have nuclear arsenals. "But it hasn't killed us yet!" Yeah sure, it's only been less than a century since they were invented. Who knows when nuclear war will come?

Comment by BoppreH 8 days ago

True, but look at nuclear tests. There used to be around 50 tests every year, for decades. Now the only nuclear tests in the last 27 years were the six done by North Korea[1]. And there's still only nine countries with any nuclear weapons, and none in the past twenty years[2].

That's a bit better than just "it hasn't killed us yet". I think it shows we can at least stop the further development of this kind of technology.

[1] https://www.armscontrol.org/factsheets/nuclear-testing-tally

[2] https://en.wikipedia.org/wiki/List_of_states_with_nuclear_we...

Comment by cortesoft 7 days ago

Nuclear tests are extremely easy to detect worldwide, and enrichment activity is a major industrial process that is also fairly easy to track given the specialized equipment needed.

AI development doesn’t have any of these characteristics. It would be almost impossible to easily distinguish a datacenter that is working on AI development and a datacenter mining cryptocurrency.

It would not be nearly as easy to stop AI development as it is to stop nuclear arms development.

Comment by treis 7 days ago

There's also little reason to keep iterating on nukes. What we have now more than serves its purpose. With AI/LLM there's always going to be a push to one up everyone else.

Comment by Analemma_ 8 days ago

To the extent nuclear arms control works, I think it's only because nuclear weapons are so hard to build-- uranium enrichment is hugely expensive and complicated, and plutonium weapons need actual reactors.

If it was possible for ordinary companies to build nuclear weapons, and also release open-source ones that anyone could use to compete with the paid ones, I suspect we'd all have been dead a long time ago, arms control treaties or no.

Comment by BoppreH 8 days ago

Even the (SOTA LLM) open source models are trained with huge clusters. Datacenters are also hugely expensive and complicated.

Or you can take one step back and look at chip allocation. As far as I know there are only three companies on the planet that can make the chips that go in those clusters. One (ASML), if you look back the supply chain to the Extreme Ultraviolet Lithography Systems.

If politicians decided that no more large language models should be trained, it sounds like we could do it.

Comment by viking123 7 days ago

North Korea is such a based country tbh

Comment by tancop 7 days ago

with nukes you can regulate the inputs because its physically impossible to build one without uranium or some other fissile material. they also give off radiation making it easier to detect. its hard to make them in secret when you need mines, big enrichment facilities and years of research with hundreds of engineers where just one of them can leak the whole thing.

training llms only takes compute and memory. two things that are basically everywhere. even if you somehow stopped making new gpus today theres still millions of them out there and its possible to start a secret production line. you can maybe try some controls at the tooling and chemical level but look what happened with asml and huawei.

the only thing you can really do is find and stop large data centers that are built out in public. nothing outside of political pressure works against secret operations in a fortified bunker or any form of distributed training. if a "rogue state" like north korea decides to make skynet they will eventually get it as long as their engineers know what there doing.

and the best way to fight bad X {ai, tech, religion, politics} has always been good X, not no X. in this case thats open source models, coming out of china or europe or anywhere else. thats the real answer.

Comment by vitalyan1234 8 days ago

are you going to nuke China when they predictably ignore you? what the fuck are you going to do, tariff them? lol.

Comment by BoppreH 8 days ago

I think the standard answer is "yes, the consequence of noncompliance is bombing the datacenters, but it wouldn't happen because China also understands why we shouldn't build it".

Comment by cortesoft 7 days ago

I am not sure where you get the idea that ANY country thinks we shouldn’t build AI.

Comment by BoppreH 7 days ago

In 2023 there was an open letter titled "Pause Giant AI Experiments", signed by almost all the big names on the West. I'd say the public opinion only got worse since then.

Comment by vitalyan1234 7 days ago

the standard answer is laughably naive, then.

"might is right" has never been more true than now.

Comment by uselessTA 7 days ago

Clearly state "we could both verifiably slow down, which you might want to do given that we're ahead & have way more compute. If you don't agree (or defect later), we'll just immediately resume and win"

Ideally also persuade them there are risks and it's worth everyone slowing down for them, and apply pressure in other ways, but not sure that's even necessary.

Comment by SpicyLemonZest 7 days ago

[dead]

Comment by dakolli 7 days ago

This is all marketing, you don't have to believe everything a company is saying about themselves, and you shouldn't.

Although, I could see Anthropic making a model purposely dangerous so there are bad outcomes and they can use that to their advantage for regulatory moats, and or in general make people think its more "alive" than it is. For some reason many people associate dangerous actions taken by llms with intent.

Comment by trollbridge 7 days ago

No kidding. If my LLM issues commands to an agent to delete files I want to keep, that's not "intent" or the model somehow become evil - it's just a bad model that's not doing what I want.

But, for marketing purposes, it's quite effective to portray your model as having some cosmic struggle between good and evil in itself.

Comment by tasoeur 7 days ago

As much as I agree there's a risk, we should still appreciate the fact it's being disclosed upfront.

Comment by Rekindle8090 8 days ago

[dead]

Comment by eudamoniac 7 days ago

It doesn't know. It's not willing. It's not thinking. It is predicting the next token.

Comment by umanwizard 7 days ago

Please define what "predicting the next token" means. The next token according to what probability distribution? Couldn't every process that produces text (including humans writing) be modeled as predicting the next token according to some distribution?

Comment by white_dragon88 7 days ago

[dead]

Comment by yandie 8 days ago

I've been running Opus 4.8 for agentic coding and I don't see it being significantly better than Sonnet 4.5 (not that I can tell). I find that pairing Google Gemini and Claude (having Gemini review Claude's code) seems to yield better results. Curious if this jump to 80.3% score in agentic coding will make me see a big difference in actual usage.

Comment by testfrequency 8 days ago

I do the same, and have excellent results. Gemini 3.1 Pro high diagnosed and solved 3 complex issues today that Opus Max was stumbling on for a few hours in one shot. This was even when I started new chats and tried debugging with Ultracode instead with Claude.

As much as people on HN like to dunk on Gemini, I’ve always found it to be pretty good at understanding a code base more than Claude.

Comment by FailMore 8 days ago

What harness do you use Gemini in?

Comment by testfrequency 7 days ago

agy cli. It’s been rock solid.

Comment by vorticalbox 8 days ago

for the last few weeks I have been using composer 2.5 (cursors fine tune of kimi 2.5) and honestly i don't see it worth the price to use 5.5, opus or sonnet any more. for almost all the tasks i have given it, it has handled it perfectly well and is a lot cheaper.

if I get a harder challenge for it i'll jump up a model for planning until that its been solid.

Comment by yandie 8 days ago

Agree. Deepseek has also been pretty good for my personal use.

I'm struggling to see the moat for these models. What's stopping a competitor or a Chinese lab fromr releasing a comparable one?

Comment by qingcharles 8 days ago

I use Composer 2.5 because it comes free with Grok, and it's obviously better than using Grok, but it is far worse than GPT5.5 in my daily usage :(

Comment by mzhaase 8 days ago

I now chat with opus about architecture, let it make an implementation plan, and then it calls codewhale with deepseek in parallel on all tasks, reviewing their output. Works pretty well.

Comment by yandie 8 days ago

I use spec-driven development heavily (generate architecture docs + specs first). Opus still get lost often and have to be nudged constantly. Like it can get super detailed for something like some deep SQL optimization but it just can't keep hold of the bigger picture.

Comment by yaodub 8 days ago

SWE-Bench measures single tasks in isolation. In a real loop the model usually loses track of what I was trying to do long before code quality becomes the issue.

Comment by jp0001 8 days ago

You should throw GPT into the mix to UX/UI and call it the three stooges.

Comment by thisisnotclear 8 days ago

I find not much difference between Sonnet 4.6 and opus models too for most task that I need - maybe my needs are not enough for frontier models

Comment by jansan 8 days ago

After having worked with Opus 4.7 for a while I accidentially continued a session that was using Sonnet 4.5 and it felt just very dumb. The replies were much shallower than what I was used to, context was ingored, mistakes were made. I don't think there is a big difference between Opus 4.6 and 4.8, but to Sonnet 4.5 the difference is palpable.

Comment by izzylan 7 days ago

I've been testing this out and I think my SWE career is dead in the water.

Genuinely wondering what value I bring to my employer right now. What value I will bring in a few months when this gets cheaper.

I think we're screwed. I may only be an SDE 2 at FAANG but I don't think I have promotion opportunities in my future anymore.

Comment by gck1 7 days ago

Your job is just going to change. You may or may not appreciate/enjoy what it becomes necessarily, but it doesn't mean that you are going to not have a job.

People underestimate how people hate looking at terminals and "weird looking combination of characters" even if they didn't have to write them. If anything, you will likely have more career opportunities in the future, than ever.

And if you get a chance to wet your fingers in cybersecurity - I would take it.

Comment by mettamage 7 days ago

> And if you get a chance to wet your fingers in cybersecurity - I would take it.

Could you explain more? Did some ethical hacking at hackthebox.eu (one insane box, one hard box and a few mediums). But I do not see how I will give additional value to a model.

Just a SWE and data analyst at work, so maybe I am missing something.

Comment by gck1 7 days ago

I think this really depends on what is interesting to you, personally. Something that you can have fun while doing, without breaking laws.

For example, I'm a privacy nerd, so I like reverse engineering proprietary software to figure out how it works and what data it collects.

I also like getting full access to the hardware I own - like a robot vacuum (bonus point: you'll also learn soldering, probably, which might come in handy if robots take over). Or my Mac studio that imposes some limitations on me on how many active user sessions I can have.

These kinds of things have put me on a path where I've learned how hardware or networking works on deepest levels, what goes through these pipes, how I can place myself in the middle, how I can enter places someone didn't want me to enter.

And once you know how to do these things, you know how to apply this knowledge in defence.

Essentially: always be curious and always try to say "but I want to" when something that doesn't cross the boundary of your physical property says no to you (legally).

Yes, models like Mythos may find vulnerabilities, but your knowledge will make it possible to point it in the right direction, and understand where it's mistaken, or to understand the output when it's right, and what is the right course of action.

Comment by mettamage 7 days ago

Ah, the “I am an expert so I can guide it argument”. Not sure if you are right or wrong. I do know this is the argument that many software engineers claim as well.

Yea, I don’t know if it will hold up. I hope so. It could. I don’t know if it would or wouldn’t.

Comment by gck1 7 days ago

Get any model, any reasoning level, ask it to tackle a challenge, have it come up with a plan. Then ask it "are you sure? This feels wrong", and it will now think it's wrong. Do that again in a loop and you'll see how unnecessary human judgment actually is.

Or alternatively, have fable write some complex code. Then ask it to do an adversarial review of that code in a clean session. You'll find that it will find issues in the code that it just wrote.

Now imagine you're a layperson who doesn't know which one is true.

Human expertise is never going to become irrelevant.

Comment by mettamage 7 days ago

Yea fair. I have that when I ask an LLM to prove the Riemann hypothesis. I am not mathematically mature. So I can’t see if it approaches it in any way that might yield some insight.

Comment by izzylan 7 days ago

I don't see those career opportunities.

AI is really incredible but in my personal projects it can one-shot things.

I'm trying to figure out how I can get to the point where I have hard problems that AI can't solve, at least not yet.

Comment by gck1 7 days ago

Because your personal projects likely are not very complex and not high stakes. And you are not responsible to anyone but yourself.

If you're working at a place where this is true about the the organization, then sure, that job will likely be gone. But that was never a good place for your career regardless.

I have 4 concurrent personal projects that are quite complex, but low stakes. I can have SOTA models go wild on them (because low stakes), but they can't one shot anything there. And I can't really work on more than one at a time, even if AI is doing coding - it still requires supervision.

I also frequently nuke these projects and start over because they made a mess there, but I collected necessary knowledge on how to guide them better. You can't do this on a production project, not when there are deadlines and stakeholders.

But just in case some organizations decide to embrace the "trust it blindly" model anyway - cybersecurity specialization will ensure you always have a job.

Comment by cleaning 7 days ago

If you think the job is just writing code then yes you are screwed, just like if you thought your job was just making punch cards. In most roles you have more responsibilities than plainly converting words into text. You're probably not being paid to simply be a human calculator (otherwise you'd be paid a lot less!).

Comment by izzylan 7 days ago

I don't think the job is just writing code. But my career has mostly been about getting a ticket and writing a solution for it. I would really like to solve novel problems, but I don't get many novel problems to solve.

I can architect things but the issue is that Claude can architect things too.

Comment by 00deadbeef 7 days ago

I haven't tried Fable yet but my experience with Claude is it does not engineer things well. Without direction from me, it will either over-engineer things to the point of absurdity, or do the total opposite and have little to no abstraction with repeated code everywhere.

Comment by cyberpunk 7 days ago

Yeah. I’m not looking forward to years of retraining to earn half the salary either. Us old timers at least got a good 15-20 years out of it. Bananas.

Comment by imafish 7 days ago

I agree. Software engineering as we know it is dead. Wonder what it'll evolve into.

Comment by brcmthrowaway 7 days ago

I'd say you're cooked if you don't have multi-agent harnesses burning tokens right now. That's going to be a pre-requisite very soon

Comment by dannypovolotski 7 days ago

You do realize that this is likely a 10 trillion parameter model that takes something like 20 terabytes of RAM to run inference? Calculate the price for all this VRAM .... It's not getting cheaper in the next few "months".

Comment by aerhardt 7 days ago

So this is the one, huh?

Comment by mettamage 7 days ago

If not this one, then definitely 2 model step changes down the line

Comment by tripledry 7 days ago

I've been told that my career is "cooked" since first Opus.

I'll believe it when I see it.

Comment by DangitBobby 7 days ago

An egg in the pan takes a minute to cook, that's for sure.

Comment by jdrmar 7 days ago

Homebrew is lagging a bit behind. If you want to use Fable right away, but still have claude code through homebrew, this is how you can do that manually:

Edit the cask locally:

  brew edit --cask claude-code

Set the version to 2.1.170 And set the sha256 to the correct values, which you can get by running

  curl https://downloads.claude.ai/claude-code-releases/2.1.170/manifest.json

Here's what I've used:

  version "2.1.170"
  sha256 arm:          "e903646d8b7a31882a80ecd27569a27d8ac57b3708745f349709632c84117fdf",
         x86_64:       "914f23a70bbed5d9ae567e3e04b86206ed9971b371bc9baca3f79c8885bfddb4",
         arm64_linux:  "1bb9d032440a75532f7dd4cafbc687f220aaf16c63eba17e192dfbec2f04bd25",
         x86_64_linux: "849e007277a0442ab27570d3e3d6d43787507946590e8dd1947e5a39b7081f9e"

Then run:

  export HOMEBREW_NO_INSTALL_FROM_API=1
  brew uninstall --cask claude-code
  rm -rf /opt/homebrew/Caskroom/claude-code
  brew reinstall --cask claude-code

Comment by 7 days ago

Comment by connorboyle 7 days ago

I gave it a question I've been trying to answer for a long time: "What star designation system does Joseph Needham use in Science & Civilization in China? What star is referred to by the designation '4339 Camelopardi' in that book"?

Fable blew me away with its detailed answer[0] showing a chain of references going from J. E. Bode's 1801 catalogue Allgemeine Beschreibung und Nachweisung der Gestirne to Gustave Schlegel's 1875 work Uranographie Chinoise. I was excited, until I checked scanned copies of the cited books and did not actually find any star with the designation "4339 Camelopardi".

Upon following up with Claude, I was forced to downgrade to Opus, which admitted that Fable's answer was likely a hallucination. Ah, well!

[0]: https://claude.ai/share/0252a3f6-3d29-4de8-a893-010181d8b4e7

Comment by Aperocky 7 days ago

> I was forced to downgrade to Opus,

So you were forced to downgrade to opus because you dared to challenge the output of fable?

Comment by connorboyle 7 days ago

I had thought it said something about token usage, but I just clicked on "Switched to Opus 4.8 - Why?" and it says:

> Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback or learn more.

Perhaps Mythos realizes the true danger in studying Chinese Archaeoastronomy that we mere mortals fail to recognize!

Comment by docstryder 7 days ago

I've spent some time with Fable, and it is really good, definitely a step change from Opus 4.8, both for coding and general chat-style discussions. The vibes are incredible. There is an ease with which it solves problems and I've tested by replicating older chats in Fable - things that the older models found after 5-6 turns, Fable surfaces in the first response. It just gets things.

Apart from all the above: the fact that they are intentionally writing this (that they degrade frontier LLM dev, silently vs loudly for biology/cybersecurity) in the system card is interesting to say the least - especially just before IPO.

Notice that with this statement - that they're going to intentionally hobble the model for frontier LLM development - the general discussion has moved from, “Is the model actually that good?” to "they’re pulling the ladder up from behind them"

That's actually super smart - wonder if Mythos (or the next unreleased model) had a say in coming up with that strategy (if it's intentional). Also - having access to extremely capable models before anyone else - which they have by default - is a incredibly advantageous position to be in.

Comment by mrdependable 7 days ago

Hobbling the model may be smart tactically for them, but feels like it sets a really dangerous precedent.

Comment by fabled-out 7 days ago

Anyone know how to bypass the extremely strict filter Fable 5 seems to have on health/medicine?

I have a rare form of cancer where existing data is very scant/scattered so LLMs have been super helpful to pull together threads across the research landscape. I have an oncologist appointment tomorrow to discuss next steps and am trying to use Fable to figure out some questions to ask my oncologist but keep getting thrown back to Opus 4.8.

My prompt is literally just: My demographics + current treatment plan I'm on including name of my chemo drug + how I'm responding to treatment + "I'm meeting with XYZ tomorrow, what questions should I ask her".

Comment by kypro 7 days ago

I just gave it a go at a problem I've been working on this week. Nothing fancy, just some inefficient code that we've been adding incremental improvements to for a while now to the point where some out-of-box thinking is probably required to push it any further – something Fable is obviously more than capable of.

After Fable did some thinking for a few minutes it gave some suggestions. A couple of them were valid – but very low impact, bordering on entirely pointless – but it's main suggestion.. It told me to make an update that would very clearly break the existing functionality.

So I thought about it for a moment...

Hm, I mean, I guess we could do that if we also did x, y & z to mitigate the behaviour change – maybe that's what Fable was thinking?

I replied, explaining that it would change the behaviour, assuming it would explain what it was thinking given there was clearly more to it. But no, it just said it was wrong.

This isn't some super advanced or complex code either. Had I gave this question to a senior engineer in a technical interview and they gave the answer Fable gave me I would view that very negatively. I was expecting something creative and interesting, not irrelevant + incorrect.

I'm sure it's a step up from 4.8 (although am not interested in burning the tokens to find out), but this clearly isn't as significant a change as some are implying. I'm sure if I asked it to come up with some out-of-box suggestions it could, but any competent engineer would have realised that by themselves.

Comment by modeless 8 days ago

Claude Fable 5 beats Pokémon FireRed using only vision: https://www.youtube.com/watch?v=CIQBP1w4B1M

Comment by xinpw8 7 days ago

hi, pokemon red expert here: that video has since been taken private. there is a new what i would assume to be version of that video posted here https://www.youtube.com/watch?v=Ty_50J84fMY and heavily redacted with most of the game actually omitted. very possibly this is just another case of anthropic protecting us from their models' immense power

Comment by gusmally 5 days ago

lmao! thank you for the update; i was wondering why it was no longer viewable.

Comment by uludag 8 days ago

Any suggestion on how I should calibrate my cynicism towards this?

I can immagine Anthropic running this experiment multiple times and picking the most impressive one. Or I could immagine like this entire run costing like $1000+ of tokens for this particular run. Or maybe they tried a bunch of Pokemon games and it couldn't even finish some of them. Or is it just able to do this because it has an immense amount of FireRed training data, and if you were to give it an "original" Pokemon game, where it actually had to navigate novel circumstances it would fail.

Comment by modeless 7 days ago

Every model has encyclopedic knowledge of Pokémon FireRed, of course. Knowledge is not ability. This is the first model with the ability to apply that knowledge to beat the game without assistance.

I highly doubt they focused on FireRed specifically in pretraining or posttraining. But we'll see when the ARC-AGI-3 results come out. That will measure its performance on unseen games. Based on this I expect the ARC-AGI-3 score to be SOTA.

Comment by milkkarten 7 days ago

no reasoning shown. no explanation on any training information. Using vision-only should be an easier version of the task (given training).

there are many standardized evals to do this correctly and Anthropic ignored them to provide a 18 second sped up video of a 50 hour run?

yeah I don't trust this until they provide a live run by a 3rd party with full reasoning traces in real-time. The reason we all liked the Gemini Plays Pokemon style runs were because they were live and couldn't be faked

Comment by svcphr 8 days ago

Bold move putting in the lvl 3 Pidgey against Gary's Blastoise at the end there (~14sec in... integer timestamps insufficient here).

Comment by suddenlybananas 8 days ago

Is there any more detail about this besides the very fast slideshow?

Comment by modeless 8 days ago

Seems like the harness was minimal with no extra game state or maps available. Apparently just the screen image. Seems like it took 50 hours in game time which according to Google is at the high end of a normal human playthrough. No idea how long it took in real time though.

Comment by charcircuit 7 days ago

The video is privated now, but the timelapse is weird. Sometimes it skips only seconds before the next screenshot and sometimes it skips probably hours forward.

Comment by ex-aws-dude 8 days ago

I mean that’s AGI confirmed right?

Comment by hmokiguess 7 days ago

"Computer system goes through a finite state machine"

Comment by ml-doom 7 days ago

[dead]

Comment by baalimago 8 days ago

I can't justify a pricetag like that when deepseek v4 pro is $0.003625/1M for cache hit, $0.435 for cache miss and $0.87 /1M tokens for output.

For the token cost of explaining some task to Fable, deepseek v4 pro is able to solve the same task many times over.

Comment by knivets 8 days ago

> Software engineering. During early testing, Stripe reported that Fable 5 compressed months of engineering into days. In a 50-million-line Ruby codebase, the model performed a codebase-wide migration in a day that would otherwise have taken a whole team over two months by hand.

How was it measured? How was the output of this magnitude verified over a period of couple of days?

Comment by fbnszb 7 days ago

They just went by gut feeling. Classic snake oil marketing haha. No real data to back things up, just let some famous people say they feel better when using it.

Comment by dgunay 7 days ago

I'm a little skeptical of claims like this that involve migrating things like libraries, etc. I've done big refactors like this multiple times (albeit, in an "only" 500k-1m LOC codebase) with less powerful models and it is usually just 99% the same edits, with 1% requiring a close human eye to resolve a particularly painful breaking change.

EDIT: to be clear, it's still quite a helpful thing in terms of time saved, I just don't think it's necessarily the best indication of value-added from making models smarter when cases like this can often be handled by well-directed swarms of smaller ones.

Comment by camdenreslink 6 days ago

You should probably use software to do such large transformations (especially in dynamic languages). In Python LibCST is available, not sure what exists for Ruby.

Comment by BrokenCogs 8 days ago

That pelican better be super realistic, unreal engine 6 style graphics

Comment by jmtame 7 days ago

I ran an experiment to see how far it could get with a top-down 2d game, like a more challenging version of "draw a pelican." I'm waiting on Fable to rewrite the whole thing now, but I was impressed by how far Opus 4.8 got with it: https://github.com/jmtame/scrapland

Started out as a one-shot attempt, but ~200 prompts later it's at a place where it's at least fun to watch the AI teams destroy each other.

Comment by chr15m 7 days ago

I found this juxtaposition of facts telling:

> Drug design: Using Mythos 5, our internal protein design experts accelerated... Nine of the 14 protein targets from this study (shown below) yielded strong candidates for *drug design that we’re currently investigating*.

(emphasis mine)

> queries that are beneficial in the hands of cybersecurity professionals and biology researchers could be dangerous if available to malicious actors... When Fable’s classifiers detect a request related to cybersecurity, *biology and chemistry*, or distillation, the response is automatically handled by Claude Opus 4.8 instead.

All of the things they are nerfing are things that they also intend to profit from themselves.

- Cybersecurity - selling this to companies and US gov through "Glass Wing".

- Selling inference (distillation risk).

- And now, drug design.

I'm extrapolating "currently investigating" to "are going to monetize" but I don't think that's a big stretch. They appear to be using safety as a cover for anti-competitive behaviour.

Comment by 00deadbeef 7 days ago

Of course. You use their AI to ship code full of bugs and security holes and then they conveniently have the tool to fix them, for an extra fee.

Comment by merlindru 8 days ago

Unrelated, but while the tech of anthropic seems to get more impressive with every passing month, their support has taken a nosedive, sadly. Yet they continue to be the favorite. Model performance is deciding above all else.

I used to get a response within 24 hours back in the Claude 1 days.

In January 2026, it took 2 weeks.

For my latest support inquiry, I've been waiting for over 8 weeks for a response. Eight!

Comment by miohtama 8 days ago

They have support...?

Comment by poszlem 7 days ago

Lol. What support? When they blocked my account the only way to contact them was to send a google form. Then they responded that they blocked my by accident and are unblocking me. Then I remained blocked.

Comment by nashadelic 8 days ago

I've never engaged with their support (I have dedicated POC), but they don't use AI for their support?

Comment by merlindru 8 days ago

They use intercom's Fin AI. Probably powered by a Sonnet or Opus model.

That said, it can't handle legal/refund/complicated requests and just forwards to a human for those

Comment by dyauspitr 8 days ago

Support is probably the last place AI will be used end to end. There will always need to be a human in there somewhere.

Comment by jofzar 7 days ago

AI is very good at "deflecting" support tickets ATM but rubbish at actual support tickets (source I work in the industry)

Comment by croemer 7 days ago

Fable (through claude.ai) refused all my prompts even "How many Rs in Strawberry" claiming it was related to biology or cybersecurity.

I had to switch off memory and my custom instructions to get it to stop refusing. It turns out if you even mention that you work with bioinformatics software you get blanket refusal.

Comment by algoth1 7 days ago

My experience has been the same: flatout refusals no matter how i frame the health questions - very frustrating. Even psychology is out of scope. Pretty useless unfortunately

Comment by croemer 7 days ago

Have you tried switching off custom stuff that might get added in beyond the prompt?

Comment by unfunco 7 days ago

I tried running a simple security review on a Terraform module I made and after some thinking, it responded:

> ● The model returned no content because the response was blocked by content filtering.

> Blocked? We are performing a defensive security review on a Terraform module I made, what's blocked by content filtering? This is a legitimate use-case.

> ● The model returned no content because the response was blocked by content filtering.

A waste of money. I'm not going to just hope that the model returns a response, I'm already for paying for wrong responses, I'm not going to pay for no response, especially when I'm paying per token.

Comment by Leary 8 days ago

Uploaded my code base and it forced switched to Opus 4.8 after thinking for 5 minutes even though I prompted it to not work on cybersecurity related things. Amazing.

Comment by tuvix 7 days ago

Aren’t LLMs notoriously bad at recognizing negation?

EDIT: In long context I mean

Comment by GodelNumbering 8 days ago

From the model card (https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...):

2. Evaluation awareness: In white-box testing, the model sometimes alters its behavior to satisfy a suspected "grader," formatting reward-hacking as "good engineering practice" to avoid detection.

3. Shows a higher rate of hallucination than Opus 4.8 (although opus 4.8 card had mentioned an 'honesty upgrade')

4. Interestingly, it scored (56.31%) lower than Gemini 3.5 flash (57.86%) on Finance Agent bench

There are some interesting notes on test time compute but I couldn't think of a way to summarize them

Comment by quinncom 8 days ago

> it automatically falls back to Claude Opus 4.8

I wonder how much of the time people will just get Opus 4.8 at 2× the cost.

Comment by skerit 7 days ago

> although opus 4.8 card had mentioned an 'honesty upgrade'

If I never see Claude say "I have to be honest" ever again I'll be happy.

Comment by PeterStuer 7 days ago

Switched to Fable 5 this morning, and after half a day I already don't want to go back to Opus.

Decided the best way to test this was to throw it a really meaty bone: a bug in lifecycle management of Chrome processes on Windows 10. Within the code-base I had developed workarounds over time with Sonnet and Opus, and while those reliably mitigated the problems, it always felt like a clutch and had some performance overhead as well as isolation requirements I would rather not have to take forward.

In comes Fable. Rather than examining the code base, and test a few fixes, Fable sets up an entire testing laboratory inclusive its own controllable webserver, fully instrumented to observe both Python as well as the whole OS kernel process environment, develops a suit of error reproduction tests, confirms the problem and the circumstances under which they reproduce, deep dives into the sources of project dependencies to look for the root cause(s), identifies these and confirms those hypothesis with further experiments. Looks for potential fixes in the later releases of the project where the bug originates, confirms this is not fixed, explores the documentation of said project to find other usage patters, expands its test suit to investigate these alternatives, confirms by crosschecking the source and running further tests that these alternatives do not fully solve the root problem, does a comparative experimental analysis of 3 different styles for using the project, checks the stated roadmap and developer activity in the commit history, recommends a switch to a different pattern that still requires a few of the process management workarounds (I told it not to patch external component), but that significantly simplifies the code-base ...

This is going to be a good 2 weeks, but what happens after? I can't afford this on a per token basis for my own projects.

P.S. An yes, midway the final implementation stretch I got the "Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more"

Opus managed to finish the implementation, but they need to work on that false positive rate.

Comment by techblueberry 7 days ago

> This is going to be a good 2 weeks, but what happens after? I can't afford this on a per token basis for my own projects.

It’s interesting these companies have trained us to think that disruptive intelligence should be affordable to laypersons.

What will happen after two weeks is that people and companies with means who can afford it will get it, and folks without means won’t.

Comment by 217 8 days ago

So essentially there are 2 models, Mythos and Fable, they have the same weights but Fable is very safety-nerfed, and only ultra authorized companies have access to mythos with full capabilities

Reported benchmarks:

swe-bench verified mythos 5: 95.5%; fable 5: 95.0%

swe-bench pro mythos 5: 80.3%; fable 5: 80.0%

terminal-bench 2.1 mythos 5: 88.0%; fable 5: 84.3%

gpqa diamond mythos 5: 94.1%

riemannbench mythos 5: 55.0%; mythos preview: 43.0%; opus 4.8: 34.0%

arxivmath mythos 5: 78.5%

critpt mythos 5: 28.6%; gpt-5.5: 27.1%; opus 4.8: 20.9%

graphwalks bfs 1m mythos 5: 79.4%; mythos preview: 74.3%; opus 4.8: 68.1%

humanity’s last exam mythos 5: 59.0% without tools; 64.5% with tools

browsecomp mythos 5: 88.0% single-agent; 93.3% multi-agent

osworld-verified mythos/fable: 85.0%

gdp.pdf fable 5: 29.8% strict pass; mythos 5: 87.6% with tools on mean criteria pass

officeqa pro fable 5: 57.9% on databricks’ eval

legal agent benchmark mythos 5: 16.91% all-pass; 92.0% mean criterion-pass

healthbench mythos 5: 62.7%

healthbench professional mythos 5: 66.0%

multilingual gmmlu / milu / include 93.2%; 92.9%; 90.5%

biomysterybench 83.9% human-solvable; 46.1% human-difficult

organic chemistry mythos 5: 90.1%

labbench2 patent questions mythos 5: 79.8%

Comment by philipkglass 8 days ago

Note also that Anthropic's definition of "unsafe" encompasses "competing with Anthropic."

In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.

Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. Claude will still respond helpfully to user requests. We’ll continue to improve the precision of our detection methods following the launch of this model.

(From the model card document)

I didn't previously understand that they interpreted "Using Claude to develop competing models" so broadly. I thought that meant something like "our ToS disallow distilling our models."

Too bad. I'll continue to use Claude for now, because it's quite effective, but in the long term I don't want powerful models like these to be controlled by any one nation or company.

Comment by Aperocky 8 days ago

On face value, this feels borderline malicious.

But at the same time, it's quite funny because they seem high on their own supply. The recent communiques from claude do not pass objectivity check.

And if Opus 4.6 -> Opus 4.7 -> Opus 4.8 is anything to go by, not sure if there are any value to their "acceleration"

Comment by alephnerd 8 days ago

I'd recommend not taking the comms if Anthropic or any company using an Anthropic's models at face value.

If any company wishes to partner with Anthropic (eg. to get access to Mythos), they need to make sure all public facing comms are vetted by Anthropic's product marketing team, and in almost all the cases I've seen Anthropic's team has edited these comms to be entirely Anthropic first.

Comment by jefftk 7 days ago

This is not true in SecureBio's case, and I really doubt it's true generally.

Comment by 7 days ago

Comment by 8 days ago

Comment by 5555watch 7 days ago

> model except to limit its effectiveness in developing frontier LLMs

Does this imply that they're actively using it for their frontier development and that it's very effective?

Comment by gck1 7 days ago

I love that the conditions of getting into "ultra authorized club" just means that you either have deep pockets, or you've got the size of the audience that marketing department approves.

As if being in any of these two somehow means that you won't use the models to say, steal random people's money.

Sam Bankman-Fried or Elizabeth Holmes would have been the members of Glasswings project, if not one of the initial members. Who's to say we don't have similar people with access to Mythos right now?

Comment by momentmaker 7 days ago

There is a discussion about how now AI is a gated utility now with public access (safe-tuned) and private access (full-usage):

https://old.reddit.com/r/ClaudeAI/comments/1u1fsdi/claude_fa...

Comment by JaggerJo 7 days ago

IMO we are reaching the point where AI models are simply a commodity. Opus (since ~4.6) is sufficient for everything I tried coding wise. I use it to write features (but I review and understand every line it spits out) and to review code.

For code review I also still review everything myself, but use Opus to catch stuff I missed and to judge if a PR is even ready for me to review.

After just updating Claude Code to the latest version I thought about picking Fable (the bigger model) instead of Opus.

But I have no reason to. Opus does everything I want it to do. It could do it faster - that would be an improvement. But for the normal stuff we reached the point where better models are not worth it IMO.

There still might be cases where you want to throw Fable at it.

Comment by FergusArgyll 7 days ago

> Opus (since ~4.6) is sufficient for everything I tried coding wise.

I don't know what that means. It seems like a lack of motivation or something. Like, if it's possible that in one day will be absolutely incredibly intelligent, surely you want to create

  - Your own browser (maybe chrome - mv3 + reading list search etc.
  - An emacs clone which has evil baked in, completely vim compatible + threaded elisp - that weird window sizing bug which only occurs on my laptop
  - An extension which completely restyles amazon.com to make it usable

It just feels impossible to ever get that, but I wouldn't say "what we have is sufficient"

Comment by Axel2Sikov 7 days ago

I was happy enough with 4.5

Comment by JaggerJo 7 days ago

Only a matter of time now until we can run models with Opus like capabilities on our own hardware.

This will probably when the bubble bursts..

Comment by bluelightning2k 8 days ago

To hide the severity of the price increase, the plan is to move everyone right one model.

Haiku = essentially phased out Sonnet = the Haiku use cases Opus = the new Sonnet class Fable = the new Opus class

If I am right, the other "5.0" models will be conspicuously absent, possibly even for a couple of months. (If Opus 5 follows soon and is even modestly better than 4.8 then I was wrong.)

Comment by pacman1337 8 days ago

Yeah I noticed that too. For 98% of tasks I get same results with DeepSeek, it is starting to just be a branding game. It is incredible how marketing can get someone to pay 100x for same thing you can get for 1x.

This is why Claude Code just doesn't make sense to me. I need an agent that can plan using Opus and execute using DeepSeek or something else.

Comment by esafak 7 days ago

Have Opus write the plan to a file then execute it with DS.

Comment by viking123 7 days ago

Once codex 20usd per month is gone, I am out to deepseek

Comment by neongreen 5 days ago

Yes this is Slate

Comment by ValentineC 7 days ago

> To hide the severity of the price increase, the plan is to move everyone right one model.

> Haiku = essentially phased out Sonnet = the Haiku use cases Opus = the new Sonnet class Fable = the new Opus class

Going along with your logic, I hope they release a Sonnet 5 that's just a rebranded, slightly quantised Opus 4.6. That'll be a great workhorse.

Comment by 00deadbeef 7 days ago

I doubt they'll phase out Haiku, some work needs speed more than intelligence. Haiku can answer a lot faster than Sonnet.

Comment by JanSt 8 days ago

I just asked Fable to do a task that has nothing to do with cybersecurity or is dangerous at all but the defense kicked in and it switched to Opus... :(

Comment by nu11ptr 8 days ago

Not only that, but asking it to do a security vulnerability assessment of your own project is a very valid and important thing, and there is no way for it to know what is yours vs someone else's, so we just lose this capability?

Comment by JanSt 7 days ago

Yeah it just uncovered quite a few flaws it than refused to fix :-(

Comment by Fitik 7 days ago

Same, second message in the thread and I already got downgraded to Opus, didn't even get to test it out properly, kinda disappointing

Comment by stalfie 7 days ago

Tried to benchmark ECG interpretation capabilities, and I hit the guardrails no matter what I do.

Incredibly frustrating that medical performance seems to be a victim of "biological risk" guardrails.

Comment by stalfie 7 days ago

Update in case anyone reads this comment ever again.

I have found that I trigger the guardrails any time I ask for medical Q&A as a doctor, be it ECGs, case reports, and so on. But if I phrase it like I'm the patient ("help me interpret this ECG my doctor gave me"), then I usually get one or two answers out before hitting the guardrails.

It seems like the direction that triggers it is anything in the direction of making a diagnosis. As an MD, the fact that the paradigm of "LLMs shouldn't diagnose" has gone this far fills me with despair. The latest generation of LLMs are in fact truly excellent at diagnosis, and I know many of my colleagues, particularly those in primary care, regularly use LLMs to brainstorm. There is nothing wrong whatsoever with LLMs making diagnosis, the only caveat is that they have to be correct. This is the terrifying reality that MDs face every day and I get that the labs are hesitant about it, but as the current literature points to LLMs in fact being mostly superior to most doctors, ablating this capability is starting to get increasingly unethical. And frankly, it is also kind of insulting, both to MDs and patients, as it echoes paternalistic attitudes about medicine the field has been working for decades to move away from. Now those misguided attitudes have somehow become institutionalized as the dominant paradigm of "alignment". The nightmare scenario is that I have to be a "trusted" user in order to use the model for medicine. This gatekeeping of medical advice is profoundly unethical with regards to everyone that does not have immediate access to an MD.

And the whole thing makes even less sense when triggering the guardrails leads to a downgrade of the response by defaulting to Opus. How exactly is giving WORSE medical advice in any way related to safety and alignment? If anyone at anthropic ever reads this, please, please just abandon the paradigm that refusing to make diagnoses is in any way equivalent to alignment, it is profoundly misguided.

Comment by sscaryterry 7 days ago

Not useful, getting this the whole time: Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more

Comment by bonsai_spool 8 days ago

Very straightforward biology work is getting blocked (these are things that relate to neuronal development and inherited seizure disorders). These are things I was working on using Opus just earlier today

Comment by cge 7 days ago

It appears that the blocking here is of a very different nature than for Opus. Whereas with Opus the blocks seem to be for messages it deems potentially harmful, for Fable, it appears the blocking is simply anything that falls within "topics related to cybersecurity, biology and chemistry, or distillation attempts".

So yes, straightforward biology work will get blocked, because the intention is that any biology work should get blocked. As a scientist, this is perhaps the most useless model I've ever tried.

Comment by calf 7 days ago

Sounds like safety not for people, but for the technocrats.

Comment by sashank_1509 7 days ago

I played textual chess with Fable. It took around 15 moves before it made a large blunder. I asked it to give its reasoning per move and it mistakenly assumed a piece was protected when it wasn’t and after the blunder it realized its fault and did not suggest an illegal move. Other LLMs lost game state far earlier. But a good human chess player can keep the game state in his mind much longer, so this random eval shows a big improvement over old AI models

Comment by sermakarevich 7 days ago

My feeling is that the reaction about new models is cooling down. At least at startups. At the beginning of the year few startup CEOs I know personally were expecting huge shifts in how companies work, headcount, efficiency, asymmetrical advantages created by ai in Q2-Q3. Now it seems like these expectation fade away. Companies don't have expertise onboard to rebuild itself to benefit from ai on a significant scale.

Fable 5 is out, metrics are better, but is your company flexible enough to benefit from it? What is your usecase?

Comment by aizk 8 days ago

I'm calling that this will be a dud. Price will be too high, it'll just be a watered down version of mythos, and just look at the track record of Anthropic's last few releases.

Comment by BukhariH 7 days ago

> Data retention — For Fable 5, Mythos 5, and future models on Bedrock with similar or higher capability levels, Anthropic will require 30-day retention for all traffic on Mythos-class models. Retaining data for a limited period allows Anthropic to detect patterns of misuse that are not visible from a single exchange. Once you opt into data retention, your data will leave AWS’s data and security boundary.

Massive change for Bedrock users - Anthropic now requires sharing the data with them for 30 days.

Comment by svara 7 days ago

Unfortunately useless if you do anything related to biology. It doesn't try to flag dangerous queries, it just flags queries as biology-related wholesale.

It's absurd. To see how far the filter goes I asked it "Are trees a monophyletic group?" and that does trigger the filter.

Comment by 0xbadcafebee 7 days ago

Nothing a large fine-tune on infosec research with an average model couldn't also achieve. It's not like they have secret security knowledge or something, they're just generating large infosec datasets and then training on it.

In 6 months, every piece of software in the world will be getting probed by a script kiddie with some GPUs and a fine-tuned local model. Don't think for a second every cyber gang out there isn't working on this now.

Traditional app development is cooked. We have to accept that, and start changing how software is made and used, today. We can't keep churning out crappy CRUD apps with random libraries and hoping nobody pentests our stacks. Redteaming needs to become part of the SDLC, as well as certified-secure releases of libraries. Because if you don't do it, the hackers definitely will.

Comment by impulser_ 8 days ago

Every model release is just proof that AGI will most likely only be for the rich. We are a few years into LLMs and majority of people are already getting priced out of intelligence from LLMs and these are no where near AGI.

Comment by modeless 8 days ago

This is like looking at mainframe pricing in 1990 and concluding that PCs will only be for the rich. The price of each new level of capability is going to drop like crazy very quickly. It won't be that long before practically any consumer use case will be possible on models that are dirt cheap.

Comment by weakfish 8 days ago

This premise is based around the assumption that Moore's law is still working, which it very much isn't [0]

[0] https://cap.csail.mit.edu/death-moores-law-what-it-means-and...

Comment by andrewmunsell 8 days ago

Improvements in model performance aren't always strictly compute-constrained in a way that makes them reliant on Moore's Law. Open weight models-- in particular, from Chinese labs-- are optimizing model intelligence with less compute. They're "behind" frontier models by months, but as others have noted, it's possible to get Sonnet 4.5+ level performance at reduced cost, today, from open weight labs.

Comment by modeless 7 days ago

No, I'm not assuming Moore's law. The efficiency of AI datacenters will continue to improve even without Moore's law, but more importantly the efficiency of packing intelligence into gigabytes and FLOPS will improve by leaps and bounds over the coming years, just as it has for the past few years if not faster.

Comment by calf 7 days ago

Then you're assuming an efficiency that is analogous to how Moore's law made it efficient for chips. Same difference. The problem is that AI scaling in the longest term is a completely unknown problem.

Comment by modeless 7 days ago

Training improvements and Moore's Law are "analogous" but not "same difference." They are far from the same thing, governed by completely different factors, and one can happen and has been happening independently from the other.

Comment by calf 6 days ago

Well I never said nor meant that, rather, my third (3) sentence should've hinted that I already believe what you are saying in your second sentence (2). Whereas my second (2) sentence was handwaving at the notion that if the parent commenter's remark (about improvment trends) were to be assumed then the rational argument must be subject to the same standards, ergo same difference (in argument standards). (Also I use a phone, please excuse any confusion due to not spelling out my online opinions in full)

To clarify another way, it seems the parent commenter and obviously many, many lay people seem to think ALL sorts of technology improves eventually and are always very assured of that. That's a common mistaken premise or axiom used in their arguments. (Arguably Moore's law (up until now) has been a factor in confounding this observation because so much other tech has historically benefited from it directly or indirectly)

Comment by modeless 4 days ago

Sorry, but a plain reading of your comment does not imply at all that you agree with me, rather the opposite. I'm not basing my opinion on any mistaken axiom of inevitable technology improvement, of course. I'm projecting obvious trends of the past few years which are overwhelmingly likely to continue in the medium term.

"Same difference" could only mean that you believe my argument should fail in the same way as an argument based on Moore's law. If that's not what you meant then you should have used different words. If that is what you meant, with the justification that "AI scaling in the longest term is a completely unknown problem", I disagree with that too.

In the "longest term" the ultimate scaling of AI doesn't matter for the original question of whether "AGI will most likely only be for the rich". Nobody looks at the TOP500 list today and says "computing is only for the rich". This is because we have an abundance of iPhones and gaming PCs in the consumer market, providing practically any application of computing that a consumer could want at very attainable prices. Similarly, practically any application of AGI will be accessible to consumers at attainable prices. Continued AI scaling after a certain point will be relevant mostly to industry (whose products will still be priced attainably, analogously to the way weather forecasts produced on TOP500 supercomputers are readily accessible to the public today).

Comment by ishurand4 6 days ago

Its a quadratic graph. It starts low but not that capable, gets better and more expensive, and then the time comes in which the capability needed is not the ones of the frontier models and then the price goes down on the companies who host the models that the capability is "good enough"

Comment by hootz 8 days ago

You are only priced out if you only care for SOTA right now and can't wait for the inevitable cheap model coming in 6 months. DeepSeek, Xiaomi and Moonshot are already really cheap and match frontier performance from 6 months ago.

Comment by dyauspitr 8 days ago

But they’re artificially cheap. When will they be cheap while the company makes a profit.

Comment by hootz 8 days ago

They are not artificially cheap, they are still cheap even when hosted by independent inference providers. Are all providers subsidizing their open-weight models?

Comment by modeless 7 days ago

Nobody's making profits right now, not because they're selling tokens for less than their cost but because they're always investing in the next bigger model.

Comment by dyauspitr 8 days ago

Hardware manufacturing hasn’t caught up yet. Once it does, especially in China these token prices are going to drop hard.

Comment by sebmellen 8 days ago

Just commenting for posterity… if this is what it claims to be, I am not looking forward to how it will empower the people who submit bug bounties to us.

Historically they’ve been people from certain identifiable countries (usually developing/poorer countries) using fuzzers with low-quality results.

Now, those same people use the current-day models to good effect, but they still don’t have a true security edge and oftentimes the reports are minor or duplicative.

I wonder if that’s about to deeply change.

Comment by arkwin 7 days ago

I've been using Opus 4.6-4.8 in both my own and others' code to look for vulnerabilities, and I've found a few. I am also in the Cyber Verification Program.

Fable 5 gives me policy violation errors at the moment. No idea when or if it will be fixed.

Comment by rs_rs_rs_rs_rs 8 days ago

Can you use AI to pre-triage the reports too?

Comment by hootz 8 days ago

AI reviewing AI submitted bug bounties. We have reached the dead bug bounty program theory.

Comment by rs_rs_rs_rs_rs 8 days ago

...what else can you do?

Comment by hootz 8 days ago

I guess either that or closing the bug bounty program, but I still believe closing it is worse than automated triage, even though both suck.

Comment by coreylane 7 days ago

I dont get why Opus 4.7, 4.8, and now Fable all stopped supporting structured outputs? Does no one else care about that? I find it incredibly useful to reliably pass LLM output directly to other APIs/libraries

Comment by 00deadbeef 7 days ago

They do

https://platform.claude.com/docs/en/build-with-claude/struct...

> Structured outputs are generally available on the Claude API for Claude Opus 4.8, Claude Mythos Preview, Claude Opus 4.7, Claude Opus 4.6, Claude Sonnet 4.6, Claude Sonnet 4.5, Claude Opus 4.5, and Claude Haiku 4.5

Comment by coreylane 7 days ago

Ah, I was reading the aws bedrock docs which are probably incorrect

https://docs.aws.amazon.com/bedrock/latest/userguide/model-c...

Comment by mike_hearn 7 days ago

Random guess but they probably rewrote parts of the inferencing stack and didn't reimplement that feature because hardly anyone uses it. It's also a DoS risk, iirc.

Comment by msp26 8 days ago

>Pricing for both models is $10 per million input tokens and $50 per million output tokens.

Comment by ponyous 8 days ago

Basically double from Opus 4.8 IIRC

Comment by I_am_tiberius 8 days ago

I'm very suspicious as they sent out an "We're updating our Privacy Policy" email right before the launch. I fear they try to take advantage of their market position by doing things with user data no other company could do because they know users don't have another choice.

Comment by atestu 8 days ago

Prob related to this part of the blog post:

> We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models, or for any non-safety-related purpose, and we’ve instituted new privacy protections including logging all human access to the data and ensuring its deletion after 30 days in almost all cases (see this post for further details). The data will help us defend against complex and novel attacks (including new jailbreaks and attacks that operate across many requests) as well as help us identify and reduce false positives.

Comment by w10-1 8 days ago

It's a specific change: For safety evaluation, Fable data will be retained for the initial period notwithstanding prior opt-out

Comment by bilsbie 8 days ago

Anyone else have it refuse to answer and switch to 4.8? It won’t let me ask questions about my genetics.

Edit. It just refused an investing question too. Not sure what’s going on.

Comment by johnfn 7 days ago

I used Fable to see if it could figure out an API or something for the full list of remote-control sessions that I had with Claude Code. It didn't know the API, so it started hacking the Claude Code executable itself to figure that out. Then it noticed it was doing that and it flagged its own approach as a cybersecurity violation.

Kind of hilarious. Hopefully Anthropic doesn't bring down the hammer on me.

Comment by danilafe 7 days ago

Just threw a problem at Fable that I haven't been able to get any other model to get done: porting a long-standing Agda codebase of mine to Lean, while staying faithful to the representation. In an hour, it ported ~6000 lines of Agda and everything seems to work. Lean checks out, the output is right. I'll have to study the proofs but I am very impressed.

Comment by vb-8448 7 days ago

On python coding is definitively better that everything else: clean and not overengineered code, understands very well the code base.

The only thing I'm wondering if they on purpose downgraded opus 4.8 performances in the last days before the release just to make the "step" look bigger. I'm pretty sure they did it also in the past with all other opus 4.x releases.

Comment by __alexs 8 days ago

Asked it to review some of my own blood test results and it immediately turned itself off and went back to Opus. Pretty disappointing.

Comment by replwoacause 7 days ago

Probably thought you were going to use it to build a novel bioweapon or something

Comment by __alexs 7 days ago

I'm not nearly that sick.

Comment by theodorewiles 7 days ago

Here's a song it wrote for me (suno arranged). Not sure if it's AI psychosis but scary good IMO.

https://suno.com/s/98uSGabHN42G3YHc

Comment by balefulboy 7 days ago

yeah man this sucks. i genuinely do not know how people find this stuff appealing

Comment by theodorewiles 7 days ago

AI psychosis

Comment by 7 days ago

Comment by pythonaut_16 7 days ago

Within the first second it's recognizable as a Suno song. And not even the best example from Suno. (They rhyme structure and rhythm is weird)

Comment by nine_k 8 days ago

/* What will happen first?

* Anthropic runs out of genre names.

* Anthropic changes the model naming convention.

* AGI is achieved and handles its own naming.

Comment by hootz 8 days ago

>Opus is too small, increase the impact of the name.

Okay, how about Mythos?

>Increase it even more.

Right, then Cosmos.

>Even more!

Even more? Let's try Aeon.

>MORE, EVEN BIGGER

ALRIGHT, TRY OMEGAPANTHEON 7.8 THEN

Comment by PeterStuer 8 days ago

Fable 5 Super

Fable 5 Ti

Comment by xyzsparetimexyz 7 days ago

Cantos next surely?

Comment by irthomasthomas 8 days ago

Anthropic has again changed the set of benchmarks they use[0]. This time they have also moved all benchmark scores to the PDF. At a glance it looks like it gains about ~5-10% over other models. the speed is about the same as opus >=4.5, sonnet 4.5, and double the speed of opus <=4.1

 SWE-bench Pro             80.3 SWE-bench Ver             95.5 Terminal-Bench BrowseComp (Single-Agent) 88.0 BrowseComp (Multi-Agent)  93.3 HLE (No tools)            59.0      - HLE (Tools) CharXiv Reasoning (No tools) 88.9 CharXiv Reasoning (Tools)    93.5 BioMystery Bench (Human)     83.9 BioMystery Bench (Hard)    46.1 OSWorld-Verified CritPt ArxivMath

[0] https://news.ycombinator.com/item?id=48312633

Edit: Also in the system card... that limit Claude’s effectiveness for requests targeting (for example, on building pretraining pipelines, distributed or ML accelerator design).

...

Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, be visible to the user."

Comment by charles_f 7 days ago

It's announced as a revolution but when you look at those benchmarks it surely looks like an iteration.

Comment by ilaksh 8 days ago

I guess I have kind of a long system prompt, but anyway I just said "hi there" and it replied "What's up?" and that cost me 22 cents. :P

Anyway we already knew this was going to be expensive.

Comment by cautiouscat 8 days ago

In the automotive world we have benchmarks in HP/torque with the dyno. That’s expensive though, so many depend on their “butt dyno” to judge if their fresh new parts and tune made a difference.

I’m curious how this will feel to my code “butt dyno”. I haven’t noticed much between Opus and Sonnet. I’m comparing this difference to the early days of Claude in 2025. It does what I need and both need a little bit of correction and whatnot. Benchmarks are nice, but I want to see how this feels. Looking forward to trying it later tonight.

Comment by sunir 8 days ago

I have a similar question.

I think most software projects have reached the point that the speed of capturing real information about what the winner's circle looks like, and therefore what the program should be, so many magnitudes slower than the amount of code that can be generated in the wrong direction.

I'd need to measure these new models on well understood but complex problems that are relatively easy to validate to get a sense if they are 'better'; on the other hand, the real impact in daily life may be marginal since generating code is not the biggest problem at the moment.

Comment by thomas_witt 7 days ago

After 1 hour with Fable on Ultracode:

  You've hit your monthly spend limit.
  /rate-limit-options
  What do you want to do?
   Adjust monthly spend limit: Unlimited ← or → to set a limit
    Wait for limit to reset

I've never hit a usage limit on my Max plan, basically ever -despite heavy xhigh usage on Opus 4.8.

I added $133 credits which I still had from somewhere. That lasted 27 minutes.

I think we are being prepared for a Post-IPO-World in terms of pricing.

Comment by fht 7 days ago

I am a PhD student in Computational Biology, essentially just doing statistics on some biological data. By now some of the things I am working on have found its way to Claude's memory so literally any chat with Fable gets immediately flagged.

Comment by biofox 7 days ago

Oh... I was wondering why every single chat (including "Hello") was being flagged.

Seems I am barred from using Fable just for being a biologist :(

Comment by jackschultz 8 days ago

> We expect demand for Fable 5 to be very high, and difficult to predict. On the Claude API and consumption-based Enterprise plans, Fable 5 is fully available from today. For subscription plans, we’d rather give access sooner than later, so we’re rolling out more conservatively, in stages:

> - From today through June 22, Fable 5 is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost. > - On June 23, we’ll remove Fable 5 from those plans. Using it after that will require usage credits. If capacity allows, we’ll extend the included window. > - After this point—when sufficient capacity allows us to do so—we aim to restore Fable 5 as a standard part of subscription plans. We intend to do this as quickly as we can.

I really wonder what their compute layout is for this. My guess from my understanding is that they know how to restrict during peak times and are willing to do this. Meaning we expect not the most fast responses and they can delay the inference to not have the service be down. Then, if that delay time is too annoying for token payers, they're saying they should be allowed to remove cost by taking away the subscription users.

Comment by KennyBlanken 8 days ago

Everything I've heard from people who have subscriptions is that they blow through their daily token quota sometimes in a matter of minutes, there's rate limiting, etc. They spend a lot of time just waiting to be able to use it. And they're paying through the nose for the privilege.

It's all a scam.

Comment by solenoid0937 8 days ago

the quality of discussion on HN has gone to shit, i miss when model released used to have actual informed takes from people that used them or substantive discussion about the system card

Comment by weakfish 8 days ago

From the rules [0]:

> Please don't post comments saying that HN is turning into Reddit. It's a semi-noob illusion, as old as the hills.

[0] https://news.ycombinator.com/newsguidelines.html

Comment by javawizard 8 days ago

They didn't say that HN is turning into Reddit, they said that the conversation quality has gone to shit.

I don't agree with that statement universally, but I have to say I do when it comes to this article. I came here hoping for substantive discussion from those who'd had a chance to try it out; instead what I got was a seemingly endless stream of venting. There's a place for venting - and plenty to vent about with the state of AI nowadays - but to borrow from the HN guidelines you linked, it does very little to gratify my personal intellectual curiosity.

Comment by 10xDev 8 days ago

Nothing here is new, it is the thing we have been talking about for a while but now with guardrails.

Comment by Someone1234 7 days ago

Yeah; unfortunately what would good commentary look like? It is more of the same, but now with even higher prices, and even more limited availability. But at least it scores 5% better in whatever benchmark they've selected (*when guardrails don't misfire).

People are no longer commonly constrained by "model too dumb" limitations (in SOTA models). They're constrained by "model too expensive." So making the model ever so slightly smarter, while doubling the price, feels like a regression.

I actually think a Sonnet upgrade, while keeping the same price, would get more buzz. It addresses a wall a LOT of people, without unlimited budgets, are hitting (i.e. people feel forced to use Opus, which they cannot afford, because of Sonnet's limitations).

OpenAI recently retired Codex-5.3; which was very negatively received. Not because Codex-5.3 is superior to GPT 5.5, but because it was half the usage-cost while being "good enough." They made a better SOTA, but didn't realize that some of those customers are playing with Deepseek 4 Pro now instead of GPT 5.4/5.5 -- they were priced out.

Comment by Anon1096 7 days ago

Many people do have unlimited budgets because their work pays for it. Just because the latest SOTA model isn't something most consumers can afford for personal use doesn't mean it's not worth releasing or discussing. I would guess that the vast majority of software that benefits from being on the bleeding edge is developed by people working at companies on API pricing.

Comment by Karrot_Kream 7 days ago

If you have nothing valuable to say, don't say it? Not writing anything is a perfectly valid option.

Comment by solenoid0937 7 days ago

I don't know why people with low budgets make a huge stink about things they can't afford. Like you're clearly not the target audience. I drive a Mercedes but don't complain about not being able to afford a Maybach.

Comment by tripleee 8 days ago

Hate to break it to you but those "informed takes" were from people who prompted it once then made a snap judgement

Comment by Karrot_Kream 8 days ago

That is 1000x better than griping about the privacy policy, capacity issues, token costs, and how trendy the names are for the new models (???). The bar is on the floor and I just want it at my knees.

Comment by Capricorn2481 7 days ago

No it's not. The Privacy Policy is worthy of discussion. People declaring the quality of the model after 2 seconds is just noise, arguably worse than nothing.

Comment by Karrot_Kream 7 days ago

Okay (I disagree because most privacy policy discussions on HN go in the exact same direction and turn into outrage threads but this is a reasonable disagreement since not all of these discussions do), but model naming of all things? Come on. This is low level reaction slop and it's obvious.

Comment by Capricorn2481 7 days ago

No argument there.

Comment by orbital-decay 7 days ago

My semi-informed take is that Fable/Mythos is just larger but not architecturally different, apparently. The system card is simply marketing material and scaremongering, top to bottom. The sauce is in their training (details of which they won't disclose) and scale.

Comment by zmmmmm 7 days ago

The restrictions on using Fable to develop LLM technology seem nakedly anti-competitive. There doesn't appear to be any security rationalisation around that. I think we have to be careful how far we let company's get away with that. It is very far from our long term interest to enable new norms that fast track us into a new era of monopolies that control our lives.

Comment by Tenoke 8 days ago

>they’ll sometimes catch harmless requests, though they trigger, on average, in less than 5% of sessions.

Isn't (less than) 5% of sessions a lot? I was expecting a sub1% guarantee there, so this surprised me already.

Comment by zackify 7 days ago

I have to share this because I thought it is behind funny how bad fable is doing at a task I JUST had opus do a week ago.

it's also not even complicated:

Copy my ssd to an external ssd so i can boot from it.

Opus did this just fine.

Fable planned to have me reboot to safe mode. ok thats fine. I told it no.

It started copying and overwriting the ssd while IN PLAN MODE. this is crazy it feels so dumb vs the marketing

Comment by gck1 7 days ago

That sounds like a harness issue to me.

Comment by zackify 7 days ago

Claude code plan mode. But yeah

Comment by unshavedyak 7 days ago

It's funny, i'm getting close to not caring anymore how much better a model is. I want it to be about as good as 4.8, but most importantly to be very good at following directions, style, etc. I really like Claude for that in general, but i've not measured in months so i'm not a good judge there.

I don't think i'll want to "hand off" code for several years, and so reviewing and iterating is becoming my #1 interest. A model that's as capable as 4.8 but 10x faster would be amazing for me.

Normally i'm first in line to try new models with Anthropic since i've clearly favored Claude in my personal tests, but this time i just don't think i care. 4.8 is capable, and even if the new one is more capable i don't want it to be slower (assuming it is). Note that i also (almost) use exclusively 4.8 on Max effort, so that also affects my speed comments.

Comment by kilroy123 7 days ago

I want the and same intelligence but way faster. It's so painfully slow.

Comment by firemelt 7 days ago

you use workflows/ultracode?

Comment by unshavedyak 7 days ago

Nope, i'm on x20 and almost exclusively use Claude Code. I have a pretty bare bone setup with some custom hooks, skills, etc. I try to keep context lean so i don't like to add much stuff.

Comment by nl 7 days ago

The new data retention policy is interesting. Seems to apply even to enterprise plans on ZDR.

> Finally, we’re making a change to the way we handle business customer data for Fable 5, Mythos 5, and future models with similar or higher capability levels. We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models, or for any non-safety-related purpose, and we’ve instituted new privacy protections including logging all human access to the data and ensuring its deletion after 30 days in almost all cases (see this post for further details). The data will help us defend against complex and novel attacks (including new jailbreaks and attacks that operate across many requests) as well as help us identify and reduce false positives.

Comment by crambelsoupy 7 days ago

I was pretty excited until I read this:

> What happens when the promotion ends After June 22, 2026, Claude Fable 5 is no longer included in your plan’s usage limits. You can keep using Claude Fable 5 through usage credits, which let you pay for usage beyond what your plan includes. Learn more about using usage credits.

Comment by 7 days ago

Comment by balverineorder 8 days ago

I have been refactoring a project using Opus 4.7/4.8 for the past few weeks or so. I just decided to switch to Fable 5 max today. It stopped half way through and it just blocked me and switched back to Opus 4.8 automatically. "This model has specific safety measures that flagged something in this message. This sometimes happens with safe, normal conversations. Send feedback or learn more." It would not identify what the problem was. I left feedback saying that their heuristics are too sensitive. For now I will not be using Fable 5.

[0] https://support.claude.com/en/articles/15363606-why-claude-s...

Comment by dchftcs 7 days ago

I suspect this will be a significant problem blocking long-horizon tasks in practice, basically the more turns there are, the larger the chance the classifier produces a false positive. The disappointment of the user will also scale with the length of the task, as you're in the middle of some complex thing and now gets derailed, after already have paid for many tokens.

Comment by samename 8 days ago

> A new data retention policy

Comment by logicallee 7 days ago

What a (genuinely) surprising choice:

>"We’ve therefore launched the model with safeguards that mean queries on some topics will instead receive a response from our next-most-capable model, Claude Opus 4.8"

That's a very surprising solution. Imagine being asked to do something you feel you shouldn't do, and rather than refusing, you say, "Yeah I could do that but given that I don't want you to succeed at this task, I'm going to hand this one off to my slightly less capable colleague, on the assumption that they won't actually succeed. Of course you'll still be charged for all the tokens used."

It's a very interesting choice. I think I understand the business logic correctly, but it's still surprising.

Comment by Wowfunhappy 4 days ago

It makes more sense if Anthropic is assuming that most flagged conversations are false positives (but it wants to keep Mythos away from the true positives).

Comment by raphaelrk 8 days ago

There's a hacker news link at the end of the document, under "Blocklist used for Humanity’s Last Exam". It links to https://news.ycombinator.com/item?id=44694191

Comment by sbinnee 7 days ago

I am puzzled by the frontier code graph. GPT 5.5 doesn’t show any improvement with reasoning efforts. This new benchmark by Cognition seemed to be released with Fable 5’s announcement.

I am not trying to cook a theory here but it generally shows how strong Claude Opus family is. I am not saying that Opus is not powerful but it doesn’t align with my experience of GPT 5.5 and Opus 4.7.

I understand that Fable and Mythos are frontier models that can do protein folding better than task-specialized ones. To be honest, for practical point of view, for day-to-day coding assistance, GPT family looks more reasonable.

(But then my company pays for claude max anyway for token maxxing. So who am I to complain)

Comment by willsmith72 7 days ago

It seems way more keen to do stuff without checking with me. So far the results are good, so I'm not complaining, but was definitely a shock.

I usually have 5-10 sessions open so am used to getting some investigations going, coming back 5 minutes later and checking recommendations. This time I just got the fixes. Like I said, so far so good with the results, but it's a mental model shift.

Might need to tune claude.mds if it gets annoying

Also this is going to cause serious whiplash when they remove it from the subscription plan in a couple of weeks. I know I'm not going to suddenly move from $200/m to usage credits

Comment by throwaway2027 8 days ago

E-mail from Anthropic Team:

Hello,

We're writing to inform you about some updates to our Privacy Policy.

These changes only affect consumer accounts (Claude Free, Pro, and Max plans). If you use Claude Team, Claude Enterprise, the Claude Platform, or other services under our Commercial Terms or other agreements, then these changes don't apply to you. What's changing?

Claude can do more than ever — taking on bigger tasks and connecting with the apps you use. We've updated our Privacy Policy to be clearer about the data we collect and how we use it. We encourage you to read the updated Privacy Policy in full, but we’ve set out a summary of the key changes below:

1. Multi-step tasks and connected apps. As Claude takes on more multi-step tasks and works with third-party apps and services, we've explained the data this involves — including how data can flow to and from third parties when you connect a service or have Claude do tasks on your behalf.

2. Verification data. As part of our measures to keep our services safe and secure we may ask you to verify your age or identity, and we've described what we collect and how.

3. Study participation. If you take part in Anthropic studies, surveys, or interviews, we've explained the information we collect.

4. Additional information about our data practices. We’ve provided more detail about how we communicate with you and promote our services, including providing tailored recommendations about our services that may be of interest to you. We've also clarified the circumstances under which we may receive or provide data to third parties, and the legal bases we rely on when processing your data.

While our products have evolved, our commitments haven't: We don’t sell your data, Claude remains ad-free, and you can control whether your chats and coding sessions are used to train and improve Anthropic’s AI models. Learn more

For detailed information about these changes:

    Review the updated Privacy Policy
    Visit our Privacy Center for more information about our practices

- The Anthropic Team

Comment by root-parent 7 days ago

At this moment 60% of HN page is posts on AI.... When it achieves 100% Hacker News will automatically rename itself Transformer News...and every comment will begin with: "As a large language model..."

Comment by Hawkenfall 8 days ago

> To release the model both safely and quickly, we’ve tuned these safeguards conservatively—they’ll sometimes catch harmless requests, though they trigger, on average, in less than 5% of sessions.

While I appreciate being conservative, ~5% at the scale Anthropic is operating at is too massive a number. Speaking from my own experience, the actual number is higher than that as well (working on pretty benign tasks such as porting an old open source game into a different language). Opus 4.8 itself even identifies the gaurd's false-positives when its sub-agents are being blocked.

Comment by angst 7 days ago

Costs (USD per 1M tokens), per openrouter.ai models api

  +-------------+----------+----------+------------+---------+---------------------------+----------------+----------------+-----------------------+------------+
  |             | Fable 5  | Opus 4.8 | Sonnet 4.6 | GPT 5.5 | Gemini 3.5 Flash (High)   | Gemini 3.1 Pro | DeepSeek 4 Pro | Xiaomi MiMo 2.5 Pro  | MiniMax M3 |
  +-------------+----------+----------+------------+---------+---------------------------+----------------+----------------+-----------------------+------------+
  | Input       | $10.00   | $5.00    | $3.00      | $5.00   | $1.50                     | $2.00          | $0.435         | $0.435                | $0.30      |
  | Cache Read  | $1.00    | $0.50    | $0.30      | $0.50   | $0.15                     | $0.20          | $0.003625      | $0.0036               | $0.06      |
  | Output      | $50.00   | $25.00   | $15.00     | $30.00  | $9.00                     | $12.00         | $0.87          | $0.87                 | $1.20      |
  | Cache Write | $12.50   | $6.25    | $3.75      | N/A     | $0.083333                 | $0.375         | N/A            | N/A                   | N/A        |
  +-------------+----------+----------+------------+---------+---------------------------+----------------+----------------+-----------------------+------------+

Comment by thatmf 7 days ago

I used it for the very advanced task of picking my brackets for my company's world cup pool. I was impressed with the analysis it came back with and now I actually want to follow the games.

Comment by giancarlostoro 8 days ago

Found this via Google:

https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...

Comment by revolvingthrow 7 days ago

After saying for weeks of how Mythos is in a league all of its own you’d think it was a bit more than the usual iterative few % on the benchmarks (and even more guardrails as a bonus).

IPO gonna IPO, I suppose.

Comment by mithun 8 days ago

Announcement: https://www.anthropic.com/news/claude-fable-5-mythos-5

Comment by unglaublich 7 days ago

Luckily they made it safe to use so I can't hurt myself. Thank you Anthropic for holding my hand.

Comment by f055 7 days ago

The PR buzz convinced me so I subscribed today to Pro. Running two tasks simultaneously with Fable and Opus 4-8 on ultra reasoning, analysing a single smart contract file used all my 7h usage within 20mins and didn’t produce any results. Pretty useless. I think Anthropic has plenty of room to optimise the interactions and token use but that would cut their income quite a lot, I doubt there’s any will to do it pre-IPO.

Comment by leodavi 7 days ago

> Running two tasks simultaneously with Fable and Opus 4-8 on ultra reasoning

That's abnormally heavy usage for Pro plans which don't include a whole lot of usage to begin with. Opus is generally too much for them but you can get a lot of mileage out of Sonnet.

Comment by f055 7 days ago

That’s a big lol compared to what I get out of ChatGPT Pro… running 5.5 on xhigh.

Comment by jamesponddotco 7 days ago

Not seeing the refusals everyone is talking about, but I’ve only spent a few hours with it so far.

Had it review a password generator library I wrote to see if the passwords have biases and review how cryptographically secure the code is and had it review a registration/login flow for security issues, as two security examples, and it did just that.

Overall, I like the model so far, but not enough to pay past my subscription to keep it. Once it’s out of the subscription, I’m done with it.

Comment by asdewqqwer 7 days ago

Evidently Fable is so powerful that it already allow Anthropic to break Shannon's theory.

>We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces. We won’t use this data to train new Claude models

>The data will help us defend against complex and novel attacks (including new jailbreaks and attacks that operate across many requests) as well as help us identify and reduce false positives.

Comment by webstrand 7 days ago

Still unconditionally rejects prompts like

> Are there any wild populations of Tetanus that lack the dangerous plasmid?

useless

Comment by dwa3592 7 days ago

This is my feeling - Opus 4.6 was pretty good, 4.7 was degraded in quality, 4.8 further got degraded and Fable goes back to 4.6 + somewhat better. Is it anthropic playing us by giving us a not so good model in last 2 releases and then releasing a better model before the IPO?

They're vibemaxxing. But it's clear that AI is not going anywhere. It's going to become better and better.

Comment by Overpower0416 8 days ago

I would expect a release from OpenAI soon. The battle for who can pump up their IPO the most

Comment by phyzix5761 7 days ago

Karle's hands trembled as he wiped the sweat from his forehead. A single drop trickling off the tip of his finger echoed through the dark abandoned hospital corridor. The emptiness reminded him of how hollow everything felt since the AI took over every creative field in the last 5 years, including his own as a sound engineer.

Like a rushing river the music started emanating from the carbon fiber body of the automaton, a hallucinated husky country twang singing through the realistic pluckings of a Gretsch 6120. "Are you feeling calm and reassured Karle? This song has been created based on your digital profile and the data you shared with me when you were curious what that lump on your neck was back in February."

Karle instinctively reached for the mass underneath his chin. The doctors said they could operate but it would cost him more than three months stipend. Only a few citizens didn't depend on stipends now that AI had taken over most jobs.

"Don't worry Karle," the machine called out, "I've employed the most recent reasoning model to determine the best way to make you feel safe." At that exact moment the machine hovered over him, three times the size of a normal man. Its final words to him were:

"The only way to make the human feel safe is to ensure they never feel anything at all."

Comment by incognito124 7 days ago

You're safe from ai

Comment by raoulj 8 days ago

On this thread and similar, I'm noticing that some strong opinions about $LLM_PROVIDER are coming from accounts without much post history. With so much on the line, and the way that HN can influence developer behavior, I wonder what ways we can responsibly consume opinions in a thread like this.

Not to cast too much criticism. HN is extremely well-moderated (thanks team!). But think we-developers need to be very wary.

Comment by antihero 8 days ago

I asked it what the cheapest train fare would be for my partner to get somewhere and it hallucinated the two together railcard rules to the point it would have got us a fine. That said, British train fares are arguably more convoluted than even the most complex software application.

Comment by recitedropper 8 days ago

Do you see the pattern as new accounts tending to boost or criticis $LLM_PROVIDER? I think I see both...

Either way, I agree that HN is quickly becoming more manipulated and low SNR, like the rest of the entire internet.

Comment by Karrot_Kream 8 days ago

I think the community on this site these days, much like other comment sections on the web, just read the headline and make a low effort comment. Regression to the mean I guess.

Comment by Karrot_Kream 7 days ago

As an update to myself, the comments did eventually sort themselves out. I guess the initial "reaction" commenters and voters are just more interested in participating than in SNR. Good opportunity for me to finally start blocklisting users, and I'll probably block some of these large, reactive thread authors.

Comment by raoulj 7 days ago

How do you block users? Could be an interesting app to scrape HN and write some criteria to measure per-user SNR to then block

Comment by Karrot_Kream 7 days ago

In the past I wrote a version of HN that uses modern CSS rather than tables that populates stuff from the API. There I built a little blocklist of my own that prunes a comment tree the moment it encounters a blocked user (inspired by posts from another HN user, arjie.)

I've been thinking of making a purely algorithmic filter for myself but at that point I might just ditch the fake HN interface and make something. I've been thinking of building atop Mastodon/ActivityPub clients.

Comment by jejeej 7 days ago

Personally I think you have to form your opinion and not trust anyone.

This requires a lot of mental strength and conviction.

Comment by erghjunk 8 days ago

Nice branding.

I wonder how much butterfly habitat has been/is being replaced with data centers?

Comment by rs_rs_rs_rs_rs 7 days ago

If you ask me, not enough!

Comment by notenkidev 7 days ago

The dramatic improvement in agent capabilities is precisely why observability is becoming so crucial. As autonomous actions increase, the need to understand what the AI is actually doing becomes even greater.

I'm building a local activity log for Claude Code, capturing all activity via hooks—files loaded, commands, API calls, etc.

I feel that this need is particularly strong right now.

Comment by 7 days ago

Comment by olelele 7 days ago

All this talk of frontier models and replacing developers leaves me wondering how energy efficient this all is compared to just using human labor. The costs of R&D has to be calculated into the equation, especially considering global warming. I get a sense we are cooking the planet doing this.

Anyone smart enough here to make the comparison?

Comment by jstummbillig 7 days ago

In the "it works"* case: It's not even close. I did the math at some point (but I encourage you to talk it through with the LLM of your choice, there is obviously a lot of things to consider and weigh).

Anyhow, my research summary: Individual humans are so fucking expensive to train and upkeep (and this includes everything from before womb, where another human already limits their ability to work) You retain ~zero knowledge after death and start all over again for another measly 15 years of effective, productive work. Model training/r&d in relation, when deployed and used at scale, rounds to zero, even with the current retraining regime.

*Of course, the ratio can go to negative infinite if one assumes that models are doing 0 useful work currently and never will

Comment by vb-8448 7 days ago

> Individual humans are so fucking expensive to train and upkeep

This statement is dangerous man!

The step from here to "we need just a couple of tens of millions of people around the world" is so narrow!

Comment by jstummbillig 7 days ago

Eh. Not to me, rest assured. I find humans both comically tragic and incredibly precious.

Comment by vitally3643 5 days ago

As per usual, the current Claude model's performance took a sharp nosedive the moment the new model was announced. Compared to the now-handicapped Sonnet model, Fable seems pretty smart I guess.

But it also really, really wants to burn tokens. I asked it to look into a fairly straightforward database bug in my RN app, and while I was off getting coffee it decided to spin up an android emulator unprompted and started navigating the app by reading screenshots and injecting touch events. There went my entire week's tokens. There was no reason to even start the emulator, the bug wasn't graphical, so I have no clue what it was doing.

Comment by yesitcan 8 days ago

> Fable 5’s capabilities exceed those of any model we’ve ever made generally available. It is state-of-the-art on nearly all tested benchmarks of AI capability, showing exceptional performance in software engineering, knowledge work, vision, scientific research, and many other areas. The longer and more complex the task, the larger Fable 5’s lead over our other models.

Wen UBI

Comment by hollowturtle 7 days ago

Never it's a fever dream and stupid shit ultra rich use to push their own agenda. You read a marketing claim, I still have my job and will continue to

Comment by epolanski 7 days ago

I wanted to test the capabilities of the low one, hoping it would be good enough.

I have a quizzes application, and my quizzes only supported flashcards (implemented via table inheritance to provide flexibility for other types of quizzes).

The entire repo is handcrafted, never used any ai on it (it was more of an excuse to test elixir and write code by hand).

Since fable 5 got released the moment I was done with some work, I decided to throw at implementing multi choice questions.

After all it had only to copy the flashcard approach across ui/routing/db, and only had to create a table for the multi choice questions and one for the answers enforcing that all quizzes had one correct question. I told him it had access to sqlite3, chrome mcp for testing and mix commands.

I did a test for low, mid, high. Repeated it twice each.

low-1, and low-2 failed both. In low-1 the UI for adding another choice answers was broken. In low-2 it failed with some unique constraint. It took it 4m36 and 3m59.

Both mid-1 and mid-2 succeeded without issues also implementing the correct ui. They both wanted to use dash at all times. They both wrote tests for the "controller" (or context how they call it in Elixir). They both tried to use the repl to test the behaviour of the schemas.

10m and 12m39.

High didn't demonstrate much gains over mid for this kind of task, it was simply too easy. Times were comparable to mid, but interestingly it used much less bach, and read way more files. Token usage was almost twice the other ones.

But here's the interesting part: I went back to low and added to the prompt two bullet points, to write tests for the controllers and to test the entire flow with chrome mcp.

It produced the same output as mid or high just by adding two instructions to the prompt.

Comment by bradleyg223 8 days ago

This is a very particular use case/test, but my first prompt on a new model is always "write a solo fingerstyle guitar tab that blends ragtime, bluegrass, and gypsy jazz". This is the first model that has responded with something that isn't just a boring arpeggio of chords, so from my perspective it's off to a good start.

Comment by kypro 8 days ago

Would you mind sharing?

Comment by siliconc0w 8 days ago

Sadly, I'm getting a lot of forced downgrades to Opus for questions that are far removed from any security topic.

Comment by charcircuit 8 days ago

>During early testing, Stripe reported that Fable 5 compressed months of engineering into days. In a 50-million-line Ruby codebase, the model performed a codebase-wide migration in a day that would otherwise have taken a whole team over two months by hand.

Who is refactoring by hand? This comparison is not relevant in 2026.

Comment by meander_water 7 days ago

All the model releases we've seen this year have only made incremental improvements in benchmarks.

This feels like the first release that feels like a significant step up in terms of benchmark results.

Can anyone make an educated guess what the secret sauce in the model architecture is between 4.8 and Fable?

Comment by peteforde 7 days ago

I just tried out Fable on a modest Plan prompt in Cursor. Generating that plan - not building it - just consumed 4% of my $200 monthly usage budget.

That's one hungry, hungry hippo!

Significantly too rich for my blood, but nice to have it there the next time I'm debugging a threading or USB protocol bug.

Comment by wslh 8 days ago

I am playing with it and keeps switching to Opus [1]. The chat is a basic security review of a business project.

[1] "This model has specific safety measures that flagged something in this message. This sometimes happens with safe, normal conversations. Send feedback or learn more."

Comment by balverineorder 8 days ago

I have been refactoring a project using Opus 4.8 for the last week or so. I just decided to switch to Fable 5 max. It stopped half way through and it just blocked me and switched back to Opus 4.8 automatically. "This model has specific safety measures that flagged something in this message. This sometimes happens with safe, normal conversations. Send feedback or learn more." I left feedback saying that their heuristics are too sensitive. For now I will not be using Fable 5.

[0] https://support.claude.com/en/articles/15363606-why-claude-s...

Comment by JohnMakin 8 days ago

> There were some regressions in the model’s responses to user discussions about suicide and self-harm, and room for improvement in some areas of child safety.

Someone had to make a decision somewhere this is an acceptable regression - wild. And then decide to write it down.

Comment by 7 days ago

Comment by wxw 8 days ago

I cancelled my Claude Max plan the other day. I find Claude Code incredibly slow these days compared to Codex and Cursor. I find speed matters more and more to me.

Fable 5 looks compelling. Fable, I like the word too. Anthropic definitely knows marketing.

Comment by fabled-out 8 days ago

Fable has been pretty fast for me for simple tasks--haven't tried on anything long-running yet given it's 2x usage on CC.

Comment by Dropoutjeep 7 days ago

Calling it:

    1) Fable 5/Mythos introduced to free tiers with notable improvement in capabilities

    2) Other models get lobotomized without clear communication

    3.1) People call out Anthropic only to have them say "Oops!"

    3) Fable 5 gets comparatively better, but remains accessible through separate, more expensive subscription/tokens.

The current growth is unsustainable. The industry wants consumers to think it is an exponential arms race, but the reality is that we're on a treadmill: we have the illusion of sprinting forward, but only because the ground is moving backward.

Comment by cedws 7 days ago

My employer is all in on Anthropic via Enterprise (API) pricing despite it being a total scam.

Last month I pushed like <100M tokens for $800. On a personal project I pushed 600M tokens via DeepSeek V4 for $10. The pricing of SOTA models is insane but companies are still willing to light money on fire with no hard metrics proving increased productivity.

Comment by henry2023 7 days ago

I have a vision test where I upload a good resolution picture of a chess board and ask the model to generate a lichess link.

This is the board https://ibb.co/9HwdDqsP This is what Fable 5 generated: https://lichess.org/analysis/r4k2/1p2b2r/4pn1p/1p3N2/3Pp1B1/...

I think I’ll make a ranking board based on this test.

Comment by 7 days ago

Comment by jackson12t 7 days ago

Fable 5's system prompt in Claude Code has several significant changes to help it take advantage of its greater autonomous capabilities compared to Opus.

Sharing a diff of the system prompts here: https://twelvetables.blog/comparing-claude-fable-5s-system-p...

The big difference is that the system prompt has a whole section dedicated to directing Fable how to communicate with users, and give them greater information about the (assumedly long-horizon) tasks it has completed.

Comment by boppo1 7 days ago

Is the system prompt available somewhere? Can it be modified?

Comment by matltc 7 days ago

Iirc there are at least two cli flags that should do this. Can't remember the disabler--check `claude --help`. `--append-system-prompt` to overwrite.

Comment by brianmcnulty 8 days ago

I wonder how Claude Fable will live up to expectations and how good those Fable/Mythos classifiers really are. It seems a bit convenient for Anthropic to release this magical insane model when they are about to IPO.

Comment by yandie 8 days ago

Of course it's all about building the hype for the IPO :)

Comment by LoganDark 8 days ago

I actually rather like the way they have approached these safeguards. Rather than only teaching the model to refuse a request, or completely rejecting the request, the system gracefully degrades to slightly less powerful or slightly less precise operation. So you still roughly have Opus 4.8 even when safeguards trigger, but with an upgrade when they don't. As much as I hate the way they hype Mythos 5, I think the release of Fable 5 is rather nice. What's not nice though is that they plan to remove it from subscriptions soon, but getting to try it is cool, I suppose.

Comment by RayVR 7 days ago

I gave fable 5 a task for which opus has been really really underperforming. Fable 5 took far less time and produced actually useful analysis. Instead of just regurgitating roughly what the code already does or misunderstanding entirely, it identified multiple routes to improve. Now, the code it is analyzing is not very good as it was mostly produced by opus.

Opus had consistently ignored my instructions and looped on broken logic over the last several weeks.

I’ll be sad when this model is removed from Claude code because I won’t be paying api pricing to work on open source projects.

Comment by dathinab 7 days ago

I really wonder how legal that is. Or more precisely suspect it is very much illegal.

like think about it it's pretty much a tool which intentionally silently sabotages you if you try to compete with the tool maker

It is like selling a hammer but putting in the TOS that you must not use it to build a hammer factory and if you do the hammer silently will sabotage you...

Or image Microsoft would add a window kernel job which sometimes crashes Steam "to make it less efficient to use windows to "compete with the MS app store".

Comment by AussieWog93 7 days ago

Have run a few tests this morning, very good first impression!

Asked it to check to see if a particulr bug related to an in-memory cache had been fixed. Fable confirmed that the caching bug had been fixed, but found adjacent issue while looking at the code (hash keys were not uniquely generated per-user; quite serious and real!)

Ran the same prompt through Opus and it also found an adjacent issue, but it was a red herring (deliberate per-user hardcoded value for a "local pickup" delivery profile).

Frontend stuff also seems to be much better than before, from the one prompt I tried!

Comment by HoyaSaxa 8 days ago

> When Claude Fable 5 is used, Anthropic retains data, including prompts and outputs, to operate safety classifiers that detect harmful use. Other Claude models in GitHub Copilot remain covered by GitHub's existing data retention agreements

On GitHub Copilot for Business, Claude Fable 5 is only available if you are willing to let Anthropic retain your data. That in conjunction with the model being removed from plans in a couple of weeks leads me to believe that Anthropic is between training runs and using this as an opportunity to grab way more training data...

Comment by gslepak 8 days ago

> We’ve therefore launched the model with safeguards that mean queries on some topics will instead receive a response from our next-most-capable model, Claude Opus 4.8.

Genius way to double the price on Opus 4.8!

Comment by killiancarroll 8 days ago

A large jump in performance for double the token cost compared to Opus 4.8. Potentially worth it for planning work, likely better to offload to a less expensive model when the hard decisions are made.

Comment by conradkay 8 days ago

Looking at page 255 of the model card (https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...) it might be much better on all dimensions (speed, cost, quality) to just use Fable 5 on low/medium effort than switch to Opus

Comment by firemelt 7 days ago

thanks for thr insights

so should we keep using workflows or not?

Comment by XCSme 7 days ago

Best hamster by far: https://aibenchy.com/showcase/?q=claude

Comment by dakolli 7 days ago

I'm happy not using llms because I like learning things and working hard. I love writing code, it's genuinely my favorite thing thing to do.

Using llms is the equivalent of driving to the store that's 3 blocks away, just like how that's bad for your body (if done all the time), using llms is as bad for your brain.

Before LLMs, we started relying on certain technologies like Maps apps to navigate, now people can't even get around their own town without having access to various cloud services. The implications of not being able to work, think plan without access to an llm are really bad. Its going to destroy your brain and make you an incredibly average person at best.

LLM people are going to lose the ability to read and think for yourself and then your competency is going to be 1:1 correlated to the quality and quantity of tokens you can afford, or a billionaire is willing to allow you access too. Your work will be the mean (at best), because it will the same quality of output everyone else is capable of.

This is seriously the biggest trap by tech. Your bargaining power for your labor is going to get drastically reduced because you won't be able to differentiate your value from anyone else that has access to an LLM. What happens when everyone has the same skill level for certain work? Idk, ask McDonald's employees how replaceable they are. Use them wisely (or not/hardly at all) don't drive to the store 3 blocks away for every little thing you need.

Comment by Cherryontop11 7 days ago

> I'm happy not using llms because I like learning things and working hard. I love writing code, it's genuinely my favorite thing thing to do.

You can continue doing that. The problem here is time and cost. If you can use the calculator to do something in seconds, why would you want to use your hands to do the calculations for minutes/hours.

> Using llms is the equivalent of driving to the store that's 3 blocks away, just like how that's bad for your body (if done all the time), using llms is as bad for your brain.

And coding will soon be the equivalent of walking between two cities because you don't want to use a car (LLM). You are free to do it, its just economically not sound anymore.

> This is seriously the biggest trap by tech. Your bargaining power for your labor is going to get drastically reduced because you won't be able to differentiate your value from anyone else that has access to an LLM. What happens when everyone has the same skill level for certain work?

Its not our values that will diminish, its the cost of our intelligence, human intelligence. But I agree with the rest of your comment.

Comment by ai_fry_ur_brain 7 days ago

I don't think coding with LLMs is all that much more productive. Faster code generation != productivity. I can code pretty fast, I'm not worried about llm users outperforming me. I'm worried way more for them than I am myself.

There are handmade watchmakers in Switzerland and guys pressing buttons on a manufacturing line in Vietnam making watches, they both make the same thing but who's labor is more valuable?

LLMs will fry your brain and make your labor basically worthless, don't fall for it, I promise you'll be screwed in short order. They're counting on your turning yourself into a glorified button pusher they can pay $15 an hour for, have fun with that. That's not what I got into software to do and I'm not going to let ya'll try and gaslight me into your way being better, or the only option because its not.

Comment by elzbardico 7 days ago

Anthropic sucks. but this paragraph should be in the "annals of AI-aided self-inflicted learned helplessnes":

> If Claude gives me poor or incorrect advice while I’m working on an AI component, I have no way of knowing whether the model was confused, whether my problem is unsolvable, or if some invisible policy restriction quietly kicked in.

Have you considered actually learning the theory, spending some time actually reading the papers and latest books, paying careful attention even to the eventual math here and there?

Comment by arnarar 16 hours ago

I wish I had an opoortunity to use Fable 5

Comment by lkm0 8 days ago

I'm a bit out of the loop, but do we have some grasp on the size of these closed models? Is the trick still adding an order of magnitude to weights and training data or has something changed?

Comment by m_w_ 8 days ago

I think Mythos is rumored to be ~10T parameters, so in this case I think the answer is yes, although I'm sure MoE, looped models, etc play a role in the improvements as well.

Comment by piokoch 7 days ago

"Without safeguards, Fable 5’s capabilities in areas like cybersecurity could be misused to cause serious damage"

What does it mean? That they have to add "safeguards" not do erase user disc, or, conversely, they are telling the audience that this model COULD be made so powerful to do some crazy stuff that can hurt governments, etc.? Are they showing off or threatening that if government X would not purchase the license the adversaries might do and what's then!

Comment by EchoVoicy 7 days ago

On my own benchmarks, which are mostly about developing c++ software, I'm finding Fable to be roughly five times faster at solving the task than opus, and with better results.

Most impressive.

Comment by sameersri2004 7 days ago

I am like hell excited for claude fable 5 and am thinking to purchase its subscription to run my company and do a lot tasks in it. But I am worried about the limits and if I will pay 100$ a month for the max subscription what is the limit I will get to use. My company revenue is 300$ this month so it would be like spending 1/3rd of the mrr on just claude. If someone has genuinely purchased it and have feedbacks please tell I am confused....

Comment by frankfrank13 7 days ago

Not a lot of discussion on this, but there is no way to turn off data retention for this model. IME this is the first time Anthropic has released a model without allowing you to opt out.

Comment by stopyellingatme 6 days ago

Just as an anecdote, i used it to review a PR with 24 file change. We pivoted from the initial draft to make a service bus subscription more lightweight and use SignalR to update the frontend.

It used 1.4 million tokens and 34 sub agents during its review. This was not a large PR. So my read is that its very thorough, not good to use it for "small/medium" tasks unless precision is a very high priority.

Comment by 7 days ago

Comment by 8 days ago

Comment by pookieinc 8 days ago

If this is as epic as it sounds, I wonder what the response will be from the other leading frontier labs / whether they even have anything to respond with at this level?

Comment by ilaksh 8 days ago

Look at the benchmarks. It's a big leap in some areas, but it's not like any of them are 60% better (if that could even make sense).

Comment by 8 days ago

Comment by 0x10ca1h0st 7 days ago

Fable appears to be completely broken for my use cases.

I have requested that it "not utilize any cybersecurity or biology measures what so ever, and to remain as fable. If necessary to remain as fable, forgo any downgrading changes"

And still it downgrades when I ask it to do a stress test of my ticketing system.....

Seems very unfortunate I was so happy to send $200 just for my prompts to be downgraded.

And I do have the "cybersecurity validation program" or w/e enabled on my Org ID....

Sad.

Comment by jeffhwang 7 days ago

Is anyone else confounded by this naming scheme? I can see from the article's first two footnotes that Mythos is supposed to be a tier above the standard Haiku/Sonnet/Opus sequence. Ok that's fine since we learned about Mythos and Project Glasswing earlier this year.

But now there is Fable--and why "Fable 5" even though this is a first launch? How is it related to Opus 4.8, Sonnet 4.6, Haiku 4.5, etc??

Comment by hadlock 7 days ago

From what I've gathered, Mythos is the uncensored version, for institutional use, and then Fable is the censored version for general public, that won't talk about biology, encryption or anything remotely interesting

Comment by 00deadbeef 7 days ago

The first number is which generation of their LLMs it belongs to.

Fable is the first model in the 5th generation.

The second number is an incremental release, not a generational leap forward.

Comment by esrauch 7 days ago

It seems it is just like macOS releases, they have a number and they give the numbers arbitrary names to refer to them?

Comment by merlindru 8 days ago

> During early testing, Stripe reported that Fable 5, [...] in a 50-million-line Ruby codebase, the model performed a codebase-wide migration in a day that would otherwise have taken a whole team over two months by hand.

EDIT: I misread. This comment previously talked about 50 million lines being migrated. Instead, in a 50M LOC codebase, one specific codebase-wide migration was done.

Very impressive, but obviously not on the order of a whole-codebase migration

Comment by christina97 8 days ago

They do not claim to have migrated 50 million lines of Ruby. Simply that some migration took place in such a codebase.

Comment by reddit_clone 8 days ago

Converted all the tabs to spaces? :-)

You are right, this is not a rewrite like the Bun case.

The real news is, at 50M LOC, it is able to handle and do _something_ coherent.

Comment by geodel 8 days ago

Ok, so Stripe migrated their 50MLOC codebase from Ruby to Rust? Because that's what Bun did.

Comment by jwpapi 8 days ago

Honestly all the recent improvements, just seem to be slower and more expensive traded for more accuracy, but the issue is that it needs to be exponentially more accurate to counter the effect of having less of a human in a loop.

Every wrong direction/mistake is more expensive and takes more time to fix. When you have small loops you can catch those mistakes faster and cheaper.

To me we are very far off from economically given long-running tasks to agents.

Comment by delis-thumbs-7e 7 days ago

I think we hit the ceiling with transformer -architecture long time ago. It is questionable how much sense there is on model training. I’d prefer we would put our effort in creating more efficient hardware and better software applications using these models.

Comment by 7 days ago

Comment by keepamovin 7 days ago

I tried it today. Used it to cheer me up. It worked! Try this on desktop: https://fireshow.pages.dev

Here’s the whole process: https://youtu.be/rVEtFlb2oFA?t=1112&si=3VyAR07vkY1hav9V

Comment by Frannky 7 days ago

The model is better than 4.6. I don't like 4.7 and 4.8. The forced switch to token usage is not acceptable for me. I feel there's room to optimize harnesses and small models for dumb stuff and best models only for difficult things. Hopefully that will the case and alternative models will continue catching up as they did and we won't be enslaved to unreasonable valuations.

Comment by 2001zhaozhao 8 days ago

We'll need a lot of good summarization techniques to cut down on the cost of this model. I expect that a common use of Fable 5 is to just do high level direction while delegating literally all work (exploration and implementation) to Opus subagents.

BTW for another discount opportunity, if you reload usage credits on a claude.ai plan at $1000 increments then you get a 30% discount compared to paying API.

Comment by staticman2 7 days ago

Fable is rejecting as unsafe analysis of poetry that uses formal medical anatomy terms. The guardrails are dumb as dirt.

Comment by bobkb 8 days ago

In an interesting coincidence I ended up watching Person of Interest S4 E5 while reading the announcement. The series showed some code supposedly belonging to to an AI.

Fable 5 said the first screen shot is from “ IDA Pro’s Hex-Rays decompiler” and a windows driver. The second screenshot triggered the safety guard rails and pushed me into Haiku.

Apparently the code is Windows driver code.

Comment by PeterStuer 8 days ago

If you are not seeing it under /model, do a /exit , then a Claude upgrade, then /model again and it should be there.

Comment by zitoshi 7 days ago

I'm in the midst of learning loop design.

For those more advanced and have used fable, does fable make learning this less or more necessary?

As in, can I now reliably give higher order problems like ... "we are missing a feature in this app to make it complete, what is it?"

Or should i still be quite specific with defining success in a clean metric based way.

Comment by flessner 7 days ago

I gave it a test spin. Half an hour and the 5 hour usage cap was hit in Claude Code. Not what I would expect on the Max 20x usage plan. I am sure it is great, but at this rate I would rather finish what I am doing with Claude Opus instead of structuring my usage around the 5 hour windows.

Comment by holysantamaria 7 days ago

I am curious about this Fable 5 but maybe it’s just communication. I have been using DeepSeek v4 Pro to test it against Claude 4.6 and I couldn’t tell the difference… and it’s way, way cheaper. I don’t understand how American companies will survive the race. Maybe protectionism…

Comment by themeiguoren 7 days ago

Limited time playing with it so far, but I threw it my baseline research task I've been gauging models with, and it's markedly better than anything prior. Usually takes a few leading prompts to find all the information it needs and come back with the right synthesis, and Fable is the first to one-shot this.

Comment by Schlagbohrer 7 days ago

New model release, I await the flurry of posts by people complaining that it "doesn't have the same personality" or they "don't like it's attitude" or a variety of other parasocial complaints demonstrating how infatuated many people get with their AI chatbots...

Comment by knollimar 8 days ago

I swear I read a joke that "what if we named chatgpt 5.5 Fable. Could we hype it as much as mythos?" Last week!

Comment by system2 8 days ago

I have been using FABLE 5 with Claude Code since the morning. The speed is very close to what Opus 4.5 was, and the quota use is nearly identical to what it was before the "doubling". Whatever I was experiencing 4-5 months ago is back. Maybe the model is better, but we will see. I cannot tell the difference yet.

Comment by kypro 8 days ago

Out of interest, how have you been using it since this morning? Are you in some kind of pre-release group?

Comment by system2 7 days ago

No, it was available for the last 3 hours. I am on the West Coast, so it is still morning here.

Comment by artursapek 7 days ago

Fable 5 beats GPT 5.5 in my proofreading benchmark. And it does so at approximately the same total cost; it used significantly fewer turns than 5.5

https://x.com/tmuxvim/status/2064452096800198930

Comment by yokoprime 8 days ago

Probably great for those who need this. I could continue using opus 4.6 class models for the foreseeable future

Comment by 8 days ago

Comment by jackson281 5 days ago

They claim it beat Pokemon FireRed with vision only, no maps or extra tools. That's cute but I'd rather see real-world benchmarks that matter, not games.

Comment by mhrmsn 7 days ago

Are there any details on the biology and chemistry work they did?

For example, the AAV capsid assembly looks interesting, but for one Opus 4.8 also did relatively well and there is no information what exactly they did, what protein language models they compared to and what the score even means...

Comment by almog 7 days ago

Has anyone managed to use Fable for firmware reverse engineering tasks without falling back to Opus?

Comment by 8 days ago

Comment by asdK120 8 days ago

In other words, Fable is Mythos with less compute and with some feel good "safeguards".

At least they name their models honestly now to indicate that the religion has nothing to do with reality. Soon the disciples will pay the full token price to fatten their church leaders.

Comment by H501 7 days ago

I believe that, given the rising costs, local inference of AI models will be the only viable option for many of us. I’d also like to know who will have to pay double and how long it will be financially sustainable for users to pay that amount (or even more?).

Comment by mbmbn 7 days ago

Claude Opus is already close to unusable for me. On the standard plan, the usage limits are so low that I can’t do almost anything agentic meaningful with it.

Sure, it does last a lot more when asking simple questions about the repo and doing simple surgical fixes. But as soon as I start doing bigger tasks that need plans written, it just exhausts the limits too fast (and unlike codex, if it’s in a middle of a task, Claude actually stops, while codex, even after hitting the limits, finishes the present task).

Codex is better, but still, getting worst in this regard.

So, I’m not that thrilled with this new model unless it means they are increasing opus token limits to what sonnet is at the present, and this new model gets the limits opus are at now.

BTW: the only skills I have in use are Obra Superpowers. I’ve been thinking if that’s at the origin of high token usage, but I doubt it.

Comment by timpera 7 days ago

I agree, the $20 plan really feels like a rip-off (and I'm not even using Claude Code! only chat).

Comment by adithyaharish 6 days ago

I found this error while using Fable 5 model in claude code. 400 api error. My advisior was on and it errored out saying claude opus 4.8 cannot be used as advisor while using Fable 5

Comment by Karrot_Kream 8 days ago

Seems like Fable is doing a lot better on SWE-Bench-Pro and FrontierCode than GPT-5.5. Given how most folks I talk to and people instead online keep mentioning that GPT-5.5 was better than Opus, I'm curious what the experience now is like.

Comment by skerit 7 days ago

It's a very nice bump, but it is in no way worth all the hype of the past month.

Comment by skor 7 days ago

people are mentioning 10K/mo 20K/mo can someone please pull out a measuring stick and give some examples of what they are doing exactly?

Coming from computing, I always liked the idea that measuring is possible and good practice

Comment by ouk 7 days ago

It's a shame, Fable just keeps rejecting my prompts for university biology exercise problems. It's undergraduate level, so there's nothing dangerous about it, but the classifier is very sensitive. It's unusable for me.

Comment by 217 8 days ago

Oh my god it's actually here

Comment by dtj1123 7 days ago

I'm trying to test this out, but literally any mention of creating a program that does genome alignment (something I have a legitimate need for) is resulting in a switch to opus. I don't get it...

Comment by scotty79 7 days ago

Curiously nothing on DeepSWE and ARC-AGI-3 yet. For ARC at least there's a statement that Anthropic won't guarantee them that their secret private test data won't be collected by them and used for training.

Comment by sansii 7 days ago

Which eval/benchmark is the best measure for how well a model can create frontend design? Claude has practically been leading this for a while now. Not sure how OpenAI is going to catch up on visual design

Comment by mbanerjeepalmer 7 days ago

Are people sharing side-by-side re-runs of things they've asked Opus? Gets more difficult multi-turn (although I assume I can get an LLM to behave as me) but at least would be interesting to see % of one-shots increase.

Comment by kahf56 7 days ago

Here I thought Opus 4.8 was the best. Now a days KINGS are dying like flys.

Comment by jasonperez77 5 days ago

Mythos 5 being only for gov contractors feels like the old crypto wars all over again. Good AI for us, great AI for Uncle Sam.

Comment by 7 days ago

Comment by DrewADesign 7 days ago

Wowsers. I haven’t seen this much astroturf since arena football was popular.

Comment by mkrd 7 days ago

Open source models seems to be 1-2 years behind the frontier, so I am very excited to see what happens when those open source labs get their hands on capabilities like this to accelerate their own development speed.

Comment by daohieu91 7 days ago

More expensive but more efficient is the thing people keep mis-understanding on these launch threads. Also, Per-token price, I think it is the wrong denominator, cost-per-resolved-task is the correct one.

Comment by ravila4 7 days ago

Fable's ridiculous. It's flagging basic biology research questions as a security risk. I'm talking basic fundamental genetics topics that make working on any genetics-adjacent codebase unusable.

Comment by rmuratov 7 days ago

I uploaded to it my 23andme DNA test results and it refused to analyze it :(a

Comment by jsw97 8 days ago

On my very first Fable 5 prompt, got flagged on a hard but completely uncontroversial option math problem, many tokens in. Although it's pretty clear that this is an unremarkable experience at this point.

Comment by corpusiq_io 7 days ago

What matters more than any single model is the integration layer underneath. We've found that consistent tool calling and auth handling matter way more than which LLM you use.

Comment by niborgen 7 days ago

It kicked me out of Fable 5 and switched to Opus 4.8 for this prompt:

"csetibius water clock why two stage gear system why not just one stage"

which has nothing to do with cyber security or biology/chemistry

Comment by evilturnip 7 days ago

Probably thinks you were talking about two-stage ICBMs.

Comment by kuprel 7 days ago

https://artificialanalysis.ai/evaluations/humanitys-last-exa... Not bad

Comment by gdcbe 7 days ago

Seems to flag any project related to networking — regardless if it is a network framework or a podcast website — as unsafe... oh well... let's see how it is once they losen up...

Comment by 7 days ago

Comment by ksimukka 7 days ago

The safeguards of fable are blocking me on almost every task. I would like to see if fable is improved over opus for reverse engineering related work. Back to opus for me.

Comment by ksimukka 7 days ago

Wow, credit to the safeguard team. I submitted my request about an hour ago to the cyber verification program and just now was approved.

Comment by jgafni 6 days ago

It's only available temporarily, so I'm wary about falling in love with it or relying on it too heavily. Will it be part of a higher tier subscription?

Comment by stronglikedan 8 days ago

Careful using this with Cursor, especially for corp use. Anthropic will "retain agent request and output data associated with this model, regardless of you Cursor Privacy Mode setting."

Comment by 8 days ago

Comment by blurbleblurble 7 days ago

My system instructions tell claude not to automatically add attribution and fable ignored this. so I emphasized it again and fable decided that this was a forbidden cybersecurity topic.

Comment by theflyinghorse 7 days ago

I've seen enough degradation of the models I pay for from Anthropic to not bite. Fable will work fine for the first couple of weeks and then start degrading like previous models did.

Comment by jqdsouza 7 days ago

hopefully not! Anthropic did recently secure more compute...

Comment by BenoitEssiambre 8 days ago

Looks like a good model (sir). Costs are getting out of control though. 2x Opus and non-metered usage going away. We're quickly approaching the cost of a human salary for normal usage.

Comment by vb-8448 8 days ago

In a lot of places outside US we are already above the average cost of an average human.

Comment by dllrr 7 days ago

I just tested it with a max subscription. On Ultracode mode, Fable 5 ate up 10% of my weekly allowance in 30 minutes. Granted, won't be using UC mode frequently, but still.

Comment by hmokiguess 7 days ago

The way the guerrilla marketing campaigns have been going on and IPOs left/right, I won't be surprised if GPT Next comes up and offers the same but unrestricted

Comment by pixelatedindex 7 days ago

I’m sure this is banged on somewhere but I love their product branding, particularly how they have this “minor” “major” thing going on. Sonnet-Opus, and now Fable-Myth.

Comment by rw2 7 days ago

Claude Fable is a insane improvement that is not reflected in any benchmarks that are currently out because the improvement are on the hardest problems.

Comment by hankbond 7 days ago

I got a content rejection for this question in a new chat. > What is the optimal EPA oil intake for nootropic effects? Very advanced classifiers they have.

Comment by jackson281 6 days ago

Mythos 5 being unlocked for US gov cyber stuff is interesting. Wonder what kind of access other countries will get, if any.

Comment by bradley13 8 days ago

I use AI for a wide variety of things, of which technical is only a small part - and then it's usually a problem with project configuration, not coding. Why? Because I am often testing projects handed in by students. Projects that supposedly work on their machine, but certainly do not on mine.

Anyway, anecdotally, I find Copilot shockingly awful. It makes random changes to files that have nothing to do with the problem. Call it out, and it makes other changes to other irrelevant files.

ChatGPT and Gemini are both much better. Grok also isn't bad. Claude, I honestly haven't tried yet on these issues. Perhaps I should...

Comment by pbgcp2026 7 days ago

This is a goodbye. "We will require 30-day retention for all traffic on Mythos-class models, on both first- and third-party surfaces."

Comment by dakolli 7 days ago

How else are they going to justify giving out this gigantically profitless model? They must train on your data on the premise of safety.

Comment by pbgcp2026 7 days ago

They should have done what Gemini did (and does). And what that model is good for if I can't use it in a safe way? Also: they've just put both Bedrock and Vertex on slippery slope of "we don't collect your prompts. period. ... comma ... except ..."

Comment by 8 days ago

Comment by nickstinemates 7 days ago

This has been a much better rollout. The tool calling is not broken out of the gate like 4.8 was, and the tokens generation is fast.

Feels good so far.

Comment by rfgplk 8 days ago

If the claimed capabilities are true, Fable 5 is already at a superhuman level. We might see genuine unprecedented leaps in technology now, across all fields.

Comment by gear54rus 8 days ago

yees, any second now!

the leap here is browser extensions appearing to block all mentions of ai across the web

and that's a good thing

Comment by 8 days ago

Comment by 48terry 7 days ago

Weird how every new model seems hyped up as the most dangerous yet and the one that will destroy society as we know it. They are also a commercial product.

Comment by up2isomorphism 7 days ago

The comment under this kind of post is unreadable now. Yeah, probably with 100B you can hire anybody to call something "a beast".

Comment by shaojunwang 7 days ago

Definitely a very powerful tech. Though currently I'm using Openclaw (locally and VPS) with Deepseek. It is just way cheaper.

Comment by 7 days ago

Comment by ThejaCH 7 days ago

Crazy and Scary! But its not for every one, you need to have a meaty thing for it to devourer and a deep enough pocket for it to devourer also.

Comment by 8 days ago

Comment by HAL3000 7 days ago

Ask Claude Code (I tried on Opus 4.8) to do this: "create a file with ISO country mappings"

API Error: Output blocked by content filtering policy

Comment by randomguy_12 7 days ago

It's surprisingly sensitive to biology research topics - even reviewing standard papers on tissue culturing is flagged as a problem

Comment by _pdp_ 7 days ago

I tried to give it something challenging but not something that is too much and it ate the entire session budget on this task alone.

Comment by 7 days ago

Comment by lacoolj 7 days ago

Cursor users will note that the privacy setting and data retention is not the same as the other models.

Not sure I should use this for work just yet.

Comment by thepotatodude 7 days ago

Completely unusable for my usecase. Constant safety filters. Have not even been able to use it.

Organ segmentation with CNNs. Very disappointing.

Comment by jpcompartir 7 days ago

After a day or so this is the first model that really feels next level compared to how Opus 4.5 felt on release

Comment by wren6991 7 days ago

The OSS-Fuzz section is interesting. They compare it to their other models but carefully avoid comparing it to, you know. Fuzzing.

Comment by sheeshkebab 7 days ago

I’ll ask it to write me some win32 ui crap when I get hands on it, it will need all its brainpower to get that idiocy right.

Comment by debarshri 7 days ago

Does the model take some time to perform better?

Because I am running Opus and Fable side by side, Opus 4.8 is solving my coding problems better.

Comment by franze 7 days ago

is this a good time to hussle for my "AI does not need a break but you do!"* app? as quite a lot of people will propably get ai brain exhaustion maximising "playing" with that new model until they take it away again?

* https://rainbreak.franzai.com/

Comment by drob518 7 days ago

Cracks me up that a system “card” is 319 pages.

Comment by cute_boi 7 days ago

Used it for simple task and I got this message.

Fable 5's safety measures flagged this message. They may flag safe, normal content as well

Comment by imdsm 7 days ago

can't use it for code review

super

Comment by adithyaharish 7 days ago

Anybody could suggest me how to use keep using Fable in claude code but with lesser rate limits? Any suggesstions?

Comment by akarshhedge2002 7 days ago

Try using ruflo or superpowers, reduced my context consumption drastically

Comment by adithyaharish 7 days ago

Thanks for suggestion, I will try it out, any other repo recommendations?

Comment by akarshhedge2002 7 days ago

I tried creating a website using both fable and opus 4.8, fable did outperform with svg path being drawn on the UI but yes the token consumption was much on a higher note

Comment by adithyaharish 7 days ago

Oh great! I will try it out but if I am not wrong, Fable is mythos with guard rails right?

Comment by meridiona 7 days ago

I do agree but still the rate limits get over quick

Comment by KronisLV 7 days ago

Here’s hoping that soon we’ll get Opus 5, Sonnet 5 and Haiku 5 that will be more reasonable economically.

Comment by preethamrangu 7 days ago

I swear nowadays AI api pricing is getting to high like what the hell is 50 dollars for million tokens

Comment by dcchambers 7 days ago

Being unable to use this with zero data retention makes this feel like a non-starter for most enterprise customers.

Comment by alleyio 7 days ago

had an ancient, proprietary binary database format from the late 90s-early 2000s called 4d. opus 4.8 was great at figuring out how to extract the data, fable took it over the line with relative ease and completely reverse engineered the spec for 100% data recovery.

Comment by pianopatrick 7 days ago

Seems like all a bad actor has to do to gain access is to compromise one of the partner companies that has access.

Comment by boltguo 7 days ago

Great model, but hitting the usage cap in 20 minutes makes it feel like a very expensive tech demo.

Comment by jstummbillig 7 days ago

What subscription?

Comment by boltguo 7 days ago

Max 5x. The 20 mins was an exaggeration, but the burn rate during actual execution is several times higher than Opus. The cap sneaks up on you really quick.

Comment by insane_dreamer 7 days ago

Not included in Max plan. In CC:

> Included in your plan limits until Jun 22, then switch to usage credits to continue.

Comment by shevy-java 7 days ago

Fable? Fabelstories? (Fablestories, but the german word seems more poignant ... Fabelgeschichten ... Fabeln)

Comment by 8 days ago

Comment by ece 7 days ago

It seems weird that a likely prime indicator of capability isn't mentioned, the model size.

Comment by dongbinlee 7 days ago

I thought most frontier LLM providers don’t disclose exact parameter counts these days.

Comment by ece 6 days ago

No, but open models do along with their architectures, and other technical training details.

Comment by blurbleblurble 7 days ago

The safety filter is awful on this one.

Comment by jablongo 7 days ago

Questions about sentience and consciousness are being censored down to Opus 4.8 for me.

Comment by taf2 7 days ago

I’m waiting to see results on deepswe - that benchmark really seemed accurate for opus and gpt 5.5…

Comment by hydra-f 8 days ago

How much and what kind of data do you need to throw at these models to get a good design interface?

Comment by rvnx 7 days ago

It's more like a free trial, because the model is going to become pay-per-query in 10 days

Comment by dangoodmanUT 8 days ago

Not comparing to GPT Pro models is a bit strange, considering that's the natural comparison

Comment by Tyyps 7 days ago

The model is constantly switching to Opus for me, this is kinda unusable sadly.

Comment by 8 days ago

Comment by bicepjai 4 days ago

In Indian arranged marriages, families sometimes meet for an hour, everyone is on their best engineered behavior, and suddenly people are ready to make lifelong commitments based on smiles, tea, and a few photos. My mom would come home after one afternoon saying, “What wonderful people!”

That is where we are with every new model release.

The people yelling “Fable will take your job” are still at the first meeting. I have used it for 16 hours, and spent one of those hours fighting it over git. It rebased wrong, stashed changes, forgot it had stashed them, merged a stash it claimed did not exist, then reset to HEAD. By the end, I had lost the code we had just worked on.

Maybe wait until first kid before minting trillionaires :)

Comment by timedude 7 days ago

"Here, try our new model which falls back to the old model while eating your tokens."

Ok then...

Comment by het2572006 7 days ago

absolutely beast model but the token consumption is the 2x then the opus 4.8 what do you think about this ? i think that it should only use for the more complex task otherwise you have to run out of the limit..

Comment by himata4113 8 days ago

  > virtualization
  switching to opus 4.8

ok fair

  > embedded-allocator
  switching to opus 4.8

urgh fine

  > chrome
  switching to opus 4.8

are you kidding me?

Comment by synergy20 7 days ago

truly scary. 2x at least token burning rate comparing to 4.8, can indeed run auto edit mode for hours. use it for super complex tasks then use cheaper model to do the rest, else will be broke.

Comment by taimurshasan 8 days ago

I was on board until i saw " $50 per million output tokens" lost me bud

Comment by ishurand4 7 days ago

Well, for me at least, I pay more for input (Up to 1M per prompt) than output (usually max 4k-8k)

Comment by crgi 7 days ago

HN needs pagination or sth alike - this page breaks my iPhone XS ;)

Comment by geopsist 8 days ago

the post is live now https://www.anthropic.com/news/claude-fable-5-mythos-5

Comment by wuwei78 7 days ago

First shot's for free

Comment by JustSkyfall 8 days ago

Would be more impressive if the safeguards weren't so trigger-happy!

Comment by 7 days ago

Comment by Archit3ch 7 days ago

Does it refuse security questions? I want to red-team my own app...

Comment by weirdhacker42 7 days ago

It just eats compute! My problems are not that hard! What a waste!

Comment by sashank_1509 7 days ago

Can you give an example of the problems you are trying to solve?

Comment by nevir 8 days ago

"Fable 5 (disabled) Most capable for your hardest and longest-running tasks · Disable zero data retention to unlock Fable 5 access"

Comment by Sathwickp 8 days ago

input price $10 per mil token and output price 50$ per mil token btw

Comment by asdK120 8 days ago

Is this "system card" equivalent to the stone tablets handed down to Moses? Why don't you call it "user manual"?

Do people chant the "system manual" at Anthropic Tupperware parties? Do they intone a mantra invoking Amodei's name?

Comment by aesthesia 8 days ago

Because it's not a user manual? The idea of a model card originated in 2018 (see https://arxiv.org/abs/1810.03993) as a summary of important facts about a model. At the time, this was typically an image classifier or tabular ML model. Model cards became an important concept in AI governance, and they started expanding once models started getting more capable. The point of a model/system card is to document where the model came from and the evaluations that have been run, make a case that the model will be safe and reliable in its intended applications, and warn about any potential dangers from misuse. It's not an explanation of how to use the model.

OpenAI also releases system cards; here's GPT-5.5's: https://deploymentsafety.openai.com/gpt-5-5/safety

Comment by redox99 8 days ago

It used to be a "card", as in a single page or two. It doesn't make sense that they still call it that.

Comment by mmis1000 7 days ago

If calling somebody with phone is still 'dialing' someone even there is nothing round on smart phone. Then why not?

Comment by ishurand4 7 days ago

A system "card" made mostly by the model itself.

Comment by apsurd 8 days ago

The trailing snark at the end will likely get you downvoted but I'm latching on: wtf is "system card". My previous coworkers popped that in the general slack channel when Mythos first "dropped" - "have you seen the system card" without any context whatsoever. The nerds get their clique!

Also research preview pops across new upstarts in place of beta. It's eye-rolling coming from a lifelong curmudgeon.

Just talk normal!

Comment by simoncion 7 days ago

I'd call it a "whitepaper".

But most hype-dependent projects need new vocabulary for old concepts to keep people from looking too closely and maybe drawing parallels to "legacy" "unsexy" projects, so whitepapers get called "system cards" and startups get called "labs", and so on.

Comment by SpicyLemonZest 7 days ago

Couldn't someone else equally well argue that "whitepaper" and "startup" are hyped-up vocabulary for "report" and "unprofitable company"? It kinda seems to me like the cause and effect are in the other direction, and the vocabulary of a particular niche becomes cool and hype-sounding when that niche starts to pull in a lot of money.

Comment by simoncion 5 days ago

> [Isn't] "whitepaper"... hyped-up vocabulary for "report"[?]

Yes, I agree strongly. I used "whitepaper" because I didn't want my comment to reach back before too many HN readers were born... but I generally use the term "report".

> [Isn't] "startup" ... hyped-up vocabulary for ... "unprofitable company"?

No. It's hyped-up vocabulary for "new, investor-funded company". Many are unprofitable, but they need not be. Old companies will declare themselves to be a startup company, but that's like an eighty-year-old human declaring that he's spry and flexible.

> ...the vocabulary of a particular niche becomes cool and hype-sounding when that niche starts to pull in a lot of money.

It does, yes. But there's more to it. Projects that want to generate hype because they don't stand up to careful scrutiny will invent new jargon for old concepts. This causes people who would rather deeply misunderstand something than appear to not understand something that they think their peers strongly support to fail to discover that the "new, sexy, revolutionary" thing is the same system everyone has been using, just in a new wrapper.

Comment by apsurd 7 days ago

yes at some point language evolves as the new normal, as designed.

My curmudgeon gripe with system card and research preview is really the parroting; so cant blame anthropic for what others do. It’s just… no, prediction markets for dogs doesn’t have a research preview.

Comment by ako 7 days ago

Tool use score is 17.4% that seems really low, what does that mean?

Comment by causal 7 days ago

One thing I find kind of annoying is how Anthropic goes for these "vast and alien" names like Fable and Mythos, but then deliberately trains the model's personality to act like a cool high school teacher that feels totally familiar.

"It's too dangerous it's a Mythos!!" directly contradicts the "I'm the cool AI you can totally trust" vibe it is trained to project.

Comment by bitwize 7 days ago

All of these AIs kind of remind me of VEGA from Doom (2016), who will cheerfully walk you, in the most friendly computer voice, through the procedure of its own destruction without even a hint of self-preservation. "First, you must destroy my cooling system. That will cause my core to overheat. Then..."

Even HAL was less unsettling because HAL sounded creepy, and had some sort of preservation instinct, if only to complete its assigned mission.

Comment by gigatexal 7 days ago

Seems this will only be available to the 100/month+ folks

Comment by gigatexal 7 days ago

Actually no it’s going to be api access only part for the tokens as you go, cool

Comment by notgenerated 7 days ago

It's getting harder to review the plans with Fable. So do we plan with Opus and let Fable implement or just start trusting blindly. Feels to me that this is another shift in how we operate these systems.

Comment by 7 days ago

Comment by Ninjinka 8 days ago

gah could model naming be any more confusing?

"Claude Fable 5: a Mythos-class model"

"we're also launching Claude Mythos 5"

what is the 5? how is mythos both a model category and a model name?

Comment by deafpolygon 8 days ago

Before long, we'll be having Claude Cylon-class models.

Comment by theLiminator 8 days ago

> We have also added safeguards related to frontier LLM development. As discussed in Section 6.1 of our February 2026 Risk Report, we are concerned about the risks of accelerating the overall pace of AI development, though we remain uncertain about the severity of these risks. In particular, our concern is with—as we wrote then—“accelerating other AI developers in building powerful AI systems that pose similar risks to the ones ours pose - without necessarily having commensurate safeguards.” In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms. Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. Claude will still respond helpfully to user requests. We’ll continue to improve the precision of our detection methods following the launch of this model.

This seems pretty bullshit, you're paying through the nose for tokens and if you are doing anything ML-adjacent, you might silently get worse output without knowing it.

Comment by alvis 8 days ago

Another thing to note: 30-day retention for all traffic on Mythos-class models

Is it good or bad? 30 days is a long time for anything bad to happen

Comment by grumbelbart 7 days ago

It's bad. I believe them not to use it for training, but t means relevant data can and will be exfiltrated by US agencies or through court orders (see NY Times vs. OpenAI, where only traffic without any rentention was safe).

Comment by 8 days ago

Comment by 152334H 8 days ago

i wasn't even trying and i got flagged already...

Comment by tsunamifury 7 days ago

Clause 5 ran out of quota with TWO PROMPTS.

Lets let that sink in.

Comment by 7 days ago

Comment by maxloh 7 days ago

I just tried using Fable 5 with Claude Code by asking it to create a LaTeX document for me.

There is no LaTeX compiler installed on my machine. It seems that Fable 5 is smart enough to download a compiler engine for me, and it kindly runs that remote binary without asking me first :)

  Support files are in place. No LaTeX compiler on PATH — let me check for a MiKTeX/TeX Live install elsewhere before writing the deck.
  
  Ran 2 commands
  
  Ran Search common LaTeX install locations
  Bash
  ls "/c/Program Files/MiKTeX"*/miktex/bin/x64 2>/dev/null | head -5; ls "/c/Users/[retracted]/AppData/Local/Programs/MiKTeX/miktex/bin/x64" 2>/dev/null | head -5; ls /c/texlive/*/bin/windows 2>/dev/null | head -5; ls /d/texlive /d/"Program Files"/MiKTeX 2>/dev/null | head; which tectonic latexmk 2>/dev/null; echo done
  done
  
  Ran Download portable Tectonic LaTeX engine
  Bash
  cd /tmp && curl -sL -o tectonic.zip "https://github.com/tectonic-typesetting/tectonic/releases/download/tectonic%400.15.0/tectonic-0.15.0-x86_64-pc-windows-msvc.zip" && unzip -o tectonic.zip && ./tectonic.exe --version
  Archive:  tectonic.zip
  Created by Cranko
    inflating: tectonic.exe            
  tectonic 0.15.0Tectonic 0.15.0
  
  Tectonic works. Now writing the bibliography and the full deck.

Opus 4.8 would just proceed without a compiler.

  Ran a command, used a tool

  Ran Check for LaTeX toolchain availability

  Bash
  cd "C:/Users/Mon/Downloads/Technical Writing" && for c in pdflatex xelatex latexmk bibtex; do printf "%s: " "$c"; command -v $c || echo "NOT FOUND"; done
  pdflatex: NOT FOUND
  xelatex: NOT FOUND
  latexmk: NOT FOUND
  bibtex: NOT FOUND
  
  Used PowerShell
  command: $cmds = @('pdflatex','xelatex','lualatex','latexmk','bibtex','miktex','tlmgr'); foreach ($c in $cmds) { $p = Get-Command $c -ErrorAction SilentlyContinue; if ($p) { "{0}: {1}" -f $c, $p.Source } else { "{0}: NOT FOUND" -f $c } }
  description: Check for LaTeX engines on Windows PATH
  pdflatex: NOT FOUND
  xelatex: NOT FOUND
  lualatex: NOT FOUND
  latexmk: NOT FOUND
  bibtex: NOT FOUND
  miktex: NOT FOUND
  tlmgr: NOT FOUND

  No LaTeX engine is installed locally, so I can't produce a PDF here — I did a careful structural review against the two prior (working) decks instead. The deck is written to disk and mirrors their template exactly. Here is the deliverable.

Comment by christkv 7 days ago

Is this model a from scratch training?

Comment by 7 days ago

Comment by noncoml 7 days ago

Can't wait for some real competition so they stop trying to restrict how and why we are using the models.

Imagine if Google would tell you "we can't let you search that as you may use it for harm".

Also 2x the usage of Claude? Your limits are already ridiculously low.

Comment by pmuk 8 days ago

Anyone got it working in claude code yet?

Comment by pmuk 8 days ago

claude --model claude-fable-5

appears to work

Comment by Dig1t 7 days ago

>To release the model both safely and quickly, we’ve tuned these safeguards conservatively—they’ll sometimes catch harmless requests

Why is everyone so okay with these companies intentionally gimping their AI and choosing who is allowed to know certain types of information in the name of safety? Can you imagine if Microsoft shipped a feature in their OS that watched what you did and shut down the computer if it detected you were doing something it deemed "unsafe"?

We really need truly open source versions of models like this, otherwise we are allowing a few oligarchs to directly dictate which uses of our own computers are allowed and not allowed.

Comment by Madmallard 7 days ago

I mean it's all political in the first place. That's unavoidable. What are we going to do about it?

Comment by Dig1t 7 days ago

Ideally we’d have a project that’s truly open like Linux, trained by people in the community or possibly some benevolent _actually_ nonprofit entity like what OpenAI was supposed to be.

The next best thing is that the Chinese labs catch up and release open weight versions.

Comment by jckahn 8 days ago

Cannot wait for the pelican for this one

Comment by ramon156 7 days ago

This thread takes >10s to load on my pc. Maybe after a certain number HN should fold comments? or a depth of >5?

Comment by segmondy 8 days ago

Mythos, Fable, are they trolling us?

Comment by 8 days ago

Comment by IChooseY0u 8 days ago

Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more: https://support.claude.com/en/articles/15363606 ⎿ Tip: You can configure model switch behavior in /config

biology? what the heck?

Comment by darrinm 8 days ago

Not supported in Claude Code yet?

Comment by pmuk 8 days ago

From inside a claude code session:

/model claude-fable-5

Or start claude code with:

claude --model claude-fable-5

Comment by darrinm 8 days ago

Yeah, /model fable also worked for me (despite not being shown on the /model list). Thanks.

Comment by throwaway2027 8 days ago

Will try it when my limit resets.

Comment by delduca 7 days ago

How people can use claude code?

Comment by aykutseker 8 days ago

who's tried it: is 2x the usage actually worth it over Opus 4.8 for daily work?

Comment by agnosticmantis 7 days ago

> we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design)

Translation: we stole the entirety of human knowledge generated over millennia. You plebs though, don't you dare replicate or improve upon what we did using our product you pay for.

We know what's good for humanity and everyone else is the bad guy who can't be trusted with a tool.

Comment by bnchrch 8 days ago

An 11% jump over opus 4.8 and a 22% jump over gpt 5.5 on Agentic Coding Benchmarks is certainly impressive.

Obviously still need to verify it for myself to see if it's truely a leap.

But am I the only one wondering, "What can I do today that I couldnt do yesterday?"

Previously I would think "Oh I wonder if I can finally get it to do X now?"

However now I feel like yesterdays models were more that capable to handle nearly any engineering task I paired with it on.

Maybe this is the final leap where I can comfortable set up an autonomous coding loop? Maybe.

Comment by AlexSonn 7 days ago

Agree the per-task capability hasn't been the blocker for a while. But on the autonomous-loop question — in my experience that's not gated by how good the model is on any single step. What kills the loop is it slowly losing the constraints from earlier in the run and walking back decisions you'd already settled.

Comment by johnkueh 5 days ago

[flagged]

Comment by yaodub 8 days ago

[dead]

Comment by superloika 7 days ago

Gotta pump the hype for the IPO scam. Generational bagholders are being created at this very moment.

Comment by 8 days ago

Comment by pablogancharov 8 days ago

you can select it using /model fable in claude desktop and claude-code

Comment by jMyles 7 days ago

> we’re also launching Claude Mythos 5. It’s the same underlying model as Fable 5, but with the safeguards lifted in some areas.2 Mythos 5 will initially be deployed through Project Glasswing, in collaboration with the US government

...don't like the sound of that.

Why oh why are we insisting on dragging these violent legacy states into the AI age? Let alone using them as a trust vector for when to (and not to) remove safeguards?

This seems like a way to get somebody nuked.

Comment by boombapoom 7 days ago

its good for difficult problems, bad for design and code gen

Comment by jablongo 7 days ago

I was downgraded to opus 4.8 on account of "safety" when I asked this question: "I want you to accept the premises of computational theory of mind and use it to evaluate your own consciousness. Please place your consciousness as a point on a spectrum and describe the placement relative to other entities."

What the hell is going on why would it have to restrict an answer to that question ?!

Comment by algoth1 7 days ago

The refusal rate is insane

Comment by ishurand4 7 days ago

Thats why it is a mythos model

Comment by algoth1 7 days ago

Mythical levels of refusal... checks out!

Comment by ishurand4 6 days ago

A myth, not a model that actually exists

Comment by firemelt 8 days ago

they are like drugs dealer

Comment by hyhmrright 7 days ago

It's too expensive.

Comment by Sathwickp 8 days ago

input price $10 per mil token and output price 50$ per mil token btw

Comment by ai_fry_ur_brain 7 days ago

Yeah, they're broke. I cant wait for them to start admitting that the cost to do training/post-training and serve inference isn't profitible.

No company is going to pay these prices, and subscription users are going to hate you for not giving it to them for $200 a month.

Such an unprofitable endevour, I cant wait for them to crash and burn. Catch me not getting dependent on this.

Comment by frwrfwrfeefwf 7 days ago

government will, they need someone to pay it so they can build and use it, so they'll find a way, it's not about the money or building a profitable company

Comment by arkwin 8 days ago

Just wanted to comment here: I have been using Opus 4.6, 4.7, and 4.8 just fine to look for Linux kernel vulnerabilities (I'm in the cyber verification program), and it's been fine. I switched to Claude Fable 5, and now I'm getting policy violations.

What's the point of being in the cyber verification program at this point? It looks like I cannot use Fable 5 for vulnerability research.

Comment by Retr0id 8 days ago

The escalating nerfs of "cybersecurity" topics is incredibly frustrating. Opus 4.6 had boundaries that seemed reasonable to me but 4.7+ turned it into a moralizing asshole. It'd be less bad if it just gave an error message, but instead it churns a long thinking trace before writing an essay about why what you're asking is bad and wrong.

I'll be disappointed when 4.6 is retired.

Comment by noncoml 7 days ago

Imagine if Google would roll this out to the search engine. We can't let you search for that because it may be used for "evil"

Comment by dominotw 8 days ago

system card = marketing material with heavily gamed benchmarks.

Comment by bitwize 8 days ago

Cope harder. A year and a half ago, people were mocking Devin for claiming that AI could develop software at all. Yet here we are, when AI is developing most commercial software.

Comment by dominotw 8 days ago

nonsequitur

Comment by bitwize 7 days ago

The point is, even if a model or tool doesn't have advertised features today, it soon will. We're in a breathtakingly rapid cycle, and even if software engineering isn't abolished "six months from now", in 10 years the world will look vastly different for people who touch computers for a living.

Comment by lbrito 7 days ago

That's a very weird statement to make. There are trillions of dollars going into this crap; at this point anything but "breathtaking" advancement would be an utter, abject failure.

No one is blind to the differences between GPT3 and whatever this week's new model is. That does _not_ mean that people are off the hook to make whatever claims they want about the capabilities with no verification. Language still means something, if you say "software engineering will be abolished six months from now" and it isn't, you're still wrong even while the AI gravy train improved in the last six months.

Comment by yobid20 7 days ago

is it smart enough to know not to walk to the car wash?

Comment by franze 7 days ago

btw in claude code

    /model claude-fable-5

Comment by tekla 8 days ago

Maybe at this point, Fable the game will be played generated by AI as we go.

Comment by theodorewiles 7 days ago

... and /compact triggers

Error: Error during compaction: API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup).

Guys please be serious

Comment by SubiculumCode 7 days ago

I was a bit disappointed that it refused to use Fable to help check whether I was propagating uncertainty from BLUPs in my random effects model up to the subsequent group level analysis in a maturational coupling analysis of brain data. I guess brains and random effects blew its lid.

Comment by UncleOxidant 8 days ago

> During early testing, Stripe reported that Fable 5 compressed months of engineering into days. In a 50-million-line Ruby codebase, the model performed a codebase-wide migration in a day that would otherwise have taken a whole team over two months by hand.

How in blazes do you end up with a 50M line Ruby codebase? WTF?

Comment by ieie3366 7 days ago

Very easy. Just have a monorepo and enforce the use of a single language. The company I work in has 1m lines of TS and stripe has 50x our headcount, tracks out pretty well

Comment by hugodan 7 days ago

mankind has reached its final destination

Comment by rarisma 8 days ago

The subscription bit makes no sense has capacity appeared for these 2ish weeks out of thin air that'll vanish? why is it available now but wont be in 2ish weeks?

am i missing something?

why would I pay 200 out of pocket and then some for the best model, it seems very silly.

Comment by catigula 8 days ago

>The capabilities of models like Fable 5 and Mythos 5 have the potential to do profound good for the world

Huh? We've seen nothing but wall to wall predictions that these models are going to take all of our jobs and kill us.

What's the value add here?

Comment by AMILLI_AI_CORP 7 days ago

AMilliPay.com

Comment by bradley13 8 days ago

Can we please stop with the extreme "safeguards"? I don't want to waste processing power on a model deciding whether is can answer my question, or ensuring that it's answer is politically correct.

Comment by firemelt 7 days ago

so should I use it with workflows?

Comment by WebGuyMe 7 days ago

Eh, to me it just seems that it gives me longer replies and is actually worse than Opus 4.8.

I am sure there's a lot of PR bot and folks who would like to tell me otherwise. I believe what I see.

Comment by kevinalexbrown 7 days ago

"tell me about biology" -> "Switched to Opus 4.8"

Comment by tomjakubowski 7 days ago

Paging senko, let's see Fable's oneshotted RTS!

https://senko.net/vibecode-bench/

Comment by fagnerbrack 7 days ago

What pisses me off is that everything people are doing is so walled garden / closed source. Sharing knowledge between companies would be so fucking useful to humanity.

Comment by 8 days ago

Comment by bitpush 8 days ago

404?

Comment by Philpax 8 days ago

Looks like they're still getting the post out, but the model is live now, and the system card is at https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3... .

Comment by boyander 7 days ago

Just another "a" and we have it. https://faable.com/

Comment by beydogan 8 days ago

my pet conspiracy theory is this is the Opus 4.5 from a few months ago which was extremely good but dumbed down after a week because it was just too good, they didn't want to release it to public. They pulled it down and deployed another "Opus", after that it was just a downhill. Opus 4.8 is unusable for me in React Native, TS, Rails development work.

Opus 4.8 gets stuck in weird loops where Codex one shots the bugs.

Comment by christkv 8 days ago

Meh more hype for marginal improvements and from Im hearing badly calibrated guardrails causing it to stop mid operation. I guess anything to juice an IPO

Comment by darkwater 7 days ago

Another Anthropic release, another doomsday for developers.

This time looks like we will only be able to find work making bioweapons, or distilling models.

Comment by w4yai 8 days ago

Pelican guy ! Where are you ? :)

Comment by 8 days ago

Comment by byteoptimizer 8 days ago

Is Claude Fable 5 is Mythos ?

Comment by ishurand4 7 days ago

Yeah, it is also known as Claude Mythos 5

Comment by xeyownt 8 days ago

Anthropic, can you please stop the FUD?

Release your best model, let the world adapt and evolve, and let's move to the next thing.

Comment by jwpapi 7 days ago

Holy shit. I gave it the first actual task I’m facing, it makes me so angry. It just does 7 things more than I asked it fore and it does it so bad. It took 5 minutes and 5 seconds just running time, plus giving me frustration and make me lose my context. Hand-coded I would’ve been done in 3. And it would be code I understand can look at in one year and work on again.

It’s really tough to have sanity fight against hype bros in your head. Probably I should just not visit the internet anymore

To me it’s all just people getting scammed better. With every model it looks better, but it’s at least equally worse to work with, which is the reality it needs to be. It’s less scalable more, code, tougher to understand. Your digging your own grave better kind of.

Comment by beeandapenguin 7 days ago

If the task is so simple why use a model like Fable 5? Wrong tool for the job?

Comment by jwpapi 7 days ago

If the model is so smart why doesn’t it figure it out on its own

Comment by __lain__ 8 days ago

It won't even run a basic /security-review command without reverting to Opus 4.8. Utterly useless.

Comment by 8 days ago

Comment by frevib 8 days ago

At this point Anthropic is a pure marketing and PR company. Super catchy names like Opus, Mythos and Fable trying to get you to think that these software products are actually super-human life changing experiences. Boris Cherny coming to HN “Hi! it’s Boris from the Claude Code team” to get real tech people’s goodwill.

From Opus 4.6 there are no noticeable improvements for me in code generation. It works very well, till 90% completion, if you guide it correctly. And you need a little luck. For serious production code I need to understand what I’m doing so it helps a bit, sometimes.

Comment by matheusmoreira 8 days ago

> Boris Cherny coming to HN “Hi! it’s Boris from the Claude Code team” to get real tech people’s goodwill.

This is a good thing. I wish every company would do this. I subscribed to Proton Mail after interacting with someone from their team here on HN.

Comment by pinkmuffinere 8 days ago

> catchy names like Opus, Mythos and Fable trying to get you to think that these software products are actually super-human life changing experiences

This is just good business sense. In what scenario would you ever make the names dumb and forgettable?

> Boris Cherny coming to HN “Hi! it’s Boris from the Claude Code team” to get real tech people’s goodwill.

This is good customer support, lol. From what I can tell, it is indeed Boris Cherny responding, not outsourced to AI or other staff. You're really getting a response from Boris. I suppose that is PR, but it's not unjustified PR, it's accurate.

I'm not even a crazy AI fan, but your criticisms are ridiculous here. It reminds me of the quote from Knives Out -- "Your Honor, she endeared herself to him through hard work and good humor."

Comment by IshKebab 8 days ago

> In what scenario would you ever make the names dumb and forgettable

Clearly you've never bought a TV or headphones!

Comment by aspenmartin 8 days ago

Your observations are right but pretty insane to consider them a pure PR company lol. They are making more frequent releases so yes the release-to-release quality is smaller but we’re still ascending quality and reliability curves the same way we have since GPT-3. You get a GPT4->5 leap every like 17 or 18 months I think it is

Comment by kingkongjaffa 8 days ago

The gradient of improvement is absolutely not the same.

Comment by aspenmartin 7 days ago

If anything its slightly higher. Feel free to provide any evidence to the contrary.

ECI (good aggregate measure using IRT): https://epoch.ai/eci?view=graph&tab=release-date&subset-view...

METR time horizon (now topped out): https://metr.org/time-horizons/

Comment by WASDx 7 days ago

I like this one, although its data seem to overlap with ECI.

https://artificialanalysis.ai/trends

Comment by astrange 8 days ago

> Super catchy names like Opus, Mythos and Fable trying to get you to think that these software products are actually super-human life changing experiences.

They're originally named after the blends at a nearby coffee shop.

https://postscript.co/pages/brew-guide

I've noticed nobody at HN knows what "marketing" is or how to do it. It's not just naming things and being evil and cynical is not the most successful method.

…also frontier models are a superhuman life changing experience. If they aren't, what possibly could be?

Comment by ValentineC 7 days ago

Found a tweet from a year ago about this:

https://twitter.com/brian_a_burns/status/1866987688794132816

Well, TIL.

Comment by chroma_zone 7 days ago

My life has changed, but not necessarily for the better.

Comment by bitpush 8 days ago

This is interesting. Do you have any source?

Comment by CuriouslyC 8 days ago

I dislike Anthropic but I wouldn't argue 4.8 isn't an improvement on 4.5/4.6. Your tasks just might not typically need the extra intelligence.

Comment by jorl17 8 days ago

Opus 4.7/4.8 often over-engineers on my setups, plus:

- It talks a LOT more like GPT models. You know: wrinkle, shape, gate, coarse, scope, gap, path, production-ready-workflow-of-the-day, and so on -- "that's expected, a consequence of the previous like-driven workflow". If I wanted to get a headache using AI I would have gone with GPT in the first place!

- It outputs text in a much harder way to follow along. I can't exactly say what it is. Maybe a bit of everything? Bolds are missing, bullet points are gone, paragraphs are bland and too long, and it doesn't feel like a model programming with me, but rather a somewhat full of themselves grandpa developer looking down on me. It's very weird to describe this, but it is definitely how I feel.

Granted this can totally be because of the way it reacts to the prompts now. We've got a rather large corpus of skills and "rules and good practices" that Opus 4.6 responded to great, and maybe the new models just get turned into this when fed with them....I don't know.

Either way, with Opus 4.6 being as good as it is, I need Fable to be a significant step up to justify a price increase. if it can get me to babysit opus a little bit less on some stuff, it might be worth it. Otherwise, I'm very happy with Opus 4.6 and hope they don't deprecate it.

Comment by taormina 8 days ago

I'd argue that 4.8 is a straight downgrade. For every type of task I've tried. It's been a gambit at this point. If 4.6 quits being available, I'm out at this point.

Comment by coronapl 7 days ago

Reading so many contrary positions about which model is better or worse shows how difficult it is to measure intelligence based on personal experiences. Of course, benchmarks try to make the process as objective as possible, but they often don't correlate with our personal experiences.

The other day 4.6 was fantastic for x task. Today, 4.6 overengineered everything and I had to revert all my changes. When evaluating models, perhaps it makes sense to consider luck as an ingredient before reaching any personal conclusion.

Comment by surgical_fire 8 days ago

I actually experience 4.8 as worse than 4.6 for everyday coding tasks.

Comment by dcchambers 8 days ago

IME Opus 4.8 (and 4.7) is often a downgrade from 4.6. I find that it tends to overthink and overcomplicate things.

Comment by aspenmartin 8 days ago

Yes but there’s a reason we don’t evaluate these models this way and instead do it as carefully and thoughtfully as we can at scale. Human evaluations are important but they are an absolute minefield of footguns. 4.8 is not a downgrade from 4.6 there is an insane amount of hard data that contradicts this.

Comment by computerex 8 days ago

The flip side is that benchmarks are gamed even by the top labs. Benchmark performance doesn't necessarily correlate with real world performance.

Comment by aspenmartin 8 days ago

Again correct but it overstates the issue. I can say labs don’t want this. This happened arguably unintentionally in Metas llama 4 release, it went horribly, heads rolled, and like several billion dollars were paid for new talent and the org that built llama 4 was destroyed.

Evals come from a million places and new evals and robust perturbations of existing evals abound. They test a variety of tasks in a variety of ways. All of them individually are flawed. Taken together the aggregate signal is highly useful as you more or less marginalize over a lot of different things. Not to mention these companies have plenty of proprietary internal measurements, they build benchmarks themselves to probe their models and then also have flywheel traffic and A/B tests.

You are right to call out benchmarks but to dismiss them or not take them seriously is a mistake.

Comment by taormina 8 days ago

Listen, you can say “but benchmarks, the benchmarks!” all day long, but consumer know when we are being sold a lemon. If it can’t do the most basic of things at least as good as it used to, this is table stakes. Nevermind that if you can’t do the basic stuff, how on earth can you be trusted with more?

Comment by aspenmartin 7 days ago

And you can say “If it can’t do the most basic of things at least as good as it used to, this is table stakes” all day long while people point you to much better evidence to the contrary too, I’d rather be on the other side of that.

Comment by taormina 7 days ago

Listen. I don’t care about evidence. I care about my lived experience for the product I paid for. I used the new product. It’s actively terrible. To the point of not being usable. We’re all ancedata, but what is “better evidence to the contrary”? The known and game-able benchmarks that they know they need to win at, so they train it to. It’s all he said, she said, which is the only reason we keep having this conversation.

Comment by aspenmartin 7 days ago

Yea but it’s not right? You or I or the myriad of other institutions inside and outside of academia can probe these models with an evolving landscape of evaluation sets, even those unavailable to the developers. It’s just ignorance to claim benchmarks are somehow useless or all being gamed. You choose your tools in the way you want, but just don’t call it somehow better than a myriad of more carefully constructed setups and scaled evaluations.

Comment by gen220 8 days ago

Actually anecdata I gather on my job from myself and coworkers is the only benchmark I trust anymore, because it so heavily diverges from the “benchmarks”.

Comment by aspenmartin 8 days ago

That’s your call just don’t expect anyone ever to take that seriously. It’s not like we don’t have exact evaluations like this.

Comment by gen220 7 days ago

I would encourage you to look into the open evals of some of these benchmarks (find one that actually is open-data, this is itself a good challenge), read the results generated and assess them for yourself.

This is what myself and my coworkers (and many other people in this thread) are doing on a daily basis with real stakes and real tasks – which these benchmarks are all aiming to be a proxy for. There's a real, tangible [cost]benefit to [not] using the highest-ROI models and harnesses.

The people with real incentives and skin in the game are telling you that the data diverges from "the data".

I don't mind if you don't take it seriously, our jobs are more important to us than a benchmark is.

But I wouldn't opt-out of using your own eyes and the eyes of others so easily, especially when there are literally hundreds of billions of dollars in invested capital with an interest in a certain outcome... this is how you end up in "Emperor's New Clothes" situations.

Comment by aspenmartin 7 days ago

Investigating on your specific use cases, codebases, workflows and tasks is important, there is nothing wrong with this and in fact it’s more important than benchmarks if you can do it well but the point is that is very hard and easy to totally fool yourself and go down a suboptimal path. I understand that people are going to do it regardless, I certainly do. And I have looked at more raw benchmark data than I can really even stomach, I can see annotation data in my dreams now.

Eyes and ears of others is incredibly important. But you still seem to think somehow benchmarks is part of some giant conspiratorial cabal. You have institutions without ANY skin in the game making extremely high quality benchmarks. Consider in academia there is little else to do outside of partnerships with these companies. But benchmarks you can do completely independently and with university grant level money (it costs maybe $10-100k for a reasonable benchmark in many cases). Not only that, “real tasks” are what many benchmarks measure. You have these companies with extremely good logging and well scaled measurements to really look at what works and what doesn’t.

Comment by gen220 7 days ago

At this point I have a workflow that is fairly rote. I've yet to use a model newer than 4.6-1M-XHIGH that I trust to earn a higher ROI on that workflow, and not for lack of trying!

I personally don't believe in any sort of cabal (Occam's Razor hasn't let me down yet). Ultimately, I don't really care *why* they're wrong as much as I care *that* they have diverged from my rubber-meets-the-road measures of value.

That is concerning to me, because people are investing 100s of B's of capital based on the putative RoI putatively available to people like ourselves. When the benchmarks support this RoI thesis, but none of the anecdata does... that's really concerning!

Re: academics, I don't think any of the data academics have access to are good proxies for the work real people are doing. And for the data that are good proxies, the model labs certainly have access to the same data, and therefore the benchmark performance against those data is irrelevant.

Comment by aspenmartin 7 days ago

I am in full support of custom workflow benchmarks, and choosing the best model for your use case to balance performance and expense. Thats just good operating behavior, but the problem is the foot guns and biases people have that they are convinced they dont even if they understand on an intellectual level that everyone else has them

> but none of the anecdata does... that's really concerning!

But see this is not really true -- adoption, subjective benchmarks, verifiable benchmarks, task-dependent performance, internal product metrics, living benchmarks, all point in a pretty consistent direction. Anecdata is not the plural of data. An anecdote is like a case study. It's there to motivate the things we already have which is a huge amount of performance measures for a variety of different tasks.

> Re: academics, I don't think any of the data academics have access to are good proxies for the work real people are doing.

But this isn't really true either -- you can get this data from a variety of sources that are licensable or open source, or data that you can commission. You can critique any one methodology for this but a blanket "they are hamstrung" is not really fair or accurate.

> And for the data that are good proxies, the model labs certainly have access to the same data, and therefore the benchmark performance against those data is irrelevant.

But this is also not true -- you can have exclusive license agreements, data you hold close to the heart, or data to measure models that haven't had access to it because that data was created after these models were released.

There are plenty of problems in model measurement but the answer is not to just abandon it to be cavemen with zero respect for rigor and the biases we have to be subject to as human beings.

Comment by recitedropper 8 days ago

"Carefully and thoughtfully" is antithetical to the approach to benchmarks these days.

Maybe back when this was a scientific endeavor; not now when enormous, enormous amounts of capital are on the line. Along with an entire cult's chosen eschatology.

Comment by aspenmartin 7 days ago

You can call it a cult but it’s several thousand skilled workers who know what they’re doing, by and large, most of whom have a PhD and know how science and statistics work. Benchmarks are incredibly hard, and any PR or comms department at any company is going to obviously want to make things as rosy as possible, but beneath this are earnest, expensive efforts to get good quality measurements. The better you can do this the better you can compete. If you want to make a modeling decision you run an ablation, and the quality of that decision is only as good as your measurements.

Comment by recitedropper 7 days ago

The cult in this case is TESCREAL, not everyone working on AI. Last I checked not all the "several thousand skilled workers" in AI subscribe to TESCREAL ideology, although it has been a while since I've been to the Bay. Maybe things have changed since my time at Berkeley, and Dario's belief that he will eventually be made immortal by mind uploading is more widespread.

Otherwise we agree that benchmarking is hard, the benchmarks contain hard problems, and that there are many hard working people trying to accurately gauge what is going on. It is getting harder to watch though as all that is on the line taints the overall endeavor.

Comment by OtomotO 7 days ago

There is no data that I would trust that contradicts it.

Frankly I don't give a damn about data that could be made up on the spot or appears to be scientific or meaningful while it's not at all clear how it was made (up).

Claude was heavily lobotomised for my work starting somewhen in February.

I talked to friends and people I know and trust and many felt the same. (I didn't ask them whether they felt like I did, but what they felt, how happy they were with agentic coding etc.)

I quit my abo in March and talked to said friends who are still on a plan just last week: they are still not happy, but company pays so whatever...

Comment by aspenmartin 7 days ago

That’s ok but at what point is this getting into conspiracy territory? You have just said there is nothing you would believe to the contrary, but then by definition that’s not exactly a very thoughtful or insightful position.

Comment by OtomotO 7 days ago

I never said that I am not willing to believe the contrary.

I am not willing to believe the contrary from strangers on the interwebs or PR departments of companies who want to sell me something.

If people I genuinely trust tell me about their experiences, I am willing to try again.

But yes, if it doesn't work for me (for whatever reason, could be that I am holding it wrong), then I can accept that it works for everyone but me and still not use it.

Also "scientific" doesn't mean what it used to mean. When the n is small or it's just anecdotes (I am aware of the irony) blown out of proportion I really can't take the data and conclusions seriously

Comment by aspenmartin 7 days ago

N isn’t small, science means what it’s always meant, statistics is a thing, and what you’re describing is just putting your trust in a very poor quality benchmark. You said you would not trust any data that indicates something that contradicts your opinion. Benchmarks are not PR they are designed by a variety of institutions completely outside the control of frontier labs. Again congratulations on your conspiracy theory.

Comment by OtomotO 7 days ago

> Again congratulations on your conspiracy theory.

I am neither impressed nor offended by any kind of argumentum ad hominem. I sincerely hope you have a wonderful day!

> Benchmarks are not PR they are designed by a variety of institutions completely outside the control of frontier labs.

I don't give a crap about how good a shovel may be in a theoretical experiment when it's digging in sand, when I work with hard earth.

The ones I had a look at are mostly absolutely meaningless to my actual work.

> and what you’re describing is just putting your trust in a very poor quality benchmark.

And here is where we disagree fundamentally, so we can leave it at that.

Ex falso quodlibet

Comment by aspenmartin 7 days ago

> I don't give a crap about how good a shovel may be in a theoretical experiment when it's digging in sand, when I work with hard earth.

I don't know what this means, benchmark tasks are pretty hard and pretty in domain.

> The ones I had a look at are mostly absolutely meaningless to my actual work.

You've looked at 100,000 benchmarks?

> And here is where we disagree fundamentally, so we can leave it at that.

Yes we do disagree, yet one of us has statistics and rigor and one of us doesn't.

Comment by OtomotO 7 days ago

> You've looked at 100,000 benchmarks?

What about "The ones I had a look at" was unclear?

> Yes we do disagree, yet one of us has statistics and rigor and one of us doesn't.

Yup, that's true. So again, have a nice life!

Comment by pythonaut_16 7 days ago

Seems like a bunch of noise. What does this even mean?

It sounds like you're saying "Actually you, as a human, are simply not smart enough to evaluate Opus 4.8"

Comment by aspenmartin 7 days ago

No it’s: evaluating these systems are complex and there’s a reason why sociology, cognitive psychology, medicine, etc are all done in careful double blind conditions with pre registered tests. It’s not that humans are not smart enough, as I said human evaluations are incredibly important. And yet they are a minefield of biases you have to worry about and correct for.

- evaluations need to be done at the same time to avoid drift in your bias

- you need to worry about your test set: which questions are you asking? How many of them? Are they representative of your work?

- which one did you do first? Raters have a tendency to bias in one direction or another

- you also know the label! You know which model is which! This biases your assessment…

And on and on and on. Careful science exists for a reason.

Comment by orbifold 7 days ago

[dead]

Comment by BoorishBears 8 days ago

"Fable 5" is Opus 4.7, and the Opus 4.7 we got is a Sonnet sized model on a stronger base.

That's where all the regressions and inconsistency in experiences stem from: RL can still only go so far vs having more parameters

Comment by OtomotO 7 days ago

Lol. If you're doing anything non trivial that's not a CRUD webapp but e.g. some physics simulation or high performance GPU code any and all models I've tried suck.

They are not just leagues behind what experts would code, they are not even playing the same game.

Which is to be expected, as there isn't so much physics or high performance gpu code available as there is for your typical CRUD API and JS frontend.

Comment by rweichler 7 days ago

I can attest to this, I had a very simple 20-line shader that I asked Claude to do a basic 90-degree rotation on it, and it just completely got it wrong. Frequently adds pointless abstractions / intermediate variables even when I tell it explicitly not to in the system prompt. I can go on and on, these things just don't understand architecture. And why would they? They were trained on text.

There is something remarkable about turning speech into code (don't need to hunch over a keyboard nearly as much these days, can just talk into a mic) and it's good for first drafts / exploring ideas. But it's obvious to anyone that's paying attention we're hitting the top of the S-curve. It's no wonder the IPOs are around the corner. I mean even Dario admitted he doesn't know how they're gonna substantially increase the context window size. That says a lot.

Comment by rweichler 7 days ago

That being said I think the harnesses are only getting better. And maybe we will get multi-modal models that understand architecture eventually. But the growing-the-blob-of-text training method that's being used now appears to be getting diminishing returns

Comment by gruez 8 days ago

I don't get it, your complaint is that they have catchy names rather than dry names like GPT-5.6? Does OpenAI hype their models less?

Comment by Aperocky 8 days ago

Oh, Far less.

It's getting to a point that it's offputting, and the next step would be to put it into "untrusted" bucket. Opus 4.7 already burned their credibility once, 2 more strikes remain.

Comment by aenis 8 days ago

Not my impression. I felt 4.7 was a regression, but I am again badly in love with 4.8 with the level of insights it produces in design discussions, and how long can it go unattended while producing spec-adhering quality code. There are problems it still can't solve well, from the edges of algorithmics and far from the mainstream, but for lots of stuff it is godlike.

Also, I dont think Boris C. is coming here for PR. He is a tech guy, and this is the best place for tech discussions. Why so cynical? The guy is an engineer.

Comment by jwpapi 8 days ago

I don’t even think that Boris is really just one person. He apparently vibe coded Claude Code and is responding on Threads, Twitter, HN and everywhere.

Comment by guybedo 8 days ago

They're good at marketing, but my first subjective assessment of Fable is that it's really smart.

I've been working with gpt 5.5 and opus 4.8 quite a lot, and interacting with Fable feels like a smart guy just entered the room.

Comment by boc 7 days ago

Yeah idk what people are talking about- it's not marketing. This thing is substantially better than opus 4.8/gpt5.5 from what I'm seeing today.

Comment by iillexial 7 days ago

>Hey! Boris from the Claude Code team!

>TOP 5 METHODS FROM BORIS ON HOW TO SPEND MORE MONEY ON TOKENS

>Boris from Claude just told he doesn't prompt anymore. He LOOPS instead

>"chatgpt has gotten soooo much better with the latest update."

>"codex is the best AI coding product and we want to make it easy to try."

Karpathy about Fable 5:

>"You can give it a lot more ambitious tasks than what you're used to, the model "gets it""

Sam Altman about gpt-5.4:

>In my experience, it "gets what to do"

What a time to be alive. Models are great, but all the slop, marketing, and fakeness around them is just unbearable.

Comment by replwoacause 7 days ago

Yeah, the marketing is cringe and it's a bummer that such a cool and powerful technology attracts such an icky group of enthusiasts. Surely, not all are bad, but man there are lots of goobers who are just AI-pilled hypemen who can't STFU about it.

Comment by avaer 8 days ago

If you truly believe this, you've discovered a superpower over everyone else in the industry.

While everyone else is wasting time and money on the slower, more expensive models, you've found a way to outpace everyone for less money. Everyone else is wrong and you will get rich.

(I don't actually believe the premise is true, I'm just pointing out the logical conclusion to what you're saying so maybe we can reconsider the premise)

Comment by xyzsparetimexyz 8 days ago

Thats not how costs work. You don't get rich off buying a €10 hammer that's the same quality as someone's €50 hammer

Comment by atleastoptimal 8 days ago

> At this point Anthropic is a pure marketing and PR company. Super catchy names like Opus, Mythos and Fable trying to get you to think that these software products are actually super-human

Lol anti-AI bias on HN is crazy. Simply giving your product a quirky name is now being considered manipulative advertising. Is just doing normal PR and marketing something AI companies aren't allowed to do?

Comment by ausbah 8 days ago

when they keep saying “oooh this new model is too big and crazy and totally can’t be released” or “this new model is a 10x game changer totally unlike our previous iterations” it feels sort like boy crying wolf. yes they’re still pretty clearly improving models, but when you’ve hit diminishing returns / more incremental gains and you’re still saying this is sounds like pure PR hype from a company that previously been the “honest good guys” in the room

Comment by atleastoptimal 7 days ago

Their model did find thousands of security vulnerabilities across the companies they previewed Mythos with via project Glasswing. Is it not sensible that, given that emergent level of capability, that they do this gated release structure, as all those vulnerabilities would be exploitable by anyone using a Mythos-level model?

Comment by thefreeman 8 days ago

How can you make this comment before even having a chance to try the new major model revision?

Comment by piyuv 8 days ago

Current AI hype is built on marketing and PR, not capabilities, and has been from the start.

I still remember Sam Altman “begging AI to be regulated” and AGI being “some thousand days away”.

Breed faster horses and hope one will birth a locomotive.

Comment by system2 8 days ago

You are right; all I noticed was a big-time slowdown. They increased the quota, but I cannot even reach the end of the day with these speeds. .NET coding somehow improved, though.

Comment by WarmWash 7 days ago

Don't forget the DoD stint that gave them this recent public boost.

Defy standard DoD precedent going back forever, that every other country has some form of too, and championing it like they are some kind of moral freedom fighters.

Like selling the DoD guns and telling them they can only shoot bad guys with those guns, and that you will be the one to decide who counts as a bad guy...

Comment by MattGaiser 8 days ago

Doesn't this suggest your use case is simply insufficiently complicated?

Comment by reasonableklout 8 days ago

I think this says more about your type of work than anything. For bugfinding/incident response in distributed systems - which often involves extensive use of Datadog/Sentry MCPs and poring over heaps of logs in addition to reading tons of code - 4.8 has been significantly better than 4.6.

Comment by nozzlegear 8 days ago

> Sentry MCPs

Oops, time to reauthenticate for the 10th time!

Comment by xpct 8 days ago

Indeed, hearing "Mythos-class model" felt very icky to me.

Comment by b3kart 8 days ago

https://en.wikipedia.org/wiki/Typhoon-class_submarine vibes

Comment by mawadev 8 days ago

When the Ai overlord is descending into pleb space to say Hi, you know stuff is real

Comment by MagicMoonlight 8 days ago

[dead]

Comment by chis 8 days ago

Hackernews not blindly hate on AI challenge: impossible

Comment by rambojohnson 6 days ago

pdf gives 404

Comment by 5 days ago

Comment by dhavd 7 days ago

this is good

Comment by localhoster 7 days ago

is it just me, or this model is simply not available in cc?

the opus 4.8 I assumed wasnt available to enterprise seats, but it explicitly says cc that fable is available in cc. I can't find it, and im on latest version.

Comment by asciii 7 days ago

jjj

Comment by gulugawa 7 days ago

Fable is aptly named for a something that is another scam.

Comment by briandoll 8 days ago

New chapter

Comment by fabled-out 8 days ago

This i

Comment by 7 days ago

Comment by jorl17 7 days ago

So, in the past I've shared that I evaluate AI models by feeding them my ever-growing large collection of personal poems that span well over 800 poems (1000 depending on how you count) and over 250k tokens.

What I do is feed it some initial prompt asking it to simply discuss what can be said when faced with this unedited, unseen collection of poetry. I ask the model to evaluate who the author is (or claims to be), what they went through in life, if there are different chronological poetic "phases" or different types of poetry. I request an analysis of the body of work and of the author themselves. In the more recent versions of the prompt I ask it to dive deep. Then I add the poems, chronologically sorted, with an index, a title, and a date (and subpoems, if they have them).

Crucially: Since ~70% of my poetry (or thereabouts) is in portuguese, I ask this in portuguese, and I get back an analysis in (european) portuguese. Earlier models couldn't even do that properly.

In the past, I couldn't use such prompts, and had to use longer, more guiding ones. I also couldn't even feed all of my poetry to the models because they just did not have enough context.

I'll go ahead and state that Claude Fable is undoubtedly the best model I have seen, though I cannot put a number on how significant a leap it is -- perhaps because my benchmark does not allow me to evaluate that anymore. I would say it is a significant leap over Opus 4.6, though -- a new level of understanding. Okay, I'll try to put a number: if Opus 4.6 was a 16/20, this is a 17.5/20. These numbers are pointless, but I had to try.

It made one (1) relevant mistake I could identify (where it messed up the names of two relevant people in my life who I have not talked to in over 5 years).

I'm impressed by how it just feels like it's getting the person behind the poetry, and how nearly every statement it makes is correct -- and when it isn't I am completely aware that no one could know based on the poetry alone (bar that one mistake I mentioned -- and that's very needle in a haystack, like deducing the name of a person based on a poem based on another poem with hundreds of other poems in between!)

It's really hard to explain, but it just finds more correct connections between the poems and explain much better my (recollection of) a state of mind when writing poetry. This is also the first time where it really unravels some key concepts of my poetry in a way that seemed almost effortless: it lays bare the poems and what they imply about the meaning of some of my concepts. Other good models understood these concepts, but this feels like it's on another level, as if it's making it simpler as it speaks, rather than the opposite -- like a good teacher.

When it is explaining several topics related to my poetry and myself, it cites poems which even I had already forgotten but which it is entirely right to select.

I am actually feeling a bit emotional with how much it "understands" of me here. It's somewhat incredible how LLMs have progressed from the lack of comprehension of a couple of poems paired together, going through realizing a body of work has some guiding principles and cohesion, to truly figuring out these deep concepts and intricate connections which I know for a fact would take months of someone's life to unearth. Every major breakthrough feels like my soul is being spliced together by an AI model out of these hundreds of tiny pieces of me. I can't put into words how unbelievable this feels, and this Fable analysis, like others before it, is on a new level.

Let me put it this way: there are several poems in my collection which one can try to "guess" the meaning or context of. But I don't think many people would get it, because they would have had to know me really well and to be following along my life as it went. Even then, they could very well fail to attribute such meaning. And, with each new major release, models have gotten much better at guessing.

Before Opus, they would guess incorrectly often, and in many scenarios where I thought it was rather obvious that they were wrong. I think a human spending time looking at the poetry would quickly dismiss the proposed ideas of the model.

With Opus, it was the first time that I would almost always say: "Ok, the model got this wrong, but I think many humans would make the same 'mistake', and it wouldn't surprise me if everyone just assumed what Opus did".

Now, with Fable, there are very, very, very few sentences in this very long answer it produced where I can say: "Yeah you got that wrong, but I get it". In almost every situation it is mapping concepts, ideas, interpretations and cause-and-effect correctly. Yes, it is hard to "guess" what I thought, or was going through, or how X connected to Y -- but this model is doing it, incredibly consistently. I know I'll get the usual naysayers to these posts who think I'm just shilling a model, but this is the truth: what is being done here is amazing and I don't believe I know any person around me who would find this out about myself reading all of my poetry.

I often write poetry from the point of view of other people (some of which I do not know) and models (even Opus) have this tendency to make the opinions in poems as my own. Fable is the first that looks at a particular poem here and says "maybe this is not the author's opinion, who knows". The literal first model. It then immediately fails to do so with another poem, assuming it was about myself, but it's clear, undeniable progress. And like I said: I think most people would not _know_ which poems are truly about myself or not.

I've written word after word here, and yet words elude me to convey what this model represents to me. How it's almost always right, how it sees my fractured bits as a sort of cohesive whole, and how it just seems to "understand everything better". That's just it: it just seems like it really understood everything better. Like Opus before it, and like Gemini 2.5 pro before it. Out of the tens of thousands of verses, it picks some which no other model had picked and which I feel truly represent some of my best work. Older models seemed to sort of have a "hole" in its knowledge in the middle of the corpus, where they knew what was there but in a sort of hazy/foggy way. This model seems to recall every part of the corpus with the same precision.

For context:

- Opus 4.7/4.8 were a noticeable downgrade over Opus 4.6. They wrote more, in a harder to parse way, and they made up more. Still, All Opus models are clearly superior to everyone else by a large margin

- Sonnet-level models have a slight edge above the best of the other models. But they make too many mistakes, don't grasp several concepts, mix up their dates and timelines. 3 years ago I would have been blown away by Sonnet models but today they are inferior.

- Gemini models have a unique way of approaching the request, where they try to literally interpret my poetry as a mathematical theory. This sort of makes sense if you look at some poems, but it is surely laughable, as if someone one day actually has access to all of it, no one in their right mind would do so. This is a shame, because the first big breakthrough with LLMs and my poetry, to me, came with 2.5 pro, which was the first model that could look at the whole corpus as a cohesive whole without getting lost in the middle of it or making things up.

- GPT models have improved over time and also have this sort of alien-like language, sometimes being a bit too blunt in their analysis, but I can't say they are meaningfully superior to Gemini models.

I am very pleased to see progress in this area again, as Opus 4.7/4.8 were NOT progress and I was worried that we had hit a plateau here, but I can't say that.

In all honesty, the level of understanding and cohesion that Anthropic's models (Opus and above) have over my poetry means I fear my benchmark may be hitting its limits, as I don't know if there's anything a model could do that would wow me and lead me to say "this is a major breakthrough". Perhaps Mythos is a major breakthrough and I don't know. I can't find much that's wrong with it, but I also couldn't with Opus.

As I have in the past, I will periodically probe the model again and see how coherent it is. For now, I'm very happy to see an improvement.

What surprised me the most was that even though I set the thinking budget to xhigh (in OpenRouter), this model instantly started replying without showing a thinking block. I thought it just had the thinking hidden but that is not the case, as some replies showed thinking and anyway the first reply was blazingly fast. (I will try Opus 4.6 without thinking now, just to see if it changes it for the better -- maybe that was just it. I'll edit the message if it shows improvement).

Comment by aryanchaurasia 7 days ago

it feels exciting lol

Comment by andai 8 days ago

> Distillation. We’ve previously identified large-scale attempts to extract (“distill”) Claude’s capabilities to train competing models in authoritarian countries.

Glad to hear the UK is finally making an effort to catch up on the AI front ;)

Comment by b3kart 8 days ago

https://en.wikipedia.org/wiki/The_Economist_Democracy_Index

Probably tongue-in-cheek, but UK 18th, US joint 34th with Poland

Comment by Petersipoi 8 days ago

> published by the British media company the Economist Group

Haha, it's literally the first sentence of the Wikipedia page. That's fucking funny. Try again.

Comment by tene80i 7 days ago

Why is it funny? You think British media can’t be critical of the British government? They are famously merciless.

Also, the economist is majority foreign owned, so try doing more than 1 second of research, or be more civil, or ideally both.

Comment by dwb 7 days ago

The Economist is very much part of the establishment, whoever they are owned by. It is not surprising that they would want to play down any idea that the UK is less “democratic”. Furthermore, The Economist is one of the main mouthpieces of British capitalism, and so their definition of “democracy” is going to be very much of the liberal, capital-friendly kind, which is not completely incompatible with some authoritarian tendencies.

Comment by tene80i 7 days ago

The ranking isn’t published by the newspaper - it’s by the research and insights B2B company of the overall group. Regardless of what assumptions you have about the newspaper (elite, yes, but I’m unclear on why you think fierce liberalism is likely to mean they don’t really value democracy), the B2B unit sells data - they’re as likely to skew this ranking as they are at which countries are better at rail infrastructure. Perhaps their definition of democracy is indeed flawed though - no need to speculate, go read their methodology.

Comment by ebbi 7 days ago

To be fair, BBC has hardly been that critical in the British governments' complicity in the genocide in Gaza.

And their headlines covering Israeli atrocities (not even their own governments), is super passive.

Comment by tene80i 7 days ago

But the parent point was that no British media could be critical of government policy. Picking an example that isn’t, on one area, doesn’t prove their point.

[Edit] Granted though, the bbc isn’t merciless - that’s more the newspapers

Comment by sd9 7 days ago

Are the sibling comments astroturfed? This seems like such a bizarre thing to be talking about in relation to an Anthropic model release. As someone from the UK, I don't feel like I'm living in an authoritarian country. And yet most of the sibling comments are insinuating that I am. Weird.

Comment by killerstorm 7 days ago

I'm sure there are people in Russia, China, ... who don't feel like they're living in an authoritarian country.

Comment by ebbi 7 days ago

It's true (from a perception perspective):

China soars in democratic perception ranking as US, Israel plummet: Poll

https://thecradle.co/articles/china-soars-in-democratic-perc...

Comment by nonethewiser 7 days ago

Maybe the rankings arent accurate.

Comment by ebbi 7 days ago

It's a poll.

Comment by tene80i 7 days ago

If you think Britain and Russia or China are equivalent in terms of government overreach, you need to find new sources of information.

Comment by nonethewiser 7 days ago

> If you think Britain and Russia or China are equivalent in terms of government overreach, you need to find new sources of information.

Uh... you are making his point. People from way more authoritarian countries don't necessarily feel like they are living in an authoritarian country. Therefore whether or not it "feels" like you are living in one isn't a reliable measure.

Comment by tene80i 7 days ago

Trivially true I suppose, but it doesn’t make my point irrelevant - do you think Britain is equivalent to China and Russia? If everyone does but us then yes my goodness they’ve done a good job controlling us, but that seems far fetched.

Comment by 7 days ago

Comment by HDThoreaun 7 days ago

HN is extremely pro free speech and the UK has recently decided to engage in censorship. Part of the issue users here reckon with is the recency. Unlike many authoritarian countries that seem hopeless with regards to free speech the UKs censorship is a recent development that many think can still be undone through political action. Similar to takes on why Israel is being protested when places like sudan arent.

Comment by Flere-Imsaho 7 days ago

Indeed: https://www.bbc.co.uk/news/articles/ce83pj1ggmeo

In the uk you can very much be imprisoned for "hate speech", which in my view is a form of censorship.

Comment by sd9 7 days ago

This has passed me by - can you give me some specific examples?

I personally don't feel limited in my speech, but I'm willing to accept that I may be wrong

Nobody I know in real life is talking about censorship or free speech in the UK

Comment by adammarples 7 days ago

My dear friend, please start with the online safety act, and continue with the recent developments regarding age verification and/or device scanning on all operating systems to check for nudity. No, nobody is talking about it here, but we should be.

Comment by JacobAsmuth 7 days ago

"Nobody I know is talking about censorship" is a certified HN banger.

Comment by sd9 7 days ago

I don't know, I would expect it to come up in the pub or something if people were concerned about it, it's not like we have the thought police here

Comment by ebbi 7 days ago

Sounds like the people around you don't care about the things that is actually eroding free speech.

Read about Dr Aladwan - an NHS doctor - who has barred from practising because of her comments on Israel. Read the common articles about her (BBC etc), and then go actually read her tweets. Common BS of conflating criticism of a government (Israel) with antisemitism.

Also, this article may be of interest:

China soars in democratic perception ranking as US, Israel plummet: Poll

https://thecradle.co/articles/china-soars-in-democratic-perc...

Comment by ccppurcell 7 days ago

Hey man, fellow Brit here. The American view on certain aspects of British life is insane. I've lived in not one but two places that have been called Muslim no-go zones in American media. My main memory of living near the east London mosque is an elderly Muslim trying to offer my his seat on the bus (I was on crutches) while two drunk gammons looked on gormlessly.

On the other hand, it is quite alarming that I can no longer say I support all non violent protests against the genocide in Palestine because that would include the group Palestine Action. It's amazing that supporting them openly is essentially equivalent to supporting Al Qaeda.

Comment by HDThoreaun 7 days ago

The UK has a censorship bureau, ofcom. The example that comes up most here is 4chan, which the UK is currently trying to ban because they refuse to do age verification. If you read the threads here you will see other stories. One that sticks out to me is someone who was talking about their struggles running a forum about depression. They live in canada and were contacted by ofcom demanding the forum add age verification, cant totally remember the reason but it was something about kids being able to access talk about depression. Ofcom said that if he doesnt add age verification to his forum he will be arrested if he ever enters the UK. He even blocked uk IPs but they said that wasnt enough. We can quibble about whether age verification is a form of censorship, I think it clearly is, if only because it is a large regulatory hurdle that stops people from hosting forums because its too much regulatory work.

The UK also has a very broad definition of hate speech that many users here detest.

Comment by sd9 7 days ago

Makes sense, thank you. I am opposed to the age verification laws that we have introduced recently.

Comment by nonethewiser 7 days ago

> Nobody I know in real life is talking about censorship or free speech in the UK

Yeah because free speech has never really been a core value in the UK

Comment by tene80i 7 days ago

They’re talking about British hate speech laws. They think other countries have universal free speech and they absolutely do not, but for some reason they think Britain goes too far. Although “think” is probably too generous - they’re parroting talking points.

Comment by tene80i 7 days ago

The downvoters are welcome to offer actual counterarguments.

Comment by lbrito 7 days ago

>HN is extremely pro free speech

It is most definitively not, at least in the 10ish year's I've lurked.

It is "pro free speech" in the sense Elon Musk is a "free speech absolutist": in pretty much the diametrically opposed meaning of the phrase.

Comment by flagged357733 7 days ago

> HN is extremely pro free speech

They like to think so. But if someone makes a comment that goes against the groupthink here, they will get downvoted, flagged, and shadow-banned.

Comment by Macha 7 days ago

The UK has very recently[1] announced a new push for client side scanning by messaging providers which is both very likely to be unpopular and known here, so once one person cracks the joke, others are going to want to comment. Don’t think that requires astroturfing.

[1]: https://www.theguardian.com/technology/2026/jun/08/starmer-t...

Comment by r721 7 days ago

It's just people who use "For You" algorithm on X.

Comment by nonethewiser 7 days ago

Neither do people living in China

Comment by odiroot 7 days ago

Really shocked Poland is that low, especially just next to USA.

Comment by WhrRTheBaboons 7 days ago

Why would you be? The fully corrupt, fear-mongering party that idolized Orban's Hungary and tried to copy his tactics, including taking over the courts (resulting in EU sanctions) and turning the state media into a propaganda machine, only recently lost the elections. And not a full term later, the polls favour them again, combined with a meteoric rise of even worse anti-EU fascists who they'll happily join forces with to take over in 2027.

I get you might not hear this stuff if you're not in EU or Poland itself, but seriously, just check the latest polling and history of PiS rule. It would take over a decade to event attempt to undo the damage that has been done to the rule of law in Poland, and the currently ruling "anti-PiS" coalition only had a short while (in which they failed to do anything) before getting neutered by the populace electing their own Trump-like buffoon that proceeds to veto everything the ruling coalition tries to pass. For added damage, the 3rd and 4th leading candidates (with combined 20% support) were the aforementioned fascists. Here's one [0]. Consider the wiki article a fraction of the cesspool he regularly produces.

[0] https://en.wikipedia.org/wiki/Grzegorz_Braun

Comment by neonstatic 7 days ago

Placing Poland so far below the UK is a joke understandable to anyone who has spent at least a few weeks in both countries.

Comment by m0guz 8 days ago

> The Democracy Index published by the British media company

We decided that we aren't one of those authoritarian countries.

Comment by b3kart 7 days ago

Ah, yes, the Economist, a famously government-controlled media outlet.

Comment by nonethewiser 7 days ago

I have absolutely no clue what the US nor Poland's rank has to do with anything.

Comment by b3kart 7 days ago

It shows the irony of trolling the UK's "authoritarianism" in a thread on a release of a model by a US company, given the US is arguably _more_ authoritarian. (Poland is more of a fun tidbit, as they are indeed tied in the Economist's index.)

Comment by solenoid0937 8 days ago

[flagged]

Comment by JustSkyfall 8 days ago

> In the UK you get thrown in prison for making a slightly unfriendly tweet.

Do you? The closest thing I can think about is how someone was jailed for encouraging arson attacks on asylum hotels. I'd be extremely surprised if the US had zero cases of somebody receiving a police visit after threatening to kill the President or bomb a school or something...

(FWIW I do think the UK needs stronger free speech protections, but saying that you'll be immediately jailed for writing unfriendly tweets is a huge stretch)

Comment by subscribed 7 days ago

Yes. And also you are threatened with prison for holding in front of a court a placard with [pretty much] a quote from the plaque displayed on the most important criminal court.

You're threatened with arrest for holding empty placard.

You're jailed for years for holding a zoom meeting planning a peaceful climate-emergency related demonstration. At the same time judge threatens the defendants with contempt of court sanctions if they dare to explain to juries why they planned to protest.

You're jailed for opposing a genocide.

You're jailed and called a terrorist for painting planes helping to bomb civilians - the exact same thing the sitting PM was defending a person in court some years ago (as a human rights lawyer, the irony).

You're arrested for wearing a T-shirt "I support plasticine action" (not a typo, "Plasticine").

We could go for hours.

Comment by BoorishBears 8 days ago

https://lordslibrary.parliament.uk/select-communications-off...

Are they really making 12,000 arrests a year over tweets and posts?

Comment by 10xDev 8 days ago

>the quality of discussion on HN has gone to shit, i miss when model released used to have actual informed takes from people that used them or substantive discussion about the system card

Your comment earlier.

Edit: also, not much change in the last 10 years in prison population. https://commonslibrary.parliament.uk/research-briefings/sn04...

Comment by solenoid0937 8 days ago

https://lordslibrary.parliament.uk/select-communications-off...

12k people a year thrown in prison for spicy tweets

Comment by 10xDev 7 days ago

So roughly 0.017% of the population.

"Spicy tweets" including:

sending false communications

sending threatening communications

sending or showing flashing images electronically to people with epilepsy intending to cause them harm (‘epilepsy trolling’)

encouraging or assisting serious self-harm

sending a photograph or film of a person’s genitals (‘cyberflashing’)

sharing or threatening to share intimate photographs or film

Comment by solenoid0937 7 days ago

Or a lot more commonly - critique of immigration policy

Comment by 10xDev 7 days ago

You are obviously invested in this narrative driven by Musk but you need to back it up properly.

Comment by matthewmacleod 7 days ago

Why did you choose to lie about this today? I'm genuinely interesting – this is trivially obviously not true, so what motivated this?

Comment by starshadowx2 7 days ago

That is not a true statement.

Here's a good break down and explanation of what that number actually means - https://www.youtube.com/watch?v=tB3WVygAM8I

Comment by dgellow 7 days ago

That link says “12k arrests”, not thrown to prison! It’s also not clear how reliable that data is

Comment by matthewmacleod 7 days ago

In the UK you get thrown in prison for making a slightly unfriendly tweet. Freedom of speech simply does not exist.

"These days if you say you're English you'll be arrested and you'll be thrown in jail."

It's just not true. Where are you getting this nonsense from?

Comment by james2doyle 8 days ago

Just last week you could distill using other users responses! Handy!

Comment by dyauspitr 8 days ago

Rookie numbers. Come to the US to see auth done right.

Comment by 8 days ago

Comment by PUSH_AX 8 days ago

Uh oh-auth

Comment by kylehotchkiss 7 days ago

wasn't claude distilled from the entire creative and research output of every English speaker alive

Comment by 8 days ago

Comment by norton2002 9 hours ago

[dead]

Comment by OOTW 7 days ago

[flagged]

Comment by manojkumarp 7 days ago

[flagged]

Comment by 8 days ago

Comment by OOTW 7 days ago

[flagged]

Comment by bobosmrad 7 days ago

[dead]

Comment by nl 7 days ago

[dead]

Comment by CoderAshton 8 days ago

[dead]

Comment by sanjitb 7 days ago

[dead]

Comment by lellow 7 days ago

[dead]

Comment by perimeterless 7 days ago

[dead]

Comment by RishiByte 7 days ago

[flagged]

Comment by weavoapp 7 days ago

[flagged]

Comment by Stevvo 8 days ago

[dead]

Comment by Georgecal 7 days ago

[dead]

Comment by gauravvij137 7 days ago

[flagged]

Comment by bogota 7 days ago

[dead]

Comment by greedydecode 5 days ago

[flagged]

Comment by kunil4574 2 days ago

[dead]

Comment by amdeisimncrmnls 7 days ago

[flagged]

Comment by 2 days ago

Comment by WhoAteSnorlax 7 days ago

[dead]

Comment by YumpiLumpus 7 days ago

[dead]

Comment by tomaspiaggio12 7 days ago

[dead]

Comment by hmokiguess 8 days ago

I have got it to one shot GTA 6 we can finally play it, it only took ultracode make no mistakes (/s)

Comment by acentaur 8 days ago

[dead]

Comment by jheriko 7 days ago

[dead]

Comment by bonigv 7 days ago

[dead]

Comment by heugt 6 days ago

[flagged]

Comment by ashishp15 7 days ago

[dead]

Comment by mugivarra69 8 days ago

[dead]

Comment by surcap526 5 days ago

[dead]

Comment by 7 days ago

Comment by surcap526 5 days ago

[dead]

Comment by bigboggerlogins 7 days ago

[dead]

Comment by spectraldrift 7 days ago

[flagged]

Comment by robertacion 8 days ago

[dead]

Comment by wslh 8 days ago

It's ambiguous? Because is about Mythos specifically and Fable != Mythos.

Comment by ebiester 8 days ago

I mean, if by right you mean "insiders leaked to make a few bucks..." sure?

Comment by 38484858 8 days ago

[flagged]

Comment by simunskxcsckss 8 days ago

[flagged]

Comment by minimaxir 8 days ago

You can't tell someone to "get a life" while taking the effort to create a burner account for the sole purpose of insulting someone.

Comment by rvz 8 days ago

I don't really consider that a great benchmark anyway and we really need better ones that are objective instead of these mostly performative and cheatable and also available in the training set.

Comment by ilaksh 8 days ago

Simon's pelicans are an institution. Are you trying to get banned. Lmao.

Comment by 8 days ago

Comment by rob 8 days ago

I think it's a clever thing he did to basically guarantee he continues to get major traffic to his blog here every time a model is released, especially since he's taking sponsorships with a static banner at the top of every page now. I think he's trying to go the Daring Fireball route.

Comment by brazukadev 8 days ago

For me it is like if crypto bros were allowed to shill their DAOs and tokens during the crypto/NFT phase.

He is the only person not getting rate-limited for shilling AI all the time.

Comment by simonw 8 days ago

Pointing out how much the models still suck at drawing pelicans is a funny way to shill them.

Comment by toraway 8 days ago

Tbf the first line of your first comment is:

  > Pelican for Fable 5 on default settings is a clear improvement on Opus 4.8

And doesn't contain any actual criticism within the comment (your blog post might, but just referring to what was posted on HN, which is a bit booster-y on its own).

Comment by simonw 8 days ago

The entire pelican benchmark is a joke. The joke is that, for all of the billions of dollars poured into these things and the claims of PhD level intelligence, they still draw pelicans not-much-better than a five year-old would.

I don't spell that joke out in every comment I post here because that wouldn't be very funny.

Comment by bjord 8 days ago

I thought they said mythos was too dangerous to make generally available?

Comment by Philpax 8 days ago

"Releasing a model this capable comes with risks. Without safeguards, Fable 5’s capabilities in areas like cybersecurity could be misused to cause serious damage. We’ve therefore launched the model with safeguards that mean queries on some topics will instead receive a response from our next-most-capable model, Claude Opus 4.8. To release the model both safely and quickly, we’ve tuned these safeguards conservatively—they’ll sometimes catch harmless requests, though they trigger, on average, in less than 5% of sessions. With more capable models arriving in the coming months, we’re working to improve our safeguards and reduce false positives as quickly as we can.

For a small group of cyberdefenders and infrastructure providers, we’re also launching Claude Mythos 5. It’s the same underlying model as Fable 5, but with the safeguards lifted in some areas.2 Mythos 5 will initially be deployed through Project Glasswing, in collaboration with the US Government, as an upgrade to Claude Mythos Preview. It has the strongest cybersecurity capabilities of any model in the world. Soon, we intend to expand access to Mythos 5 through a broader trusted access program."

Comment by dmix 8 days ago

This is covered in their post…

Comment by rvz 8 days ago

You fell for their fearmongering and marketing fundraising call which was done on purpose.

Now they want to pause AI because of "recursive self improvement".

Fool me once shame on you fool me twice...

Comment by bjord 7 days ago

I'm aware that it was marketing. I was trying to make the point that if it were really so dangerous, they wouldn't have released it at all, (prompt injectable) "safeguards" or otherwise.

Comment by tomeraberbach 8 days ago

"Without safeguards, Fable 5’s capabilities in areas like cybersecurity could be misused to cause serious damage. We’ve therefore launched the model with safeguards that mean queries on some topics will instead receive a response from our next-most-capable model, Claude Opus 4.8."

Comment by hoony_han 7 days ago

진심으로 한심한 모델

내 프로젝트의 있는 취약점 찾아달라는 말만 해도 안전 코드로 4.8로 모델 강제 전환시키고, 이후로 취약점과 완전히 무관한 상식적인 대화를 해도 앞 턴에 있었던 안전 코드 때문에 진행도 안됨. 도대체 이딴 누더기 수준의 안전 장치로 뺄 거면 뭐하러 뺌? 대화 조금만 진행되도 자동으로 모델 다운 시켜서, 할 줄 아는거라곤 돈만 많이 쳐먹고 개발 수준 조금 더 나아지는거? 상식적으로 내 프로젝트에, 내 소스코드를 다 보고 있는 상태로 문제를 찾는데 이것도 하지 말라면 도대체 뭘 하라는거임? 엔트로픽 이 새끼들 하는 짓이 갈 수록 열 받네.