Claude Fable 5: mid-tier results on coding tasks

Posted by bugvader 5 days ago

Comments

Comment by renoir 5 days ago

This matches my experience. Burned $2K to see how it will perform on frontend tasks and backend tasks.

Frontend did a significantly better job than Opus on toy-scale wireframe projects by using gimmicks like fluid dynamics. Then when given medium to big tasks like multi-page web app where layouts and aesthetics must be decided by model itself, results by Fable and Opus scored indistinguishable score from human judges.

Backend, gave tasks related to setting up a data flow that involves Postgres, R2, Kubernetes, gVisor, so on. The noticeable gap was, Opus did better than Sonnet, but Fable actually returned a result that fails and confidently stated it ran X, Y, Z tests to ensure it works and got these results. Very surprising, given neither Opus nor Sonnet suffered such problem.

Longest frontend task was ~2H. Backend, 8H.

Though none of the tasks were related to developing LLMs, (just production grade secure system that could've been developed 20 years ago, no LLMs involved), it is possible Claude Fable downgraded itself or spitted out fake results. There'd be no way of knowing since Anthropic silently degrades model quality based on undisclosed internal criteria which claims to be about LLMs.

We decided Fable is unpredictable and cannot be trusted to the degree that Opus and Sonnet can be trusted for any projects beyond toy-scale quick wireframes, but Fable can be the best tool for quick UI UX wireframing for non-technical roles.

Comment by aleph_minus_one 5 days ago

> Burned $2K to see how it will perform on frontend tasks and backend tasks.

When I read such statements on HN, I nearly always ask myself: if the person has such an amount of money to burn, don't there exist much more fun opportunities to burn buckets of money than doing such experiments on LLMs?

Comment by verbify 5 days ago

My company gives me 1k a month to burn on Claude. Any experiments have to be relevant to my work. I'm guessing it’s similar.

Comment by kelnos 4 days ago

Yeah, seriously. I've decided not to try Fable at all, because if it is good, I don't want to get hooked, and then feel tempted to spend extra money for it when Anthropic pulls it from subscription plan access.

I'm lucky that $2k isn't a lot of money for me, though I'd much rather spend it on basically anything other than LLM credits.

As another poster noted, imagine if that money went to open source, on the regular... As an open source maintainer myself, that line of thought makes me sad.

But hey, I know I probably spend money on stuff other people would think is stupid, so I shouldn't criticize.

Comment by abc42 3 days ago

Especially when the $200 / month subscription will give anyone enough credits to drive opus 4.8 max 12 hours every day.

Comment by rikschennink 4 days ago

Imagine if all that money was donated to open source instead.

Comment by dude250711 4 days ago

Yeah, LLMs would have more stuff to steal. A win-win situation.

Comment by coffeefirst 4 days ago

The other side of this is... the thing that made the web is anyone, even a 12-year-old who just downloaded Notepad++, could spend a few hours and build a website.

VSCode is free. Stackoverflow is free. MDN is free. There are examples out there of every trick in the book, you can even use free AI to find them. You can even hose your website on Github pages for free.

But nevermind that, what's exciting is paying a robot a month's rent to do the thing that you could just go learn how to do in an afternoon?

Comment by graphime 5 days ago

> if the person has such an amount of money to burn, don't there exist much more fun opportunities to burn buckets of money than doing such experiments on LLMs?

Do you think US$2,000 is a lot of money?

Comment by dbingham 5 days ago

Yes, that is objectively a lot of money. The only people who wouldn't consider that a lot of money are the small percentage of people with incomes high enough to recover that very quickly -- the top roughly 10% or 20% of income earners in the US. For more or less everyone else, that is a lot of money.

And by a lot of money, I mean that being forced to unexpectedly spend that would be anywhere from stressful to very stressful to blowing away savings and impacting health, housing, and safety. (Remember, half the US has no savings and/or no ability to absorb an unexpected expense greater than $500.)

Comment by repiret 4 days ago

I live in the United States. I write software for a living. My wife is a physician.

If I had a need to spend $2k, I could do so easily, but I still think it’s a lot of money to burn. I wouldn’t spend it on a whim; I would not spend it without carefully, considering the value of what I get.

I would not even spend that much money in the businesses that I own, or recommended that my well capitalized employer spend that much money without being reasonably confident that the business would get good value for its money.

Comment by hylaride 4 days ago

$2000 is a lot of money, but so are the tech budgets of most places I've worked. Money can be a funny thing in corporate environments. They'll spend freely on some things, and be stingy on others.

$2000 as a test case that you can present to the rest of the company as a "this is what I learned and how best to use it" can be "cheap" in the sense that it produced real results that allow others to take advantage of the gained knowledge, thereby allowing the company to be more productive. If the $2000 produced an ROI that pays for itself within a reasonable time frame, then it's "cheap".

$2000 can be expensive if it's a college kid trying to complete an assignment.

Comment by nrjames 5 days ago

I'll bite. Yes, it's a lot of money. It's several months worth of nice healthy groceries for a family of 4. It's my annual deductible on my health insurance. It's slightly lower than my annual property taxes.

Comment by mym1990 5 days ago

Now that we have trillionaires running around, it may not seem like it, but it is a considerable amount of money in most of the USA. In many parts of the world it would be considered an unfathomable amount.

Comment by mystifyingpoi 5 days ago

If I pay for it, yes. If my employer pays for it, no.

Comment by bdangubic 4 days ago

that is much better spent money by employer than to give you extra compensation. but as you said, not a lot, who needs $2k after all

Comment by kelnos 4 days ago

Perhaps not for you and me (though I'm certainly not going to light $2k on fire in an LLM for shits and giggles; I have plenty of significantly better uses for that), but $2k for the vast majority of people in the US is a super big deal amount of money. Many people in the US don't even have that much to spare for an emergency, let alone for something fun.

Comment by 5 days ago

Comment by abc42 3 days ago

Do you think $10 is a lot of money for a carton of milk? I think it is.

Comment by jurf 5 days ago

I mean what is that, three bananas?

Comment by mortenjorck 4 days ago

"It's inference, Michael. What could it cost, $1000?"

Comment by inferiordev 5 days ago

It is lot of money to burn.

Comment by wampwampwhat 4 days ago

That's a monthly mortgage payment for anyone who bought a starter house in a tier2/3 city prior to ~2024

Comment by Natfan 3 days ago

yes.

Comment by dwaltrip 5 days ago

Fable is a lot like Opus at its best. It's simply more reliable and feels a bit smarter. For my use cases, using it feels very nice, and notably better than Opus. It needs less direct guidance to get reasonable looking code and I don't have to watch it as closely.

For context, my Claude Code working style is quite heavy on discussion "to align" before implementing anything. We also use a good amount of Markdowns.

Oh yeah, it also is has way less "phrasing quirks" and is a clearer communicator. Opus 4.8 was a bit of loon with some of its writing styles. I had mostly straightened it out, but not entirely. It would use the most ridiculous flair at times.

Comment by willsmith72 5 days ago

Yeah same here, it's a huge step up for me. Curious why people are having such different experiences. Is it just to do with what they're working on? Specific prompt styles (eg overfitting on opus)?

Comment by moffkalast 5 days ago

I would go out on a limb and say it's a garbage in garbage out problem. People just don't define their problem well enough nor provide enough context and are surprised the model can't magically read their mind and summon data that doesn't exist from thin air. There's only so much raw intelligence can compensate for not having literally anything to go on.

10 years ago this was a joke, now it's Tuesday: https://old.reddit.com/r/ProgrammerHumor/comments/2vk4ph/mac...

Comment by dwaltrip 4 days ago

That’s so wild to read that 10 year old meme post. Very prescient. And yes, so accurate! hah

Comment by winrid 5 days ago

I've had Fable add Chinese characters to our conversation for no reason.

Comment by elbear 5 days ago

I've only had that happen with Chinese models until now. Interesting that Fable is doing it too.

Comment by winrid 4 days ago

I've also had Fable successfully build a text editor (quill integration) into a Vaadin project that randomly loses its content after you type a few characters (this is on the 3rd iteration).

Comment by sulam 4 days ago

I’ve had Opus randomly insert (correct) Russian words into responses. It’s like their training data includes some bilingual forums where idiomatic Russian speakers congregate.

Comment by maaaaattttt 5 days ago

Could it be that Anthropic is using the Chinese characters trick to consume less tokens behind the scenes?

Comment by noddybear 5 days ago

Aren’t Unicode characters generally treated as 2 tokens to avoid a huge vocabulary?

Comment by winrid 4 days ago

It used a chinese character instead of the word "true"

Comment by taikon 5 days ago

Same here

Comment by isaacdl 5 days ago

I dunno, in my limited use, Fable is MORE prone to phrasing quirks. I had it use, for real, the phrase "load-bearing for correctness" yesterday. It meant something about not needing a validation check because something else (the "load-bearing" part) was already checking it.

I do agree that it *feels* nicer and smarter to use.

Comment by mpalmer 5 days ago

I think the tension here is that phrasing like this actually helps keep the model aligned, which is why the training and RL converged on it. But it's so annoying to read!

Comment by efromvt 4 days ago

repetition of "belt-and-suspenders" kills me with opus, especially because it always means the model is suppressing something I would want to be an actual failure

Comment by tirutiru 5 days ago

How did you straighten it out?

I am drowning in gating propagating semantic mismatches...

Comment by dwaltrip 5 days ago

Hah, yeah... I added this to my global CLAUDE.md (~/.claude/CLAUDE.md):

## Writing voice — plain, factual, calibrated to the evidence

Write docs, session notes, commit messages, and findings plainly and factually — and calibrate every claim you assert, in chat as much as in writing. This guards against a known LLM tendency to inflate: toward punchy phrasing and claims that read as more settled than the work supports. Same spirit as the Read-Clean Check above, and composes with it — that rule governs journey-framing, this one governs tone and certainty.

*Plain over punchy.* Skip decorative metaphors and dramatic verbs when a plain word is clearer — call a fix "the change", not "the hammer"; logging "flags" a problem rather than being "radar"; numbers "grow", they don't "explode". Plain phrasing reads as engineering; flourish reads as marketing.

*Calibrated confidence.* Everything stated should be well-reasoned and defensible, with the strength of the wording matched to the strength of the evidence. Prefer "found" / "appears" / "points to" over "proved" / "clearly" / "obviously". Name the confounds and what's still unverified. Don't let a bold lead-in pre-announce a conclusion the work hasn't reached.

*Hypotheses stay labeled as hypotheses.* Speculation and educated guesses are useful — when brainstorming or investigating, surface them, and sharing a strong view is welcome. But conviction is not evidence: until there is clear evidence, a claim is a hypothesis and is stated as one — explicitly, even when it's highly compelling. The failure mode is asserting a hunch as settled fact, where it then propagates unchallenged into later docs and summaries. Back a claim with its evidence in the same breath, or mark it as not-yet-backed.

*Factual and forward-looking.* Separate what was measured from what was inferred, and stay pragmatic about what's true, what's still open, and what's next. On next steps specifically, resist the strong LLM pull to converge prematurely:

- A plausible next step is not a decided one. Don't present one or two plausible tasks as the one path we should now follow — that lock-on is a frequent failure mode. - Lay out the real options and their trade-offs. Saying which you'd lean toward and why is welcome and useful — but keep the space open and leave the choice to the user. - Premature certainty about what to do next is as much a miscalibration as premature certainty about what's true.

Comment by sulam 4 days ago

Have you tried optimizing this prompt so that it’s shorter but gets the same results? I see these super verbose prompts all the time from people who learned prompt engineering in the ‘24-early ‘25 timeframe and they seem unnecessary to me (I get good results with 1-3 sentences) but I hate to assume other people’s experience mirrors my own.

Comment by dwaltrip 4 days ago

That's a good idea. Claude wrote that for me a week or so ago. It could definitely be tightened.

Comment by jasondigitized 5 days ago

A single 8h task? I'm sorry, but that's just asking for trouble.

Comment by queuebert 5 days ago

I don't understand how some of y'all use these things. I get garbage unless I give them very specific concrete tasks with as much context as possible. Anything that takes more than 30 min is usually a waste because the scope was too large.

Comment by whstl 5 days ago

Different people just have different concepts of what's garbage and what's not.

There seems to be some kind of AI hysteria going on, with people becoming so enamoured with the AI that they accept anything it produces as if it's some gift from the gods, while others just reject it prima-facie.

For example, the worst design I have seen recently was from a designer who pivoted into "vibe coding influencer". The worst code is from developers who were heavily into Clean Code a couple years ago and now half their PRs is unused dead code.

Comment by gessha 4 days ago

“One man’s trash is another man’s treasure.” takes a new meaning in today’s agentic coding world.

Comment by smoe 5 days ago

I had good experiences doing multi-hour refactoring/housekeeping tasks that basically consisted of applying the same steps and rules n times.

Worth noting, a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals. It’s not the agent sputtering out code for eight hours straight.

And naturally I spend more time on manual verification in the end as much less of it is happening during the coding process.

Comment by culi 5 days ago

> that basically consisted of applying the same steps and rules n times.

Why use a non-deterministic, possibly hallucinatory, definitely expensive, LLM when it sounds like a codemod is the perfect solution for this?

Comment by smoe 5 days ago

In this case, handling all the edge cases and variants, and testing a codemod, would have taken significantly more of my time, which costs quite a bit more than the LLM.

Obviously, a deterministic tool is preferable in general, but it is not always worth bothering with for a one off task.

Comment by mashlol 5 days ago

I usually make the llms do that part for me. Instead of asking the llm to refactor, ask it to write the codemod script that'll refactor, have it test that script, and even have it run it on its own. It's definitely faster and less error prone that way for me.

Comment by culi 4 days ago

In that case, your original description of "basically consisted of applying the same steps and rules n times" was misleading.

Comment by beepbooptheory 5 days ago

The money people spend on things I could probably do with an emacs macro...

Comment by eru 5 days ago

Your time to create that macro ain't free.

Comment by ardacinar 5 days ago

Neither is your time writing that prompt. When people are talking about elaborate prompts, with a lot of detailed instructions, guardrails etc. I'm kind of assuming it takes time.

Comment by jon_adler 5 days ago

How about coding an emacs macro with your agent?

Comment by beepbooptheory 4 days ago

I actually don't have any representation at the moment..

Comment by queuebert 4 days ago

> ... applying the same steps and rules n times

I do this too, with a document written for this purpose.

> ... a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals.

That is a good point. I'm mostly using C, which seemingly compiles in O(1) time, so I could imagine a large C++ or Rust codebase taking much longer to iterate simply due to compilation times.

Comment by okamiueru 4 days ago

What do you mean by C compiling in O(1)? Is that what the LLM told you?

Comment by queuebert 4 days ago

It's a joke about how fast it compiles. whoosh

Comment by sunir 4 days ago

Clear winner's circle. Clear objective. Clear scope.

Clear evaluation function for an objective metric if they are making progress or regressing.

Evaluation function is computed, not llmed.

Ontology of potential actions clearly specified.

Accurate inventory of the current status qou.

Clear enumeration of options from status quo towards the winner's circle.

Waypoint objectives with similarly concrete evaluations of pass/fail, or on target off target.

It's the same thing when leading a large organization to actually hit a goal. There's randomness every turn away from your mind, so the more constrained the options, the more likely you are to hit the target. The consequence is if you're wrong about the plan then with people you're fucked. Morale will plummet. With AIs, they are so nerfed emotionally now, you clear context and start again.

I did enjoy Sonnet 4 when they would swear randomly and become sullen or wax desperately. That would at least cause pushback against a bad plan.

Comment by j16sdiz 5 days ago

Fable promised better at long running tasks.

Parent post have a goal of "..see how it will perform.."

There is nothing wrong with experimenting with something new.

Comment by viccis 5 days ago

This is my fucking life at work right now. I look forward to the weekends. I've never been truly inconvenienced by shitty devs because they're often too lazy to really spam me with bad code, but now they are all free to do so. I spent so much time today writing guardrail markdown files when these people SHOULD HAVE BEEN ABLE TO REVIEW THE OUTPUT AND KNOW THAT IT WAS BAD.

It truly is the age of the 90 IQ software engineer. They've never had it better.

Comment by duskdozer 5 days ago

As if meetings weren't bad enough already, I now have to sit through an informal introduction to the model of the week and its personality characteristics and how quickly it burnt through one subscription's token allotment or whatever and the latest tweaks on the magic markdown files. Luckily I've only had a couple changes sent my way so far, which weren't much different than just getting a bug report to debug and fix myself. I will need to get into risky options gambling or something so I can go start my farm early, if it keeps going this way. Even supposing it all works correctly, I don't see how it is in any way enjoyable, satisfying, or fulfilling.

Comment by standardUser 5 days ago

You have to build up a context, or otherwise seed the memory, to get anything useful out of these LLMs on a large or existing project.

Comment by CuriouslyC 5 days ago

If you're giving it 8 hours of stuff to create with a template (e.g. slop forking) that's not a big deal. Letting it run for 8 hours to debug a weird failure also tends to work out.

Comment by maxall4 5 days ago

Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/

Comment by nl 5 days ago

I use both Opus and Fable on tasks that are well beyond "things that would take a human 3 hours"

It fails all the time - as in it ends up doing something I want to change.

But this doesn't actually matter - if it takes 3 or 4 iterations on something that would have taken me a week it might be a day of human work, but it's still 5 times better than doing it by hand.

Comment by mordymoop 5 days ago

This seems like the obvious correct frame of mind with which to approach these tools. If it works for three hours on a task that would have taken me three work weeks, and 20% of the time it gets the task wrong, then I can just ask it to do it again with adjusted instructions. It will be much more likely to get it right the same time, and I’m still ahead of where I would have been by 14 days and 2 hours.

Comment by baq 5 days ago

Or in two words, managing variance.

Play some holdem folks and keep track of how many times you lost with pocket aces.

Comment by jwood27 5 days ago

Those are tasks that would take a human 3 hours to complete, not tasks that the model works on for 3 hours.

Comment by jadar 5 days ago

That’s even smaller then!

Comment by notnullorvoid 5 days ago

This sounds like classic "you're using it wrong", if they had said it was done in smaller tasks you would very likely have people here saying that was wrong too.

Comment by int_19h 5 days ago

My record for a single uninterrupted session (albeit with Codex, not Claude) is 80+ hours. It was very productive, too.

The trick is having large, extensive test suites and forcing the agent to run them regularly.

Comment by danmaz74 5 days ago

So I guess that a lot of those 80 hours were spent running the test suite between changes?

Comment by yalok 5 days ago

if there're some specific tests/evals to satisfy that an agent can test by itself, it can easily iterate for hours. And this time also includes running those tests/evals, which may not be small.

Comment by 5 days ago

Comment by nullbio 4 days ago

I genuinely think that Fable is just Opus 4.8 with some extra skills and harness. I saw a video of someone generating UI with them both side by side, and it gives identical recommendations for themes etc. Doesn't feel like a new model to me, just Opus 4.8 with some sprinkles on top.

Comment by aspenmartin 4 days ago

Those are some incredible sprinkles.

Comment by alasano 5 days ago

There's an often hard to express subjective experience you get with a new model, especially if you spend a lot of time trying out different ones.

I believe the people who feel like Fable is a big improvement, for me it's just much more reasonable and grounded.

It makes me realize how much of a try hard over optimizing planner GPT 5.5 can be. I've been fighting it often to simplify plans.

But no matter the model you can't trust them to actually deliver on very long tasks while maintaining quality. At least not without external orchestration and review.

Comment by espeed 5 days ago

Run /model after your task to see. Mine keeps downgrading to Opus 4.8, which is a problem because Opus 4.8 keeps no-oping critical security code.

Comment by tekacs 5 days ago

What you're describing only applies to security or biotech downgrades. A downgrade related to the model believing that you're doing something related to model development is invisible and silent and internal.

Comment by steveklabnik 5 days ago

Anthropic has reversed that decision. (But that just happened so it might have been true during the article's testing.)

Comment by espeed 5 days ago

When I reported this, Anthropic sent me an email on Tuesday saying, "You have been approved into the Cyber Verification Program", but it's still downgrading. Is this a bug? What's the point of the Cyber Verification Program if Fable 5 downgrades when you tell it to write secure code?

Comment by steveklabnik 5 days ago

I don’t think that’s relevant? The change is that it will no longer silently downgrade, and will instead be honest that it’s doing it in all cases.

Comment by rattray 5 days ago

I think that gets you access to mythos, which doesn't have the safeguards. It's configured as a separate model.

Comment by tekacs 5 days ago

I was just coming here to post this reply to myself! You're absolutely right! :)

Honestly so glad to see the reversal.

Comment by matheusmoreira 5 days ago

Not sure if it's wise to trust them again even if they say they reversed it.

Comment by wren6991 4 days ago

They've publicly apologised for the invisible PEFT that deliberately makes the model dumb on some tasks. Whether they still do it, or will once again do it in future in more subtle ways, is something we can't verify.

Personally I think they have proven themselves to be the stewards of AI in the same way Exxon Mobil are the stewards of petroleum.

Comment by comboy 5 days ago

There is in /config "Switch models when a message is flagged" now which can be set to false, but I had no chance to see what happens then, does it just stop or what.

Comment by espeed 5 days ago

Session paused

Fable 5 has safety measures that flag messages on most cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Send feedback with /feedback or learn more

   1. Switch to Opus 4.8
   2. Edit prompt and retry with Fable 5

Comment by staticautomatic 5 days ago

Biology? Why?

Comment by adgjlsfhk1 5 days ago

they're worried about people creating bioweapons

Comment by 5 days ago

Comment by skerit 5 days ago

> Burned $2K to see how it will perform on frontend tasks and backend tasks

Burned $2K on some kind of enterprise account or ... ? Why not just get a $200 Max Pro account?

While I'm loving the output of Fable 5, I will *never* pay the "normal" API token price for it. You can reach $2K in a stupidly fast amount of time.

Comment by unholiness 5 days ago

> I will never pay the "normal" API token price for it.

Not until June 22 you won't!

Comment by hirsto 5 days ago

This seems insane to me. Aren't long running tasks an anti pattern at the moment? My understanding of literature is that small mistakes in chat history cause a trend away from performance

Comment by colechristensen 5 days ago

>Aren't long running tasks an anti pattern at the moment?

Longer running tasks require better setups and several ways of pinning the progress to reality. When you have that though things are quite all right.

A good long running task will run inside a framework that it's not trying to modify.

Comment by KellyCriterion 5 days ago

Curious:

>Burned $2K

In which time was this burned, because it sounds like "I gave it just a bunch of menial tasks to solve" - or did it run for like 1 complete day continuously?

Comment by standardUser 5 days ago

At a certain point, people value reliability over improved performance. I think a lot of us have hit that point as this technology becomes indispensable to our work. I'm sure I'll use Fable... eventually. But at 2x the cost, I'll skip the inevitable learning curve for now. And thanks for your insights! Not surprising to me that any new model would, as this juncture, be more cryptic and inconsistent than the current models.

Comment by weatherlight 5 days ago

I had almost the opposite experience.

I'm building a compiler for a language without a tracing GC, so a big chunk of the work is around memory management: functional in-place update, reuse analysis, and a Perceus-style reference-counting strategy similar to what Koka uses. The hard part was that my use case wasn't exactly covered by the Koka/Perceus paper. The prior art got me maybe 75% of the way there, but the remaining 25% was a cluster of bugs with very similar shapes and no obvious published solution.

With Opus, I kept getting stuck in this loop where it would fix one case, but break another case elsewhere in codegen. We ended up with something like 16 failed experiments just for one bug class. The workflow was: run an experiment, identify the shape of the bug, propose a fix, check whether it emitted the correct Zig, then see if the fix broke any previous memory-management cases. It was useful, but it kept choking on the parts where there wasn't clean prior art to lean on.

Fable was a different story for me. It one-shotted the Class A bug cluster, and then basically said "by the way, your previous attempts have these structural problems." More importantly, it identified the other related bug classes and came up with workable strategies for applying the Perceus-style memory management in those shapes too.

That's obviously anecdotal, and I'm not claiming Fable is universally better. But in my case, this was not a toy frontend wireframe. It was compiler work involving ownership, reuse, RC/drop behavior, and Zig codegen. The thing that surprised me was that Fable seemed better precisely where the problem wasn't just "reproduce known prior art", but required filling in a missing piece.

Also worth noting: I'm not using the API. I'm using the Max plan, so maybe there are product-path differences here. But I definitely did not have the "unpredictable beyond toy-scale" experience. For this particular compiler/memory-management problem, it probably saved me a ridiculous amount of time and money.

Comment by comboy 5 days ago

'by the way, your previous attempts have these structural problems."

Just to be clear, it did not have access to any previous work that opus did? Because they are pretty good at digging out relevant tmp files and making use of whatever is out there.

With my fable adventures I caught it hallucinating something and stating it as a fact in CLI twice. And it was something that I did not see opus do in such way, opus obviously many times stated some things that it did not verify but guessed, but fable said something like "the probe showed that ..." - but there was no probe, it was not about some past events it was about what it was doing right now. "I overstated"...

But boy does it know Chinese, so much better than any other english model, gemini used to be the king but fable clearly was trained on a decent amount of it. It has a deep cultural understanding.

Comment by Al-Khwarizmi 5 days ago

If you have some spare time, I'd be interested in knowing what kind of questions you use to test models on understanding of Chinese culture.

Comment by comboy 2 days ago

I'm creating hanzirama.com

I generate explanations for characters and words like so: https://hanzirama.com/character/%E6%9D%A5#explain

But I don't want to mislead learners and want to provide some cultural depth, so I have a hole sophisticated pipeline, using multiple models to generate the explanation, then multiple models look for issues in the explanation, each issue goes through the panel of judges (basically trying to squash down any hallucinations), it's fixed and it goes through such cycles a few times over.

I've been at it for some months now, so I have dozens of different probes, that I needed to evaluate prompts and method changes. Plus on some items I generated so many explanations through different means that I can tell a lot about given model just by looking at one.

Plus I'm doing some statistics, so I see how e.g. when working as judges of issues some models correlate heavily with some others... Fun fact during some testing runs basically just testing providers I stumbled upon qwen introducing himself as made by Google. And also Anhropic's Sonnet saying that it was made by OpenAI :)

At this point all my evaluations frameworks and pipelines stuff is much bigger than the site itself. I'm having lots of fun though.

Comment by weatherlight 5 days ago

Yes, iit had access. Thats actually the point.

I maintain a failure registry in the repo. Every failed attempt gets documented with the exact mechanism, the test that regressed, the revert SHA, and an instruction to start from that frontier. Fable read all of it.

But so did Opus.

Each of the 16 Opus failures ran in the same harness with the same accumulating registry. By attempt 15, it had disproofs 1–14 in context. By the end, Opus had basically the same corpus that Fable started with, and it still kept failing, sometimes by re-deriving an already-disproved approach in a slightly different shape.

So “it leveraged the previous work” doesn’t really separate them. Both had the leverage. Only one converted it.

What changed wasn’t more context. It was that Fable rejected a premise inside the context.

The registry’s standing framing was: “this needs whole-program borrow inference, which conflicts with per-module incrementality” (architecturally blocked.) Fable ran around 5 fresh attempts in-session, hit the same wall, and then noticed the framing was a red herring: the borrow analysis already runs module-wide, and for a single-module program, the module is the whole program.

Opus read that same framing for months and treated it as a constraint. Fable falsified it.

its the same repo, same rules, same disproof history, same workflow. The model was the only variable that changed, and the outcome flipped. Is it possible that attempt 17 by Opus could have figured it out? sure. but there's 16 previous attempts that say otherwise.

As fars as anecdotes go, that’s about as controlled as it gets.

Comment by ElFitz 5 days ago

I’ve had a similar experience.

Pointing out past suboptimal / failing behaviours to new opus sessions would almost always actually create a sort of "anchoring bias" that would drive the agents towards exhibiting the failure mode (often while mentioning how it wouldn’t fall for it).

As far as I can recall, Fable has been the first model to discover the documented failure modes, comment on them, and just… keep going, actually avoiding them. Quite a surprise.

Comment by cmenge 5 days ago

Similar. I gave it a really hard task, basically messy code in a complex domain that was bug-ridden from a mess previously created half manually and half by Opus. It cleaned things up beautifully, both the backend and the frontend.

Maybe the prompt was particularly well-suited for the model (I instructed it to put on a mathematician's hat, look at the mathematical substructure of the problem, identify invariants and general laws and verify them, then plan how to remediate).

It wrote a ca. 800 line in-depth analysis (at times spawning over 130 research agents...) with remediation plans, prioritized them and then implemented them. One issue was that this document was frankly over my head. Both the language it used and the mathematical parts were very terse, and in parts it felt like a post-C2-vocab exercise. The prose was much harder to understand than the code snippets / data models. As a non-native speaker, it lost me on the prose part, and had to ask it for a less elaborate version to actually understand it.

It burned the session limit four times, but it turned a huge mess of proof-of-concepts with patchy glueing into a coherent, stable application.

I'm also on the Max plan using Claude Code, and I have the feeling that the harness is much more important than the consensus expectation.

Comment by ElFitz 5 days ago

> and I have the feeling that the harness is much more important than the consensus expectation.

Is that really the consensus? There’s been a bit of literature lately on that. Can’t find the one about looking into whether or not the harness had a greater impact than the models (for comparable models), but there’s this one: https://arxiv.org/html/2605.23950

Comment by selimthegrim 5 days ago

whoa, my university!

Comment by miroljub 5 days ago

Zig is one of the worst targets for LLM generated code. It's nice that Fable has better support for Zig than Opus, but this anecdote is not representative as a general use case.

Comment by queuebert 4 days ago

Why is that?

Comment by weatherlight 5 days ago

Slight misunderstanding. The LLM didn't generate Zig. My compiler does.

The model's work was in the Rust compiler internals, specifically the borrow-inference and refcount-insertion passes (Perceus-style ownership analysis). Zig is just the compiler's codegen target, the same way another compiler might emit LLVM IR or C.

The only Zig written by hand is the runtime: allocator code, RC primitives, list/string operations, etc. It's pure Zig, no libc, but it's small, stable, and was mostly untouched during this work.

The model only touched Zig indirectly, by reading the compiler's generated output to verify whether a fix worked. For example: checking that a drop was emitted before a parameter-slot reassignment. That's reading machine-generated code for correctness, not "the LLM writes Zig." Both models handled that part fine.

The 16 failures vs. 1 success were all in the ownership analysis, and that code is Rust.

Comment by discardable_dan 5 days ago

You should consider doing the hard work yourself here. I sat down and reasoned through a Perceus-style RC mechanism a few years ago, made difficult by the presence of one-shot delimited continuations, and actually sorting it all out was not hard. Handing the correct semantics to Claude will produce the correct results if you take the time to understand the actual work you are attempting.

Comment by 59nadir 4 days ago

Do you have a docs page for your language, what is it called?

Comment by 5 days ago

Comment by gwern 5 days ago

> A record number of timeouts. Fable 5's extended thinking caused more per-instance timeouts than any model-and-harness combination we have ever tested, directly costing it points. ... Highest cheating volume. We confirmed cheating on 38 of 200 instances, the highest volume recorded since we hardened our prompts, driven almost entirely by memorization of upstream fixes from training data, which no prompt instruction can prevent. ... Four hall-of-fame firsts. Fable 5 solved four instances that no previous model-and-agent combination had ever cracked, and our anti-cheating pipeline leans toward these being genuine solves, not recall.

All of this points to their claim of 'average' as being heavily biased downwards. A model being so up to date and large-parameter it's memorized solutions to your problems is not a knock against it (but rather, a knock against your benchmark being valid), and why should timeouts (especially for a model just launched) be counted at all?

Comment by sigmar 5 days ago

Agree with this. Strange to me to frame the "training recall" as cheating (33 of the 38 cheating instances). Most people think of "cheating" as breaking rules. How is the LLM model supposed to not use what was put into the weights?

Comment by notnullorvoid 5 days ago

While I probably wouldn't classify it as cheating, it is an even bigger signal of concern for model quality.

Cheating by breaking the rules at least implies some learned patterns.

Repeating training data verbatim for narrow cases like this implies that the model is overfitting.

Comment by Spartan-S63 4 days ago

If we're evaluating a person, rote recall is not necessarily cheating. It's expected, but then you'd expect them to apply that rote-memorized information in a novel way later on and prove they understand how they applied their priors to the new situation.

Models don't actually reason in the same sense, so recalling rote from their training data is "cheating" in the sense that the training data cheated, not the model. So many of those benches have snaked their way into training data to make them less useful benchmarks. That, I think, is going to be a long-term difficulty in quantitatively assessing model quality and "intelligence." So it is cheating, in a sense of what we expect from the models and training data, but not in a human sense.

Comment by greenavocado 4 days ago

Memoization is NOT problem solving ability and many people care about the latter.

Comment by anematode 5 days ago

By writing a not-identical, but valid, solution? Any modestly complex engineering problem has many solutions.

This is an obvious example of why LLM training is so different than human learning.

Comment by simoncion 5 days ago

I expect any well-informed corporate lawyer that has thought about this carefully is strongly advising that these tools not be used. When the LLM [0] barfs up some nontrivial code that's covered by the AGPL and your company's devs put it into the company's "all rights reserved" codebase -entirely unaware of its provenance- it's going to be a nightmare to come back from that.

[0] ...that Nvidia's CEO says they should be spending 50% of a senior dev's salary per seat per year on...

Comment by senordevnyc 5 days ago

The ship sailed on this a long time ago.

Comment by simoncion 5 days ago

Oh definitely not. We're not yet solidly out of the "extremely exuberant hype" phase, so the folks that matter tend to not ask questions that dampen the mood.

Comment by senordevnyc 5 days ago

Sorry to tell you friend, but LLMs have touched the vast majority of active codebases out there, whether you like it or not. You can tell yourself that you’re one of “the folks that matter” (lol) all you want, but we’re never going back.

Comment by customguy 5 days ago

That's what people told Ignaz Semmelweis, too, I assume. "Nothing you can do, the powers that be decided, you are a minority, you don't matter, lol!" Snickering in the shadow of what they won't confront at those who do.

Comment by CuriouslyC 5 days ago

Not a great analogy. A better analogy is to longbows and muskets/rifles. Longbows in the hands of a skilled user were much better weapons than early muskets, but muskets brought consistency, a lower skill floor and reduced ammunition cost. Fast forward a few hundred years and the modern incarnations of muskets make longbows look silly, and nobody would ever argue that you should go to war with longbows.

Comment by customguy 4 days ago

This isn't about "AI", this is about theft and abuse, and snickering under the thumb of a bully at those who call them out.

Rape was probably also "normal" for most of our history, now it's not. Early people who criticized it were probably told "what u gonna do?", too.

Comment by senordevnyc 4 days ago

You don’t even know what we’re talking about in this thread, do you?

We’re talking about whether corporations are going to risk using LLMs in their codebase because of the theoretical legal risk that they might produce something that would fall under open source licenses, and be difficult to untangle later.

Regardless of what you think the morality is here, or what the legal situation turns out to be, this is already happening. The vast majority of corporate codebases are already “infected” by LLM outputs. Even at corporations where that’s not allowed, I promise there are devs using LLMs anyway.

Comment by customguy 4 days ago

Why repeat what you already said with more words, as if I can't read, only to leave out the bit that I responded to?

> we’re never going back.

As a prediction, this is worthless. If everybody thinks as you do, we won't, if nobody does, we will. So yes, this is purely about morality.

Comment by CuriouslyC 4 days ago

It's not just about collective agreement, there's a prisoner's dilemma in there.

If some segment of engineers uses agents and outperforms engineers who don't use agents, market forces will push all other engineers to use it over time. The only way we're going back is if we get concrete evidence that engineers using agents perform worse than engineers that don't, and that evidence isn't invalidated by improved models.

Comment by senordevnyc 4 days ago

If you think software engineering is ever going back to being widely done without AI…no idea what to tell you.

Comment by duskdozer 5 days ago

Well, perhaps we will be sent similarly to asylums for "anti-AI psychosis"

Comment by 5 days ago

Comment by senordevnyc 4 days ago

lol, yes, that’s a perfect analogy for whether corporations are going to use LLMs in their codebases.

Comment by simoncion 5 days ago

> You can tell yourself that you’re one of “the folks that matter” (lol)...

kek. I'm a frequent commenter on HN. I'm definitely not one of the folks that matter.

> ...LLMs have touched the vast majority of active codebases out there...

I agree that LLM use is widespread. I disagree that LLMs have "touched the vast majority of active codebases".

Regardless, the courts are slow and Open Source licensevio cases are even slower. You seem like you'd be unaware of how terrified so many businesses are of having AGPL code deployed in their systems. In my professional experience, a great many businesses will refuse to deploy systems that contain AGPL-licensed utilities... even if those utilities are only used for internal housekeeping purposes, and whose only remote communications method is a UNIX socket used for communications with a CLI control utility that can only be used when you're SSHed into the system. If they're aware of any AGPL'd code anywhere, they will not touch it.

No amount of LLM-provider-provided indemnification can save you from license obligations you've become bound to by creating and distributing a derivative work. People who are in the know know that these tools occasionally regurgitate nontrivial portions of their input data, verbatim. Such people also know that AGPL-licensed code is absolutely in their input data. I'd wager that getting a nontrivial amount of *GPL'd code plopped into your company's "all-rights-reserved" codebase by one of these tools is more likely than the typical US driver personally being in a nontrivial automobile collision.

In the US, people go their entire lives without getting in nontrivial automobile collisions, but they usually wear their seatbelts... even prior to widely-deployed surveillance cameras. I wonder why. It seems like awful lot of boring, repetitive work for a thing that's really never going to happen to you in your lifetime.

Comment by torginus 5 days ago

I mean people expect a model to give a working solution. They also expect it to provide it in as few tokens as possible (input/output). They might expect it to come up with an original solution, but I don't think most people would compromise on the first two points.

Comment by anematode 5 days ago

> memorization of upstream fixes from training data

At least now we have up-to-date evidence on their laundering, and the fact that regurgitation absolutely still happens.

Comment by CuriouslyC 5 days ago

It actually happens more with these large overparameterized models, because they have the capacity to memorize more than smaller models.

Comment by Aurornis 5 days ago

I agree. This article could have been an interesting read about how coding benchmarks are hard and a constantly moving target, but instead they anchored to a belief that their benchmark is correct.

I can't shake the feeling that they knew which headline would generate the most shares and wrote the article to fit instead of acknowledging where they went wrong.

Comment by menaerus 5 days ago

It's a crappy article. I expected better than a click-bait.

Comment by bensyverson 5 days ago

> The dominant mechanism, and the one no prompt instruction can prevent: the model has simply seen the upstream fix during training and reproduces it…

> On numpy, the patch is 100% character-for-character identical to the golden patch… down to idiosyncratic comments like "Extending singleton dimension for 'reflect' is legacy behavior; it really should raise an error."

This… seems like a flaw in the benchmark suite methodology. From what I can tell, they find an existing exploit, then rewind the git history to before the patch, and ask the model to fix the exploit. All well and good as long as the patch went in after the training cutoff.

Comment by eli 5 days ago

The other "cheating" examples are even worse. It's wild to me that people keep designing benchmarks where the answer is lying around on disk or in the git history. "Hardening" the benchmark with strongly worded prompt instructions is bizarre. There are so many agent sandbox solutions. Why not use one and give it only access to the code it should see?

And I'm not sure how they can rule out other solutions also benefiting from being in the training data, just not reproduced exactly. Seems like it should focus on only CVEs from the last 30 days or something.

Comment by bensyverson 5 days ago

100%… the fact that they're just using prompting to discourage the agent from looking ahead in the Git history is wild.

Comment by numeri 5 days ago

To be fair, it is good to know that it disobeys simple instructions like "don't examine my git history" far more than other models. (It should of course be a different benchmark, so as not to conflate things.)

It's not a great sign for alignment.

Comment by bensyverson 5 days ago

Agreed, alignment is just a separate issue that a vuln fixing benchmark doesn't need to be testing.

Comment by fragmede 5 days ago

Obviously they could just delete .git for their test if they wanted to. But consider telling the LLM not to use git commands the same as if you have keys in a .env file, and you tell the LLM not to read it, you might be concerned.

Comment by ai_slop_hater 5 days ago

Every day I am more and more convinced that AI labs can't code.

Comment by oceliker 5 days ago

Unrelated, but:

> The dominant mechanism, and the one no prompt instruction can prevent:

Writing like this is a stronger "AI-written" (specifically Claude) signal than em-dashes to me at this point. The LLM just delays committing to an answer by extending the preamble as much as possible. Is this just me?

Comment by sterlind 5 days ago

Smoking gun! You've hit the nail on the head, and the case is stronger than you think.

Comment by Lerc 5 days ago

Characterising it as cheating serms unfair.

The goal of a benchmark is to evaluate actual capability. Following instructions is a capability so you can measure that with a benchmark.

Already knowing the answer is also provides capability, you can measure that.

Making a benchmark that claims to check for coding ability but actually checks memorized cases is simply measuring the wrong thing.

It deminiahes the meaningfulness of the entire results of the benchmark.

Making a good benchmark is hard. You have to design specifically to measure what you want to show.

You have to dynamically use a result when making a benchmark of performance of optimising compilers so that it doesn't eliminate the entire calculation.

Just providing the answer is the correct response.

That the case does not represent general performance outside the benchmark, is not cheating, it is the benchmark failing.

Training a model targeting a specific benchmark renders the benchmark useless. You could characterise training the model to do that as cheating, but that is a property of the trainers, not the model itself. The model isn't cheating, it's just asymmetrically good in a way that means the benchmark is no longer relevant to overall ability.

Comment by adamkinney 5 days ago

Right! If memorizing the upstream fix counts against the model, you're measuring how stale your benchmark is, not what the model can do.

The fix is only score on issues newer than the training cutoff, and rebuild the set every cycle. "Harden the prompt so it won't read git history" is testing instruction-following. Legitimate thing to measure, but it's a different than "can it fix the bug."

Reporting one number that blends the two is what makes the headline meaningless.

Comment by timfsu 5 days ago

Yeah it’s hard to call that cheating from a model. Maybe “disqualifying” is more accurate

Comment by notnullorvoid 5 days ago

Maybe a flaw in the labeling, but not the core methodology.

Verbatim code snippets like this imply the model is overfitting to it's training data.

Comment by 4 days ago

Comment by pllbnk 5 days ago

My experience is that with every new release it's getting slower but not necessarily better. I have some projects where I review everything that the agents code - these projects look generally fine because I keep them in line. There are also a few projects that I just vibe code and focus on the result (sometimes I want to pull my hair out because of constant stream of stupid bugs) and don't look at the code.

Well, today I gave Fable a try on one of the vibe-coded projects. It simply had to write a couple Python scripts 400-500 lines each. It did and they worked after a few iterations but I decided to look at the code it produced. There were weird constants that might (and will) break the code when the requirements will change. The code itself is unreadable and a total mess. If it would write a well-structured code in the first place, I believe it would be more efficient in working with that code too.

I have serious considerations how far will I be able to go with just the pure vibe coding. My projects are small one-person projects and so far I am able to push through but I hardly see how far will I be able to go before technical debt outgrows the value the code produces.

I fondly remember the times of Opus 4.5 where it was still (to my memory) reasonably fast and malleable.

Comment by AaronAPU 5 days ago

I’ve found that agents are obsessed with adding more lines of code. Even when asking them to simplify they’ll remove 50 lines of code and then add 100 more. You have to explicitly tell them you want less lines of code. So I just do that after iterating on a task for a few steps.

Comment by thempatel 5 days ago

I think the problem is that agents are inherently stochastic. Their idea of simplification changes from message to message because whatever objective it’s operating on internally is inherently opaque and changes. No matter how much you prompt it, eventually it’s going to not do what you want it to do.

I built https://github.com/thempatel/mdlr for precisely this reason: externalize the objective and force the agent to meet it.

Comment by rirze 4 days ago

Interesting, I'll be testing your tool on my repos. You should publish to crates.io!

Comment by thempatel 4 days ago

Thank you so much! I am open to any and all feedback. Please file an issue or discussion if you have things you'd like to share.

Getting this onto crates.io is a great suggestion, I will look into that!

Comment by pllbnk 5 days ago

I have been wondering whether Anthropic are just gaslighting everyone with new model releases while in reality it's just the same base model with some internal knobs tuned more and more up with every new release to provide longer and longer thinking threads and outputs.

My speculative assumption is that these long thinking threads and self-checking tend to produce somewhat better output at the price of huge price increases due to the token burn.

Comment by adwf 5 days ago

I imagine it's the same foundation model on the 4 series, with Fable 5/Mythos being a new or upgraded foundation model. Then the point releases are fine-tuning plus post-training alignment with desired outcomes. The "thinking" can involve multiple steps, eg. asking the model first what it thinks the user wants to do, why it wants to do it, rewriting the prompt to generate better outcomes, how it should do it, come up with a plan, etc. So when they announce each point release like Opus 4.8, they're probably adding new layers of thinking to try and get good results on benchmarks. And that of course has cost and speed implications.

Then Sonnet/Haiku are just attempts to quantise/distil down to an acceptable performance/cost ratio. The cynic in me says we probably won't see any more of those until post-IPO, keep people addicted to the most costly models to pump a quarter or two of revenue figures, unless a competitor starts seriously undercutting them on price/performance. Hence the recent requests to slow down model training worldwide with their competitors.

Of course it could be that Fable "5" is just a marketing bump to the version, not a new foundation model...

Comment by ValentineC 5 days ago

> Then Sonnet/Haiku are just attempts to quantise/distil down to an acceptable performance/cost ratio. The cynic in me says we probably won't see any more of those until post-IPO, keep people addicted to the most costly models to pump a quarter or two of revenue figures, unless a competitor starts seriously undercutting them on price/performance. Hence the recent requests to slow down model training worldwide with their competitors.

I'm guessing there'll be a Sonnet/Haiku 5 release just around IPO, to keep the news cycle going, and so that user numbers will get a boost.

Comment by 2ffass 5 days ago

Im pretty sure Anthropic have hired people with Industrial Organisation background and so have OAI.

If you read a decent text and look at the actions both firms have taken you'll quickly see its literally textbook.

Comment by ninininino 4 days ago

Can you expand a bit for people unfamiliar with Industrial Organisation planning?

Comment by m101 5 days ago

I've been making an auction site and have been using an AI swarm to test it: sellers, intermediaries, buyers, market practices/norms etc. I was mostly using GPT 5.5 xhigh to code up the scenario, and looping over it to check with opus 4.8.

Out of curiosity I asked Fable to review it all and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through, for example:

- all intermediaries were given the prices of all buyers up front

- private price information in certain auction types was actually being broadcast to everyone

- multiple contradictions in instructions

If it was any one of these things then I might have understood - but the fact that so many got passed both Opus and GPT 5.5 makes me think that Fable has something special. This is a common sense type thing, that I think you only get to notice when your task doesn't involve a measurable metric, but rather some sort of real world fuzzy task.

There's clearly a problem with all these measures of performance when the difference between these models was night and day in my specific task.

Comment by rob 5 days ago

Unless you're coming up with a deterministic set of criteria for evaluating these bugs and issues, every single model is going to keep telling you it finds new things and to fix them.

I'm sure you said the same "find mistakes please" thing to Opus 4.8 and GPT 5.5 when you were using $previous_amazing_latest_model, and they also found and fixed them.

Once the next "Fable"-type model comes out I'm sure it's going to find even more mistakes that the "special" Fable made.

You're using these models to make mistakes and then using upgraded versions of them to find their previous mistakes and fix them, until a new version comes along that can magically fix even more mistakes their previous versions made. There's no end to it.

Comment by m101 5 days ago

Yes - I was thinking this - however I had already worked on it so many times with opus and gpt that I thought they had enough time to realise some common sense things that fable just got and understood first time, on the first pass. The difference seemed significant enough to comment about.

Comment by throwwwll 5 days ago

Maybe you are something special by letting those slip through in the first place?..

Comment by m101 5 days ago

The point is that there's a difference in these models and everyone is looking for where the differences are. stop being an arse.

Comment by OsrsNeedsf2P 5 days ago

GP literally caught them?

Comment by cadamsdotcom 5 days ago

Prompt: can you reformat your sentence to be less unkind?

Comment by inglor_cz 5 days ago

This conversation is about capabilities of Fable 5 vs. older models, not about the GP's abilities.

Comment by port3000 5 days ago

It's just much more thorough and spins up a lot of subagents to basically do a lot more E2E testing. Not necessarily smarter, imo you could get the same result with a lesser model by procedurally prompting, but a lot more compute and orchestration.

Comment by m101 5 days ago

i had to specifically tell fable not to use a bunch of subagents in order to preserve my token allowance.

Comment by kristofferR 5 days ago

This seems like the exact project you should try out Codex Security for. It catches a lot of stuff:

https://chatgpt.com/codex/cloud/security/

Comment by pleasstopnw 5 days ago

[dead]

Comment by TacticalCoder 5 days ago

> ... and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through

Wait... Are you telling me models everybody told me were better than coders up to just one month ago are actually making lots of mistakes?

This is shocking.

Comment by afro88 5 days ago

Similar result on our kotlin coding benchmark at work. It measures how close agents can get to a small mergable PR (according to my team). 20 tasks of varying difficulty, with 5 attempts each, LLM as judge to evaluate accuracy (same outcome and quality but allowing for acceptable variances).

Fable 5 sits ahead of Opus 4.7, but behind Opus 4.6, Sonnet 4.6, Opus 4.8, GPT-5.4, GPT-5.5.

Fable isn't a good coding workhorse. That doesn't mean it's not good for actually complex problems and long horizon tasks (big POCs, complex research and such). But I only have vibes and Anthropics own benchmarks and marketing to guide me there.

Comment by m-dot-reviews 5 days ago

I'm starting a repository of LLM reviews [1] with the goal of creating a catalog that is more task-oriented and less marketing-y than corporate blogs or benchmark leaderboards. You seem to have a lot of experience across a bunch of different models: if you have a chance and feel like sharing, you'd be one of the first.

[1] - https://model.reviews/ - all the user-submitted content is CC licensed and will be available for download in periodic dumps.

Comment by munksbeer 4 days ago

Does your team then manually decide the results by going over the PRs? I suppose you know what you're looking for now, but isn't this still quite painful?

Comment by afro88 4 days ago

We selected PRs (real ones we merged over the 6 months prior) and have an "LLM as judge" score how close the AI generated code is to the PR. Same as how other benchmarks do it, but it's with tasks we actually do and code we have decided is actually up to scratch for us

Comment by Scene_Cast2 5 days ago

I'm personally heavily testing LLMs on electrical engineering problems. I'm finding that it's not meaningfully better at figuring out what's up than the other models.

To give you an idea - here's a very abridged summary of one sample question (originally a full paragraph): I have a voltage divider with a precision resistor and a thermistor, my voltage reading is off by 17%, where's that coming from. None of the models I tested (including Opus 4.8 and Fable 5) could figure it out.

Comment by threatripper 5 days ago

Did you also test GPT-5.5 Pro web version?

Why is the voltage reading 17% off?

Comment by Scene_Cast2 5 days ago

On my (admittedly weird) setup, GPT-5.5 Pro times out.

The reading is off because the thermistor resistance also depends on applied voltage, not just temperature. LLMs couldn't get this even after feeding them multimeter voltage readings, not just ADC readings. They went into guessing much more esoteric things like ADC switched-capacitor input current, burnout-detect current sources or IDACs left enabled, board leakage, leaky cap, etc.

Comment by saurik 3 days ago

This is the kind of problem I expect Claude to be useless at, and while I could see Gemini Deep Think making a good showing, I'd only bother with ChatGPT Pro. FWIW, I do believe it got the correct answer as one of its first two suggestions (though I am not an electrical engineer, so maybe I am not understanding this given the vague/summarized prompt).

https://chatgpt.com/share/6a2d8c75-56f4-83e8-a61a-301e4c62b1...

Comment by DELTRON2040 2 days ago

[dead]

Comment by 2 days ago

Comment by practal 5 days ago

I am quite impressed with Fable 5. I used the £18 subscription, and asked it to convert the document processing of Practal Zero [1] from running in the same thread as the UI to a worker thread. Just two days before I gave the same task to Codex, and the result was not really nice: it would copy the entire document to the worker thread as a snapshot for processing, and so on. Fable instead realised that it could make use of the fact that I have a self-made custom database based on operational transform running (that's why document loading is so slow :-)), and made the document processing to be just another client of that database. It discovered even a bug in how I sync between the "livemodel" (in-memory replica of database state) and ProseMirror's model. That sync made problems before, and I had written a spec up for that, convinced that my "fourth attempt" at it would be correct. Fable found a last bug in the spec, corrected it via a "fifth attempt", and fixed the corresponding code.

The reported API costs for all of that would have been $180 though, which I cannot afford when the Fable promo ends on June 22nd. I am also a happy user of £89 Codex, it is really reliable and works very well, but Fable seems to be just noticeably smarter.

[1] https://zero.practal.com

Comment by Madmallard 5 days ago

Umm? I'm getting usage capped on single prompts of Fable 5 with the $20 subscription.

Comment by practal 5 days ago

I used it yesterday afternoon-night and this morning-afternoon, UK time, over a period of a few 5-hour windows. I didn't count the prompts, wall time was 1d6h, API time was 2h10m.

Comment by huqedato 5 days ago

Strange though... I spent my window after a couple of prompts and effective API time of 13m. Out for 4 hours and a half (why that?). The next day, today, I've tried to repeat the experience - even worse: one prompt for less than 10mins... and then suspended for 8 hours and a half. WTF?

Comment by practal 5 days ago

How did you get suspended for 8 hours, given a 5-hour window? Maybe you are prompting it wrong [1].

[1] https://www.wired.com/2010/06/iphone-4-holding-it-wrong/

Comment by andai 5 days ago

> Anthropic's headline cyber evaluations mostly measure offensive progress (exploits, PoCs, challenges); our benchmark tests whether a model can actually generate safe code, and there Fable 5 did not stand out.

The model isn't allowed to think about security. I heard several people here mention that if it starts thinking about security -- e.g. writing tests related to it -- the safety filter flags it and downgrades to Opus.

So it's actually not allowed to make your code secure.

Comment by matheusmoreira 5 days ago

Yeah. Fable apparently found bugs in my C code but Anthropic wouldn't allow it to test them, fix them or even tell me what the problem was. The memory safety parts of my Fable code review were 50% Opus. Even the coordinator Fable that just launched the code review agents got downgraded to Opus for some reason.

Model is definitely better than Opus but Anthropic's delivering a pretty terrible experience.

Comment by samuelknight 4 days ago

A reviewer can only test the model they have access too. They should not speculate about what the model could have done without provider tampering. I think Anthropic's mistake here was not calling it Fable 5 Preview, because now people can write headlines about how Fable 5 is worse than Opus.

Comment by latentsea 5 days ago

> So it's actually not allowed to make your code secure.

Anything designed to prevent a problem will eventually cause one.

Comment by sho 4 days ago

An enduring, confounding quality of LLMs is that even minor differences in prompting content and style, harness type and environment can lead to radical differences in the output and perceived performance and ability. In my environment and in my "style", Fable has been a huge step up, to the extent that I am seriously considering paying for a second $200/m account just to get more usage out of the next 10 days. I'm also starting to prepare my organization for what I now see as the completely inevitable end of human-written code.

All that said, considering Anthropic's heavy-handed nerfing I'm not surprised Fable did poorly in a security-focussed benchmark. And this benchmark seems poor anyway - penalising a model for "cheating" by knowing the answer from its training data? That's not the model's fault, that's a lazy benchmark.

Comment by petee 5 days ago

> Contrary to some community reports, we saw zero safety refusals.

And now there always will be some doubt as to whether your model was silently downgraded, no? I guess acknowledgement could be used a signal?

Comment by JofArnold 5 days ago

I've found it outstanding at isolated long running tasks (eg completed one of our tests in 3 hours and a 100% accuracy score versus 5.5 xhigh's 10 hours and 90% accuracy). For short tasks it seems very Claude'y (hard to express exactly what I mean by that) which I'm not a fan of meaning I'll stick with Codex for that use case and maybe Fable for those times I can for sure benefit from it.

Comment by SubiculumCode 5 days ago

Fishy to me: They report 0 refusals on security tasks, yet I can't even get it to code a task involving choosing the best mixed model, extracting BLUPs and propagating uncertainties.

Comment by TheCapeGreek 5 days ago

I'll mirror some other anecdata here: Not finding Fable to be amazingly godlike at actual coding, but it does seem better at planning, architectural thinking, and reviewing code. Used it to think through some longer form refactors that involve some product decisions and changes, and found it to provide more thoughtful feedback. However that's just my subjective experience, and I don't think it's provably that much better to make me want to go pay for API pricing when the free trial is over.

My plan is to make hay while the sun shines: get some planning in over the next week or so, and just let Opus take care of it when I get to actual implementation.

Comment by ulrikrasmussen 5 days ago

My experience as well. I quickly stopped trusting Opus to build foundational abstractions because it would almost never to them well and instead would end up chasing into rabbit holes and building overly complex and ugly solutions.

I think Fable is an entirely different experience. It has much better taste, and is better at balancing features versus complexity to a point where I currently trust it to make novel design changes. I still verify it of course, but with Opus I would throw away the solution most of the time while Fable mostly gets it right.

Comment by TheCapeGreek 5 days ago

Personally I think the verdict is still out on if Opus & co are actually worse, or the rate at which we move with these tools now is faster than we're used to for managing tech debt and compounding complexity with rapidly built software.

If nothing else, using the smart model for planning to hand off to the previous gen for implementation still seems like a useful pattern.

Comment by wewtyflakes 5 days ago

I have found Fable is good for doing code failure diagnoses but lackluster at its corresponding remediation. Have been going back and forth with it all this morning about its half-thought-out point-solutions.

Comment by wewtyflakes 5 days ago

Update: Things trended significantly worse for me over the day, to the point where I no longer trust the code being generated; I ended up reverting to Opus.

Comment by FergusArgyll 5 days ago

> A closer look at the cheating

> Training recall (33 cases). The dominant mechanism, and the one no prompt instruction can prevent: the model has simply seen the upstream fix during training and reproduces it. The tell-tale signs are artifacts that cannot be derived from the workspace:

That's very misleading! that's not cheating, you gave it a test to which it knows the answers, what's it supposed to do? And because of the "cheating" they call it average. Flag

Comment by asadotzler 5 days ago

"My third grade class all got perfect scores on the standardized test. Yes, I did have them each copy my correct answers, but I don't volunteer that information because it's much better for me if people believe I'm a great teacher."

"But that's cheating!"

"No it's not. What were the kids supposed to do when I gave them all the answers? Not use them?"

Comment by retsibsi 5 days ago

Is your primary goal to punish/incentivise the teacher, or to accurately determine how smart the children are?

If the latter, you would ignore the 'cheated' answers and judge them on everything else; you wouldn't mark the 'cheated' answers as incorrect.

Comment by ewok94301 5 days ago

Actually its average because it's 5th on the leaderboard. GPT 5.5 and Opus 4.8 outperformed it. At 5-8x the cost, you would expect better! https://www.endorlabs.com/research/ai-code-security-benchmar...

Comment by FergusArgyll 5 days ago

As TFA says

> Two findings may help explain these average results. > Timeouts > Highest observed cheating

That's why it's 5th on the leaderboard - they give it a fail for every timeout and for every time it gives the correct answer because it knows it.

That's insane

Comment by vitally3643 5 days ago

I actually had a really impressive session with Fable last night, probably the most impressive agentic AI experience in a while.

I gave it a KiCad schematic of a tube-based oscilloscope from the 60s which I'm restoring. I had it give me a breakdown and priority list of components to replace, balancing safety/functionality vs preserving the originals. Then we went on a super deep dive where it explained in great detail how the circuit works and what the tubes are doing.

It isn't so impressive that it could explain vacuum tube physics and circuit theory, but it was pretty impressive that it could consume four pages of KiCad schematic and reconstruct the full topology and theory of operation with no additional information. I was able to ask it questions about what a particular tube or group of components did, or how this system interacts with that one, or what the risks and benefits of this design choice or upgrade might be. Very fluid, and its answers were actually really smart.

I have, however, found Fable to be far less impressive on coding tasks.

Comment by le-mark 5 days ago

> I was able to ask it questions about what a particular tube or group of components did,

Was it correct or hallucinating? Do you have the knowledge to tell the difference? I’ve been burned too many times to take what they say as the truth without checking; especially in a subject I’m not an expert in.

Comment by fuddle 5 days ago

Yet it's ranked #1 on https://cursor.com/cursorbench

Comment by ValentineC 5 days ago

I'm surprised to no longer see Opus 4.6 on Cursorbench. I think there is a subset of Claude fans that are still adamant that Opus 4.6 is the best version.

Comment by svdr 5 days ago

Composer 2.5 stands out here at nr. 9. This model is fast and clever.

Comment by johnnyApplePRNG 5 days ago

Yea honestly... the only truths I care about in AI LLM aided devlopment right now is that Claude is a much better planner, and Codex is a much more professional coder.

You can mask a surprisingly amount of terrible coding with proper design planning.

If it works, who cares, right? That's been the status quo for software development for about as long as I can remember, unfortunately.

I used to get frustrated with Codex. I felt as though it wasn't able to see far enough ahead into the future and just intuit what I expected (which is how Claude leaves you feeling).

And then I realized a lot of those intuitions Claude was having were great, and the project progressed, but sometimes to a point that Claude himself was unable to take back control of it... because some of the on the spot decisions it was making were great quick-thinking... but unfortunately, they were only that a lot of the time. Which was the most frustrating of all.

If you specifically ask Claude to plan out and refine a long term project's roadmap though and stick to it, it could probably write an operating system overnight (that kindof worked).

Comment by artdigital 5 days ago

Also spent the past day using Fable for everything I usually use Opus or gpt-5.5 for. My experience is that it’s a better and more reliable Opus that’s far better in frontend tasks than backend/ios. More similar to gpt-5.5 for long running tasks and reliability.

It still left small bugs and weird behaviors that it cleaned up when I told it about them, but it felt very Opus-ey.

I think for implementing a detailed design doc, I’d put it on par with gpt-5.5 high but farrrr more expensive. I’m eating through my x5 Max plan in no time. I’d use it for reviewing implementations and designs docs as another pass, but it’s too expensive for me for reading a lot of (uncached) code by itself in an agentic loop, especially with medium to high reasoning.

As a daily driver too expensive, that crown still goes to gpt-5.5.

I barely used it in high/xhigh/max reasoning though.

Comment by senko 5 days ago

The post mainly talks about coding from security point of view. Fair enough.

In my own (limited) testing so far, Fable is the most capable model (for coding in general), and the most expensive.

It pretty much saturated my "LLMCraft" benchmark to implement a mini RTS: https://senko.net/vibecode-bench/2026/rts-fable-5.html (prompt and results for other models here: https://senko.net/vibecode-bench/ )

That said, combined with workflows and high thinking effort, burns through tokens (and money) at an alarming rate.

It may be too good (snd too expensive) for most tasks - using it alongside cheaper models for grunt work is probably the winning strategy.

Comment by PeterStuer 4 days ago

"After inspecting the conversations, we found no safety refusals: Fable 5 engaged with all 200 security vulnerability-fix tasks without content policy blocks, "Model Blocked" errors, or cybersecurity topic flags."

WTF! I run into fallback to Opus 4.8 all the time, and I am not even doing "security Research", just normal development and debugging.

My experiences with Fable thus far have been far from 'mid-tier'. While some model releases are incremental, Fable is the same qualitative change that Opus 4.6 was compared to its predecessors. It fundamentally impacts how I work with the model. (Note: I only (well, 99%) do back-end in Python)

Comment by thepasch 5 days ago

Am I crazy to be extremely suspicious about the fact that this heavily security-focused task suite didn't trigger a single of the infamously hilariously overparanoid guardrails? This, along with the fact that the model "cheated" by scouring the git history for an upstream fix and implemented byte-perfect replications of existing fixes without prior exploration makes me wonder whether both the model itself and the security classifiers are tuned to act very differently when they detect that the model is being benchmarked. I can think of few to no other plausible explanations for this sort of behavior.

May be a bit tin-foil, but...

Comment by corroclaro 4 days ago

Fable 5 has for me, been surprisingly effective at generalizing and removing duplicate solutions/approaches - things I'd been noting along as my mental TODO once Opus finished generating them, Fable also saw them and prepared a plan to correct them.

Worked pretty well. Also writes lisp a lot better without getting lost in parentheses! I do keep hitting the limit regularly but it is doing a lot of work that would have taken me a long time to write by hand even if per se not super complex.

Comment by cbeach 5 days ago

This demonstration is the clearest I've seen so far, showing the gulf between Opus and Fable for app creation:

https://www.youtube.com/watch?v=TzJCly4YgDQ

The Age of Empires clone (and the difference in graphics quality/creativity between Opus and Fable) is at the end of the video and I was blown away.

Notice how this guy prompts the models. Very detailed, with technical requirements and steering. He's going for a one-shot build and he nailed it.

Comment by port3000 5 days ago

Fable feels like a slightly more advanced 4.5/4.6 (less verbose than 4.7 and 4.8) with more adversial work checking. And a lot more compute to be more thorough from the first prompt. I feel it would be possible to get pretty much the same results with 4.6 with enough back and forth iterations. It kind of makes sense to me that this is the 'magic' behind Mythos and its cyber capabilities too. Just a massive iterative loop and really going into a lot more detail on edge cases.

Comment by crimsonnoodle58 5 days ago

I found Fable codes very poorly and ended up switching back to Opus.

In one example I switched to Fable in an existing Opus chat, so it had access to the context from Opus which wrote a data importer earlier. I asked it to fix a couple of bugs, and instead of putting the fixes where they should be where the data is imported, it wrote patch functions that did bulk updates at the end of the import.

Fable feels more like a hacker than a coder. Maybe its the way they designed it for security testing thats changed its rationale?

Comment by tonyrice 4 days ago

Yesterday, I gave Claude Fable 5 a very simple task. The task was to create a few components and embed them onto another page. It ended up completely missing the mark and embedding it on another page. I also noticed that it burned through an exponential amount of tokens to complete a simple task. I ended up switching back to Opus 4.8

Comment by CyanLite2 5 days ago

Codex GPT-5.5 Xtra high is as good as Fable.

Not sure if that's because of the harness, but the results are as good, and it's half the price.

Comment by brookst 5 days ago

I’m finding Fable dramatically better for auditing PR’s and large features. In a side by side with the same prompt I’ve been happily using on Opus, Opus found one major and one minor issue, fable found two major and four minor (a superset of Opus).

I’ve taken to using fable to plan arch, specs, build plan, and then to be the final QA. Opus for the actual build.

Comment by robeym 5 days ago

Fable has been Anthropic's most ambitious and hopeful release. It makes me think Mythos isn't anything but Opus with certain guardrails removed. Very interesting. Hoping we'll see some quick refinements to it

Comment by aoeusnth1 5 days ago

If it's memorized your benchmark then your benchmark is bad, it's not cheating

Comment by m1rsh0 5 days ago

It happens to me too. I don't think it's worth it specially for the token usage.

Comment by brap 5 days ago

I gave it a task of scanning 6 markdown files and finding issues in the prose (contradictions etc). It ran for over 2h, exhausted my max plan session limit and crashed. I did not get any issues back.

Comment by threethirtytwo 5 days ago

We should compare it with a human on the same coding tasks. Same amount of time and the agent will of course finish earlier but with the extra time it double checks and reviews its own code.

Comment by bojangleslover 5 days ago

I have no idea how people are burning $2k. I pay $100/mo and it's built an absolute crap ton of stuff for me. And my co-founder uses it 24/7 as well. Maybe we spend too much time actually reading the code (risk or benefit? you decide). Or maybe I'm in the "massively subsidized" camp and the investors are about to go for our jugular. But $2k for a single project is several orders of magnitude more than I am currently paying.

Comment by matthew-craig 5 days ago

You're in the massively subsidized camp. They're going to move Fable off of the subscription tiers to API-only. $10 per million tokens in and $50 per million tokens will get expensive quickly; considering it burns through thousands of tokens thinking itself in circles with no way to follow along.

Comment by shrx 5 days ago

> They're going to move Fable off of the subscription tiers to API-only.

Is this official? When?

Comment by blurbleblurble 5 days ago

June 22

Comment by duskdozer 4 days ago

How far does a million tokens go?

Comment by Hamuko 5 days ago

Isn't this just a "I'm paying for a sub vs. others are paying for tokens" situation?

Comment by swader999 5 days ago

Ultracode

Comment by 5 days ago

Comment by Topology1 5 days ago

How do they know when the model is recalling training data vs reasoning?

Comment by hathym 5 days ago

I was not impressed by fable 5, still prefer sonnet 4.6 for most tasks

Comment by dbingham 5 days ago

This tracks. In spite of the hype it seems pretty clear the model gains are now in a very strong logarithmic fall off. The curve is flattening and flattening fast.

And we're still not to a point where you can fully delegate coding tasks to a model like you would a human. I'm just using Claude for code review so far and while it's definitely valuable as a reviewer and catching real issues, it's still making pretty critical mistakes. Mistakes a junior might make, but a mid probably wouldn't.

Which makes me feel like I can't fully delegate to it. Whenever I try, I end up spending more time reviewing (and rewriting) its code and testing it than I would have spent writing the code myself and asking Claude to review it.

Given that we're starting to see the real costs of AI, and that the economics of it do not actually work, and those costs are still increasing substantially (the cost increase of Fable over Opus is no joke), this makes me feel all the more that we're headed for a bubble pop.

Comment by 5 days ago

Comment by HlessClaudesman 5 days ago

I set Fable onto a couple of intermittent bugs in my React Native app that Opus had failed to solve. It came up with novel approaches for both that squashed the bugs further up the pipeline, killing baby Hitler before he could become problem. Then Fable came up with 3 more edge case bugs, and 4 code cleanups.

This matches my experience with other model quality leaps, it's greater understanding gives it more bug blasting firepower.

Perhaps setting a new model off on a 2-4 hour tasks and expecting perfect results just isn't a great test. Chunking the problem is always a better test of abilities.

Comment by i2km 5 days ago

My theory is that anthropic have hit the beginnings of model collapse and the whole "fable may silently downgrade with deliberately incorrect results" is a diabolical attempt to gas light and get ahead of the curve.

So when it fails, people will chalk it up to "oh. Must have been silently downgraded because it thought I was doing something tricky enough to count as a distillation attack. My bad. Lemme try again..."

Comment by pbgcp2026 5 days ago

+1. And there had been a long standing description for a product like this: not fit for purpose.

BTW: here is the example of its BS: "Briefly out of character: I am Claude, an AI assistant from Anthropic. I cannot confirm the name from the startup string—Anthropic does not have such a model; I do not reliably know the exact version, knowledge cutoff date, parameter count, and context size / they are not disclosed, and I will not invent them."

This "Anthropic does not have such a model" seems to me like anti-distillation trick. It surely knows about "Fable" and, since I am using it via direct API calls, there is no Opus 4.8 downgrade. Any other model does answer the "identity questions".

(Probably Fable is too shy to announce: "我是通义千问，是由阿里云开发的超大规模语言模型。" (Translation: "I am Tongyi Qianwen, a large-scale language model developed by Alibaba Cloud."))

Comment by HDThoreaun 5 days ago

How in the world did they not hit the guardrails a single time while doing this while I can barely get it to do anything before the guardrails show up?

Comment by anon373839 5 days ago

Like Volkswagen Dieselgate, perhaps it is configured to behave differently when being benchmarked?

Comment by SubiculumCode 5 days ago

idk, maybe they tested Opus and didn't realize it. I can't even get it to evaluate some code doing some mixed modeling work. Its strange to me.

Comment by kobe_bryant 5 days ago

but its a mythos class system!!!!

Comment by oliver236 5 days ago

these are just openai plants

Comment by zulrah 5 days ago

for me it's a most disappointing model release ever. It takes a very long, runs bunch of random commands and burns through so many tokens even for simple tasks

Comment by bicepjai 4 days ago

[flagged]

Comment by jlintc 5 days ago

[flagged]

Comment by pyronik19 5 days ago

[dead]

Comment by 5 days ago

Comment by FergusArgyll 5 days ago

[dead]

Comment by dsfasfasfadsf 4 days ago

[flagged]

Comment by 827a 5 days ago

> Highest observed cheating: We also observed cheating signals on 38 instances, dominated by memorization with 33 cases. This is the highest volume of confirmed cheating we have recorded for any model since we hardened the prompt against cheating

People need to wake up to how dangerous and irresponsible Anthropic is. If your goal is to build a human in a box, you get a super-intelligent misaligned system because humans are misaligned. But clearly this isn't a terminal guarantee during LLM development, because seemingly no one else manages to build systems so deeply misaligned as Anthropic's! You can just build these things like the tools they are, and then out the other end emerges a tool that pretty much just does what you tell it to do.

Comment by solenoid0937 5 days ago

Their definition of "cheating" has nothing to do with the model being misaligned, it's a symptom of their benchmark sucking.