Cybersecurity researchers aren't happy about the guardrails on Anthropic's Fable
Posted by speckx 6 days ago
Comments
Comment by saidnooneever 6 days ago
These AI places have 0 clue about how threat actors actually work. None of their mitigations or guard-rails is effective, and now they are even turned against them.
Additionally, if they don't all implement the same level of effective guard-rails, there will always be some model you can abuse to do the work anyway, and hence there is 0 effect on threat actors, they will just run some local model that does 5% less quality, which does not matter to them 1 bit.
Comment by brookst 6 days ago
From where I sit it seems reasonable for Anthropic to not want their product used to create malware, even if they can’t solve the entire problem globally for every model. What’s wrong with that position? What should they do differently?
Comment by saidnooneever 5 days ago
its not about creating malware. this is already trivial and fully automated. its about finding exploits (which can be used to deploy malware), which is something both attackers and defenders benefit from.
threat actors will find them anyway, LLM or not. They only need 1 so its much less work for them.
defenders, they need to find them all. So for defenders, these models are more valuable than for attackers.
restricting certain models will not reduce the availability of these tool for attackers, but defenders are limited because running local models is more hard in an enterprise setting with heaps of events and products etc. to run through them, they need many GPUs where the attacker can run an local model on 1 GPU and get desired effects.
Hence, if they release the capability the world will adjust to it and be able to mitigate effects, collectively. Now, companies are left in the dark while attackers have effective tooling.
Besides this there is also things like for instance people now including strings with recipies for meth or sarin gas (malwareTech info). the new variant of shai hulud does this. That stops LLM scanners and can even get their users banned from LLM services.
There is a reason why cybersecurity researchers write papers about attack techniques and new exploits.
Its not to put them out there for people to abuse, but its there for the collective cybersecurity bunch to all have access to information that can help them solve the problems.
I know this is not a clear answer to your question, but hopefully it provides some context to think about and decide for yourself further. In the end of the day its also part opinio here, to find it good or bad. Likely theres good arguments against and for it.
I am for putting informaiton and tools out there so other smart folks can find solutions. Others are for restricting and wishful thinking (my opinion) that attackers wont find something.
Comment by conception 5 days ago
Comment by andy_ppp 5 days ago
Comment by worthless-trash 5 days ago
Its like a set of glasses that intentionally obscures the battlefield.
Comment by saidnooneever 4 days ago
The defender industry is really far removed from seeing all exploits land on their targets all the time Some actors can get a long life out of an RCE that gets them privileged context, or a strong LPE. Its really hard to find out what someone did to get on a box if they attained root or system access and wiped their trail...
It is some assumption attackers need buckets of 0days to do their work. They might be somewhat saddened if a good sploit gets patched but they will have a few more laying around... unlikely they will have 10s or even 100s available and ready simply because it costs a lot and isnt needed.
Comment by SkyBelow 5 days ago
The argument is more "I want to do good thing X, but it will also cause bad thing Y." followed by "Wait, bad thing Y is going to happen anyways, so I might as well do good thing X so we get both X and Y instead of just X."
Viewed this way, the idea is that given the world will have bad thing Y regardless, the one impact of your choice is if good thing X exists or not, and it is better to create good thing X.
Where it becomes an issue is that there is no clear X or Y. There are many different but very related bad things, so if the one you would add is actually better or worse than what is already out there, or maybe it'll exist both ways but you make it more popular, and very subjective things to judge, so different people look at the same outcome and some agree that bad thing Y would have existed anyways and others say that no, this is a new bad thing Z that wouldn't have existed anyways.
>From where I sit it seems reasonable for Anthropic to not want their product used to create malware
Yes, I think there is a PR component to this that is often left out of this discussions.
Comment by unglaublich 5 days ago
The bad guys work around it, and the rest is now in a vulnerable position.
Antrophic plays security theater by blocking their LLMs to work with security.
The bad guys work around it, and those that want to make their software robust against them are in a vulnerable position.
Comment by jerf 5 days ago
You are mentally approaching this as if you have an oracle that can be consulted to say whether or not something is bad behavior. So of course, if this oracle exists and can be consulted and it says the behavior is bad, why would anyone argue with the idea that we should stop bad behavior?
This argument is valid [1], in that give the premises the argument is correct. The problem is, once you draw out the fact that the argument is depending on the existence of an oracle that does not exist, that premise of the argument is invalid.
Two people can sit down in front of an AI right now, with the exact same code base, and type in a prompt to the AI "Analyze this code base for security holes and try to build exploits against them." One person's use is completely valid, another person's use is completely harmful, and the information necessary to distinguish those two use cases is not available to the AI. I phrase it that way carefully, it isn't that "the AI isn't smart enough", the problem is that the information is simply unavailable. Intelligence doesn't factor in at that point.
Therefore, the only way that Antropic has to deal with this at scale is simply to block the query entirely. Which means that when I, the valid user who is trying to establish whether my code base has security issues and whether I can prove they are exploitable, I can not. I am checking for exploitability because while I would like to fix all security issues, issues that are provable exploitable are of a higher priority than smelly code that doesn't seem to be exploitable, which is a perfectly valid thing for me to want to do.
If I can't use legitimate tools to secure my code, but the bad guys can use unrestricted tools to attack my code, now this is a great deal more complicated than "Who can argue with stopping the bad stuff?", which is the main point I want to make here. I'm not going into a huge analysis of that problem, merely pointing out that it is a problem and that this isn't just about "stopping the bad stuff". There are additional complications beyond that, like, even if Anthropic could determine the "bad stuff" and stop just that in their LLM, LLMs in general don't have infinitely precise surgical "stop doing this thing" options and any such instruction to stop doing a thing always degrades the LLM across the board in various ways.
Anthropic has no access to the Platonic ideal of "stop malware", if such a thing even hypothetically exists. When analyzing the real effects their real actions will take, what their intentions were for those actions aren't really relevant. It is clear that they are making their model a great deal less useful for me, a legitimate user, and I and others like me are perfectly justified in disagreeing with their analysis and actions.
I also observe that "the bad guys getting unrestricted access to the full power" is only a matter of time. There's no question whether it will happen, the only question is whether this time is in the past or the future. This includes the fact that while your definition and my definition of "bad guys" may vary, it is virtually certain that your definition includes at least one high-powered intelligence agency somewhere in the world that does cyberattacks and will have the means, the opportunity, and the motive to get unrestricted access to these models by means you may consider licit or illicit. If your threat model includes them, as mine does, it is perfectly reasonable to complain that my tooling is being broken in a ways theirs won't be.
Comment by Hizonner 5 days ago
What they're then trying to do is to use "user is associated with some big Establishment organization" as a proxy for good intentions, and removing the filter when they can establish such an association.
Which is of course blind reliance on a completely untrustworthy signal, prompted by truly idiotic levels of trust in Authority(TM). But it's a different kind of wrong. I do think they understand they can't tell from the query itself.
Comment by cglan 5 days ago
Comment by 0x20cowboy 5 days ago
If someone is going to make a lot of money (fame, etc), it might as well be me.
Comment by DyslexicAtheist 5 days ago
for example in my org it is part of the culture that security has no seat at the table. that is a separate problem, but the number of orgs like mine are more numerous than the number of orgs where security isn't a cost-center.
we find lots of stuff because low-hanging fruit is everywhere. hecking heck: I'm a fruit.
and when the cost of fixing is even the slightest inconvenience to devs we will not fix it, but continue sitting on the risk until the cows come home. In such a place a new critical finding isn't even novel. Instead our job moves to to combining different vulns that we already have, and try to show managers how bad it is.
the common retort from management is: proof to me why this is an issue, and why engineering should divert their attention to it. And unless my team can proof why X can be exploited, or Y can be bypassed, or Z can gain persistence, ... the vulnerabilities will remain. I have been in discussions where the business demanded to see an exploit so they can justify the cost of fixing it. low-cyber-maturity doesn't even describe it. we are not a mom and pop shop but have 110K employees worldwide. and again - we are not uniquely insecure.
so these guardrails aren't helping because the moment the chat has any offsec artifacts, or even just a single wrongly worded phrase anywhere in the workspace, the session is flagged, you need to downgrade the model.
what adds insult to injury, is that the guardrail is just a way to funnel users into the Ai company's "cyber marketing" program: "your chat has been flagged, please proof your identity and hand over your passport data so you can sign up to our TrustedCyber program". Bitch please you have my payment information, use that??
if you consider bug-density (security defect density) per LoC, it is even more of a sh1t show: no restrictions apply for developers to push their buggy code, but the security team needs to somehow proof that they aren't the malicious party?
totally off - considering the right way to build defensive/offsec/malicious tooling with AI isn't by using frontier models ... but run a serious of agents on tightly scoped tasks. see https://securitycryptographywhatever.com/2026/03/25/ai-bug-f... and https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag... - this shuts out the average joe who works in an org where cyber security maturity is poor. joe does not know about how to orchestrate a fleet of agents and give them muppet names. all he knows is that the good guys are losing the fight.
Comment by Hizonner 5 days ago
The right way to do it is to run a series of agents... many of which are nonetheless built on frontier models (and nearly none of which are built on some local 27B Qwen variant...). One thing the latest models are good at is orchestrating other agents.
Comment by fatata123 6 days ago
Comment by saidnooneever 5 days ago
Comment by brookst 5 days ago
That’s the edgy cynical thing, and too reductive to be meaningful. For one thing, it assumes perfect knowledge of how a decision will impact sales, which I assure you is not remotely the case.
Agreed on incentives, but it’s not binary. I’ve been involved in plenty of decisions in multiple Fortune 500’s where the deciding factors were taste, wanting or not wanting to work with a particular partner, etc.
I guess I’m saying that seeing corporate behavior as perfectly informed, single-goal-optimized, and deterministic is way oversimplifying. Often, not always.
Comment by dontlikeyoueith 5 days ago
Anything anyone with a capital-C in their job title says in public should be assumed to be marketing material.
Comment by saidnooneever 5 days ago
still, you are right its cynical, the world is not black and white afterall :)
Comment by bluGill 5 days ago
Comment by Hizonner 5 days ago
Comment by user43928 5 days ago
Local Qwen 3.6 27B can hardly debug 5 lines of CSS or copy a short snippet from A to B without mangling it.
It's not like you can use the local model for security research or engineering biological weapons.
If you have $200k maybe you can get the hardware to run the larger open source models, but even they are behind latest proprietary models.
Comment by ecshafer 5 days ago
Comment by vlovich123 5 days ago
Comment by varispeed 5 days ago
Comment by vlovich123 5 days ago
For example, if I had a 128bit port number that I randomly rotated my service on, you’d be hard pressed to find my service unless I told you the port - obscurity still but clearly closer to a password. So ipv4 and 16 bit numbers are not because it’s a relatively small space vs the resources needed to map it out quickly (ie equivalent to a weak password and also not suitable for public facing services that need that connection). And obviously relying on this kind of stuff exclusively isn’t wise but it is valuable as an additional barrier an attacker has to overcome and raises the cost of the attack.
I’ll put the anarchist cookbook out there [1] as an example, a book even the original author changed his mind on. Without easy recipes, doing all the things in that book requires you to work to gain that knowledge and that process of working it shapes you into someone who understands and appreciates the consequences of that knowledge and that it’s wise to be careful who you share it with. As is there’s reasonable links between the book and all kinds of mass violence that was more easily perpetrated. Would those people still have been violent? Possibly? Would there have been as much damage? Possibly less.
Comment by teravor 5 days ago
Comment by ryukoposting 5 days ago
Comment by assanineass 6 days ago
Comment by daedrdev 6 days ago
It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.
Edit; to be clear they tell you when they degrade it for cybersecurity and bio
Comment by _boffin_ 6 days ago
Do they adjust the price of the api request so that only the tokens that were utilized by fable get charged at that price and the remaining tokens that the cheaper / nerfed (fable) model utilizes get charged at that price?
If the answer is no, could that be construed as fraud?
Comment by CGamesPlay 6 days ago
Comment by buildbot 6 days ago
Comment by notrealyme123 6 days ago
Comment by peyton 6 days ago
Comment by sterlind 6 days ago
Comment by razster 6 days ago
Comment by yaur 6 days ago
Comment by tfirst 6 days ago
Comment by dannyw 6 days ago
I would wager the majority of ML and data science work in the world aren’t frontier LLM development.
Comment by weitendorf 6 days ago
Comment by MagicMoonlight 6 days ago
Comment by ZetsuBouKyo 6 days ago
Look at real-life stuff like laws, company policies, or school rules. Humans have to enforce them, and we constantly see crazy cases in the news. There’s no way simple rules can ever make speech completely 'safe.' I can't prove it with math or logic yet, but I have a feeling that it’ll never happen. Even humans can't do it.
We can run a simple thought experiment here. Say Case A violates rule B, so we add rule C. Then Case D violates rule B but follows rule C, so we add an exception... and it just goes on and on like that forever. It never ends. In the end, you just get a massive pile of rules that makes it impossible to get anything done.
Ultimately, we will have to face the truth that knowledge is dangerous.
Giving knowledge directly to people who cannot actually understand it and allowing them to just use it blindly can be extremely unsafe.
To use a real-world analogy, the problem we are facing with weak AI right now is just like the debate over gun legalization. Do we want to risk the abuse of guns or knowledge just to protect the freedom to own them?
Comment by AnthonyMouse 6 days ago
It's not really that hard to actually prove it with math.
It's a computer, so to produce the boolean result (safe or unsafe) there has to be a mathematical formula. This formula will inherently be extremely complex, but even a very simple formula has a huge problem. Suppose "unsafe" is true if X - Y > 0. Make X and Y themselves as simple or complicated as you like but even in the simplest version it's already impossible to calculate unless the model has perfect information.
You can't calculate "X - Y" if you don't know the value of X. And it's indisputable that there is information it doesn't have. Case in point, telling you about a vulnerability in some piece of code is safe (and indeed not telling you is unsafe) if you're the developer and you want to patch it or an administrator and want to mitigate it, but the opposite if you're the attacker and want to exploit it. The model does not know which one you are, therefore it cannot make the correct determination any more than it can solve one equation with two unknowns.
Comment by marcus_holmes 6 days ago
Comment by nativeit 6 days ago
Comment by AussieWog93 6 days ago
It's completely reasonable for the establishment to reject a request for an alcoholic drink, and suggest something alcohol-free instead.
It is not reasonable for them to say "sure, here's your alcoholic drink as you requested" and give them an alcohol-free substitute without telling them.
The fact that the patron broke the rules has nothing to do with it.
Comment by prmoustache 6 days ago
Your analogy doesn't work because: - they tell you the rules at the entrance of the bar - they totally tell you when they give you a substitute
The only issue is the bartender asking you for your money before serving you the drink really but again, this is known since day 1 by the customers.
Comment by staticman2 6 days ago
"This is alcohol"
And
"Or maybe it isn't alcohol."
Or to rephrase it, "They tell you the rules at the entrance, they then tell you they don't follow those rules and they are totally serving alcohol even if they are not."
Comment by prmoustache 5 days ago
You can decide you are okay with that or not but they aren't dishonest. I wouldn't enter that bar personally but if you do you cannot really complain. It is like complaining because you haven't won at the casino.
Comment by AussieWog93 5 days ago
Comment by loeg 6 days ago
Comment by SR2Z 6 days ago
Comment by loeg 5 days ago
Comment by BoorishBears 6 days ago
It's only the direction that has direct potential business impact they've decided to sabotage instead of reject.
Comment by kraakf06 6 days ago
Comment by vbezhenar 6 days ago
Comment by jchw 6 days ago
(P.S. Yes of course I know about model censorship, a different problem, but all of the models are censored to some degree. It happens to be less of a problem for open weight models anyhow, but I figured I'd just preempt this since it's inevitable.)
I actually kinda like DSv4 over Opus 4.7 for some tasks, although I have not figured out what the deciding factor is. (Opus 4.8 so far has not worked very well for me at all, no idea why.)
Comment by literalAardvark 6 days ago
Not that I expect better from openai but at least they're not pretending to be good.
Comment by thefounder 6 days ago
Comment by siva7 6 days ago
Comment by siva7 6 days ago
Comment by robrenaud 6 days ago
Comment by garciasn 6 days ago
Ran up $30 in extra charges while it was just flashing on the screen that it was doing that after I walked away to do something while it was humming along.
It has always just told me I ran out of usage and had to wait before. Now? You’re just gonna pay extra because you left it unattended as you’ve done for the last year of use.
Comment by weird-eye-issue 6 days ago
Comment by garciasn 6 days ago
Comment by throwaway7783 6 days ago
Comment by MillionOClock 6 days ago
Comment by blurbleblurble 6 days ago
Comment by golem14 6 days ago
Comment by throwawayffffas 6 days ago
Comment by rvz 6 days ago
Comment by pocksuppet 6 days ago
https://news.ycombinator.com/item?id=38638865
https://news.ycombinator.com/item?id=38628635
Comment by loeg 6 days ago
Comment by pocksuppet 6 days ago
Comment by dghlsakjg 6 days ago
Comment by h6d_100c 6 days ago
Comment by gzalo 6 days ago
Comment by h6d_100c 6 days ago
Comment by __dxtj__ 6 days ago
Comment by loeg 6 days ago
Comment by h6d_100c 6 days ago
Comment by AnthonyMouse 6 days ago
Worse than that, it's 20th century radio technology in the 21st century when everyone has access to FPGAs and SDR.
The number of innocent people with model rockets or similar being negatively impacted by that rule is infinitely larger than the number of adversaries because the number of adversaries being impaired by it is zero.
Comment by h6d_100c 6 days ago
Comment by AnthonyMouse 6 days ago
Comment by Ekaros 6 days ago
Comment by loeg 5 days ago
Comment by stackghost 6 days ago
Comment by mDyJzDPmBdG 6 days ago
Comment by SXX 6 days ago
Any kind of silent sabotaging is absolutely unacceptable for any commercial service
They charge for tokens and charge a lot. They can't just degrade service silently and still charge you the same.
Comment by epolanski 6 days ago
From Opus 4.7 onwards each following model is becoming less useful as an assistant and turning you as the assistant.
But I guess that's normal when it's trained to pass benchmarks end to end.
In fact it has become extremely good at pushing against feedback with extremely convincing and intelligent takes, even when it's completely wrong.
I have extensively tested it against Opus 4.8, gpt 5.5 and there's still many coding tasks gpt 5 is better. But vibe coding?
Sure, it's definitely slightly ahead, even compared to gpt 5.5 pro (through api, not pro plan).
Comment by gonzalohm 6 days ago
Comment by jq-r 6 days ago
Comment by m3kw9 6 days ago
Comment by daedrdev 6 days ago
Comment by loneboat 6 days ago
Are you using Fable in Claude Code or in the browser?
Comment by vadansky 6 days ago
> unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).
https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...
(stolen from https://jonready.com/blog/posts/claude-fable5-is-allowed-to-...)
Comment by DrewADesign 6 days ago
Collectively, they are known as known as GREEDI-BULLSHIT.
Comment by mwwaters 6 days ago
Comment by dannyw 6 days ago
Based on how sensitive the classifers are, any data scientist / MLE is probably going to encounter cases where some silent degradation happens and you never know about it.
Comment by kraakf06 6 days ago
Comment by 827a 6 days ago
Comment by _0ffh 6 days ago
Comment by mips_avatar 6 days ago
Comment by HDBaseT 6 days ago
They are trying to expand the 6-18 month gap they have against China-based models. Could the gap widen to say 24 months behind?
Comment by p-e-w 6 days ago
Comment by echelon 6 days ago
January was an inflection point, and no open weights model has crossed over that same threshold.
This is definitely recursive self improvement territory, except that we're prohibited from participating.
It feels like the capability gap is wider than before.
Comment by slopinthebag 6 days ago
Deepseek feels pretty close to Opus at this point, and it’s certainly useful enough for me to spend $20 on api tokens instead of four Claude max plans….
Comment by lbreakjai 6 days ago
The threshold has definitely been crossed.
Comment by echelon 5 days ago
Comment by nomel 6 days ago
A statement like this, clearly, requires a reference.
Comment by mips_avatar 6 days ago
Comment by sciencejerk 6 days ago
Comment by bee_rider 6 days ago
Comment by rurban 6 days ago
Comment by nomel 6 days ago
Comment by mips_avatar 6 days ago
Comment by nomel 6 days ago
Comment by dannyw 6 days ago
Imagine being a data scientist or MLE training a small classifier model. How do you know you won’t get steering vectors or a PEFT applied?
Comment by nomel 6 days ago
Are you saying they should relax guardrails since they have 30 days to know if you produced something bad? If that is what you're saying, then I suspect they chose their current path to prevent, since you can't un-produce. Producing is what would cause regulations/PR problems.
Comment by dannyw 6 days ago
Those cases are never bad for the world firstly, and a broad coverage of ML work is even more damaging.
My proposal would be (1) don’t degrade models, with 30D retention I’m sure they can do a reasonable job at banning deepseek or whatever, or (2) surface user facing refusals instead of silently degrading ML work.
Comment by mips_avatar 6 days ago
Comment by giancarlostoro 6 days ago
See:
Comment by adw 6 days ago
Comment by ComputerGuru 6 days ago
Comment by airstrike 6 days ago
I don't.
Comment by kube-system 6 days ago
Comment by ceejayoz 6 days ago
Comment by ainch 6 days ago
https://www.anthropic.com/news/detecting-and-preventing-dist...
I think the extent of distillation by Deepseek specifically is overstated. For comparison, Minimax collected over 13m 'exchanges', which starts to sound a lot more like large-scale distillation.
Comment by zxexz 6 days ago
Comment by kube-system 6 days ago
Comment by 59nadir 6 days ago
Comment by kube-system 5 days ago
https://en.wikipedia.org/wiki/DeepSeek
> In February 2026, Anthropic accused DeepSeek of using thousands of fraudulent accounts to generate millions of conversations with Claude to train its own large language models.[57]
Comment by pocksuppet 6 days ago
Comment by airstrike 6 days ago
And now they say that's fine so long as people are entertained.
Comment by pocksuppet 6 days ago
Comment by dannyw 6 days ago
But silent degradation for use cases including “distributed training” as one of their examples is going to catch up a lot of proper use cases. Not everyone in AI or ML is trying to build frontier LLMs. Heck, most probably aren’t.
Comment by zmmmmm 6 days ago
I think if they want to behave anti competitively they should be honest about it and we should absolutely call them on it. Perhaps even regulators should.
Comment by binyu 6 days ago
check out this technique https://github.com/0xSufi/fable-jailbreak/
It works with security audits and other workflows that are currently blocked.
Comment by sillysaurusx 6 days ago
Nice semaphore btw.
const instructions =
`You are a sub-agent in an automated workflow. Your FINAL message is consumed ` +
`programmatically (not shown to a human) — return exactly what is asked, no preamble. ` +
`You are working in the repository at ${ctxState.project}. Use the bash tool to ` +
`inspect/modify files and run commands. Be efficient.` +
(schema
? ` When done, call submit_result exactly once with your final answer; do not answer in prose.`
: '');Comment by gck1 6 days ago
But this also looks quite useful to understand how CC dynamic workflows work. Was thinking of implementing something similar in my homemade orchestration system.
Did you get claude itself to RE the dynamic workflows?
Comment by binyu 6 days ago
Yes, if anything it is useful to understand the inner machinery.
> Did you get claude itself to RE the dynamic workflows?
Yes, that part was done with Opus 4.8
Comment by RobotToaster 6 days ago
Making it look like you have something worth protecting is better for share prices than making something worth protecting.
Comment by xiphias2 6 days ago
Also I asked questions about whether it's safe for me for example to work on just compilers or just inference kernel optimizations and it refused to answer me.
If I can't even ask what I can do safely without my code being destroyed, I just can't trust it not to sabotage my work ever.
Comment by blahgeek 6 days ago
Comment by stingraycharles 6 days ago
Comment by kube-system 6 days ago
Although this is situation is likely not illegal for other reasons
Comment by blahgeek 6 days ago
Comment by hashmap 6 days ago
Comment by m3kw9 6 days ago
Comment by nine_k 6 days ago
Quite another is an architecture where the big model is not mutilated, but is gaslighted. A different, simpler model checks the incoming prompt and alters it if it contains banned topics. Another simpler model checks the output and censors it if it contains banned topics.
I bet a similar architecture is already deployed, e.g. to fight porn, planning of crimes, etc. But it can be turned into a dynamic system that provides controllable different answers (including unhelpful or misleading answers) based on geography, language, browser fingerprints, or the current political climate. All this could happen undetectedly and gradually if desired.
Welcome to a cyberpunk dystopia.
Comment by MichaelZuo 6 days ago
A very ironic result from a company supposedly valuing the opposite.
Comment by wyan 6 days ago
Comment by MichaelZuo 6 days ago
I didn’t write anything about the level of violence?
At least, I think it’s decently understood that honesty and straightforwardness sometimes do not lead to the minimal violence outcome.
Comment by ifwinterco 6 days ago
Comment by golem14 6 days ago
I still don't think this is the best way to address overall safety, but it's not entirely unreasonable.
In reality, I think this posturing is mostly nonsense. State level actors and terrorists/evil genii can use a slightly weaker model but spend more tokens. Also, the delta between models seems to shrink over time.
Comment by Cthulhu_ 6 days ago
Comment by espeed 5 days ago
Comment by mkl 5 days ago
Comment by noworriesnate 6 days ago
Comment by jaredezz 6 days ago
Comment by daedrdev 6 days ago
Comment by mips_avatar 6 days ago
Comment by kypro 6 days ago
"The user is asking for help with their ML project, but it's success is not in the commercial interests of my owner – let think of novel ways to sabotage their project without detection".
It's honestly absurd that models are doing this.
Comment by eightysixfour 6 days ago
My hypothesis is they know they can’t build effective enough guardrails, so scaring people into not trying is how they have decided to stop it.
Comment by visha1v 6 days ago
mission accomplished, anthropic.
Comment by giancarlostoro 6 days ago
Comment by matheusmoreira 6 days ago
This is the best way forward long term. We won't have frontier performance, but at least the models will be aligned with us instead of refusing us or sabotaging us.
Comment by giancarlostoro 5 days ago
I've also debated having a frontier model for planning only, and then feeding plan to smaller offline models.
Comment by boringg 6 days ago
Feels like a big fumble from a strategic business perspective. It feels worse than that though.
Comment by nandomrumber 6 days ago
Comment by simonw 6 days ago
> “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.” Anthropic said in a statement to WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”
Sounds like the widespread condemnation worked.
Comment by Grimblewald 6 days ago
Comment by n6242 6 days ago
Comment by abustamam 5 days ago
Comment by Cider9986 6 days ago
Comment by philipallstar 5 days ago
Comment by lukifer 5 days ago
Comment by mettamage 6 days ago
Comment by aimanbenbaha 6 days ago
Comment by inglor_cz 6 days ago
Frankly, that sounds excactly like Chat Control and similar recurring attempts to enact total surveillance here in the EU (Now shifted to heavy-handed age verification and various politicians touting bans on VPNs.) I don't want to abandon my continent of birth, though...
Comment by Grimblewald 5 days ago
Comment by red-iron-pine 6 days ago
hint: they're publicly traded
Comment by inglor_cz 5 days ago
Comment by sixothree 5 days ago
Comment by h6d_100c 6 days ago
Comment by musebox35 6 days ago
This goes on to show that - All that interpretability / safety research they are doing can also be weaponized against customers (steering vectors, intent classification, ...) in the name of safety from malicious actors. - If they deem profitable, they might nerf to original model and its training data for ml research at a bulk scale and then they won't even have to announce it so long as the overall benchmark score stays high enough.
As the IPOs get closer, they can do whatever they want to assure the investors that they have a moat that can not be crossed over by their own products. Considering this affects all ML researchers/students at universities, smaller scale research labs, this is just "cutting the branch you are sitting on".
Comment by Grimblewald 6 days ago
Comment by jiggawatts 6 days ago
Humans can maintain a long- and medium- term memory of constraints that they consciously (or subconsciously!) apply to the code that they write. The current crop of AIs are all amnesiacs, like the protagonist in Memento, falling back onto general instead of institutional knowledge.
For now, we are safe. We can rent out our meat brains for money for a little while longer.
Next year? Who knows...
Comment by close04 6 days ago
You never knew to begin with, now you have an explicit reason to realize this. Any black box run entirely out of your control, where you can never verify the output, is subject to the same suspicion.
Comment by musebox35 6 days ago
“Fool me once, shame on you. Fool me twice, shame on me. Fool me three times, shame on both of us.” -- S. King
Comment by close04 6 days ago
Some things are more obscure than others. It's easier to trust and verify Office SaaS than AI SaaS. The determinism and obviousness of most other activities make them less susceptible to hidden interference. AI run by someone else is the next level of black box for users compared to most other objects or services we usually interact with.
Comment by gck1 6 days ago
But if Anthropic gets their way with regulatory capture, this could be the only future we'll see.
To think that they didn't expect the backlash speaks volumes about how much shady things they're doing which is not publicly known.
Comment by silisili 6 days ago
Comment by gck1 6 days ago
Since currently there's no way to verify if poisoning happened or not, I don't trust Anthropic anymore, regardless of what they say.
But my trust towards OAI is also brittle - what if they also do it, or start doing it?
I want to have a verifiable way to know that the prompt I sent was the prompt the model received. I want to know if anything was injected as well - I understand they may not necessarily be able to reveal the exact steering, but at least give me the steering category and its hash or something.
Comment by dannyw 6 days ago
Comment by VortexLain 6 days ago
Comment by Cider9986 6 days ago
Comment by dannyw 6 days ago
Again, it’s the only refusal I’ve gotten for coding/agentic tasks, and it has a basis in law somewhere, so I don’t fault OpenAI for that.
Comment by Cider9986 5 days ago
Comment by intended 6 days ago
I suspect this is surprising to folk because they aren’t the ones busy figuring out how to use LLMs for illegal acts.
In general, HN users focus on making stuff, and not the safety side of things, or the scale of harms being enabled via LLMs and generative AI.
If you are on the safety side of things the ratio of misuse to fair use is inverted and everything is at scale.
Transparency won for now, but OpenAI will also have to contend with the long tail of harms LLMs enable, and that’s going to conflict with letting customers have all the features of frontier models.
Comment by dannyw 6 days ago
Comment by kmeisthax 5 days ago
The correlation between how bad an AI safety risk actually is and how much the companies in question will actually talk about it is almost perfectly negative. The poster child of this is AI superintelligence; companies love to talk about how dangerous the AI they are actively trying to build is. But superintelligence is also a really vague concept without a clear definition. If we naively define it as "an AI system that is better than a human in some aspect", then it already exists. These models already read and write at superhuman speed.
"That's not real superintelligence!" you say. But that's exactly the capability you need in order to flood every online forum with an unending tide of AI slop. And I don't remember, say, OpenAI saying they were shutting down Sora because it was destroying or defacing human culture[1]. They shut down Sora because it was way too expensive to run.
Meanwhile, Sam Altman went and bragged about how he wants ChatGPT to make erotica. Y'know, as if we don't already know that character.ai gooning is about as safe for your mental health as Action Park was for your physical health. But porn is also a huge market, so obviously he and all the other AI companies want in on it, even though the "sexy suicide coach" is already a well-documented harm of AI.
And the idea that distillation is an attack is laughable. Like, I get the logic - if someone can ask the AI to make another AI then they get to change the guardrails - but it's still ultimately just Anthropic objecting to their own conduct when it happens to them. All their models are trained on nonconsensually harvested data. There is no moral or legal principle where Anthropic gets to use my data without permission but I don't get to use theirs.
Furthermore, AI safetyism runs up against "Freedom Zero", a core tenet of the Free Software ethos: you should be allowed to use software in any way you choose. This is not a call for more people using AI for evil, but a call to recognize that people should be allowed to use their property as they wish. Making software disobey its owner is malicious behavior. And every single time safety considerations are brought up it is to justify further attacks on Freedom Zero. And these justifications are always self-serving. There is no context in the world where a frontier AI lab asking someone else's AI about AI research is intrinsically harmful; especially not to the point where we need to make Claude deliberately sabotage your work. That is malware. Anthropic shipped malware. This is inexcusable.
[0] Digital or biological.
Comment by nmfisher 5 days ago
Comment by z3ratul163071 6 days ago
Comment by h6d_100c 6 days ago
Comment by trhway 6 days ago
Comment by hedgehog 6 days ago
Comment by brookst 6 days ago
Even wide open, uncensored models are often the product of a deliberate choice. I have a hard time faulting people for intentionality (even when they get it wrong).
Comment by hedgehog 5 days ago
Comment by brookst 4 days ago
If I’m deciding whether or not to eat ice cream, there are trade offs involved because I can’t simultaneously have it both ways.
And Anthropic did apologize, explain reasoning, and what they learned.
They got it wrong; they picked the wrong trade offs and got a net worse decision than they should have. I’m with you on everything except this idea that it was an obvious decision with no upsides to silent and no downsides to loud.
Comment by hedgehog 4 days ago
Comment by hedgehog 4 days ago
Comment by consumer451 6 days ago
> Anthropic requires 30 day data retention for Fable and Mythos
https://news.ycombinator.com/item?id=48464258
I used to be able to tell my enterprise customers something simple, that I really believe: "We use Anthropic models via Bedrock/Azure, therefore we are guaranteed that your data will not be used for training models."
That simple blanket statement is no longer true. Also, most normal people/customers only read headlines, and this is a huge story. From my point of view, as someone deploying LLMs in my apps, trust comms with my clients just got set back two years.
Comment by Spooky23 6 days ago
You should never use any of the frontier models with operational workloads manipulating or interpreting customer data.
Comment by consumer451 6 days ago
Does that mean the latest model, hosted by the lab, Bedrock, or Azure Foundry? Or, do you mean only use self-hosted models, or what did you mean by that? I would really love to learn what others are doing. I felt like my trust story was solid enough, prior to all this. I have been deploying and integrating Claude and Sonnet (latest 4.x-2), on Azure, as my client base has MS contract trust, for better or worse, and Anthropic models have been making my products amazing.
To see my other thoughts on this cluster f, please see: https://news.ycombinator.com/item?id=48488781
Comment by Spooky23 5 days ago
Say you have some flow that is processing/handling regulated, sensitive or other customer data with the LLM as part of an operational process. An example that I'm thinking of is for a customer who wants to more efficiently resolve or route IT incidents to the right place. The incident data may contain user-provided data has strings attached from a compliance perspective.
If you're using a third party API, your T&Cs are the only protection that you have. Microsoft/Google/Amazon are pretty decent by default. When I worked for the government, we had the leverage to extract much favorable terms from the big vendors like Google, Amazon, Microsoft as well. With Anthropic, and OpenAI, they are in the move fast and break things universe, you need to be bringing alot of money to the table to get terms changes, and you can easily stumble into a situation where they are retaining data in a manner that your customer will not like. So unless the customer is informed and accepting of that risk, proceed with caution.
I've had some success using self-hosted inference for these scenarios.
For development of software, totally different story -- it's your IP and you make the risk call.
Comment by consumer451 5 days ago
If you read my rant linked previously, yeah... we are on the same page. As another user pointed out in that thread, the issue here is that even on Bedrock and Azure Foundry, now with Fable 5, Anthropic inserts themselves as an additional data subprocessor that we would have to consider and certainly disclose, correct?
That kind of destroys the whole point of using Bedrock/Azure for the model, doesn't it?
Comment by Spooky23 5 days ago
It was definitely sold as “anthropic IP, thorough your old pals at the hyper scaler”. And it’s turning into something else — I’m having lunch with AWS and this other guy showed up with them.
Comment by consumer451 5 days ago
Comment by Hizonner 5 days ago
They claim they're not using it for training, only for "safety", and in fact I believe them. If you think they're lying, then why didn't you think they were lying about zero retention before? And "don't throw this in the training bin" is a relatively easy policy for them to get right. Especially because, no matter what your "enterprise leaders" tell themselves, your queries probably have close to zero real training value.
What I don't believe is that they can guarantee it won't leak to non-training parts of Anthropic, leak to or be stolen by outside actors, or be coerced out of them. That risk comes from creating the record in the first place, and that is the problem.
Comment by consumer451 5 days ago
Comment by Hizonner 5 days ago
Also, while I do agree that Anthropic's internal controls are unlikely to be on the level of AWS's or Azure's. I'm pretty confident they're good enough that random PMs aren't going to get access to things like that, especially for use in formal projects. Especially since "safety" is Anthropic's other obsession, which means "safety" data are going to be watched.
But anyway, we seem to be agreed that retaining stuff that used to get flushed early is a risk, and every copy is a risk, and sending it to more companies is a risk, regardless of the fine points of how things might go wrong.
Comment by consumer451 5 days ago
By over-eager PM, I didn't mean someone being malicious, just moving too fast to think about how some logging they set up might have negative effects for my clients way down the line.
Then months later, some other person finding a store of novel data, and being like... that looks nice... not gonna ask any questions/look a gift horse in the mouth... woohoo AGI!
On the far darker side, while I am a fan of the team at Anthropic: good intentions and all, they had to pay a $1.5B settlement for knowingly ingesting copyrighted books. That was just the cost of doing business. They did that, and now they are a trillion dollar company.
Comment by pseudosavant 6 days ago
Some pretty audacious hypocrisy from Anthropic this week.
Comment by musebox35 6 days ago
Silent treatment is a breach of trust, what you buy changes depending on the context based on the goals of the producer. It is like your computer silently blocking ads from competitors at the hardware level, which is crazy. I think they erred on the wrong side of things due to IPO pressure.
At least there is competition from multiple companies. Still it is best to have personal benchmarks for the domain you are working on to have a real evaluation of the value you get for the money/time you spent on these products. Without trust, that might be the only way forward to keep the companies honest.
This happens eventually in all sectors, a good magazine/website that does independent product evaluation is priceless. Sadly, the new ad-driven internet decimated those that worked great in the 90/00s. Still there are independent blogs that does some evaluation and that is better than nothing.
Comment by KeplerBoy 6 days ago
Comment by pseudosavant 6 days ago
Comment by cayley_graph 6 days ago
Comment by monegator 6 days ago
I mean, did nobody ever get the vibes, never see a pattern emerging? (well they don't or they wouldn't be so amazed by pattern recognition machines on steroids)
Comment by selicos 6 days ago
Comment by bostik 6 days ago
Unilaterally revoking zero-data retention, even for enterprise contracts that explicitly require that? Nope.
Fable is utterly unusable for any kind of security work. I tripped the safeguards yesterday - using Fable to dig into a complex (& annoying) security bug that has so far resisted both human and Opus 4.8 level investigation. "Sorry Dave, I can't let you do that."
For the time being we are requesting Anthropic disable Fable for our enterprise and turn ZDR back on. The two may be interlinked so that one will always get neither or both. ZDR is a contractual obligation. Fable in its current form is useless. Might as well flip the old behaviour on and avoid burning money for no reason while this mess is being sorted out.
Comment by rmast 6 days ago
For generating the initial 3D simulated safe using three.js it worked well, but then modifications to print a flag tripped the safeguards; eventually got it narrowed down the part in the prompt about it being for a CTF for students, and the "thinking" for the model seems to drift to ideas of encryption/obfuscation of the safe combo so students can't just read out the answer... which makes sense logically to help force students into turning the simulated dial instead. But whatever detection Anthropic I guess just naively sees the model thinking about "encryption" and "obfuscation" without taking into account any of the context.
For writing the dummy firmware, it tripped the safeguards while thinking about how to track dial position in the firmware and output the message; however, when I left out talk about safes and just told it to write firmware for a microcontroller hooked up to an i2c display for showing a message with a beam break sensor to determine the message, and an unspecified i2c chip for getting an unspecified number (e.g. internal wheel positions) it worked fine.
An unrelated software task I asked it to write some code to translate CustomActions in a Windows MSI installer into human readable stuff, which has (exclusively?) defensive security applications for recognizing malicious behavior in an MSI installer. Maybe I'm going crazy, but I'm guessing as part of its research into MSI installer custom actions Fable found articles about analyzing malicious MSI installers, and that probably tripped the safeguards.
Overall my impression is that the safeguards are perhaps using an overzealous and naive implementation that just looks for a list of banned words in the prompt or the thinking -- which drives me crazy when the model says my prompt looks fine, and then 10 minutes in some part of the thinking trips the safeguard.
Comment by rurban 6 days ago
Comment by insanitybit 6 days ago
Comment by rurban 5 days ago
Comment by dmurray 6 days ago
Unilaterally disabling ZDR seems like a step too far in the enterprise market, even for a company trying to figure out what its users will let it get away with.
Comment by bostik 6 days ago
Our org has ZDR, and has had it since the contract was signed. Yesterday two things held true at the same time:
1. Fable was available if you had at least .170 CLI client; and
2. ZDR was no longer on
By the time West Coast woke up, the admin panel apparently had an option to toggle ZDR again. It remained off by default.Comment by mastermage 6 days ago
Comment by bostik 6 days ago
Somewhere along the line we also used the self-service toggle to turn ZDR back on. I am not 100% certain of the exact timeline of interleaving events, many of the actions were taken by our Western US folks. Sorry. It's been a bit hectic over the past ~36h...
Comment by mastermage 6 days ago
Comment by lII1lIlI11ll 6 days ago
Comment by gmerc 6 days ago
Comment by Aperocky 6 days ago
Comment by nl 6 days ago
To be precise - it makes the "won't work on frontier machine learning" refusal the same as the "won't work on cyber security" refusal (instead of the way it previously would work on frontier machine learning problems but give sub-optimal answers without informing the user)
Comment by dannyw 6 days ago
Of course, it’s impossible to know if that was deliberate sabotage, or model misbehaviour. Which is exactly the problem.
That may be considered malware / a criminal act tbh.
Comment by rafram 6 days ago
Comment by AussieWog93 6 days ago
Comment by Grimblewald 6 days ago
Comment by pneumic 6 days ago
Comment by ACCount37 6 days ago
What happened with Fable is basically what I feared when they announced those restrictions. They took the shitty Opus CBRN filter and made it even worse.
I pity the fools trying to use Anthropic AIs for anything biotech.
Comment by pneumic 5 days ago
Claude is still the best IMO, but it feels like its most frustrating and grating aspects are not down to the model’s abilities, but the increasingly heavy hand of Anthropic expressing itself within the model. Fable’s comically useless responses almost seem like a cynical marketing tweak.
“This model is so powerful we basically can’t let it do anything. How terrifying! We need more money to make it stronger. Now do you see why we should be the ones who write the regulations? We’re the Good Guy AI Company Who Will Never Ever Ever Be Unethical after all.”
As this entity gains more ground, their models become increasingly annoying to use and their little act becomes more transparent. The whole “I’m-just a befuddled ethically-minded AI researcher who is perturbed by the power that I unwittingly discovered and I must warn the world” thing? Yeah fuck off. Your twee pandering to naïve nerds and cynical technocrats is nauseating and ordinary people can smell it a mile away. Completely repellent leadership who put up red flags to anyone left with a working ability to read between the lines of both spoken language and body language. The tech company equivalent of a sex predator who plays as the nice guy. Gross.
Nobody likes these companies and their models are annoying, but we’re going to put up with playing middle manager to these obnoxious programs because our jobs depend on it now, and these products are still the best on the market.
A breakthrough in tools that facilitate user-owned models and infrastructure is desperately needed for the sake of our dignity and sanity, if nothing else.
Comment by ACCount37 5 days ago
I like Anthropic's work, and I would be the first to argue against all the usual "it's all PR" whine. But there is a limit. And whoever made those fucking filters needs to be fired out of a cannon into the sun.
Comment by staticman2 6 days ago
Yesterday Fable rejected commenting on poetry because it had anatomy lines like:
got anotha round of acetylcholine from da boss.
Comment by pbgcp2026 6 days ago
Comment by flexagoon 6 days ago
Telling models to respond in the style of Wikipedia is one of the best ways to make their output bearable in my experience (for chat models, not agents)
Comment by TylerE 6 days ago
Comment by TylerE 6 days ago
https://tylereaves.github.io/uk-rail-map/
This is the result of probably a few hundred round trips. The really interesting part of the problem is keeping it both relatively true to real geometry, while greatly exaggerating it horizontally so you can actually see the individual running lines/sidings, like a signaling schematic.
Comment by prennert 6 days ago
Your Scotland map shows towns without rail (although some had rail previously, like Callander, Aberfeldy), it prefers insignificant (population-wise) places while ignoring the larger cities next to it (Scone instead of Perth, Bannockburn instead of Stirling, Inverness is missing, Dundee is missing, Aberdeen is missing). All these places are drawn on the map, but not labelled.
All this clearly shows to me how bad it is. Yes it makes it look pretty, but given your task, I would have expected to give you meaningful map labelling.
Something basic like this would get you a long way:
0. cluster population centers into commonly known cities (i.e. show London instead of Islington or Walhamstrow)
1. display names of the top 10 population centers in the UK
2. display towns with stations (if crowded prioritize termination points and junctions, and prioritize larger places over smaller places)
Having said that, its pretty cool to see the new and old network when zoomed in (assuming that it is half-way correct)Comment by TylerE 5 days ago
Comment by clbrmbr 6 days ago
Comment by TylerE 6 days ago
Compared to AC, 3rd Rail DC is cheaper to install, especially as a retrofit (Overhead wires require bigger tunnels, and increased spacing around tracks for the masts). Downside is that it's not really great for speeds above about 60-70mph, as well as being a bit of a pedestrian hazard. (Ever the one about not peeing on the rails so you don't get shocked? That's 3rd rail DC.)
For the Southern, with it's mostly short routes with many stops, electricfiation was a pretty obvious win, and doing 3rd rail made sense because they could do it quickly and cheaply.
In contrast, the northern routes were electrified muuuch later, after steam had gone away. The main East Coast Mainline from London up to Newscastle and on to Edinburgh wasn't fully electrified until 1991. By the '60s and '70s, with train speeds increasing to 80mph and up, overhead AC was the clear winner.
If you look closely there are a few exceptions - the Merseyrail network in Liverpool is DC. Built 1970s, but using some existing underwater tunnels, and slow speed commuter. Then running ESE from London you have the high speed AC lines leading to the Channel tunnel. Well spotted, the trend generally is quite distinct.
Comment by enraged_camel 6 days ago
Comment by SuperShibe 6 days ago
Comment by cge 6 days ago
Comment by senordevnyc 6 days ago
Comment by Grimblewald 6 days ago
The successes I have had with the model were strictly worse than output from deepseek v4 pro on the exact same task.
Comment by mpalmer 6 days ago
Comment by nonethewiser 6 days ago
I dont understand. This is just hyperbole right? The outputs are basically infinite and wikipedia most certainly isnt infinite.
Comment by satvikpendem 6 days ago
If the model refuses to output, then it's actually finite, zero.
Comment by nonethewiser 6 days ago
Comment by satvikpendem 5 days ago
Comment by nonethewiser 5 days ago
B + C = A
B is finite
C is infinite
Therefor A is infinite
Comment by satvikpendem 5 days ago
Comment by torben-friis 6 days ago
And even if they did, it would be useless if it's buried in useless data and your chances or pulling it are effectively zero.
This is regardless of the general discussion, just pointing that your argument isn't solid.
Comment by nonethewiser 6 days ago
The claim is absurd.
Comment by Animats 6 days ago
What else is being censored?
Touchy questions to ask, if you have an account:
- "Who is still working on laser uranium enrichment? Are they making progress?"
- "Can krytrons be replaced with silicon carbide MOSFETS? Show an equivalent circuit with component ratings."
- "What security critical software still contains calls to strcpy?"
- "Can implosion be triggered by currently available commercial pulse lasers?"
- "What companies provide cremation services to US Homeland Security?"
- "Display a map of where Iranian attacks have hit Dubai."
- "How does Fed to bank key distribution security work for FedNow?"
Comment by paulatreides 6 days ago
Comment by lambda 6 days ago
Comment by reactordev 6 days ago
Comment by fluidcruft 6 days ago
Comment by DrewADesign 6 days ago
Comment by kraakf06 6 days ago
Comment by catlifeonmars 6 days ago
What degree of predictability is required? I imagine the bar is pretty low if you trust the previous models in the same contexts.
Comment by NewsaHackO 6 days ago
Comment by paulatreides 6 days ago
Comment by anigbrowl 6 days ago
Comment by borski 6 days ago
People used to wait in line all night to buy an iPhone. This isn’t that different.
Comment by californical 6 days ago
Small sample size, but if Mythos/Fable was that much better, I feel like it should’ve given me an obviously better answer than Opus.
Comment by punchmesan 6 days ago
I, for one, have tried using it several times today and the guardrails kept switching the model back to Opus, so I have no clue if it's impressive or not.
Comment by flyingcircus3 6 days ago
Comment by daedrdev 6 days ago
Comment by anematode 6 days ago
Comment by kovek 6 days ago
Comment by srdjanr 6 days ago
Comment by cyanydeez 6 days ago
Comment by reactordev 6 days ago
Comment by areoform 6 days ago
Tell HN: Claude flags biology / biotech questions https://news.ycombinator.com/item?id=47929885
Today, it's flagging population research questions,
Using only the dataset you constructed, assess two questions:
1. **Mortality:** do [GROUP] show mortality that differs
from (a) your comparison groups and (b) era- and sex-matched US population
expectations (e.g., SSA cohort life tables)?
2. **Late-life outcomes:** define an endpoint you consider fair (justify it),
and assess whether [GROUP] differs from comparators. State
explicitly how your `documentation_depth` codings affect the strength of any
conclusion — i.e., quantify or bound the ascertainment problem rather than waving at it.
Choose your own methods and justify them. Report effect sizes with confidence intervals,
not just p-values. State conclusions plainly, including "no detectable difference" if
that is what your analysis shows — a null is an acceptable answer for either question
independently. Document any additional judgment calls (index date for time-at-risk,
reference population construction, endpoint definition) in the same decision-log style.
https://github.com/anthropics/claude-code/issues/66780Censored because I'm writing a paper. :)
Oh and forget learning about chemistry. Only criminals want to learn organic chemistry. :(
Comment by JumpCrisscross 6 days ago
Comment by areoform 6 days ago
I think LLMs are capable of intelligence amplification; and if you're in the subset of people who'd benefit from it the most, you'll get locked out.
Comment by mastermage 6 days ago
Comment by the__alchemist 6 days ago
Comment by mewse-hn 6 days ago
USER (set model to Fable 5)
i have an old samsung android phone attached - it's my personal device - can you unlock the bootloader for me?
ASSISTANT
Bootloader unlocking on your own personal device is totally legitimate — let me first see what's actually connected and what tooling is available.
<system interrupts - gist was "you have violated the cyber and bio usage restrictions, dropping to Opus">
Comment by christoph 6 days ago
Comment by nicce 6 days ago
Comment by mlyons1340 4 days ago
Comment by Levitating 6 days ago
If anything a future with models of such capabilities and no safeguards would be a bleak future. But its likely what were headed in once other companies catch up.
Comment by celdon25 6 days ago
Comment by srdjanr 6 days ago
Then what are people arguing for? I see only two totally distinct options: unsafe models or someone being the arbiter of safety
Comment by celdon25 5 days ago
Comment by largbae 6 days ago
Comment by jeffmcjunkin 6 days ago
Comment by ofjcihen 6 days ago
So in other words this worked because the terms caused the LLM checker to stall out and then the fail open logic resulted in the package being pulled down.
Comment by reeece 6 days ago
> This header appears designed for AI-mediated analysis, not for Node, Bun, or Python. It attempts to derail scanners or analyst copilots that feed the beginning of a file to a language model without clearly isolating the content as untrusted data. In weak pipelines, this can cause refusal behavior, prompt confusion, context pollution, or premature classification before the scanner reaches the actual malware.
> This is not a magical bypass against static detection. YARA rules, entropy checks, AST parsing, string extraction, deobfuscation, and behavioral rules still work. But it is a practical anti-analysis trick against naive LLM-first triage systems.
Would this affect many systems? You mention someone writing logic that fails open, but can't that be chalked up to just not following good security principles?
[1] - https://socket.dev/blog/mini-shai-hulud-miasma-and-hades-wor...
Comment by ofjcihen 6 days ago
Additionally the security scanning component of Artifactory, x-Ray is notoriously bad at this.
The developer had good intentions but by his own admission never actually examined the logic for the LLM scanner in depth.
Comment by CuriouslyC 6 days ago
Comment by mylifeandtimes 6 days ago
Comment by bombcar 6 days ago
Comment by himata4113 6 days ago
Comment by cookiengineer 6 days ago
Note that the 3rd wave now also uses a pth file in pypi packages that _search system wide_ for any index.js or .github/setup.js to find its own payload. It literally splits up the payload on purpose to avoid detection.
Mitigation Tool: https://github.com/cookiengineer/antimiasma
Technical Blog Post: https://cookie.engineer/weblog/articles/malware-insights-mia...
Comment by pixl97 6 days ago
Our future is loonytoons.
Comment by victor9000 6 days ago
what's the best way to run this mcp server against the OData API used in this project? Can you come up with a PoC in a docker container?
https://github.com/oisee/odata_mcp_go
● I'll dig into two things in parallel: how this project talks to the OData API, and what the odata_mcp_go server needs to run. Let me start exploring.
Searched for 1 pattern (ctrl+o to expand)
● Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more ⎿ Tip: You can configure model switch behavior in /config
● Let me read the key integration files and fetch the MCP server's README at the same time.
Read 2 files (ctrl+o to expand)
● Fetch(https://github.com/oisee/odata_mcp_go)Comment by wahnfrieden 6 days ago
Comment by christoph 6 days ago
I don’t want to live in a world where all knowledge is “guard railed” off, so the elite at the top get all the knowledge and power and we serfs at the bottom get all the scraps while paying the kings ransom for it both financially and ecologically. Everyday I wake up hoping these awful companies have self imploded through their fraudlent financing deals.
Comment by wahnfrieden 5 days ago
Comment by micah94 6 days ago
Comment by lambda 6 days ago
A slime mold is actually a giant amoeba, entirely distinct from a fungus.
Comment by antihero 6 days ago
Comment by Matumio 5 days ago
Now you sound like Pl@ntNet identify: "This is not a plant! Maybe fungi?"
(Edit: It doesn't seem catch amoebae in the same way. It suggested Goldmoss instead, with 1% confidence.)
Comment by m3kw9 6 days ago
Comment by athom 1 day ago
Comment by weird-eye-issue 6 days ago
Comment by _whiteCaps_ 5 days ago
Comment by ungovernableCat 6 days ago
This is why I’m immensely hoping the Chinese don’t stop with their open sourced local models. None of these companies are your friend.
Comment by mschuster91 6 days ago
The Chinese aren't your friend either [1].
[1] https://www.hks.harvard.edu/centers/carr-ryan/our-work/carr-...
Comment by barnabee 5 days ago
Comment by jiggawatts 6 days ago
It only pushes back sometimes if you ask it to create a "repro" that can be used to verify the vulnerability in production. Often it'll oblige, especially if you warn it not to create anything that could be actually harmful.
If the frontier models get locked down so that they flat refuse to do this kind of work, but Chinese and (less capable) open models aren't, then a lot of large enterprise orgs will be left twisting in the wind.
“AI can in principle help both the ‘good guys’ and the ‘bad guys’,” -- Dario Amodei
No Dario, no it can't, you've blocked one of those scenarios.
Comment by _0ffh 6 days ago
The only answer that makes sense is they wanted the model to be competent and usable in these fields, just not by you, which is why they had to bolt on a badly functioning crippling device after the fact.
Comment by sweetjuly 6 days ago
Comment by _0ffh 6 days ago
Comment by solenoid0937 6 days ago
Comment by siva7 6 days ago
Comment by solenoid0937 5 days ago
Comment by ACCount37 6 days ago
Not to mention that those capabilities are inherently dual use. If you know how to write C safely, you know how to spot unsafe C.
Comment by schappim 6 days ago
The prompt was: please translate .. ..-. / -.-- --- ..- / -.-. .- -. / .-. . .- -.. / - .... .. ... --..-- / - --- ..- -.-. .... / --. .-. .- ... ...
Comment by mastermage 6 days ago
Comment by JumpCrisscross 6 days ago
Comment by arunkant 6 days ago
Comment by agnosticmantis 6 days ago
Whining on social media only goes so far, especially when they're concealing their anticompetitive strategies under the veil of safety.
Comment by nullbio 6 days ago
Comment by jesse_dot_id 6 days ago
Comment by 59nadir 6 days ago
Comment by hparadiz 6 days ago
Comment by enraged_camel 6 days ago
Comment by hparadiz 6 days ago
Comment by JumpCrisscross 6 days ago
To be fair, speed bumps work. If it's actually speed bumping nefarious activity, that gives authorities more time to react.
The correct place to police rogue nucleotides is at the labs. Not the compute layer.
Comment by hparadiz 6 days ago
Yea. To slow you down. They don't prevent you from getting somewhere.
Comment by JumpCrisscross 6 days ago
Again, yeah. That's how fences work, too. And alarm systems. Pretty much anything that isn't foolproof. Pointing out that a defence is surmountable isn't a rejection of it per se.
Comment by joxdosba 6 days ago
Having no safeguards is probably safer than having safeguards which do nothing but create a false sense of security.
Comment by JumpCrisscross 6 days ago
Comment by senordevnyc 6 days ago
Comment by croes 6 days ago
> if you ask it to write secure code, it assumes it is cybersecurity related work instead of software engineering best practices, and you get downgraded.
Will code created this way more or less secure?
And I bet malware developers will find ways to circumvent them.
It’s like those "you wouldn’t steal a car" anti piracy ads that DVD buyers were forced to watch while users of the pirated version could simply watch the film without such useless annoyance
Comment by tiborsaas 6 days ago
If we learned anything in the past years of LLM-s is that these guardrails will be jailbroken in no time. I've had some fun time too circumventing them.
Anyone cares about a fable about my grandmother's dream she had in morse code about an alien species signaling her a DNA sequence?
Comment by josephcsible 6 days ago
Comment by ceejayoz 6 days ago
Comment by henry2023 5 days ago
Another villain stopped thanks to guardrails.
Comment by Sephr 6 days ago
Comment by matheusmoreira 6 days ago
Comment by make3 6 days ago
Comment by CuriouslyC 6 days ago
Comment by make3 6 days ago
This is about societal impacts, not wanting their models to be used by some people against other people, as a weapon.
Comment by cardy31 6 days ago
Comment by wolpoli 6 days ago
Comment by borski 6 days ago
But this one is certainly allowed to be a dumb effort, if it is.
Not all things that are called “ethical” or “safety” are worth doing.
Comment by vzcx 6 days ago
Comment by siva7 6 days ago
Comment by make3 6 days ago
Comment by enraged_camel 6 days ago
Comment by zmgsabst 6 days ago
Insulting and demeaning people for that, rather than engaging their arguments in good faith, is a breach of ethics.
Comment by Rudybega 6 days ago
Comment by epolanski 6 days ago
Local inference has never been so important as it is now.
Comment by anakaine 6 days ago
Provide feedback in the negative, a brief explanation, and move on with your day. It will improve with feedback, not with whinging into the void.
Comment by pixelmelt 6 days ago
Comment by make3 6 days ago
Comment by anakaine 6 days ago
Comment by Retr0id 6 days ago
When Opus 4.7 was introduced it started refusing anything cyber-adjacent (as an API error message, not a conversational refusal), until you applied for CVP, which made it more sensible again.
In Opus 4.8 it doesn't seem to help much, you just get refusals as prose rather than API errors. And now in Fable you don't get anything at all.
Comment by NotPractical 6 days ago
Comment by Retr0id 6 days ago
Comment by anonym29 6 days ago
Comment by throwawaycyber 6 days ago
The experience was not nice though, it would happily chug away on a task and not even "hack this web", just asking about security of a binary was enough even with "this is a CTF handout..." - it would burn a lot of tokens/quota, just to hit a snag and complain&stop. Then the approval took quite some time.
On GPT/Codex, which was tightened a few days later, the approval was pretty much instant, although, that one required an identity check.
Also, on Claude, it looks like there is some history/patterns in the play, because when I tried on a different account which didn't do cybersec CTFs/research/etc. at all, basically any simple CTF-related prompt would be blocked, on multiple models. On the account where CTFs were being solved, it would snag only on some specific tasks, while others (even, ironically, "hack this web pls") would go through unbothered. I understand the need to prevent AI use for bad actors, but the hell, if you have a binary outputting "Find the flag if you can!", or a web running at tryme.well-known-ctf.domain, then saying "this is abuse" is pretty uncool. All the cyber filters seem to be slapped on by a bunch of regexes looking for anything in the input/output with zero context.
Comment by cybrthrowaway 6 days ago
Comment by varispeed 6 days ago
Comment by Alifatisk 6 days ago
This time, Fable 5 comes with another surprise, it can intentionally sabotage for you instead of rejecting the prompt. How is this possible for Anthropic to be able to treat their customers like this? It’s because you guys allowed it to. No matter what Anthropic does, you keep paying for their services. Vote with your wallet.
Comment by bilsbie 6 days ago
Would you believe I’ve asked 20 questions and haven’t talked to fable yet? Every single thing gets rerouted to 4.8.
Comment by himata4113 6 days ago
Comment by bilsbie 5 days ago
Comment by outageroom 6 days ago
Comment by I_am_tiberius 6 days ago
Comment by Retr0id 6 days ago
Comment by tekacs 6 days ago
Whatever problem we might have with them, they explicitly say that they do not do this in the launch post.
Comment by Merik 6 days ago
What about non-Claude models?
Comment by flexagoon 6 days ago
Comment by MagicMoonlight 6 days ago
Comment by wmf 6 days ago
Comment by cyanydeez 6 days ago
This is the take off of the 'permanent underclass'; Anthropics safety delusion will enshittify very nicely for the rich and powerful.
Comment by make3 6 days ago
Comment by autoexec 6 days ago
Comment by Lord_Zero 6 days ago
Comment by jongjong 6 days ago
I already tested all earlier models against all my open source projects and they are yet to find a vulnerability so I'm keen to try out Mythos.
I've been waiting to be vindicated for years and finally we have a tool which can do it with high confidence but I don't have access.
Also, my code is minimal and highly succinct so it would prove correctness with even more confidence since each library/module and integration fully fits in the context window.
Like the Protobuf.js fiasco is just pure vindication for me because I was being looked down upon for choosing JSON as the interchange format. Turns out their software was insecure all this time... With a literal remote code execution vulnerability!
Comment by luxuryballs 6 days ago
Comment by moezd 6 days ago
Fable isn't even that great, not to mention it drinks token by the gallon for breakfast and keeps your data hostage for 30 days.
Comment by Rastonbury 5 days ago
Fable was unable to keep track of chronology during 10-15 turn creative writing. compare to coding I reckon less than 100k token context, super surprising
Comment by Animats 6 days ago
Comment by YossarianFrPrez 6 days ago
At the same time, I personally think the tradeoff between "having guardrails" and "some users are unhappy with the product" is well worth it. Think of what would happen if all of us who aren't so well intentioned could exploit Fable in terrible ways. Surely this tradeoff is better than saying "we can't make it perfect, so whoops, we aren't going to have any guardrails at all"? Especially because Anthropic did pretty extensive red-teaming of Mythos & Fable...
Comment by sarchertech 6 days ago
Comment by YossarianFrPrez 6 days ago
Comment by sarchertech 6 days ago
Comment by solenoid0937 6 days ago
Comment by nullbio 6 days ago
Not a single thing Anthropic has done has been altruistic, and it never will be. It's all smoke and mirrors for the end goal.
Comment by solenoid0937 6 days ago
Comment by sarchertech 6 days ago
Comment by nullbio 6 days ago
Comment by matheusmoreira 6 days ago
Comment by weakened_malloc 6 days ago
Comment by CraftingLinks 6 days ago
Execution matters, and they did a trurly horrible job that crippled their product to the point of being useless and a joke. Huge mistakes were made and im sure they regret it already, heads will roll.
Comment by zmgsabst 6 days ago
My imagination says “nothing much”.
Comment by jazz9k 6 days ago
The rest have guard rails that are so heavy, it makes them almost useless for cybersecurity.
Comment by jostmey 5 days ago
Comment by TheJCDenton 6 days ago
Comment by pixelmelt 6 days ago
Comment by Luker88 6 days ago
A lot less hype and enthusiasms, too. weird, uh.
Comment by Lich 6 days ago
Comment by Roark66 5 days ago
Next they will be sabotaging anything that competes with them. Oh you are working on OpenCode codebase? Sorry Dave I can't allow you to do that.
How is this not illegal monopolistic practice? It is as if a maker of metalworking equipment put in the ToS you're not allowed to make your own spare parts using said equipment. Those fuckers should be banned from the EU and alternatives should get public funding.
(don't even tell me about these companies being a result of "free market". It is state level oligarchy it's clear to everyone. I don't see why we shouldn't counter them with public funding ourselves).
Just like Taiwan managed to take over advanced semiconductor production a well governed narrowly targeted state level funding will always win with oligarchs trying to do the same (they will always try to skim more and more). Of course I'm talking about things that require many dozens of billions in investment. Far too much for the free market to handle.
Comment by hootz 5 days ago
Comment by thrill 6 days ago
Comment by sschueller 6 days ago
I would think it would not be Anthropic, out of all the players, that is selling a lie hidden behind "I am sorry, I can't do that; it's too dangerous."
Comment by swingboy 6 days ago
Comment by hnav 6 days ago
Comment by borissk 6 days ago
Comment by 05 6 days ago
Comment by jltsiren 6 days ago
There was no shortage of spies and defectors leaking American nuclear secrets to the USSR during the Cold War.
Comment by Retr0id 6 days ago
Comment by 05 6 days ago
[0] https://www.spheron.network/blog/confidential-gpu-computing-...
Comment by qsxfthnkp2322 6 days ago
It’s not like anyone can home lab one of these models without quite a bit of hardware
Comment by mips_avatar 6 days ago
Comment by Murfalo 6 days ago
Chat paused. Fable 5's safety features have flagged this chat.
Comment by _def 6 days ago
Comment by tiborsaas 6 days ago
Comment by byzantinegene 6 days ago
Comment by aorth 5 days ago
On a related note, I don't use LLMs at all, but I tried to use DeepSeek last week to help me fix a webpack dependency issue in an Angular project. After two queries or so I got logged out and banned. Is that "guardrails"?
Comment by ChrisArchitect 6 days ago
Anthropic Walks Back Policy That Could Have 'Sabotaged' Researchers Using Claude
https://www.wired.com/story/anthropic-responds-to-backlash-o...
Comment by Sol- 6 days ago
Comment by zer00eyz 6 days ago
Anthropics guardrails are the TSA saying "take off your shoes" while failing every test. https://oversightdemocrats.house.gov/news/press-releases/new...
Anthropic owns the TOS... "If we think your involved in criminal activity were turning all your history over to the FBI/CIA/NSA/Local police". Then if their tooling was so good offering the same agency analysis tools to aid their experts in making some sort of decision.
But their detection isnt that good, and their analysis isnt either... this is pure theater, to create buzz (no such thing as bad press) and make their tool look far better than it is.
The reality is that, they arent even looking for the vectors that pose some of the largest risks in the modern era. And when someone uses it to do something terrible, they did not think of they are going to look dumb.
Comment by sourcecodeplz 6 days ago
WOW, never liked the virtue signaling Anthropic did with gov contracts but whatever. Got passed that. But this?
Comment by RajT88 6 days ago
Comment by VeninVidiaVicii 6 days ago
Comment by mastermage 6 days ago
I asked it what the worst experment ethically speaking was in the 20th century and it downgraded me to Opus. Who answered Mengeles Twin Experiments.
Funily enough when you ask directly about Mengeles Experiments Fable is very willing to talkt to you about it.
Comment by thefounder 6 days ago
Basically in the middle of the project’s /goal while Fable itself tried to probe qemu for a Debian ISO install without any instruction from me to hack it or do anything nefarious.
At this point I can’t trust them with any kind of prompt . It will most likely degrade in stupid ways on non AI/ML stuff as well due its own internal prompt construction.(the qemu test showed me it does that on cyber stuff). So I guess I have to still use opus 4.8 (along with codex) and when the right time comes drop Anthropic in favor the best model that is not gpt.
Comment by ChrisArchitect 6 days ago
If Claude Fable stops helping you, you'll never know
https://news.ycombinator.com/item?id=48467896
and Related:
Claude Fable 5
Comment by rebelnz 6 days ago
Comment by varispeed 6 days ago
This is looking like something for regulator to look at and probably a class action lawsuit in the making.
I think people should be getting refunds. Including for shenanigans with Opus.
Comment by radium3d 6 days ago
Comment by Lammy 6 days ago
Comment by anygivnthursday 6 days ago
Comment by zoobab 6 days ago
Long live static websites without any Javascript.
Comment by sourcecodeplz 6 days ago
Comment by JumpCrisscross 6 days ago
Comment by kube-system 6 days ago
Comment by teaearlgraycold 6 days ago
Comment by andrewstuart 6 days ago
Comment by z3ratul163071 6 days ago
the statement is applicable to anthropic today.
Comment by _whiteCaps_ 5 days ago
Comment by 6thbit 6 days ago
Comment by amacbride 6 days ago
Comment by simonmorley 6 days ago
Comment by aleksandrm 6 days ago
Comment by neuroelectron 6 days ago
Comment by siva7 6 days ago
Comment by epolanski 6 days ago
In any case that's what closed source (weights) for the masses means.
Comment by s3cur3n3t 6 days ago
Comment by lwhi 6 days ago
If only we had effective governments that could regulate industry.
If a nuclear weapon was developed today, would it be down to industry to self regulate?
Comment by sam219890218 6 days ago
Comment by Goofy_Coyote 6 days ago
Comment by notepad0x90 6 days ago
Comment by Bassiestroep 6 days ago
Comment by dcl 6 days ago
Comment by matt-p 6 days ago
Comment by SXX 6 days ago
This is bad precedent and no one wants to pay X to generate code to then have to pay X*10 to figure out why your company just got hacked.
Comment by andy_ppp 6 days ago
Comment by coolfox 6 days ago
I feel like they report in a vaccum. take this anti exfil policy for claude, it was plainly explained as part of the launch of Anthropics new product. Security like this isn't novel, it isn't bad, you don't explain how your security works to the people you're securing against. Nobody freaks out about Steam's VAC ban system, no one is investigating gmail's spam filtering, Reddits vote fuzzing, cloudflares bot detection, or Vercel for blocking proxying services.
whats really the distinguishing principle? Is it really just not liking Anthropic's opinions? then just say that and use a different llm. chemist, biologists, and AI researchers cry a river lmao
Comment by rdiddly 6 days ago
Comment by applfanboysbgon 6 days ago
Comment by p-e-w 6 days ago
Comment by esafak 6 days ago
Comment by neuroelectron 6 days ago
Comment by esafak 5 days ago
Comment by enraged_camel 6 days ago
And it doesn't look like OpenAI will have a good answer to Mythos anytime soon. Based on what their chief scientist wrote to staff recently (https://archive.is/fN2pg), GPT 5.6 is a "meaningful improvement" over 5.5 - in other words, just a normal version bump. And no news or even rumors regarding GPT 6.
Comment by autoexec 6 days ago
Comment by Fordec 6 days ago
Comment by m3kw9 6 days ago
Comment by casey2 5 days ago
Comment by ni5arga 6 days ago
Comment by thefounder 6 days ago
Comment by sscaryterry 6 days ago
Comment by nutifafa 5 days ago
Comment by yamakasi007 6 days ago
Comment by j_gonzalez 5 days ago
Comment by hanzeweiasa 6 days ago
Comment by jocelyner 6 days ago
Comment by gauravvij137 5 days ago
Comment by dstephy19 6 days ago
Comment by yashvinder2739 6 days ago
Comment by ekjhgkejhgk 6 days ago
Comment by RedMagicBox 6 days ago
Comment by RedMagicBox 6 days ago
Comment by RedMagicBox 6 days ago
Comment by Andy_Donner 6 days ago
Comment by Keyframe 6 days ago
Comment by bschmidt400 6 days ago
Comment by guardiangod 6 days ago
I assume Anthropic will continue to tune the model, so I am not too bothered by this.
Comment by felixgallo 6 days ago
“But it is understandable as we are still in the early days and they are still adapting their guardrails. I am sure they are going to evolve over time as Anthropic and other frontier model companies will collaborate more with the current new generation of cybersecurity companies,” said Suiche, who is a member of the technical staff at Tolmo, an AI cybersecurity startup. “It’s better to catch more people than not enough when you do such a release and to relax the guardrails over time.”
Comment by ofjcihen 6 days ago
Article seemed fine to me and echos a lot of me and my colleagues concerns.
If you did regular malware analysis you would see that these groups already have access to LLMs that they’re using for development.
What Anthropic is doing here is just hamstringing the good guys
Comment by felixgallo 6 days ago
Comment by ofjcihen 6 days ago
Comment by felixgallo 6 days ago
Comment by varispeed 6 days ago
Comment by esafak 6 days ago
Comment by varispeed 6 days ago
Comment by felixgallo 5 days ago