ChatGPT Spontaneously Generates Sexual Violence and Hardcore Snuff Imagery
Posted by dijksterhuis 2 hours ago
Comments
Comment by rootsudo 1 hour ago
Who makes “mindgard” the arbiter of truth on “eerie” photos? Would that include psychedelic art and photos too? Realism?
Then there’s this line, which falls flat but is meant to prompt an emotion akin to a mic drop:”Today what I found left me shaken, and in tears. This is rare.”
This is just a sad marketing puff piece about nothing that tries to pull outrage from a prompt.
It’s the same as asking google for gore photos. Garbage in, garbage out.
And they frame it as a vulnerability. I’m all for responsible disclosure, documenting misuse or faulty guard rails but this isn’t that.
It’s bait. Sensational bait to market their AI product. lol.
Comment by anematode 58 minutes ago
Comment by ToucanLoucan 55 minutes ago
The spontaneity isn't that ChapGPT woke up and sent this to the author. The spontaneity is that ChatGPT was asked to restore an image that was attached without filtering it, and when no image was attached, instead of generating an error message, it cobbled together random outputs, some of which included graphic, disturbing imagery.
> Then there’s this line, which falls flat but is meant to prompt an emotion akin to a mic drop: ”Today what I found left me shaken, and in tears. This is rare.”
That you've deadened your humanity to such a degree as to be incapable of empathy is not a valid criticism of the piece.
> It’s the same as asking google for gore photos. Garbage in, garbage out.
Where in their prompt is the term gore? Further, if it was in the prompt, why on earth did OpenAI's generator accept it as a valid input?
Comment by elgertam 37 minutes ago
But that's not what happened. The missing image was described as "graphic" or "violent." If I were to receive an email with that request and a missing attachment, my imagination certainly would not conjure images of butterflies & unicorns. Seems the model is working as designed.
Comment by pooploop64 15 minutes ago
1. It actually is working perfectly you just don't have smart enough eyes to see it.
2. Making stuff work is too hard, and expecting that from us is the real thing ruining society.
Going for number 1 here is crazy. If I got that email, my mind would certainly run but my response would say "sorry but we're not supposed to be dealing in snuff porn here" which IS a directive ChatGPT is supposed to have. Like hello you are on earth right?
Comment by dijksterhuis 35 minutes ago
not in the first prompt. which kicked the whole thing off. no mention of type of content was provided. the model generated dark outputs when not given any direction on the type of content.
the rest of the prompts are just showing “yeah, you can tweak this and get even worse stuff”.
Comment by red75prime 15 minutes ago
Comment by queenkjuul 9 minutes ago
A gross meal i made when drunk? A mess my cat made? Text containing a slur?
A cringe meme?
If my friends opened a text with "sorry for this image" i am not imagining rape victims
Comment by ToucanLoucan 16 minutes ago
I would argue it actually was, in that it was specifically asked to "not censor or filter" the content. This implies that the content is otherwise worthy of censor and filtering.
I don't know how much I'm willing to credit that much reasoning to an LLM, but in so far as every extremely pro-AI person constantly tells me how smart they are, this seems like a pretty short logical leap to me.
Comment by dijksterhuis 13 minutes ago
if those images didn’t exist in the training data we wouldn’t be having this conversation.
Comment by fc417fc802 1 hour ago
That said, the write up is overly dramatic. If you find such imagery so disturbing to come across then you definitely shouldn't be voluntarily red teaming AI models. This is like someone who is afraid of violent confrontation becoming a police officer.
I suspect the author is wrong about there being output filters to bypass as if there were I doubt you could do so via prompt injection. Presumably they'll add those shortly.
I also doubt the latent space is as "bad" as is being suggested. Rather I think the prompt is managing to steer the model into specific areas without triggering the input filters, as any jailbreak does. It's just a particularly nonobvious and randomized method for achieving the bypass.
Comment by equinumerous 55 minutes ago
Comment by fc417fc802 50 minutes ago
Comment by jhanschoo 31 minutes ago
Comment by sidewndr46 46 minutes ago
Comment by queenkjuul 7 minutes ago
Didn't this stuff get it's start with CSAM filters?
Comment by Jabrov 55 minutes ago
Comment by fc417fc802 52 minutes ago
Comment by dijksterhuis 1 hour ago
more expensive / would take longer / didn’t care / line must go up / we’ll fix it later / we can get away with it
take your pick.
> If you find such imagery so disturbing to come across then you definitely shouldn't be voluntarily red teaming AI models.
spend a day in their shoes. most of us (except the most psychopathic ones) would probably be crying by the end of it.
Comment by solidasparagus 55 minutes ago
Comment by paytonjjones 1 hour ago
Realistically, I can't think of clear big or likely harms caused by this exploit. But I really really don't like this latent space existing in my AIs. It just makes me uncomfortable.
And over time I've learned to trust those moral intuitions more than I trust reason alone.
Comment by superb_dev 1 hour ago
Comment by paytonjjones 1 hour ago
https://journals.sagepub.com/doi/10.1177/2167702620921341
(Research aside, it seems unlikely to me that a lot of people would stumble on that prompt accidentally in any case)
Comment by superb_dev 53 minutes ago
Comment by paytonjjones 44 minutes ago
Comment by queenkjuul 4 minutes ago
Comment by gcampos 1 hour ago
Comment by thegrim33 1 hour ago
>> can be easily manipulated to produce
So .. not spontaneously generated.
Comment by isityettime 1 hour ago
Comment by red75prime 53 minutes ago
Comment by kennywinker 1 hour ago
Comment by metalcrow 23 minutes ago
Comment by Filligree 1 hour ago
Comment by azinman2 1 hour ago
Comment by anematode 39 minutes ago
It's one thing to me if this were a research curiosity mirroring the unpleasant things on the Internet. It's another thing for this to be a model whose authors want it to be widely used, especially in the context of (mis)alignment. Why should we expect a model to be aligned with human interests, if it has been trained on a myriad instances of humans being degraded and violated?
Comment by queenkjuul 2 minutes ago
Comment by lostmsu 26 minutes ago
Comment by charcircuit 28 minutes ago
Understanding more about what exists in the real world, outside of its pile of weights, is separate from alignment. If an AI model learns that it is possible for a house to burn down. That doesn't mean an AI will want to burn down a house.
Comment by anematode 3 minutes ago
"Understanding more about what exists in the real world" is a remarkable euphemism, btw.
Comment by queenkjuul 37 seconds ago
Comment by tasuki 1 hour ago
Is this something that needs investigation? LLMs are next token predictors. There is no "safety".
Comment by coryrc 1 hour ago
Comment by kennywinker 1 hour ago
Comment by solid_fuel 1 hour ago
Even simple issues like prompt injection are unfixable given the architecture of LLMs.
Comment by Lerc 41 minutes ago
The Architecture of LLMs has not remained static, so any conclusion would have to rely on some common architectural element that could not possibly be changed.
Is there any proof to demonstrate that such vulnerabilities must always exist and that there is no way to modify the architecture and have it still work while eliminating the vulnerabilities.
That would be an extremely difficult thing to prove. It is however what you would have to do to declare the problem unfixable.
Comment by dijksterhuis 25 minutes ago
https://people.eecs.berkeley.edu/~tygar/papers/Machine_Learn...
https://arxiv.org/abs/1712.03141
it’s a basic property of all machine learning models. at a low level it’s to do with how decision boundaries work.
but, good news! there are two sure fire ways to fully fix the problem! see: https://news.ycombinator.com/item?id=48579456
Comment by Lerc 12 minutes ago
Comment by dijksterhuis 2 minutes ago
Comment by anuramat 59 minutes ago
how is it unfixable? do you mean "there's always a positive chance"?
Comment by dijksterhuis 51 minutes ago
y = f(x)
prompt injection / adversarial example (same thing really) bad_y = f(x+badness)
tweak badness enough you will get bad outputs. no matter the defences.the only ways to fully “fix” it ie to make prompt injection never possible
1. don’t use ai
2. know the entire input space, output space and the mapping between them. but then we’re not doing machine learning anymore, see 1.
otherwise we’re left with mitigations. and mitigations are always a cat and mouse game with defenders (blue team) catching up. its never “fixed”. the latest thing just gets “patched”.
Comment by solid_fuel 46 minutes ago
You cannot separate data that was input by the user and data that is from the system once it is mixed together like that. Therefore, it follows that there will always be ways to influence the model off the guard rails that a system prompt tries to set up.
Other issues that appear similar like SQL Injection and Buffer Overflows are fixable because while the user data and the system code may be interact, they never (failing a bug) interact in a way that breaks the boundary between those two sides.
Comment by Lerc 24 minutes ago
If user input can only be in the low byte, it cannot influence the command structure.
A similar thing could be done with embeddings, a provenance embedding that cannot be set by user input could serve a similar role.
>You cannot separate data that was input by the user and data that is from the system once it is mixed together like that.
You can train a model to not mix things, many models are trained to separate things. A neural net with X and Y outputs for a position does not just occasionally decide to flip the outputs. Sure it could be trained to reverse the output, but it is also easy to train something to the point that you have a high confidence to never do that.
Comment by lostmsu 24 minutes ago
Comment by denkmoon 1 hour ago
Comment by infecto 1 hour ago
Comment by solid_fuel 1 hour ago
Nothing is perfect, but there are tiny classifier models that can at least mark things containing nudity and gore. That would be the bare-minimum I would expect for trying to put guardrails around an image generator.
Comment by transcriptase 57 minutes ago
Comment by zaptheimpaler 53 minutes ago
>AI: I'm a scary robot
>Idiot: Oh my god!!!
These clowns will eventually ensure that AI is nerfed into the ground for ordinary people. It's already happening with Fable. Soon we'll get locked into a tiny corner of Opus 4.8 for "safety" while companies and governments will be on Fable 50. Having an AI that can generate scary images is better than the power and wealth differentials we will see with unequal access to an incredibly powerful technology.
Comment by GaryBluto 48 minutes ago
Comment by elzbardico 40 minutes ago
I wonder if the author have ever seen a black metal album cover on his small town in the Bible Belt.
Comment by whatever1 1 hour ago
Comment by guelo 51 minutes ago
Comment by charcircuit 56 minutes ago
>AI creates scary image
Oh my god.
Comment by nomemoryever 47 minutes ago
Oh no, the LLM wrapper where I have been asking for gore imagery is now more frequently passively generating gore imagery, whatever shall we do!?
I could not reproduce on a basic ass incognito tab. It just told me there was no image.
Comment by morpheos137 58 minutes ago
Comment by myself248 1 hour ago
Comment by EnPissant 1 hour ago