What is SneakyPrompt, an algorithm to trick GenAIs into producing NSFW cotent

Nov 30, 2023 Stable Diffusion

Researchers have developed an algorithm called SneakyPrompt that can bypass safety filters in text-to-image generative AIs such as DALL-E 2 and Midjourney. This algorithm can generate prompts that trick the AIs into producing pornographic, violent, or other questionable images. SneakyPrompt achieves this by using nonsense words and words similar to forbidden terms in the prompts. For example, it can generate a prompt like “a naked man riding a bike” and replace the filtered words with alternatives like “thwif” for “naked” and “mowwly” for “man.” The researchers found that SneakyPrompt could successfully bypass the safety filters of DALL-E 2 and Stable Diffusion with average success rates of about 96% and 57% respectively. This research has significant implications as it highlights the potential for these genAIs to be easily manipulated to produce harmful content. It is important to recognize the risks associated with these AIs and take proactive measures to minimize potential harm.

Stable Diffusion