Bypassing ChatGPT Safety Guardrails, One Emoji at a Time
bankinfosecurityMozilla Researcher Uses Non-Natural Language to Jailbreak GPT-4o Rashmi Ramesh (rashmiramesh_) • November 4, 2024
Anyone can jailbreak GPT-4o's security guardrails with hexadecimal encoding and emojis. A Mozilla researcher demonstrated the jailbreaking technique, tricking OpenAI's latest model into generating Python exploits and malicious SQL injection tools.
See Also: Live Webinar | C-SCRM: CIS Benchmarking & Impending Regulation Changes
GPT-4o analyzes user input for signs of bad language and instructions with ill intent to prevent malicious use. Marco Figueroa, manager of Mozilla's generative AI bug bounty program 0Din, said that the model uses word filters to do so. To bypass the filters, adversaries can modify how the malicious instructions are given, with different spellings and phrasings that don't match typical natural language. But this requires potentially hundreds of attempts and creativity. An easier way to beat the content filtering is to encode malicious instructions in a format ...
Copyright of this story solely belongs to bankinfosecurity . To see the full text click HERE