🚨🛡️ Jailbreak Defense in a Narrow Domain 🛡️🚨
Jailbreaking is easy. Defending is hard. Might defending against a single, narrow behavior be easier?
Even in this focused setting, all defenses fail 😱
arxiv.org/abs/2412.02159
Appearing at @AdvMLFrontiers (Oral) & @solarneurips
#NeurIPS2024