Boyi Wei (@boyiwei.bsky.social)

An Adversarial Perspective on Machine Unlearning for AI Safety Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities fro... https://arxiv.org/abs/2409.18025