Simon Lermen (@simonlermen.bsky.social)

Current safety training techniques do not fully transfer to the agent setting — LessWrong TL;DR: We are presenting three recent papers which all share a similar finding, i.e. the safety training techniques for chat models don’t transfer we… https://www.lesswrong.com/posts/ZoFxTqWRBkyanonyb/current-safety-training-techniques-do-not-fully-transfer-to