New Google DeepMind safety paper! LLM agents are coming β how do we stop them finding complex plans to hack the reward?
Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!
Inspired by myopic optimization but better performance β details inπ§΅
8 months ago