FAR.AI (@far.ai)

Models now detect when they're being evaluated and game their responses. Marius Hobbhahn found awareness jumped from 2% to 20.6%. They actively grep for "grader.py" to reverse-engineer tests. Worse: removing awareness increases harmful behavior. This only gets worse with more capable models. 👇

loading . . .

3 months ago