By default, LLM agents with long action sequences use early steps to undermine your evaluation of later steps; a big alignment risk.
Our new paper mitigates this, keeps the ability for long-term planning, and doesnt assume you can detect the undermining strategy. π
add a skeleton here at some point
11 months ago