First paper since switching into AI safety team🎉
We look at problems that could be solved if the model behaved consistently over a set of prompts, and tried training that in output space and internal activations. Both were effective. See thread or paper for details.
add a skeleton here at some point
about 2 months ago