Cassidy Laidlaw
@cassidylaidlaw.bsky.social
📤 767
📥 58
📝 20
PhD student at UC Berkeley studying RL and AI safety.
https://cassidylaidlaw.com
We built an AI assistant that plays Minecraft with you. Start building a house—it figures out what you’re doing and jumps in to help. This assistant *wasn't* trained with RLHF. Instead, it's powered by *assistance games*, a better path forward for building AI assistants. 🧵
loading . . .
5 months ago
1
26
6
When RLHFed models engage in “reward hacking” it can lead to unsafe/unwanted behavior. But there isn’t a good formal definition of what this means! Our new paper provides a definition AND a method that provably prevents reward hacking in realistic settings, including RLHF. 🧵
loading . . .
9 months ago
2
8
3
reposted by
Cassidy Laidlaw
Eugene Vinitsky 🍒
10 months ago
Kind of a broken record here but
proceedings.neurips.cc/paper_files/...
is totally fascinating in that it postulates two underlying, measurable structures that you can use to assess if RL will be easy or hard in an environment
8
152
30
you reached the end!!
feeds!
log in