Brian Christian 4 months ago
Reward models (RMs) are the moral compass of LLMs โ but no one has x-rayed them at scale. We just ran the first exhaustive analysis of 10 leading RMs, and the results were...eye-opening. Wild disagreement, base-model imprint, identity-term bias, mere-exposure quirks & more: ๐งต