Daniel Chechelnitsky (@dchechel.bsky.social)

RLHF is built upon some quite oversimplistic assumptions, i.e., that preferences between pairs of text are purely about quality. But this is an inherently subjective task (not unlike toxicity annotation) -- so we wanted to know, do biases similar to toxicity annotation emerge in reward models?

add a skeleton here at some point