Incredible paper, finding that large companies can game the LMArena through statistical noise (via many model submissions), over-sampling of their models, and overfitting to Arena-style prompts (without real gains on model reasoning)
The experiments they run to show this are pretty cool too!
add a skeleton here at some point
8 months ago