Evaluating language models is tricky, how do we know if our results are real, or due to random chance?
We find an answer with two simple metrics: signal, a benchmark’s ability to separate models, and noise, a benchmark’s random variability between training steps 🧵
add a skeleton here at some point
about 1 month ago