🪩 Evaluate your LLMs on benchmarks like MMLU at 1% cost.
In our new paper, we show that outputs on a small subset of test samples that maximise diversity in model responses are predictive of the full dataset performance.
Project page:
arubique.github.io/disco-site/
More below 🧵👇