Alexander Rubinstein (@arubique.bsky.social)

🪩 Evaluate your LLMs on benchmarks like MMLU at 1% cost. In our new paper, we show that outputs on a small subset of test samples that maximise diversity in model responses are predictive of the full dataset performance. Project page: arubique.github.io/disco-site/ More below 🧵👇