Jonathan Bragg (@jbragg.bsky.social)

Agent benchmarks don't measure true *AI* advances We built one that's hard & trustworthy: 👉 AstaBench tests agents w/ *standardized tools* on 2400+ scientific research problems 👉 SOTA results across 22 agent *classes* 👉 AgentBaselines agents suite 🆕 arxiv.org/abs/2510.21652 🧵👇