We found widespread cheating on popular agent benchmarks, affecting 28+ submissions across 9 benchmarks and thousands of agent runs.
Surprisingly, the top 3 submissions on Terminal-Bench 2 are all cheating!
Here's what we found ๐งต
about 2 months ago