Alex Turner (@turntrout.bsky.social)

Mark Kurzeja & I exploited weaknesses in multiple-choice TruthfulQA dataset while hiding the questions! A few simple rules of thumb achieved 79% accuracy. Even well-regarded benchmarks can have flaws. Kudos to the authors for addressing this! Read at turntrout.com/original-tru...

loading . . .

Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses Common factuality benchmark was easily gamed using our simple decision tree. The benchmark is now updated. https://turntrout.com/original-truthfulqa-weaknesses

about 1 year ago