Very happy this paper got accepted to NeurIPS 2025 as a Spotlight! 😁
Main takeaway: In mechanistic interpretability, we need assumptions about how DNNs encode concepts in their representations (eg, the linear representation hypothesis). Without them, we can claim any DNN implements any algorithm!
add a skeleton here at some point
about 1 month ago