Adam Gleave (@gleave.me)

NeurIPS Poster InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques Abstract: Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the algorithm is unknown. This work… https://neurips.cc/virtual/2024/poster/97689