New paper with
@coryshain.bsky.social: We develop "perturbation" - a simple approach to finding shared representations (common abstractions) in LMs. We perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation βinfectsβ other examples.