(@jsrozner.bsky.social)

New paper with @coryshain.bsky.social: We develop "perturbation" - a simple approach to finding shared representations (common abstractions) in LMs. We perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation “infects” other examples.