Alessandro Stolfo
@alestolfo.bsky.social
📤 362
📥 65
📝 2
PhD @ ETHZ - LLM Interpretability alestolfo.github.io
reposted by
Alessandro Stolfo
Yucheng Sun
8 months ago
1/6: Can we use an LLM’s hidden activations to predict and prevent wrong predictions? When it comes to arithmetic, yes! I’m presenting new work w/
@alestolfo.bsky.social
“Probing for Arithmetic Errors in LMs” @
#ICML2025
Act Interp WS 🧵 below
5
1
1
reposted by
Alessandro Stolfo
Aaron Mueller
11 months ago
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!
1
51
21
Our paper "Improving Instruction-Following in Language Models through Activation Steering” has been accepted to
#ICLR2025
! We're also excited to share that our public GitHub repo is now live. Code:
github.com/microsoft/ll...
Camera-ready:
arxiv.org/abs/2410.12877
11 months ago
1
8
4
you reached the end!!
feeds!
log in