Alessandro Stolfo
@alestolfo.bsky.social
📤 348
📥 65
📝 2
PhD @ ETHZ - LLM Interpretability alestolfo.github.io
reposted by
Alessandro Stolfo
Yucheng Sun
2 months ago
1/6: Can we use an LLM’s hidden activations to predict and prevent wrong predictions? When it comes to arithmetic, yes! I’m presenting new work w/
@alestolfo.bsky.social
“Probing for Arithmetic Errors in LMs” @
#ICML2025
Act Interp WS 🧵 below
5
1
1
reposted by
Alessandro Stolfo
Aaron Mueller
5 months ago
Lots of progress in mech interp (MI) lately! But how can we measure when new mech interp methods yield real improvements over prior work? We propose 😎 𝗠𝗜𝗕: a 𝗠echanistic 𝗜nterpretability 𝗕enchmark!
1
49
21
Our paper "Improving Instruction-Following in Language Models through Activation Steering” has been accepted to
#ICLR2025
! We're also excited to share that our public GitHub repo is now live. Code:
github.com/microsoft/ll...
Camera-ready:
arxiv.org/abs/2410.12877
5 months ago
1
7
4
you reached the end!!
feeds!
log in