I am excited to share our new paper: “Grokking at the Edge of Numerical stability”!
We show that floating point errors in the Softmax play a surprising role in grokking, explaining among other things, why weight decay seems necessary for grokking in most cases!
🧵
over 1 year ago