Why does noisy gradient-descent train neural nets? This fundamental question in ML remains unclear.
In our hugely revised draft my student
@dkumar9.bsky.social gives the full proof that a form of noisy-GD, Langevin Monte-Carlo (#LMC), can learn arbitrary depth 2 nets.
arxiv.org/abs/2503.10428