Samira
@samiraabnar.bsky.social
📤 28
📥 32
📝 10
pinned post!
🚨 One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, parallelizable compute, or sequential compute? We explored this through the lens of MoEs:
8 months ago
1
18
11
reposted by
Samira
Pierre Ablin
8 months ago
Excited to share Soup-of-Experts, a new neural network architecture that, for any given specific task, can instantiate in a flash a small model that is very good on it. Made with ❤️ at Apple Thanks to my co-authors David Grangier, Angelos Katharopoulos, and Skyler Seto!
arxiv.org/abs/2502.01804
0
12
4
reposted by
Samira
Dan Busbridge
8 months ago
Reading "Distilling Knowledge in a Neural Network" left me fascinated and wondering: "If I want a small, capable model, should I distill from a more powerful model, or train from scratch?" Our distillation scaling law shows, well, it's complicated... 🧵
arxiv.org/abs/2502.08606
loading . . .
Distillation Scaling Laws
We provide a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings reduce the risks associated ...
https://arxiv.org/abs/2502.08606
1
3
5
reposted by
Samira
Preetum Nakkiran
8 months ago
Paper🧵 (cross-posted at X): When does composition of diffusion models "work"? Intuitively, the reason dog+hat works and dog+horse doesn’t has something to do with independence between the concepts being composed. The tricky part is to formalize exactly what this means. 1/
add a skeleton here at some point
2
39
17
🚨 One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, parallelizable compute, or sequential compute? We explored this through the lens of MoEs:
8 months ago
1
18
11
you reached the end!!
feeds!
log in