Pierre Ablin
@pierreablin.bsky.social
๐ค 266
๐ฅ 216
๐ 4
Research scientist at Apple | machine learning, optimization, language modeling pierreablin.com
reposted by
Pierre Ablin
Preetum Nakkiran
about 1 year ago
Paper๐งต (cross-posted at X): When does composition of diffusion models "work"? Intuitively, the reason dog+hat works and dog+horse doesnโt has something to do with independence between the concepts being composed. The tricky part is to formalize exactly what this means. 1/
add a skeleton here at some point
2
39
17
reposted by
Pierre Ablin
Fabian Schaipp
about 1 year ago
Learning rate schedules seem mysterious? Why is the loss going down so fast during cooldown? Turns out that this behaviour can be described with a bound from *convex, nonsmooth* optimization. A short thread on our latest paper ๐
arxiv.org/abs/2501.18965
loading . . .
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training
We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedul...
https://arxiv.org/abs/2501.18965
2
31
6
Excited to share Soup-of-Experts, a new neural network architecture that, for any given specific task, can instantiate in a flash a small model that is very good on it. Made with โค๏ธ at Apple Thanks to my co-authors David Grangier, Angelos Katharopoulos, and Skyler Seto!
arxiv.org/abs/2502.01804
about 1 year ago
0
12
4
reposted by
Pierre Ablin
Mathieu Blondel
about 1 year ago
Really proud of these two companion papers by our team at GDM: 1) Joint Learning of Energy-based Models and their Partition Function
arxiv.org/abs/2501.18528
2) Loss Functions and Operators Generated by f-Divergences
arxiv.org/abs/2501.18537
A thread.
1
14
4
reposted by
Pierre Ablin
Valรฉrie Castin
about 1 year ago
How do tokens evolve as they are processed by a deep Transformer? With Josรฉ A. Carrillo,
@gabrielpeyre.bsky.social
and
@pierreablin.bsky.social
, we tackle this in our new preprint: A Unified Perspective on the Dynamics of Deep Transformers
arxiv.org/abs/2501.18322
ML and PDE lovers, check it out!
loading . . .
2
95
16
reposted by
Pierre Ablin
Samuel Vaiter
about 1 year ago
Byte Pair Encoding is a tokenization method that starts with all characters as initial tokens. It iteratively merges the most frequent adjacent byte pairs in the text, adding new tokens to the vocabulary until reaching a predefined size. The output is a sequence of tokens.
https://buff.ly/42oG80f
loading . . .
1
14
3
reposted by
Pierre Ablin
Gaรซl Varoquaux
about 1 year ago
๐ ๐ซ We are opening post-doc positions at the intersection of AI, data science, and medicine: โข Large Language Models for French medical texts โข Evaluating digital medical devices: statistics and causal inference
1
27
16
Mixture of experts are all the rage when it comes to shipping low-latency LLMs. Check out this awesome work by Samira et al. about scaling laws for mixture of experts !
add a skeleton here at some point
about 1 year ago
0
3
0
reposted by
Pierre Ablin
Samira
about 1 year ago
๐จ One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, parallelizable compute, or sequential compute? We explored this through the lens of MoEs:
1
18
11
reposted by
Pierre Ablin
Pau Rodriguez
about 1 year ago
Thrilled to share the latest work from our team at @Apple where we achieve interpretable and fine-grained control of LLMs and Diffusion models via Activation Transport ๐ฅ ๐
arxiv.org/abs/2410.23054
๐ ๏ธ
github.com/apple/ml-act
0/9 ๐งต
3
47
20
Excited to see Sigmoid Attention accepted at ICLR 2025 !! Make attention ~18% faster with a drop-in replacement ๐ Code:
github.com/apple/ml-sig...
Paper
arxiv.org/abs/2409.04431
loading . . .
Theory, Analysis, and Best Practices for Sigmoid Self-Attention
Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as...
https://arxiv.org/abs/2409.04431
about 1 year ago
1
28
5
reposted by
Pierre Ablin
Marco Cuturi
about 1 year ago
The Apple Machine Learning Research (MLR) team in Paris has openings for both FTE roles and a short-term post-doc position to contribute to our team's research agenda. Researchers at Apple's MLR (led by Samy Bengio) target impactful publications in top-tier ML venues and OSS.
1
13
5
Congratulations for these new models !!
add a skeleton here at some point
about 1 year ago
0
4
0
reposted by
Pierre Ablin
Alaa El-Nouby
about 1 year ago
๐๐ผ๐ฒ๐ ๐ฎ๐๐๐ผ๐ฟ๐ฒ๐ด๐ฟ๐ฒ๐๐๐ถ๐๐ฒ ๐ฝ๐ฟ๐ฒ-๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด ๐๐ผ๐ฟ๐ธ ๐ณ๐ผ๐ฟ ๐๐ถ๐๐ถ๐ผ๐ป? ๐ค Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding ๐งต paper:
arxiv.org/abs/2411.14402
code:
github.com/apple/ml-aim
HF:
huggingface.co/collections/...
3
59
20
reposted by
Pierre Ablin
Gaรซl Varoquaux
about 1 year ago
Great video explaining a clever vectorization for learning on strings and dirty categories: the MinHashEncoder is fast, stateless, and excellent with tree-based learners. It's in
@skrub-data.bsky.social
youtu.be/ZMQrNFef8fg
loading . . .
Why the MinHashEncoder is great for boosted trees
YouTube video by probabl
https://youtu.be/ZMQrNFef8fg
2
75
8
you reached the end!!
feeds!
log in