Arthur Douillard
@douillard.bsky.social
📤 1442
📥 155
📝 78
distributed (diloco) + modularity (dipaco) + llm @ deepmind | continual learning phd @ sorbonne
one more step towards decentralized learning: Eager Updates can we overlap communication with computation over hundred of steps? -- yes we can in this work led by @SatyenKale, we improve DiLoCo and use x1177 less bandwidth than data-parallel
8 months ago
1
7
1
from Jeff Dean at The Dwarkesh podcast: "asynchronous training where each copy of the model does local computation [...] it makes people uncomfortable [...] but it actually works" yep, i can confirm, it does work for real see
arxiv.org/abs/2501.18512
loading . . .
8 months ago
0
5
0
reposted by
Arthur Douillard
Marco Ciccone
8 months ago
We received an outstanding interest in our
#ICLR2025
@iclr-conf.bsky.social
workshop on modularity! Please sign up to serve as a reviewer if you are interested in Model Merging, MoEs, and Routing, for Decentralized and Collaborative Learning
t.co/HIsZKWNaOx
add a skeleton here at some point
0
1
1
We release today the next step for distributed training: --> Streaming DiLoCo with Overlapping Communication. TL;DR: train data-parallel across the world with low-bandwidth for the same performance: 400x less bits exchanged & huge latency tolerance
8 months ago
1
8
2
reposted by
Arthur Douillard
MuJoCo.org
9 months ago
Introducing
playground.mujoco.org
Combining MuJoCo’s rich and thriving ecosystem, massively parallel GPU-accelerated simulation, and real-world results across a diverse range of robot platforms: quadrupeds, humanoids, dexterous hands, and arms. Get started today: pip install playground
loading . . .
MuJoCo Playground
An open-source framework for GPU-accelerated robot learning and sim-to-real transfer
https://playground.mujoco.org
1
74
23
reposted by
Arthur Douillard
Marc Lanctot
9 months ago
In December, I posted about our new paper on mastering board games using internal + external planning. 👇 Here's a talk now on Youtube about it given by my awesome colleague John Schultz!
www.youtube.com/watch?v=JyxE...
add a skeleton here at some point
1
35
11
reposted by
Arthur Douillard
Wanru Zhao
9 months ago
🚀Excited to co-organize the
#ICLR2025
Workshop on Modularity for Collaborative, Decentralized, and Continual Deep Learning (MCDC
@iclr-conf.bsky.social
). 📃 Submission Portal:
openreview.net/group?id=ICL...
🤗See you in Singapore! For more details, check out the original thread ↪️🧵
add a skeleton here at some point
0
2
1
Workshop alert 🚨 We'll host in ICLR 2025 (late April) a workshop on modularity, encompassing collaborative + decentralized + continual learning. Those topics are on the critical path to building better AIs. Interested? submit a paper and join us in Singapore!
sites.google.com/corp/view/mc...
9 months ago
1
8
4
openreview.net/forum?id=QdE...
👀
loading . . .
Modular, Collaborative and Decentralized Deep Learning
The increasing complexity of modern machine learning models exposes the limitations of the traditional, monolithic approach to their development, raising concerns about cost and...
https://openreview.net/forum?id=QdETnsJ77V
10 months ago
0
7
1
PrimeIntellect have released their tech report on INTELLECT-1:
t.co/8hnoTILaL3
The first open-source world-wide training of a 10B model. The underlying ML distributed algo is DiLoCo (
arxiv.org/abs/2311.08105
) but they also built tons of engineering on top of it to make it scalable.
10 months ago
0
12
1
Awesome video on speculations for test-time scaling (O1 👀 ):
loading . . .
Speculations on Test-Time Scaling (o1)
Tutorial on the technical background behind OpenAI o1. Talk written with Daniel Ritter.Slides: https://github.com/srush/awesome-o1Talk: The “large” in LLM is...
https://www.youtube.com/watch?v=6PEJ96k1kiw
10 months ago
0
7
0
Excellent explanation of RoPE embedding, from scratch with all the math needed:
https://fleetwood.dev/posts/you-could-have-designed-SOTA-positional-encoding
And with beautiful 3blue1brown's style of animation:
https://github.com/3b1b/manim
. Original RoPE paper:
arxiv.org/abs/2104.09864
11 months ago
0
53
10
I need to take holidays to read all my saved arXivs
11 months ago
0
11
0
Distributed Decentralized Training of Neural Networks: A Primer:
towardsdatascience.com/distributed-decentralized-training-of-neural-networks-a-primer-21e5e961fce1
DP's AllReduce, variants thereof + advanced methods as SWARM (
arxiv.org/abs/2301.11913
) and DiLoCo (
arxiv.org/abs/2311.08105
)
loading . . .
Distributed Decentralized Training of Neural Networks: A Primer
Data Parallelism, Butterfly All-Reduce, Gossiping and More…
https://towardsdatascience.com/distributed-decentralized-training-of-neural-networks-a-primer-21e5e961fce1
11 months ago
0
3
0
LLMs Know More Than They Show:
arxiv.org/abs/2410.02707
* Adding a true-seeking classifier probe on the token embeddings can have better performance than the actual generation * Is something wrong going on in the decoding part? * Those error detectors don't generalize across datasets
11 months ago
0
8
0
reposted by
Arthur Douillard
joao
11 months ago
New essay by DeepMind about AI for scientific discovery, there's a lot of interesting ideas and citations to others's work here
deepmind.google/public-polic...
loading . . .
A new golden age of discovery
In this essay, we take a tour of how AI is transforming scientific disciplines from genomics to computer science to weather forecasting. Some scientists are training their own AI models, while...
https://deepmind.google/public-policy/ai-for-science/
1
65
18
Secret Collusion among Generative AI Agents:
arxiv.org/abs/2402.07510
LLMs are prone to collude when they know they cannot be "caught"
11 months ago
2
9
2
reposted by
Arthur Douillard
Jon Barron
11 months ago
Our group at Google DeepMind is now accepting intern applications for summer 2025. Attached is the official "call for interns" email; the links and email aliases that got lost in the screenshot are below.
3
95
27
reposted by
Arthur Douillard
Christian Wolf
11 months ago
There are now several benchmarks testing spatial reasoning and agent capabilities of LLMs and VLMs: -
arxiv.org/abs/2410.06468
(does spatial cognition ...) -
arxiv.org/abs/2307.06281
(MMBench) -
arxiv.org/abs/2411.13543
(BALROG) - additional points for the LOTR ref.
2
115
14
reposted by
Arthur Douillard
11 months ago
Great thread to summarize the history of distributed learning, from the Federated Learning to prime intellect's OpenDiLoCo
add a skeleton here at some point
0
2
2
distributed learning for LLM? recently,
@primeintellect.bsky.social
have announced finishing their 10B distributed learning, trained across the world. what is it exactly? 🧵
11 months ago
1
23
8
Adaptive Decoding via Latent Preference Optimization:
arxiv.org/abs/2411.09661
* Add a small MLP + classifier which predict a temperature per token * They train the MLP with a variant of DPO (
arxiv.org/abs/2305.18290
) with the temperatures as latent * low temp for math, high for creative tasks
11 months ago
0
11
1
reposted by
Arthur Douillard
William Isaac
11 months ago
We have multiple roles now open in my Responsible Research Group! Research Scientist on the HEART team led by IasonGabriel:
boards.greenhouse.io/deepmind/job...
Research Engineer on the SAMBA team led by Kristian Lum:
boards.greenhouse.io/deepmind/job...
loading . . .
DeepMind
https://boards.greenhouse.io/deepmind/jobs/6351433
2
48
10
reposted by
Arthur Douillard
Tim Rocktäschel
11 months ago
Excited to announce "BALROG: a Benchmark for Agentic LLM and VLM Reasoning On Games" led b UCL DARK's
@dpaglieri.bsky.social
! Douwe Kiela plot below is maybe the scariest for AI progress — LLM benchmarks are saturating at an accelerating rate. BALROG to the rescue. This will keep us busy for years.
add a skeleton here at some point
3
125
16
Top-nσ:
arxiv.org/abs/2411.07641
Similar to min-p (
arxiv.org/abs/2407.01082
), aims to cut too low probs for sampling. while min-p is based on a % threshold of the max prob, Top-nσ notes that logits follow a gaussian, and aims to cut logits further than n-sigma away
11 months ago
1
14
0
reposted by
Arthur Douillard
Lucas Beyer (bl16)
11 months ago
French startup H Company released their first model/product today: Runner H, a web agent that's a 3B VLM. The startup was co-founded by Charles Kantor and 4 DeepMinders: Laurent Sifre, Karl Tuyls, Daan Wierstra and Julien Perolat in May. The latter 3 left in August.
www.hcompany.ai/blog/a-resea...
1
28
4
Deepseek released deepseek-r1, an "equivalent" to OpenAI's O1:
api-docs.deepseek.com/news/news1120
Given that deepseek has been very open in the past (e.g.
github.com/deepseek-ai
), I'm very hopeful they will disclose more details about R1 too
11 months ago
2
10
1
Min-p Sampling:
arxiv.org/abs/2407.01082
1. Get max prob 2. Find min prob based on a threshold \in [0, 1] \times that max prob 3. Gather only tokens probs above that min prob 4. Sample in that pool, according to renormalized probs More robust to change in temperature!
loading . . .
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs
Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step. However, popular sampling methods like top-p (nucleus…
https://arxiv.org/abs/2407.01082
11 months ago
0
17
5
This graph from SemiAnalysis (
semianalysis.com/2024/03/13/ai-datacenter-energy-dilemma-race/
) is quite crazy. We better start building SMR factories asap
11 months ago
1
9
1
DiLoCo (
arxiv.org/abs/2311.08105
) is a distributed algorithm allowing us to do a kind of data-parallelism but communicating hundreds of times less PrimeIntellect's folks made an open-source version and are currently training a 10B model across the world with DiLoCo They're almost finished at 90%!
11 months ago
0
9
2
reposted by
Arthur Douillard
Lucas Beyer (bl16)
11 months ago
They really missed an opportunity for "Willkommen in La Forêt" here. That being said, nice to see Europe going strong!
2
40
1
you reached the end!!
feeds!
log in