Leandro von Werra
@lvwerra.bsky.social
๐ค 1146
๐ฅ 51
๐ 11
Research @ Hugging Face
Distributed training is notoriously hard to learn - knowledge is scattered across papers and complex codebases. Enter picotron: implementing all 4D parallelism concepts in separate, readable files totaling just 1988 LoC!
9 months ago
2
3
0
reposted by
Leandro von Werra
merve
9 months ago
supercharge your LLM apps with smolagents ๐ฅ however cool your LLM is, without being agentic it can only go so far enter smolagents: a new agent library by
@hf.co
to make the LLM write code, do analysis and automate boring stuff!
huggingface.co/blog/smolage...
2
88
20
reposted by
Leandro von Werra
Anton
9 months ago
Introducing ๐FineMath: the best open math pre-training dataset with 50B+ tokens! Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH. ๐ค
huggingface.co/datasets/Hug...
Hereโs a breakdown ๐งต
2
45
16
Releasing Jupyter Agents - LLMs running data analysis directly in a notebook! The agent can load data, execute code, plot results and following your guidance and ideas! A very natural way to collaborate with an LLM over data and it's just scratching the surface of what's possible soon!
loading . . .
9 months ago
1
13
4
reposted by
Leandro von Werra
Lewis Tunstall
9 months ago
We outperform Llama 70B with Llama 3B on hard math by scaling test-time compute ๐ฅ How? By combining step-wise reward models with tree search algorithms :) We're open sourcing the full recipe and sharing a detailed blog post ๐
4
109
22
reposted by
Leandro von Werra
Entalpic
10 months ago
Big News in AI4Science! โจ We are thrilled to launch LeMaterial, an open-source project in collaboration with
@hf.co
to accelerate materials discovery โ๏ธ๐ค Discover LeMat-Bulk: a 6.7M-entry dataset standardizing and unifying Materials Project, Alexandria and OQMD
2
11
7
reposted by
Leandro von Werra
Guilherme Penedo
10 months ago
Announcing ๐ฅ FineWeb2: A sparkling update with 1000s of ๐ฃ๏ธlanguages. We applied the same data-driven approach that led to SOTA English performance in๐ท FineWeb to thousands of languages. ๐ฅ FineWeb2 has 8TB of compressed text data and outperforms other datasets.
1
75
19
reposted by
Leandro von Werra
Thomas Wolf
10 months ago
The FineWeb team is happy to finally release "FineWeb2" ๐ฅ๐ฅณ FineWeb 2 extends the data driven approach to pre-training dataset design that was introduced in FineWeb 1 to now covers 1893 languages/scripts Details:
huggingface.co/datasets/Hug...
A detailed open-science tech report is coming soon
3
105
15
There are not many opportunities out there to build open LLMs and make them state-of-the-art, too! This is one of them.
add a skeleton here at some point
10 months ago
0
16
1
reposted by
Leandro von Werra
Xenova
10 months ago
WOW! ๐คฏ Language models are becoming smaller and more capable than ever! Here's SmolLM2 running 100% locally in-browser w/ WebGPU on a 6-year-old GPU. Just look at that speed! โก๏ธ๐ Powered by ๐ค Transformers.js and ONNX Runtime Web! How many tokens/second do you get? Let me know! ๐
loading . . .
2
46
13
Some people are pushing models to the top right of the plot following the scaling laws, others push them to the top left and make them faster and cheaper! We need both!
add a skeleton here at some point
10 months ago
1
11
1
reposted by
Leandro von Werra
Anton
10 months ago
Check out how easy it is to do LLM evals with LightEval! * any dataset on the ๐ค Hub can become an eval task in a few lines of code: customize the prompt, metrics, parsing, few-shots, everything! * model- and data-parallel inference * auto batching with the new vLLM backend
2
78
11
reposted by
Leandro von Werra
Thomas Wolf
10 months ago
It's Sunday morning so taking a minute for a nerdy thread (on math, tokenizers and LLMs) of the work of our intern Garreth By adding a few lines of code to the base Llama 3 tokenizer, he got a free boost in arithmetic performance ๐ฎ [thread]
5
272
39
What's the secret sauce of SmolLM2 to beat LLM titans like Llama3.2 and Qwen2.5? Unsurprisingly: data, data, data! The SmolTalk is open and available here:
huggingface.co/datasets/Hug...
10 months ago
2
62
8
All the things you need to know to pretrain an LLM at home*! Gave a workshop at Uni Bern: starts with scaling laws and goes to web scale data processing and finishes training with 4D parallelism and ZeRO. *assuming your home includes an H100 cluster
10 months ago
5
77
9
you reached the end!!
feeds!
log in