Costa Huang
@vwxyzjn.bsky.social
📤 476
📥 129
📝 48
RL + LLM
@ai2.bsky.social
; main dev of
https://cleanrl.dev/
🥘 Excited to share our latest OLMo 1 B models! Almost summer RL time. We did another two-stage RL: * The first RLVR run uses allenai/RLVR-GSM-MATH-IF-Mixed-Constraints * The final RLVR run uses allenai/RLVR-MATH for targeted MATH improvement Short 🧵
5 months ago
1
3
0
Introducing OLMo-2-0325-32B-Instruct! It's the spring RL curve time. This time, we used GRPO for RLVR and trained a pretty nice fully open source model!
add a skeleton here at some point
7 months ago
1
12
1
🔥 allenai/Llama-3.1-Tulu-3-8B (trained with PPO) -> allenai/Llama-3.1-Tulu-3.1-8B (trained with GRPO) We are happy to "quietly" release our latest GRPO-trained Tulu 3.1 model, which is considerably better in MATH and GSM8K!
8 months ago
1
22
7
🤯 Check out our new iOS OLMoE app that runs the model on-device! We also trained new OLMoE-1B-7B-0125 this time using the Tulu 3 recipe. Very exciting that RLVR improved gsm8k by almost 10 points for OLMoE 🔥 A quick 🧵
add a skeleton here at some point
8 months ago
1
5
1
I nerd-snipped myself over the
@deepseek.bsky.social
GRPO's usage of John Schulman's kl3 estimator. I can now see why: When directly minimizing the KL loss, kl3 just appears much more numerically stable. And the >0 guarantee here is also really nice (kl1 could go negative).
8 months ago
1
5
0
We released the OLMo 2 report! Ready for some more RL curves? 😏 This time, we applied RLVR iteratively! Our initial RLVR checkpoint on the RLVR dataset mix shows a low GSM8K score, so we did another RLVR on GSM8K only and another on MATH only 😆. And it works! A thread 🧵 1/N
9 months ago
1
12
6
reposted by
Costa Huang
Nathan Lambert
10 months ago
Maybe my favorite unexpected crossover model from the community. SmolLM from HuggingFace trained with part of the Tülu 3 recipe we release a month ago :D. Cool numerical explorations of post-training stuff. Nothing crazy. SultanR/SmolTulu-1.7b-Instruct
https://buff.ly/41Epauv
0
37
3
reposted by
Costa Huang
Valentina Pyatkin
10 months ago
I'll be attending NeurIPS in Vancouver next week! I would love to chat about LLM post-training research, the Faculty job market and anything in between - ping me if you'd like to meet up! You can also find me at the following:
4
47
2
So happy OLMo 2 is out! We applied the same Tülu 3 RLVR recipe and it worked very nicely for our final 13B instruct model. Here are the gains/losses of allenai/OLMo-2-1124-13B-Instruct (RLVR's checkpoint) over allenai/OLMo-2-1124-13B-DPO. More to share soon!
10 months ago
1
14
0
reposted by
Costa Huang
Ai2
10 months ago
Meet OLMo 2, the best fully open language model to date, including a family of 7B and 13B models trained up to 5T tokens. OLMo 2 outperforms other fully open models and competes with open-weight models like Llama 3.1 8B — As always, we released our data, code, recipes and more 🎁
5
151
48
reposted by
Costa Huang
vmoens
10 months ago
One of my fav projects: LeanRL, a simple RL library that provides recipes for fast RL training using torch.compile and cudagraphs. Using these, we got >6x speed-ups compared to the original CleanRL implementations.
github.com/pytorch-labs...
2
32
5
👀
@araffin.bsky.social
says he can’t @ me. What’s going on?
@bsky.app
add a skeleton here at some point
10 months ago
1
0
0
reposted by
Costa Huang
Nathan Lambert
10 months ago
I've spent the last two years scouring all available resources on RLHF specifically and post training broadly. Today, with the help of a totally cracked team, we bring you the fruits of that labor — Tülu 3, an entirely open frontier model post training recipe. We beat Llama 3.1 Instruct. Thread.
8
212
52
you reached the end!!
feeds!
log in