Costa Huang
@vwxyzjn.bsky.social
📤 493
📥 129
📝 49
RL + LLM
@ai2.bsky.social
; main dev of
https://cleanrl.dev/
Congrats on the launch!
add a skeleton here at some point
2 months ago
1
1
0
🥘 Excited to share our latest OLMo 1 B models! Almost summer RL time. We did another two-stage RL: * The first RLVR run uses allenai/RLVR-GSM-MATH-IF-Mixed-Constraints * The final RLVR run uses allenai/RLVR-MATH for targeted MATH improvement Short 🧵
7 months ago
1
3
0
Introducing OLMo-2-0325-32B-Instruct! It's the spring RL curve time. This time, we used GRPO for RLVR and trained a pretty nice fully open source model!
add a skeleton here at some point
9 months ago
1
12
1
🔥 allenai/Llama-3.1-Tulu-3-8B (trained with PPO) -> allenai/Llama-3.1-Tulu-3.1-8B (trained with GRPO) We are happy to "quietly" release our latest GRPO-trained Tulu 3.1 model, which is considerably better in MATH and GSM8K!
10 months ago
1
22
7
🤯 Check out our new iOS OLMoE app that runs the model on-device! We also trained new OLMoE-1B-7B-0125 this time using the Tulu 3 recipe. Very exciting that RLVR improved gsm8k by almost 10 points for OLMoE 🔥 A quick 🧵
add a skeleton here at some point
10 months ago
1
5
1
I nerd-snipped myself over the
@deepseek.bsky.social
GRPO's usage of John Schulman's kl3 estimator. I can now see why: When directly minimizing the KL loss, kl3 just appears much more numerically stable. And the >0 guarantee here is also really nice (kl1 could go negative).
10 months ago
1
5
0
We released the OLMo 2 report! Ready for some more RL curves? 😏 This time, we applied RLVR iteratively! Our initial RLVR checkpoint on the RLVR dataset mix shows a low GSM8K score, so we did another RLVR on GSM8K only and another on MATH only 😆. And it works! A thread 🧵 1/N
11 months ago
1
12
6
reposted by
Costa Huang
Nathan Lambert
12 months ago
Maybe my favorite unexpected crossover model from the community. SmolLM from HuggingFace trained with part of the Tülu 3 recipe we release a month ago :D. Cool numerical explorations of post-training stuff. Nothing crazy. SultanR/SmolTulu-1.7b-Instruct
https://buff.ly/41Epauv
0
37
3
reposted by
Costa Huang
Valentina Pyatkin
about 1 year ago
I'll be attending NeurIPS in Vancouver next week! I would love to chat about LLM post-training research, the Faculty job market and anything in between - ping me if you'd like to meet up! You can also find me at the following:
4
47
2
So happy OLMo 2 is out! We applied the same Tülu 3 RLVR recipe and it worked very nicely for our final 13B instruct model. Here are the gains/losses of allenai/OLMo-2-1124-13B-Instruct (RLVR's checkpoint) over allenai/OLMo-2-1124-13B-DPO. More to share soon!
about 1 year ago
1
14
0
reposted by
Costa Huang
Ai2
about 1 year ago
Meet OLMo 2, the best fully open language model to date, including a family of 7B and 13B models trained up to 5T tokens. OLMo 2 outperforms other fully open models and competes with open-weight models like Llama 3.1 8B — As always, we released our data, code, recipes and more 🎁
5
151
48
reposted by
Costa Huang
vmoens
about 1 year ago
One of my fav projects: LeanRL, a simple RL library that provides recipes for fast RL training using torch.compile and cudagraphs. Using these, we got >6x speed-ups compared to the original CleanRL implementations.
github.com/pytorch-labs...
2
33
6
👀
@araffin.bsky.social
says he can’t @ me. What’s going on?
@bsky.app
add a skeleton here at some point
about 1 year ago
1
0
0
reposted by
Costa Huang
Nathan Lambert
about 1 year ago
I've spent the last two years scouring all available resources on RLHF specifically and post training broadly. Today, with the help of a totally cracked team, we bring you the fruits of that labor — Tülu 3, an entirely open frontier model post training recipe. We beat Llama 3.1 Instruct. Thread.
8
211
52
you reached the end!!
feeds!
log in