I keep thinking about this demo of Llama 3.1 8B served at 15k-20k tok/s. I wouldn't have believed it if I hadn't seen it.
For reference, GPT 5.4 is currently being served at ~44 tok/s, and the highly optimized deployment of Qwen3-8B powering RStudio's Next Edit Suggestions is ~1,300 tok/s.
21 days ago