Hung-Yueh Chiang 9 months ago
Weβre excited to pre-release our latest work: Quamba2
π§ Supports W4A8 / W4A16 / W4AX / W8A8 for Mamba1 and Mamba2
π Achieves 4Γ memory reduction and 3Γ generation speedup
β‘οΈ Enables 8B model inference on Orin Nano 8G at 13 tokens/sec
π₯ Outperforms W4A8KV4 Llama3-8B in both speed and quality