Announcing ๐ฅ FineWeb2: A sparkling update with 1000s of ๐ฃ๏ธlanguages.
We applied the same data-driven approach that led to SOTA English performance in๐ท FineWeb to thousands of languages.
๐ฅ FineWeb2 has 8TB of compressed text data and outperforms other datasets.
10 months ago