Sergei Vassilvitskii (@vsergei.bsky.social)

Synthetic Data is all the rage in LLM training, but why does it work? In arxiv.org/abs/2502.08924 we show how to analyze this question through the lens of boosting. Unlike boosting, however, our assumptions on the data and the learning method are inverted.

loading . . .

Escaping Collapse: The Strength of Weak Data for Large Language Model Training Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper... https://arxiv.org/abs/2502.08924