Adam Gaber (@adam-gaber.bsky.social)

Leshem (Legend) Choshen @EMNLP about 2 months ago

LLMs can't generalize across 2 copies of the same language!🫠 Usually we blame multilinguality issues on syntax, tokenizer fragmentation or data disproportion, so what if we eliminate these factors? In our ongoing work by Adam Jaber Bobby Cheng and I we ran a pretraining experiment doing just that