Wessel Poelman (@wpoelman.bsky.social)

New EACL paper (with @mdlhx.bsky.social)! We tested if comparing perplexity of parallel data across languages is fair. Turns out: it depends. We show the choice of test set (even with consistent meaning) can flip conclusions about which language is easier to model. Paper: arxiv.org/abs/2601.10580

loading . . .

Form and Meaning in Intrinsic Multilingual Evaluations Intrinsic evaluation metrics for conditional language models, such as perplexity or bits-per-character, are widely used in both mono- and multilingual settings. These metrics are rather straightforwar... https://arxiv.org/abs/2601.10580

6 months ago