New EACL paper (with
@mdlhx.bsky.social)! We tested if comparing perplexity of parallel data across languages is fair. Turns out: it depends. We show the choice of test set (even with consistent meaning) can flip conclusions about which language is easier to model.
Paper:
arxiv.org/abs/2601.10580