Tired of messy non-replicable multilingual LLM evaluation? So were we.
In our new paper, we experimentally illustrate common eval. issues and present how structured evaluation design, transparent reporting, and meta-evaluation can help us to build stronger models.
add a skeleton here at some point
7 months ago