We pitted LLMs against themselves in different languages across games๐ฎ
Usually when a LLM is worse at a task in some language, we go "it's a language issue". But isn't that circular?
If the task is made of language, how do you tell if the model lacks skills, or just struggles to read the question? ๐งต
3 days ago