Jindřich Libovický
@jlibovicky.bsky.social
📤 544
📥 248
📝 40
Researcher at Charles University | multilingual natural language processing, machine translation
3rd run teaching ML to 250+ bachelor students (with great materials originaly by
@straka-milan.bsky.social
). Core philosophy: explain the math, implement algorithms from scratch, Kaggle-style competitions, all auto-graded.
ufal.mff.cuni.cz/courses/npfl...
But look what LLMs did to the course 👇
loading . . .
Introduction to Machine Learning with Python | ÚFAL
https://ufal.mff.cuni.cz/courses/npfl129/2526-winter
about 19 hours ago
1
2
0
Spent time making AI-generated images of Bayes' Rule, Laplace Smoothing, Markov Chains & Shannon Entropy for class today 🎨🤖 Even though the images are objectively hilarious, none of the 50 students in the room laughed. Or even smiled. 💀
6 days ago
1
4
1
I reviewed papers evaluating LLM values using sociology questionnaires. Different methods, different results. Didn't trust them, so I tested it myself. Methodology matters. Short answers vs CoT, squared err vs KL div.: each changes which populations an LLM "aligns" with.
www.arxiv.org/pdf/2602.04033
loading . . .
https://www.arxiv.org/pdf/2602.04033
21 days ago
1
4
0
We have updated the pre-print on CUS-QA, benchmark for regional knowledge about Czechia, Slovakia and Ukraine
arxiv.org/abs/2507.22752
Now, there are results of retrieval-augmented generation and more detailed analysis of model performance depending on the topic of the question or visual context.
add a skeleton here at some point
about 1 month ago
0
7
1
We (= mostly
@abyste.bsky.social
) developed a way to evaluate how morphological a
#tokenization
is w/o gold segmentation labels.
arxiv.org/abs/2601.18536
The key: align subword tokens with morphological features from UniMorph using IBM Model 1. To appear in EACL 2026 Findings.
loading . . .
Evaluating Morphological Plausibility of Subword Tokenization via Statistical Alignment with Morpho-Syntactic Features
We present a novel metric for the evaluation of the morphological plausibility of subword segmentation. Unlike the typically used morpheme boundary or retrieval F-score, which requires gold segmentati...
https://arxiv.org/abs/2601.18536
about 1 month ago
1
9
2
Happy holidays! 🎄🎅🤩🎁
loading . . .
3 months ago
0
3
0
Attenzione! 🇮🇹 Know Piedmontese or Neapolitan speakers?
@gianlucavico.bsky.social
is collecting crowd-sourced translations to evaluate LLM performance on these regional languages. Partecipate!
add a skeleton here at some point
4 months ago
0
2
1
With
@andrei-a-manea.bsky.social
, we posted a survey on multilingual vision-language models 👉
arxiv.org/pdf/2509.22123
We reviewed 31 models+21 benchmarks. There's a tension between language neutrality (same results across languages) & cultural awareness (context matters differently across cultures)
loading . . .
https://arxiv.org/pdf/2509.22123
5 months ago
1
3
2
So proud of my PhD student
@andrei-a-manea.bsky.social
for his first first-author publication! 🎉 He presented this work last week at TSD. Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language Encoders
arxiv.org/pdf/2504.21681
add a skeleton here at some point
6 months ago
1
6
0
🧵 We're releasing CUS-QA - a new benchmark for testing LLMs on regional knowledge! Find out what your model knows about Czechia 🇨🇿, Slovakia 🇸🇰, and Ukraine 🇺🇦! 👉 Textual and visual questions, answers, and human judgment on model outputs!
huggingface.co/datasets/ufa...
www.arxiv.org/abs/2507.22752
loading . . .
ufal/cus-qa · Datasets at Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
https://huggingface.co/datasets/ufal/cus-qa
7 months ago
1
16
6
Stay tuned, we will release the dataset soon...
add a skeleton here at some point
7 months ago
0
2
0
reposted by
Jindřich Libovický
Jindra Helcl
7 months ago
We need to have poster fights at the end of every conference.
0
3
1
Just presented MAGBIG, a new dataset and evaluation methodology for gender bias in multilingual text-to-image generation. Grammatical gender matters when studying these biases across languages! Thanks to Felix Friedrich,
@kathaem.bsky.social
and all co-authors - it was fun to work on this together!
add a skeleton here at some point
7 months ago
0
2
0
This week I am at
#ACL2025NLP
in Vienna 🎡🇦🇹. Find me 🕵️ or message 💌 me if you want to chat about multilinguality or tokenization. Stop 🛑 by our poster on gender bias in text-to-image generation on Monday
aclanthology.org/2025.acl-lon...
8 months ago
0
7
0
reposted by
Jindřich Libovický
Tokenization Workshop (TokShop) @ICML2025
9 months ago
TokShop @
#ICML2025
got way more submissions than expected! 📈 We could really use a few more reviewers to help out. If you have the capacity to review a
#tokenization
paper by Saturday, please fill out this form:
forms.gle/32A6sQHQrMSb...
🙏
loading . . .
TokShop 2025
Registering interest in all things tokenization at TokShop @ ICML 2025 (July 18) Consider joining the Google group for future updates! https://groups.google.com/g/tokshop
https://forms.gle/32A6sQHQrMSb6hpE9
0
0
6
reposted by
Jindřich Libovický
Tokenization Workshop (TokShop) @ICML2025
10 months ago
📣 Call for Paper Alert: TokShop @ ICML 2025 TokShop explores tokenization across all data modalities. Topics include: subword NLP techniques, multimodal approaches, multilingual challenges, post-training modification, alternative representations, and statistical perspectives.
loading . . .
ICML 2025 Workshop TokShop
Welcome to the OpenReview homepage for ICML 2025 Workshop TokShop
https://openreview.net/group?id=ICML.cc/2025/Workshop/TokShop
1
18
14
reposted by
Jindřich Libovický
Tokenization Workshop (TokShop) @ICML2025
10 months ago
Got a tokenization paper that just didn't make the cut for ICML? Submit it to the Tokenization Workshop TokShop at
#ICML2025
-- we'd love to see it there!
tokenization-workshop.github.io
loading . . .
Tokenization Workshop @ ICML 2025
https://tokenization-workshop.github.io/
0
7
6
Attending
#NAACL2025
virtually. Since 2022, I've been training a classifier on papers I read to tackle the arXiv madness. Ran it on the NAACL proceedings for my personalized watch list. 🤓📺 However, it's far from perfect: Multilingual cultural awareness is great, but where is tokenization? 🤷
10 months ago
2
2
0
We're organizing ✨Tokenization Workhop✨ TokShop❗ Join us at
@icmlconf.bsky.social
in July in Vancouver 🇨🇦. Follow
@tokshop.bsky.social
for updates! Submit your paper by May 30.
add a skeleton here at some point
11 months ago
0
4
0
Random take on the
#TuringTest
: Rather than testing machine intelligence, it can be a measure of societal awareness about
#AI
capabilities. The real objective isn't creating a machine that passes but educating people to think critically and avoid being deceived, so the machines do not pass the test.
11 months ago
0
4
0
Summaries of pre-prints that I noticed and liked on arXiv in March are now on my blog
jlibovicky.github.io//2025/04/02/...
loading . . .
Highlights from Machine Translation and Multilinguality in March 2025
EuroBERT: Scaling Multilingual Encoders for European Languages
https://jlibovicky.github.io//2025/04/02/MTML-Highlights-March.html
11 months ago
0
4
0
Our paper 'Beyond Literal Token Overlap: Token Alignability for Multilinguality' will be at
#NAACL2025
! We show that token alignability is a stronger predictor of cross-lingual transfer than literal token overlap. Read it here:
arxiv.org/abs/2502.06468
add a skeleton here at some point
12 months ago
0
6
2
Short notes about what pre-prints I noticed in December and January are now on my blog:
jlibovicky.github.io/2025/02/07/M...
loading . . .
Highlights from Machine Translation and Multilinguality in December 2024 and January 2025
MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost
https://jlibovicky.github.io/2025/02/07/MTML-Highlights-December-and-January.html
about 1 year ago
0
3
0
Join Mu-SHROOM 🍄, a SemEval 2025 shared task on detecting hallucination spans in multilingual LLM outputs! 🌍 Includes Czech with regional Czech questions 🇨🇿. Do you think you can spot when something isn’t true? 🤔 Try it out! 👉
helsinki-nlp.github.io/shroom
#SemEval2025
#NLP
loading . . .
Welcome to SemEval-2025 Task-3 — Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
https://helsinki-nlp.github.io/shroom
about 1 year ago
0
4
1
Happy holidays! 🎄🎅🤩🎁
loading . . .
about 1 year ago
0
8
0
Highlights from multilingual
#NLP
and machine translation papers I found on arXiv in November are now on my blog:
jlibovicky.github.io/2024/12/05/M...
loading . . .
Highlights from Machine Translation and Multilinguality in November 2024
Mitigating Metric Bias in Minimum Bayes Risk Decoding
https://jlibovicky.github.io/2024/12/05/MTML-Highlights-November.html
over 1 year ago
0
14
0
This is going to be fun! 🤓 We have three years to spend 6.5M CZK on improving multilingual tokenization. The goal is to make subwords more alignable across languages and help languages that suffer from over-segmentation with current models.
add a skeleton here at some point
over 1 year ago
2
11
1
Just shared my takeaways from
#EMNLP2024
on my blog:
jlibovicky.github.io//2024/11/21/...
loading . . .
Notes from EMNLP 2024
Last week, I was at EMNLP in Miami, and here are a few notes about what I saw at the conference.
https://jlibovicky.github.io//2024/11/21/Notes-from-EMNLP-2024.html
over 1 year ago
4
39
4
reposted by
Jindřich Libovický
Institute of Formal and Applied Linguistics
over 1 year ago
Hello Blue Sky! 👋 This is the official account of the Institute of Formal and Applied Linguistics (ÚFAL for short) at the Faculty of Mathematics and Physics, Charles University in Prague 🇨🇿. Here, we will share news from the life of the institute and our members.
0
16
5
Hello, Blue Sky 👋 🦋 Looking forward to reading and sometimes also posting about
#NLP
and related stuff! 🤓
over 1 year ago
0
6
0
you reached the end!!
feeds!
log in