Florian Dorner
@flodorner.bsky.social
📤 80
📥 278
📝 43
PhD student in CS @ ETHZ / MPI-IS Theory of ML evaluation
https://flodorner.github.io/
pinned post!
At ICLR and interested in theory for LLMs? Join us at our poster to learn more about the (im)possibility of scaling laws for test-time scaling methods like Best-of-N when verification is imperfect!
about 1 month ago
1
3
2
At ICLR and interested in theory for LLMs? Join us at our poster to learn more about the (im)possibility of scaling laws for test-time scaling methods like Best-of-N when verification is imperfect!
about 1 month ago
1
3
2
Meet me at the Benchmarking workshop (
sites.google.com/view/benchma...
) at EurIPS on Saturday: We’ll present two works on errors in LLM-as-Judge and their impacts on benchmarking and test-time-scaling:
6 months ago
1
7
3
reposted by
Florian Dorner
Yatong Chen
6 months ago
I'll be
@neuripsconf.bsky.social
presenting Strategic Hypothesis Testing (spotlight!) tldr: Many high-stakes decisions (e.g., drug approval) rely on p-values, but people submitting evidence respond strategically even w/o p-hacking. Can we characterize this behavior & how policy shapes it? 1/n
1
17
3
reposted by
Florian Dorner
TĂĽbingen AI Center
7 months ago
Congratulations also to Vivian Nastl (supervised by Moritz Hardt) and Ricardo Dominguez-Olmedo (Moritz Hardt and Bernhard Schölkopf) for winning 2025 Global Google PhD fellowships. Find out more about their work here:
is.mpg.de/en/news/vivi...
@maxplanckcampus.bsky.social
@unituebingen.bsky.social
loading . . .
Vivian Nastl and Ricardo Dominguez-Olmedo receive 2025 Google Ph.D. Fellowship
Program supports exceptional graduate students working on innovative research in computer science and related fields
https://is.mpg.de/en/news/vivian-nastl-and-ricardo-dominguez-olmedo-receive-2025-google-ph-d-fellowship
0
5
2
reposted by
Florian Dorner
Michael Saxon
7 months ago
The viral "Definition of AGI" paper tells you to read fake references which do not exist! Proof: different articles present at the specified journal/volume/page number, and their titles exist nowhere on any searchable repository. Take this as a warning to not use LMs to generate your references!
6
156
51
reposted by
Florian Dorner
Yatong Chen
8 months ago
We (w/ Moritz Hardt, Olawale Salaudeen and
@joavanschoren.bsky.social
) are organizing the Workshop on the Science of Benchmarking & Evaluating AI
@euripsconf.bsky.social
2025 in Copenhagen! 📢 Call for Posters:
rb.gy/kyid4f
đź“… Deadline: Oct 10, 2025 (AoE) đź”— More info:
rebrand.ly/bg931sf
1
21
7
reposted by
Florian Dorner
Millicent Li
8 months ago
Wouldn’t it be great to have questions about LM internals answered in plain English? That’s the promise of verbalization interpretability. Unfortunately, our new paper shows that evaluating these methods is nuanced—and verbalizers might not tell us what we hope they do. 🧵👇1/8
1
26
9
Does anyone have background on this plot, compared to the 32% performance for o3-mini-high with tool use claimed by OpenAI in January?
#GPT5
#GPT-5
openai.com/index/introd...
openai.com/index/openai...
10 months ago
0
1
0
New blogpost by my colleague Ricardo, arguing that instead of limiting data collection from big labs, LMArena should publicly release all data for everyone.
ricardodominguez.github.io/blogs/arena....
loading . . .
How to Fix the Chatbot Arena? Release All Data
https://ricardodominguez.github.io/blogs/arena.html
about 1 year ago
1
1
0
In Singapore for
#ICLR2025
and excited for two oral presentations on work I have contributed to! 🎉
about 1 year ago
1
0
0
Starting to believe
@natolambert.bsky.social
's take that the o1 plots are misleading [1] (in the sense that OpenAI cannot fully control test compute at inference time). In particular, it seems like scaling up test compute might require extensive retraining. [1]
www.interconnects.ai/p/openais-o1...
over 1 year ago
0
2
0
you reached the end!!
feeds!
log in