Jonathan Berant
@jonathanberant.bsky.social
๐ค 380
๐ฅ 101
๐ 11
NLP at Tel Aviv Uni and Google DeepMind
reposted by
Jonathan Berant
Jacob Eisenstein
3 months ago
With GDM friends Adam Fisch,
@jonathanberant.bsky.social
, Alekh Agarwal, and special guest Anastasios Angelopoulos.
0
2
1
reposted by
Jonathan Berant
Jacob Eisenstein
3 months ago
We offer cost-optimal policies for selecting which rater should annotate which examples, which link the cost, the annotation noise, and the *uncertainty* of the cheaper rater.
1
1
1
reposted by
Jonathan Berant
Jacob Eisenstein
3 months ago
Cheap but noisy? Or accurate but expensive? How to split a limited annotation budget between different types of judges?๐ฉโโ๏ธ๐ค๐ฆง
www.arxiv.org/abs/2506.07949
loading . . .
Cost-Optimal Active AI Model Evaluation
The development lifecycle of generative AI systems requires continual evaluation, data acquisition, and annotation, which is costly in both resources and time. In practice, rapid iteration often makes...
http://www.arxiv.org/abs/2506.07949
1
9
3
reposted by
Jonathan Berant
Jacob Eisenstein
6 months ago
An ablation reveals the importance of mechanism design: when the helper identities are known to the asker during training (CSP-DeAnon), calibrated hedging is no longer learned.
1
5
1
reposted by
Jonathan Berant
Jacob Eisenstein
6 months ago
In practice, collaborative self-play + reinforced self-training (ReST) lead to improved task performance, better calibration of confidence markers, and more efficient tool use.
2
5
1
reposted by
Jonathan Berant
Jacob Eisenstein
6 months ago
A bit of game theory can help explain when this can work: we model the setup as a game of public utility provision, where the public utility is the extra information provided by the costly retrieval action. The game has a unique equilibrium when the tools are sufficiently distinct (or both bad).
1
4
1
reposted by
Jonathan Berant
Jacob Eisenstein
6 months ago
Because the identity of each helper is hidden from the asker, it is forced to rely on confidence signals when faced with incompatible answers from the helpers. Maximizing effort-penalized accuracy of the full rollout can teach the LLM to use these confidence markers correctly.
1
3
1
reposted by
Jonathan Berant
Jacob Eisenstein
6 months ago
We focus on two capabilities: knowing when to use a costly retrieval tool, and hedging non-confident answers. To teach these capabilities, we create a small multi-agent society, in which two "helpers" can use specialized retrieval tools to pass information back to an "asker"
1
5
1
reposted by
Jonathan Berant
Jacob Eisenstein
6 months ago
We all want LLMs to collaborate with humans to help them achieve their goals. But LLMs are not trained to collaborate, they are trained to imitate. Can we teach LM agents to help humans by first making them help each other?
arxiv.org/abs/2503.14481
loading . . .
Don't lie to your friends: Learning what you know from collaborative self-play
To be helpful assistants, AI agents must be aware of their own capabilities and limitations. This includes knowing when to answer from parametric knowledge versus using tools, when to trust tool outpu...
https://arxiv.org/abs/2503.14481
1
56
20
reposted by
Jonathan Berant
Ted Underwood
6 months ago
A way to help models "be aware of their own capabilities and limitations" from
@jacobeisenstein.bsky.social
et al:
arxiv.org/abs/2503.14481
#MLSky
3
41
9
Fun work led by
@amouyalsamuel.bsky.social
and with Aya. Coming in I didn't think LLMs should have difficulties with answering questions on some of the GP sentences we used, but turns out they had! See Samuel's thread for more info...
add a skeleton here at some point
6 months ago
0
0
0
reposted by
Jonathan Berant
6 months ago
I had a lot of fun working on this with Aya Meltzer-Asscher and
@jonathanberant.bsky.social
. We will soon release our materials, human results, LLM results and all the cool images the models produced on our sentences.
arxiv.org/abs/2502.09307
loading . . .
When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models
Modern Large Language Models (LLMs) have shown human-like abilities in many language tasks, sparking interest in comparing LLMs' and humans' language processing. In this paper, we conduct a detailed c...
https://arxiv.org/abs/2502.09307
0
1
1
reposted by
Jonathan Berant
6 months ago
One intriguing follow-up: some component of the sentence understanding cognitive model fails on GP sentence. Is this component also present in LLMs? If not, then why so many LLMs are influenced by our manipulations in the same way humans are?
1
1
1
reposted by
Jonathan Berant
6 months ago
There are many more cool insights you can find in our paper. One takeaway from this paper for the psycholinguistics community: run your reading comprehension experiment on LLM first. You might get a general idea of the human results. (Last image I swear)
1
1
1
reposted by
Jonathan Berant
6 months ago
These experiments replicated the results from the sentence comprehension one: our manipulations had the same effect on the paraphrase or drawing correctness as they had on the sentence comprehension task. In this image: While the teacher taught the puppies looked at the board.
2
1
1
reposted by
Jonathan Berant
6 months ago
We also ran two additional experiments with LLMs that are challenging to perform on humans. 1. We asked the LLM to paraphrase our sentence 2. We asked text-to-image models to draw the sentences In this image: While the horse pulled the submarine moved silently.
1
2
1
reposted by
Jonathan Berant
6 months ago
To answer our second question, we ran the same sentence comprehension experiment we ran on humans with over 60 LLMs. We found that LLMs also struggle with GP sentences and that, interestingly, the manipulations we did to test our hypotheses impacted LLMs as they did with humans
1
1
1
reposted by
Jonathan Berant
6 months ago
In our latest paper with Aya Meltzer-Asscher and
@jonathanberant.bsky.social
, we try to answer both these questions. We devise hypotheses explaining why GP sentences are harder to process and test them. Human subjects answered a reading comprehension question about a sentence they read.
1
1
1
reposted by
Jonathan Berant
6 months ago
The old man the boat. You probably had to read that sentence twice. It's because it's a garden path (GP) sentence. GP sentences are read slower and often misunderstood. This begs the questions: 1. Why are these sentences harder to process? 2. How do LLMs deal with them?
1
1
2
reposted by
Jonathan Berant
Ziteng Sun
7 months ago
Inference-time procedures (e.g. Best-of-N, CoT) have been instrumental to recent development of LLMs. Standard RLHF focuses only on improving the trained model. This creates a train/inference mismatch. ๐๐ข๐ฏ ๐ธ๐ฆ ๐ข๐ญ๐ช๐จ๐ฏ ๐ฐ๐ถ๐ณ ๐ฎ๐ฐ๐ฅ๐ฆ๐ญ ๐ต๐ฐ ๐ฃ๐ฆ๐ต๐ต๐ฆ๐ณ ๐ด๐ถ๐ช๐ต ๐ข ๐จ๐ช๐ท๐ฆ๐ฏ ๐ช๐ฏ๐ง๐ฆ๐ณ๐ฆ๐ฏ๐ค๐ฆ-๐ต๐ช๐ฎ๐ฆ ๐ฑ๐ณ๐ฐ๐ค๐ฆ๐ฅ๐ถ๐ณ๐ฆ? Check out below.
1
25
10
reposted by
Jonathan Berant
Ahmad Beirami
9 months ago
Excited to share ๐๐ง๐๐๐ฅ๐ข๐ ๐ง! Alignment optimization objective implicitly assumes ๐ด๐ข๐ฎ๐ฑ๐ญ๐ช๐ฏ๐จ from the resulting aligned model. But we are increasingly using different and sometimes sophisticated inference-time compute algorithms. How to resolve this discrepancy?๐งต
2
55
12
reposted by
Jonathan Berant
Alexandre Lacoste
9 months ago
Weโre really excited to release this large collaborative work for unifying web agent benchmarks under the same roof. In this TMLR paper, we dive in-depth into
#BrowserGym
and
#AgentLab
. We also present some unexpected performances from Claude 3.5-Sonnet
1
20
13
I will also be at NeurIPS! Happy to chat about post-training, reasoning, and interesting ways you use multiple agents for things.
10 months ago
0
1
0
reposted by
Jonathan Berant
Alexandre Lacoste
10 months ago
๐งต-1 We are thrilled to release
#AgentLab
, a new open-source package for developing and evaluating web agents. This builds on the new
#BrowserGym
package which supports 10 different benchmarks, including
#WebArena
.
2
18
15
reposted by
Jonathan Berant
Yoav Artzi
10 months ago
I am seriously behind uploading Learning Machines videos, but I did want to get
@jonathanberant.bsky.social
's out sooner than later. It's not only a great talk, it also gives a remarkably broad overview and contextualization, so it's an excellent way to ramp up on post-training
youtu.be/2AthqCX3h8U
loading . . .
Jonathan Berant (Tel Aviv University / Google) / Towards Robust Language Model Post-training
YouTube video by Yoav Artzi
https://youtu.be/2AthqCX3h8U
1
53
12
reposted by
Jonathan Berant
Marc Lanctot
10 months ago
Student Researcher positions in EMEA now accepting applications! Please repost.
www.google.com/about/career...
loading . . .
Student Researcher, 2025 โ Google Careers
https://www.google.com/about/careers/applications/jobs/results/139039912904008390-student-researcher-2025
0
24
10
you reached the end!!
feeds!
log in