Xing Han Lu
@xhluca.bsky.social
📤 681
📥 164
📝 57
👨🍳 Web Agents
@mila-quebec.bsky.social
🎒
@mcgill-nlp.bsky.social
reposted by
Xing Han Lu
Gaurav Kamath
4 months ago
Our new paper in
#PNAS
(
bit.ly/4fcWfma
) presents a surprising finding—when words change meaning, older speakers rapidly adopt the new usage; inter-generational differences are often minor. w/ Michelle Yang, @sivareddyg.bsky.social ,
@msonderegger.bsky.social
and
@dallascard.bsky.social
👇(1/12)
3
34
19
reposted by
Xing Han Lu
Cesare
5 months ago
A blizzard is raging through Montreal when your friend says “Looks like Florida out there!” Humans easily interpret irony, while LLMs struggle with it. We propose a 𝘳𝘩𝘦𝘵𝘰𝘳𝘪𝘤𝘢𝘭-𝘴𝘵𝘳𝘢𝘵𝘦𝘨𝘺-𝘢𝘸𝘢𝘳𝘦 probabilistic framework as a solution. Paper:
arxiv.org/abs/2506.09301
to appear @
#ACL2025
(Main)
1
15
11
"Build the web for agents, not agents for the web" This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).
arxiv.org/abs/2506.10953
5 months ago
0
6
4
reposted by
Xing Han Lu
Benno Krojer
5 months ago
Excited to share the results of my recent internship! We ask 🤔 What subtle shortcuts are VideoLLMs taking on spatio-temporal questions? And how can we instead curate shortcut-robust examples at a large-scale? We release: MVPBench Details 👇🔬
1
16
5
reposted by
Xing Han Lu
Ziling Cheng
5 months ago
Do LLMs hallucinate randomly? Not quite. Our
#ACL2025
(Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably. 📎 Paper:
arxiv.org/abs/2505.22630
1/n
1
46
21
reposted by
Xing Han Lu
Mila - Institut québécois d'IA
7 months ago
Congratulations to Mila members
@adadtur.bsky.social
, Gaurav Kamath and
@sivareddyg.bsky.social
for their SAC award at NAACL! Check out Ada's talk in Session I: Oral/Poster 6. Paper:
arxiv.org/abs/2502.05670
loading . . .
0
13
10
reposted by
Xing Han Lu
Karolina Stańczak
7 months ago
Exciting release! AgentRewardBench offers that much-needed closer look at evaluating agent capabilities: automatic vs. human eval. Important findings here, especially on the popular LLM judges. Amazing work by
@xhluca.bsky.social
& team!
add a skeleton here at some point
1
3
1
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories We are releasing the first benchmark to evaluate how well automatic evaluators, such as LLM judges, can evaluate web agent trajectories.
7 months ago
1
7
5
reposted by
Xing Han Lu
Sara Vera Marjanovic
7 months ago
And thoughtology is now on Arxiv! Read more about R1 reasoning 🐋💭 across visual, cultural and psycholinguistic tasks at the link below: 🔗
arxiv.org/abs/2504.07128
add a skeleton here at some point
0
5
1
DeepSeek-R1 Thoughtology: Let’s <think> about LLM reasoning 142-page report diving into the reasoning chains of R1. It spans 9 unique axes: safety, world modeling, faithfulness, long context, etc. Now on arxiv:
arxiv.org/abs/2504.07128
7 months ago
1
6
1
reposted by
Xing Han Lu
Siva Reddy
8 months ago
Introducing the DeepSeek-R1 Thoughtology -- the most comprehensive study of R1 reasoning chains/thoughts ✨. Probably everything you need to know about R1 thoughts. If we missed something, please let us know.
add a skeleton here at some point
0
17
5
reposted by
Xing Han Lu
Sara Vera Marjanovic
8 months ago
Models like DeepSeek-R1 🐋 mark a fundamental shift in how LLMs approach complex problems. In our preprint on R1 Thoughtology, we study R1’s reasoning chains across a variety of tasks; investigating its capabilities, limitations, and behaviour. 🔗:
mcgill-nlp.github.io/thoughtology/
1
52
25
reposted by
Xing Han Lu
Marius Mosbach
8 months ago
Check out our new workshop on Actionable Interpretability @ ICML 2025. We are also looking forward to submissions that take a position on the future of interpretability research more broadly. 👇
add a skeleton here at some point
0
9
1
reposted by
Xing Han Lu
VLMs4All - CVPR 2025 Workshop
8 months ago
📢Excited to announce our upcoming workshop - Vision Language Models For All: Building Geo-Diverse and Culturally Aware Vision-Language Models (VLMs-4-All) @CVPR 2025! 🌐
sites.google.com/view/vlms4all
1
17
15
reposted by
Xing Han Lu
Parishad BehnamGhader
8 months ago
Instruction-following retrievers can efficiently and accurately search for harmful and sensitive information on the internet! 🌐💣 Retrievers need to be aligned too! 🚨🚨🚨 Work done with the wonderful Nick and
@sivareddyg.bsky.social
🔗
mcgill-nlp.github.io/malicious-ir/
Thread: 🧵👇
loading . . .
Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
Parishad BehnamGhader, Nicholas Meade, Siva Reddy
https://mcgill-nlp.github.io/malicious-ir/
1
12
8
reposted by
Xing Han Lu
Spandana Gella
8 months ago
Web agents powered by LLMs can solve complex tasks, but our analysis shows that they can also be easily misused to automate harmful tasks. See the thread below for more details on our new web agent safety benchmark: SafeArena and Agent Risk Assessment framework (ARIA).
add a skeleton here at some point
0
5
2
reposted by
Xing Han Lu
Karolina Stańczak
8 months ago
The potential for malicious misuse of LLM agents is a serious threat. That's why we created SafeArena, a safety benchmark for web agents. See the thread and our paper for details:
arxiv.org/abs/2503.04957
👇
add a skeleton here at some point
0
9
2
reposted by
Xing Han Lu
Arkil Patel
8 months ago
Llamas browsing the web look cute, but they are capable of causing a lot of harm! Check out our new Web Agents ∩ Safety benchmark: SafeArena! Paper:
arxiv.org/abs/2503.04957
add a skeleton here at some point
0
9
3
Agents like OpenAI Operator can solve complex computer tasks, but what happens when users use them to cause harm, e.g. spread misinformation? To find out, we introduce SafeArena (
safearena.github.io
), a benchmark to assess the capabilities of web agents to complete harmful web tasks. A thread 👇
8 months ago
1
17
12
reposted by
Xing Han Lu
Karolina Stańczak
9 months ago
📢New Paper Alert!🚀 Human alignment balances social expectations, economic incentives, and legal frameworks. What if LLM alignment worked the same way?🤔 Our latest work explores how social, economic, and contractual alignment can address incomplete contracts in LLM alignment🧵
2
28
16
reposted by
Xing Han Lu
Vaibhav
9 months ago
Check out the new MMTEB benchmark🙌 if you are looking for an extensive, reproducible and open-source evaluation of text embedders!
add a skeleton here at some point
0
3
1
I'm fortunate to have collaborated with a team of brilliant researchers on this colossal project 🎊 Among the tasks i contributed, i'm most excited about the contextual web element retrieval task derived from weblinx, which i think is a crucial component for building web agents!
add a skeleton here at some point
9 months ago
0
2
0
reposted by
Xing Han Lu
Arkil Patel
9 months ago
Presenting ✨ 𝐂𝐇𝐀𝐒𝐄: 𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐜𝐡𝐚𝐥𝐥𝐞𝐧𝐠𝐢𝐧𝐠 𝐬𝐲𝐧𝐭𝐡𝐞𝐭𝐢𝐜 𝐝𝐚𝐭𝐚 𝐟𝐨𝐫 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧 ✨ Work w/ fantastic advisors Dima Bahdanau and
@sivareddyg.bsky.social
Thread 🧵:
1
17
9
reposted by
Xing Han Lu
Kenneth Enevoldsen
9 months ago
I am delighted to announce that we have released 🎊 MMTEB 🎊, a large-scale collaboration working on efficient multilingual evaluation of embedding models. This work implements >500 evaluation tasks across >1000 languages and covers a wide range of use cases and domains🩺👩💻⚖️
1
27
8
reposted by
Xing Han Lu
Nouha Dziri
10 months ago
Interested in knowing more about LLMs agents and in contributing to this topic?🚀 📢We're thrilled to announce REALM: The first Workshop for Research on Agent Language Models 🤖
#ACL2025NLP
in Vienna 🎻 We have an exciting lineup of speakers 🗓️ Submit your work by *March 1st*
@aclmeeting.bsky.social
1
13
5
Glad to see BM25S (
bm25s.github.io
) has been downloaded 1M times on PyPi 🎉 Numbers aside, it makes me happy to hear the positive experience from friends working on retrieval. It's good to know that people near me are enjoying it! Discussion:
github.com/xhluca/bm25s/discussions
10 months ago
1
13
1
Retrieval seems to be a rather challenging problem even in the era of LLMs: a lot of benchmarks do not seem to be saturated yet, e.g. the best score on a 7-year old benchmark like Dbpedia is around 0.53 NDCG@10. I wonder if it's a lack of focus or if they are truly challenging problems to solve...
11 months ago
1
1
0
reposted by
Xing Han Lu
Jeremy Howard
11 months ago
I'll get straight to the point. We trained 2 new models. Like BERT, but modern. ModernBERT. Not some hypey GenAI thing, but a proper workhorse model, for retrieval, classification, etc. Real practical stuff. It's much faster, more accurate, longer context, and more useful. 🧵
19
620
181
Really glad that this work is out! Agentlab and browsergym will be, in my opinion, very important components of web agent research and will play an important role in the toolkit of most web agent researchers. Read the paper if you are interested in learning more about what the platform covers!
add a skeleton here at some point
11 months ago
0
2
0
reposted by
Xing Han Lu
11 months ago
Glad to be part of this great collaborative effort 😊
add a skeleton here at some point
0
1
1
reposted by
Xing Han Lu
Alexandre Lacoste
11 months ago
We’re really excited to release this large collaborative work for unifying web agent benchmarks under the same roof. In this TMLR paper, we dive in-depth into
#BrowserGym
and
#AgentLab
. We also present some unexpected performances from Claude 3.5-Sonnet
1
20
13
reposted by
Xing Han Lu
Benno Krojer
11 months ago
Finally it's handy that all my twitter posts got migrated here to bsky: I'll be presenting AURORA at
@neuripsconf.bsky.social
on Wednesday! Come by to discuss text-guided editing (and why imo it is more interesting than image generation), world modeling, evals and vision-and-language reasoning
add a skeleton here at some point
1
24
2
reposted by
Xing Han Lu
Oscar Mañas
11 months ago
Tomorrow at 3:15pm I'll be presenting my work at
@mila-quebec.bsky.social
's booth (#104) at
@neuripsconf.bsky.social
. Come to learn more about controlling multimodal LLMs via reward-guided decoding! 🔗
openreview.net/forum?id=VWJ...
loading . . .
Controlling Multimodal LLMs via Reward-guided Decoding
As Multimodal Large Language Models (MLLMs) gain widespread applicability, it is becoming increasingly desirable to adapt them for diverse user needs. In this paper, we study the adaptation of...
https://openreview.net/forum?id=VWJWqKdeld#discussion
0
11
3
reposted by
Xing Han Lu
Alexandre Lacoste
12 months ago
Awesome Starter Pack. Thanks
@xhluca.bsky.social
add a skeleton here at some point
0
3
1
I've created a starter pack of researchers working on digital agents (focusing on web, mobile and OS agents). I am missing a lot, and many are not on bsky yet, so if I missed you or someone you know, please send me a DM with the link to a relevant paper and I will update the starter pack!
add a skeleton here at some point
12 months ago
1
10
3
reposted by
Xing Han Lu
Alexandre Lacoste
12 months ago
🧵-1 We are thrilled to release
#AgentLab
, a new open-source package for developing and evaluating web agents. This builds on the new
#BrowserGym
package which supports 10 different benchmarks, including
#WebArena
.
2
18
15
reposted by
Xing Han Lu
Stella Biderman
12 months ago
A dataset of 1 million or 2 million Bluesky posts is completely irrelevant to training large language models. The primary usecase for the datasets that people are losing their shit over isn't ChatGPT, it's social science research and developing systems that improve Bluesky.
add a skeleton here at some point
8
252
45
reposted by
Xing Han Lu
Omar Sanseviero
12 months ago
I'm disheartened by how toxic and violent some responses were here. There was a mistake, a quick follow up to mitigate and an apology. I worked with Daniel for years and is one of the persons most preoccupied with ethical implications of AI. Some replies are Reddit-toxic level. We need empathy.
add a skeleton here at some point
29
334
46
reposted by
Xing Han Lu
merve
12 months ago
Small yet mighty! 💫 We are releasing SmolVLM: a new 2B small vision language made for on-device use, fine-tunable on consumer GPU, immensely memory efficient 🤠 We release three checkpoints under Apache 2.0: SmolVLM-Instruct, SmolVLM-Synthetic and SmolVLM-Base
huggingface.co/collections/...
11
159
31
reposted by
Xing Han Lu
Rupali Bhati
12 months ago
Mila is such a large community. One starter pack just isn’t enough! After
@josephdviviano.bsky.social
’s Mila list filled up, I decided to make another one. Will continue to add members until this one is full too.
go.bsky.app/9nXTDHo
add a skeleton here at some point
4
33
9
reposted by
Xing Han Lu
Kyle Lo
12 months ago
Excited to share OLMo 2! 🐟 7B and 13B weights, trained up to 4-5T tokens, fully open data, code, etc 🐠 better architecture and recipe for training stability 🐡 staged training, with new data mix Dolmino🍕 added during annealing 🦈 state-of-the-art OLMo 2 Instruct models
#nlp
#mlsky
links below👇
1
68
13
reposted by
Xing Han Lu
McGill NLP
12 months ago
It turns out we had even more papers at EMNLP! Let's complete the list with three more🧵
add a skeleton here at some point
1
14
5
reposted by
Xing Han Lu
McGill NLP
12 months ago
Our lab members recently presented 3 papers at
@emnlpmeeting.bsky.social
in Miami ☀️ 📜 From interpretability to bias/fairness and cultural understanding -> 🧵
1
19
8
csv.DictReader is such an underrated tool. Its really neat that you can just load csv rows as dictionaries in pure python
12 months ago
0
2
0
you reached the end!!
feeds!
log in