Tyler Chang
@tylerachang.bsky.social
π€ 216
π₯ 66
π 13
PhD student at UC San Diego. He/him/his.
https://tylerachang.github.io/
pinned post!
We scaled training data attribution (TDA) methods ~1000x to find influential pretraining examples for thousands of queries in an 8B-parameter LLM over the entire 160B-token C4 corpus!
medium.com/people-ai-re...
11 months ago
2
36
13
Very very excited that Global PIQA is out! This was an incredible effort by 300+ researchers from 65 countries. The resulting dataset is a high-quality, participatory, and culturally-specific benchmark for over 100 languages.
add a skeleton here at some point
14 days ago
0
3
0
reposted by
Tyler Chang
Catherine Arnett
about 2 months ago
Did you know? β77% of language models on
@hf.co
are not tagged for any language πFor 95% of languages, most models are multilingual π¨88% of models with tags are trained on English In a new blog post,
@tylerachang.bsky.social
and I dig into these trends and why they matter! π
1
13
2
reposted by
Tyler Chang
Multilingual Representation Workshop @ EMNLP 2025
3 months ago
We have over 200 volunteers now for 90+ languages! We are hoping to expand the diversity of our language coverage and are still looking for participants who speak these languages. Check out how to get involved below, and please help us spread the word!
add a skeleton here at some point
1
3
3
reposted by
Tyler Chang
Multilingual Representation Workshop @ EMNLP 2025
3 months ago
With six weeks left before the deadline, we have had over 50 volunteers sign up to contribute for over 30 languages. If you donβt see your language represented on the map, this is your sign to get involved!
1
3
3
We're organizing a shared task to develop a multilingual physical commonsense reasoning evaluation dataset! Details on how to submit are at:
sigtyp.github.io/st2025-mrl.h...
add a skeleton here at some point
5 months ago
0
4
0
Presenting our work on training data attribution for pretraining this morning:
iclr.cc/virtual/2025...
-- come stop by in Hall 2/3 #526 if you're here at ICLR!
add a skeleton here at some point
7 months ago
1
4
1
We scaled training data attribution (TDA) methods ~1000x to find influential pretraining examples for thousands of queries in an 8B-parameter LLM over the entire 160B-token C4 corpus!
medium.com/people-ai-re...
11 months ago
2
36
13
reposted by
Tyler Chang
Catherine Arnett
12 months ago
The Goldfish models were trained on byte-premium-scaled dataset sizes, such that if a language needs more bytes to encode a given amount of information, we scaled up the dataset according the byte premium. Read about how we (
@tylerachang.bsky.social
) trained the models:
arxiv.org/pdf/2408.10441
1
5
1
reposted by
Tyler Chang
Catherine Arnett
12 months ago
Tyler Chang and my paper got awarded outstanding paper at
#EMNLP2024
! Thanks to the award committee for the recognition!
1
32
1
you reached the end!!
feeds!
log in