Kyle O’Brien
@kyletokens.bsky.social
📤 36
📥 102
📝 7
studying the minds on our computers |
https://kyobrien.io
reposted by
Kyle O’Brien
Cas (Stephen Casper)
26 days ago
📌📌📌 I'm excited to be on the faculty job market this fall. I just updated my website with my CV.
stephencasper.com
loading . . .
Stephen Casper
Visit the post for more.
https://stephencasper.com/
0
17
5
reposted by
Kyle O’Brien
Sharon Goldman
about 2 months ago
Thanks to
@stellaathena.bsky.social
for chatting with me about Deep Ignorance: the new paper/project from Eleuther AI and the UK AISI. Bottom line: Worried AI could teach people to build bioweapons? Don’t teach it how
fortune.com/2025/08/14/w...
loading . . .
AI safety tip: if you don’t want it giving bioweapon instructions, maybe don’t put them in the training data, say researchers
New research shows that scrubbing risky material from AI training data can build safeguards that are harder to bypass — and one author calls out tech giants for keeping such work under wraps.
https://fortune.com/2025/08/14/worried-ai-could-teach-people-to-build-bioweapons-dont-teach-it-how-say-researchers/
0
12
3
This articles covers our work for a general audience. :)
add a skeleton here at some point
about 2 months ago
0
4
0
Big and True :)
add a skeleton here at some point
about 2 months ago
0
3
0
I like that OpenAI published this. They were able to fine-tune away GPT-oss's refusal, decreasing refusal rates to ~0%. These results aren't surprising. Acknowledging that existing safeguards don't generalize to open models is the first step in developing solutions.
arxiv.org/abs/2508.031...
loading . . .
Estimating Worst-Case Frontier Risks of Open-Weight LLMs
In this paper, we study the worst-case frontier risks of releasing gpt-oss. We introduce malicious fine-tuning (MFT), where we attempt to elicit maximum capabilities by fine-tuning gpt-oss to be as ca...
https://arxiv.org/abs/2508.03153v1
about 2 months ago
0
1
0
I've learned a lot over the past two years of getting into research, mostly from mistakes. I’ve made many mistakes. Such is science. Good research is often at the adjacent possible. I've written up much of what I've learned now that I'm beginning to mentor others.
open.substack.com/pub/kyletoke...
loading . . .
Don’t "Think", Just Think
Lessons From Breaking Into AI Research
https://open.substack.com/pub/kyletokens/p/dont-think-just-think?r=3gtmk8&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false
about 2 months ago
0
0
0
I led an effort at Microsoft last Fall that studied whether SAE steering was an effective way to improve jailbreak robustness. Our paper on SAE steering has been accepted to the ICML Actionable Interpretability Workshop! Venue:
actionable-interpretability.github.io
Paper:
arxiv.org/abs/2411.11296
loading . . .
Steering Language Model Refusal with Sparse Autoencoders
Responsible deployment of language models requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we...
https://arxiv.org/abs/2411.11296
3 months ago
0
2
0
I'll be in England this summer as an AI Safety Research Fellow with ERA!
erafellowship.org/fellowship
I will be studying data filtering and tamper-resistant unlearning for open-weight AI safety so that the community can continue to benefit from open models as capabilities improve.
loading . . .
Fellowship — ERA Fellowship
https://erafellowship.org/fellowship
4 months ago
1
5
0
you reached the end!!
feeds!
log in