Alex Turner
@turntrout.bsky.social
📤 673
📥 6
📝 85
Research scientist at Google DeepMind. All opinions are my own.
https://turntrout.com
“If your reward is misspecified, you’re doomed” Maybe not! You can reduce specification gaming with a simple prompt swap during RL, no reward signal improvements needed. Developed concurrent to inoculation prompting, but with RL & prompt contrasting. Presenting: ✨Recontextualization ✨ 🧵
1 day ago
1
0
0
The first pretraining results are in, and it looks like models indeed have self-fulfilling misalignment properties. Great work by Tice et al!
alignmentpretraining.ai
loading . . .
Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
LLMs trained on data about misaligned AIs themselves become less aligned. Luckily, pretraining LLMs with synthetic data about good AIs helps them become more aligned. These alignment priors persist…
https://alignmentpretraining.ai/
4 days ago
0
5
1
Modern “reward hacking” does NOT show that reward is the optimization target! Such "reward hacking" is almost entirely specification gaming, not the "reward optimization" I addressed in 2022. /Thread/
4 days ago
1
1
0
If an AI kills everyone because it ruthlessly optimized its reward signal because it was trained to predict text of people saying "RL trains AIs to ruthlessly optimize the reward signal" I'll literally die to my pet peeve (people saying "RL trains AI to maximize reward")
8 days ago
0
1
0
I just donated $5,200 (+100% employer match from Google) to Civitech (501c3) for their incubator project. Smart analysts I know recommend them as a highly cost-effective way to protect American democracy. đź§µ on my donations this year (Recreated for technical reasons)
9 days ago
1
2
0
This is THE key fact amongst the noise and scandal. I generally like Newsom's actions because he responds to that fact --- he seems to take it seriously.
add a skeleton here at some point
13 days ago
0
0
0
Gotta say, I was disturbed by Invisible AI's booth at
#NeurIPS
. Employees dressed as cows advertising how they use AI to optimize factory farming (a torture facility for cows). Bad taste
20 days ago
1
4
0
I made accessible design easier by writing alt-text-llm, an AI-powered tool for generating and managing alt text in markdown files. The tool detects missing alt text, suggests context-aware descriptions, and provides an interactive reviewing interface in the terminal.
about 1 month ago
1
2
0
Self-fulfilling alignment? (image credit: Quintin Pope)
turntrout.com/self-fulfill...
about 1 month ago
0
4
0
“Output-based training will keep chains-of-thought honest.” Sadly, NO. We show that training on *just the output* can still cause models to hide unwanted behavior in their chain-of-thought. MATS 8.0 Team Shard presents: a 🧵
about 1 month ago
1
3
1
New Google DeepMind paper: "Consistency Training Helps Stop Sycophancy and Jailbreaks" by
@alexirpan.bsky.social
, me, Mark Kurzeja, David Elson, and Rohin Shah. (thread)
about 2 months ago
1
18
6
"Authoritarianism can't happen here." Sadly, I think that it IS happening here. Protect yourself and your digital communications using the highly actionable, specific, step-by-step privacy guide I wrote.
about 2 months ago
2
3
0
Want to get into alignment research? Alex Cloud & I mentor *Team Shard*, responsible for gradient routing, steering vectors, MELBO, and a new unlearning technique (TBA) :) We discover new research subfields. Apply for mentorship this summer at
forms.matsprogram.org/turner-app-8
9 months ago
0
6
2
This book is really fun & informative. I have solid understanding of a bunch of my body's processes now. &I can just start reading random physiology Wikipedia pages and be able to roughly follow. :) My review with insights and my remaining confusions:
turntrout.com/insights-fro...
loading . . .
Insights From “The Manga Guide to Physiology”
This book breaks down complex physiology into digestible parts, using charming visuals & clear explanations. You might be surprised how much you can learn!
https://turntrout.com/insights-from-physiology
11 months ago
0
2
0
Mark Kurzeja & I exploited weaknesses in multiple-choice TruthfulQA dataset while hiding the questions! A few simple rules of thumb achieved 79% accuracy. Even well-regarded benchmarks can have flaws. Kudos to the authors for addressing this! Read at
turntrout.com/original-tru...
loading . . .
Gaming TruthfulQA: Simple Heuristics Exposed Dataset Weaknesses
Common factuality benchmark was easily gamed using our simple decision tree. The benchmark is now updated.
https://turntrout.com/original-truthfulqa-weaknesses
11 months ago
1
3
0
1) AIs are trained as black boxes, making it hard to understand or control their behavior. This is bad for safety! But what is an alternative? Our idea: train structure into a neural network by configuring which components update on different tasks. We call it "gradient routing."
about 1 year ago
1
16
6
you reached the end!!
feeds!
log in