I have been thinking about this in light of Anthropic’s recent verbalization interp paper. It had no evidence convincing me that their verbalizations are faithful, but they are convincingly useful. Even wrong output can stimulate human creativity and increase the entropy of exploration.
add a skeleton here at some point
12 days ago