Been really enjoying unfaithful CoT research with collaborators recently. Two observations:
1) Quickly it's clear that models are sneaking in reasoning without verbalising where it comes from (e.g. making an equation that gets the correct answer, but defined out of thin air)
about 1 year ago