I missed Bloom, Anthropic's open-source framework for automating AI behavioral tests. It approaches LLMs' evaluation by:
> Using a 4-stage pipeline: "understanding," "ideation," "rollouts," "judgment."
> Generating new scenarios each time to reduce training set risks.
More in the 🧵
3 months ago