How do we reliably judge if AI companions are performing well on subjective, context-dependent, and deeply human tasks? ๐ค
Excited to share the first paper from my postdoc (!!) investigating when LLMs are reliable judges - with empathic communication as a case study ๐ง
๐งต๐
6 months ago