How good are LLMs at ๐ญ scientific computing and visualization ๐ญ?
AstroVisBench tests how well LLMs implement scientific workflows in astronomy and visualize results.
SOTA models like Gemini 2.5 Pro & Claude 4 Opus only match ground truth scientific utility 16% of the time. ๐งต
6 months ago