By running weekly experiments over a year, without specifying a model, or what tools were available to the model, strict replication is impossible.
The rubric developed could allow independent verification. Unfortunately, the "human in the loop" eval requires access to SciPak writers to verify.
add a skeleton here at some point
28 days ago