Evals
Introducing COMPOSITE-STEM, 70 expert-curated agentic tasks across Physics, Biology, Chemistry, and Math, compatible with the Harbor Framework.
In our recent paper, we introduce AsymmetryZero, a framework for operationalizing human expert preferences as semantic evals.