Mono by KUSA Projects
2 Posts

Evals

2 Posts

Latest posts(2)

Introducing COMPOSITE-STEM: an open source benchmark testing agents on frontier scientific tasks
Introducing COMPOSITE-STEM: an open source benchmark testing agents on frontier scientific tasks

Introducing COMPOSITE-STEM, 70 expert-curated agentic tasks across Physics, Biology, Chemistry, and Math, compatible with the Harbor Framework.

by Kyle Waters & Lucas Nuzzi & Tadhg Looram Apr 15, 2026
Operationalizing Expert Preferences for Model and Agent Evals
Operationalizing Expert Preferences for Model and Agent Evals

In our recent paper, we introduce AsymmetryZero, a framework for operationalizing human expert preferences as semantic evals.

by Tadhg Looram & Lucas Nuzzi & Kyle Waters Mar 27, 2026

Other tags(8)

Your link has expired. Please request a new one.
Your link has expired. Please request a new one.
Your link has expired. Please request a new one.
Great! You've successfully signed up.
Great! You've successfully signed up.
Welcome back! You've successfully signed in.
Success! You now have access to additional content.