Building agents in production

Eval harnesses are graph nodes, not weekend notebooks

On this page

    Why evaluators are graph nodes, not weekend notebooks

    This claim has teethThe 2025 incident at Multimarkts where a notebook eval missed a multilingual regression because it ran a stale prompt template — production caught it three weeks later. .

    The contract: golden datasets + drift policy

    The promotion gate

    What goes wrong when evaluators are weekend notebooks



    Essays by email