Building agents in production
Eval harnesses are graph nodes, not weekend notebooks
On this page
Why evaluators are graph nodes, not weekend notebooks
This claim has teethThe 2025 incident at Multimarkts where a notebook eval missed a multilingual regression because it ran a stale prompt template — production caught it three weeks later. .