Agentic workflows: the ones that work and the ones that blow up

We've been running agentic workflows in production for over a year. Not demos. Not internal experiments. Workflows that run on customer data with real consequences if they fail.

Here's what I've actually learned.

The workflows that work

Document processing pipelines. Input a document, extract structured information, validate against a schema, route the result. The loop is short, the output is verifiable, and the failure mode is loud (validation fails, pipeline halts). These work reliably.

Monitoring and alerting enrichment. We send raw alert context to an agent that summarizes the incident, identifies related prior incidents, and drafts the first message for the incident channel. The output is low-stakes enough that a wrong answer is annoying but not catastrophic. The agent doesn't take action - it produces context for the human who does.

Code generation with constraints. An agent that writes boilerplate given a template and a spec. The output is always code review before merge. The agent can't ship directly. This constraint is load-bearing.

The workflows that blew up

Autonomous remediation. We tried an agent that would detect certain infrastructure anomalies and attempt automated fixes: restart a service, scale a resource. The agent was correct about 85% of the time. The 15% of the time it was wrong, it made things worse. An 85% correct rate sounds good until the 15% is a production incident that requires manual cleanup at 2am.

We shut it down. We now use the same detection logic to generate runbook-style suggested actions for a human to execute.

Long chains without human checkpoints. An agent with twelve steps and no human in the loop until the end produces outputs that are confidently wrong in ways that are hard to trace back. The error accumulates. Step 3's bad assumption propagates through steps 4 through 12 without anyone catching it.

The fix is checkpoints. Short chains. Humans in the loop at the right points, not at the end.

The failure mode nobody talks about

Confident degradation. The agent keeps running, keeps producing output, and the output is getting worse. There's no error. There's no alert. The pipeline is technically healthy. The output is quietly less useful than it was six months ago - maybe a model version changed, maybe a prompt is hitting an edge case more often now, maybe the input data distribution shifted.

This is the hardest failure mode to catch because there's no signal. You catch it only if you're measuring output quality continuously, not just pipeline health.

We built an evaluation harness that runs weekly against a fixed test set. It's not comprehensive. It's enough to catch drift before it becomes a problem.

What makes an agentic workflow safe to run

Short loops. Long loops accumulate errors.

Verifiable outputs. If you can't tell whether the output is correct without deep inspection, you can't run it reliably at scale.

Graceful degradation. What happens when the agent fails? If the answer is "the production system breaks," that workflow is not ready.

Human in the loop at high-stakes decision points. Not throughout - that eliminates the value. At the points where a wrong decision is expensive.

The broader lesson

Agents are good at automating the mechanical parts of human judgment. They're not good at replacing the parts of human judgment that require understanding why something matters.

The workflows that work are the ones that automate the mechanical parts and route the rest back to a human. The workflows that blow up are the ones that assume the mechanical-looking problem doesn't have a judgment component.

Most problems have a judgment component.

With gusto, Fatih.