The term "agentic engineering" gets used for everything from autocomplete to fully autonomous coding systems. I want to describe what it actually looks like in our day-to-day work, specifically where in the development loop agents add value and where they don't.
The loop
A feature or fix moves through roughly five stages: spec, design, implementation, review, deployment. We have agents in stages 1, 3, and 4. Stages 2 and 5 are human-owned.
Stage 1: Spec
We use an LLM to draft technical specs from rough notes. An engineer writes two paragraphs describing the problem and the desired outcome. The agent produces a structured spec: problem statement, proposed solution, success criteria, known risks, open questions.
The output is not final. It's a starting point that an engineer refines. The value is that the spec now exists and is structured. The starting point is roughly 60-70% accurate. The engineer catches the wrong assumptions, fills the gaps, and ends up with a tighter document faster than writing from scratch.
This has had an unexpected secondary benefit: it forces engineers to write down the rough description in the first place. Before this workflow, specs were often mental models that never got written down.
Stage 2: Design (human)
Architecture decisions stay human. What services are involved, how data flows, where the edge cases live. The model's suggestions here are plausible but frequently wrong in domain-specific ways. We use it for rubber-ducking - I'll explain a design to Claude and ask it to identify what assumptions I'm making - but the decision is always human.
Stage 3: Implementation
Copilot handles in-editor completion. Claude handles larger generation tasks: write a function given a description and a type signature, implement a test suite for this module, translate this pseudocode into working Python.
We do not have an agent that makes commits or opens pull requests. The engineer reviews, edits, and commits. The agent is a tool the engineer uses, not an autonomous participant in the loop.
Stage 4: Review
We have an automated review agent that runs on every PR. It checks for a fixed set of things: potential null pointer exceptions, inconsistent error handling patterns, missing input validation, obvious logic errors, type mismatches.
The agent posts comments inline on the diff. The engineer reviews those comments before the human code review. Some comments are useful catches. Some are false positives. The engineer marks them resolved either way.
Human review still happens on every PR. The agent does not replace it. What it does is raise the floor - humans can focus on design and intent rather than catching the class of mechanical errors the agent handles.
Stage 5: Deployment (human)
Deployment decisions are human. Which revision to promote, whether to run the staged rollout, when to abort. We have monitoring and alerting that gives us the data. A human makes the call.
We experimented with automated rollback on certain alert thresholds. We use it only in one narrow case: if P99 latency exceeds a threshold within five minutes of a new deployment, Cloud Run automatically rolls back. This is the only case where an automated system takes a deployment action, because the evidence is clear and the cost of a wrong rollback is low.
What the loop gains
Throughput. The same team handles more surface area than it did two years ago. Not because the agents are doing the work - they're handling the mechanical parts - but because the humans can focus on the parts that require judgment.
Documentation coverage. When specs are generated automatically, specs exist for everything. Before this workflow, specs existed for big features. Small features and fixes were underdocumented.
What it doesn't change
The cost of bad decisions. If the design is wrong, the agents will implement the wrong thing efficiently. Speed of execution amplifies both good and bad judgment.
The need for engineers who understand the system. Agents generate code. Engineers still need to review it, understand it, and own the consequences. If no one on the team can read the generated code critically, you're accumulating technical debt at scale.
With gusto, Fatih.