The Claude API in production: a working engineer's notes

I've been using the Claude API in production since mid-2023. Not for experiments. For things that break if they don't work.

Here's what I've actually learned.

Why I switched

The short version: instruction following. GPT-4 is a capable model that occasionally decides it knows better than your prompt. Claude follows instructions with more consistency. When you're building internal tooling where the prompt is a specification and a deviation is a bug, this matters more than benchmark scores.

The longer version involves a specific incident: a GPT-4-based summarization pipeline that started silently adding caveats we hadn't asked for, changing the meaning of outputs. We caught it in code review. It had been running for three weeks. That was when I started evaluating alternatives seriously.

What it's better at

Long context. We feed it full engineering specs, full conversation histories, full error logs. The model holds context well. It doesn't lose the beginning of the document by the time it gets to the end in the way I observed with earlier models.

Structured output compliance. When you tell Claude to return JSON with a specific schema, it returns that schema. Reliably. When you need this to work in a pipeline that downstream code is parsing, reliability is the entire point.

Code explanation. Not generation - explanation. We use it to document existing functions and write internal runbooks. The output is more often readable prose than the confident-but-wrong answers I got elsewhere when asking about non-standard code patterns.

What I still reach for other tools for

Code generation at the function level. GitHub Copilot inside the editor is faster for completion-style generation. Claude isn't worse - it's just not the right interface for that workflow.

Web search. Claude doesn't have real-time access. Anything time-sensitive - checking a library version, verifying an API hasn't changed, looking up a status page - goes elsewhere.

Customer-facing content. We write that ourselves. Not because the model output is bad, but because it's faster to write it right than to edit something that's 80% right.

On context window size

We use 100K context regularly. It's slower than short-context calls and it costs more. But there are workflows where it's the only option: reviewing a full codebase change, analyzing a long incident timeline, synthesizing a week of engineering notes into a status update.

The failure mode I've seen in other teams: using long context as a crutch instead of structuring your prompts properly. Long context doesn't mean you can skip the work of telling the model what to focus on. It just means you can include more material. You still need to be specific about what you want.

Costs

We run approximately 50,000 API calls per month across internal tooling. Not cheap. We pay for it because the productivity delta is measurable: the workflows it powers would otherwise require headcount we don't have.

The mistake I see teams make is running cost analysis against "nothing" - calculating what the API costs without calculating what the alternative costs. The right comparison is API cost versus engineer hours doing the same task.

What I'd tell someone starting now

Start with the use cases where correctness is verifiable. Code review comments, structured data extraction, documentation drafts. These are the cases where you'll develop real intuitions about where the model is reliable and where it isn't.

Don't start with use cases where wrong answers are invisible. That's how you end up with pipelines that have been quietly degraded for three weeks.

With gusto, Fatih.