99.9% uptime with a six-person team

We hit 99.9% uptime last year with a team of six engineers. No dedicated SRE. No overnight on-call rotation with a team of twenty. Six people, all of whom also write product code.

This isn't a brag. It's a description of the practices that make it possible.

The architecture decisions that reduce incidents

Stateless services. Every service we run on Cloud Run is stateless. No local disk, no in-memory session state, no sticky routing. This means any request can be handled by any instance. Scaling events don't cause routing problems. Failed instances are replaced without consequence.

Idempotent operations. Wherever possible, operations are designed to be safe to retry. Processing a message twice produces the same result as processing it once. This makes the operational response to "something went wrong mid-operation" much simpler: retry and verify, rather than investigate and potentially manually repair state.

Explicit dependency health checks. Every service exposes a health endpoint that verifies not just that the service is up but that its dependencies are reachable. The health check calls the database. The health check calls the ML model endpoint. If either is unavailable, the service reports unhealthy and gets rerouted. We find out about dependency failures from health checks, not from customer reports.

The operational practices that keep incidents short

Runbooks for everything that's happened before. After every incident, we write a runbook: what happened, how we detected it, what we did, how we confirmed it was resolved. The next time the same class of incident occurs - and it will - the engineer on call doesn't have to figure it out from scratch.

The runbooks live in a searchable internal wiki. They're not perfect. They're good enough to cut mean-time-to-resolution significantly for repeat incident types.

Deployment small and often. We deploy multiple times per week. Small deploys are faster to roll back than large deploys. When something breaks after a deployment, the blast radius is smaller and the investigation is easier because fewer things changed.

We use Cloud Run's traffic splitting to route a percentage of traffic to a new revision before cutting over fully. For anything non-trivial, 10% traffic for fifteen minutes before full rollout. This catches issues that don't appear in staging.

The monitoring setup

We alert on symptoms, not causes. Not "CPU above 80%" but "request error rate above 1% for five minutes" and "P99 latency above 400ms." The first set of alerts tells you something is happening at the infrastructure level. The second set tells you users are experiencing problems. We care about the second set.

Every alert is actionable. If an alert fires and the correct response is "check the dashboard and it's probably fine," that alert is deleted or changed. Alerts that cry wolf train engineers to ignore alerts.

On the on-call burden

Six engineers. We rotate on-call weekly. One person is primary for the week. Pages outside working hours go to that person first.

The goal is a rotation where on-call rarely gets paged outside business hours. When it does, the incident resolves quickly because of the runbooks and architecture decisions above. Last year we had eleven off-hours pages across the full year. That's manageable.

The way to make on-call sustainable is to treat every off-hours page as a bug in your operational setup. Something that pages you at 2am should either stop happening or stop requiring a human to handle it. If it's a recurring alert, write a runbook. If it's a runnable remediation, automate it. If it's genuinely unpredictable, accept the cost but don't let it accumulate.

What this requires

Time. Every one of these practices takes time to build and maintain. Runbooks go stale. Architecture decisions get worked around. Alerting thresholds need tuning.

The tradeoff is that time invested in operational discipline compounds. The alternative - heroics and headcount - doesn't.

With gusto, Fatih.