Model evaluation in production: how we score our CV pipeline

We've been running computer vision in production since 2021. Four years in, I've watched a lot of evaluation frameworks break against reality.

Here's what we actually measure and why.

The metrics that feel good but aren't

Accuracy on the validation set. It's necessary. It's not sufficient. The validation set was assembled by humans, labeled under controlled conditions, and does not reflect the full distribution of what comes in from the field. A model that hits 98% on the val set and 91% in production is not a failure of the model - it's a failure of the evaluation setup.

Confusion matrices on balanced classes. Our production data is not balanced. False negatives on a certain class cost ten times what a false positive costs. An average metric that treats all classes equally will tell you the model is doing well while the expensive errors are quietly accumulating.

Aggregate latency. P50 latency is not the number that matters. P99 latency is the number that matters. In a logistics pipeline where a conveyor belt is running, a 500ms tail request at P99 means physical holdups. We track P50 as a sanity check. We alert on P99.

The metrics that actually matter

Business error rate. We define this precisely: the subset of model errors that would have caused a real downstream problem - a shipment incorrectly authenticated, a flagged item incorrectly passed. This requires instrument at the pipeline level, not just the model level. It's harder to compute. It's the only number our clients care about.

Drift over time. We run a weekly evaluation job against a fixed test set that was assembled eighteen months ago. Not to track absolute performance, but to detect when performance on a known distribution changes. Drift is the signal that something about the deployment environment has shifted: lighting changed at a warehouse, a product line changed packaging, a firmware update changed camera exposure.

Latency under load. We run synthetic load tests weekly at 2x current production volume. The model's latency profile under realistic concurrent load is different from single-request latency. You discover this before production or during an incident. I prefer before.

On ground truth

The hardest part of production evaluation is ground truth. You can evaluate model predictions against human labels at a low sample rate. You can flag predictions below a confidence threshold for human review and use those reviews as ground truth. You cannot label everything that comes in.

We use confidence thresholding aggressively. Anything below 0.85 confidence goes to a review queue. The review queue generates ground truth. The ground truth goes into the next training run. This loop is the primary driver of model improvement in production.

It's slow. The alternative is faster but produces a model that improves in the lab and drifts in the field.

On evaluation for LLM-based components

We have LLMs in the loop for certain document processing tasks. Evaluating LLM output is harder than evaluating classification output because there's no single correct answer.

We use a rubric-based approach: define the properties the output must have (is the relevant information present? is any information fabricated? is the format correct?) and score against those properties. A human spot-checks 5% of outputs weekly. The rubric evolves as we discover new failure modes.

This is more work than running evals against a benchmark. It's the only thing that tells us whether the model is doing what we actually need.

The principle underneath all of this

Evaluation is only useful if it's measuring the thing you actually care about, on the distribution you actually see, with enough frequency to catch drift before it becomes a problem.

Most teams measure the thing that's easy to measure. These are not the same.

With gusto, Fatih.