3 min read

What I'm watching in 2026: reasoning models, long-context, and the next shift

The things that look small now but probably aren't.

I've been wrong before about where AI goes next. I was wrong about how fast LLMs would get good at code. I was wrong about how quickly enterprises would start buying rather than building. I try to hold my predictions loosely.

With that caveat: here's what I'm watching in 2026.

Reasoning models in production

The gap between reasoning models (o-series, Claude's extended thinking) and standard models on complex tasks is real. I've measured it. For structured analysis tasks - evaluating an engineering design, synthesising information from multiple documents, debugging a non-obvious system behaviour - the reasoning models produce better output.

What I don't know yet: whether the latency cost is acceptable for the use cases where the quality difference matters most. Reasoning takes longer. In an interactive workflow where someone is waiting for output, the threshold for acceptable latency is different than in an async pipeline.

I expect the reasoning capability to come down to faster inference times in 2026. When it does, a class of tasks that currently go to humans because the model isn't reliable enough will shift.

Long-context getting reliable

We use 100K context today. The failure mode with very long context is that models lose focus: they process the full context but weight recent content more heavily and drop important context from early in the document.

The models are getting better at this. I expect full-context reliability to improve meaningfully this year. When it does, the workflows that currently require careful chunking and retrieval engineering become significantly simpler.

The interesting consequence: a lot of current RAG infrastructure exists to work around context limitations. Some of that infrastructure will become unnecessary. Building complex retrieval systems on top of a problem that's about to be solved at the model layer is a mistake I'm trying to avoid.

Edge inference catching up

We run YOLO on physical infrastructure at Wasteer. Cameras in the field, local inference, results sent to the cloud. The constraint has always been what you can run at the edge: the model has to be small enough to fit on the hardware, fast enough to meet latency requirements.

The edge inference hardware is getting meaningfully faster. What ran on a cloud GPU last year runs on a Jetson-class device today. This opens CV use cases that weren't previously viable - more complex models, higher accuracy, richer output - without changing the fundamental deployment architecture.

The thing I'm less sure about

Multi-agent systems at scale. We run these in production. I've written about what works and what doesn't. The failure modes are real and not fully solved. I've seen a lot of framework development in this space but not a lot of evidence that the fundamental orchestration problems - reliability, observability, graceful degradation - are solved.

My current position: multi-agent is valuable for certain tasks (parallel document processing, complex code review, test generation) and not ready for general automation. I'll update this if I see production evidence that changes it.

What stays constant

Deployment is harder than development. The gap between a benchmark result and a production result will still exist. Organisations that don't have operational discipline will not get value from better models.

The teams that do well in 2026 will be the ones that were already doing well in 2025 - building things that work, measuring what matters, and not confusing capability with product.

With gusto, Fatih.