On 1 June 2026 I closed the laptop after three long days of the AI engineering conference circuit and walked back along the Embankment thinking I had seen the shape of the year. The room was crowded with people building production systems at last. The corridor conversation had matured. The hype had retreated by a measurable amount. And six themes had emerged as the load-bearing ideas of the rest of 2026.

This is the long write-up. If you only have a minute, here are the headlines.

The Model Context Protocol won the year. Tool surfaces are now MCP-shaped, full stop.
Multi-provider gateway with failover is the new minimum bar for production. Single-provider deployments are a planning failure.
Voice loops finally cleared the sub-second turn-time bar on commodity hardware. The vendor lock-in story is shifting toward open building blocks.
Evals as code, gated in CI, replaced the dashboard era. If your suite cannot fail a pull request, it is decoration.
Long-running agents are taking on the work that nobody wanted to call agent work. Durable journals with deterministic replay are the operating model.
The next wave of real users does not live on the web. They live on a desktop. Local-first assistants are the next distribution surface.

This piece is part recap, part argument. Where I am citing a stat I have linked the source. Where I am giving an opinion I am owning it. If you spot a number you cannot reproduce, email me and I will fix or remove it^[4]^[16].

1. MCP took the year

A year ago the Model Context Protocol^[2] was a proposal from Anthropic^[3]. At the conference this week it was the de-facto wire format for tool calling, full stop. Every other production agent talk referenced an MCP server. The fringe vendor booths were busy taking proprietary tool surfaces and wrapping them in MCP because that is how integrators are buying now. The "should we adopt MCP" question is no longer interesting. The interesting questions are operational.

What does that mean in practice?

Build your tools as MCP servers from day one. Even your internal-only tools. The friction of porting later is a tax on your future self.
The stdio transport is enough for most cases. The HTTP transport matters when you need cross-process or cross-host invocation, but do not reach for it for an in-editor skill.
Bearer auth on the HTTP transport is the standard. Roll your own callback if you need OAuth, but keep the catalog endpoint plain.
Catalog your tools. The list_tools discovery shape is what makes the protocol composable across hosts.

I have been shipping the MCP-shape pattern across my own stack since the autumn. The /api/v1/mcp endpoint in Sarmalink-AI is a bearer-protected catalog over plain JSON. The plugin auto-router in the gateway dispatches to that catalog. Slipstream's nine sp_ tools live behind a stdio MCP server inside the slipstream plugin. The point is not that any of this is novel. The point is that it is now the assumed baseline. If your tool surface is not MCP-shaped you are about to have an integration problem.

2. Multi-provider gateways are now table stakes

The most quoted number this week came from LangChain's State of AI 2025 follow-up^[4]: 58 percent of teams running AI in production now route through some form of multi-provider gateway. Another 24 percent are piloting. That leaves under one fifth of production AI workloads on a single provider, and most of those are stuck there for compliance reasons or contractual minimums.

This is a serious change in posture. A year ago a single-provider deployment was the default. Now it reads as a planning failure waiting to page someone. Three factors drove the shift.

First, the provider outages of the past twelve months made it personal. OpenAI's late May 2026 incident, Anthropic's 27 March incident, the Google Workspace AI cascade in February. Every senior engineer who paged for those is now looking at gateway designs.

Second, the cost gap between providers narrowed. The premium for top-tier reasoning at one provider versus another is now usually under 30 percent, and the failover penalty for accepting a slightly cheaper model under load is acceptable to most product teams. Pure cost arbitrage is a real driver.

Third, the operational tooling for gateways has matured. You no longer need to roll the whole stack yourself. Open-source gateways have failover, rotation, intent routing, and usage tracking built in. Cloudflare Workers AI^[5] handles the infrastructure for teams that do not want to host the gateway themselves.

For the open-source side specifically, I have been pushing on Sarmalink-AI^[10] for this exact reason. The v2 drop a fortnight ago landed ten features around the multi-provider story: intent auto-routing, MCP-shape tool catalog, smart-suggestions endpoint, exports, reasoning-leak stripper, plus voice. If you want a working free reference for what a modern open-source gateway looks like, that repo is the closest to the conference baseline I know of.

3. Voice loops have a serious production stack

Voice was the surprise of the week. The sub-one-second turn-time budget on commodity hardware has finally arrived. The reference architecture is mediasoup^[7] for WebRTC capture and routing, Whisper.cpp or a comparable Whisper variant for STT, a streaming TTS endpoint (Cloudflare Workers AI MeloTTS, Piper, or a paid hosted TTS), and provider-pluggable LLM in the middle. The OpenAI Realtime API^[6] is the simplest hosted path. The open-source path costs more setup but is genuinely cheaper to operate at any scale above a single workstation.

What was new at the conference was the operational discipline around voice. Three patterns mattered.

Sentence-by-sentence TTS streaming. Wait for a full clause from the LLM, then stream it to TTS, then play. Do not wait for the whole response. This is the single most important UX upgrade for voice and most demos still get it wrong.
Barge-in as a state machine. When the user starts talking while the assistant is speaking, the TTS must cancel mid-sentence and the new utterance must become the next prompt. The turn-state machine is small but unforgiving. Get it wrong and the assistant feels broken in seconds.
Per-stage latency telemetry. STT first-token-out, LLM first-token-out, TTS first-byte-out, mouth-open-to-ear-arrive. Without per-stage numbers you cannot honestly optimise the budget.

The pattern shows up directly in voice-agent-starter^[14] and indirectly through every product where voice is in the picture. The conference made it clear that the field has converged on this stack. If you are starting a voice loop in mid-2026, this is the architecture to clone.

4. Evals as code replaced the dashboard era

Two years ago "AI evals" meant a Streamlit dashboard with a few prompts and a thumbs-up button. This year the dominant pattern is: evals as code, with regression budgets, gated in CI, failing pull requests when the release loses ground against the baseline.

The interesting thing is what fell out of fashion.

Vendor dashboards for evals are quietly losing share. Teams want their eval suite in the same repo as the code so a feature branch ships its eval delta with the PR.
LLM-as-judge stayed in the toolbox but with a much tighter eye on bias. The serious teams now ensemble multiple judges and report agreement rates.
Synthetic test sets are losing ground to small, hand-curated, version-controlled regression sets. A 50-example hand-curated suite that exercises the actual failure modes you care about is worth more than 5,000 synthetically generated examples that mostly test the trivial cases.

My own ai-eval-runner is exactly that shape: Python plus a Typer CLI, DuckDB as the result store, FastAPI plus HTMX viewer, regression mode that fails CI when a release loses ground. The conference made me feel less alone in this design. The pattern is consolidating fast.

5. Long-running agents found their operating model

The conference talked about agents like the field had grown up. Two years ago you could not avoid "autonomous agents" as a marketing phrase. This week the word "agent" specifically meant "a multi-step workflow with tool calls and external side effects, journaled to durable storage so it can be inspected and replayed."

This is the right definition. And it shows up in the architecture.

The dominant pattern is a journaled execution model with deterministic replay. Postgres-backed event sourcing, BullMQ or a comparable queue for step orchestration, hard token and tool-use and wall-clock budgets per step, plus an Inspector UI to visualise live state and replay from any step. Temporal^[8] is making a serious push into this space with their AI workflows messaging.

My open take from agent-orchestrator^[13]: this is the right model. The journal is what makes long agent workflows debuggable. If you cannot replay from step seven you cannot fix the bug that only happens after step six. The token budget per step is what stops a model from looping until the context exhausts. The Inspector UI is what lets a human approve the side effect before it goes out the door. Build all three.

6. Local-first assistants are the next distribution surface

The quietest theme of the week was, in my view, the most important. There is a wave of desktop AI assistants coming. They are open-source, cross-platform, brain-agnostic, and they run on the user's own subscription. Tauri 2.0^[9] is the most common shell, although a couple of native-Swift demos were doing the rounds too.

The shape that is emerging:

One Rust core, three operating systems (macOS, Windows, Linux).
A translucent multi-monitor HUD as the visible surface.
A wake word, a STT loop, a brain router across multiple subscription-backed CLIs (Claude Code, Codex, Gemini CLI, Ollama), a TTS loop.
A skill bus where every skill is an MCP server.
File-based memory with vector recall.
Single-click OAuth for calendar, mail, music, home automation. OS keychain everywhere.

This is the architecture I have been building toward for the last twenty months. echo^[12] is my contribution to that wave, shipping publicly on 1 July 2026. The conference made it clear that this is not a niche I am building in alone. It is the next wave of where real users actually live.

What I am shipping in response

Here is the practical takeaway. If you take nothing else from this post, take the architecture moves I am making for the second half of 2026 in response to what the conference signalled.

Echo public 0.1.0 on 1 July 2026^[12]. Brain-agnostic personal AI assistant. Voice loop, multi-monitor HUD, memory, MCP skill bus. Runs on any subscription you already pay for. Cross-platform from one Rust core. MIT licensed.
Sarmalink-AI v2 already shipped^[10]. Intent auto-routing, multi-step agent runner with SSE, MCP-shape tool catalog, TTS plus STT cascades, quota tracker, smart suggestions, reasoning-leak stripper, Markdown to PDF, JSON to XLSX. Free open-source gateway, deployable on Vercel.
slipstream as the coding agent runner^[11]. Persistent memory, PreCompact session digest, signal-ranked recall, live local dashboard. Nine sp_ MCP tools that any MCP-capable editor can call.
agent-orchestrator pattern documented^[13]. The journaled durable workflow story for anyone running long agents.
voice-agent-starter as the voice reference^[14]. Sub-second WebRTC turn loop with pluggable adapters.
forge-infer for the brave^[15]. Minimal Rust LLM inference server with paged KV cache, continuous batching, speculative decoding. For when you want to learn the internals.

That is the move. The themes the conference surfaced are the themes I am building around. The repos are open. The code is honest. If you are an AI engineer in mid-2026 and any of this is useful to you, that is the whole point.

The corridor conversation I cannot publish

The most interesting things I heard this week were in corridor conversations. Two examples without naming names.

A senior engineer at one of the larger US labs told me their internal eval suite has shifted away from a 50,000-example automated benchmark and toward a 200-example hand-curated regression suite. The reason was the same as mine. The automated suite passed everything. The hand-curated suite caught the bugs.

A platform engineer at a UK fintech told me their AI gateway is now their single most expensive piece of infrastructure to operate, ahead of their core API. The reason was not raw spend on models but the operational overhead of provider rotations, key hygiene, and per-team quota tracking. The conference moved them from "should we build a gateway" to "we already built one and it is bigger than we thought."

Both of those conversations made me feel the field is finally maturing into something engineers can take seriously as a craft. The hype layer is shrinking. The operational layer is growing. That is exactly the right direction.

What to read if you want to go further

The MCP specification^[2]. Forty minutes. Worth every minute.
LangChain's State of AI 2025^[4]. The methodology section is more useful than the headline numbers.
AI Engineer Summit talks from the back catalogue^[16]. The voice and evals sessions held up better than the agent ones.
My own product pages at sarmalinux.com/products/sarmalink-ai, sarmalinux.com/products/echo, and sarmalinux.com/products/slipstream, each of which has an architecture diagram and a whitepaper.

Closing

The themes are clear. MCP won. Multi-provider gateway is the new minimum. Voice is real. Evals as code beat dashboards. Long agents work when journaled. Local-first assistants are the next wave.

The work for the second half of 2026 is to ship into that consensus. That is what I am doing. The repos are open. The plan is published. If any of this helps you ship something honest, the conference, and this post, did their job.

Email me if you want to talk about any of it: hello@sarmalinux.com.

---

A note on this post

This is a personal recap of themes from a major AI engineering conference. The themes referenced (MCP momentum, multi-provider gateway adoption, voice production stacks, evals as code, local-first agents) are well-documented in the public record and linked in the citations. The opinions are mine. Any specific stat I quote comes from a linked source in the citations block. If you spot a number you cannot reproduce, email me at hello@sarmalinux.com and I will fix or remove it.

About the data

A note on what the numbers in this post represent so you can read them with the right confidence:

"My own bench" rows are personal measurements on my own hardware. They are honest about my setup and reproducible there, but they should not be treated as universal benchmark scores.
Benchmark numbers attributed to public sources (Geekbench Browser, DXOMARK, NotebookCheck, FIA timing) are illustrative, the trend is what matters, not the third decimal place. Cross-check against the source for anything you would act on financially.
Client outcomes and ROI percentages in business-focused posts are anonymised composites drawn from my own consulting work. Real numbers, real direction, sanitised so individual clients are not identifiable.
Foldable crease-depth and similar engineering measurements are estimates pulled from teardown reports and reviewer claims; manufacturers do not publish these directly.
Forecasts and "what I bet" lines are exactly that, opinions, not predictions with a track record yet.

If you spot a number that contradicts a source you trust, tell me, I would rather correct it than be the chart that was off by 6 percent and pretended otherwise.

Comments

By signing in, Sarma will receive your name, avatar, email, sign-in provider, and approximate location (country/city, derived from your IP) for moderation and reply purposes. None of this is shown publicly, only your name and avatar appear on the post. No newsletter, no marketing, no third-party sharing.

Loading comments…

AI Engineer World's Fair 2026: what actually mattered