AI agents in 2026: where the actual ground is
A grounded look at where general-purpose agents actually deliver in 2026, with benchmark numbers and a strong opinion on which products are real and which are demo-ware.
A year ago everyone was building agents. In May 2026 most of those startups have folded into Cursor, Replit, or Manus. The market consolidated. The benchmarks consolidated. Here is what actually works.
The four categories that matter
Agent capability is not one thing. It is at least four:
1. Coding — generating, modifying, debugging code in a real repo. Measured by SWE-bench Verified[^1]. 2. Research — multi-step gathering and synthesising of information. Measured by GAIA Level 3[^3]. 3. Browsing — navigating real websites, filling forms, reading dynamic UIs. Measured by WebArena[^4]. 4. Desktop tasks — driving real applications via OS-level UI. Measured by OSWorld[^2].
Most products are good at one. The general-purpose ones try to do all four and fail in different ways.
```chart::capabilities ```
What the chart says
Cursor wins coding, full stop. SWE-bench Verified is at 78 percent in May 2026. Devin is at 71. Manus at 62. The gap is real and it is widening. Cursor's tight integration with the IDE, the codebase indexing, and the model fine-tuning give it a measurable lead.
Manus wins everything else. Research, browsing, desktop tasks, long-running execution. The general-purpose agent design wins where the task does not have a fixed shape.
Devin sits awkwardly between them. It is the priciest of the three, the slowest, and rarely the best at any single category. Cognition's product strategy seems to be evolving toward agent-as-engineer-replacement which the underlying model is not yet capable of.
What the benchmark scores hide
Benchmarks reward solvable problems. Real agent work has unsolvable problems woven in. Examples:
- The browsing target requires CAPTCHAs every fifth page - The codebase has a build system the agent has never seen - The user is impatient and changes the task halfway through - The PDF extraction returns garbage because the document is scanned
Benchmarks score on success rate. Real users score on confidence. They are different things.
> Manus's research win comes mostly from being willing to say "I could not find a definitive answer" instead of guessing. The other agents pretend.
Cost as a capability
Cost-efficiency is on the radar chart for a reason. Devin charges roughly $20 per task in May 2026[^5]. Manus charges roughly $0.40-2.00. Cursor's coding mode runs on a flat subscription.
For a developer doing 50 small tasks a day, that is the difference between £20 and £1,000 in monthly spend.
The price gap has narrowed because Anthropic and OpenAI both released cheaper agent-tier pricing in early 2026[^5][^6]. But the gap is still material.
What I actually use
For coding inside an IDE: Cursor Agent. The integration is the product.
For ad-hoc research, browsing, desktop work: Manus. The general-purpose interface is faster than configuring a specialist tool. They hand out free credits on signup [via this link](https://manus.im/invitation/AIRTDVWVEWKCK4R) which is enough to evaluate it on real tasks.
For long-running orchestrated workflows where I need replay and budgets: my own [agent-orchestrator](https://github.com/sarmakska/agent-orchestrator). Not because it is better; because I trust the durability semantics.
For multi-document QA and grounding: I still self-host [rag-over-pdf](https://github.com/sarmakska/rag-over-pdf) when the documents are sensitive. The general agents are great at general questions; specific document grounding is still better with explicit retrieval.
Where the agents fail in 2026
Long-tail desktop applications. Agents can drive Chrome and VS Code. They struggle with line-of-business desktop apps that have unusual rendering or accessibility-tree gaps.
Recovering from confidently wrong actions. When the agent thinks the task is done but it is not, the recovery loop is bad. This is true across every product I have tested.
Cost transparency mid-run. Most products show the bill at the end. By the time you see it, you cannot stop it. Manus is best here; Devin is worst.
Refusal calibration. Agents either refuse too often (Anthropic's Claude 4 skews this way[^5]) or refuse too rarely (Manus's free tier skews this way). Neither is calibrated for real workflows.
The real prediction for 2027
Agents will get better at coding. They will not get better at strategy. Strategy requires sustained context across weeks, the ability to admit ignorance, and the willingness to ask a human. Agents in 2026 do not have any of those.
The acquisition wave is coming. Several of the standalone agent companies will be folded into model labs by end of 2026. Manus, with a working consumer product and strong revenue, is the most likely target.
Read the actual benchmarks
- [SWE-bench Verified leaderboard](https://www.swebench.com/) — coding - [OSWorld leaderboard](https://os-world.github.io/) — desktop - [GAIA paper](https://arxiv.org/abs/2311.12983) — research - [WebArena](https://webarena.dev/) — browsing
If you want to write your own evals, [ai-eval-runner](https://github.com/sarmakska/ai-eval-runner) is the simplest way to plug a model in and measure it on your data.
Sarma
SarmaLinux
Have a project in mind?
Let's discuss how I can help you implement these ideas in your business.
Get in Touch