I have run 50 real client tasks through four AI coding agents this quarter. Here is the honest comparison.
Pass rates
Source: My own tracking, Q1 2026
Claude Code (the CLI tool) won this on first-try pass rate (62 percent). Cursor was close second (56 percent). Manus and Devin were lower.
That is for tasks under 30 minutes of human work. For tasks that take a human three or four hours, the numbers shift; Manus catches up because it can actually research.
Strengths
| Spec | Manus | Cursor | Devin | Claude Code |
|---|---|---|---|---|
| Best for | Multi-step research | IDE coding flow | Async background tasks | Terminal-led editing |
| Failure mode | Wanders | Quick refusal | Slow loops | Over-edits |
| Speed (avg task) | 12 min | 4 min | 38 min | 6 min |
| Cost (per real task) | ~$0.80 | ~$0.20 | ~$2.50 | ~$0.30 |
| My pick for | Spec exploration | Daily coding | Overnight chores | Surgical edits |
Manus is best when I do not know what the answer is. "Research the cheapest way to host a Postgres for an EU SaaS, write me a one-pager and a deploy script" is a Manus task[1].
Cursor is best when I know what I want and want fast iteration in the IDE. Composer mode with Claude Sonnet 4.6 handles most of my daily code work[2].
Devin is best for async background work. "Take this list of 30 small bug reports and try to fix each one overnight." Slower per task but you give it 8 hours and come back to a PR list[3].
Claude Code is best for surgical editing in a single repo. The terminal flow is faster than mouse-driven IDE work for many tasks.
Cost
A typical "fix a bug, add a test" task:
- Cursor: $0.20 in tokens
- Claude Code: $0.30
- Manus: $0.80 (does more, costs more)
- Devin: $2.50 (slow loops)
For a freelancer billing £80/hour these are rounding error. The constraint is correctness and time, not token cost.
Where they all fail
Cross-cutting refactors that touch 20+ files. All four lose context, all four make inconsistent choices across files. Humans still do these better.
Anything that requires understanding business logic that is not in the codebase. Easy to fall back to "well, I will guess." Critical if you do not review carefully.
My setup
For active client work: Cursor Composer + Claude Code in a tmux split. Cursor for IDE-led tasks, Claude Code for terminal-led tasks.
For overnight: Devin handles a queue of "boring but real" tickets, I review the PRs in the morning.
For exploration: Manus when I do not know the answer.
There is no single tool. The 2026 AI coding setup is a toolkit, not a single product.
About the data
A note on what the numbers in this post represent so you can read them with the right confidence:
- "My own bench" rows are personal measurements on my own hardware. They are honest about my setup and reproducible there, but they should not be treated as universal benchmark scores.
- Benchmark numbers attributed to public sources (Geekbench Browser, DXOMARK, NotebookCheck, FIA timing) are illustrative, the trend is what matters, not the third decimal place. Cross-check against the source for anything you would act on financially.
- Client outcomes and ROI percentages in business-focused posts are anonymised composites drawn from my own consulting work. Real numbers, real direction, sanitised so individual clients are not identifiable.
- Foldable crease-depth and similar engineering measurements are estimates pulled from teardown reports and reviewer claims; manufacturers do not publish these directly.
- Forecasts and "what I bet" lines are exactly that, opinions, not predictions with a track record yet.
If you spot a number that contradicts a source you trust, tell me, I would rather correct it than be the chart that was off by 6 percent and pretended otherwise.
Live: latest HN discussion on Manus
The agent space changes weekly. Latest threads matching "manus" on HN:
Source: HN Algolia · cached 10–60 min