A year ago everyone was building agents. In May 2026 most of those startups have folded into Cursor, Replit, or Manus. The market consolidated. The benchmarks consolidated. Here is what actually works.
The four categories that matter
Agent capability is not one thing. It is at least four:
- Coding — generating, modifying, debugging code in a real repo. Measured by SWE-bench Verified[1].
- Research — multi-step gathering and synthesising of information. Measured by GAIA Level 3[3].
- Browsing — navigating real websites, filling forms, reading dynamic UIs. Measured by WebArena[4].
- Desktop tasks — driving real applications via OS-level UI. Measured by OSWorld[2].
Most products are good at one. The general-purpose ones try to do all four and fail in different ways.
Source: Self-tested via published benchmarks plus my own ten-task private suite
What the chart says
Cursor wins coding, full stop. SWE-bench Verified is at 78 percent in May 2026. Devin is at 71. Manus at 62. The gap is real and it is widening. Cursor's tight integration with the IDE, the codebase indexing, and the model fine-tuning give it a measurable lead.
Manus wins everything else. Research, browsing, desktop tasks, long-running execution. The general-purpose agent design wins where the task does not have a fixed shape.
Devin sits awkwardly between them. It is the priciest of the three, the slowest, and rarely the best at any single category. Cognition's product strategy seems to be evolving toward agent-as-engineer-replacement which the underlying model is not yet capable of.
What the benchmark scores hide
Benchmarks reward solvable problems. Real agent work has unsolvable problems woven in. Examples:
- The browsing target requires CAPTCHAs every fifth page
- The codebase has a build system the agent has never seen
- The user is impatient and changes the task halfway through
- The PDF extraction returns garbage because the document is scanned
Benchmarks score on success rate. Real users score on confidence. They are different things.
Manus's research win comes mostly from being willing to say "I could not find a definitive answer" instead of guessing. The other agents pretend.
Cost as a capability
Cost-efficiency is on the radar chart for a reason. Devin charges roughly $20 per task in May 2026[5]. Manus charges roughly $0.40-2.00. Cursor's coding mode runs on a flat subscription.
For a developer doing 50 small tasks a day, that is the difference between £20 and £1,000 in monthly spend.
The price gap has narrowed because Anthropic and OpenAI both released cheaper agent-tier pricing in early 2026[5][6]. But the gap is still material.
What I actually use
For coding inside an IDE: Cursor Agent. The integration is the product.
For ad-hoc research, browsing, desktop work: Manus. The general-purpose interface is faster than configuring a specialist tool. They hand out free credits on signup via this link which is enough to evaluate it on real tasks.
For long-running orchestrated workflows where I need replay and budgets: my own agent-orchestrator. Not because it is better; because I trust the durability semantics.
For multi-document QA and grounding: I still self-host rag-over-pdf when the documents are sensitive. The general agents are great at general questions; specific document grounding is still better with explicit retrieval.
Where the agents fail in 2026
Long-tail desktop applications. Agents can drive Chrome and VS Code. They struggle with line-of-business desktop apps that have unusual rendering or accessibility-tree gaps.
Recovering from confidently wrong actions. When the agent thinks the task is done but it is not, the recovery loop is bad. This is true across every product I have tested.
Cost transparency mid-run. Most products show the bill at the end. By the time you see it, you cannot stop it. Manus is best here; Devin is worst.
Refusal calibration. Agents either refuse too often (Anthropic's Claude 4 skews this way[5]) or refuse too rarely (Manus's free tier skews this way). Neither is calibrated for real workflows.
The real prediction for 2027
Agents will get better at coding. They will not get better at strategy. Strategy requires sustained context across weeks, the ability to admit ignorance, and the willingness to ask a human. Agents in 2026 do not have any of those.
The acquisition wave is coming. Several of the standalone agent companies will be folded into model labs by end of 2026. Manus, with a working consumer product and strong revenue, is the most likely target.
Read the actual benchmarks
- SWE-bench Verified leaderboard — coding
- OSWorld leaderboard — desktop
- GAIA paper — research
- WebArena — browsing
If you want to write your own evals, ai-eval-runner is the simplest way to plug a model in and measure it on your data.
About the data
A note on what the numbers in this post represent so you can read them with the right confidence:
- "My own bench" rows are personal measurements on my own hardware. They are honest about my setup and reproducible there, but they should not be treated as universal benchmark scores.
- Benchmark numbers attributed to public sources (Geekbench Browser, DXOMARK, NotebookCheck, FIA timing) are illustrative — the trend is what matters, not the third decimal place. Cross-check against the source for anything you would act on financially.
- Client outcomes and ROI percentages in business-focused posts are anonymised composites drawn from my own consulting work. Real numbers, real direction, sanitised so individual clients are not identifiable.
- Foldable crease-depth and similar engineering measurements are estimates pulled from teardown reports and reviewer claims; manufacturers do not publish these directly.
- Forecasts and "what I bet" lines are exactly that — opinions, not predictions with a track record yet.
If you spot a number that contradicts a source you trust, tell me — I would rather correct it than be the chart that was off by 6 percent and pretended otherwise.