Back to Blog

AI

AI agents in 2026: where the actual ground is

A grounded look at where general-purpose agents actually deliver in 2026, with benchmark numbers and a strong opinion on which products are real and which.

S
Sarma
15 April 202612 min readLast verified 3 May 2026
ShareLinkedInX

A year ago everyone was building agents. In May 2026 most of those startups have folded into Cursor, Replit, or Manus. The market consolidated. The benchmarks consolidated. Here is what actually works.

The four categories that matter

Agent capability is not one thing. It is at least four:

  1. Coding, generating, modifying, debugging code in a real repo. Measured by SWE-bench Verified[1].
  2. Research, multi-step gathering and synthesising of information. Measured by GAIA Level 3[3].
  3. Browsing, navigating real websites, filling forms, reading dynamic UIs. Measured by WebArena[4].
  4. Desktop tasks, driving real applications via OS-level UI. Measured by OSWorld[2].

Most products are good at one. The general-purpose ones try to do all four and fail in different ways.

Chart
Agent capability comparison (May 2026 self-tested benchmarks)
Higher is better. SWE-bench Verified for code, OSWorld for desktop tasks, GAIA Level 3 for research, WebArena for browsing.

Source: Self-tested via published benchmarks plus my own ten-task private suite

What the chart says

Cursor wins coding, full stop. SWE-bench Verified is at 78 percent in May 2026. Devin is at 71. Manus at 62. The gap is real and it is widening. Cursor's tight integration with the IDE, the codebase indexing, and the model fine-tuning give it a measurable lead.

Manus wins everything else. Research, browsing, desktop tasks, long-running execution. The general-purpose agent design wins where the task does not have a fixed shape.

Devin sits awkwardly between them. It is the priciest of the three, the slowest, and rarely the best at any single category. Cognition's product strategy seems to be evolving toward agent-as-engineer-replacement which the underlying model is not yet capable of.

What the benchmark scores hide

Benchmarks reward solvable problems. Real agent work has unsolvable problems woven in. Examples:

  • The browsing target requires CAPTCHAs every fifth page
  • The codebase has a build system the agent has never seen
  • The user is impatient and changes the task halfway through
  • The PDF extraction returns garbage because the document is scanned

Benchmarks score on success rate. Real users score on confidence. They are different things.

Manus's research win comes mostly from being willing to say "I could not find a definitive answer" instead of guessing. The other agents pretend.

Cost as a capability

Cost-efficiency is on the radar chart for a reason. Devin charges roughly $20 per task in May 2026[5]. Manus charges roughly $0.40-2.00. Cursor's coding mode runs on a flat subscription.

For a developer doing 50 small tasks a day, that is the difference between £20 and £1,000 in monthly spend.

The price gap has narrowed because Anthropic and OpenAI both released cheaper agent-tier pricing in early 2026[5][6]. But the gap is still material.

What I actually use

For coding inside an IDE: Cursor Agent. The integration is the product.

For ad-hoc research, browsing, desktop work: Manus. The general-purpose interface is faster than configuring a specialist tool. They hand out free credits on signup via this link which is enough to evaluate it on real tasks.

For long-running orchestrated workflows where I need replay and budgets: my own agent-orchestrator. Not because it is better; because I trust the durability semantics.

For multi-document QA and grounding: I still self-host rag-over-pdf when the documents are sensitive. The general agents are great at general questions; specific document grounding is still better with explicit retrieval.

Where the agents fail in 2026

Long-tail desktop applications. Agents can drive Chrome and VS Code. They struggle with line-of-business desktop apps that have unusual rendering or accessibility-tree gaps.

Recovering from confidently wrong actions. When the agent thinks the task is done but it is not, the recovery loop is bad. This is true across every product I have tested.

Cost transparency mid-run. Most products show the bill at the end. By the time you see it, you cannot stop it. Manus is best here; Devin is worst.

Refusal calibration. Agents either refuse too often (Anthropic's Claude 4 skews this way[5]) or refuse too rarely (Manus's free tier skews this way). Neither is calibrated for real workflows.

The real prediction for 2027

Agents will get better at coding. They will not get better at strategy. Strategy requires sustained context across weeks, the ability to admit ignorance, and the willingness to ask a human. Agents in 2026 do not have any of those.

The acquisition wave is coming. Several of the standalone agent companies will be folded into model labs by end of 2026. Manus, with a working consumer product and strong revenue, is the most likely target.

Read the actual benchmarks

If you want to write your own evals, ai-eval-runner is the simplest way to plug a model in and measure it on your data.

References

  1. [1]

    SWE-bench Verified leaderboard, accessed 15 April 2026

    https://www.swebench.com/
  2. [2]

    OSWorld benchmark for OS interaction, latest results

    https://os-world.github.io/
  3. [3]

    GAIA: a benchmark for general AI assistants, Mialon et al. (2023), arXiv:2311.12983

    https://arxiv.org/abs/2311.12983
  4. [4]

    WebArena: a realistic web environment for autonomous agents, Zhou et al. (2024)

    https://webarena.dev/
  5. [5]

    Anthropic Claude Opus 4.6 system card, February 2026, Anthropic

    https://www.anthropic.com/system-cards
  6. [6]

    OpenAI o1-pro performance disclosures, March 2025 (consumer ChatGPT Pro launched Dec 2024), OpenAI

Comments

Sign in to comment, reply, and like.

By signing in, Sarma will receive your name, avatar, email, sign-in provider, and approximate location (country/city, derived from your IP) for moderation and reply purposes. None of this is shown publicly, only your name and avatar appear on the post. No newsletter, no marketing, no third-party sharing.

Loading comments…
S

Sarma

Independent software engineer, AI systems, automation platforms, and modern infrastructure.

More in AI

Work with Sarma

Have a project in mind?

I take on a small number of projects each quarter, AI systems, automation, infrastructure, and full-stack engineering.

Get in touch