An independent Agentic.ai decision report. Every tool is ranked by our 9-dimension Agenticness rubric — not by who pays us (nobody does). Shared with you because someone thought it would help.

Agentic.ai · Decision Report · June 2026

The best AI tools for coding agents

Our pick

Claude Code — 21/36 · Level 3

Tops our independent /36 for coding agents — strongest on action and autonomy, narrowly ahead of Cursor (both 21/36). Subscription.

Ranked by our 9-dimension rubric, not by who pays us (nobody does). If the best option for your situation were something we couldn't profit from, this report would still say so — the independence is the point. The full scorecard, the evidence behind each score, and our scoring method are below.

An independent, data-grounded read. Every tool is scored on our 9-dimension Agenticness rubric (out of 36) and ranked by that score — not by who pays us. Capability, pricing, and evidence are pulled from structured data; reliability is graded against cited, primary-sourced evidence (see below); adoption is real outbound clicks from the directory.

The shortlist

1. Claude Code — 21/36 · L3 · Subscription · 134 clicks/30d (↓ cooling)
Claude Code is Anthropic's agentic coding tool that lives in your terminal. It understands your entire codebase, makes multi-file edits, runs commands, manages git workflows, and uses MCP for tool integration. Built with a Unix philosophy — it reads, plans, edits, and verifies in a loop. The fastest-growing product in the coding agent category.
Strongest: action + autonomy. Weakest: safety (not yet evidenced).

2. Cursor — 21/36 · L3 · Freemium · 148 clicks/30d (→ steady)
Cursor is a developer-focused AI environment that adds agents, context, and automation around your repositories. It combines an editor-like interface, a CLI, and a cloud agent API to automate code review, bug fixing, CI hygiene, and more. Designed for individual developers and engineering teams who want AI to take real actions in their code and infrastructure, not just chat.
Strongest: action + autonomy. Weakest: continuity (1/4).

3. Cline — 19/36 · L3 · Free / open-source · 614 clicks/30d (↑ rising)
Cline is an AI coding assistant for VS Code that can inspect your project, edit files, run terminal commands, and use a browser while asking for permission at each step. It is aimed at developers working on real codebases who want more than code completion.
Strongest: action + planning. Weakest: reliability (not yet evidenced).

9-dimension scorecard

Tool	Action	Autonomy	Planning	Reliability	Safety	Continuity	Adaptation	Interop	Sovereignty	/36
Claude Code	3	3	3	3	0	3	2	2	2	21
Cursor	3	3	3	3	2	1	2	2	2	21
Cline	3	2	3	0	2	1	3	2	3	19
Devin Desktop	3	3	2	0	1	3	3	2	2	19
OpenHands	3	3	3	2	0	1	3	1	2	18
Factory AI	3	3	2	0	2	3	2	2	1	18

Each dimension 0–4. Green = 3+. A 0 means "not yet evidenced" (a sourcing stance), not "broken" — see the reliability section below.

Reliability — the evidence

Reliability splits into two sub-signals: (a) harness-linked benchmark — does the tool's own agent harness post a verifiable score? — and (b) real-world incident / workflow history — how does it behave in production? A "0 — not yet evidenced" is a sourcing-integrity stance, not "broken": we won't award points to a number we can't trace to a primary source, and we never backfill with model marketing. We anchor on contamination-resistant benchmarks (Terminal-Bench 2.1, SWE-bench Pro, the Artificial Analysis Coding Agent Index) and documented incidents, not leaderboard-top numbers.

Claude Code — 3/4. Benchmark: SWE-bench Verified on Anthropic's own bash + file-edit scaffold — Opus 4.6 at 80.8% (primary-sourced); independent Terminal-Bench 2.1 78.9% (#2, tbench.ai, Jun 18 2026). Contamination caveat: leaderboard-top SWE-bench entries cite unverifiable model names, and the audited SWE-bench Pro sits far lower (~69%) — we score off the verifiable figures, not the leaderboard peak. Real-world: Severe March-2026 rate-limit / prompt-caching incident (5-hour budgets draining in ~20–70 min) plus a ~6-week quality regression — but an unusually transparent Apr-23 postmortem, fixed in v2.1.116, and the raw API was unaffected throughout. Mitigation: pin Claude Code versions in CI.

Cursor — 3/4. Benchmark: Cursor publishes harness-linked Composer scores — Composer 2.5 (May 18 2026): Terminal-Bench 2.0 69.3, SWE-bench Multilingual 79.8 (Cursor blog + arXiv 2603.24477). Caveat: SWE-bench Multilingual ≠ Verified, and competitor scores are measured in Cursor's own harness — so this is strong but vendor-harness-measured. Real-world: Mixed-to-negative user reports: the Cursor 2.1 release reportedly corrupted chat histories and worktrees; broken multi-file edits and unrelated-file changes recur. Line-by-line review of agent output is the consistently recommended mitigation. Roadmap risk (reported, unverified): Cursor's parent Anysphere was reported (Jun 16 2026, secondary sources only) to be acquired by SpaceX at a ~$60B valuation — verify before relying on it.

Cline — 0/4. Benchmark: No published benchmark for Cline itself — as a bring-your-own-model harness, its effective score tracks whichever model you point it at. Harness level: not yet evidenced. Real-world: The best-documented harness failure mode in the set (diff-edit / SEARCH-REPLACE failures), but openly tracked and actively mitigated — Cline shipped an order-invariant diff-apply algorithm reporting +~25% diff-edit success on Claude 3.5 Sonnet, with an open-sourced eval. The transparency is the credit here, not a benchmark.

Devin Desktop — 0/4. (now Devin Desktop, the Jun 2 2026 Cognition rebrand of Windsurf — same product, new owner.) Benchmark: Cognition benchmarks its in-house SWE-1.x models on SWE-Bench Pro ('near-SOTA'; SWE-1.6 ~11% above SWE-1.5) but published no precise percentage in primary blogs, and there's no SWE-bench Verified for the Cascade / Devin Local harness. Harness level: partial / not yet quantitatively evidenced. Real-world: Operability fixes dominate the April-2026 changelog (Cascade crash on conversation switch, Devin Cloud auth/start failures, MCP OAuth regressions, Windows updater) plus independent user crash/RAM-spike reports — citable, but weaker than a benchmark.

OpenHands — 2/4. Benchmark: The strongest open-source benchmark trail: SWE-bench Verified ~72% with Sonnet 4.5 via the OpenHands Software Agent SDK (arXiv 2511.03690), plus the continuously-updated OpenHands Index (an open, reproducible harness). Credible — but not top-tier on independent cross-tool comparisons. Real-world: Self-hosted, so you own the failure modes (sandbox/dependency stalls; the root-equivalent Docker socket needs hardening); the public PR stream shows steady reliability plumbing (sandbox timeouts, validation, health checks).

Capability & fit

Tool	License	MCP	Self-host	Own model	Autonomy
Claude Code	◐ Source-avail	✅	✅	❌	Semi-autonomous
Cursor	❌ Proprietary	✅	✅	✅	Semi-autonomous
Cline	✅ Open	✅	✅	✅	Semi-autonomous
Devin Desktop	❌ Proprietary	✅	✅	❌	Semi-autonomous
OpenHands	❌ Proprietary	❌	❌	❌	—
Factory AI	❌ Proprietary	✅	❌	✅	Semi-autonomous

Pricing & true cost

Tool	Model	Tiers
Claude Code	Subscription	Pro 20/month · Max 100/month · API
Cursor	Freemium	Hobby · Pro 20/month · Business 40/month
Cline	Free / open-source (free — you pay your own model API)	Free / open source — full functionality available at no cost.
Devin Desktop	Freemium	Free $0 · Pro $20/mo · Max $200/mo · Teams $40/seat/mo (Mar 18 2026 reset: credits → quota + overage). In-house SWE-1.x models consume 0 quota; frontier models bill against the quota.
OpenHands	Free / open-source	Open-source (MIT) — free to self-host; you pay only model inference at provider API rates (no markup). OpenHands Cloud is usage-based with $20 in free credits; no public per-seat price.
Factory AI	Hybrid	Free: BYOK; $0/month · Pro ($20/mo): 20 million Standard Tokens per month · Max ($200/mo): 200 million Standard Tokens per month

Independent evidence & momentum

Tool	Harness benchmark	GitHub	Last commit	Clicks/30d	Trend
Claude Code	SWE-bench Verified 80.8%	133,767 ★	1d ago	134	↓ cooling
Cursor	SWE-bench Multilingual 79.8%	—	—	148	→ steady
Cline	not yet evidenced	63,658 ★	today	614	↑ rising
Devin Desktop	SWE-Bench Pro (no public %)	—	—	269	↑ rising
OpenHands	SWE-bench Verified ~72%	78,006 ★	today	159	↑ rising
Factory AI	—	4,112 ★	18d ago	59	↑ rising

"Harness benchmark" = a score tied to the tool's own agent harness, with the variant named. Variants are not directly comparable; "not yet evidenced" = no verifiable harness score (the underlying model may still score well). GitHub stars/recency are live from the GitHub API.

Sources & methodology: Reliability is graded on two sub-signals — (a) harness-linked benchmark, (b) real-world incident/workflow history — anchored on contamination-resistant benchmarks (Terminal-Bench 2.1, SWE-bench Pro, the Artificial Analysis Coding Agent Index) and documented incidents, never model marketing. Benchmark variants (SWE-bench Verified / Multilingual / Pro) are not directly comparable, and several scores are model-in-vendor-harness rather than tool-isolated. Pricing is volatile — re-verify each vendor's live pricing page on the day of publication. Primary sources by tool: Claude Code — anthropic.com/research/swe-bench-sonnet; tbench.ai (Terminal-Bench 2.1); github.com/anthropics/claude-code/issues/41930 + /41788; the Apr-23 2026 Claude Code postmortem. Cursor — cursor.com/blog/composer-2; arxiv.org/abs/2603.24477; checkthat.ai (Cursor reviews); artificialanalysis.ai/agents/coding-agents. Cline — github.com/cline/cline (issues #4384, #1195, #2909); cline.bot/blog/improving-diff-edits-by-10. Devin Desktop — docs.devin.ai/desktop/devin-desktop-faq (rebrand); cognition.ai/blog/swe-1-6-preview; the Windsurf April-2026 changelog; cloudzero.com/blog/windsurf-pricing. OpenHands — arxiv.org/abs/2511.03690 (SDK paper); index.openhands.dev; openhands.dev/pricing; github.com/OpenHands/OpenHands.

Methodology: the 9-dimension Agenticness rubric (v3.1, /36). Reliability is scored conservatively on two sub-signals — a harness-linked benchmark and real-world incident/workflow history — and only against evidence we can trace to a primary source; "not yet evidenced" is a sourcing stance, never a verdict that a tool is unreliable. Scoring is independent of any commercial relationship; the rubric is the brand. — Agentic.ai

Know someone weighing the same decision?If this saved you a week of comparing, send it their way — it's free to read, and we're not selling either of you anything.

Weighing a different decision?Agentic.ai independently scores agentic-AI tools on a 9-dimension rubric — so you can tell what to actually use, and trust we're not on the take.

Explore the directory →See the coding agents tools →