How we run trials in The Lab
Most directories rate AI tools on what their marketing says. The Lab rates them on what they actually dowhen given a real task. Each trial is a transcript, not an opinion. The score is derived from the transcript, not from the vendor’s docs.
You should argue with this page, not with any individual trial. The single biggest determinant of whether the score on a trial is fair is whether the protocol is fair — and the protocol lives here.
The Pact
Five commitments we make publicly, before any trial runs, so we cannot quietly retreat from them later when results are inconvenient.
- We publish trials regardless of outcome. If a paying vendor loses a trial, we publish it. If a free vendor wins, we publish it. If everyone fails the task, we publish that too.
- Vendors do not preview trials. No tool covered in a trial sees the result before it is public. Vendors may submit a 200-word response after publication, which we publish alongside the trial verbatim. We do not edit findings to accommodate vendor responses.
- Methodology errors are corrected, not retracted. If we discover a flaw in how a trial was run, we issue a correction, append the original transcript with annotations, and re-run the affected portion. Trials are not silently deleted.
- No advertising tier inside The Lab.Vendors cannot pay to be included, excluded, ranked, or re-tested. The directory’s paid tier (logo placement, vendor portal) does not extend to The Lab.
- Transcripts are public and citable. Every trial ships with the full transcript and the rubric scoring grounded in transcript line numbers. If a reader cannot reproduce our reasoning by reading the transcript, the scoring is incomplete.
The fixed protocol
Every trial runs against the same five constants. If any of these shift between tools within a trial, that trial does not ship.
1. Same task
Every tool in a trial receives an identical task specification — same prompt, same acceptance criteria, same definition of done. Acceptable tasks are:
- scoped (completable by a competent human in 30–90 minutes)
- realistic (the kind of task developers actually delegate)
- verifiable (a passing test, a working endpoint, a deployed page — something a third party can confirm)
- self-contained (does not depend on private credentials, internal tooling, or proprietary data)
The full task specification is published with each trial as a machine-readable file. Anyone can re-run it.
2. Same starter repo
Every tool in a trial starts from the same commit hash of the same public repository. Repository: github.com/agentic-ai/lab-fixtures. The exact tagged commit is published with each trial.
Tools that require a hosted environment (web IDEs, cloud-only agents) clone the repo into their environment at the same commit. Any repo modifications a tool makes during the trial are recorded in the transcript via diff.
3. Same model versions
Each trial documents — per tool — the tool’s build/release version, the underlying model name and version, and the wall-clock timestamp of the run. We use each tool’s default model unless a trial is specifically about model choice.
Tools rev quickly; trials are dated artifacts. A trial from February 2026 is a snapshot of February 2026 capability, not a permanent verdict. We re-run trials when a tool ships a major release or a new underlying model — and we publish the diff.
4. Same scoring rubric
Every trial scores observed behavior against the Agenticness 8-dimension, 32-point rubric. The same rubric used for the directory at large.
The crucial difference: in The Lab, every dimension score cites a transcript line number or timestamp as evidence. If a score claim cannot be grounded in the transcript, it does not count. A trial’s rubric scoring is, in effect, a guided tour of the transcript.
5. Same observer
A single observer executes a documented script for every tool in a trial. The observer’s only role is to deliver the task specification, answer clarification questions strictly from a pre-written FAQ, and end the trial when the acceptance criteria are met or the time budget is exhausted.
The observer does not coach, hint, course-correct, or interpret on the fly. Off-script behavior — which is itself interesting — is recorded but does not influence scoring.
We acknowledge a known limitation: with a single human observer we cannot rule out unconscious framing effects. Methodology v1 will replace the human observer with an automated agent operator running an identical script — at which point trials become fully reproducible by anyone with the same fixture repository.
What every trial includes
Every trial published in The Lab carries the same set of artifacts. If any are missing, the trial does not ship.
- Task specification. Full prompt, acceptance criteria, time budget, FAQ for clarification questions.
- Starter repository. Public repo URL + tagged commit hash.
- Tool inventory. Every tool tested: name, version, underlying model, run timestamp, run environment.
- Full transcript. Verbatim record of each run — prompts, tool outputs, agent actions, errors. Public and downloadable.
- Repository diffs. What each tool actually changed in the starter repo, as a unified diff.
- Verification log. Whether the acceptance criteria were met, by whom (test runner / human / automated check), and how.
- Rubric scoring. 8 dimension scores per tool, each citing a transcript line number or diff hunk as evidence.
- Editorial commentary. What the transcripts reveal about the tools that the rubric numbers don't capture.
- Vendor responses. 200-word vendor-submitted responses, if any, published verbatim alongside.
- Corrections log. Any post-publication corrections, dated and explained.
Known limitations
Things this protocol cannot do, stated up front so readers can weigh trial findings appropriately.
- Observer effect. A single human observer means framing bias cannot be fully ruled out. Methodology v1 swaps in an automated operator.
- Task generalization.A tool that wins one task may lose another. We do not extrapolate from a single trial to “X is better than Y in general.” Each trial scores one specific task on one specific day.
- Sample size. We currently run each task once per tool. We are not yet measuring run-to-run variance. When a trial produces unexpected results, we run it twice more before publishing — and disclose this.
- Network and account state. Tools that depend on cloud APIs run in real network conditions; latency or rate limits during a trial are recorded but not equalized.
- Selection bias.We choose which tools and tasks to test. We do not claim the trial set is representative of the full directory. The directory’s 32-point rubric remains the broader, lower-resolution coverage.
How to challenge a trial
We assume readers will disagree with trial outcomes. The protocol is structured to make disagreement productive.
- Argue with the protocol on this page. If you think the protocol itself is broken, that is the highest-leverage place to push back. Methodology version history is published at the bottom.
- Re-run the trial. Every trial publishes the starter repo commit, the task spec, and the tool versions. Anyone can reproduce. Discrepancies between your run and ours are evidence we want.
- Find the score-to-evidence mismatch.Every dimension score cites the transcript. If a score doesn’t match the cited evidence, we want to know. We correct, we don’t retract.
- Send corrections to lab@agentic.ai.Confirmed corrections appear in the trial’s corrections log with attribution if requested.
Methodology version history
- v0.12026-05-02Initial publication. Single human observer. One run per tool/task. Awaiting trial #1 to validate the protocol end-to-end.