Cross-posted from a14y.dev

Summary

Ship a markdown mirror and a real meta description first. Skip the agent-skills directory for now.

We took a14y.dev, toggled each of its 11 agent-readiness features on and off, and measured what each one is worth to an AI agent doing a real retrieval task. For Claude, the full discovery layer cuts token use about 47%, and a markdown mirror alone does most of that. But the features do not simply add up: some substitute for each other, one (the agent-skills directory) actively makes things worse, and a judge re-grade found the answers stayed exactly as accurate. The features make the agent faster and more efficient, not smarter. We ran the same probe through Codex and Cursor too, but only Claude’s runs were cleanly instrumented enough to quantify, which is a finding in itself (more on that below).

Metric Result
Full layer (Claude) −47% tokens
Best single feature −48.6% (md-mirrors)
Answer quality flat, ~75/100 across all variants

The question

The a14y scorecard rewards a long list of agent-readiness features. A fair question for any site owner with finite time: which of them actually pay off, and do they stack or just overlap? So we ablated them. Starting from a stripped site, we added each feature alone to see what it buys (call it sparse). Then, starting from a fully featured site, we removed each feature one at a time to see what it costs (dense). The gap between those two numbers is where the interesting behavior lives.

One caveat belongs up front, not in a footnote: this is a black box, and it is moving fast. None of these signals is a ratified web standard. llms.txt, for instance, is a good idea, not a spec, and it does not look like most agents go hunting for it on their own. But if an agent does encounter it, because you linked it, referenced it in robots.txt, or told the agent to look, it can help a lot. So read these results as a snapshot of how today’s agents behave with today’s signals, not as laws of nature. They will shift as the agents do.

Methodology

A focused retrieval probe run against 24 builds of the same site, each exposing a different subset of a14y’s features, across three coding agents.

Field Value
Probe 5 fact-retrieval questions about a14y.dev (a scorecard version, specific check IDs, a threshold). The site’s discovery structure decides how much the agent has to crawl to answer.
Features 11 toggles: llms-txt, robots-txt, sitemap-xml, sitemap-md, agents-md, agent-skills, md-mirrors, canonical-link, meta-description, og-tags, json-ld
Variants baseline (all off), only-X (each feature alone), all-except-X (each removed from full), and all = 24 builds
Sparse vs dense Sparse = only-X vs baseline (what a feature buys from zero). Dense = all-except-X vs all (what removing it costs from a full site).
Runs 24 variants × 3 repeats = 72 runs
Agents Claude Code, Codex CLI, Cursor, and Gemini. Only Claude is reported (see Why this is Claude only).
Metrics Tokens per run (efficiency) and a judge re-grade of answer quality against canonical answers (effectiveness)
Date June 4, 2026

The matrix below is Claude. We also ran Codex and Cursor, but only Claude’s runs could be cleanly tied to the build we served, so every number on this page is Claude’s. The “Why this is Claude only” section explains what went wrong with the other two.

What each feature is worth (Claude)

Token change for each feature, alone and removed-from-full. Negative is fewer tokens. The full layer saves 47%, but markdown mirrors alone save 48.6%, so the discovery features hit diminishing returns fast.

Feature Alone Removed Reading
md-mirrors -48.6% +10.3% Biggest win alone; partly substitutable in a full site
agents-md -45.5% -1.9% Huge alone, redundant once llms.txt + mirrors exist
meta-description -43.0% +43.3% Irreplaceable; nothing else substitutes for it
canonical-link -40.4% +13.0% Moderate marginal value
og-tags -40.0% +13.2% Moderate marginal value
agent-skills -35.1% -7.8% Helps alone, distracts when stacked
llms-txt -31.6% +25.4% Meaningful marginal value
sitemap-md -28.6% +12.2% Moderate marginal value
robots-txt -25.7% +23.7% Meaningful marginal value
json-ld -25.3% +6.5% Small marginal value
sitemap-xml -16.3% +3.8% Smallest effect

“Alone” is the feature added to a stripped site (sparse). “Removed” is the feature taken out of a full site (dense). A negative number for the Alone column means fewer tokens required with the feature. A positive number in the Removed column means more tokens required without the feature. A negative number in the Removed column means more tokens required with the feature.

Four findings

Markdown mirrors are the biggest single win

Alone, a markdown mirror cuts Claude’s tokens 48.6%, more than any other feature and more than the full layer combined. But remove it from a full site and the cost is only 10.3%, because llms.txt, sitemap.md, and a meta description can partly cover for it. Starting from zero, though, a mirror is the highest-impact thing you can ship.

A meta description is uniquely irreplaceable

It cuts 43.0% alone and costs 43.3% to remove. Nothing else in the matrix is that symmetric. Every other feature substitutes for something; the meta description does a job none of the others can. The likely reason: it short-circuits the “what is this page even about” step an agent does early in its navigation. Surfacing that one line cheaply seems to matter more than any amount of structured discovery.

AGENTS.md is redundant once you have the basics

Alone it is huge (−45.5%), but removing it from a full site costs nothing (−1.9%, within noise). The combination of llms.txt and markdown mirrors already carries the information it would. So AGENTS.md is a fine choice if it is the easiest file for you to maintain, but it is not pulling independent weight when the rest of the layer is present.

The agent-skills directory is actively distracting

It helps a little alone (−35.1%), but in a full site, removing it makes Claude faster (−7.8%). The .well-known/agent-skills/ index appears to pull Claude into a wrong-direction exploration when other signals are already present. It is the only clear “stop doing this” in the study, small in absolute tokens but consistent.

Faster and more efficient, not smarter

Efficiency is only half the question. We re-graded every answer with a judge against the canonical answers. Claude’s answer quality is essentially flat at about 75 of 100 across all 24 variants, while token use swings 2×. The discovery layer makes Claude reach the answer with less work; it does not change whether the answer is right. That is the clean version of the win: the same answers, reached with far less effort.

The re-grade also corrected the record. Our first pass used a strict substring rubric, and one question’s canonical answer (“5%”) matched inside a wrong answer (“15%”), so every agent “passed” a question they all actually got wrong. The judge caught it. Pass-rate numbers from the first draft were inflated by that bug; the quality figures here come from the re-grade.

Why this is Claude only

We initially did this audit with four agents: Claude, Codex, Cursor, and Gemini. However, today’s agent CLIs are still hard to benchmark in a controlled way, and so only Claude could be measured cleanly.

Codex calls a hosted web_search tool that cannot be turned off, so it reaches the open web (the live a14y.dev included), not just the controlled build. Its token counts measure that off-target exploration, not the variant in front of it. Cursor, in headless mode, does not surface its fetches at all, so we have no record of what it actually read; its answers looked reasonable, but we cannot attribute its cost to any one feature. Gemini’s CLI could not fetch our target and was dropped.

That left Claude as the only agent whose runs we can tie to the exact build it was given, which is why every number on this page is Claude’s. The finding underneath stands on its own: agents differ a great deal in how they read a site, and the tooling to measure them rigorously is still catching up. We will fold other agents back in as their CLIs expose what they fetch. The caveats below list the specific instrumentation gaps.

What to ship

For site owners

Assuming Claude-class agents are your main readers: ship a markdown mirror first, it is the largest win and the hardest to substitute. Always include a meta description; it is cheap and uniquely irreplaceable. Add llms.txt and sitemap.md as a pair (they back each other up). Treat AGENTS.md as optional once you have mirrors and llms.txt. And skip .well-known/agent-skills/ unless you have a specific reason to publish skills.

For agent and model builders

A discovery file can be read as a jump list or as a reading list, and those have very different cost profiles for the same content (we saw both behaviors across the agents we ran). The outsized pull of the meta description suggests agents spend real effort early on “what is this page about”; surfacing that cheaply may matter more than richer discovery. And the agent-skills directory needs clearer semantics: more than one agent failed to benefit from it, and Claude was slowed by it.

For everyone, while this is still a black box

These are not standards, and the agents are changing under us. Linking your agent files so they get found matters as much as having them. The safe bets are the ones that held across agents and map to plain readability: a clean mirror and a real description. We will re-run this as the agents evolve.

Caveats

Where to read these numbers with care.

  • One task. A 5-question retrieval probe. Long-form summarization, code generation, or multi-hop reasoning could rank the features very differently. The defensible generalization is “for focused retrieval.”
  • One site. Every variant serves the same a14y.dev content; only the discovery scaffolding changes. Whether the deltas hold on other sites is untested.
  • Only Claude is reported. We ran Codex and Cursor too, but Codex’s hosted web_search cannot be disabled and reaches the open web, and Cursor’s headless mode does not surface its fetches, so neither could be cleanly tied to the served build. Gemini was dropped (its CLI could not fetch). See “Why this is Claude only.”
  • Rubric bug. The original strict-substring rubric scored a universally-wrong question as passing (“5%” inside “15%”). Quality figures here use the judge re-grade; the first-draft pass rates were inflated.
  • Mirror quality is a lower bound. The mirrors used v0.2.0-style output. v0.3.0-draft tightens mirror checks, so a fully compliant mirror could save more than 48.6%.
Reproduce this study

Score any single site with the same tool the probe drove:

npx a14y https://example.com

The ablation builds the same site 24 ways by toggling each feature, then runs an identical retrieval probe against each build and aggregates tokens and judge-graded quality. Source data: the 2026-06-04 per-feature ablation.