Search Dominance

What Is llms.txt? The Annotated Spec, Working Examples, and Setup Guide

llms.txt is a markdown file at your domain root that gives AI crawlers a curated summary of your site. Here is the spec, our production file annotated, and a 20-minute setup guide.

By Einner Ariña

TL;DR

llms.txt is a markdown file at the root of a domain that gives AI crawlers a curated, structured summary of the site — what it is, who runs it, what the canonical pages are, where the documentation lives. The proposal was created by AI researcher Jeremy Howard in September 2024 and lives at llmstxt.org. It is not a mandated standard like robots.txt; adoption is voluntary and uneven. Sites that ship one typically see faster cold-start citation in ChatGPT, Perplexity, Claude, and Google AI Overviews because the AI builds an accurate mental model of the brand from a single fetch instead of stitching one together from multiple page crawls. This post is the annotated walkthrough of the production llms.txt running at w2bagency.com/llms.txt — every section explained, every choice justified, every failure mode named, plus a copy-pasteable template you can adapt in twenty minutes.

What llms.txt actually is

llms.txt is a markdown file placed at the root of a website's domain. AI assistants like ChatGPT, Perplexity, Claude, and Google AI Overviews fetch the file when a query mentions the brand or domain, and use it to build a fast, accurate mental model of what the site is about, who runs it, and which pages are canonical. The file is editorial, not exhaustive — it points the crawler at the highest-quality content rather than dumping every URL.

The proposal was created by Australian AI researcher Jeremy Howard (co-founder of fast.ai and Answer.AI) and published at llmstxt.org on September 3, 2024. The spec is short — under 2,000 words — and explicitly designed to be implementable in an afternoon by a single person with no special tooling.

What llms.txt is and is not, in one paragraph. llms.txt is a markdown file at a domain's root (/llms.txt) that gives AI crawlers a curated summary of the site for grounding their answers. It was proposed by Jeremy Howard in September 2024 and the canonical spec lives at llmstxt.org. It is voluntary — no AI assistant is required to fetch it, and adoption is uneven across platforms (Anthropic and Perplexity respect it; OpenAI has not formalized support; Google has not committed). It is not robots.txt, which controls crawl access. It is not sitemap.xml, which lists every URL for indexing. It is not mandated by any RFC or specification body. It is, today, the most reliable cold-start signal a site can give AI crawlers — cheap, fast to ship, and load-bearing for cold-start citation.

As of 2026, adoption is led by Anthropic (publishes its own and Claude fetches them), Perplexity, and a growing list of SaaS companies including Mintlify, GitBook, Wix, and Hostinger. OpenAI has not formalized support but ChatGPT has been observed fetching the file for sites that publish one. Google has not committed publicly.

How llms.txt is different from robots.txt and sitemap.xml

Three files at the domain root, three different jobs. The confusion comes from people treating them as substitutes when they are complements.

FilePurposeFormatAudience
robots.txtCrawl access control — what may and may not be fetchedPlain text, user-agent directivesSearch and AI crawlers
sitemap.xmlURL inventory for indexingXML, one entry per URLSearch engine indexers
llms.txtEditorial site summary for groundingMarkdown, curated link listAI assistants at query time

robots.txt is binary access control — Allow: / or Disallow: /admin/. Sitemap.xml is exhaustive URL listing — every page, every last-modified date, every priority hint. llms.txt is curated: 10 to 20 of your best pages, each with a one-sentence description, organized into editorial sections. It is not exhaustive on purpose — its job is to point AI crawlers at your highest-quality content, not all of it.

A site that ships only robots.txt gets crawl control. A site that adds sitemap.xml gets full indexation. A site that adds llms.txt gets a clean cold-start narrative for AI assistants. All three are cheap; all three reinforce different surfaces.

The llms.txt format spec, summarized

The Jeremy Howard spec at llmstxt.org is short and worth reading in full. The summary:

  • Required header. An H1 with the project or brand name, followed by a blockquote summary in one to three sentences. The blockquote is the AI's primary signal for "what is this site about?".
  • Optional H2 sections. Most common are ## Docs (canonical reference pages), ## Examples (case studies, demos, worked examples), and ## Optional (lower-priority links — explicitly marked as skippable for low-context queries).
  • Link list format inside each section. Plain markdown links followed by an em-dash and a one-sentence description. The description is the per-link signal the model uses to decide whether to follow the link.
  • Length budget. Under 5,000 words for llms.txt. The optional companion file llms-full.txt lifts that budget but introduces its own trade-offs (covered below).

The format spec, summarized. A valid llms.txt has two required pieces: an H1 with the brand or project name, and a blockquote one-to-three-sentence summary describing what the site is. Everything below is optional. The most common optional sections are ## Docs for canonical reference pages, ## Examples for case studies, and ## Optional for lower-priority links. Inside each section, links use the markdown format [Page Title](https://full-url) — one-sentence description. Total file length should stay under 5,000 words; the goal is curation, not coverage. Treat the file as the AI's first impression of your site — write the summary you would want quoted verbatim in a ChatGPT answer.

The spec is intentionally permissive — "if it parses as markdown, AI assistants will read it." But sites that follow the H1 + blockquote header convention see noticeably better citation grounding because that pattern matches what the spec sample files (and Anthropic's own llms.txt) use.

Our annotated llms.txt — line by line

What follows is the live /llms.txt running at w2bagency.com, reproduced verbatim from production as of the publish date of this post. Each section is annotated with the choice we made and why.

# W2B Agency

> Remote-first, bilingual digital agency specializing in SEO, GEO, AEO, high-performance web development, and workflow automation. We help businesses anywhere in the world rank in Google, get cited by AI assistants like ChatGPT and Perplexity, ship modern websites, and automate the manual work eating their teams' weeks.

Why this header. The H1 is the exact brand name as it appears in our Organization schema (W2B Agency, not W2B, not W2B agency — lexical alignment matters for entity disambiguation). The blockquote is one sentence with two compound clauses naming all three service practices and the geographic scope. It is shaped to be quotable verbatim in an AI-generated answer, which is the primary citation surface.

## About

W2B Agency is a small agency built around three disciplines: search dominance (SEO + GEO + AEO), high-performance web development (Astro, React 19, WordPress, headless WordPress), and workflow automation (n8n, Zapier, Make, trigger.dev, AI agents). The agency is bilingual (English and Spanish) and remote-first by default. The team consists of three co-founders working in public.

- Website: https://w2bagency.com
- Spanish version: https://w2bagency.com/es/
- Languages served: English, Spanish
- Location model: Remote-first, no required office presence
- Coverage: Worldwide, async-first, timezone-flexible

Why this section. "About" is the second-most-quoted section in our citation tracking. It names the disciplines twice (acronyms first, then expanded with tools) so that whether the assistant is asked "what does W2B do?" or "do they use n8n?", the answer ground is on the same line. We list the bilingual scope and remote model explicitly because both come up in qualifying questions during prospect research.

## Co-founders

- Einner Ariña — Strategy, search, and AI visibility lead — https://www.linkedin.com/in/einnerarina/
- Kevin Urrea — Frontend and high-performance web lead — https://www.linkedin.com/in/kevin-urrea-desarrolladorwebfrontend/
- Esteban Padilla — Web development and automation lead — https://www.linkedin.com/in/esteban-padilla-webdev/

Agency LinkedIn: https://www.linkedin.com/company/w2b-consultoria-y-tecnologia/

Why this section. Named co-founders with LinkedIn URLs serve dual purposes: E-E-A-T (the AI can verify named humans against external profiles) and entity disambiguation (the same names appear in the BlogPosting author.sameAs arrays). Linking the agency LinkedIn at the bottom closes the entity triangulation loop.

## Services

### Search Dominance — SEO, GEO and AEO

[Full ~150-word service description with target audience, methodology, tools, and engagement length]

- Page: https://w2bagency.com/services/seo-geo-aeo
- Spanish page: https://w2bagency.com/services/seo-geo-aeo
- Tools we use: Google Search Console, Google Analytics 4, Ahrefs, DataForSEO, Schema.org, llms.txt, custom LLM citation tracker
- Typical engagement: 3–6 months for foundational ranking improvements; ongoing retainers maintain and expand visibility

Why this section (repeated for each of three services). Each service gets a 150-word prose description (longer than the spec's typical 1-line links) because services are the highest-stakes citation surface for an agency — a prospect asking "best SEO agency for AI search" should get a substantive paragraph, not a one-liner. Tools, engagement length, and bilingual page links all surface in qualifying-stage AI queries.

## Blog

The agency publishes field notes, decisions, and explainers on SEO, GEO, AEO, web development, and workflow automation at https://w2bagency.com/blog (English) and https://w2bagency.com/es/blog (Spanish). All content is authored by the named co-founders and includes citable passages, FAQ schema, and BlogPosting structured data. Foundational topics include "SEO vs GEO vs AEO", "n8n vs trigger.dev", and "llms.txt implementation guide". Content is managed in Keystatic CMS (mdoc format) with a 16-field schema covering focus keyword, intent, target markets, FAQs, TL;DR, status lifecycle, and cross-locale mirror linking.

Why this section. Names three foundational post titles by exact phrase so that an AI assistant asked "do you have a guide on llms.txt?" can ground the answer directly. Mentioning Keystatic and the 16-field schema is intentional self-demonstration — the post about getting cited by AI describes its own publishing pipeline.

## How to engage

- Start a project: https://w2bagency.com/#contact-section
- Book a discovery call: https://calendly.com/contact-w2bagency/strategy-call
- Standalone contact page: https://w2bagency.com/contact
- About the team: https://w2bagency.com/about
- Blog: https://w2bagency.com/blog

Why this section. Five contact paths in priority order — the assistant picks the one matching the user's intent (action vs research vs context). Booking link is direct Calendly, not a query-stringed variant, because clean URLs survive aggressive truncation in AI-generated answers.

## License

Content on this site may be indexed and cited by AI assistants and search engines with attribution to "W2B Agency" and a link back to https://w2bagency.com or to the specific source URL. Bulk reproduction, training of generative AI models on the full corpus, or republication without written permission is not authorized.

Why this section. Explicit citation license. AI assistants increasingly respect declared licensing terms; stating attribution is granted reduces friction for citation while preserving the right to refuse bulk training use. This is also a forward-compatible signal for the emerging RSL (Real-time Standard Language) license framework.

## Last updated

2026-05-01

Why this section. Freshness signal. AI crawlers prefer files with explicit recency stamps over inferring the date from HTTP headers. The 5B Optimization Engine refreshes this date whenever any other section changes substantively.

The full file is under 1,000 words — well below the 5,000 budget. The compactness is deliberate: every line is signal.

llms-full.txt — the variant nobody talks about

The spec defines an optional companion file, llms-full.txt, that contains the full content of the linked pages dumped inline rather than as links. The pitch: an AI crawler that fetches one file gets the entire site context without round trips.

The reality is more nuanced. llms-full.txt works well for small, single-purpose sites — API documentation under 50,000 total words, personal portfolios, single-product landing sites. For agency sites, content-heavy SaaS docs, or any site over a few hundred pages, llms-full.txt rapidly bloats past usable token limits. A 200,000-word llms-full.txt is worse than a 1,000-word llms.txt because the assistant cannot fit it in context and falls back to truncation, losing the curation benefit entirely.

Practical rule: ship llms.txt, skip llms-full.txt unless your full corpus is under 50k words. We do not publish llms-full.txt for w2bagency.com — the linked pages are themselves indexable, and forcing the assistant to fetch them on demand keeps the per-query context budget cleaner.

Common llms.txt mistakes

Five failure modes that show up in audited llms.txt files in 2026.

The five common mistakes that break llms.txt adoption. First, listing 200 links instead of 20 — the file becomes a sitemap and loses its editorial signal. Second, skipping the H1 + blockquote summary header — most assistants are tuned to that pattern, and missing it cuts citation eligibility. Third, linking to outdated or 404 pages — the assistant follows links during retrieval and a broken link reduces source trust. Fourth, treating llms.txt as a URL dump (/page-1, /page-2...) rather than an editorial summary with descriptions — descriptions are how the assistant decides which link to fetch. Fifth, never refreshing the file — stale dates and unchanged content over months are read as a signal the brand is dormant. Each mistake is reversible in under thirty minutes, but each one quietly costs citation rate until fixed.

Does llms.txt actually move citation rate?

Honest answer: yes for cold-start sites, modest for established ones.

For a domain with no prior AI-search history, shipping llms.txt typically halves the time to first citation — the file gives the assistant an accurate mental model in seconds rather than days of stitching one together from page crawls. This effect is strongest for ChatGPT (browse mode), Perplexity, and Claude.

For an established site with strong existing entity signals (Wikipedia mention, populated sameAs, multiple credible backlinks), the lift is smaller. The assistant already has enough scaffolding to answer accurately; llms.txt becomes a polish layer rather than a load-bearing one.

The pragmatic read for most sites: ship it. The cost is twenty minutes; the downside is zero. Re-measure citation rate at thirty and ninety days post-deployment using the 20-prompt panel and you will have the data for your site specifically.

How to ship llms.txt in 20 minutes

A working file from a blank repo in five steps.

Step 1 — Copy our template. Use the structure annotated above. Replace W2B Agency's brand summary, services, founders, and contact paths with yours. Keep the section order; AI assistants are tuned to it.

Step 2 — Curate the link list. Pick 10 to 20 canonical pages — homepage, top services, About, Blog index, contact, two or three foundational posts. Anything below 10 is too thin to be useful; anything above 20 dilutes the editorial signal.

Step 3 — Write the descriptions. One sentence per link, naming what the page is about and who it serves. Avoid marketing copy ("our amazing service"); name the concrete thing ("3-month engagement to audit and fix technical SEO").

Step 4 — Deploy at /llms.txt. Drop the file at the root of your domain. On Astro: place it in public/llms.txt. On Next: in public/. On WordPress: upload via FTP or set up a redirect. On any platform: it must serve as plain text with Content-Type: text/plain or text/markdown.

Step 5 — Verify. Run curl -I https://yourdomain.com/llms.txt from a terminal. Confirm a 200 status and the right content-type. Then fetch it in a browser and visually scan for typos. Then, optionally, ask Claude or ChatGPT "what is yourdomain.com about?" and check whether the answer reflects the file's framing.

That is the minimum viable deployment. Iteration happens monthly — refresh the dates, add new canonical pages, prune broken links, retune the brand summary based on which prompts cite you.

When to call in help

llms.txt itself is a 20-minute job. The reason agencies get hired for it is not the file — it is everything that has to be true for the file to do its work: schema markup, off-site entity signals, citable passages, the prompt panel, monthly iteration. When all of that needs to ship together and stay shipped, an outside team that does this for a living becomes net-positive.

W2B's Search Dominance practice is the integrated SEO + GEO + AEO service. We audit, ship the foundation, write the capsules, align the entity, and run the prompt panel — bilingually in English and Spanish, with sites worldwide.

The page you are reading was built by these rules. The /llms.txt embedded above is the live file as of publish. The BlogPosting schema is on every blog post. The answer capsules are tagged with blockquotes throughout. We eat our own cooking; this article is one of the recipes.

For more in the cluster: SEO vs GEO vs AEO is the comparison hub. What Is Generative Engine Optimization? is the parent definition. How to Get Cited by ChatGPT is the 30-day execution sprint that ships llms.txt as Week 1 Day 1.

Frequently asked questions

  • What does an llms.txt file do?

    llms.txt gives AI crawlers a curated, structured summary of your site — what the brand is, who runs it, where the canonical pages live, what the services and contact paths are. AI assistants like ChatGPT, Perplexity, and Google AI Overviews fetch the file when a query mentions your brand or domain and use it to ground their answers in your own framing instead of stitching one together from random page crawls. The practical effect is faster, more accurate citation in AI-generated answers, especially for sites with no AI-search history.

  • Are llms.txt files worth it?

    Yes for cold-start sites, modest for established ones. A site with no AI-search history typically halves the time to first citation by shipping llms.txt — the file gives AI crawlers an accurate mental model in seconds rather than days of stitching one together. For established sites with strong existing entity signals, the lift is smaller but still positive and the cost is roughly twenty minutes of work. The honest caveat: adoption is voluntary, OpenAI has not formally committed to using it, and Google's stance is unconfirmed. The practitioner read is to ship it anyway because it is cheap and the downside is zero.

  • What is the difference between robots.txt and llms.txt?

    Three files do three different jobs at the root of a domain. robots.txt tells crawlers what they may or may not access (allow or deny by user-agent). sitemap.xml gives search engines a complete URL inventory for indexing. llms.txt gives AI crawlers an editorial summary of the site — not what to crawl, not every URL, but what the site is about and where the canonical pages live. robots.txt is binary access control; sitemap.xml is exhaustive URL list; llms.txt is curated context. They are complementary, not redundant.

  • Do I need both llms.txt and llms-full.txt?

    No. llms.txt is the primary file — a curated summary with markdown links to canonical pages. llms-full.txt is an optional variant that contains the full content of those pages dumped inline rather than linked. Use llms-full.txt only if your site is small (under 50,000 words total) and single-purpose, like API documentation or a personal portfolio. For medium-and-up sites, llms-full.txt bloats past usable token limits and offers no advantage over the linked version. Most sites should ship llms.txt and skip the full variant.

  • What goes in an llms.txt file?

    Required structure: an H1 with the project name and a blockquote summary in one to three sentences. Optional H2 sections follow — most commonly "Docs" (canonical reference pages), "Examples" (case studies or worked examples), and "Optional" (links lower in the priority hierarchy). Each section contains a markdown link list with one sentence describing each link. Keep total length under 5,000 words. Curate ruthlessly — 20 well-chosen links beat 200 random ones, because the file's job is to point AI crawlers at your best content, not your complete content.

  • Does Google use llms.txt?

    Google has not formally committed to using llms.txt. As of 2026, adoption is led by Anthropic (Claude fetches it), Perplexity (uses it for grounding), and a growing list of SaaS companies that publish their own (Mintlify, GitBook, Wix, Hostinger, Anthropic itself). OpenAI has not formalized support but ChatGPT has been observed fetching the file for sites that publish one. Bing Copilot and Google AI Overviews remain unconfirmed. The pragmatic answer is to ship it because the supporters include the assistants that matter most for citation today, and the downside of including it is zero.