llms.txt Parser¶

A zero-dependency parser for the llms.txt specification — parse llms.txt files into structured data and discover per-page markdown URLs.

Features¶

Spec-compliant parsing: H1 title, blockquote description, detail paragraphs, H2 sections with link entries
Optional section handling: ## Optional entries separated automatically per spec semantics
Site-level discovery: discover() probes any URL's root for /llms.txt and /llms-full.txt
Unified URL discovery: find_candidates() searches parsed llms.txt entries with heuristic fallback
Lenient parsing: only H1 title is required; everything else gracefully degrades
Immutable results: frozen dataclasses for parse output
Zero dependencies: stdlib only (re, urllib.parse, urllib.request, dataclasses)

Quick Start¶

from llmstxt import parse, find_candidates, discover

# Discover llms.txt from any URL
result = discover("https://example.com/docs/guide")
content = result.llms_full_txt or result.llms_txt
if content:
    doc = parse(content)
    print(doc.title)

# Parse an llms.txt file
doc = parse("""# My Project

> A brief description of the project.

Some extra detail here.

## Docs

- [Guide](https://example.com/guide.md): The main guide
- [API](https://example.com/api.md): API reference

## Optional

- [Advanced](https://example.com/advanced.md): Advanced topics
""")

print(doc.title)        # 'My Project'
print(doc.description)  # 'A brief description of the project.'
print(doc.details)      # 'Some extra detail here.'
print(doc.sections)     # {'Docs': [FileEntry(name='Guide', ...), ...]}
print(doc.optional)     # [FileEntry(name='Advanced', ...)]

API¶

`parse(text)`¶

Parse llms.txt content into a structured LlmsTxt object.

Parameter	Type	Description
`text`	`str`	Raw text content of an llms.txt file

Returns: LlmsTxt — parsed result.

Raises: LlmsTxtError — if the required H1 title is missing or input is empty.

`find_candidates(url, doc=None)`¶

Find candidate markdown resources for a given URL.

When doc is provided, searches all sections and optional entries for URL matches (exact > extension variation > path prefix). If no match is found (or doc is None), falls back to heuristic URL generation.

Parameter	Type	Description
`url`	`str`	The page URL to look up
`doc`	`LlmsTxt \\| None`	Optional parsed llms.txt to search in

Returns: list[FileEntry] — candidates ordered by match quality.

`discover(url, *, timeout=10)`¶

Probe a site for /llms.txt and /llms-full.txt.

Given any URL, extracts the root ({scheme}://{netloc}) and attempts to fetch both files. If the input URL already points to one of these files, it is still fetched along with its sibling.

Parameter	Type	Description
`url`	`str`	Any URL belonging to the target site
`timeout`	`int`	HTTP request timeout in seconds (per request), default `10`

Returns: DiscoveryResult — raw content of whichever files were found.

`LlmsTxt`¶

Frozen dataclass representing a parsed llms.txt file.

Field	Type	Description
`title`	`str`	H1 heading (project/site name)
`description`	`str`	Blockquote summary, or `""`
`details`	`str`	Paragraphs between blockquote and first H2, or `""`
`sections`	`dict[str, list[FileEntry]]`	H2 section name → entries (excludes "Optional")
`optional`	`list[FileEntry]`	Entries from the `## Optional` section

`DiscoveryResult`¶

Frozen dataclass returned by discover().

Field	Type	Description
`llms_txt`	`str \\| None`	Raw content of `/llms.txt`, or `None` if not found
`llms_full_txt`	`str \\| None`	Raw content of `/llms-full.txt`, or `None` if not found
`source_url`	`str`	The root URL (`{scheme}://{netloc}`) that was probed

`FileEntry`¶

Frozen dataclass representing a single link entry.

Field	Type	Description
`name`	`str`	Display name of the link
`url`	`str`	URL of the linked resource
`notes`	`str`	Text after `:` separator, or `""`

`LlmsTxtError`¶

Exception raised when parsing fails (missing H1 title or empty input).

Usage Examples¶

Search parsed llms.txt for a URL¶

from llmstxt import parse, find_candidates

doc = parse(llms_txt_content)

# Exact and extension-based matching
results = find_candidates(
    "https://docs.example.com/guide.html", doc=doc
)
# Finds entries like guide.html.md, guide.md, etc.

# Prefix matching — find all entries under a path
results = find_candidates(
    "https://docs.example.com/tutorials", doc=doc
)
# Returns all entries whose URL starts with /tutorials/

Heuristic fallback (no llms.txt available)¶

from llmstxt import find_candidates

# No doc provided — generates candidate .md URLs by convention
results = find_candidates("https://example.com/docs/guide")
for entry in results:
    print(entry.url)
# https://example.com/docs/guide.md
# https://example.com/docs/guide/index.md
# https://example.com/docs/guide/index.html.md

Typical discovery workflow¶

from llmstxt import discover, parse, find_candidates

page_url = "https://example.com/docs/guide"

# Discover and fetch llms.txt / llms-full.txt from the site
result = discover(page_url)
content = result.llms_full_txt or result.llms_txt

if content:
    doc = parse(content)
    # Search for per-page markdown matching the original URL
    candidates = find_candidates(page_url, doc=doc)
else:
    # No llms.txt found — fall back to heuristic URL generation
    candidates = find_candidates(page_url)

for entry in candidates:
    print(f"Try: {entry.url}")

Matching Strategy¶

find_candidates() returns results ordered by match quality:

Exact match — entry URL equals input URL (after stripping query/fragment)
Extension variation — guide.html ↔ guide.html.md, guide ↔ guide.md
Path prefix — input path is a prefix of entry path or vice versa
Heuristic fallback — if no llms.txt matches, generates {url}.md, {url}/index.md, {url}/index.html.md

Notes¶

The parser is lenient: only the H1 title is required. Missing blockquote, details, or sections result in empty strings/lists.
The ## Optional section is treated specially per the spec — its entries go to LlmsTxt.optional, not sections.
Case-sensitive: only exact ## Optional (capital O) triggers special handling.
Windows line endings (\r\n) are normalized automatically.
find_candidates() strips query strings and fragments from URLs before matching.