Skip to content

Readability API Reference

Auto-generated API documentation for the readability module.

readability

HTML readability content extractor — zero-dep, stdlib only, Python 3.10+.

Part of zerodep: https://github.com/Oaklight/zerodep Copyright (c) 2026 Peng Ding. MIT License.

Extracts the main article content from arbitrary web pages using a scoring algorithm inspired by Mozilla's Readability.js (Firefox Reader View). Built on top of zerodep/soup for HTML parsing; no external dependencies.

Algorithm overview:

  1. Pre-clean the DOM (remove scripts, styles, etc.)
  2. Extract metadata (JSON-LD, <meta> tags, <title>)
  3. Remove unlikely candidate nodes (sidebars, footers, ads …)
  4. Transform mis-used <div> elements into <p> paragraphs
  5. Score every <p>/<pre>/<td> node based on comma count, text length and class/id weight; propagate scores to parent & grandparent
  6. Pick the highest-scoring container and include qualifying siblings
  7. Sanitize the extracted article (remove forms, low-quality headers …)
  8. If the result is too short, retry with relaxed heuristics

Example::

from readability import extract, is_probably_readable

html = open("article.html").read()
if is_probably_readable(html):
    result = extract(html)
    print(result.title)
    print(result.text[:200])
References
  • Mozilla Readability.js: https://github.com/mozilla/readability
  • python-readability: https://github.com/buriy/python-readability

ReadabilityResult dataclass

Container for extracted article data.

Attributes:

Name Type Description
title str

Article title (refined from <title> or headings).

content str

Cleaned HTML of the main content.

text str

Plain-text rendering of the main content.

author str | None

Author name, or None.

excerpt str | None

Article excerpt / description, or None.

site_name str | None

Site name (e.g. from og:site_name), or None.

published_time str | None

Publication timestamp string, or None.

lang str | None

Language code from <html lang="...">, or None.

dir str | None

Text direction ("ltr" / "rtl"), or None.

length int

Character count of text.

Source code in readability/readability.py
@dataclass(frozen=True)
class ReadabilityResult:
    """Container for extracted article data.

    Attributes:
        title: Article title (refined from ``<title>`` or headings).
        content: Cleaned HTML of the main content.
        text: Plain-text rendering of the main content.
        author: Author name, or ``None``.
        excerpt: Article excerpt / description, or ``None``.
        site_name: Site name (e.g. from ``og:site_name``), or ``None``.
        published_time: Publication timestamp string, or ``None``.
        lang: Language code from ``<html lang="...">``, or ``None``.
        dir: Text direction (``"ltr"`` / ``"rtl"``), or ``None``.
        length: Character count of *text*.
    """

    title: str
    content: str
    text: str
    author: str | None = None
    excerpt: str | None = None
    site_name: str | None = None
    published_time: str | None = None
    lang: str | None = None
    dir: str | None = None
    length: int = 0

extract(html, url=None)

Extract the main article content from an HTML string.

Parameters:

Name Type Description Default
html str

The full HTML document as a decoded string.

required
url str | None

Optional base URL (currently unused; reserved for future link absolutisation).

None

Returns:

Type Description
ReadabilityResult

A ReadabilityResult with the extracted content and metadata.

Source code in readability/readability.py
def extract(html: str, url: str | None = None) -> ReadabilityResult:
    """Extract the main article content from an HTML string.

    Args:
        html: The full HTML document as a decoded string.
        url: Optional base URL (currently unused; reserved for future
            link absolutisation).

    Returns:
        A ``ReadabilityResult`` with the extracted content and metadata.
    """
    Soup, _Tag = _load_soup()
    reader = _Readability(html, Soup, _Tag)
    return reader.parse()

is_probably_readable(html, min_score=20.0, min_content_length=140)

Quick heuristic check whether html likely contains a readable article.

Uses the same approach as Mozilla's isProbablyReaderable: accumulate sqrt(textLen - threshold) over qualifying <p>/<pre>/<article> nodes and return True once the score exceeds min_score.

Parameters:

Name Type Description Default
html str

The HTML document string.

required
min_score float

Minimum cumulative score to consider readable.

20.0
min_content_length int

Minimum text length for a node to contribute.

140

Returns:

Type Description
bool

True if the page is probably an article.

Source code in readability/readability.py
def is_probably_readable(
    html: str,
    min_score: float = 20.0,
    min_content_length: int = 140,
) -> bool:
    """Quick heuristic check whether *html* likely contains a readable article.

    Uses the same approach as Mozilla's ``isProbablyReaderable``: accumulate
    ``sqrt(textLen - threshold)`` over qualifying ``<p>``/``<pre>``/``<article>``
    nodes and return ``True`` once the score exceeds *min_score*.

    Args:
        html: The HTML document string.
        min_score: Minimum cumulative score to consider readable.
        min_content_length: Minimum text length for a node to contribute.

    Returns:
        ``True`` if the page is probably an article.
    """
    Soup, _Tag = _load_soup()
    soup = Soup(html)

    score = 0.0
    for tag in soup.find_all(["p", "pre", "article"]):
        # Skip unlikely candidates
        class_id = _get_class_id_string(tag)
        if len(class_id) > 1:
            is_unlikely = UNLIKELY_CANDIDATES_RE.search(class_id)
            is_ok = OK_MAYBE_CANDIDATE_RE.search(class_id)
            if is_unlikely and not is_ok:
                continue

        text = tag.get_text(strip=True)
        text_len = len(text)
        if text_len < min_content_length:
            continue

        score += math.sqrt(text_len - min_content_length)
        if score >= min_score:
            return True

    return False