Readability API Reference¶

Auto-generated API documentation for the readability module.

`readability` ¶

HTML readability content extractor — zero-dep, stdlib only, Python 3.10+.

Extracts the main article content from arbitrary web pages using a scoring algorithm inspired by Mozilla's Readability.js (Firefox Reader View). Built on top of zerodep/soup for HTML parsing; no external dependencies.

Algorithm overview:

Pre-clean the DOM (remove scripts, styles, etc.)
Extract metadata (JSON-LD, <meta> tags, <title>)
Remove unlikely candidate nodes (sidebars, footers, ads …)
Transform mis-used <div> elements into <p> paragraphs
Score every <p>/<pre>/<td> node based on comma count, text length and class/id weight; propagate scores to parent & grandparent
Pick the highest-scoring container and include qualifying siblings
Sanitize the extracted article (remove forms, low-quality headers …)
If the result is too short, retry with relaxed heuristics

Example::

from readability import extract, is_probably_readable

html = open("article.html").read()
if is_probably_readable(html):
    result = extract(html)
    print(result.title)
    print(result.text[:200])

References

Mozilla Readability.js: https://github.com/mozilla/readability
python-readability: https://github.com/buriy/python-readability

`ReadabilityResult` `dataclass` ¶

Container for extracted article data.

Attributes:

Name	Type	Description
`title`	`str`	Article title (refined from `<title>` or headings).
`content`	`str`	Cleaned HTML of the main content.
`text`	`str`	Plain-text rendering of the main content.
`author`	`str \| None`	Author name, or `None`.
`excerpt`	`str \| None`	Article excerpt / description, or `None`.
`site_name`	`str \| None`	Site name (e.g. from `og:site_name`), or `None`.
`published_time`	`str \| None`	Publication timestamp string, or `None`.
`lang`	`str \| None`	Language code from `<html lang="...">`, or `None`.
`dir`	`str \| None`	Text direction (`"ltr"` / `"rtl"`), or `None`.
`length`	`int`	Character count of text.
`score`	`float`	Readability score of the best candidate container. Higher values indicate stronger confidence that the extracted content is a real article rather than navigation / boilerplate. Zero when no scored candidate was found (body fallback).

Source code in readability/readability.py

@dataclass(frozen=True)
class ReadabilityResult:
    """Container for extracted article data.

    Attributes:
        title: Article title (refined from ``<title>`` or headings).
        content: Cleaned HTML of the main content.
        text: Plain-text rendering of the main content.
        author: Author name, or ``None``.
        excerpt: Article excerpt / description, or ``None``.
        site_name: Site name (e.g. from ``og:site_name``), or ``None``.
        published_time: Publication timestamp string, or ``None``.
        lang: Language code from ``<html lang="...">``, or ``None``.
        dir: Text direction (``"ltr"`` / ``"rtl"``), or ``None``.
        length: Character count of *text*.
        score: Readability score of the best candidate container.  Higher
            values indicate stronger confidence that the extracted content
            is a real article rather than navigation / boilerplate.  Zero
            when no scored candidate was found (body fallback).
    """

    title: str
    content: str
    text: str
    author: str | None = None
    excerpt: str | None = None
    site_name: str | None = None
    published_time: str | None = None
    lang: str | None = None
    dir: str | None = None
    length: int = 0
    score: float = 0.0

`extract(html, url=None)` ¶

Extract the main article content from an HTML string.

Parameters:

Name	Type	Description	Default
`html`	`str`	The full HTML document as a decoded string.	required
`url`	`str \| None`	Optional base URL (currently unused; reserved for future link absolutisation).	`None`

Returns:

Type	Description
`ReadabilityResult`	A `ReadabilityResult` with the extracted content and metadata.

Source code in readability/readability.py

def extract(html: str, url: str | None = None) -> ReadabilityResult:
    """Extract the main article content from an HTML string.

    Args:
        html: The full HTML document as a decoded string.
        url: Optional base URL (currently unused; reserved for future
            link absolutisation).

    Returns:
        A ``ReadabilityResult`` with the extracted content and metadata.
    """
    Soup, _Tag = _load_soup()
    reader = _Readability(html, Soup, _Tag)
    return reader.parse()

`is_probably_readable(html, min_score=20.0, min_content_length=140)` ¶

Quick heuristic check whether html likely contains a readable article.

Uses the same approach as Mozilla's isProbablyReaderable: accumulate sqrt(textLen - threshold) over qualifying <p>/<pre>/<article> nodes and return True once the score exceeds min_score.

Parameters:

Name	Type	Description	Default
`html`	`str`	The HTML document string.	required
`min_score`	`float`	Minimum cumulative score to consider readable.	`20.0`
`min_content_length`	`int`	Minimum text length for a node to contribute.	`140`

Returns:

Type	Description
`bool`	`True` if the page is probably an article.

Source code in readability/readability.py

def is_probably_readable(
    html: str,
    min_score: float = 20.0,
    min_content_length: int = 140,
) -> bool:
    """Quick heuristic check whether *html* likely contains a readable article.

    Uses the same approach as Mozilla's ``isProbablyReaderable``: accumulate
    ``sqrt(textLen - threshold)`` over qualifying ``<p>``/``<pre>``/``<article>``
    nodes and return ``True`` once the score exceeds *min_score*.

    Args:
        html: The HTML document string.
        min_score: Minimum cumulative score to consider readable.
        min_content_length: Minimum text length for a node to contribute.

    Returns:
        ``True`` if the page is probably an article.
    """
    Soup, _Tag = _load_soup()
    soup = Soup(html)

    score = 0.0
    for tag in soup.find_all(["p", "pre", "article"]):
        # Skip unlikely candidates
        class_id = _get_class_id_string(tag)
        if len(class_id) > 1:
            is_unlikely = UNLIKELY_CANDIDATES_RE.search(class_id)
            is_ok = OK_MAYBE_CANDIDATE_RE.search(class_id)
            if is_unlikely and not is_ok:
                continue

        text = tag.get_text(strip=True)
        text_len = len(text)
        if text_len < min_content_length:
            continue

        score += math.sqrt(text_len - min_content_length)
        if score >= min_score:
            return True

    return False

Readability API Reference¶

readability ¶

ReadabilityResult dataclass ¶

extract(html, url=None) ¶

is_probably_readable(html, min_score=20.0, min_content_length=140) ¶

`readability` ¶

`ReadabilityResult` `dataclass` ¶

`extract(html, url=None)` ¶

`is_probably_readable(html, min_score=20.0, min_content_length=140)` ¶