Readability API Reference¶
Auto-generated API documentation for the readability module.
readability
¶
HTML readability content extractor — zero-dep, stdlib only, Python 3.10+.
Part of zerodep: https://github.com/Oaklight/zerodep Copyright (c) 2026 Peng Ding. MIT License.
Extracts the main article content from arbitrary web pages using a scoring
algorithm inspired by Mozilla's Readability.js (Firefox Reader View). Built
on top of zerodep/soup for HTML parsing; no external dependencies.
Algorithm overview:
- Pre-clean the DOM (remove scripts, styles, etc.)
- Extract metadata (JSON-LD,
<meta>tags,<title>) - Remove unlikely candidate nodes (sidebars, footers, ads …)
- Transform mis-used
<div>elements into<p>paragraphs - Score every
<p>/<pre>/<td>node based on comma count, text length and class/id weight; propagate scores to parent & grandparent - Pick the highest-scoring container and include qualifying siblings
- Sanitize the extracted article (remove forms, low-quality headers …)
- If the result is too short, retry with relaxed heuristics
Example::
from readability import extract, is_probably_readable
html = open("article.html").read()
if is_probably_readable(html):
result = extract(html)
print(result.title)
print(result.text[:200])
References
- Mozilla Readability.js: https://github.com/mozilla/readability
- python-readability: https://github.com/buriy/python-readability
ReadabilityResult
dataclass
¶
Container for extracted article data.
Attributes:
| Name | Type | Description |
|---|---|---|
title |
str
|
Article title (refined from |
content |
str
|
Cleaned HTML of the main content. |
text |
str
|
Plain-text rendering of the main content. |
author |
str | None
|
Author name, or |
excerpt |
str | None
|
Article excerpt / description, or |
site_name |
str | None
|
Site name (e.g. from |
published_time |
str | None
|
Publication timestamp string, or |
lang |
str | None
|
Language code from |
dir |
str | None
|
Text direction ( |
length |
int
|
Character count of text. |
Source code in readability/readability.py
extract(html, url=None)
¶
Extract the main article content from an HTML string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
html
|
str
|
The full HTML document as a decoded string. |
required |
url
|
str | None
|
Optional base URL (currently unused; reserved for future link absolutisation). |
None
|
Returns:
| Type | Description |
|---|---|
ReadabilityResult
|
A |
Source code in readability/readability.py
is_probably_readable(html, min_score=20.0, min_content_length=140)
¶
Quick heuristic check whether html likely contains a readable article.
Uses the same approach as Mozilla's isProbablyReaderable: accumulate
sqrt(textLen - threshold) over qualifying <p>/<pre>/<article>
nodes and return True once the score exceeds min_score.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
html
|
str
|
The HTML document string. |
required |
min_score
|
float
|
Minimum cumulative score to consider readable. |
20.0
|
min_content_length
|
int
|
Minimum text length for a node to contribute. |
140
|
Returns:
| Type | Description |
|---|---|
bool
|
|