HTML Parser (Soup)¶

Zero-dependency HTML parser with BeautifulSoup-like API -- stdlib only, Python 3.10+.

Replaces: beautifulsoup4 (common subset)

Overview¶

The Soup module provides a lightweight DOM tree built on top of html.parser.HTMLParser. It supports find, find_all, select, select_one, get_text, decompose, and find_parent -- the subset of BeautifulSoup used by the vast majority of real-world scraping scripts.

File	Description	Dependencies
`soup.py`	Pure Python implementation	None (stdlib only: `re`, `html.parser`)

How to Use in Your Project¶

Just copy the single .py file into your project:

cp soup/soup.py your_project/

Then import directly:

from soup import Soup

Usage Examples¶

Basic Parsing¶

from soup import Soup

html = "<html><body><p class='msg'>Hello <b>world</b></p></body></html>"
soup = Soup(html)
print(soup.find("p", class_="msg").text)
# Hello world

find and find_all¶

html = """
<ul>
  <li class="item">Apple</li>
  <li class="item">Banana</li>
  <li class="item special">Cherry</li>
</ul>
"""
soup = Soup(html)

# Find first match
first = soup.find("li")
print(first.text)  # Apple

# Find all matches
items = soup.find_all("li", class_="item")
print([i.text for i in items])  # ['Apple', 'Banana', 'Cherry']

CSS Selectors¶

html = """
<div id="main">
  <p class="intro">Welcome</p>
  <p class="body">Content</p>
</div>
"""
soup = Soup(html)

# Select by ID
main = soup.select_one("#main")

# Select by class
intro = soup.select_one(".intro")
print(intro.text)  # Welcome

# Select by tag
paragraphs = soup.select("p")
print(len(paragraphs))  # 2

# Descendant selector
body_p = soup.select_one("div p.body")
print(body_p.text)  # Content

# Child selector
children = soup.select("div > p")
print(len(children))  # 2

# Attribute selector
soup2 = Soup('<a href="/home">Home</a><a href="/about">About</a>')
links = soup2.select("a[href]")
print(len(links))  # 2

# Pseudo-selectors
html3 = """
<ul>
  <li class="a">First</li>
  <li class="b">Second</li>
  <li class="c">Third</li>
</ul>
"""
soup3 = Soup(html3)

# :first-child / :last-child
print(soup3.select_one("li:first-child").text)  # First
print(soup3.select_one("li:last-child").text)   # Third

# :only-child
soup4 = Soup("<div><span>Alone</span></div>")
print(soup4.select_one("span:only-child").text)  # Alone

# :not(selector)
others = soup3.select("li:not(.a)")
print([li.text for li in others])  # ['Second', 'Third']

get_text¶

soup = Soup("<p>Hello <b>world</b></p>")
print(soup.get_text())          # Hello world
print(soup.get_text(separator=" | "))  # Hello  | world
print(soup.get_text(strip=True))  # Helloworld

Attributes¶

soup = Soup('<a href="/page" class="nav active" id="link1">Click</a>')
a = soup.find("a")

# Access attributes
print(a["href"])         # /page
print(a["id"])           # link1
print(a.get("class"))    # ['nav', 'active']
print(a.get("missing", "default"))  # default

Tree Mutation¶

from soup import Soup

html = "<ul><li>A</li><li>B</li></ul>"
soup = Soup(html)
ul = soup.find("ul")

# append -- add a child at the end
new_li = soup.new_tag("li", {"class": "new"})
new_li.children.append("C")
ul.append(new_li)

# insert -- add a child at a specific position
another = soup.new_tag("li")
another.children.append("Z")
ul.insert(0, another)

# extract -- remove from parent but keep the node intact
removed = ul.children[1].extract()

# replace_with -- swap one node for another
replacement = soup.new_tag("li")
replacement.children.append("X")
ul.children[0].replace_with(replacement)

# unwrap -- remove a tag but keep its children
soup2 = Soup("<div><b>bold text</b></div>")
soup2.find("b").unwrap()
print(soup2.get_text())  # bold text

HTML Serialization¶

soup = Soup('<div class="box"><p>Hello <b>world</b></p></div>')
div = soup.find("div")
print(div.to_html())
# <div class="box"><p>Hello <b>world</b></p></div>

# str() also produces HTML
print(str(div))
# <div class="box"><p>Hello <b>world</b></p></div>

Attribute Setting¶

soup = Soup('<a href="/old">Link</a>')
a = soup.find("a")

# Set attribute
a["href"] = "/new"
a["class"] = ["nav", "active"]

# Delete attribute
del a["class"]
print(a.to_html())  # <a href="/new">Link</a>

Creating New Tags¶

soup = Soup("<div></div>")
tag = soup.new_tag("span", {"id": "greeting"})
tag.children.append("Hello!")
soup.find("div").append(tag)
print(soup.find("div").to_html())
# <div><span id="greeting">Hello!</span></div>

decompose (Remove Elements)¶

html = "<div><p>Keep</p><script>remove me</script></div>"
soup = Soup(html)
for script in soup.find_all("script"):
    script.decompose()
print(soup.get_text())  # Keep

find_parent¶

soup = Soup("<div><ul><li>Item</li></ul></div>")
li = soup.find("li")
print(li.find_parent("div").name)  # div
print(li.find_parent().name)       # ul

Supported CSS Selectors¶

Selector	Example	Description
Tag	`p`	Match by tag name
Class	`.intro`	Match by class name
ID	`#main`	Match by ID
Attribute	`[href]`	Match elements with attribute
Attribute value	`[href="/home"]`	Match attribute value
Descendant	`div p`	Match `p` inside `div`
Child	`div > p`	Match direct children
Compound	`p.intro`	Match `p` with class `intro`
:first-child	`li:first-child`	Match first child element
:last-child	`li:last-child`	Match last child element
:only-child	`span:only-child`	Match element that is the only child
:not()	`li:not(.active)`	Match elements that do NOT match the inner selector

API Reference¶

`Soup(markup, parser="html.parser")`¶

Parse an HTML document and provide a BeautifulSoup-like API.

Parameters:

Name	Type	Default	Description
`markup`	`str`	--	The HTML string to parse.
`parser`	`str`	`"html.parser"`	Ignored (present only for API compatibility with BS4).

`Tag` Methods¶

Method	Description
`find(name, class_, **attrs)`	Find first matching child element.
`find_all(name, class_, **attrs)`	Find all matching child elements.
`select(selector)`	Find all elements matching a CSS selector.
`select_one(selector)`	Find first element matching a CSS selector.
`get_text(separator="", strip=False)`	Get all text content.
`decompose()`	Remove this element from its parent.
`find_parent(name=None)`	Find the nearest parent, optionally by tag name.
`get(attr, default=None)`	Get attribute value.
`append(child)`	Append a child node (Tag or str).
`insert(index, child)`	Insert a child at the given position.
`extract()`	Remove from parent, return self (keeps children).
`replace_with(new_node)`	Replace this node with another in parent's children.
`unwrap()`	Remove this tag but keep its children (reparent to parent).
`to_html()`	Serialize this element to an HTML string.

`Tag` Properties¶

Property	Type	Description
`.text`	`str`	All text content (shortcut for `get_text()`).
`.name`	`str`	Tag name (e.g. `"div"`).
`.attrs`	`dict`	Attribute dictionary. `class` is stored as a list.
`.children`	`list`	Child nodes (`Tag` or `str`).
`.parent`	`Tag \\| None`	Parent element.

`Soup` Factory Methods¶

Method	Description
`new_tag(name, attrs=None)`	Create a new detached Tag node.

Comparison with BeautifulSoup¶

Feature	zerodep soup	BeautifulSoup
Dependencies	None (stdlib only)	`soupsieve`, optional `lxml`/`html5lib`
Files	Single file	Package (multiple files)
Parser backends	`html.parser` only	`html.parser`, `lxml`, `html5lib`
find / find_all	Yes	Yes
CSS selectors	Tag, class, id, attr, combinators, pseudo-selectors	Full (via soupsieve)
Tree mutation (append/insert/extract)	Yes	Yes
HTML serialization (to_html)	Yes	Yes (prettify)
NavigableString	No (plain `str`)	Yes
Parse speed (small)	149 us	446 us (2.99x slower)
Parse speed (large)	12.7 ms	37.1 ms (2.93x slower)

When to use zerodep: You need basic HTML parsing (find, select, get_text) with zero dependencies and fast performance.

When to use BeautifulSoup: You need the full CSS selector spec (:nth-child(), :has(), etc.), multiple parser backends, or NavigableString features.

Benchmark¶

Benchmarked against beautifulsoup4 across small, medium, and large HTML documents.

See Soup Benchmark for detailed results.