Skip to content

HTML Parser (Soup)

Zero-dependency HTML parser with BeautifulSoup-like API -- stdlib only, Python 3.10+.

Replaces: beautifulsoup4 (common subset)

Overview

The Soup module provides a lightweight DOM tree built on top of html.parser.HTMLParser. It supports find, find_all, select, select_one, get_text, decompose, and find_parent -- the subset of BeautifulSoup used by the vast majority of real-world scraping scripts.

File Description Dependencies
soup.py Pure Python implementation None (stdlib only: re, html.parser)

How to Use in Your Project

Just copy the single .py file into your project:

cp soup/soup.py your_project/

Then import directly:

from soup import Soup

Usage Examples

Basic Parsing

from soup import Soup

html = "<html><body><p class='msg'>Hello <b>world</b></p></body></html>"
soup = Soup(html)
print(soup.find("p", class_="msg").text)
# Hello world

find and find_all

html = """
<ul>
  <li class="item">Apple</li>
  <li class="item">Banana</li>
  <li class="item special">Cherry</li>
</ul>
"""
soup = Soup(html)

# Find first match
first = soup.find("li")
print(first.text)  # Apple

# Find all matches
items = soup.find_all("li", class_="item")
print([i.text for i in items])  # ['Apple', 'Banana', 'Cherry']

CSS Selectors

html = """
<div id="main">
  <p class="intro">Welcome</p>
  <p class="body">Content</p>
</div>
"""
soup = Soup(html)

# Select by ID
main = soup.select_one("#main")

# Select by class
intro = soup.select_one(".intro")
print(intro.text)  # Welcome

# Select by tag
paragraphs = soup.select("p")
print(len(paragraphs))  # 2

# Descendant selector
body_p = soup.select_one("div p.body")
print(body_p.text)  # Content

# Child selector
children = soup.select("div > p")
print(len(children))  # 2

# Attribute selector
soup2 = Soup('<a href="/home">Home</a><a href="/about">About</a>')
links = soup2.select("a[href]")
print(len(links))  # 2

# Pseudo-selectors
html3 = """
<ul>
  <li class="a">First</li>
  <li class="b">Second</li>
  <li class="c">Third</li>
</ul>
"""
soup3 = Soup(html3)

# :first-child / :last-child
print(soup3.select_one("li:first-child").text)  # First
print(soup3.select_one("li:last-child").text)   # Third

# :only-child
soup4 = Soup("<div><span>Alone</span></div>")
print(soup4.select_one("span:only-child").text)  # Alone

# :not(selector)
others = soup3.select("li:not(.a)")
print([li.text for li in others])  # ['Second', 'Third']

get_text

soup = Soup("<p>Hello <b>world</b></p>")
print(soup.get_text())          # Hello world
print(soup.get_text(separator=" | "))  # Hello  | world
print(soup.get_text(strip=True))  # Helloworld

Attributes

soup = Soup('<a href="/page" class="nav active" id="link1">Click</a>')
a = soup.find("a")

# Access attributes
print(a["href"])         # /page
print(a["id"])           # link1
print(a.get("class"))    # ['nav', 'active']
print(a.get("missing", "default"))  # default

Tree Mutation

from soup import Soup

html = "<ul><li>A</li><li>B</li></ul>"
soup = Soup(html)
ul = soup.find("ul")

# append -- add a child at the end
new_li = soup.new_tag("li", {"class": "new"})
new_li.children.append("C")
ul.append(new_li)

# insert -- add a child at a specific position
another = soup.new_tag("li")
another.children.append("Z")
ul.insert(0, another)

# extract -- remove from parent but keep the node intact
removed = ul.children[1].extract()

# replace_with -- swap one node for another
replacement = soup.new_tag("li")
replacement.children.append("X")
ul.children[0].replace_with(replacement)

# unwrap -- remove a tag but keep its children
soup2 = Soup("<div><b>bold text</b></div>")
soup2.find("b").unwrap()
print(soup2.get_text())  # bold text

HTML Serialization

soup = Soup('<div class="box"><p>Hello <b>world</b></p></div>')
div = soup.find("div")
print(div.to_html())
# <div class="box"><p>Hello <b>world</b></p></div>

# str() also produces HTML
print(str(div))
# <div class="box"><p>Hello <b>world</b></p></div>

Attribute Setting

soup = Soup('<a href="/old">Link</a>')
a = soup.find("a")

# Set attribute
a["href"] = "/new"
a["class"] = ["nav", "active"]

# Delete attribute
del a["class"]
print(a.to_html())  # <a href="/new">Link</a>

Creating New Tags

soup = Soup("<div></div>")
tag = soup.new_tag("span", {"id": "greeting"})
tag.children.append("Hello!")
soup.find("div").append(tag)
print(soup.find("div").to_html())
# <div><span id="greeting">Hello!</span></div>

decompose (Remove Elements)

html = "<div><p>Keep</p><script>remove me</script></div>"
soup = Soup(html)
for script in soup.find_all("script"):
    script.decompose()
print(soup.get_text())  # Keep

find_parent

soup = Soup("<div><ul><li>Item</li></ul></div>")
li = soup.find("li")
print(li.find_parent("div").name)  # div
print(li.find_parent().name)       # ul

Supported CSS Selectors

Selector Example Description
Tag p Match by tag name
Class .intro Match by class name
ID #main Match by ID
Attribute [href] Match elements with attribute
Attribute value [href="/home"] Match attribute value
Descendant div p Match p inside div
Child div > p Match direct children
Compound p.intro Match p with class intro
:first-child li:first-child Match first child element
:last-child li:last-child Match last child element
:only-child span:only-child Match element that is the only child
:not() li:not(.active) Match elements that do NOT match the inner selector

API Reference

Soup(markup, parser="html.parser")

Parse an HTML document and provide a BeautifulSoup-like API.

Parameters:

Name Type Default Description
markup str -- The HTML string to parse.
parser str "html.parser" Ignored (present only for API compatibility with BS4).

Tag Methods

Method Description
find(name, class_, **attrs) Find first matching child element.
find_all(name, class_, **attrs) Find all matching child elements.
select(selector) Find all elements matching a CSS selector.
select_one(selector) Find first element matching a CSS selector.
get_text(separator="", strip=False) Get all text content.
decompose() Remove this element from its parent.
find_parent(name=None) Find the nearest parent, optionally by tag name.
get(attr, default=None) Get attribute value.
append(child) Append a child node (Tag or str).
insert(index, child) Insert a child at the given position.
extract() Remove from parent, return self (keeps children).
replace_with(new_node) Replace this node with another in parent's children.
unwrap() Remove this tag but keep its children (reparent to parent).
to_html() Serialize this element to an HTML string.

Tag Properties

Property Type Description
.text str All text content (shortcut for get_text()).
.name str Tag name (e.g. "div").
.attrs dict Attribute dictionary. class is stored as a list.
.children list Child nodes (Tag or str).
.parent Tag \| None Parent element.

Soup Factory Methods

Method Description
new_tag(name, attrs=None) Create a new detached Tag node.

Comparison with BeautifulSoup

Feature zerodep soup BeautifulSoup
Dependencies None (stdlib only) soupsieve, optional lxml/html5lib
Files Single file Package (multiple files)
Parser backends html.parser only html.parser, lxml, html5lib
find / find_all Yes Yes
CSS selectors Tag, class, id, attr, combinators, pseudo-selectors Full (via soupsieve)
Tree mutation (append/insert/extract) Yes Yes
HTML serialization (to_html) Yes Yes (prettify)
NavigableString No (plain str) Yes
Parse speed (small) 149 us 446 us (2.99x slower)
Parse speed (large) 12.7 ms 37.1 ms (2.93x slower)

When to use zerodep: You need basic HTML parsing (find, select, get_text) with zero dependencies and fast performance.

When to use BeautifulSoup: You need the full CSS selector spec (:nth-child(), :has(), etc.), multiple parser backends, or NavigableString features.

Benchmark

Benchmarked against beautifulsoup4 across small, medium, and large HTML documents.

See Soup Benchmark for detailed results.