Skip to content

XML API Reference

Auto-generated API documentation for the XML module.

xml

XML ↔ dict converter with fault-tolerant LLM tag extraction — zero-dep, stdlib only, Python 3.10+.

Part of zerodep: https://github.com/Oaklight/zerodep Copyright (c) 2026 Peng Ding. MIT License.

Provides xmltodict-compatible parse / unparse for bidirectional XML ↔ dict conversion, plus extract_tags for fault-tolerant extraction of XML-like tags from LLM output (unclosed tags, malformed nesting, streaming truncation).

Standard layer (xmltodict-compatible)::

d = parse('<root><name>Alice</name><age>30</age></root>')
# {'root': {'name': 'Alice', 'age': '30'}}

xml_str = unparse({'root': {'name': 'Alice', 'age': '30'}})
# '<?xml version="1.0" encoding="utf-8"?>\n<root><name>Alice</name><age>30</age></root>'

Lenient layer (LLM tag extraction)::

tags = extract_tags('<answer>42</answer>', 'answer')
# [ExtractedTag(tag='answer', content='42', attrs={}, is_closed=True)]

tags = extract_tags('Here is my thinking <thinking>let me reason')
# [ExtractedTag(tag='thinking', content='let me reason', attrs={}, is_closed=False)]

XMLError

Bases: Exception

Raised when XML parsing or serialization fails.

Source code in xml/xml.py
class XMLError(Exception):
    """Raised when XML parsing or serialization fails."""

ParsingInterrupted

Bases: Exception

Raised to interrupt streaming parse (future use).

Source code in xml/xml.py
class ParsingInterrupted(Exception):
    """Raised to interrupt streaming parse (future use)."""

ExtractedTag dataclass

A tag extracted from text (possibly malformed XML).

Attributes:

Name Type Description
tag str

Tag name (e.g. "answer").

content str

Text content between open and close tags.

attrs dict[str, str]

Dictionary of attributes on the opening tag.

is_closed bool

True if a matching close tag was found.

Source code in xml/xml.py
@dataclasses.dataclass(frozen=True, slots=True)
class ExtractedTag:
    """A tag extracted from text (possibly malformed XML).

    Attributes:
        tag: Tag name (e.g. ``"answer"``).
        content: Text content between open and close tags.
        attrs: Dictionary of attributes on the opening tag.
        is_closed: True if a matching close tag was found.
    """

    tag: str
    content: str
    attrs: dict[str, str]
    is_closed: bool

parse(xml_input, *, encoding=None, process_namespaces=False, namespace_separator=':', disable_entities=True, process_comments=False, xml_attribs=True, attr_prefix='@', cdata_key='#text', force_cdata=False, cdata_separator='', postprocessor=None, dict_constructor=dict, strip_whitespace=True, force_list=None, comment_key='#comment')

Parse an XML document into a Python dict.

Compatible with xmltodict.parse(). Attributes are prefixed with attr_prefix (default "@"), text content is stored under cdata_key (default "#text"), and same-name siblings auto-coalesce into lists.

Parameters:

Name Type Description Default
xml_input str | bytes | IO[bytes]

XML string, bytes, or file-like object.

required
encoding str | None

Character encoding override.

None
process_namespaces bool

Expand namespace URIs in element names.

False
namespace_separator str

Separator between namespace and local name.

':'
disable_entities bool

Block entity declarations for security (XXE).

True
process_comments bool

Include XML comments in the output.

False
xml_attribs bool

Include element attributes in the output.

True
attr_prefix str

Prefix for attribute keys in the output dict.

'@'
cdata_key str

Key for text content in the output dict.

'#text'
force_cdata bool

Always wrap text content in a dict with cdata_key.

False
cdata_separator str

Separator for joining multiple text nodes.

''
postprocessor Callable | None

Callable (path, key, value) -> (key, value) or None.

None
dict_constructor type

Dict class to use (default dict).

dict
strip_whitespace bool

Strip whitespace from text nodes.

True
force_list bool | tuple[str, ...] | Callable | None

Force list creation — bool, tuple of tag names, or callable.

None
comment_key str

Key for XML comments in the output dict.

'#comment'

Returns:

Type Description
dict | None

Parsed dict, or None for empty documents.

Raises:

Type Description
XMLError

If the XML is malformed.

Source code in xml/xml.py
def parse(
    xml_input: str | bytes | IO[bytes],
    *,
    encoding: str | None = None,
    process_namespaces: bool = False,
    namespace_separator: str = ":",
    disable_entities: bool = True,
    process_comments: bool = False,
    xml_attribs: bool = True,
    attr_prefix: str = "@",
    cdata_key: str = "#text",
    force_cdata: bool = False,
    cdata_separator: str = "",
    postprocessor: Callable | None = None,
    dict_constructor: type = dict,
    strip_whitespace: bool = True,
    force_list: bool | tuple[str, ...] | Callable | None = None,
    comment_key: str = "#comment",
) -> dict | None:
    """Parse an XML document into a Python dict.

    Compatible with ``xmltodict.parse()``.  Attributes are prefixed with
    *attr_prefix* (default ``"@"``), text content is stored under *cdata_key*
    (default ``"#text"``), and same-name siblings auto-coalesce into lists.

    Args:
        xml_input: XML string, bytes, or file-like object.
        encoding: Character encoding override.
        process_namespaces: Expand namespace URIs in element names.
        namespace_separator: Separator between namespace and local name.
        disable_entities: Block entity declarations for security (XXE).
        process_comments: Include XML comments in the output.
        xml_attribs: Include element attributes in the output.
        attr_prefix: Prefix for attribute keys in the output dict.
        cdata_key: Key for text content in the output dict.
        force_cdata: Always wrap text content in a dict with *cdata_key*.
        cdata_separator: Separator for joining multiple text nodes.
        postprocessor: Callable ``(path, key, value) -> (key, value)`` or None.
        dict_constructor: Dict class to use (default ``dict``).
        strip_whitespace: Strip whitespace from text nodes.
        force_list: Force list creation — bool, tuple of tag names, or callable.
        comment_key: Key for XML comments in the output dict.

    Returns:
        Parsed dict, or None for empty documents.

    Raises:
        XMLError: If the XML is malformed.
    """
    handler = _DictSAXHandler(
        xml_attribs=xml_attribs,
        attr_prefix=attr_prefix,
        cdata_key=cdata_key,
        force_cdata=force_cdata,
        cdata_separator=cdata_separator,
        postprocessor=postprocessor,
        dict_constructor=dict_constructor,
        strip_whitespace=strip_whitespace,
        namespace_separator=namespace_separator,
        force_list=force_list,
        comment_key=comment_key,
    )

    if process_namespaces:
        handler.namespaces = {}  # will be populated by namespace decl events

    parser = _expat.ParserCreate(
        encoding,
        namespace_separator if process_namespaces else None,
    )

    if disable_entities:
        try:
            parser.UseForeignDTD(True)
        except AttributeError:
            pass

        def _entity_decl_handler(*_args: Any) -> None:
            raise ValueError(
                "Entities are disabled (disable_entities=True). "
                "Set disable_entities=False to allow entity declarations."
            )

        parser.EntityDeclHandler = _entity_decl_handler

        try:
            feature_external_ges = _expat.XML_PARAM_ENTITY_PARSING_NEVER  # type: ignore[attr-defined]
            parser.SetParamEntityParsing(feature_external_ges)
        except AttributeError:
            pass

    parser.StartElementHandler = handler.start_element
    parser.EndElementHandler = lambda name: handler.end_element(name)
    parser.CharacterDataHandler = handler.characters

    if process_namespaces:
        parser.StartNamespaceDeclHandler = handler.start_namespace_decl

    if process_comments:
        parser.CommentHandler = handler.comments

    try:
        if isinstance(xml_input, str):
            parser.Parse(xml_input, True)
        elif isinstance(xml_input, bytes):
            parser.Parse(xml_input, True)
        else:
            # File-like object — read in chunks
            while True:
                chunk = xml_input.read(65536)
                if not chunk:
                    parser.Parse(b"", True)
                    break
                parser.Parse(chunk, False)
    except _expat.ExpatError as exc:
        raise XMLError(str(exc)) from exc
    except ValueError as exc:
        raise XMLError(str(exc)) from exc

    return handler.item

unparse(input_dict, *, output=None, encoding='utf-8', full_document=True, short_empty_elements=False, pretty=False, indent='\t', newl='\n', attr_prefix='@', cdata_key='#text', preprocessor=None, namespace_separator=':', namespaces=None, comment_key='#comment')

Convert a Python dict into an XML string.

Compatible with xmltodict.unparse(). Keys prefixed with attr_prefix (default "@") become element attributes, cdata_key (default "#text") values become text content.

Parameters:

Name Type Description Default
input_dict dict

Dictionary to convert.

required
output IO[str] | None

File-like object to write to. If None, return a string.

None
encoding str

Output encoding (used in XML declaration).

'utf-8'
full_document bool

Include <?xml …?> declaration.

True
short_empty_elements bool

Use <tag/> for empty elements.

False
pretty bool

Pretty-print with indentation.

False
indent str

Indentation string (used when pretty is True).

'\t'
newl str

Newline string (used when pretty is True).

'\n'
attr_prefix str

Prefix for attribute keys in the input dict.

'@'
cdata_key str

Key for text content in the input dict.

'#text'
preprocessor Callable | None

Callable (key, value) -> (key, value) or None.

None
namespace_separator str

Separator between namespace prefix and local name.

':'
namespaces dict[str, str] | None

Dict mapping namespace URIs to prefixes.

None
comment_key str

Key for XML comments in the input dict.

'#comment'

Returns:

Type Description
str | None

XML string if output is None, otherwise None.

Raises:

Type Description
XMLError

If the dict cannot be serialized.

Source code in xml/xml.py
def unparse(
    input_dict: dict,
    *,
    output: IO[str] | None = None,
    encoding: str = "utf-8",
    full_document: bool = True,
    short_empty_elements: bool = False,
    pretty: bool = False,
    indent: str = "\t",
    newl: str = "\n",
    attr_prefix: str = "@",
    cdata_key: str = "#text",
    preprocessor: Callable | None = None,
    namespace_separator: str = ":",
    namespaces: dict[str, str] | None = None,
    comment_key: str = "#comment",
) -> str | None:
    """Convert a Python dict into an XML string.

    Compatible with ``xmltodict.unparse()``.  Keys prefixed with *attr_prefix*
    (default ``"@"``) become element attributes, *cdata_key* (default
    ``"#text"``) values become text content.

    Args:
        input_dict: Dictionary to convert.
        output: File-like object to write to.  If None, return a string.
        encoding: Output encoding (used in XML declaration).
        full_document: Include ``<?xml …?>`` declaration.
        short_empty_elements: Use ``<tag/>`` for empty elements.
        pretty: Pretty-print with indentation.
        indent: Indentation string (used when *pretty* is True).
        newl: Newline string (used when *pretty* is True).
        attr_prefix: Prefix for attribute keys in the input dict.
        cdata_key: Key for text content in the input dict.
        preprocessor: Callable ``(key, value) -> (key, value)`` or None.
        namespace_separator: Separator between namespace prefix and local name.
        namespaces: Dict mapping namespace URIs to prefixes.
        comment_key: Key for XML comments in the input dict.

    Returns:
        XML string if *output* is None, otherwise None.

    Raises:
        XMLError: If the dict cannot be serialized.
    """
    if not isinstance(input_dict, dict):
        raise XMLError(f"Expected dict, got {type(input_dict).__name__}")

    # Must have exactly one root key (excluding comment/attr keys)
    root_keys = [
        k for k in input_dict if not k.startswith(attr_prefix) and k != comment_key
    ]
    if len(root_keys) != 1:
        raise XMLError(
            f"Expected exactly one root element, got {len(root_keys)}: {root_keys}"
        )

    must_return = output is None
    if must_return:
        output = io.StringIO()

    content_handler = _XMLGen(output, encoding, short_empty_elements)
    if full_document:
        content_handler.startDocument()
    if pretty:
        content_handler.ignorableWhitespace(newl)

    opts = _EmitOpts(
        attr_prefix=attr_prefix,
        cdata_key=cdata_key,
        preprocessor=preprocessor,
        pretty=pretty,
        newl=newl,
        indent=indent,
        namespace_separator=namespace_separator,
        namespaces=namespaces,
        full_document=full_document,
        comment_key=comment_key,
    )

    root_key = root_keys[0]
    _emit(root_key, input_dict[root_key], content_handler, opts=opts, depth=0)

    if must_return:
        return output.getvalue()  # type: ignore[union-attr]
    return None

extract_tags(text, tag=None, *, first_only=False)

Extract XML-like tags from text, tolerating malformed XML.

Designed for extracting structured tags from LLM output where the XML may be incomplete, improperly nested, or truncated mid-stream.

Parameters:

Name Type Description Default
text str

Raw text containing XML-like tags.

required
tag str | None

If provided, only extract tags with this name. If None, extract all top-level tags found.

None
first_only bool

If True, return after finding the first match.

False

Returns:

Type Description
list[ExtractedTag]

List of ExtractedTag objects.

Example::

>>> extract_tags('<answer>42</answer>', 'answer')
[ExtractedTag(tag='answer', content='42', attrs={}, is_closed=True)]

>>> extract_tags('Thinking... <thought>hmm')
[ExtractedTag(tag='thought', content='hmm', attrs={}, is_closed=False)]
Source code in xml/xml.py
def extract_tags(
    text: str,
    tag: str | None = None,
    *,
    first_only: bool = False,
) -> list[ExtractedTag]:
    """Extract XML-like tags from text, tolerating malformed XML.

    Designed for extracting structured tags from LLM output where the XML
    may be incomplete, improperly nested, or truncated mid-stream.

    Args:
        text: Raw text containing XML-like tags.
        tag: If provided, only extract tags with this name.  If None,
            extract all top-level tags found.
        first_only: If True, return after finding the first match.

    Returns:
        List of ``ExtractedTag`` objects.

    Example::

        >>> extract_tags('<answer>42</answer>', 'answer')
        [ExtractedTag(tag='answer', content='42', attrs={}, is_closed=True)]

        >>> extract_tags('Thinking... <thought>hmm')
        [ExtractedTag(tag='thought', content='hmm', attrs={}, is_closed=False)]
    """
    results: list[ExtractedTag] = []

    for m in _OPEN_TAG_RE.finditer(text):
        tag_name = m.group(1)
        attr_str = m.group(2)
        self_closing = m.group(3) == "/"

        if tag is not None and tag_name != tag:
            continue

        attrs = _parse_attrs(attr_str) if attr_str.strip() else {}

        if self_closing:
            results.append(
                ExtractedTag(
                    tag=tag_name,
                    content="",
                    attrs=attrs,
                    is_closed=True,
                )
            )
        else:
            # Search for matching closing tag with depth counting
            content_start = m.end()
            content, is_closed = _find_closing(text, content_start, tag_name)
            results.append(
                ExtractedTag(
                    tag=tag_name,
                    content=content,
                    attrs=attrs,
                    is_closed=is_closed,
                )
            )

        if first_only:
            break

    return results