XML API Reference¶

Auto-generated API documentation for the XML module.

`xml` ¶

XML ↔ dict converter with fault-tolerant LLM tag extraction — zero-dep, stdlib only, Python 3.10+.

Provides xmltodict-compatible parse / unparse for bidirectional XML ↔ dict conversion, plus extract_tags for fault-tolerant extraction of XML-like tags from LLM output (unclosed tags, malformed nesting, streaming truncation).

Standard layer (xmltodict-compatible)::

d = parse('<root><name>Alice</name><age>30</age></root>')
# {'root': {'name': 'Alice', 'age': '30'}}

xml_str = unparse({'root': {'name': 'Alice', 'age': '30'}})
# '<?xml version="1.0" encoding="utf-8"?>\n<root><name>Alice</name><age>30</age></root>'

Lenient layer (LLM tag extraction)::

tags = extract_tags('<answer>42</answer>', 'answer')
# [ExtractedTag(tag='answer', content='42', attrs={}, is_closed=True)]

tags = extract_tags('Here is my thinking <thinking>let me reason')
# [ExtractedTag(tag='thinking', content='let me reason', attrs={}, is_closed=False)]

`XMLError` ¶

Bases: Exception

Raised when XML parsing or serialization fails.

Source code in xml/xml.py

class XMLError(Exception):
    """Raised when XML parsing or serialization fails."""

`ParsingInterrupted` ¶

Bases: Exception

Raised to interrupt streaming parse (future use).

Source code in xml/xml.py

class ParsingInterrupted(Exception):
    """Raised to interrupt streaming parse (future use)."""

`ExtractedTag` `dataclass` ¶

A tag extracted from text (possibly malformed XML).

Attributes:

Name	Type	Description
`tag`	`str`	Tag name (e.g. `"answer"`).
`content`	`str`	Text content between open and close tags.
`attrs`	`dict[str, str]`	Dictionary of attributes on the opening tag.
`is_closed`	`bool`	True if a matching close tag was found.

Source code in xml/xml.py

@dataclasses.dataclass(frozen=True, slots=True)
class ExtractedTag:
    """A tag extracted from text (possibly malformed XML).

    Attributes:
        tag: Tag name (e.g. ``"answer"``).
        content: Text content between open and close tags.
        attrs: Dictionary of attributes on the opening tag.
        is_closed: True if a matching close tag was found.
    """

    tag: str
    content: str
    attrs: dict[str, str]
    is_closed: bool

`parse(xml_input, *, encoding=None, process_namespaces=False, namespace_separator=':', disable_entities=True, process_comments=False, xml_attribs=True, attr_prefix='@', cdata_key='#text', force_cdata=False, cdata_separator='', postprocessor=None, dict_constructor=dict, strip_whitespace=True, force_list=None, comment_key='#comment')` ¶

Parse an XML document into a Python dict.

Compatible with xmltodict.parse(). Attributes are prefixed with attr_prefix (default "@"), text content is stored under cdata_key (default "#text"), and same-name siblings auto-coalesce into lists.

Parameters:

Name	Type	Description	Default
`xml_input`	`str \| bytes \| IO[bytes]`	XML string, bytes, or file-like object.	required
`encoding`	`str \| None`	Character encoding override.	`None`
`process_namespaces`	`bool`	Expand namespace URIs in element names.	`False`
`namespace_separator`	`str`	Separator between namespace and local name.	`':'`
`disable_entities`	`bool`	Block entity declarations for security (XXE).	`True`
`process_comments`	`bool`	Include XML comments in the output.	`False`
`xml_attribs`	`bool`	Include element attributes in the output.	`True`
`attr_prefix`	`str`	Prefix for attribute keys in the output dict.	`'@'`
`cdata_key`	`str`	Key for text content in the output dict.	`'#text'`
`force_cdata`	`bool`	Always wrap text content in a dict with cdata_key.	`False`
`cdata_separator`	`str`	Separator for joining multiple text nodes.	`''`
`postprocessor`	`Callable \| None`	Callable `(path, key, value) -> (key, value)` or None.	`None`
`dict_constructor`	`type`	Dict class to use (default `dict`).	`dict`
`strip_whitespace`	`bool`	Strip whitespace from text nodes.	`True`
`force_list`	`bool \| tuple[str, ...] \| Callable \| None`	Force list creation — bool, tuple of tag names, or callable.	`None`
`comment_key`	`str`	Key for XML comments in the output dict.	`'#comment'`

Returns:

Type	Description
`dict \| None`	Parsed dict, or None for empty documents.

Raises:

Type	Description
`XMLError`	If the XML is malformed.

Source code in xml/xml.py

def parse(
    xml_input: str | bytes | IO[bytes],
    *,
    encoding: str | None = None,
    process_namespaces: bool = False,
    namespace_separator: str = ":",
    disable_entities: bool = True,
    process_comments: bool = False,
    xml_attribs: bool = True,
    attr_prefix: str = "@",
    cdata_key: str = "#text",
    force_cdata: bool = False,
    cdata_separator: str = "",
    postprocessor: Callable | None = None,
    dict_constructor: type = dict,
    strip_whitespace: bool = True,
    force_list: bool | tuple[str, ...] | Callable | None = None,
    comment_key: str = "#comment",
) -> dict | None:
    """Parse an XML document into a Python dict.

    Compatible with ``xmltodict.parse()``.  Attributes are prefixed with
    *attr_prefix* (default ``"@"``), text content is stored under *cdata_key*
    (default ``"#text"``), and same-name siblings auto-coalesce into lists.

    Args:
        xml_input: XML string, bytes, or file-like object.
        encoding: Character encoding override.
        process_namespaces: Expand namespace URIs in element names.
        namespace_separator: Separator between namespace and local name.
        disable_entities: Block entity declarations for security (XXE).
        process_comments: Include XML comments in the output.
        xml_attribs: Include element attributes in the output.
        attr_prefix: Prefix for attribute keys in the output dict.
        cdata_key: Key for text content in the output dict.
        force_cdata: Always wrap text content in a dict with *cdata_key*.
        cdata_separator: Separator for joining multiple text nodes.
        postprocessor: Callable ``(path, key, value) -> (key, value)`` or None.
        dict_constructor: Dict class to use (default ``dict``).
        strip_whitespace: Strip whitespace from text nodes.
        force_list: Force list creation — bool, tuple of tag names, or callable.
        comment_key: Key for XML comments in the output dict.

    Returns:
        Parsed dict, or None for empty documents.

    Raises:
        XMLError: If the XML is malformed.
    """
    handler = _DictSAXHandler(
        xml_attribs=xml_attribs,
        attr_prefix=attr_prefix,
        cdata_key=cdata_key,
        force_cdata=force_cdata,
        cdata_separator=cdata_separator,
        postprocessor=postprocessor,
        dict_constructor=dict_constructor,
        strip_whitespace=strip_whitespace,
        namespace_separator=namespace_separator,
        force_list=force_list,
        comment_key=comment_key,
    )

    if process_namespaces:
        handler.namespaces = {}  # will be populated by namespace decl events

    parser = _expat.ParserCreate(
        encoding,
        namespace_separator if process_namespaces else None,
    )

    if disable_entities:
        try:
            parser.UseForeignDTD(True)
        except AttributeError:
            pass

        def _entity_decl_handler(*_args: Any) -> None:
            raise ValueError(
                "Entities are disabled (disable_entities=True). "
                "Set disable_entities=False to allow entity declarations."
            )

        parser.EntityDeclHandler = _entity_decl_handler

        try:
            feature_external_ges = _expat.XML_PARAM_ENTITY_PARSING_NEVER  # type: ignore[attr-defined]
            parser.SetParamEntityParsing(feature_external_ges)
        except AttributeError:
            pass

    parser.StartElementHandler = handler.start_element
    parser.EndElementHandler = lambda name: handler.end_element(name)
    parser.CharacterDataHandler = handler.characters

    if process_namespaces:
        parser.StartNamespaceDeclHandler = handler.start_namespace_decl

    if process_comments:
        parser.CommentHandler = handler.comments

    try:
        if isinstance(xml_input, str):
            parser.Parse(xml_input, True)
        elif isinstance(xml_input, bytes):
            parser.Parse(xml_input, True)
        else:
            # File-like object — read in chunks
            while True:
                chunk = xml_input.read(65536)
                if not chunk:
                    parser.Parse(b"", True)
                    break
                parser.Parse(chunk, False)
    except _expat.ExpatError as exc:
        raise XMLError(str(exc)) from exc
    except ValueError as exc:
        raise XMLError(str(exc)) from exc

    return handler.item

`unparse(input_dict, *, output=None, encoding='utf-8', full_document=True, short_empty_elements=False, pretty=False, indent='\t', newl='\n', attr_prefix='@', cdata_key='#text', preprocessor=None, namespace_separator=':', namespaces=None, comment_key='#comment')` ¶

Convert a Python dict into an XML string.

Compatible with xmltodict.unparse(). Keys prefixed with attr_prefix (default "@") become element attributes, cdata_key (default "#text") values become text content.

Parameters:

Name	Type	Description	Default
`input_dict`	`dict`	Dictionary to convert.	required
`output`	`IO[str] \| None`	File-like object to write to. If None, return a string.	`None`
`encoding`	`str`	Output encoding (used in XML declaration).	`'utf-8'`
`full_document`	`bool`	Include `<?xml …?>` declaration.	`True`
`short_empty_elements`	`bool`	Use `<tag/>` for empty elements.	`False`
`pretty`	`bool`	Pretty-print with indentation.	`False`
`indent`	`str`	Indentation string (used when pretty is True).	`'\t'`
`newl`	`str`	Newline string (used when pretty is True).	`'\n'`
`attr_prefix`	`str`	Prefix for attribute keys in the input dict.	`'@'`
`cdata_key`	`str`	Key for text content in the input dict.	`'#text'`
`preprocessor`	`Callable \| None`	Callable `(key, value) -> (key, value)` or None.	`None`
`namespace_separator`	`str`	Separator between namespace prefix and local name.	`':'`
`namespaces`	`dict[str, str] \| None`	Dict mapping namespace URIs to prefixes.	`None`
`comment_key`	`str`	Key for XML comments in the input dict.	`'#comment'`

Returns:

Type	Description
`str \| None`	XML string if output is None, otherwise None.

Raises:

Type	Description
`XMLError`	If the dict cannot be serialized.

Source code in xml/xml.py

def unparse(
    input_dict: dict,
    *,
    output: IO[str] | None = None,
    encoding: str = "utf-8",
    full_document: bool = True,
    short_empty_elements: bool = False,
    pretty: bool = False,
    indent: str = "\t",
    newl: str = "\n",
    attr_prefix: str = "@",
    cdata_key: str = "#text",
    preprocessor: Callable | None = None,
    namespace_separator: str = ":",
    namespaces: dict[str, str] | None = None,
    comment_key: str = "#comment",
) -> str | None:
    """Convert a Python dict into an XML string.

    Compatible with ``xmltodict.unparse()``.  Keys prefixed with *attr_prefix*
    (default ``"@"``) become element attributes, *cdata_key* (default
    ``"#text"``) values become text content.

    Args:
        input_dict: Dictionary to convert.
        output: File-like object to write to.  If None, return a string.
        encoding: Output encoding (used in XML declaration).
        full_document: Include ``<?xml …?>`` declaration.
        short_empty_elements: Use ``<tag/>`` for empty elements.
        pretty: Pretty-print with indentation.
        indent: Indentation string (used when *pretty* is True).
        newl: Newline string (used when *pretty* is True).
        attr_prefix: Prefix for attribute keys in the input dict.
        cdata_key: Key for text content in the input dict.
        preprocessor: Callable ``(key, value) -> (key, value)`` or None.
        namespace_separator: Separator between namespace prefix and local name.
        namespaces: Dict mapping namespace URIs to prefixes.
        comment_key: Key for XML comments in the input dict.

    Returns:
        XML string if *output* is None, otherwise None.

    Raises:
        XMLError: If the dict cannot be serialized.
    """
    if not isinstance(input_dict, dict):
        raise XMLError(f"Expected dict, got {type(input_dict).__name__}")

    # Must have exactly one root key (excluding comment/attr keys)
    root_keys = [
        k for k in input_dict if not k.startswith(attr_prefix) and k != comment_key
    ]
    if len(root_keys) != 1:
        raise XMLError(
            f"Expected exactly one root element, got {len(root_keys)}: {root_keys}"
        )

    must_return = output is None
    if must_return:
        output = io.StringIO()

    content_handler = _XMLGen(output, encoding, short_empty_elements)
    if full_document:
        content_handler.startDocument()
    if pretty:
        content_handler.ignorableWhitespace(newl)

    opts = _EmitOpts(
        attr_prefix=attr_prefix,
        cdata_key=cdata_key,
        preprocessor=preprocessor,
        pretty=pretty,
        newl=newl,
        indent=indent,
        namespace_separator=namespace_separator,
        namespaces=namespaces,
        full_document=full_document,
        comment_key=comment_key,
    )

    root_key = root_keys[0]
    _emit(root_key, input_dict[root_key], content_handler, opts=opts, depth=0)

    if must_return:
        return output.getvalue()  # type: ignore[union-attr]
    return None

`extract_tags(text, tag=None, *, first_only=False)` ¶

Extract XML-like tags from text, tolerating malformed XML.

Designed for extracting structured tags from LLM output where the XML may be incomplete, improperly nested, or truncated mid-stream.

Parameters:

Name	Type	Description	Default
`text`	`str`	Raw text containing XML-like tags.	required
`tag`	`str \| None`	If provided, only extract tags with this name. If None, extract all top-level tags found.	`None`
`first_only`	`bool`	If True, return after finding the first match.	`False`

Returns:

Type	Description
`list[ExtractedTag]`	List of `ExtractedTag` objects.

Example::

>>> extract_tags('<answer>42</answer>', 'answer')
[ExtractedTag(tag='answer', content='42', attrs={}, is_closed=True)]

>>> extract_tags('Thinking... <thought>hmm')
[ExtractedTag(tag='thought', content='hmm', attrs={}, is_closed=False)]

Source code in xml/xml.py

def extract_tags(
    text: str,
    tag: str | None = None,
    *,
    first_only: bool = False,
) -> list[ExtractedTag]:
    """Extract XML-like tags from text, tolerating malformed XML.

    Designed for extracting structured tags from LLM output where the XML
    may be incomplete, improperly nested, or truncated mid-stream.

    Args:
        text: Raw text containing XML-like tags.
        tag: If provided, only extract tags with this name.  If None,
            extract all top-level tags found.
        first_only: If True, return after finding the first match.

    Returns:
        List of ``ExtractedTag`` objects.

    Example::

        >>> extract_tags('<answer>42</answer>', 'answer')
        [ExtractedTag(tag='answer', content='42', attrs={}, is_closed=True)]

        >>> extract_tags('Thinking... <thought>hmm')
        [ExtractedTag(tag='thought', content='hmm', attrs={}, is_closed=False)]
    """
    results: list[ExtractedTag] = []

    for m in _OPEN_TAG_RE.finditer(text):
        tag_name = m.group(1)
        attr_str = m.group(2)
        self_closing = m.group(3) == "/"

        if tag is not None and tag_name != tag:
            continue

        attrs = _parse_attrs(attr_str) if attr_str.strip() else {}

        if self_closing:
            results.append(
                ExtractedTag(
                    tag=tag_name,
                    content="",
                    attrs=attrs,
                    is_closed=True,
                )
            )
        else:
            # Search for matching closing tag with depth counting
            content_start = m.end()
            content, is_closed = _find_closing(text, content_start, tag_name)
            results.append(
                ExtractedTag(
                    tag=tag_name,
                    content=content,
                    attrs=attrs,
                    is_closed=is_closed,
                )
            )

        if first_only:
            break

    return results

XML API Reference¶

xml ¶

XMLError ¶

ParsingInterrupted ¶

ExtractedTag dataclass ¶

unparse(input_dict, *, output=None, encoding='utf-8', full_document=True, short_empty_elements=False, pretty=False, indent='\t', newl='\n', attr_prefix='@', cdata_key='#text', preprocessor=None, namespace_separator=':', namespaces=None, comment_key='#comment') ¶

extract_tags(text, tag=None, *, first_only=False) ¶

`xml` ¶

`XMLError` ¶

`ParsingInterrupted` ¶

`ExtractedTag` `dataclass` ¶

`unparse(input_dict, *, output=None, encoding='utf-8', full_document=True, short_empty_elements=False, pretty=False, indent='\t', newl='\n', attr_prefix='@', cdata_key='#text', preprocessor=None, namespace_separator=':', namespaces=None, comment_key='#comment')` ¶

`extract_tags(text, tag=None, *, first_only=False)` ¶