Important: This documentation covers Yarn 1 (Classic).
For Yarn 2+ docs and migration guide, see yarnpkg.com.

Package detail

@rgrove/parse-xml

rgrove444.6kISC4.2.0TypeScript support: included

A fast, safe, compliant XML parser for Node.js and browsers.

xml, xml parser, parse-xml, parse xml, parse, parser

readme

parse-xml

A fast, safe, compliant XML parser for Node.js and browsers.

npm version Bundle size CI

Installation

npm install @rgrove/parse-xml

Or, if you like living dangerously, you can load the minified bundle in a browser via Unpkg and use the parseXml global.

Features

  • Returns a convenient object tree representing an XML document.

  • Works great in Node.js and browsers.

  • Provides helpful, detailed error messages with context when a document is not well-formed.

  • Mostly conforms to XML 1.0 (Fifth Edition) as a non-validating parser (see below for details).

  • Passes all relevant tests in the XML Conformance Test Suite.

  • Written in TypeScript and compiled to ES2020 JavaScript for Node.js and ES2017 JavaScript for browsers. The browser build is also optimized for minification.

  • Extremely fast and surprisingly small.

  • Zero dependencies.

Not Features

While this parser is capable of parsing document type declarations (<!DOCTYPE ... >) and including them in the node tree, it doesn't actually do anything with them. External document type definitions won't be loaded, and the parser won't validate the document against a DTD or resolve custom entity references defined in a DTD.

In addition, the only supported character encoding is UTF-8 because it's not feasible (or useful) to support other character encodings in JavaScript.

Examples

Basic Usage

ESM

import { parseXml } from '@rgrove/parse-xml';
parseXml('<kittens fuzzy="yes">I like fuzzy kittens.</kittens>');

CommonJS

const { parseXml } = require('@rgrove/parse-xml');
parseXml('<kittens fuzzy="yes">I like fuzzy kittens.</kittens>');

The result is an XmlDocument instance containing the parsed document, with a structure that looks like this (some properties and methods are excluded for clarity; see the API docs for details):

{
  type: 'document',
  children: [
    {
      type: 'element',
      name: 'kittens',
      attributes: {
        fuzzy: 'yes'
      },
      children: [
        {
          type: 'text',
          text: 'I like fuzzy kittens.'
        }
      ],
      parent: { ... },
      isRootNode: true
    }
  ]
}

All parse-xml objects have toJSON() methods that return JSON-serializable objects, so you can easily convert an XML document to JSON:

let json = JSON.stringify(parseXml(xml));

Friendly Errors

When something goes wrong, parse-xml throws an error that tells you exactly what happened and shows you where the problem is so you can fix it.

parseXml('<foo><bar>baz</foo>');

Output

Error: Missing end tag for element bar (line 1, column 14)
  <foo><bar>baz</foo>
               ^

In addition to a helpful message, error objects have the following properties:

  • column Number

    Column where the error occurred (1-based).

  • excerpt String

    Excerpt from the input string that contains the problem.

  • line Number

    Line where the error occurred (1-based).

  • pos Number

    Character position where the error occurred relative to the beginning of the input (0-based).

Why another XML parser?

There are many XML parsers for Node, and some of them are good. However, most of them suffer from one or more of the following shortcomings:

  • Native dependencies.

  • Loose, non-standard parsing behavior that can lead to unexpected or even unsafe results when given input the author didn't anticipate.

  • Kitchen sink APIs that tightly couple a parser with DOM manipulation functions, a stringifier, or other tooling that isn't directly related to parsing and consuming XML.

  • Stream-based parsing. This is great in the rare case that you need to parse truly enormous documents, but can be a pain to work with when all you want is a node tree.

  • Poor error handling.

  • Too big or too Node-specific to work well in browsers.

parse-xml's goal is to be a small, fast, safe, compliant, non-streaming, non-validating, browser-friendly parser, because I think this is an under-served niche.

I think parse-xml demonstrates that it's not necessary to jettison the spec entirely or to write complex code in order to implement a small, fast XML parser.

Also, it was fun.

Benchmark

Here's how parse-xml's performance stacks up against a few comparable libraries:

While libxmljs2 is faster at parsing medium and large documents, its performance comes at the expense of a large C dependency, no browser support, and a history of security vulnerabilities in the underlying libxml2 library.

In these results, "ops/s" refers to operations per second. Higher is faster.

Node.js v22.10.0 / Darwin arm64
Apple M1 Max

Running "Small document (291 bytes)" suite...
Progress: 100%

  @rgrove/parse-xml 4.2.0:
    253 082 ops/s, ±0.16%   | fastest

  fast-xml-parser 4.5.0:
    127 232 ops/s, ±0.44%   | 49.73% slower

  libxmljs2 0.35.0 (native):
    68 709 ops/s, ±2.77%    | slowest, 72.85% slower

  xmldoc 1.3.0 (sax-js):
    122 345 ops/s, ±0.15%   | 51.66% slower

Finished 4 cases!
  Fastest: @rgrove/parse-xml 4.2.0
  Slowest: libxmljs2 0.35.0 (native)

Running "Medium document (72081 bytes)" suite...
Progress: 100%

  @rgrove/parse-xml 4.2.0:
    1 350 ops/s, ±0.18%   | 29.5% slower

  fast-xml-parser 4.5.0:
    560 ops/s, ±0.48%     | slowest, 70.76% slower

  libxmljs2 0.35.0 (native):
    1 915 ops/s, ±2.64%   | fastest

  xmldoc 1.3.0 (sax-js):
    824 ops/s, ±0.20%     | 56.97% slower

Finished 4 cases!
  Fastest: libxmljs2 0.35.0 (native)
  Slowest: fast-xml-parser 4.5.0

Running "Large document (1162464 bytes)" suite...
Progress: 100%

  @rgrove/parse-xml 4.2.0:
    109 ops/s, ±0.17%   | 40.11% slower

  fast-xml-parser 4.5.0:
    48 ops/s, ±0.55%    | slowest, 73.63% slower

  libxmljs2 0.35.0 (native):
    182 ops/s, ±1.16%   | fastest

  xmldoc 1.3.0 (sax-js):
    73 ops/s, ±0.50%    | 59.89% slower

Finished 4 cases!
  Fastest: libxmljs2 0.35.0 (native)
  Slowest: fast-xml-parser 4.5.0

See the parse-xml-benchmark repo for instructions on how to run this benchmark yourself.

License

ISC License

changelog

parse-xml changelog

All notable changes to parse-xml are documented in this file. The format is based on Keep a Changelog. This project adheres to Semantic Versioning.

4.2.0 (2014-10-24)

Faster! Smaller! Better in ways you can't even see and probably don't care about! And still completely backwards compatible.

Improved

  • Parsing performance in Node.js 22 is up to 28% faster than version 4.1.0. Note that the performance gain will vary depending on the document being parsed.

  • The minified bundle size has been reduced by a mind-blowing 87 bytes (uncompressed).

Changed

  • Moved initial parsing steps out of the Parser constructor and into a new parse() method. #35

    This change is an internal refactoring that doesn't affect the public API, but may make error handling easier for people who like living dangerously and are using parse-xml internals in interesting ways.

Fixed

  • The parser now throws an error when it encounters an invalid character in an encoding declaration.

4.1.0 (2023-02-04)

Added

  • Added a new includeOffsets parser option. #25

    When true, the starting and ending byte offsets of each node in the input string will be made available via start and end properties on the node. The default is false.

    This option is useful if you want to preserve the original source text of each node when later serializing a document back to XML. Previously, the original source text was always discarded, which meant that if you parsed a document and then serialized it, the original source text would be lost.

    const { parseXml } = require('@rgrove/parse-xml');
    
    let xml = '<root><child /></root>';
    let doc = parseXml(xml, { includeOffsets: true });
    
    console.log(doc.root.toJSON());
    // => { type: 'element', name: 'root', start: 0, end: 22, ... }
    
    console.log(doc.root.children[0].toJSON());
    // => { type: 'element', name: 'child', start: 6, end: 15, ... }
  • Added a new preserveXmlDeclaration parser option. #31

    When true, an XmlDeclaration node representing the XML declaration (if there is one) will be included in the parsed document. When false, the XML declaration will be discarded. The default is false, which matches the behavior of previous versions.

    This option is useful if you want to preserve the XML declaration when later serializing a document back to XML. Previously, the XML declaration was always discarded, which meant that if you parsed a document with an XML declaration and then serialized it, the original XML declaration would be lost.

    const { parseXml } = require('@rgrove/parse-xml');
    
    let xml = '<?xml version="1.0" encoding="UTF-8"?><root />';
    let doc = parseXml(xml, { preserveXmlDeclaration: true });
    
    console.log(doc.children[0].toJSON());
    // => { type: 'xmldecl', version: '1.0', encoding: 'UTF-8' }
  • Added a new preserveDocumentType parser option. #32

    When true, an XmlDocumentType node representing a document type declaration (if there is one) will be included in the parsed document. When false, any document type declaration encountered will be discarded. The default is false, which matches the behavior of previous versions.

    Note that the parser only includes the document type declaration in the node tree; it doesn't actually validate the document against the DTD, load external DTDs, or resolve custom entity references.

    This option is useful if you want to preserve the document type declaration when later serializing a document back to XML. Previously, the document type declaration was always discarded, which meant that if you parsed a document with a document type declaration and then serialized it, the original document type declaration would be lost.

    const { parseXml } = require('@rgrove/parse-xml');
    
    let xml = '<!DOCTYPE root SYSTEM "root.dtd"><root />';
    let doc = parseXml(xml, { preserveDocumentType: true });
    
    console.log(doc.children[0].toJSON());
    // => { type: 'doctype', name: 'root', systemId: 'root.dtd' }
    
    xml = '<!DOCTYPE kittens [<!ELEMENT kittens (#PCDATA)>]><kittens />';
    doc = parseXml(xml, { preserveDocumentType: true });
    
    console.log(doc.children[0].toJSON());
    // => {
    //   type: 'doctype',
    //   name: 'kittens',
    //   internalSubset: '<!ELEMENT kittens (#PCDATA)>'
    // }

Changed

  • Errors thrown by the parser are now instances of a new XmlError class, which extends Error. These errors still have all the same properties as before, but now with improved type definitions. #27

Fixed

  • Leading and trailing whitespace in comment content is no longer trimmed. This issue only affected parsing when the preserveComments parser option was enabled. #28

  • Text content following a CDATA section is no longer appended to the preceding XmlCdata node. This issue only affected parsing when the preserveCdata parser option was enabled. #29

4.0.1 (2022-10-17)

Fixed

  • The parseXml() function's options argument is now correctly marked as optional. [#23]

4.0.0 (2022-09-25)

parse-xml has been rewritten in TypeScript. The API is unchanged, but the parseXml() function is now a named export rather than a default export, which will require a small change to how you import it. See below for details.

This release also contains major performance improvements. Parsing is now 1.4x to 2.5x as fast as it was in 3.0.0, depending on the document being parsed.

Breaking Changes

  • The parseXml() function is now a named export rather than the default export. Please update your import and require statements accordingly:

    ESM

    -import parseXml from '@rgrove/parse-xml';
    +import { parseXml } from '@rgrove/parse-xml';

    CommonJS

    -const parseXml = require('@rgrove/parse-xml');
    +const { parseXml } = require('@rgrove/parse-xml');
  • XML node classes (XmlNode, XmlDocument, XmlElement, etc.) are now named exports of the @rgrove/parse-xml package rather than properties on the parseXml() function. This is unlikely to affect most people since there aren't many reasons to use these classes directly.

    ESM

    -import parseXml from '@rgrove/parse-xml';
    -const { XmlNode } = parseXml;
    +import { parseXml, XmlNode } from '@rgrove/parse-xml';

    CommonJS

    -const parseXml = require('@rgrove/parse-xml');
    -const { XmlNode } = parseXml;
    +const { parseXml, XmlNode } = require('@rgrove/parse-xml');
  • The minified browser-ready global bundle, which was previously located at dist/umd/parse-xml.min.js, is now located at dist/global.min.js. This is unlikely to affect most people because this file isn't used by Node.js or by browser bundlers like webpack. It's only a convenience for people who want to load parse-xml directly from a CDN like unpkg with a <script> element and use it via a parseXml() global.

    -<script src="https://unpkg.com/@rgrove/parse-xml@3.0.0/dist/umd/parse-xml.min.js"></script>
    +<script src="https://unpkg.com/@rgrove/parse-xml@4.0.0/dist/global.min.js"></script>
  • Node.js 12 is no longer supported. Node.js 14 is now the minimum supported version.

Other Changes

  • Parsing performance has been improved significantly, and is now 1.4 to 2.5 times as fast as it was in 3.0.0, depending on the document being parsed.

  • The package now includes a browser-specific entry point that's optimized for minification. Using parse-xml with a minifying bundler like webpack or esbuild should now result in a smaller bundle.

3.0.0 (2021-01-23)

This release includes significant changes under the hood (such as a brand new parser!), but backwards compatibility has been a high priority. Most users should be able to upgrade without needing to make any changes (or with only minimal changes).

Added

  • XML processing instructions are now included in parsed documents as XmlProcessingInstruction nodes (with the type value "pi"). Previously they were discarded.

  • A new sortAttributes option. When true, attributes will be sorted in alphabetical order in an element's attributes object (which is no longer the default behavior).

  • TypeScript type definitions. While parse-xml is still written in JavaScript, it now has TypeScript-friendly JSDoc comments throughout, with strict type checking enabled. These comments are now used to generate type definitions at build time.

Changed

  • The minimum supported Node.js version is now 12.x, and the minimum supported ECMAScript version is ES2017. Extremely old browsers (like IE11) are no longer supported out of the box, but you can still transpile parse-xml yourself if you need to support old browsers.

  • The XML parser has been completely rewritten with the primary goals of improving robustness and safety.

    While the previous parser was good, it relied heavily on complex regular expressions. This helped keep it extremely small, but also left it open to the possibility of regex denial of service bugs when parsing unusual or maliciously crafted input.

    The new parser uses a less interesting but overall safer approach, and employs regular expressions only sparingly and in ways that aren't risky (they're now only used as performance optimizations rather than as the basis for the entire parser).

  • The parseXml() function now returns an XmlDocument instance instead of a plain object. Its properties are backwards compatible.

  • Other node types (elements, text nodes, CDATA nodes, and comments) are also now represented by class instances (XmlElement, XmlText, XmlCdata, and XmlComment) rather than plain objects. Their properties are all backwards compatible.

  • Attributes are no longer sorted alphabetically by name in an element's attributes object by default. They're now defined in the same order that they're encountered in the document being parsed, unless the sortAttributes parser option is true.

  • If the value returned by an optional resolveUndefinedEntity function is not a string, null, or undefined, a TypeError will now be thrown. If you don't pass a custom resolveUndefinedEntity function to parseXml(), then this change won't affect you.

  • Some error messages have been changed to improve clarity, and more helpful errors have been added in some scenarios that previously would have resulted in generic or less helpful errors.

  • The browser field in package.json has been removed and the main field now points both Node.js and browser bundlers to the same untranspiled CommonJS source.

    When bundled using your favorite bundler, parse-xml will work great in all modern browsers with no transpilation needed. If you don't want to use a bundler, you can still use the prepackaged UMD bundle at dist/umd/parse-xml.min.js, which provides a parseXml global.

2.0.4 (2020-05-01)

Fixed

  • Extremely long attribute values no longer cause the parser to throw a "Maximum call stack size exceeded" RangeError. #13 (@rossj)

2.0.3 (2020-04-20)

Fixed

  • Attribute values with many consecutive character references (such as &lt;) no longer cause the parser to hang. #10 (@rossj)

2.0.2 (2020-01-10)

Fixed

  • Whitespace in attribute values is now normalized correctly. #7

    Previously, attribute values were normalized according to the rules for non-CDATA attributes, but this was incorrect and based on a misreading of the spec.

    Attribute values are now correctly parsed as CDATA, meaning that whitespace is not collapsed or trimmed and whitespace character entities are resolved to their respective characters rather than being normalized to spaces (which was incorrect even by the non-CDATA rules!).

2.0.1 (2019-04-09)

Fixed

  • A carriage return (\r) character that isn't followed by a line feed (\n) character is now correctly normalized to a line feed before parsing.

2.0.0 (2019-01-20)

Added

  • There's a new minified UMD bundle at dist/umd/parse-xml.min.js in the npm package. This may be useful if you want to load parse-xml directly in a browser using a service like unpkg or jsDelivr.

Changed

  • parse-xml no longer depends on CoreJS polyfills or the Babel runtime, which reduces the browser bundle size significantly. If you need to support older browsers, you should provide your own polyfills for Object.assign(), Object.freeze(), and String.fromCodePoint().

  • The browser-friendly CommonJS build has moved from dist/ to dist/commonjs/ in the npm package.

1.1.1 (2017-09-20)

Fixed

  • Attribute values are no longer truncated at the first = character.

1.1.0 (2017-09-10)

Added

1.0.0 (2017-06-04)

  • Initial release.