word-extractor

Read data from a Word document (.doc or .docx) using Node.js

Why use this module?

There are a fair number of npm components which can extract text from Word .doc files, but they often appear to require some external helper program, and involve either spawning a process or communicating with a persistent one. That raises the installation and deployment burden as well as the runtime one.

This module is intended to provide a much faster way of reading the text from a Word file, without leaving the Node.js environment.

This means you do not need to install Word, Office, or anything else, and the module will work on all platforms, without any native binary code requirements.

As of version 1.0, this module supports both traditional, OLE-based, Word files (usually .doc), and modern, Open Office-style, ECMA-376 Word files (usually .docx). It can be used both with files and with file contents in a Node.js Buffer.

How do I install this module?

yarn add word-extractor

# Or using npm... 
npm install word-extractor

How do I use this module?

const WordExtractor = require("word-extractor"); 
const extractor = new WordExtractor();
const extracted = extractor.extract("file.doc");

extracted.then(function(doc) { console.log(doc.getBody()); });

The object returned from the extract() method is a promise that resolves to a document object, which then provides several views onto different parts of the document contents.

Methods

WordExtractor#extract(<filename> | <Buffer>)

Main method to open a Word file and retrieve the data. Returns a promise which resolves to a Document. If a Buffer is passed instead of a filename, then the buffer is used directly, instad of reading a disk from the file system.

Document#getBody()

Retrieves the content text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getFootnotes()

Retrieves the footnote text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getEndnotes()

Retrieves the endnote text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getHeaders(options?)

Retrieves the header and footer text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Note that by default, getHeaders() returns one string, containing all headers and footers. This is compatible with previous versions. If you want to separate headers and footers, use getHeaders({includeFooters: false}), to return only the headers, and the new method getFooters() (from version 1.0.1) to return the footers separately.

Document#getFooters()

From version 1.0.1. Retrieves the footer text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getAnnotations()

Retrieves the comment bubble text from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Document#getTextboxes(options?)

Retrieves the textbox contenttext from a Word document. This will handle UNICODE characters correctly, so if there are accented or non-Latin-1 characters present in the document, they'll show as is in the returned string.

Note that by default, getTextboxes() returns one string, containing all textbox content from both main document and the headers and footers. You can control what gets included by using the options includeHeadersAndFooters (which defaults to true) and includeBody (also defaults to true). So, as an example, if you only want the body text box content, use: doc.getTextboxes({includeHeadersAndFooters: false}).

License

Licensed under the MIT License.

Change log

1.0.4 / 26th July 2021

Fixed issue with missing content from LibreOffice files. See #40
Fixed order of entry reading from LibreOffice OOXML files. See #41

1.0.3 / 17th June 2021

Fixes issues with long attribute values (> 65k) in OO XML. See #37
Propogate errors from XML failures into promise rejections. See #38
Changed the XML parser dependency for maintenance and fixes. See #39

1.0.2 / 28th May 2021

Added a new method for reading textbox content. See #35

1.0.1 / 24th May 2021

Added separation between headers and footers. See #34

1.0.0 / 16th May 2021

Major refactoring of the OLE code to use promises internally
Added support for Open Office XML-based (.docx) Word files. See #1
Added support for reading direct from a Buffer. See #11
Removed event-stream dependency. See #19
Fixed an issue with not closing files properly. See #23
Corrected handling of extracting files with files. See #31
Corrected handling of extracting files with deleted text. See #32
Fixed issues with extracting multiple rows of table data. See #33

This is a major release, and while there are no incompatible API changes, it seemed best to bump the version so as not to pick up updates automatically. However, all old applications should not require any code changes to use this version.

0.3.0 / 18th February 2019

Re-fixed the bad loop in the OLE code. See #15, #18
A few errors previously rejected as strings, they're now errors
Updated dependencies to safe versions. See #20

0.2.2 / 23rd January 2019

Fixed the bad dependency on event-stream

0.2.1 / 21st January 2019

Added a new getEndnotes method. See #16
Fixed a bad loop in the OLE code

0.2.0 / 31st October 3018

Removed coffeescript and mocha, now using jest and plain ES6
Removed partial work on .docx (for now)

0.1.4 / 25th March 2017

Fixed a documentation issue. extract returns a Promise. See #6
Corrected table cell delimiters to be tabs. See #9
Fixed an issue where replacements weren't being applied right.

0.1.3 / 6th July 2016

Added the missing lib folder
Added a missing dependency to package.json

0.1.1 / 17th January 2016

Fixed a bug with text boundary calculations
Added endpoints getHeaders, getFootnotes, getAnnotations

0.1.0 / 14th January 2016

Initial release to npm

Package detail

word-extractor

readme

word-extractor

Why use this module?

How do I install this module?

How do I use this module?

Methods

License

changelog

Change log

1.0.4 / 26th July 2021

1.0.3 / 17th June 2021

1.0.2 / 28th May 2021

1.0.1 / 24th May 2021

1.0.0 / 16th May 2021

0.3.0 / 18th February 2019

0.2.2 / 23rd January 2019

0.2.1 / 21st January 2019

0.2.0 / 31st October 3018

0.1.4 / 25th March 2017

0.1.3 / 6th July 2016

0.1.1 / 17th January 2016

0.1.0 / 14th January 2016