Important: This documentation covers Yarn 1 (Classic).
For Yarn 2+ docs and migration guide, see yarnpkg.com.

Package detail

@ridi/epub-parser

ridi242MIT0.7.4-alpha.1TypeScript support: included

Common EPUB2 data parser for Ridibooks services

EPUB, EPUB2, parser, serialize, deserialize, unzip, read, crypto

readme

@ridi/epub-parser

Common EPUB2 data parser for Ridibooks services

NPM version Check codecov NPM total downloads

Features

  • <input checked="" disabled="" type="checkbox"> EPUB2 parsing
  • <input disabled="" type="checkbox"> EPUB3 parsing
  • <input checked="" disabled="" type="checkbox"> Package validation with option
  • <input checked="" disabled="" type="checkbox"> Unzip epub file when parsing with options
  • <input checked="" disabled="" type="checkbox"> Read files
    • <input checked="" disabled="" type="checkbox"> Extract inner HTML of body in Spine with option
    • <input checked="" disabled="" type="checkbox"> Change base path of Spine, CSS and Inline style with option
    • <input checked="" disabled="" type="checkbox"> Customize CSS, Inline Style with options
    • <input disabled="" type="checkbox"> Truncate inner HTML of body in Spine with options
    • <input disabled="" type="checkbox"> Minify HTML, CSS, Inline Style with options
  • <input checked="" disabled="" type="checkbox"> Encrypt and decrypt function when parsing or reading or unzipping
  • <input disabled="" type="checkbox"> More spec
    • <input disabled="" type="checkbox"> encryption.xml
    • <input disabled="" type="checkbox"> manifest.xml
    • <input disabled="" type="checkbox"> metadata.xml
    • <input disabled="" type="checkbox"> rights.xml
    • <input disabled="" type="checkbox"> signatures.xml
  • <input disabled="" type="checkbox"> Debug mode
  • <input disabled="" type="checkbox"> Environment
    • <input checked="" disabled="" type="checkbox"> Node
    • <input disabled="" type="checkbox"> CLI
    • <input disabled="" type="checkbox"> Browser
  • <input disabled="" type="checkbox"> Online demo

Install

npm install @ridi/epub-parser

Usage

Basic:

import { EpubParser } from '@ridi/epub-parser';
// or const { EpubParser } = require('@ridi/epub-parser');

const parser = new EpubParser('./foo/bar.epub' or './unzippedPath');
parser.parse(/* { parseOptions } */).then((book) => {
  parser.readItems(book.spines/*, { readOptions } */).then((results) => {
    ...
  });
  ...
});

with AesCryptor:

import { CryptoProvider, AesCryptor } from '@ridi/epub-parser';
// or const { CryptoProvider, AesCryptor } = require('@ridi/epub-parser');

const { Purpose } = CryptoProvider;
const { Mode, Padding } = AesCryptor;

class ContentCryptoProvider extends CryptoProvider {
  constructor(key) {
    super();
    this.cryptor = new AesCryptor(Mode.ECB, { key });
  }

  getCryptor(filePath, purpose) {
    return this.cryptor;
  }

  // If use as follows:
  // const provider = new ContentCryptoProvider(...);
  // const parser = new EpubParser('encrypted.epub', provider);
  // const book = await parser.parse({ unzipPath: ... });
  // const firstSpine = await parser.readItem(book.spines[0]);
  //
  // It will be called as follows:
  // 1. run(data, 'encrypted.epub', Purpose.READ_IN_DIR)
  // 2. run(data, 'META-INF/container.xml', Purpose.READ_IN_ZIP)
  // 3. run(data, 'OEBPS/content.opf', Purpose.READ_IN_ZIP)
  // ...
  // 4. run(data, 'mimetype', Purpose.WRITE)
  // ...
  // 5. run(data, 'OEBPS/Text/Section0001.xhtml', Purpose.READ_IN_DIR)
  //
  run(data, filePath, purpose) {
    const cryptor = this.getAesCryptor(filePath, purpose);
    const padding = Padding.AUTO;
    if (purpose === Purpose.READ_IN_DIR) {
      return cryptor.decrypt(data, { padding });
    } else if (purpose === Purpose.WRITE) {
      return cryptor.encrypt(data, { padding });
    }
    return data;
  }
}

const cryptoProvider = new ContentCryptoProvider(key);
const parser = new EpubParser('./encrypted.epub' or './unzippedPath', cryptoProvider);

Log level setting:

import { LogLevel, ... } from '@ridi/epub-parser';
const parser = new EpubParser(/* path */, /* cryptoProvider */, /* logLevel */)
// or const parser = new EpubParser(/* path */, /* logLevel */)
parser.logger.logLevel = LogLevel.VERBOSE; // SILENT, ERROR, WARN(default), INFO, DEBUG, VERBOSE

API

parse(parseOptions)

Returns Promise<EpubBook> with:

  • EpubBook: Instance with metadata, spine list, table of contents, etc.

Or throw exception.

parseOptions: ?object


readItem(item, readOptions)

Returns string or Buffer in Promise with:

or throw exception.

item: Item (see: Item Types)

readOptions: ?object


readItems(items, readOptions)

Returns string[] or Buffer[] in Promise with:

or throw exception.

items: Item[] (see: Item Types)

readOptions: ?object


unzip(unzipPath, overwrite)

Returns Promise<boolean> with:

  • If result is true, unzip is successful or has already been unzipped.

Or throw exception.

unzipPath: string

overwrite: boolean


onProgress = callback(step, totalStep, action)

Tells the progress of parser through callback.

const { Action } = EpubParser; // PARSE, READ_ITEMS
parser.onProgress = (step, totalStep, action) => {
  console.log(`[${action}] ${step} / ${totalStep}`);
}

Model

EpubBook

Author

  • name: ?string
  • fileAs: ?string
  • role: string (Default: Author.Roles.UNDEFINED)
  • toRaw(): object

Author.Roles

Type Value
UNDEFINED undefined
UNKNOWN unknown
ADAPTER adp
ANNOTATOR ann
ARRANGER arr
ARTIST art
ASSOCIATEDNAME asn
AUTHOR aut
AUTHOR_IN_QUOTATIONS_OR_TEXT_EXTRACTS aqt
AUTHOR_OF_AFTER_WORD_OR_COLOPHON_OR_ETC aft
AUTHOR_OF_INTRODUCTIONOR_ETC aui
BIBLIOGRAPHIC_ANTECEDENT ant
BOOK_PRODUCER bkp
COLLABORATOR clb
COMMENTATOR cmm
DESIGNER dsr
EDITOR edt
ILLUSTRATOR ill
LYRICIST lyr
METADATA_CONTACT mdc
MUSICIAN mus
NARRATOR nrt
OTHER oth
PHOTOGRAPHER pht
PRINTER prt
REDACTOR red
REVIEWER rev
SPONSOR spn
THESIS_ADVISOR ths
TRANSCRIBER trc
TRANSLATOR trl

DateTime

  • value: ?string
  • event: string (Default: DateTime.Events.UNDEFINED)
  • toRaw(): object

DateTime.Events

Type Value
UNDEFINED undefined
UNKNOWN unknown
CREATION creation
MODIFICATION modification
PUBLICATION publication

Identifier

  • value: ?string
  • scheme: string (Default: Identifier.Schemes.UNDEFINED)
  • toRaw(): object

Identifier.Schemes

Type Value
UNDEFINED undefined
UNKNOWN unknown
DOI doi
ISBN isbn
ISBN13 isbn13
ISBN10 isbn10
ISSN issn
UUID uuid
URI uri

Meta

  • name: ?string
  • content: ?string
  • toRaw(): object

Guide

  • title: ?string
  • type: string (Default: Guide.Types.UNDEFINED)
  • href: ?string
  • item: ?Item
  • toRaw(): object

Guide.Types

Type Value
UNDEFINED undefined
UNKNOWN unknown
COVER cover
TITLE_PAGE title-page
TOC toc
INDEX index
GLOSSARY glossary
ACKNOWLEDGEMENTS acknowledgements
BIBLIOGRAPHY bibliography
COLOPHON colophon
COPYRIGHT_PAGE copyright-page
DEDICATION dedication
EPIGRAPH epigraph
FOREWORD foreword
LOI loi
LOT lot
NOTES notes
PREFACE preface
TEXT text

Item Types

Item

  • id: ?string
  • href: ?string
  • mediaType: ?string
  • size: ?number
  • isFileExists: boolean (size !== undefined)
  • toRaw(): object

SpineItem (extend Item)

NcxItem (extend Item)

CssItem (extend Item)

  • namespace: string

InlineCssItem (extend CssItem)

  • style: string (Default: '')

ImageItem (extend Item)

  • isCover: boolean (Default: false)

SvgItem (extend ImageItem)

FontItem (extend Item)

DeadItem (extend Item)

  • reason: string (Default: DeadItem.Reason.UNDEFINED)

DeadItem.Reason

Type Value
UNDEFINED undefined
UNKNOWN unknown
NOT_EXISTS not_exists
NOT_SPINE not_spine
NOT_NCX not_ncx
NOT_SUPPORT_TYPE not_support_type

  • id: ?string
  • label: ?string
  • src: ?string
  • anchor: ?string
  • depth: number (Default: 0)
  • children: NavPoint[]
  • spine: ?SpineItem
  • toRaw(): object

Version

  • major: number
  • minor: number
  • patch: number
  • toString(): string

Parse Options


validatePackage: boolean

If true, validation package specifications in IDPF listed below.

used only if input is EPUB file.

  • Zip header should not corrupt.
  • mimetype file must be first file in archive.
  • mimetype file should not compressed.
  • mimetype file should only contain string application/epub+zip.
  • Should not use extra field feature of ZIP format for mimetype file.

Default: false


allowNcxFileMissing: boolean

If false, stop parsing when NCX file not exists.

Default: true


unzipPath: ?string

If specified, unzip to that path.

only using if input is EPUB file.

Default: undefined


overwrite: boolean

If true, overwrite to unzipPath when unzip.

only using if unzipPath specified.

Default: true


parseStyle: boolean

If true, styles used for spine is described, and one namespace is given per CSS file or inline style.

Otherwise it CssItem.namespace, SpineItem.styles is undefined.

In any list, InlineCssItem is always positioned after CssItem. (EpubBook.styles, EpubBook.items, SpineItem.styles, ...)

Default: true


styleNamespacePrefix: string

Prepend given string to namespace for identification.

only available if parseStyle is true.

Default: 'ridi_style'


additionalInlineStyle: ?string

If specified, added inline styles to all spines.

only available if parseStyle is true.

Default: undefined

Read Options


force: boolean

If true, ignore any exceptions that occur within parser.

Default: false


basePath: ?string

If specified, change base path of paths used by spine and css.

HTML: SpineItem

...
  <!-- Before -->
  <div>
    <img src="../Images/cover.jpg">
  </div>
  <!-- After -->
  <div>
    <img src="{basePath}/OEBPS/Images/cover.jpg">
  </div>
...

CSS: CssItem, InlineCssItem

/* Before */
@font-face {
  font-family: NotoSansRegular;
  src: url("../Fonts/NotoSans-Regular.ttf");
}
/* After */
@font-face {
  font-family: NotoSansRegular;
  src: url("{basePath}/OEBPS/Fonts/NotoSans-Regular.ttf");
}

Default: undefined


extractBody: boolean|function

If true, extract body. Otherwise it returns a full string. If specify a function instead of true, use function to transform body.

false:

'<!doctype><html>\n<head>\n</head>\n<body style="background-color: #000000;">\n  <p>Extract style</p>\n  <img src=\"../Images/api-map.jpg\"/>\n</body>\n</html>'

true:

'<body style="background-color: #000000;">\n  <p>Extract style</p>\n  <img src=\"../Images/api-map.jpg\"/>\n</body>'

function:

readOptions.extractBody = (innerHTML, attrs) => {
  const string = attrs.map((attr) => {
    return ` ${attr.key}=\"${attr.value}\"`;
  }).join(' ');
  return `<article ${string}>${innerHTML}</article>`;
};
'<article style="background-color: #000000;">\n  <p>Extract style</p>\n  <img src=\"../Images/api-map.jpg\"/>\n</article>'

Default: false


serializedAnchor: Boolean

If true, replace file path of anchor in spine with spine index.

...
<spine toc="ncx">
  <itemref idref="Section0001.xhtml"/> <!-- index: 0 -->
  <itemref idref="Section0002.xhtml"/> <!-- index: 1 -->
  <itemref idref="Section0003.xhtml"/> <!-- index: 2 -->
  ...
</spine>
...
<!-- Before -->
<a href="./Text/Section0002.xhtml#title">Chapter 2</a>
<!-- After -->
<a href="1#title">Chapter 2</a>

Default: false


ignoreScript: boolean

Ignore all scripts from within HTML.

Default: false


removeAtrules: string[]

Remove at-rules.

Default: []


removeTagSelector: string[]

Remove selector that point to specified tags.

Default: []


removeIdSelector: string[]

Remove selector that point to specified ids.

Default: []


removeClassSelector: string[]

Remove selector that point to specified classes.

Default: []

License

MIT

changelog

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

Unreleased

0.7.3 (2021-06-02)

Changed

  • Optimizing bundle size.
  • Ignore script in event attribute with ignoreScript option.

0.7.2 (2021-06-02)

Changed

  • Improve PDF parsing speed.
  • Bump PDF.js version to 2.5.207.

Fixed

  • Fix an issue where title in OPF could not be read when it was identified by id.
  • Fix an issue where inline styles could not be read.

0.7.1 (2020-11-13)

Added

  • Types for Typescript support.
  • Documentations

0.7.0 (2020-10-31)

Changed

  • Replace unzipper with adm-zip.
  • Changed Parsers to accept async CryptoProvider methods.
  • Added an option to the CryptoProvider to not handle stream in chunks.

Fixed

  • Fix a bug where path has an additional slash in Windows.

0.6.15 (2020-09-05)

Changed

  • Dependencies and babel updates. (on babel 7)

Fixed

  • Fix an issue where scheme was broken when using URL as basePath.
  • Fix an issue where order of spines in OPF is mixed when order of spines does not match manifest in OPF.

0.6.14 (2020-03-16)

Changed

  • Add feature to ignore percent encoding when matching files and items.

0.6.13 (2020-03-04)

Changed

  • Fix to ignore NavPoint if it cannot find SpineItem that maps to NavPoint.

0.6.12 (2020-03-03)

Fixed

  • Fix an issue where first page is always calculated to undefined when calculating pages for PDF outline.

0.6.11 (2020-02-26)

Fixed

  • Fix an issue where parser crash when parsing spines without html or body tag.

0.6.10 (2020-01-22)

0.6.9 (2020-01-22)

Fixed

  • Fix an issue that could crash during PDF outline parsing.

0.6.8 (2020-01-21)

Fixed

  • Fix an issue that could crash during EPUB metadata parsing.

0.6.7 (2019-11-12)

Changed

  • Revert "Dependencies and babel updates".
  • Fixed specify cross-dependency version numbers exactly.

0.6.6 (2019-11-05)

0.6.5 (2019-11-04)

Changed

  • Dependencies and babel updates.

0.6.4 (2019-10-04)

Fixed

  • Fix an issue where the hash function cannot be used externally.

0.6.3 (2019-10-03)

Added

  • Add hash function.

Fixed

  • Fix an issue where parser error with an outline that cannot be inferred page.

0.6.2 (2019-09-10)

Added

  • Add PdfParser.parseOptions.fakeWorker option. (default: false)

0.6.1 (2019-08-09)

Changed

  • Replace html and body styles with namespace when If use EpubParser.parseOptions.parseStyle and EpubParser.readOptions.extractBody together.

Fixed

  • Fix an issue where encrypted zip file could not be opened.
  • Fix an issue where unzipping process terminates if CryptorProvider.bufferSize is larger than file size to be unzip.

0.6.0 (2019-08-04)

Added

  • Add pdf-parser package.
  • Add EpubParser.parseOptions.additionalInlineStyle option. (default: undefined)
  • Add CryptoProvider.bufferSize property.

Changed

  • Remove Version.isValid property.
  • Improve cryption performance.

0.5.8 (2019-07-03)

Added

  • Add Parser.unzip(unzipPath, overwrite) method.

Changed

  • Implement Parser.parseOptions.overwrite option.

0.5.7 (2019-07-03)

Added

  • Add EpubParser.readOptions.ignoreScript option. (default: false)

Changed

  • Rename EpubParser.readOptions.removeTags to .removeTagSelector.
  • Rename EpubParser.readOptions.removeIds to .removeIdSelector.
  • Rename EpubParser.readOptions.removeClasses to .removeClassSelector.

0.5.6 (2019-06-12)

Fixed

  • Fix an issue where invalid path generated when URI contains unusable characters.

0.5.5 (2019-05-15)

Fixed

  • Fix a malfunction when parsing corrupted CSS.
  • Fix an issue where EpubParser.parseOptions.basePath option is not reflected in image for svg.

0.5.4 (2019-05-14)

Changed

  • Rename Cryptor to AesCryptor.

0.5.3 (2019-04-01)

Changed

  • Change the language field to accept multiple values.

Fixed

  • Fix an issue where intermittently EBADF error occurred when unzipping.

0.5.2 (2019-02-18)

Fixed

  • Fix an issue where directroy cache file is not overwritten.

0.5.1 (2019-02-14)

Fixed

  • Fix an issue where broken cache values if that save out of ascii range.

0.5.0 (2019-02-13)

Added

  • Add LogLevel.DEBUG and debug log in Parser.
  • Add logLevel parameter for Parser.constructor.
  • Add error code for Cryptor internal error.

Changed

  • Change Logger.logLevel default. (error => warning)
  • Rename LogLevel.WARNING to LogLevel.WARN.

Fixed

  • Fix an issue where subpath sort was not natural.

0.4.1 (2019-02-12)

Changed

  • Improve performance of parsing.

0.4.0 (2019-02-12)

Added

  • Add ComicParser.parseOptions.parseImageSize option.
  • Add ComicBook.Item.width and ComicBook.Item.height.

Changed

  • Rename ComicBook.Item.size to ComicBook.Item.fileSize.

0.3.1 (2019-01-31)

Fixed

  • Fix an issue where JSON parsing errors in directory cache data when attempting to read items from same Book on multiple processes.

0.3.0 (2019-01-27)

Added

  • Add comic-parser, parser-core and content-parser.
  • Add Logger that can control all console logs and log execution time for each method in Parser.
  • Add Parser.onProgress property.
  • Add Parser.readOptions.force option.

Changed

  • Configure multi-packages environment using Lerna.
  • CryptoProvider refactoring.
  • Remove EpubParser.parseOptions.ignoreLinear option.
  • Cache to subdirectory parsing result.

Fixed

  • Fix an issue where spine is always undefined for NavPoint with anchor exists or two depths.
  • Fix an issue where string is broken at 16,384 byte intervals when en/decrypting.
  • Fix an issue where can not be unzip under certain conditions.
  • Fix bad file descriptor error on unzipping.

0.2.0 (2018-11-19)

Added

  • Add EpubParser.readOptions.serializedAnchor option.
  • Add Author.fileAs property.
  • Add encrypt and decrypt function.

Changed

  • Change EpubParser.parseOptions.ignoreLinear option default. (true => false)
  • Change EpubParser.parseOptions.useStyleNamespace option default. (false => true)
  • Change EpubParser.readOptions structure.
  • Remove EpubParser.readOptions.usingCssOptions and EpubParser.parseOptions.validateXml option.
  • Rename useStyleNamespace to parseStyle in EpubParser.parseOptions.
  • Rename SpineItem.spineIndex to SpineItem.index.

Fixed

  • Fix an issue where ncx could not be found in opf, and EpubParser.parseOptions.allowNcxFileMissing was false, but no exception was thrown.
  • Fix an issue where Book.spines order does not match spine order of OPF.

0.1.1 (2018-10-08)

Fixed

  • Fix invalid class name for style namespace.

0.1.0 (2018-09-12)

Added

  • Add overwrite option.
  • Add spine.uesCssOptions option.

Changed

  • Remove spine.extractAdapter option.
  • Remove createIntermediateDirectories and removePreviousFile options. (replaced by overwrite option)
  • Change css.removeAtrules option default.
  • Improve parsing of epub version.
  • Simplifies return type of readitem or readItems.

Fixed

  • Fix an issue where cssParser can not handle URL that are not wrapped in a string.
  • Fix an issue where cssParser does not ignore :not(x) function.

0.0.2 (2018-09-11)

Fixed

  • Fix broken export/import.

0.0.1 (2018-08-30)

  • First release.