Package detail

@ridi/epub-parser

ridi242MIT0.7.4-alpha.1

Common EPUB2 data parser for Ridibooks services

EPUB, EPUB2, parser, serialize, deserialize, unzip, read, crypto

readme

@ridi/epub-parser

Common EPUB2 data parser for Ridibooks services

Features

<input checked="" disabled="" type="checkbox"> EPUB2 parsing
<input disabled="" type="checkbox"> EPUB3 parsing
<input checked="" disabled="" type="checkbox"> Package validation with option
<input checked="" disabled="" type="checkbox"> Unzip epub file when parsing with options
<input checked="" disabled="" type="checkbox"> Read files
- <input checked="" disabled="" type="checkbox"> Extract inner HTML of body in Spine with option
- <input checked="" disabled="" type="checkbox"> Change base path of Spine, CSS and Inline style with option
- <input checked="" disabled="" type="checkbox"> Customize CSS, Inline Style with options
- <input disabled="" type="checkbox"> Truncate inner HTML of body in Spine with options
- <input disabled="" type="checkbox"> Minify HTML, CSS, Inline Style with options
<input checked="" disabled="" type="checkbox"> Encrypt and decrypt function when parsing or reading or unzipping
<input disabled="" type="checkbox"> More spec
- <input disabled="" type="checkbox"> encryption.xml
- <input disabled="" type="checkbox"> manifest.xml
- <input disabled="" type="checkbox"> metadata.xml
- <input disabled="" type="checkbox"> rights.xml
- <input disabled="" type="checkbox"> signatures.xml
<input disabled="" type="checkbox"> Debug mode
<input disabled="" type="checkbox"> Environment
- <input checked="" disabled="" type="checkbox"> Node
- <input disabled="" type="checkbox"> CLI
- <input disabled="" type="checkbox"> Browser
<input disabled="" type="checkbox"> Online demo

Install

npm install @ridi/epub-parser

Usage

Basic:

import { EpubParser } from '@ridi/epub-parser';
// or const { EpubParser } = require('@ridi/epub-parser');

const parser = new EpubParser('./foo/bar.epub' or './unzippedPath');
parser.parse(/* { parseOptions } */).then((book) => {
  parser.readItems(book.spines/*, { readOptions } */).then((results) => {
    ...
  });
  ...
});

with AesCryptor:

import { CryptoProvider, AesCryptor } from '@ridi/epub-parser';
// or const { CryptoProvider, AesCryptor } = require('@ridi/epub-parser');

const { Purpose } = CryptoProvider;
const { Mode, Padding } = AesCryptor;

class ContentCryptoProvider extends CryptoProvider {
  constructor(key) {
    super();
    this.cryptor = new AesCryptor(Mode.ECB, { key });
  }

  getCryptor(filePath, purpose) {
    return this.cryptor;
  }

  // If use as follows:
  // const provider = new ContentCryptoProvider(...);
  // const parser = new EpubParser('encrypted.epub', provider);
  // const book = await parser.parse({ unzipPath: ... });
  // const firstSpine = await parser.readItem(book.spines[0]);
  //
  // It will be called as follows:
  // 1. run(data, 'encrypted.epub', Purpose.READ_IN_DIR)
  // 2. run(data, 'META-INF/container.xml', Purpose.READ_IN_ZIP)
  // 3. run(data, 'OEBPS/content.opf', Purpose.READ_IN_ZIP)
  // ...
  // 4. run(data, 'mimetype', Purpose.WRITE)
  // ...
  // 5. run(data, 'OEBPS/Text/Section0001.xhtml', Purpose.READ_IN_DIR)
  //
  run(data, filePath, purpose) {
    const cryptor = this.getAesCryptor(filePath, purpose);
    const padding = Padding.AUTO;
    if (purpose === Purpose.READ_IN_DIR) {
      return cryptor.decrypt(data, { padding });
    } else if (purpose === Purpose.WRITE) {
      return cryptor.encrypt(data, { padding });
    }
    return data;
  }
}

const cryptoProvider = new ContentCryptoProvider(key);
const parser = new EpubParser('./encrypted.epub' or './unzippedPath', cryptoProvider);

Log level setting:

import { LogLevel, ... } from '@ridi/epub-parser';
const parser = new EpubParser(/* path */, /* cryptoProvider */, /* logLevel */)
// or const parser = new EpubParser(/* path */, /* logLevel */)
parser.logger.logLevel = LogLevel.VERBOSE; // SILENT, ERROR, WARN(default), INFO, DEBUG, VERBOSE

API

parse(parseOptions)

Returns Promise<EpubBook> with:

EpubBook: Instance with metadata, spine list, table of contents, etc.

Or throw exception.

parseOptions: `?object`

readItem(item, readOptions)

Returns string or Buffer in Promise with:

SpineItem, CssItem, InlineCssItem, NcxItem, SvgItem:
- string
Other items:
- Buffer

or throw exception.

item: `Item` (see: Item Types)

readOptions: `?object`

readItems(items, readOptions)

Returns string[] or Buffer[] in Promise with:

SpineItem, CssItem, InlineCssItem, NcxItem, SvgItem:
- string[]
Other items:
- Buffer[]

or throw exception.

items: `Item[]` (see: Item Types)

readOptions: `?object`

unzip(unzipPath, overwrite)

Returns Promise<boolean> with:

If result is true, unzip is successful or has already been unzipped.

Or throw exception.

unzipPath: `string`

overwrite: `boolean`

onProgress = callback(step, totalStep, action)

Tells the progress of parser through callback.

const { Action } = EpubParser; // PARSE, READ_ITEMS
parser.onProgress = (step, totalStep, action) => {
  console.log(`[${action}] ${step} / ${totalStep}`);
}

Model

EpubBook

titles: string[]
creators: Author[]
subjects: string[]
description: ?string
publisher: ?string
contributors: Author[]
dates: DateTime[]
type: ?string
format: ?string
identifiers: Identifier[]
source: ?string
languages: string[]
relation: ?string
rights: ?string
version: Version
metas: Meta[]
items: Item[]
spines: SpintItem[]
ncx: ?NcxItem
fonts: FontItem[]
cover: ?ImageItem
images: ImageItem[]
styles: CssItem[]
guides: Guide[]
deadItems: DeadItem[]
toRaw(): object

Author

name: ?string
fileAs: ?string
role: string (Default: Author.Roles.UNDEFINED)
toRaw(): object

Author.Roles

Type	Value
UNDEFINED	undefined
UNKNOWN	unknown
ADAPTER	adp
ANNOTATOR	ann
ARRANGER	arr
ARTIST	art
ASSOCIATEDNAME	asn
AUTHOR	aut
AUTHOR_IN_QUOTATIONS_OR_TEXT_EXTRACTS	aqt
AUTHOR_OF_AFTER_WORD_OR_COLOPHON_OR_ETC	aft
AUTHOR_OF_INTRODUCTIONOR_ETC	aui
BIBLIOGRAPHIC_ANTECEDENT	ant
BOOK_PRODUCER	bkp
COLLABORATOR	clb
COMMENTATOR	cmm
DESIGNER	dsr
EDITOR	edt
ILLUSTRATOR	ill
LYRICIST	lyr
METADATA_CONTACT	mdc
MUSICIAN	mus
NARRATOR	nrt
OTHER	oth
PHOTOGRAPHER	pht
PRINTER	prt
REDACTOR	red
REVIEWER	rev
SPONSOR	spn
THESIS_ADVISOR	ths
TRANSCRIBER	trc
TRANSLATOR	trl

DateTime

value: ?string
event: string (Default: DateTime.Events.UNDEFINED)
toRaw(): object

DateTime.Events

Type	Value
UNDEFINED	undefined
UNKNOWN	unknown
CREATION	creation
MODIFICATION	modification
PUBLICATION	publication

Identifier

value: ?string
scheme: string (Default: Identifier.Schemes.UNDEFINED)
toRaw(): object

Identifier.Schemes

Type	Value
UNDEFINED	undefined
UNKNOWN	unknown
DOI	doi
ISBN	isbn
ISBN13	isbn13
ISBN10	isbn10
ISSN	issn
UUID	uuid
URI	uri

Guide

title: ?string
type: string (Default: Guide.Types.UNDEFINED)
href: ?string
item: ?Item
toRaw(): object

Guide.Types

Type	Value
UNDEFINED	undefined
UNKNOWN	unknown
COVER	cover
TITLE_PAGE	title-page
TOC	toc
INDEX	index
GLOSSARY	glossary
ACKNOWLEDGEMENTS	acknowledgements
BIBLIOGRAPHY	bibliography
COLOPHON	colophon
COPYRIGHT_PAGE	copyright-page
DEDICATION	dedication
EPIGRAPH	epigraph
FOREWORD	foreword
LOI	loi
LOT	lot
NOTES	notes
PREFACE	preface
TEXT	text

Item Types

Item

id: ?string
href: ?string
mediaType: ?string
size: ?number
isFileExists: boolean (size !== undefined)
toRaw(): object

SpineItem (extend Item)

index: number (Default: undefined)
isLinear: boolean (Default: true)
styles: ?CssItem[]
first: ?SpineItem
prev: ?SpineItem
next: ?SpineItem

NcxItem (extend Item)

navPoints: NavPoint[]

CssItem (extend Item)

namespace: string

InlineCssItem (extend CssItem)

style: string (Default: '')

ImageItem (extend Item)

isCover: boolean (Default: false)

SvgItem (extend ImageItem)

FontItem (extend Item)

DeadItem (extend Item)

reason: string (Default: DeadItem.Reason.UNDEFINED)

DeadItem.Reason

Type	Value
UNDEFINED	undefined
UNKNOWN	unknown
NOT_EXISTS	not_exists
NOT_SPINE	not_spine
NOT_NCX	not_ncx
NOT_SUPPORT_TYPE	not_support_type

NavPoint

id: ?string
label: ?string
src: ?string
anchor: ?string
depth: number (Default: 0)
children: NavPoint[]
spine: ?SpineItem
toRaw(): object

Version

major: number
minor: number
patch: number
toString(): string

validatePackage: `boolean`

If true, validation package specifications in IDPF listed below.

used only if input is EPUB file.

Zip header should not corrupt.
mimetype file must be first file in archive.
mimetype file should not compressed.
mimetype file should only contain string application/epub+zip.
Should not use extra field feature of ZIP format for mimetype file.

Default: false

allowNcxFileMissing: `boolean`

If false, stop parsing when NCX file not exists.

Default: true

unzipPath: `?string`

If specified, unzip to that path.

only using if input is EPUB file.

Default: undefined

overwrite: `boolean`

If true, overwrite to unzipPath when unzip.

only using if unzipPath specified.

Default: true

parseStyle: `boolean`

If true, styles used for spine is described, and one namespace is given per CSS file or inline style.

Otherwise it CssItem.namespace, SpineItem.styles is undefined.

In any list, InlineCssItem is always positioned after CssItem. (EpubBook.styles, EpubBook.items, SpineItem.styles, ...)

Default: true

styleNamespacePrefix: `string`

Prepend given string to namespace for identification.

only available if parseStyle is true.

Default: 'ridi_style'

additionalInlineStyle: `?string`

If specified, added inline styles to all spines.

only available if parseStyle is true.

Default: undefined

Read Options

force
basePath
extractBody
serializedAnchor
ignoreScript
removeAtrules
removeTagSelector
removeIdSelector
removeClassSelector

force: boolean

If true, ignore any exceptions that occur within parser.

Default: false

basePath: `?string`

If specified, change base path of paths used by spine and css.

HTML: SpineItem

...
  <!-- Before -->
  <div>
    <img src="../Images/cover.jpg">
  </div>
  <!-- After -->
  <div>
    <img src="{basePath}/OEBPS/Images/cover.jpg">
  </div>
...

CSS: CssItem, InlineCssItem

/* Before */
@font-face {
  font-family: NotoSansRegular;
  src: url("../Fonts/NotoSans-Regular.ttf");
}
/* After */
@font-face {
  font-family: NotoSansRegular;
  src: url("{basePath}/OEBPS/Fonts/NotoSans-Regular.ttf");
}

Default: undefined

extractBody: `boolean|function`

If true, extract body. Otherwise it returns a full string. If specify a function instead of true, use function to transform body.

false:

'<!doctype><html>\n<head>\n</head>\n<body style="background-color: #000000;">\n  <p>Extract style</p>\n  <img src=\"../Images/api-map.jpg\"/>\n</body>\n</html>'

true:

'<body style="background-color: #000000;">\n  <p>Extract style</p>\n  <img src=\"../Images/api-map.jpg\"/>\n</body>'

function:

readOptions.extractBody = (innerHTML, attrs) => {
  const string = attrs.map((attr) => {
    return ` ${attr.key}=\"${attr.value}\"`;
  }).join(' ');
  return `<article ${string}>${innerHTML}</article>`;
};

'<article style="background-color: #000000;">\n  <p>Extract style</p>\n  <img src=\"../Images/api-map.jpg\"/>\n</article>'

Default: false

serializedAnchor: `Boolean`

If true, replace file path of anchor in spine with spine index.

...
<spine toc="ncx">
  <itemref idref="Section0001.xhtml"/> <!-- index: 0 -->
  <itemref idref="Section0002.xhtml"/> <!-- index: 1 -->
  <itemref idref="Section0003.xhtml"/> <!-- index: 2 -->
  ...
</spine>
...

<!-- Before -->
<a href="./Text/Section0002.xhtml#title">Chapter 2</a>
<!-- After -->
<a href="1#title">Chapter 2</a>

Default: false

ignoreScript: `boolean`

Ignore all scripts from within HTML.

Default: false

removeAtrules: `string[]`

Remove at-rules.

Default: []

removeTagSelector: `string[]`

Remove selector that point to specified tags.

Default: []

removeIdSelector: `string[]`

Remove selector that point to specified ids.

Default: []

removeClassSelector: `string[]`

Remove selector that point to specified classes.

Default: []

License

MIT

changelog

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

Unreleased

0.7.3 (2021-06-02)

Changed

Optimizing bundle size.
Ignore script in event attribute with ignoreScript option.

0.7.2 (2021-06-02)

Changed

Improve PDF parsing speed.
Bump PDF.js version to 2.5.207.

Fixed

Fix an issue where title in OPF could not be read when it was identified by id.
Fix an issue where inline styles could not be read.

0.7.1 (2020-11-13)

Added

Types for Typescript support.
Documentations

0.7.0 (2020-10-31)

Changed

Replace unzipper with adm-zip.
Changed Parsers to accept async CryptoProvider methods.
Added an option to the CryptoProvider to not handle stream in chunks.

Fixed

Fix a bug where path has an additional slash in Windows.

0.6.15 (2020-09-05)

Changed

Dependencies and babel updates. (on babel 7)

Fixed

Fix an issue where scheme was broken when using URL as basePath.
Fix an issue where order of spines in OPF is mixed when order of spines does not match manifest in OPF.

0.6.14 (2020-03-16)

Changed

Add feature to ignore percent encoding when matching files and items.

0.6.13 (2020-03-04)

Changed

Fix to ignore NavPoint if it cannot find SpineItem that maps to NavPoint.

0.6.12 (2020-03-03)

Fixed

Fix an issue where first page is always calculated to undefined when calculating pages for PDF outline.

0.6.11 (2020-02-26)

Fixed

Fix an issue where parser crash when parsing spines without html or body tag.

0.6.10 (2020-01-22)

0.6.9 (2020-01-22)

Fixed

Fix an issue that could crash during PDF outline parsing.

0.6.8 (2020-01-21)

Fixed

Fix an issue that could crash during EPUB metadata parsing.

0.6.7 (2019-11-12)

Changed

Revert "Dependencies and babel updates".
Fixed specify cross-dependency version numbers exactly.

0.6.6 (2019-11-05)

0.6.5 (2019-11-04)

Changed

Dependencies and babel updates.

0.6.4 (2019-10-04)

Fixed

Fix an issue where the hash function cannot be used externally.

0.6.3 (2019-10-03)

Added

Add hash function.

Fixed

Fix an issue where parser error with an outline that cannot be inferred page.

0.6.2 (2019-09-10)

Added

Add PdfParser.parseOptions.fakeWorker option. (default: false)

0.6.1 (2019-08-09)

Changed

Replace html and body styles with namespace when If use EpubParser.parseOptions.parseStyle and EpubParser.readOptions.extractBody together.

Fixed

Fix an issue where encrypted zip file could not be opened.
Fix an issue where unzipping process terminates if CryptorProvider.bufferSize is larger than file size to be unzip.

0.6.0 (2019-08-04)

Added

Add pdf-parser package.
Add EpubParser.parseOptions.additionalInlineStyle option. (default: undefined)
Add CryptoProvider.bufferSize property.

Changed

Remove Version.isValid property.
Improve cryption performance.

0.5.8 (2019-07-03)

Added

Add Parser.unzip(unzipPath, overwrite) method.

Changed

Implement Parser.parseOptions.overwrite option.

0.5.7 (2019-07-03)

Added

Add EpubParser.readOptions.ignoreScript option. (default: false)

Changed

Rename EpubParser.readOptions.removeTags to .removeTagSelector.
Rename EpubParser.readOptions.removeIds to .removeIdSelector.
Rename EpubParser.readOptions.removeClasses to .removeClassSelector.

0.5.6 (2019-06-12)

Fixed

Fix an issue where invalid path generated when URI contains unusable characters.

0.5.5 (2019-05-15)

Fixed

Fix a malfunction when parsing corrupted CSS.
Fix an issue where EpubParser.parseOptions.basePath option is not reflected in image for svg.

0.5.4 (2019-05-14)

Changed

Rename Cryptor to AesCryptor.

0.5.3 (2019-04-01)

Changed

Change the language field to accept multiple values.

Fixed

Fix an issue where intermittently EBADF error occurred when unzipping.

0.5.2 (2019-02-18)

Fixed

Fix an issue where directroy cache file is not overwritten.

0.5.1 (2019-02-14)

Fixed

Fix an issue where broken cache values if that save out of ascii range.

0.5.0 (2019-02-13)

Added

Add LogLevel.DEBUG and debug log in Parser.
Add logLevel parameter for Parser.constructor.
Add error code for Cryptor internal error.

Changed

Change Logger.logLevel default. (error => warning)
Rename LogLevel.WARNING to LogLevel.WARN.

Fixed

Fix an issue where subpath sort was not natural.

0.4.1 (2019-02-12)

Changed

Improve performance of parsing.

0.4.0 (2019-02-12)

Added

Add ComicParser.parseOptions.parseImageSize option.
Add ComicBook.Item.width and ComicBook.Item.height.

Changed

Rename ComicBook.Item.size to ComicBook.Item.fileSize.

0.3.1 (2019-01-31)

Fixed

Fix an issue where JSON parsing errors in directory cache data when attempting to read items from same Book on multiple processes.

0.3.0 (2019-01-27)

Added

Add comic-parser, parser-core and content-parser.
Add Logger that can control all console logs and log execution time for each method in Parser.
Add Parser.onProgress property.
Add Parser.readOptions.force option.

Changed

Configure multi-packages environment using Lerna.
CryptoProvider refactoring.
Remove EpubParser.parseOptions.ignoreLinear option.
Cache to subdirectory parsing result.

Fixed

Fix an issue where spine is always undefined for NavPoint with anchor exists or two depths.
Fix an issue where string is broken at 16,384 byte intervals when en/decrypting.
Fix an issue where can not be unzip under certain conditions.
Fix bad file descriptor error on unzipping.

0.2.0 (2018-11-19)

Added

Add EpubParser.readOptions.serializedAnchor option.
Add Author.fileAs property.
Add encrypt and decrypt function.

Changed

Change EpubParser.parseOptions.ignoreLinear option default. (true => false)
Change EpubParser.parseOptions.useStyleNamespace option default. (false => true)
Change EpubParser.readOptions structure.
Remove EpubParser.readOptions.usingCssOptions and EpubParser.parseOptions.validateXml option.
Rename useStyleNamespace to parseStyle in EpubParser.parseOptions.
Rename SpineItem.spineIndex to SpineItem.index.

Fixed

Fix an issue where ncx could not be found in opf, and EpubParser.parseOptions.allowNcxFileMissing was false, but no exception was thrown.
Fix an issue where Book.spines order does not match spine order of OPF.

0.1.1 (2018-10-08)

Fixed

Fix invalid class name for style namespace.

0.1.0 (2018-09-12)

Added

Add overwrite option.
Add spine.uesCssOptions option.

Changed

Remove spine.extractAdapter option.
Remove createIntermediateDirectories and removePreviousFile options. (replaced by overwrite option)
Change css.removeAtrules option default.
Improve parsing of epub version.
Simplifies return type of readitem or readItems.

Fixed

Fix an issue where cssParser can not handle URL that are not wrapped in a string.
Fix an issue where cssParser does not ignore :not(x) function.

0.0.2 (2018-09-11)

Fixed

Fix broken export/import.

0.0.1 (2018-08-30)

First release.

Package detail

@ridi/epub-parser

Features

Install

Usage

API

parse(parseOptions)

parseOptions: ?object

readItem(item, readOptions)

item: Item (see: Item Types)

readOptions: ?object

readItems(items, readOptions)

items: Item[] (see: Item Types)

readOptions: ?object

unzip(unzipPath, overwrite)

unzipPath: string

overwrite: boolean

onProgress = callback(step, totalStep, action)

Model

Item Types

SpineItem (extend Item)

NcxItem (extend Item)

CssItem (extend Item)

InlineCssItem (extend CssItem)

ImageItem (extend Item)

SvgItem (extend ImageItem)

FontItem (extend Item)

DeadItem (extend Item)

Parse Options

validatePackage: boolean

allowNcxFileMissing: boolean

unzipPath: ?string

overwrite: boolean

parseStyle: boolean

styleNamespacePrefix: string

additionalInlineStyle: ?string

Read Options

force: boolean

basePath: ?string

extractBody: boolean|function

serializedAnchor: Boolean

ignoreScript: boolean

removeAtrules: string[]

removeTagSelector: string[]

removeIdSelector: string[]

removeClassSelector: string[]

License

Changelog

Changed

Changed

Fixed

Added

Changed

Fixed

Changed

Fixed

Changed

Changed

Fixed

Fixed

Fixed

Fixed

Changed

Changed

Fixed

Added

Fixed

Added

Changed

Fixed

Added

Changed

Added

Changed

Added

Changed

Fixed

Fixed

Changed

Changed

parseOptions: `?object`

item: `Item` (see: Item Types)

readOptions: `?object`

items: `Item[]` (see: Item Types)

readOptions: `?object`

unzipPath: `string`

overwrite: `boolean`

validatePackage: `boolean`

allowNcxFileMissing: `boolean`

unzipPath: `?string`

overwrite: `boolean`

parseStyle: `boolean`

styleNamespacePrefix: `string`

additionalInlineStyle: `?string`

basePath: `?string`

extractBody: `boolean|function`

serializedAnchor: `Boolean`

ignoreScript: `boolean`

removeAtrules: `string[]`

removeTagSelector: `string[]`

removeIdSelector: `string[]`

removeClassSelector: `string[]`