Important: This documentation covers Yarn 1 (Classic).
For Yarn 2+ docs and migration guide, see yarnpkg.com.

Package detail

gutenbergscraper

whitzscott10ISC1.0.3

A Scraper for Project Gutenberg allowing you to use it for scraping data into datasets, very customizable and friendly

gutenberg, scraper, node, typescript, web scraping, gutenberg scraper, gutenberg downloader, book scraper, book downloader, gutenberg api, node.js, node scraper, http request, parallel scraping, web scraping node, gutenberg books, open source, project gutenberg, scrape books, gutenberg downloader node, scraping library, data extraction, html parser, axios, cheerio, npm scraper, scrape data, gutenberg project, scraper tool, node scraping library, csv output, json output, txt output, book metadata, ebook downloader, book format, scraper framework, gutenberg text extraction, node scraping tool, nodejs scraper, typescript scraper, books in csv, books in json, ebooks in txt, book extraction, scrape project gutenberg, gutenberg content, scrape Gutenberg project, web crawler, data scraper, automated scraping, scraping framework, nodejs web scraping, html to csv, html to json, scrape web content, book content extraction, scraping tool, extracting book data, nodejs scraping, text extraction, book data scraper, scraping project gutenberg, scrape ebooks, gutenberg library, ebook extraction, scraper node package, scraper for books, book data exporter, book web scraper, typescript web scraper, node parallelism, scraping parallel, nodejs parallel scraping, data extraction tool, scraping framework node, async scraper, npm scraper, npm scraping library, npm scraper tool, scrape from gutenberg, book scraper node, nodejs downloader, npm scraper project, node scraper typescript, parallel request scraper, scraper with retries, scrape with retries, scraping with retries, web scraping package, nodejs web scraper, scraping npm package, parallel scraping npm, scrape ebooks node, scraper npm, npm web scraper, async scraping, html to book, web scraper npm, scraper parallel, gutenberg ebooks, open books, text scraping, nodejs scraping tool, typescript web scraping, web scraping tools, scrape html content, scraping data framework, scrape content nodejs, scrape books project gutenberg, data extraction nodejs, scraper nodejs tool, web scraper typescript, gutenberg book extractor, books from gutenberg, scraping package, parallel requests scraping, scraping tools nodejs, scraping with nodejs, scraping html to csv, scraping html to json, scraping in nodejs, gutenberg book downloader, scraper for gutenberg, book scraper npm, html web scraper, gutenberg html scraper, scraping library nodejs, scraper retry nodejs, gutenberg node scraper, data extraction tool nodejs, scrape html books, scrape gutenberg content, scrape books into csv, scraper javascript, scraper for nodejs, data scraping tool, gutenberg library scrape, book download tool, gutenberg web scraping, nodejs scraper tool, scraping project gutenberg books, scraping with axios, scraping with cheerio, scraper nodejs project, scraper nodejs npm, parallel data scraping, scrape books json, scraper retry, scrape books retry, scraper csv json, scraper typescript node, scrape gutenberg books, books scraper, scraper node npm, gutenberg scraper npm, scrape from gutenberg nodejs, gutenberg content extraction, scrape gutenberg books nodejs, npm book scraper, scrape gutenberg project, scraping books nodejs, gutenberg content extractor, scraper for gutenberg books, scrape gutenberg with nodejs, scraper with axios cheerio, scraper npm package, gutenberg html data, scraper nodejs npm, scraper nodejs parallel, scraping with cheerio npm, scraping books npm, scrape book text, books from gutenberg scraper, scraper books nodejs, html extraction nodejs, scrape gutenberg library, book data extraction, scraping books to csv, scraper npm nodejs, scraping books text, nodejs scraping library, gutenberg project scraper, book data scraper node, nodejs text scraping, gutenberg scraping tool, scrape html nodejs, gutenberg metadata scraper, books scraper npm, scrape to csv, scraper for ebooks, project gutenberg scraping, scrape gutenberg text, scraper for gutenberg project, nodejs scraper npm, scraper html data, book data nodejs, scraper parallel request, scraper library node, web scraping tools npm, gutenberg text scraper, scrape gutenberg project data, scrape nodejs, scraper project gutenberg, nodejs project gutenberg, scrape books from gutenberg, scraper text extraction, html book scraper, scraper gutenberg html, scraper parallel processing, scraper nodejs retry, scrape gutenberg books json, scraper nodejs csv, scraper with cheerio html, scraping gutenberg with node, scrape from gutenberg csv, gutenberg nodejs scraper, html scraping nodejs, book extraction npm, scraping books json, scraping with axios cheerio, scrape nodejs books, scrape gutenberg html nodejs, scraper project gutenberg npm, scraping gutenberg books, scraper for books project gutenberg, scrape books text nodejs, scraper npm project, scraper for gutenberg project books, scraper books project, scraper for html to json, scraping books in nodejs, scraping to json npm, scrape html books nodejs, scrape books nodejs npm, gutenberg text extraction scraper, gutenberg books json scraper, scraper books text extraction, scraper books data, gutenberg scrape npm, scraper text to csv, gutenberg node scraper, scraping books npm, scraper gutenberg nodejs

readme

Gutenberg Scraper

The Gutenberg Scraper is a tool designed to scrape content from Project Gutenberg. But how does it work?

The Gutenberg Scraper uses parallelism and other technologies to speed up the scraping process for Node.js applications. It is primarily built with TypeScript.

If you'd like to use this scraper, here's an example of how to set it up:

You’ll likely notice a file named index.ts. This is where you can begin. By default, it will contain some example code, such as:

import { Scraper } from './Scraper';

const scraper = new Scraper({
  useBooknum: [12, 50],  // Scrape books from 12 to 50
  FormatOutput: 'csv',   // Output format will be CSV
  userAgent: 'Mozilla/5.0',
  timeout: 5000          // Set a timeout for requests
}, 10, 3); // Scrape 10 books at once and retry 3 times in case of failure

scraper.scrape();

In this example:

  • useBooknum: [12, 50] specifies the range of books to scrape, from book number 12 to 50.
  • FormatOutput: 'csv' indicates that the output will be in CSV format. You can also choose other formats, such as CSV, TXT, or JSON.
  • userAgent: 'Mozilla/5.0' sets a custom user-agent to help prevent the scraper from being blocked by the website.
  • timeout: 5000 sets the timeout for each request to 5000 milliseconds (5 seconds).

The second part of the constructor, 10 and 3, represents:

  • 10: The number of parallel requests to make at once. This allows the scraper to scrape multiple books simultaneously, speeding up the process.
  • 3: The number of retry attempts in case a request fails. If a book fails to scrape, the scraper will retry up to 3 times before it gives up.

Once you've set this up, calling scraper.scrape() will start the scraping process based on the provided configuration. You can choose the output format to be CSV, JSON, or TXT as per your preference.

To use it first install the package by running npm i gutenbergscraper once run you can directly type in the command prompt or powershell npm i then npm run start and your done~!