Important: This documentation covers Yarn 1 (Classic).
For Yarn 2+ docs and migration guide, see yarnpkg.com.

Package detail

@us3r-network/metadata-scraper

BetaHuhn2MIT0.3.3TypeScript support: included

A Javascript library for scraping/parsing metadata from a web page.

metadata, meta-tags, metatags, open-graph, html-scraper, metadata-extraction

readme

metadata-scraper

GitHub David npm

A Javascript library for scraping/parsing metadata from a web page.

👋 Introduction

metadata-scraper is a Javascript library which scrapes/parses metadata from web pages. You only need to supply it with a URL or an HTML string and it will use different rules to find the most relevant metadata like:

  • Title
  • Description
  • Favicons/Images
  • Language
  • Keywords
  • Author
  • and more (full list below)

🚀 Get started

Install metadata-scraper via npm:

npm install metadata-scraper

📚 Usage

Import metadata-scraper and pass it a URL or options object:

const getMetaData = require('metadata-scraper')

const url = 'https://github.com/BetaHuhn/metadata-scraper'

getMetaData(url).then((data) => {
    console.log(data)
})

Or with async/await:

const getMetaData = require('metadata-scraper')

async function run() {
    const url = 'https://github.com/BetaHuhn/metadata-scraper'
    const data = await getMetaData(url)
    console.log(data)
}

run()

This will return:

{
    title: 'BetaHuhn/metadata-scraper',
    description: 'A Javascript library for scraping/parsing metadata from a web page.',
    language: 'en',
    url: 'https://github.com/BetaHuhn/metadata-scraper',
    provider: 'GitHub',
    twitter: '@github',
    image: 'https://avatars1.githubusercontent.com/u/51766171?s=400&v=4',
    icon: 'https://github.githubassets.com/favicons/favicon.svg'
}

You can see a list of all metadata which metadata-scraper tries to scrape below.

⚙️ Configuration

You can change the behaviour of metadata-scraper by passing an options object:

const getMetaData = require('metadata-scraper')

const options = {
    url: 'https://github.com/BetaHuhn/metadata-scraper', // URL of web page
    maxRedirects: 0, // Maximum number of redirects to follow (default: 5)
    ua: 'MyApp', // Specify User-Agent header
    lang: 'de-CH', // Specify Accept-Language header
    timeout: 1000, // Request timeout in milliseconds (default: 10000ms)
    forceImageHttps: false, // Force all image URLs to use https (default: true)
    customRules: {} // more info below
}

getMetaData(options).then((data) => {
    console.log(data)
})

You can specify the URL by either passing it as the first parameter, or by setting it in the options object.

📖 Examples

Here are some examples on how to use metadata-scraper:

Basic

Pass a URL as the first parameter and metadata-scraper automatically scrapes it and returns everything it finds:

const getMetaData = require('metadata-scraper')
const data = await getMetaData('https://github.com/BetaHuhn/metadata-scraper')

Example file located at examples/basic.js.


HTML String

If you already have an HTML string and don't want metadata-scraper to make an http request, specify it in the options object:

const getMetaData = require('metadata-scraper')

const html = `
    <meta name="og:title" content="Example">
    <meta name="og:description" content="This is an example.">
`

const options {
    html: html, 
    url: 'https://example.com' // Optional URL to make relative image paths absolute
}

const data = await getMetaData(options)

Example file located at examples/html.js.


Custom Rules

Look at the rules.ts file in the src directory to see all rules which will be used.

You can expand metadata-scraper easily by specifying custom rules:

const getMetaData = require('metadata-scraper')

const options = {
    url: 'https://github.com/BetaHuhn/metadata-scraper',
    customRules: {
        name: {
            rules: [
                [ 'meta[name="customName"][content]', (element) => element.getAttribute('content') ]
            ],
            processor: (text) => text.toLowerCase()
        }
    }
}

const data = await getMetaData(options)

customRules needs to contain one or more objects, where the key (name above) will identify the value in the returned data.

You can then specify different rules for each item in the rules array.

The first item is the query which gets inserted into the browsers querySelector function, and the second item is a function which gets passed the HTML element:

[ 'querySelector', (element) => element.innerText ]

You can also specify a processor function which will process/transform the result of one of the matched rules:

{
    processor: (text) => text.toLowerCase()
}

If you find a useful rule, let me know and I will add it (or create a PR yourself).

Example file located at examples/custom.js.

📇 All metadata

Here's what metadata-scraper currently tries to scrape:

{
    title: 'Title of page or article',
    description: 'Description of page or article',
    language: 'Language of page or article',
    type: 'Page type',
    url: 'URL of page',
    provider: 'Page provider',
    keywords: ['array', 'of', 'keywords'],
    section: 'Section/Category of page',
    author: 'Article author',
    published: 1605221765, // Date the article was published
    modified: 1605221765, // Date the article was modified
    robots: ['array', 'for', 'robots'],
    copyright: 'Page copyright',
    email: 'Contact email',
    twitter: 'Twitter handle',
    facebook: 'Facebook account id',
    image: 'Image URL',
    icon: 'Favicon URL',
    video: 'Video URL',
    audio: 'Audio URL'
}

If you find a useful metatag, let me know and I will add it (or create a PR yourself).

💻 Development

Issues and PRs are very welcome!

Please check out the contributing guide before you start.

This project adheres to Semantic Versioning. To see differences with previous versions refer to the CHANGELOG.

❔ About

This library was developed by me (@betahuhn) in my free time. If you want to support me:

Donate via PayPal

Credits

This library is based on Mozilla's page-metadata-parser. I converted it to TypeScript, implemented a few new features, and added more rules.

License

Copyright 2020 Maximilian Schiller

This project is licensed under the MIT License - see the LICENSE file for details.

changelog

[v0.2.61] - 2022-10-03

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • a13cc15 Bump typescript from 4.8.3 to 4.8.4

[v0.2.60] - 2022-09-12

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 3b3bafa Bump typescript from 4.8.2 to 4.8.3

[v0.2.59] - 2022-08-29

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • d0f9d2b Bump typescript from 4.7.4 to 4.8.2

[v0.2.58] - 2022-08-13

Release notes · Compare · Tag · Archive (zip · tar.gz)

Updates

[v0.2.57] - 2022-06-27

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 6daf099 Bump typescript from 4.7.3 to 4.7.4

[v0.2.56] - 2022-06-13

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 9a0b101 Bump typescript from 4.7.2 to 4.7.3

[v0.2.55] - 2022-05-30

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 5c8f6a1 Bump typescript from 4.6.4 to 4.7.2

[v0.2.54] - 2022-05-02

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 9b406c5 Bump typescript from 4.6.3 to 4.6.4

[v0.2.53] - 2022-03-28

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 8ad3b01 Bump typescript from 4.6.2 to 4.6.3

[v0.2.52] - 2022-03-21

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 8f1e821 Bump tsc-watch from 4.6.1 to 4.6.2

[v0.2.51] - 2022-03-14

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 450acaa Bump tsc-watch from 4.6.0 to 4.6.1

[v0.2.50] - 2022-03-07

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 4263952 Bump typescript from 4.5.5 to 4.6.2

[v0.2.49] - 2022-01-24

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • d7c0b19 Bump typescript from 4.5.4 to 4.5.5

[v0.2.48] - 2021-12-27

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • e431a74 Bump tsc-watch from 4.5.0 to 4.6.0

[v0.2.47] - 2021-12-20

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • fb35a00 Bump typescript from 4.5.3 to 4.5.4

[v0.2.46] - 2021-12-13

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 2eb2682 Bump typescript from 4.5.2 to 4.5.3

[v0.2.45] - 2021-11-22

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 6064436 Bump typescript from 4.4.4 to 4.5.2
  • bc7e638 Bump got from 11.8.2 to 11.8.3

[v0.2.44] - 2021-10-18

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • cd0ba7b Bump typescript from 4.4.3 to 4.4.4

[v0.2.43] - 2021-10-11

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 9221686 Bump @typescript-eslint/eslint-plugin from 4.32.0 to 4.33.0
  • d37b7e5 Bump @typescript-eslint/parser from 4.32.0 to 4.33.0

[v0.2.42] - 2021-10-04

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 2ceb809 Bump @typescript-eslint/eslint-plugin from 4.31.2 to 4.32.0
  • 4796691 Bump @typescript-eslint/parser from 4.31.2 to 4.32.0

[v0.2.41] - 2021-09-27

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 73702a9 Bump @typescript-eslint/eslint-plugin from 4.31.1 to 4.31.2
  • 0715a36 Bump @typescript-eslint/parser from 4.31.1 to 4.31.2

[v0.2.40] - 2021-09-20

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • f82fa40 Bump typescript from 4.4.2 to 4.4.3
  • 159be07 Bump @typescript-eslint/eslint-plugin from 4.31.0 to 4.31.1
  • a19215e Bump @typescript-eslint/parser from 4.31.0 to 4.31.1

[v0.2.39] - 2021-09-13

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • fe19543 Bump @typescript-eslint/parser from 4.30.0 to 4.31.0
  • 4953468 Bump @typescript-eslint/eslint-plugin from 4.30.0 to 4.31.0
  • 22d5a28 Bump @betahuhn/config from 1.1.0 to 1.2.0

[v0.2.38] - 2021-09-06

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • a2f9f4c Bump @typescript-eslint/eslint-plugin from 4.29.3 to 4.30.0
  • 71e3a21 Bump @typescript-eslint/parser from 4.29.3 to 4.30.0

[v0.2.37] - 2021-08-30

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • fc93cb0 Bump typescript from 4.3.5 to 4.4.2

[v0.2.36] - 2021-08-24

Release notes · Compare · Tag · Archive (zip · tar.gz)

Bug fixes

  • 921be7a Make default icon URL absolute

Dependency updates

  • c6a4e03 Bump @typescript-eslint/parser from 4.29.2 to 4.29.3
  • 1d5b450 Bump @typescript-eslint/eslint-plugin from 4.29.2 to 4.29.3

[v0.2.35] - 2021-08-23

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 674ecd2 Bump @typescript-eslint/parser from 4.29.1 to 4.29.2
  • 51e6582 Bump @typescript-eslint/eslint-plugin from 4.29.1 to 4.29.2
  • bc9d265 Bump tsc-watch from 4.4.0 to 4.5.0

[v0.2.34] - 2021-08-16

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 35e7b4d Bump @typescript-eslint/eslint-plugin from 4.29.0 to 4.29.1
  • daceb76 Bump @typescript-eslint/parser from 4.29.0 to 4.29.1

[v0.2.33] - 2021-08-09

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • b1bf00f Bump eslint from 7.31.0 to 7.32.0
  • dffc2a5 Bump @typescript-eslint/eslint-plugin from 4.28.5 to 4.29.0
  • fd2bea8 Bump @typescript-eslint/parser from 4.28.5 to 4.29.0

[v0.2.32] - 2021-08-02

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 334fe30 Bump @typescript-eslint/parser from 4.28.4 to 4.28.5
  • b8e8860 Bump @typescript-eslint/eslint-plugin from 4.28.4 to 4.28.5

[v0.2.31] - 2021-07-26

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 1588bdd Bump eslint from 7.30.0 to 7.31.0
  • 453ac7d Bump @typescript-eslint/eslint-plugin from 4.28.3 to 4.28.4
  • 44d56e1 Bump @typescript-eslint/parser from 4.28.3 to 4.28.4

[v0.2.30] - 2021-07-19

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • c78efbd Bump @typescript-eslint/eslint-plugin from 4.28.2 to 4.28.3
  • 0295003 Bump @typescript-eslint/parser from 4.28.2 to 4.28.3

[v0.2.29] - 2021-07-12

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 5508909 Bump eslint from 7.29.0 to 7.30.0
  • 4821c13 Bump @typescript-eslint/eslint-plugin from 4.28.1 to 4.28.2
  • 960cafc Bump @typescript-eslint/parser from 4.28.1 to 4.28.2

[v0.2.28] - 2021-07-05

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • b93275e Bump @typescript-eslint/parser from 4.28.0 to 4.28.1
  • 414234e Bump @typescript-eslint/eslint-plugin from 4.28.0 to 4.28.1
  • d22c00e Bump typescript from 4.3.4 to 4.3.5

[v0.2.27] - 2021-06-28

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 89f12b2 Bump eslint from 7.28.0 to 7.29.0
  • cd71e35 Bump @typescript-eslint/parser from 4.27.0 to 4.28.0
  • 222db07 Bump @typescript-eslint/eslint-plugin from 4.27.0 to 4.28.0

[v0.2.26] - 2021-06-21

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 946f8db Bump @typescript-eslint/eslint-plugin from 4.26.1 to 4.27.0
  • 3a566fa Bump @typescript-eslint/parser from 4.26.1 to 4.27.0
  • 044996d Bump typescript from 4.3.2 to 4.3.3
  • 16744e6 Bump typescript from 4.3.3 to 4.3.4

[v0.2.25] - 2021-06-14

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 3c3ec87 Bump eslint from 7.27.0 to 7.28.0
  • 5ad8158 Bump @typescript-eslint/eslint-plugin from 4.26.0 to 4.26.1
  • 4348e18 Bump @typescript-eslint/parser from 4.26.0 to 4.26.1

[v0.2.24] - 2021-06-07

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 0a0801f Bump @typescript-eslint/parser from 4.25.0 to 4.26.0
  • a1e447c Bump @typescript-eslint/eslint-plugin from 4.25.0 to 4.26.0

[v0.2.23] - 2021-05-31

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • a87fa04 Bump eslint from 7.26.0 to 7.27.0
  • 888ee22 Bump @typescript-eslint/parser from 4.24.0 to 4.25.0
  • 518e881 Bump @typescript-eslint/eslint-plugin from 4.24.0 to 4.25.0
  • 711b9fc Bump typescript from 4.2.4 to 4.3.2
  • e359275 Bump tsc-watch from 4.2.9 to 4.4.0

[v0.2.22] - 2021-05-24

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • a113138 Bump @typescript-eslint/eslint-plugin from 4.23.0 to 4.24.0
  • 5371cb3 Bump @typescript-eslint/parser from 4.23.0 to 4.24.0

[v0.2.21] - 2021-05-17

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 8f676af Bump eslint from 7.25.0 to 7.26.0
  • f192c80 Bump @typescript-eslint/parser from 4.22.1 to 4.23.0
  • 3a84b8b Bump @typescript-eslint/eslint-plugin from 4.22.1 to 4.23.0

[v0.2.20] - 2021-05-10

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 7d7f744 Bump @typescript-eslint/parser from 4.22.0 to 4.22.1
  • f6d62c7 Bump @typescript-eslint/eslint-plugin from 4.22.0 to 4.22.1

[v0.2.19] - 2021-05-03

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 491b51a Bump eslint from 7.24.0 to 7.25.0

[v0.2.18] - 2021-04-19

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 1740820 Bump eslint from 7.23.0 to 7.24.0
  • 3266b57 Bump @typescript-eslint/parser from 4.21.0 to 4.22.0
  • 5bd9cec Bump @typescript-eslint/eslint-plugin from 4.21.0 to 4.22.0
  • cff3683 Bump @betahuhn/config from 1.0.2 to 1.1.0

[v0.2.17] - 2021-04-12

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 3316c8d Bump @typescript-eslint/eslint-plugin from 4.20.0 to 4.21.0
  • 4c7d39a Bump @typescript-eslint/parser from 4.20.0 to 4.21.0
  • 95ad78d Bump typescript from 4.2.3 to 4.2.4

[v0.2.16] - 2021-04-05

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 6f8ee8d Bump eslint from 7.22.0 to 7.23.0
  • c1a2c49 Bump @typescript-eslint/eslint-plugin from 4.19.0 to 4.20.0
  • e3e5f7b Bump @typescript-eslint/parser from 4.19.0 to 4.20.0

[v0.2.15] - 2021-03-29

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 70af276 Bump @typescript-eslint/parser from 4.18.0 to 4.19.0
  • ba8fa40 Bump @typescript-eslint/eslint-plugin from 4.18.0 to 4.19.0

[v0.2.14] - 2021-03-22

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 846ee60 Bump eslint from 7.21.0 to 7.22.0
  • d707f08 Bump @typescript-eslint/eslint-plugin from 4.17.0 to 4.18.0
  • 94b8a3c Bump @typescript-eslint/parser from 4.17.0 to 4.18.0

[v0.2.13] - 2021-03-11

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • dc4a9f4 Bump @typescript-eslint/eslint-plugin from 4.16.1 to 4.17.0 (#62) (Issues: #62)- 1e134e6 Bump @typescript-eslint/parser from 4.16.1 to 4.17.0 (#61) (Issues: #61)

[v0.2.12] - 2021-03-10

Release notes · Compare · Tag · Archive (zip · tar.gz)

[v0.2.11] - 2021-03-08

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 16106c1 Bump got from 11.8.1 to 11.8.2
  • 679914b Bump eslint from 7.20.0 to 7.21.0
  • 16126bf Bump @typescript-eslint/parser from 4.15.2 to 4.16.1
  • fffe655 Bump @typescript-eslint/eslint-plugin from 4.15.2 to 4.16.1
  • 632cce7 Bump typescript from 4.2.2 to 4.2.3
  • 2293c12 Bump stefanzweifel/git-auto-commit-action

[v0.2.10] - 2021-03-01

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 122a6b6 Bump stefanzweifel/git-auto-commit-action
  • dbeb927 Bump @typescript-eslint/parser from 4.15.1 to 4.15.2
  • bad6e7e Bump @typescript-eslint/eslint-plugin from 4.15.1 to 4.15.2
  • 9e57df1 Bump typescript from 4.1.5 to 4.2.2
  • 1e78730 Bump stefanzweifel/git-auto-commit-action

[v0.2.9] - 2021-02-22

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • e05f96a Bump eslint from 7.19.0 to 7.20.0
  • 96fbe9c Bump pascalgn/automerge-action from v0.13.0 to v0.13.1
  • 058b7aa Bump @typescript-eslint/eslint-plugin from 4.15.0 to 4.15.1
  • a2ec2df Bump @typescript-eslint/parser from 4.15.0 to 4.15.1

[v0.2.8] - 2021-02-15

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • bf1c220 Bump @typescript-eslint/parser from 4.14.2 to 4.15.0
  • 30618e3 Bump @typescript-eslint/eslint-plugin from 4.14.2 to 4.15.0
  • 456b76c Bump typescript from 4.1.3 to 4.1.4
  • db5ad00 Bump typescript from 4.1.4 to 4.1.5

[v0.2.7] - 2021-02-08

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • ba13de2 Bump eslint from 7.18.0 to 7.19.0
  • ba66209 Bump @typescript-eslint/parser from 4.14.1 to 4.14.2
  • e47c8c9 Bump @typescript-eslint/eslint-plugin from 4.14.1 to 4.14.2

[v0.2.6] - 2021-02-01

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • b3f0079 Bump @typescript-eslint/eslint-plugin from 4.14.0 to 4.14.1
  • cf11c6c Bump @typescript-eslint/parser from 4.14.0 to 4.14.1

[v0.2.5] - 2021-01-25

Release notes · Compare · Tag · Archive (zip · tar.gz)

Dependency updates

  • 6adb3b1 Bump eslint from 7.17.0 to 7.18.0
  • da98c57 Bump @typescript-eslint/parser from 4.13.0 to 4.14.0
  • 0ed1a68 Bump @typescript-eslint/eslint-plugin from 4.13.0 to 4.14.0

[v0.2.4] - 2021-01-18

Release notes · Compare · Tag · Archive (zip · tar.gz)

[v0.2.3] - 2021-01-11

Release notes · Compare · Tag · Archive (zip · tar.gz)

[v0.2.2] - 2021-01-01

Release notes · Compare · Tag · Archive (zip · tar.gz)

[v0.2.1] - 2021-01-01

Release notes · Compare · Tag · Archive (zip · tar.gz)

Updates

[v0.2.0] - 2020-12-19

Added

  • lang option which sets the Accept-Language header used for the request #12

Changed

  • update dependencies

[v0.1.2] - 2020-12-08

Changed

  • update dependencies

[v0.1.1] - 2020-11-14

Fixed

  • added export function to fix 'expression is not callable' error

[v0.1.0] - 2020-11-12

Added

  • Initial commit