Package detail

x-ray

matthewmueller8.8kMIT2.3.4

structure any website

api, cheerio, scrape, scraper, structure, web

readme

x-ray

Last version Node version

var Xray = require('x-ray')
var x = Xray()

x('https://blog.ycombinator.com/', '.post', [
  {
    title: 'h1 a',
    link: '.article-title@href'
  }
])
  .paginate('.nav-previous a@href')
  .limit(3)
  .write('results.json')

Installation

npm install x-ray

Features

Flexible schema: Supports strings, arrays, arrays of objects, and nested object structures. The schema is not tied to the structure of the page you're scraping, allowing you to pull the data in the structure of your choosing.
Composable: The API is entirely composable, giving you great flexibility in how you scrape each page.
Pagination support: Paginate through websites, scraping each page. X-ray also supports a request delay and a pagination limit. Scraped pages can be streamed to a file, so if there's an error on one page, you won't lose what you've already scraped.
Crawler support: Start on one page and move to the next easily. The flow is predictable, following a breadth-first crawl through each of the pages.
Responsible: X-ray has support for concurrency, throttles, delays, timeouts and limits to help you scrape any page responsibly.
Pluggable drivers: Swap in different scrapers depending on your needs. Currently supports HTTP and PhantomJS driver drivers. In the future, I'd like to see a Tor driver for requesting pages through the Tor network.

Selector API

xray(url, selector)(fn)

Scrape the url for the following selector, returning an object in the callback fn. The selector takes an enhanced jQuery-like string that is also able to select on attributes. The syntax for selecting on attributes is selector@attribute. If you do not supply an attribute, the default is selecting the innerText.

Here are a few examples:

Scrape a single tag

xray('http://google.com', 'title')(function(err, title) {
  console.log(title) // Google
})

Scrape a single class

xray('http://reddit.com', '.content')(fn)

Scrape an attribute

xray('http://techcrunch.com', 'img.logo@src')(fn)

Scrape innerHTML

xray('http://news.ycombinator.com', 'body@html')(fn)

xray(url, scope, selector)

You can also supply a scope to each selector. In jQuery, this would look something like this: $(scope).find(selector).

xray(html, scope, selector)

Instead of a url, you can also supply raw HTML and all the same semantics apply.

var html = '<body><h2>Pear</h2></body>'
x(html, 'body', 'h2')(function(err, header) {
  header // => Pear
})

API

xray.driver(driver)

Specify a driver to make requests through. Available drivers include:

request - A simple driver built around request. Use this to set headers, cookies or http methods.
phantom - A high-level browser automation library. Use this to render pages or when elements need to be interacted with, or when elements are created dynamically using javascript (e.g.: Ajax-calls).

xray.stream()

Returns Readable Stream of the data. This makes it easy to build APIs around x-ray. Here's an example with Express:

var app = require('express')()
var x = require('x-ray')()

app.get('/', function(req, res) {
  var stream = x('http://google.com', 'title').stream()
  stream.pipe(res)
})

xray.write([path])

Stream the results to a path.

If no path is provided, then the behavior is the same as .stream().

xray.then(cb)

Constructs a Promise object and invoke its then function with a callback cb. Be sure to invoke then() at the last step of xray method chaining, since the other methods are not promisified.

x('https://dribbble.com', 'li.group', [
  {
    title: '.dribbble-img strong',
    image: '.dribbble-img [data-src]@data-src'
  }
])
  .paginate('.next_page@href')
  .limit(3)
  .then(function(res) {
    console.log(res[0]) // prints first result
  })
  .catch(function(err) {
    console.log(err) // handle error in promise
  })

xray.paginate(selector)

Select a url from a selector and visit that page.

xray.limit(n)

Limit the amount of pagination to n requests.

xray.abort(validator)

Abort pagination if validator function returns true. The validator function receives two arguments:

result: The scrape result object for the current page.
nextUrl: The URL of the next page to scrape.

xray.delay(from, [to])

Delay the next request between from and to milliseconds. If only from is specified, delay exactly from milliseconds.

xray.concurrency(n)

Set the request concurrency to n. Defaults to Infinity.

xray.throttle(n, ms)

Throttle the requests to n requests per ms milliseconds.

xray.timeout (ms)

Specify a timeout of ms milliseconds for each request.

Collections

X-ray also has support for selecting collections of tags. While x('ul', 'li') will only select the first list item in an unordered list, x('ul', ['li']) will select all of them.

Additionally, X-ray supports "collections of collections" allowing you to smartly select all list items in all lists with a command like this: x(['ul'], ['li']).

Composition

X-ray becomes more powerful when you start composing instances together. Here are a few possibilities:

Crawling to another site

var Xray = require('x-ray')
var x = Xray()

x('http://google.com', {
  main: 'title',
  image: x('#gbar a@href', 'title') // follow link to google images
})(function(err, obj) {
  /*
  {
    main: 'Google',
    image: 'Google Images'
  }
*/
})

Scoping a selection

var Xray = require('x-ray')
var x = Xray()

x('http://mat.io', {
  title: 'title',
  items: x('.item', [
    {
      title: '.item-content h2',
      description: '.item-content section'
    }
  ])
})(function(err, obj) {
  /*
  {
    title: 'mat.io',
    items: [
      {
        title: 'The 100 Best Children\'s Books of All Time',
        description: 'Relive your childhood with TIME\'s list...'
      }
    ]
  }
*/
})

Filters

Filters can specified when creating a new Xray instance. To apply filters to a value, append them to the selector using |.

var Xray = require('x-ray')
var x = Xray({
  filters: {
    trim: function(value) {
      return typeof value === 'string' ? value.trim() : value
    },
    reverse: function(value) {
      return typeof value === 'string'
        ? value
            .split('')
            .reverse()
            .join('')
        : value
    },
    slice: function(value, start, end) {
      return typeof value === 'string' ? value.slice(start, end) : value
    }
  }
})

x('http://mat.io', {
  title: 'title | trim | reverse | slice:2,3'
})(function(err, obj) {
  /*
  {
    title: 'oi'
  }
*/
})

Examples

selector: simple string selector
collections: selects an object
arrays: selects an array
collections of collections: selects an array of objects
array of arrays: selects an array of arrays

In the Wild

Levered Returns: Uses x-ray to pull together financial data from various unstructured sources around the web.

Resources

Video: https://egghead.io/lessons/node-js-intro-to-web-scraping-with-node-and-x-ray

Backers

Support us with a monthly donation and help us continue our activities. [Become a backer]

License

MIT

changelog

2.3.2 (2018-06-22)

==================

removed bumped
Merge pull request #283 from matthewmueller/renovate/pin-dependencies
Pin Dependencies
Merge pull request #282 from matthewmueller/renovate/configure
Add renovate.json
Merge pull request #241 from kengz/master
Merge pull request #232 from dfcowell/master
Merge pull request #218 from jmichelin/master
Merge pull request #200 from thangngoc89/master
Merge branch 'master' into master
Merge pull request #253 from linguistbreaker/master
Merge pull request #250 from mandeldl/master
add job board
fixing path for require() for XRay and a typo in path.resolve
fix example typos and relative paths
add json parsing error handler to promisify
add readme for promisify
add promisify option for xray.then(); add tests
Update readme.
Remove test constraints.
Add abort function to allow for ending pagination based on returned results/next url.
Merge pull request #1 from jmichelin/jmichelin-patch-1
bug fix typo in selector example
Fix Composition and collection error

2.3.2 (2017-02-24)

Merge pull request #221 from cvan/patch-1
Merge pull request #233 from sagivo/patch-1
Merge pull request #246 from paulbarrett/patch-1
fix broken code sample using alt test site
fix broken tests caused by removal of dribbble pagination
modify test spec to account for dribble dom mods
fix broken selector in the dribbble sample code
remove commented code
include both hostname + port when absolute-izing URLs

2.3.1 (2016-10-13)

Merge pull request #181 from Muscot/master
Merge pull request #209 from Crazometer/readme-driver-fix
Clarification suggested by @gebrits
Merge pull request #216 from 600rr/patch-1
Update index.js
Moved driver list to correct place and added request driver.
Merge pull request #179 from fatman-/patch-1
Merge pull request #191 from piamancini/patch-1
Merge pull request #192 from GeoffreyEmery/patch-1
update to readme
Update Read.me
Added backers and sponsors from OpenCollective
Merge pull request #190 from davidarnarsson/master
Fixed spaces
Fixed coding style
Fixes a problem when encountering malformed URLs while "absolutizing" urls in a document.
Merge pull request #186 from wayneashleyberry/patch-1
Update Readme.md
Moved example follow.js to a folder crawler that test to follow many links and retrieve information. Changed to standard js.
// Nested crawling broken on 'master'. When to merge 'bugfix/nested-crawling' #111 // ------------------------------------------------------------------------------- // Needed to exit this without calling next, the problem was that it returned to // the "finished" callback before it had retrived all pending request. it should // wait for "return next(null, compact(out))"
Fix typo in index.js in the collections example

2.3.0 (2016-04-26)

fix README code in 'Collections' (3a8470c)
Support x-ray-parse filters (0fa5093)
Update Readme and contributors (faabc13)

2.2.0 (2016-03-30)

fix error when html is null (f73990f)
Fix style (636d4f0)
run pretest on travis (21eabfb)

2.1.1 (2016-03-29)

delete unnecessary fixture (85a42a7)
Extra util.isObject method (cd56198)
Fix #162 (7484e20), closes #162
fix example links in readme (29ab95e)
Setup correctly init state (0c3f421)
upgrade rimraf (43c2eb9)

2.1.0 (2016-03-25)

Add .stream section in docs (3dbaa61)
Add badges (5ca8cbd)
Add bumped integration (70c39b0)
Add cheerio dev dependency (c34fed4)
Add coveralls integration (fc9b6d5)
Add missing fields (1d2f669)
Add node.stream (c573cda)
Add phantom setup (c116752)
Add travis (32c74c6)
Change to documentation of support for raw HTML (959f791)
Cleanup dependencies (27933d8)
Delete commented code (b788d93)
Disable security in testing (a26f8e5)
doc - API usage correction (324291b)
Ensure to have last mocha version (96264af)
Extract stream helpers and WalkHTML methods (6f73cac)
Extract util methods (0781058)
Fix #148 (d79f94e), closes #148
Fix examples (a9c9c5f)
Fix package (a0e8be7)
Fix potential issue when selector is a function following comments on commit e7318d591908d04b439c1ff (9327cfa)
Fix some linter messages (50c00cc)
Fix test (2b2f027)
Fix test (1adf200)
Fix test (760b615)
Fix travis integration with phantom (0966ef1)
Fix typo (03d044c)
Generate (again) changelog (8610ff5)
Handle invalid URL and invalid HTML input (61738eb)
little refactor (78cf007)
Move xray phantom out of xray dep (0cf2be3)
Prevent lint phantom stuff (58cb404)
Remove testing section, just look badges! (978306f)
Rename testing files. More easy to search (0a269fd)
Setup node version (6f77635)
Sort requires (9b204dd)
total refactor (6d30434)
Update travis phantom build (cd37e33)
Update travis phantom pattern (db9e63b)
Upgrade dependency (3aeea03)
upgrade deps (cef188e)
WIP (1f0bc2f)
WIP (6fece6c)
xit tests that doesn't work well. (095ad5f)

2.0.3 (2016-01-06)

Added support for <base> tag when generating absolute URLs. (11dd8f7)
added video (46abded)
Do no error on trying to follow a non-url (030e5ea)
Emit error on correct stream (488d5dd), closes #98
Fix issue when one of the selector is null (e7318d5)
Fix potential issue when selector is a function following comments on commit e7318d591908d04b439c1ff (9327cfa)
Paginating => Crawling (6750a38)
Release 2.0.3 (8c3b6e3)

2.0.2 (2015-07-04)

add test dependency (de3a3f0)
handle the empty result set case (a257ef7)
Release 2.0.2 (c179fbb)

2.0.1 (2015-06-24)

Added running test note (f27ebc1)
basic repl working (9228ee0)
build out documentation (2603c0b)
bump x-ray-select (eb766b0)
cleanup (31a2490)
cleanup and start format (7709e08)
fix make test (ad54ad4)
fix convert URLs (246915c)
fix example (3b3338c)
fixed the double call of the callback function (d1a955c)
ignored webstorm specific files (7494d9a)
more API docs (4e77f5b)
new x-ray (8fe7ae7)
phantomjs is required to run the tests (ba47969)
Release 2.0.0 (80414cb)
Release 2.0.1 (3c6f6db)
remove coverage from git (4a71923)
remove old tests (3820c2e)
small cleanup (8ce66a6)
update absolutes (f246874)
update example (b0529b7)
update iamge (a1865c1)
update image (6aca14d)
Update index.js (5a8f05f)
update readme (2636061)
Update Readme.md (c9f06e6)

1.0.5 (2015-03-10)

fix readme (9e51e82)
Release 1.0.5 (f827a90)
x-ray: Add #ua() (43a2824), closes #4

1.0.4 (2015-03-08)

Add img[src] to absolute urls (d9090d2)
add test (2d9edc6)
Added Gitter badge (c8a09d9)
Delete dribbble-search.js (109da37)
Release 1.0.4 (524ed37)
update list of attribute affected by absolute urls (bae1926)

1.0.3 (2015-02-07)

add xray.prepare(str, fn) for custom 'title | uppercase' filters (4cf1897)
better spot for badges (d62f2a0)
Release 1.0.3 (d9d93aa)

1.0.2 (2015-02-06)

fix bold (e1416a7)
fix url (thanks Jonah) (d63a15a)
more linking (74a08df)
Release 1.0.2 (b5f9043)
Update Readme.md (f673628)
added: gittask (c3a1f8c)

1.0.1 (2015-02-05)

add xray.format(fn) (bfea5f0)
cleanup (a1289d9)
more testing and finished readme (0aa0fcc)
Release 1.0.1 (871c833)

1.0.0 (2015-02-04)

all new x-ray (6377062)
Initial commit (ea51a7f)
Release 1.0.0 (0989cdc)
remove node_modules (92e6085)
updates (9c4e329)
written, needs to be tested better. (427e7f2)
added: gitignore (da14f58)