Important: This documentation covers Yarn 1 (Classic).
For Yarn 2+ docs and migration guide, see yarnpkg.com.

Package detail

wtf_wikipedia

spencermountain13.6kMIT10.4.0TypeScript support: included

parse wikiscript into json

wikipedia, wikimedia, wikipedia markup, wikiscript

readme

wtf_wikipedia
parse data from wikipedia
npm install wtf_wikipedia
it is very, very hard.         we're not joking.
why do we always do this?
we put our information where we can't take it out.
import wtf from 'wtf_wikipedia'

let doc = await wtf.fetch('Toronto Raptors')
let coach = doc.infobox().get('coach')
coach.text() //'Darko Rajaković'

.text()

get clean plaintext:

let str = `[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall. <ref>Field of our Fathers: By Richard Johnson</ref>`
wtf(str).text()
// "Boston's baseball field has a 37ft wall."
let doc = await wtf.fetch('Glastonbury', 'en')
doc.sentences()[0].text()
// 'Glastonbury is a town and civil parish in Somerset, England, situated at a dry point ...'

.json()

get all the data from a page:

let doc = await wtf.fetch('Whistling')

doc.json()
// { categories: ['Oral communication', 'Vocal skills'], sections: [{ title: 'Techniques' }], ...}

the default .json() output is really verbose, but you can cherry-pick data by poking-around like this:

// get just the links:
doc.links().map((link) => link.json())
//[{ page: 'Theatrical superstitions', text: 'supersitions' }]

// just the images:
doc.images()[0].json()
// { file: 'Image:Duveneck Whistling Boy.jpg', url: 'https://commons.wiki...' }

// json for a particular section:
doc.section('see also').links()[0].json()
// { page: 'Slide Whistle' }

run it on the client-side:

<script src="https://unpkg.com/wtf_wikipedia"></script>
<script>
  wtf.fetch('Radiohead', { 'Api-User-Agent': 'Name your script here' }, function (err, doc) {
    let members = doc.infobox().get('current members')
    members.links().map((l) => l.page())
    //['Thom Yorke', 'Jonny Greenwood', 'Colin Greenwood'...]
  })
</script>

or the server-side:

import wtf from 'wtf_wikipedia'
// or,
const wtf = require('wtf_wikipedia')

full wikipedia dumps

With this library, in conjunction with dumpster-dive, you can parse the whole english wikipedia in an aftertoon.

npm install -g dumpster-dive

Ok first, 🛀

Wikitext is no small thing.

Consider:

this library supports many recursive shenanigans, depreciated and obscure template variants, and illicit wiki-shorthands.

What it does:

  • Detects and parses redirects and disambiguation pages
  • Parse infoboxes into a formatted key-value object
  • Handles recursive templates and links- like [[.. [[...]] ]]
  • Per-sentence plaintext and link resolution
  • Parse and format internal links
  • creates image thumbnail urls from File:XYZ.png filenames
  • Properly resolve dynamic templates like {{CURRENTMONTH}} and {{CONVERT ..}}
  • Parse images, headings, and categories
  • converts 'DMS-formatted' (59°12'7.7"N) geo-coordinates to lat/lng
  • parse and combine citation and reference metadata
  • Eliminate xml, latex, css, and table-sorting cruft

What doesn't do:

  • external 'transcluded' page data [1]
  • AST output
  • smart (or 'pretty') formatting of html in infoboxes or galleries [1]
  • maintain perfect page order [1]
  • per-sentence references (by 'section' element instead)
  • maintain template or infobox css styling
  • large tables that span different sections [1]

It is built to be as flexible as possible. In all cases, tries to fail in considerate ways.

How about html scraping..?

Wikimedia's official parser turns wikitext ➔ HTML.

if you prefer this screen-scraping workflow, you can pluck at parts of a page like that.

that's cool!

getting structured data this way is still a complex, weird process. Manually spelunking the html is sometimes just as tricky and error-prone as scanning the wikitext itself.

The contributors to this library have come to that conclusion, as many others have.

This library is gracious to the Parsoid contributors.

okay,

flip your wikitext into a Doc object

import wtf from 'wtf_wikipedia'

let txt = `
==Wood in Popular Culture==
* Harry Potter's wand
* The Simpson's fence
`
wtf(txt)
// Document {text(), json(), lists()...}
let txt = `Whistling is featured in a number of television shows, such as [[Lassie (1954 TV series)|''Lassie'']], and the title theme for ''[[The X-Files]]''.`
wtf(txt)
  .links()
  .map((l) => l.page())
// [ 'Lassie (1954 TV series)',  'The X-Files' ]

doc.text()

returns nice plain-text of the article

let txt =
  "[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>"
wtf(txt).text()
//"Boston's baseball field has a 37ft wall."

doc.sections():

a section is a heading '==Like This=='

wtf(page).sections()[1].children() //traverse nested sections
wtf(page).section('see also').remove() //delete one

doc.sentences()

let s = wtf(page).sentences()[4]
s.links()
s.bolds()
s.italics()
s.text()
s.wikitext()

doc.categories()

await wtf.fetch('Whistling').categories()
//['Oral communication', 'Vocal music', 'Vocal skills']

doc.images()

let img = wtf(page).images()[0]
img.url() // the full-size wikimedia-hosted url
img.thumbnail() // 300px, by default
img.format() // jpg, png, ..

Fetch

You can grab and parse articles from [any wiki api](https://www.mediawiki.org/wiki/API:Mainpage)_. This includes any language, any wiki-project, and most 3rd-party wikis.

// 3rd-party wiki
let doc = await wtf.fetch('https://muppet.fandom.com/wiki/Miss_Piggy')

// wikipedia français
doc = await wtf.fetch('Tony Hawk', 'fr')
doc.sentence().text() // 'Tony Hplawk est un skateboarder professionnel et un acteur ...'

// accept an array, or wikimedia pageIDs
let docs = wtf.fetch(['Whistling', 2983], { follow_redirects: false })

// article from german wikivoyage
wtf.fetch('Toronto', { lang: 'de', wiki: 'wikivoyage' }).then((doc) => {
  console.log(doc.sentences()[0].text()) // 'Toronto ist die Hauptstadt der Provinz Ontario'
})

you may also pass the wikipedia page id as parameter instead of the page title:

let doc = await wtf.fetch(64646, 'de')

the fetch method follows redirects.

API plugin

wtf.getCategoryPages(title, [options])

retrieves all pages and sub-categories belonging to a given category:

wtf.extend(require('wtf-plugin-api'))
let result = await wtf.getCategoryPages('Category:Politicians_from_Paris')
/*
{
  [
    {"pageid":52502362,"ns":0,"title":"William Abitbol"},
    {"pageid":50101413,"ns":0,"title":"Marie-Joseph Charles des Acres de L'Aigle"}
    ...
    {"pageid":62721979,"ns":14,"title":"Category:Councillors of Paris"},
    {"pageid":856891,"ns":14,"title":"Category:Mayors of Paris"}
  ]
}
*/

wtf.random([options])

fetches a random wikipedia article, from a given language or domain

wtf.extend(require('wtf-plugin-api'))
wtf.random().then((doc) => {
  console.log(doc.title(), doc.categories())
  //'Whistling'  ['Oral communication', 'Vocal skills']
})

see wtf-plugin-api

Tutorials

Plugins

these add all sorts of new functionality:

wtf.extend(require('wtf-plugin-classify'))
await wtf.fetch('Toronto Raptors').classify()
// 'Organization/SportsTeam'

wtf.extend(require('wtf-plugin-summary'))
await wtf.fetch('Pulp Fiction').summary()
// 'a 1994 American crime film'

wtf.extend(require('wtf-plugin-person'))
await wtf.fetch('David Bowie').birthDate()
// {year:1947, date:8, month:1}

wtf.extend(require('wtf-plugin-i18n'))
await wtf.fetch('Ziggy Stardust', 'fr').infobox().json()
// {nom:{text:"Ziggy Stardust"}, oeuvre:{text:"The Rise and Fall of Ziggy Stardust"}}
Plugin
classify person/place/thing
summary short description text
person birth/death information
api fetch more data from the API
i18n improves multilingual template coverage
wtf-mlb fetch baseball data
wtf-nhl fetch hockey data
nsfw flag sexual/graphic/adult articles
image additional methods for .images()
html output html
wikitext output wikitext
markdown output markdown
latex output latex

Good practice:

The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -

  • pass a Api-User-Agent as something so they can use to easily throttle bad scripts
  • bundle multiple pages into one request as an array (say, groups of 5?)
  • run it serially, or at least, slowly.
wtf
  .fetch(['Royal Cinema', 'Aldous Huxley'], {
    lang: 'en',
    'Api-User-Agent': 'spencermountain@gmail.com',
  })
  .then((docList) => {
    let links = docList.map((doc) => doc.links())
    console.log(links)
  })

Full API

  • .title() - get/set the title of the page from the first-sentence
  • .pageID() - get/set the wikimedia id of the page, if we have it.
  • .wikidata() - get/set the wikidata id of the page, if we have it.
  • .domain() - get/set the domain of the wiki we're on, if we have it.
  • .url() - (try to) generate the url for the current article
  • .lang() - get/set the current language (used for url method)
  • .namespace() - get/set the wikimedia namespace of the page, if we have it
  • .isRedirect() - if the page is just a redirect to another page
  • .redirectTo() - the page this redirects to
  • .isDisambiguation() - is this a placeholder page to direct you to one-of-many possible pages
  • .isStub() - if the page is flagged as incomplete
  • .categories() - return all categories of the document
  • .sections() - return a list of the Document's sections
  • .paragraphs() - return a list of Paragraphs, in all sections
  • .sentences() - return a list of all sentences in the document
  • .images() - return all images found in the document
  • .links() - return a list of all links, in all parts of the document
  • .lists() - sections in a page where each line begins with a bullet point
  • .tables() - return a list of all structured tables in the document
  • .templates() - any type of structured-data elements, typically wrapped in like {{this}}
  • .infoboxes() - specific type of template, that appear on the top-right of the page
  • .references() - return a list of 'citations' in the document
  • .coordinates() - geo-locations that appear on the page
  • .text() - plaintext, human-readable output for the page
  • .json() - a 'stringifyable' output of the page's main data
  • .wikitext() - original wiki markup
  • .description() - get/set the page's short description, if we have one.
  • .pageImage() - get/set the page's representative image, if we have one.
  • .revisionID() - get/set the latest edit id of the page, if we have it.
  • .timestamp() - get/set the time of the most recent edit of the page, if we have it.

Section

  • .title() - the name of the section, between ==these tags==
  • .index() - which number section is this, in the whole document.
  • .indentation() - how many steps deep into the table of contents it is
  • .sentences() - return a list of sentences in this section
  • .paragraphs() - return a list of paragraphs in this section
  • .links() - list of all links, in all paragraphs and templates
  • .tables() - list of all html tables
  • .templates() - list of all templates in this section
  • .infoboxes() - list of all infoboxes found in this section
  • .coordinates() - list of all coordinate templates found in this section
  • .lists() - list of all lists in this section
  • .interwiki() - any links to other language wikis
  • .images() - return a list of any images in this section
  • .references() - return a list of 'citations' in this section
  • .remove() - remove the current section from the document
  • .nextSibling() - a section following this one, under the current parent: eg. 1920s → 1930s
  • .lastSibling() - a section before this one, under the current parent: eg. 1930s → 1920s
  • .children() - any sections more specific than this one: eg. History → [PreHistory, 1920s, 1930s]
  • .parent() - the section, broader than this one: eg. 1920s → History
  • .text() - readable plaintext for this section
  • .json() - return all section data
  • .wikitext() - original wiki markup

Paragraph

  • .sentences() - return a list of sentence objects in this paragraph
  • .references() - any citations, or references in all sentences
  • .lists() - any lists found in this paragraph
  • .images() - any images found in this paragraph
  • .links() - list of all links in all sentences
  • .interwiki() - any links to other language wikis
  • .text() - generate readable plaintext for this paragraph
  • .json() - generate some generic data for this paragraph in JSON format
  • .wikitext() - original wiki markup

Sentence

  • .links() - list of all links
  • .bolds() - list of all bold texts
  • .italics() - list of all italic formatted text
  • .text() - generate readable plaintext
  • .json() - return all sentence data
  • .wikitext() - original wiki markup

Image

  • .url() - return url to full size image
  • .thumbnail() - return url to thumbnail (pass size to customize)
  • .links() - any links from the caption (if present)
  • .format() - get file format (e.g. jpg)
  • .text() - does nothing
  • .json() - return some generic metadata for this image
  • .wikitext() - original wiki markup

Template

  • .text() - does this template generate any readable plaintext?
  • .json() - get all the data for this template
  • .wikitext() - original wiki markup

Infobox

  • .links() - any internal or external links in this infobox
  • .keyValue() - generate simple key:value strings from this infobox
  • .image() - grab the main image from this infobox
  • .get() - lookup properties from their key
  • .template() - which infobox, eg 'Infobox Person'
  • .text() - generate readable plaintext for this infobox
  • .json() - generate some generic 'stringifyable' data for this infobox
  • .wikitext() - original wiki markup

List

  • .lines() - get an array of each member of the list
  • .links() - get all links mentioned in this list
  • .text() - generate readable plaintext for this list
  • .json() - generate some generic easily-parsable data for this list
  • .wikitext() - original wiki markup

Reference

  • .title() - generate human-facing text for this reference
  • .links() - get any links mentioned in this reference
  • .text() - returns nothing
  • .json() - generate some generic metadata data for this reference
  • .wikitext() - original wiki markup

Table

  • .links() - get any links mentioned in this table
  • .keyValue() - generate a simple list of key:value objects for this table
  • .text() - returns nothing
  • .json() - generate some useful metadata data for this table
  • .wikitext() - original wiki markup

Configuration

Adding new methods:

you can add new methods to any class of the library, with wtf.extend()

wtf.extend((models) => {
  // throw this method in there...
  models.Doc.prototype.isPerson = function () {
    return this.categories().find((cat) => cat.match(/people/))
  }
})

await wtf.fetch('Stephen Harper').isPerson()

Adding new templates:

does your wiki use a {{foo}} template? Add a custom parser for it:

wtf.extend((models, templates) => {
  // create a custom parser function
  templates.foo = (tmpl, list, parse) => {
    let obj = parse(tmpl) //or do a custom regex
    list.push(obj)
    return 'new-text'
  }

  // array-syntax allows easy-labeling of parameters
  templates.foo = ['a', 'b', 'c']

  // number-syntax for returning by param # '{{name|zero|one|two}}'
  templates.baz = 0

  // replace the template with a string '{{asterisk}}' -> '*'
  templates.asterisk = '*'
})

by default, if there's no parser for a template, it will be just ignored and generate an empty string. However, it's possible to configure a fallback parser function to handle these templates:

wtf('some {{weird_template}} here', {
  templateFallbackFn: (tmpl, list, parse) => {
    let obj = parse(tmpl) //or do a custom regex
    list.push(obj)
    return '[unsupported template]' // or return null to ignore this template
  },
})

you can determine which templates are understood to be 'infoboxes' with the 3rd parameter:

wtf.extend((models, templates, infoboxes) => {
  Object.assign(infoboxes, { person: true, place: true, thing: true })
})

Notes:

3rd-party wikis

by default, a public API is provided by a installed mediawiki application. This means that most wikis have an open api, even if they don't realize it. Some wikis may turn this feature off.

It can usually be found by visiting http://mywiki.com/api.php

to fetch pages from a 3rd-party wiki:

wtf.fetch('Kermit', { domain: 'muppet.fandom.com' }).then((doc) => {
  console.log(doc.text())
})

some wikis will change the path of their API, from ./api.php to elsewhere. If your api has a different path, you can set it like so:

wtf.fetch('2016-06-04_-_J.Fernandes_@_FIL,_Lisbon', { domain: 'www.mixesdb.com', path: 'db/api.php' }).then((doc) => {
  console.log(doc.template('player').json())
})

for image-urls to work properly, the wiki should also have Special:Redirect enabled. Some wikis, (like wikia) have intentionally disabled this.

i18n and multi-language:

wikitext is (amazingly) used across all languages, wikis, and even in right-to-left languages. This parser actually does an okay job at it too.

Wikipedia I18n language information for Redirects, Infoboxes, Categories, and Images are included in the library, with pretty-decent coverage.

To improve coverage of i18n templates, use wtf-plugin-i18n

Please make a PR if you see something missing for your language.

Builds:

this library ships separate client-side and server-side builds, to preserve filesize.

  • [./wtfwikipedia-client.mjs](./builds/wtf_wikipedia-client.mjs)_ - as es-module (or Deno)
  • [./wtfwikipedia-client.min.js](./builds/wtf_wikipedia-client.min.js)_ - for production

  • [./wtfwikipedia.cjs](./builds/wtf_wikipedia.cjs)_ - node commonjs build

  • [./wtfwikipedia.mjs](./builds/wtf_wikipedia.mjs)_ - node/deno/typescript esm build

the browser version uses fetch() and the server version uses require('https').

Performance:

It is not the fastest parser, and is very unlikely to beat a single-pass parser in C or Java.

Using dumpster-dive, this library can parse a full english wikipedia in around 4 hours on a macbook.

That's about 100 pages/second, per thread.

See also:

Other alternative javascript parsers:

and many more!

MIT

changelog

10.4.0 [Feb 2025]

  • [update] - export esm/require syntax
  • [update] - #581 - use export default to fix ESM incompatibility
  • [new] - #585 Support 'as of' template
  • [fix] - readme typos
  • [update] - dependencies

10.3.2 [Jul 2024]

  • [new] - support many new inline templates
  • [new] - support recursive i18n category queries
  • [update] - dependencies
  • [update] - i18n and api plugins

10.3.1 [May 2024]

  • [fix] - unicode glitch token #573
  • [fix] - retire mixesdb wiki test
  • [fix] - support birthdate template aliases #537
  • [update] - dependencies

10.3.0 [Dec 2023]

  • [new] - fallbackTemplateFn handler #509
  • [new] - more i18n redirects and templates
  • [new] - metadata methods .revisionID(), .description(), .timestamp(), .pageImage()
  • [new] - i18n .isStub() method
  • [new] - debug plugin for finding parsing errors

10.2.1 [Nov 2023]

  • [change] - support more templates
  • [change] - support multiple citations inside a ref tag
  • [new] - add revisionID() - thanks Dag-Inge! #568

10.2.0 [Oct 2023]

  • [change] - typescript export helpers

10.1.7 [Sep 2023]

  • [fix] - don't crash on huge geojson blob #555

10.1.6 [Sep 2023]

  • [change] - handle fetch data errors
  • [fix] - template runtime error #550
  • [update] - deps

10.1.5 [May 2023]

  • [fix] - support inline templates
  • [change] - dont overwrite duplicate props in infobox #530
  • [update] - deps

10.1.4 [Apr 2023]

  • [fix] - #528 template runtime errors
  • [fix] - remove stray console.log (thank you @mxunknown)
  • [update] - some work on gamelog template

10.1.3 [Mar 2023]

  • [fix] - #519 date parsing issue
  • [fix] - #518 support slash in infobox property
  • [fix] - #516 better support {{br}} template

10.1.2 [Jan 2023]

  • [fix] - #514 runtime error
  • [update] - dependencies

10.1.1 [Jan 2023]

  • [change] - support many more inline templates
  • [fix] - wikitext newline join issue
  • [update] - dependencies

10.1.0 [Dec 2022]

  • [fix] - extra dots in interwiki links #510
  • [new] - configure unsupported template behaviour - templateFallbackFn #509
  • [update] - dependencies
  • [change] - allow embedded infoboxes #506
  • [change] - support :File and :Category syntax for #308
  • [new] - support {{medalcount}} template #428

10.0.5 [Dec 2022]

  • [fix] - broken cli script #504

10.0.4 [Dec 2022]

  • [fix] - mangled interwiki link #502
  • [fix] - tabs in infoboxes #435
  • [update] - dependencies

10.0.3 [Oct 2022]

  • [fix] - improved i18n infobox classification
  • [update] - dependencies

10.0.2 [Jul 2022]

  • [fix] - multiple inline templates in a heading #489
  • [fix] - non-i18n list templates #475
  • [fix] - don't print hatnotes in .text()
  • [update] - api, i18n, sports plugins

10.0.1 [May 2022]

  • [fix] - runtime error #484
  • [new] - wtf-plugin-sports for tricky nhl and mlb templates
  • [change] - .random() in api-plugin parses document
  • [change] - update dependencies

10.0.0 [April 2022]

  • [breaking] - drop IE11 support - target evergreen browsers
  • [change] - convert to esmodules internally
  • [change] - add blockquote template
  • [change] - update dependencies

9.1.0 [March 2022]

  • [change] - support inline templates inside section titles
  • [change] - xml parsing fix
  • [change] - increase arbitrary char limit on bold & italixs
  • [change] - improve parsing for Image and File names
  • [new] - add .license() method for image plugin
  • [fix] - table parsing bugs
  • [fix] - typescript fixes update deps huge thank you to @FFatur !!

9.0.3

  • [fix] - typescript error
  • [change] - update demos

9.0.1

  • [fix] - runtime error in cli (thanks maxlath!)
  • [fix] - linter fixes for regexes
  • update deps

9.0.0

Tldr:

  • .templates() now return Template objects, instead of json.
  • cool new http library for .fetch()
  • custom templates recieve pre-parsed json
  • more development of plugins

detail:

  • [breaking] - .templates() now returns Template objects, like other methods (call .json())
  • [breaking] - change interpretation of reversed params in .fetch() method (thanks wouter!)
  • [breaking] - change params for custom templates
  • [breaking] - move .random() and .category() to plugin-api
  • [breaking] - always return an array for plural methods, even with number param, like .links(3)
  • [possibly-breaking] - cleanup null|undefined responses from methods
  • [possibly-breaking] - remove .dates() method (prev deprecated)
  • [possibly-breaking] - require node 10, ie > 11
  • [change] - normalize table rows
  • [change] - move wiktionary templates to wtf-plugin-wiktionary
  • [change] - Link.text() now returns page
  • [change] - improvements to 'soft' isDisambiguation detection
  • [change] - deprecate wtf-plugin-category (move to wtf-plugin-api)
  • [new] - api plugin
  • [new] - disambig plugin
  • [new] - person plugin
  • [new] - Table.get() method
  • [new] - set new infoboxes using .extend()

  • plugin-api 0.1.0

  • plugin-classify 1.0.0
  • plugin-disambig 0.0.1
  • plugin-image 0.3.0
  • plugin-person 0.2.0
  • plugin-summary 0.3.0
  • plugin-wikitext 1.1.0
  • plugin-wikinews 0.0.1
  • plugin-wikivoyage 0.0.1
  • plugin-wiktionary 0.0.1

8.5.1

  • fix reference json encoding for mongodb

8.5.0

  • fix for cross-domain 3rd-party wikis
  • improved support for fetching non-wikipedia domains

8.4.0

  • new wikidata() method
  • new domain() method
  • support image urls from 3rd-party wikis
  • support for some html formatting tags #374
  • support for sub and sup templates
  • [fix] for link-parsing bug #375

8.3.0

  • adds some wikivoyage templates
  • fix cli help options
  • change covid template again

8.2.0

  • export http lib for plugin in .extend()
  • stop exporting (huge) mapfile in builds
  • deprecate .dates() from sentence class (didn't work)
  • stop ignoring ref-list template, keep otherwise empty ==References== sections

8.1.2

  • another fix for covid templates

8.1.1

  • fix for covid templates

8.1.0

  • [major] fix Link json object in .json() result
  • [major] fix inconsistent response for singular method aliases like .template('foo')
  • [major] change in rowspan behaviour to support covid table
  • support <noinclude>
  • add .url() and .language() methods
    • support setters on Link methods
    • add Link.href() method
    • support proper urls for interwiki links
  • replicate wikipedia behaviour for apostrophe-s after link
  • new plugins summary, classify, category, and i18n.
  • Link hrefs are not titlecased anymore by default

8.0.0

  • [breaking] move .html(), .latex(), and .markdown() to their respective plugins
    • drop header/footer boilerplate from outputs
  • [breaking] .templates() and .links() return Template and Link objects, and not bare JSON (use .map(l=> l.json()))
  • [breaking] refactor inputs for .fetch()
    • no longer support 'enwikiquote' etc format as input
    • use 'wiki' instead of undocumented 'wikiUrl' param
    • no more automatic throttling/rate-limiting
  • [breaking] remove Image.exists() method to plugin
  • [major] create seperate client/server-side build formats (use native fetch/node lib)
  • [major] support deep (infinite) recursion in templates
  • [major] much-stronger i18n support
  • no-longer automatically titlecase links
  • support adding template parsers through plugins in .extend()
    • support array, number, and string shorthand for template parsers
  • deprecate .plaintext() in favour of .text()

7.8.0

  • add .extend() method for authoring plugins

7.7.2

  • bugfixes by suntala

7.6.0

  • use rollup for builds, publish esm module

7.3.0

  • more unicode support

7.2.10

  • improved unicode support for sentence/paragraph splitting
  • supporting more formatting templates, like Mono
  • more flexible reference support in .json()

7.2.9

  • few more sports templates,
  • rowspan parsing fix
  • no-longer include package.json in builds
  • use full template-parser for image captions
  • support manually setting doc.title()

7.2.0

  • improved date templates, bugfixes

7.1.1

  • support population, weatherbox templates

7.1.0

  • some template fixes
  • add a 'number' field in sentence json, when it looks like a number
  • slight change in coordinate result format, support inline coordinate text
  • handle fetching a large list of titles in sequence

7.0.0 🚨

  • change result-format in a lot of templates, for more consistency.
    • notably: reference format, see also, IPA, main
  • support colspan/rowspan in tables (a little!)
  • support implicit first-row headers for some tables
  • return templates even if they have no data
  • begin support for some well-used {{foo start}}...{{foo end}} templates
  • remove empty [] for some more section properties in .json() response

6.3.0

  • support way (+20%?) more templates.

6.2.0

  • support categories in redirects
  • add mongo-encoding from dumpster-dive

6.1.0

  • titlecase internal link destinations #192

6.0.0 🚨

  • support .paragraphs()
  • :warning: major changes to output of .json(). cleaning-up redundant data.:warning:
    • remove top-level templates data (found in section) - resume it with {templates:true}
    • remove top-level coordinates data (found in templates) - resume it with {coordinates:true}
    • remove top-level citations data (found in section) - resume it with {citations:true}
  • return empty arrays in .json() again ¯_(:/)_ /¯
  • remove h1 title on html output
  • change ambiguous options.title for sections to options.headers
  • support lists of 1
  • begin removing empty references section by default
  • begin support for rendering citations at the bottom of documents
  • begin first-class references-parsing as objects at paragraph-level
    • use this: .citations() --> .citations().map(c => c.json());
  • remove .wikitext() and .reparse() methods - keeping wikitext stateful caused too many issues
  • turn Image.file into a function
  • include interwiki() results in .links()
  • support follow_redirects option to fetch
  • hide object data in console.logs
  • move ALL image urls from upload.wikimedia.org/wikipedia/commons to wikipedia.org/wiki/Special:Redirect/file/ via 86
  • image captions are now Sentence objects
  • rename citation → reference internally, and in json output
  • remove references inside section titles

5.3.0

  • add infobox html back into html output (tentative)
  • redirect support in .json(), .html() output
  • remove empty [] properties in .json() results (saves disk space!)
  • keep # anchor data in .links()
  • show links default-on in latex output, like in md and html
  • render html/latex/json 'soft redirect', instead of blank pages

5.2.0

  • make .json() results return proper json for tables

5.1.0

  • improved support for gallery tag
  • more support for wiktionary grammar templates
  • tweak some regexes

5.0.0

  • new Table class and List classes
  • improved table-parser - generate name col1 instead of col-0
  • support options.verbose_template for debugging
  • support recursive tables

4.6.0

  • <gallery> tag support in .images()
  • support pageids again in .fetch()
  • better disambiguation-page detection in english
  • remove wikitext from caption titles
  • support 3-level templates (whew!)

4.5.0

  • support section(0).wikitext()
  • support inline {{marriage}} template
  • dangling semi-colons in first-sentence parentheses

4.2.2

  • support dollar templates

4.2.0

  • return a result or undefined for sentences.bolds(0), and the like

4.1.0

  • remove repeated/redundant text in .links() results
  • don't automatically titlecase link srcs anymore

4.0.0

  • 🚨 non-api changing, but large result-format change
  • add .wikitext() method to Document, Section, Sentence (thanks @niebert)
  • move infobox, citation parser/data to Section class
  • .templates() are now an ordered array, instead of an object, and include infoboxes and citations
  • add (early) support for 'generic' key-value template parsing
  • normalize/lowercase template/infobox properties - add loose .get('key') method to Infobox class
  • mess-around with citation-template formatting
  • beginning to support unknown template forms
  • move date data from Sentence to Section object.
  • rollback of awkward+undocumented options param in parser (but keep options param for output methods)
  • add support for about a hundred new templates
  • templates, including citations, try to be flat-text, and no-longer return Sentence objects

3.1.0

  • improved .json() results
  • guess a page's title based on bold formatting in first sentence
  • make section.title a function

3.0.0

  • BIG API RE-WRITE!
  • move .parse() to main wtf() method
  • allow repeated processes without a pre-parse of the document
  • wtf.fetch() uses promises, and native fetch() method (when available)
  • allow per-section images, lists, tables + templates
  • section depth values now start at 0
  • infobox values now return sentence objects
  • latex output (thanks @niebert!)
  • refactor shell scripts to wtf_wikipedia Toronto --plaintext
  • use babel-preset-env cause it's new-new
  • update deps

2.6.1

  • better html output tables/infoboxes

2.6.0

  • support for markdown output
  • support for html output
  • add page 'title' to response, where possible.
  • better support for capturing the [[link]]'s syntax
  • opt-out of citation, infobox, image ... parsing
  • support a whack of date/time/age templates

2.5.0

  • co-ordinate parsing fix
  • support longer ref tags
  • smarter disambiguation for interwiki links vs pages containing ':'
  • more support for various list syntaxes

2.2.0

  • support for {{coords}} geo-coordinate parsing+conversion
  • early-support for custom template-parsing

2.1.0

  • support table '! row' row heading syntax, and other forms

2.0.0

  • move possibly-repeatable data into the sections object, list 'lists' and 'tables'
  • change library export name to wtf
  • turn infobox into 'infoboxes' array
  • moved 'infobox_template' to infobox.type
  • change initial depth to 0
  • change 'translations' property to 'interwiki'
  • support {{main}} and {{wide image|}} templates

1.0.0

  • make sections into an ordered array, instead of an es6 Map thing. - add 'depth' too