Important: This documentation covers Yarn 1 (Classic).
For Yarn 2+ docs and migration guide, see yarnpkg.com.

Package detail

js-tiktoken

dqbd4.2mMIT1.0.20TypeScript support: included

JavaScript port of tiktoken

readme

⏳ js-tiktoken

tiktoken is a BPE tokeniser for use with OpenAI's models. This is a pure JS port of the original tiktoken library.

Install the library from NPM:

npm install js-tiktoken

Lite

You can only load the ranks you need, which will significantly reduce the bundle size:

import { Tiktoken } from "js-tiktoken/lite";
import o200k_base from "js-tiktoken/ranks/o200k_base";

const enc = new Tiktoken(o200k_base);
assert(enc.decode(enc.encode("hello world")) === "hello world");

Alternatively, encodings can be loaded dynamically from our CDN hosted on Cloudflare Pages.

import { Tiktoken } from "js-tiktoken/lite";

const res = await fetch(`https://tiktoken.pages.dev/js/o200k_base.json`);
const o200k_base = await res.json();

const enc = new Tiktoken(o200k_base);
assert(enc.decode(enc.encode("hello world")) === "hello world");

Full usage

If you need all the OpenAI tokenizers, you can import the entire library:

[!CAUTION] This will include all the OpenAI tokenizers, which may significantly increase the bundle size. See

import assert from "node:assert";
import { getEncoding, encodingForModel } from "js-tiktoken";

const enc = getEncoding("gpt2");
assert(enc.decode(enc.encode("hello world")) === "hello world");

changelog

Changelog

This is the changelog for the open source version of tiktoken.

[v0.5.1]

  • Add encoding_name_for_model, undo some renames to variables that are implementation details

[v0.5.0]

  • Add tiktoken._educational submodule to better document how byte pair encoding works
  • Ensure encoding_for_model knows about several new models
  • Add decode_with_offets
  • Better error for failures with the plugin mechanism
  • Make more tests public
  • Update versions of dependencies

[v0.4.0]

  • Add decode_batch and decode_bytes_batch
  • Improve error messages and handling

[v0.3.3]

  • tiktoken will now make a best effort attempt to replace surrogate pairs with the corresponding Unicode character and will replace lone surrogates with the Unicode replacement character.

[v0.3.2]

  • Add encoding for GPT-4

[v0.3.1]

  • Build aarch64 wheels
  • Make blobfile an optional dependency

Thank you to @messense for the environment variable that makes cargo not OOM under emulation!

[v0.3.0]

  • Improve performance by 5-20%; thank you to @nistath!
  • Add gpt-3.5-turbo models to encoding_for_model
  • Add prefix matching to encoding_for_model to better support future model versions
  • Fix a bug in the README instructions on extending tiktoken
  • Update the set of available encodings
  • Add packaging metadata

[v0.2.0]

  • Add tiktoken.encoding_for_model to get the encoding for a specific model
  • Improve portability of caching logic

Thank you to @fritzo, @arvid220u, @khanhvu207, @henriktorget for various small corrections

[v0.1.2]

  • Avoid use of blobfile for public files
  • Add support for Python 3.8
  • Add py.typed
  • Improve the public tests

[v0.1.1]

  • Initial release