Package detail

js-tiktoken

dqbd8.2mMIT1.0.21

JavaScript port of tiktoken

readme

⏳ js-tiktoken

tiktoken is a BPE tokeniser for use with OpenAI's models. This is a pure JS port of the original tiktoken library.

Install the library from NPM:

npm install js-tiktoken

Lite

You can only load the ranks you need, which will significantly reduce the bundle size:

import { Tiktoken } from "js-tiktoken/lite";
import o200k_base from "js-tiktoken/ranks/o200k_base";

const enc = new Tiktoken(o200k_base);
assert(enc.decode(enc.encode("hello world")) === "hello world");

Alternatively, encodings can be loaded dynamically from our CDN hosted on Cloudflare Pages.

import { Tiktoken } from "js-tiktoken/lite";

const res = await fetch(`https://tiktoken.pages.dev/js/o200k_base.json`);
const o200k_base = await res.json();

const enc = new Tiktoken(o200k_base);
assert(enc.decode(enc.encode("hello world")) === "hello world");

Full usage

If you need all the OpenAI tokenizers, you can import the entire library:

[!CAUTION] This will include all the OpenAI tokenizers, which may significantly increase the bundle size. See

import assert from "node:assert";
import { getEncoding, encodingForModel } from "js-tiktoken";

const enc = getEncoding("gpt2");
assert(enc.decode(enc.encode("hello world")) === "hello world");

changelog

Changelog

This is the changelog for the open source version of tiktoken.

[v0.5.1]

Add encoding_name_for_model, undo some renames to variables that are implementation details

[v0.5.0]

Add tiktoken._educational submodule to better document how byte pair encoding works
Ensure encoding_for_model knows about several new models
Add decode_with_offets
Better error for failures with the plugin mechanism
Make more tests public
Update versions of dependencies

[v0.4.0]

Add decode_batch and decode_bytes_batch
Improve error messages and handling

[v0.3.3]

tiktoken will now make a best effort attempt to replace surrogate pairs with the corresponding Unicode character and will replace lone surrogates with the Unicode replacement character.

[v0.3.2]

Add encoding for GPT-4

[v0.3.1]

Build aarch64 wheels
Make blobfile an optional dependency

Thank you to @messense for the environment variable that makes cargo not OOM under emulation!

[v0.3.0]

Improve performance by 5-20%; thank you to @nistath!
Add gpt-3.5-turbo models to encoding_for_model
Add prefix matching to encoding_for_model to better support future model versions
Fix a bug in the README instructions on extending tiktoken
Update the set of available encodings
Add packaging metadata

[v0.2.0]

Add tiktoken.encoding_for_model to get the encoding for a specific model
Improve portability of caching logic

Thank you to @fritzo, @arvid220u, @khanhvu207, @henriktorget for various small corrections

[v0.1.2]

Avoid use of blobfile for public files
Add support for Python 3.8
Add py.typed
Improve the public tests

[v0.1.1]

Initial release