Important: This documentation covers Yarn 1 (Classic).
For Yarn 2+ docs and migration guide, see yarnpkg.com.

Package detail

cvm-lib

havelessbemore30MIT0.1.1TypeScript support: included

Estimate the number of distinct values in a set using the simple and space-efficient CVM algorithm

algorithm, cardinality estimation, count distinct, cvm, distinct, estimate, estimator, sampling, streaming, unique

readme

CVM Library

Estimate the number of distinct values in a set using the simple and space-efficient CVM algorithm.

Version JSR Maintenance License codecov npm bundle size

Getting Started

Install

NPM:

npm install cvm-lib

Yarn:

yarn add cvm-lib

JSR:

jsr add @rojas/cvm

Examples

See the examples/ directory for all examples.

Hamlet

Estimate unique words in Shakespeare's Hamlet:

node ./examples/hamlet/index.js
  • Total words: 31991
  • CVM capacity: 2161
  • Expected uniques: 4762 ± 10.00%
  • Estimated uniques: 4728 (-0.71%)

1M

Estimate unique integers among 1 million random integers.

node ./examples/hamlet/index.js
  • Total values: 1000000
  • CVM capacity: 10631
  • Expected uniques: 994384 ± 5.00%
  • Estimated uniques: 996480 (0.21%)

API

Functions

calculateCapacity(n, epsilon?, delta?)

Calculates the space required to estimate the number of distinct values in a set with a given accuracy and confidence.

  • n: The total number of values in the set, or an estimate if unknown. Must be a positive number.
  • epsilon (optional): An estimate's relative error. Controls accuracy. Must be between 0 and 1. Defaults to 0.05.
  • delta (optional): The probability an estimate is not accurate. Controls confidence. Must be between 0 and 1. Defaults to 0.01.

Classes

Estimator<T>

Estimates the number of distinct values in a set using the CVM algorithm.

  • Constructors

    • new (capacity): Create an instance with a given capacity. Must be a positive integer.
    • new (config): Create an instance using a config object.
  • Properties

    • capacity: Gets the maximum number of samples in memory.
    • randomFn: Gets or sets the random number generator function (e.g. Math.random).
    • sampleRate Gets the base sample rate (e.g. 0.5).
    • size: Gets the number of samples in memory.
  • Methods

    • add(value): Adds a value.
    • clear(): Clears/resets the instance.
    • estimate(): Gets the estimated number of distinct values.

Interfaces

EstimatorConfig<T>

A configuration object used to create Estimator instances.

  • capacity: The maximum number of samples in memory. Must be a positive integer.
  • randomFn (optional): The random number generator function. Should return random or pseudorandom values between 0 and 1.
  • sampleRate (optional): The sampling rate for managing samples. Must be between 0 and 1.
    • Note: Custom values may negatively affect accuracy. In general, the further from 0.5, the more it's affected. If capacity was calculated via calculateCapacity, expected accuracy / confidence may be invalidated.
  • storage (optional): An object that implements SampleSet for storing samples.

SampleSet<T>

Represents a generic set for storing samples.

  • size: The number of values in the set.
  • add(value): Adds a value to the set.
  • clear(): Clears all values from the set.
  • delete(value): Removes a specified value from the set.
  • [Symbol.iterator](): Iterates through the set's values.

Community and Support

Contributions are welcome!

  • Questions / Dicussions: Please contact us via GitHub discussions.

  • Bug Reports: Please use the GitHub issue tracker to report any bugs. Include a detailed description and any relevant code snippets or logs.

  • Feature Requests: Please submit feature requests as issues, clearly describing the feature and its potential benefits.

  • Pull Requests: Please ensure your code adheres to the existing style of the project and include any necessary tests and documentation.

For more information, check out the contributor's guide.

Build

  1. Clone the project from github
git clone git@github.com:havelessbemore/cvm-lib.git
cd cvm-lib
  1. Install dependencies
npm install
  1. Build the project
npm run build

This will output ECMAScript (.mjs) and CommonJS (.cjs) modules in the dist/ directory.

Format

To run the code linter:

npm run lint

To automatically fix linting issues, run:

npm run format

Test

To run tests:

npm test

To run tests with a coverage report:

npm run test:coverage

A coverage report is generated at ./coverage/index.html.

References

  1. Source paper: Chakraborty, S., Vinodchandran, N. V., & Meel, K. S. (2023). Distinct Elements in Streams: An Algorithm for the (Text) Book
  2. Notes by Donald Knuth: Knuth, D. E. (2023). The CVM Algorithm for Estimating Distinct Elements in Streams. Stanford Computer Science Department.
  3. Wikipedia: CVM Algorithm.
  4. High-level summary: Nadis, S. (2024, May 16). Computer Scientists Invent an Efficient New Way to Count. Quanta Magazine..

changelog

Change Log

0.1.1 (2024-06-23)

  • Update the CommonJS build file's extension from .js to .cjs
    • Should mitigate issues importing the package in CommonJS environments.
  • Update dev dependencies

0.1.0 (2024-06-14)

What's New

  • Adds the SampleSet interface.
  • Adds an optional storage property in EstimatorConfig to enable providing a custom object for storing samples.

0.0.5 (2024-06-07)

What's Changed

  • Updates package entry points.
  • Fixes broken link in README.md.

0.0.4 (2024-06-06)

What's Changed

  • Reduces package size.
  • Updates documentation.

0.0.3 (2024-06-05)

What's Changed

  • Fixes incorrect JSR links in README.md
  • Updates README.md

0.0.2 (2024-06-05)

First release via CI/CD

0.0.1 (2024-06-05)

Initial release