DistanceSensitiveHash - Lightning-Fast Euclidean Distance Search

Find similar items in microseconds. This library compresses high-dimensional vectors (embeddings, features, user preferences) into tiny fingerprints that let you estimate Euclidean similarity 9× faster while using 94% less memory than naive Euclidean distance comparison. Perfect for rapidly narrowing a set of options, and edge deployments.

What's the Problem?

You have 100,000 high-dimensional vectors and need to find similar ones fast. Works for anything vector-shaped:

Product feature vectors
Image / text embeddings
User preference vectors
Audio fingerprints
Time series data

How It Works (30-Second Version)

Feed your vector into encodeE2LSH() → get a compact integer signature (16-512 integers)
Compare signatures with estimateSimilarity() → Euclidean similarity in microseconds
Profit

Trade-off: 5-15% error for orders of magnitude speed-up and massive memory savings.

Bundle Size & Performance

Tiny footprint: Core library is only ~8KB (gzipped: ~3KB).

Works everywhere:

✅ Node.js (v14+)
✅ Browser (ES modules)
✅ Web Workers (thread-safe)

Quick Start

npm install distance-sensitive-hash

import { encodeE2LSH, estimateSimilarity } from 'distance-sensitive-hash';

const productA = [0.8, 0.2, 0.9, /*...512 dims...*/ 0.1, 0.7];
const productB = [0.7, 0.3, 0.8, /*...512 dims...*/ 0.2, 0.6];

// 1. Hash once (64 integers)
const fpA = encodeE2LSH(productA, 64);
const fpB = encodeE2LSH(productB, 64);

// 2. Compare in <1 μs
const sim = estimateSimilarity(fpA, fpB);
console.log('Similarity ≈', sim); // 0.0 – 1.0

Two Algorithms, One API

Name	Function	When to Pick
CS-E2LSH (default)	`encodeCSE2LSH()`	3× faster encoding (recommended)
HCS-E2LSH	`encodeHCSE2LSH()`	Better accuracy for medium distances

Pick based on your use case - CS for speed, HCS for better distance estimation.

API Cheatsheet

encodeE2LSH(vector, bits, seed = 42, variant = 'cs', w = 4.0, order = 2)
// vector: number[]           – your data
// bits: int                  – signature size (16-512 recommended)
// seed: int                  – reproducibility
// variant: string            – 'cs' or 'hcs'
// w: float                   – width parameter (4.0 default)
// order: int                 – tensor order for HCS (2-4)
// returns: Int32Array        – compact fingerprint

estimateSimilarity(fpA, fpB, w = 4)
// fpA, fpB: Int32Array       – fingerprints
// w: float                   – width parameter used in encoding
// returns: float 0-1         – Euclidean similarity estimate

Benchmarks

Task	Ops/sec	vs Vanilla	Accuracy
CS-E2LSH encode (512d→64-int)	66,701 sigs/sec	—	85-95%
CS-E2LSH compare	8.7M ops/sec	9.5× faster	±0.15 @ ≥0.7
HCS-E2LSH encode (512d→64-int)	7,857 sigs/sec	—	85-95%
HCS-E2LSH compare	8.8M ops/sec	9.6× faster	±0.13 @ ≥0.7
Vanilla Euclidean	915k ops/sec	baseline	exact

Performance by Signature Size

Size	Generation	Comparison	Memory	Accuracy
16	50.8k/s	9.3M/s	64B	±0.10
32	81.5k/s	14.6M/s	128B	±0.13
64	66.7k/s	8.7M/s	256B	±0.15
128	46.4k/s	4.0M/s	512B	±0.13
256	15.5k/s	2.0M/s	1KB	±0.09

Distance Accuracy Examples

True Distance	CS-E2LSH Error	HCS-E2LSH Error
0.377 (close)	±0.064 (17%)	±0.171 (45%)
1.405 (medium)	±0.543 (39%)	±0.318 (23%)
3.443 (far)	±0.466 (14%)	±0.231 (7%)

Memory Savings

Massive reduction compared to full vector storage:

Vector Size	Full Vector	64-int Signature	Reduction
512-d float	2,048 bytes	256 bytes	8× smaller
1024-d float	4,096 bytes	256 bytes	16× smaller
2048-d float	8,192 bytes	256 bytes	32× smaller

Real-world impact:

1M product vectors (512-d): 2GB → 256MB (87.5% reduction)
10M user vectors (1024-d): 40GB → 2.5GB (93.75% reduction)
100M image vectors (2048-d): 800GB → 25GB (96.9% reduction)

When to Use / Not Use

✅ Use for:

Large datasets (≥10k items)
Real-time recommendations
Memory-constrained environments
Approximate nearest neighbor search
Pre-filtering before exact comparisons

❌ Don't use for:

Exact similarity requirements
Very small datasets (<100 items)
Cosine similarity (use SimilaritySensitiveHash)
Jaccard similarity (use Grouped-OPH)

Academic Roots

Implements Count-Sketch E2LSH (CS-E2LSH) and Higher-Order Count-Sketch E2LSH (HCS-E2LSH) from:

Verma & Pratap, Improving E2LSH with Count-Sketch and Higher-Order Count-Sketch, arXiv:2503.06737

Built on the classic LSH framework (Gionis, Indyk, Motwani, VLDB 1999) and Euclidean LSH (Datar et al., SoCG 2004).

License

MIT

Package detail