Intelligent Text Chunking
A powerful TypeScript library for intelligent text chunking with advanced document structure recognition, PDF support, and semantic boundary preservation.
Features
- 🧠 Intelligent Structure Recognition: Automatically detects headings, sections, and document patterns
- 📄 PDF Support: Page-aware chunking with page number metadata
- 🎯 Semantic Boundaries: Respects sentence, paragraph, and heading boundaries
- 📊 Rich Metadata: Comprehensive chunk information including headings, sections, and statistics
- 🔧 Flexible Configuration: Customizable chunk sizes, overlap, and boundary preferences
- 📚 Multiple Document Types: Supports academic papers, legal documents, technical docs, and more
Installation
npm install intelligent-text-chunkingQuick Start
Basic Usage
import { chunkTextIntelligently, ChunkingOptions } from 'intelligent-text-chunking';
const text = `
# Introduction
This is the introduction section with some content.
## Methodology
Here we describe our methodology in detail.
### Data Collection
We collected data from various sources.
## Results
Our results show significant improvements.
`;
const options: ChunkingOptions = {
maxChunkSize: 500,
overlapSize: 50,
respectHeadingBoundaries: true
};
const chunks = chunkTextIntelligently(text, options);
console.log(`Generated ${chunks.length} chunks`);
chunks.forEach((chunk, index) => {
console.log(`Chunk ${index + 1}:`);
console.log(` Heading: ${chunk.metadata.heading || 'None'}`);
console.log(` Level: ${chunk.metadata.headingLevel || 'N/A'}`);
console.log(` Words: ${chunk.metadata.wordCount}`);
console.log(` Text: ${chunk.text.substring(0, 100)}...`);
});PDF-Specific Chunking
import { chunkPDFTextIntelligently } from 'intelligent-text-chunking';
// Extract text from PDF (using pdf2json or similar)
const pdfText = "Your PDF text content...";
const pageBreaks = [1000, 2000, 3000]; // Character positions of page breaks
const chunks = chunkPDFTextIntelligently(pdfText, pageBreaks, {
maxChunkSize: 800,
respectParagraphBoundaries: true
});
chunks.forEach(chunk => {
console.log(`Page ${chunk.metadata.pageNumber}: ${chunk.text.substring(0, 50)}...`);
});Advanced Configuration
import { IntelligentChunker, ChunkingOptions } from 'intelligent-text-chunking';
const options: ChunkingOptions = {
maxChunkSize: 1000, // Maximum characters per chunk
minChunkSize: 200, // Minimum characters per chunk
overlapSize: 100, // Overlap between chunks
respectSentenceBoundaries: true, // Don't break mid-sentence
respectParagraphBoundaries: true, // Don't break mid-paragraph
respectHeadingBoundaries: true, // Don't break across headings
preserveHeadingHierarchy: true, // Maintain heading structure
maxHeadingLevel: 6 // Maximum heading level to recognize
};
const chunker = new IntelligentChunker(options);
const chunks = chunker.chunkText(yourText);API Reference
Types
ChunkingOptions
interface ChunkingOptions {
maxChunkSize?: number; // Default: 1000
minChunkSize?: number; // Default: 200
overlapSize?: number; // Default: 100
respectSentenceBoundaries?: boolean; // Default: true
respectParagraphBoundaries?: boolean; // Default: true
respectHeadingBoundaries?: boolean; // Default: true
preserveHeadingHierarchy?: boolean; // Default: true
maxHeadingLevel?: number; // Default: 6
}IntelligentChunk
interface IntelligentChunk {
text: string;
metadata: ChunkMetadata;
}
interface ChunkMetadata {
heading?: string; // Detected heading text
headingLevel?: number; // Heading level (1-6)
section?: string; // Section name
pageNumber?: number; // Page number (for PDFs)
chunkIndex: number; // Index of this chunk
totalChunks: number; // Total number of chunks
wordCount: number; // Word count in chunk
charCount: number; // Character count in chunk
startPosition: number; // Start position in original text
endPosition: number; // End position in original text
}Functions
chunkTextIntelligently(text: string, options?: ChunkingOptions): IntelligentChunk[]
Chunks regular text intelligently based on document structure.
chunkPDFTextIntelligently(text: string, pageBreaks?: number[], options?: ChunkingOptions): IntelligentChunk[]
Chunks PDF text with page awareness and page number metadata.
IntelligentChunker
Main class for advanced chunking operations.
Supported Document Patterns
The library recognizes various document structures:
Academic Papers
- Abstract, Introduction, Conclusion
- References, Bibliography
- Numbered sections (1., 1.1, 1.1.1)
Legal Documents
- Articles, Sections, Chapters
- Roman numerals (I., II., III.)
- Lettered sections (A., B., C.)
Technical Documentation
- Overview, Implementation
- API Reference, Configuration
- Markdown headings (# ## ###)
General Documents
- All caps headings
- Title case with colons
- Table of contents patterns
Use Cases
- RAG Systems: Create semantic chunks for retrieval-augmented generation
- Document Analysis: Process and analyze structured documents
- Search Systems: Build searchable document chunks with metadata
- Content Management: Organize and structure document content
- AI Training: Prepare text data for machine learning models
Examples
Academic Paper Processing
const academicText = `
Abstract
This paper presents a novel approach to text processing.
1. Introduction
Text processing is a fundamental task in NLP.
1.1 Background
Previous work has shown...
2. Methodology
We propose a new algorithm...
3. Results
Our experiments demonstrate...
References
[1] Smith, J. (2023). Text Processing...
`;
const chunks = chunkTextIntelligently(academicText);
// Automatically detects Abstract, Introduction, Methodology, Results, ReferencesLegal Document Processing
const legalText = `
Article 1. Definitions
For the purposes of this agreement...
Section 2.1. Rights and Obligations
Each party shall have the right to...
Chapter III. Termination
This agreement may be terminated...
`;
const chunks = chunkTextIntelligently(legalText);
// Recognizes Article, Section, Chapter structureContributing
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
License
MIT License - see LICENSE file for details.
Changelog
1.0.0
- Initial release
- Intelligent text chunking with structure recognition
- PDF support with page awareness
- Comprehensive metadata support
- TypeScript definitions included