extract-zhongwen
extract-zhongwen is a small utility designed to extract Chinese characters from a given string based on Unicode Ranges.
Installation
npm install extract-zhongwen
Features
- Extracts Chinese characters from an input string.
- Supports Unicode normalization (NFKC) with optional preservation of specified characters.
- Allows whitelisting or blacklisting of specific characters.
- Option to remove duplicate characters.
Function Signature
const extract = (
input: string,
options?: Options
): string
Parameters
input
(string)
The input string from which Chinese characters will be extracted.
options
(Options)
An object containing configuration options.
Option | Type | Default | Description |
---|---|---|---|
normalizeUnicode |
boolean | true |
If true , normalizes Unicode characters to NFKC form, while preserving whitelisted characters. |
removeDuplicates |
boolean | true |
If true , removes duplicate Chinese characters in the output. |
includeCharacters |
string | "" |
A string of characters to explicitly include in the extracted output, even if they don't match general Chinese character ranges. |
excludeCharacters |
string | "" |
A string of characters to exclude from the extracted output, even if they match Chinese character ranges. |
Notes
- Whitelisted characters in
includeCharacters
will not be normalized if present. includeCharacters
andexcludeCharacters
will treat each character individually. This means that it is not possible to whitelist or blacklist specific words or phrases.- If
includeCharacters
andexcludeCharacters
contain overlapping characters, the overlapping characters will be filtered out. - Whitespaces, punctuation, and any non-Chinese characters are filtered out by default.
- Duplicate characters are removed at the very end. This means that unnormalized characters and their normalized counterparts may be considered duplicates if the
normalizeUnicode
option is enabled, even if their Unicode values are technically different. - If
normalizeUnicode
is disabled, characters with different Unicode representations might not be merged correctly. - Performance may vary for very large input strings, especially when
removeDuplicates
is enabled, since it requires additional processing. - If no Chinese characters are found in the input, an empty string will be returned.
- The function does not differentiate between Simplified and Traditional Chinese.
Example Usage
import { extract } from "extract-zhongwen";
console.log(extract("中文字符 English Characters"));
// Output: "中文字符"
// Example with normalization (NFKC)
console.log(extract("社 社 祖 租", { normalizeUnicode: true }));
// Output: "社社租租"
// Example with duplicate removal
console.log(extract("你好 你好 世界 世界", { removeDuplicates: true }));
// Output: "你好世界"
// Example with duplicate removal disabled
console.log(extract("你好 你好 世界 世界", { removeDuplicates: false }));
// Output: "你好你好世界世界"
// Example including a specific character
console.log(
extract("Hello 你好,世界!", { includeCharacters: "l,! " })
);
// Output: "ll 你好,世界!"
// Example excluding a specific character
console.log(extract("Hello 你好,世界!", { excludeCharacters: "世" }));
// Output: "你好界"
// Example including and excluding characters
console.log(
extract(
"那座山,正当顶上,有一块仙石 On the summit of the mountain was a mythical stone",
{
includeCharacters: "On the summit of the mountain",
excludeCharacters: "那座山",
}
)
);
// Output: "正当顶上有一块仙石 On the summit of the mountain"