Package detail

rdfjs-c14n

iherman174W3C-201505133.1.3

Implementation in Typescript of the RDF Canonicalization Algorithm RDFC-1.0, on top of the RDF/JS interfaces

RDF, canonicalization

readme

RDF Dataset Canonicalization in TypeScript

This is an implementation of the RDF Dataset Canonicalization algorithm, also referred to as RDFC-1.0. The algorithm has been published by the W3C RDF Dataset Canonicalization and Hash Working Group.

Requirements

RDF packages and references

The implementation depends on the interfaces defined by the RDF/JS Data model specification for RDF terms, named and blank nodes, or quads. It also depends on an instance of an RDF Data Factory, specified by the same document. For TypeScript, the necessary type specifications are available through the @rdfjs/types package; an implementation of the RDF Data Factory is provided by, for example, the n3 package, which also provides a Turtle/TriG parser and serializer.

By default (i.e., if not explicitly specified) the Data Factory of the n3 package is used.

Crypto

The implementation relies on the Web Cryptography API as implemented by modern browsers, deno (version 1.3.82 or higher), or node.js (version 21 or higher). A side effect of using Web Crypto is that the canonicalization and hashing interface entries are asynchronous, returning Promises, and must be used, for example, through the await idiom of Javascript/Typescript.

Usage

An input RDF Dataset may be represented by any object that may be iterated through quad instances (e.g., arrays of quads, a set of quads, or any specialized objects storing quads like RDF DatasetCore implementations), or a string representing an N-Quads, Turtle, or TriG document. Formally, the input type is:

Iterable<rdf.Quad> | string

The canonicalization process can be invoked by

the canonicalize method, that returns an N-Quads document containing the (sorted) quads of the dataset, using the canonical blank node id-s;
the canonicalizeDetailed method, that returns an Object of the form:
- canonicalized_dataset: an RDF DatasetCore instance using the canonical blank node id-s
- canonical_form: an N-Quads document containing the (sorted) quads of the dataset, using the canonical blank node id-s
- issued_identifier_map: a Map object, mapping the original blank node id-s (as used in the input) to their canonical equivalents
- bnode_identifier_map: Map object, mapping a blank node to its (canonical) blank node id

Copying the input quads

The Iterable<rdf.Qad> input instance is expected to be a set of quads, i.e., it should not include repeated entries. This is not checked by the process. Usually, the input quads are copied into an internal store, thereby de-duplicating them. Because this can be a costly operation for large dataset, it can be controlled through an additional, optional, boolean parameter copy. The effects are as follows:

If the value of copy is set, and its value is true, the input quads are copied to an internal store. If the value is false, the quads are used directly.
If the value of copy is not set, the input is copied to an internal store unless the object implements the RDF DatasetCore interface.

If the input is a string serializing a Dataset in Turtle/TriG format, the input is parsed, and duplicate quads are filtered out automatically.

Note that the value of copy must not be set to false if the input is a generator function (even if the generator function avoids duplicate quads).

The separate testing folder includes a tiny application that runs some local tests, and can be used as an example for the additional packages that are required. See also the separate tester repository that runs the official test suite set up by the W3C Working Group.

All the examples below ignore the copy argument.

Installation

For node.js, the usual npm installation can be used:

npm install rdfjs-c14n

The package has been written in TypeScript but is distributed in JavaScript; the type definition (i.e., index.d.ts) is included in the distribution.

Using appropriate tools (e.g., esbuild) the package can be included into a module to be loaded into a browser.

For deno a simple

import { RDFC10, Quads, InputQuads } from "npm:rdfjs-c14n"

will do.

Usage Examples

There is a more detailed documentation of the classes and types on github. The basic usage may be as follows:

import * as n3  from 'n3';
import * as rdf from '@rdfjs/types';;
// The definition that are used here:
// export type Quads = rdf.DatasetCore; 
// export type InputQuads = Iterable<rdf.Quad>;
import {RDFC10, Quads, InputQuads } from 'rdf-c14n';

async function main(): Promise<void> {
    // Any implementation of the data factory will do in the call below.
    // By default, the Data Factory of the n3 package (i.e., the argument in the call
    // below is not strictly necessary).
    const rdfc10 = new RDFC10(n3.DataFactory);  

    const input: InputQuads = createYourQuads();

    // "normalized" is a dataset of quads with canonical blank node labels
    // per the specification. 
    // Alternatively, "input" could also be a string for a Turtle/TriG document
    const normalized: Quads = (await rdfc10.c14n(input)).canonicalized_dataset;

    // If you care only for the N-Quads results, you can make it simpler
    const normalized_N_Quads: string = (await rdfc10.c14n(input)).canonical_form;

    // Or even simpler, using a shortcut:
    const normalized_N_Quads_bis: string = await rdfc10.canonicalize(input);

    // "hash" is the hash value of the canonical dataset, per specification
    const hash: string = await rdfc10.hash(normalized);
}

Additional features

Choice of hash

The RDFC 1.0 algorithm is based on an extensive usage of hashing. By default, as specified by the specification, the hash function is sha256. This default hash function can be changed via the

    rdfc10.hash_algorithm = algorithm;

attribute, where algorithm can be any hash function identification. Examples are sha256, sha512, etc. The list of available hash algorithms can be retrieved as:

    rdfc10.available_hash_algorithms;

which corresponds to the values defined by the Web Cryptography API specification as of December 2013, namely sha1, sha256, sha384, and sha512. Future revision of the specification may add more.

Controlling the complexity level

On rare occasions, the RDFC 1.0 algorithm has to go through complex cycles that may also involve recursive steps. On even more extreme situations, this could result in an unreasonably long canonicalization process. Although this practically never occurs in practice, attackers may use some "poison graphs" to create such situations (see the security consideration section in the specification).

As specified by the standard, this implementation sets a maximum complexity level (usually set to 50); this level can be inquired by the

    rdfc10.maximum_allowed_complexity_number;

(read-only) attribute. This number can be lowered by setting the

    rdfc10.maximum_complexity_number

attribute. The value of this attribute cannot exceed the system wide maximum level.

Maintainer: @iherman

changelog

Version 3.1.3

More careful with the initial choice on whether the input dataset must be copied to an internal store or not (triggered by a bug whereby a dataset through a generator function would not have worked)

Version 3.1.1

On the advice of @jeswr the turtle parser is now based on the streaming parser of n3, avoiding an unnecessary "buffer" like array. The same is true for the test runner.
N3's Writer object is used to generate an nquad, instead of relying on yet another external package (n3 is used anyway, it is unnecessary to involve yet another module).
Added a deduplicate flag to the canonicalization arguments, to set whether duplicate quads should be removed from the input dataset. (This required some change on the documentation.)

Version 3.1.0

The type of input to the algorithm has been changed to Iterable<rdf.Quad> | string. This provides extra flexibility and makes the code clearer. (Proposed by @tpluscode, see github comment)
The turtle/nquads parsers have been modified to ensure uniqueness of terms. See github issue.

Version 3.0.1

There was a bug in the config file; the key SHA-512 was mapped on SHA-256.
Updated the dependency to @types/rdfjs

Version 3.0.0

As crypto-js package has been discontinued, switching to the WebCrypto API for hashing (available in node.js for versions 21 and upwards). This is a backward incompatible change, because hashing in WebCrypto is an asynchronous function, and this "bubbles up" to the generic interface as well.

Version 2.0.4

Added SHA-384 to the list of available hash functions (missed it the last time)

Version 2.0.3

The library has been made "node-independent", ie, removed all dependencies that meant that library can only use on a node.js platform (as opposed to, say, a browser). This means:
- Instead of using the build-in crypto module for hashing, the crypto-js library is used. Although it has a slightly smaller set of available hashing functions, that is not really important for RDFC10 (which, formally, is based on sha256, and everything else is just a cherry on the cake)
- The function that allows the user to set some configuration data via environment variables and/or configuration files has been removed from the core. There is now a separate 'extras' directory on the repository which has this function as a callback example that the application developer can use, and the general structure only relies on callback. The callback itself is node.js based, others may want to come up with alternatives for, e.g., deno or a browser.

Version 2.0.2

The return structure uses bona fide Map-s for the additional mapping information, instead of a bespoke structure. This makes the usage more natural to end-users.
Handling of poison graphs has changed: instead of looking at the recursion level, it looks at the number of calls to "hash n degree quads".
The default hash function and complexity levels can also be set via environment variables, and the system also looks for a .rdfjs_c14n.json configuration file in the local directory as well as the user's home directory for further values. These are merged using the usual priority (HOME < Local Dir < environment variables).
The canonicalizeDetailed entry point has been renamed c14n...
Code improvement: it is simpler to use Set<rdf.Quad> everywhere rather than using a 'Shell' to cover for a Set or an Array. (It may be a bit slower that way, but the complication may not be worth it.)

Version 2.0.1

The variable names used in the return structure of the core algorithm have been aligned with the latest version of the spec text

Version 2.0.0

Minor editorial changes to be in line with the newest version of eslint
The name of the hash algorithm, if set explicitly, is checked against a list of names as returned from openssl list -digest-commands and implemented by node.js. Also, setting the algorithm is now done via accessors, with an extra accessor giving access to the list of available names.
Removed the references to dataset factories. It is not a widely implemented feature, and just creates extra complications (e.g., is it an Iteratable?). The simple union of Set or Array of quads is, in this respect much more usable...)
Changed the way Loggers are used. Instead of letting the user check in a class instance as a logger (which may be seen as a security risk), the current approach is to use a (text) id to choose among the loggers included in the implementation. Developer can add their own as part of the library, but the lambda user of the library cannot.
Synching with the latest version of the official draft (June 2023), namely:
- The input to the algorithm can be either a Quads object (ie, a Set or Array or Quads, or an RDF Dataset) or a string, i.e., an nQuads document. If the latter, it is parsed into a Quad object, keeping the BNode identifiers as used in the nQuad source.
- The simple output of the algorithm is an N-Quads document; alternatively, the detailed output is a structure containing the N-Quads and rdf.Quads versions of the data, as well as a mappings of blank nodes and their identifiers.
- The hash function can use the same type of input as the core input.
- The name of the algorithm is officially RDFC 1.0. As a consequence, the name of the main entry point has been changed from RDFCanon to RDFC10. The documentation and the tests have also been changed to reflect the new name.
- The maximum level of recursion in the "Hash N Degree Quads" has a default (currently 50), which can also be set to a lower level by the user.

Version 1.0.4

The section number references in the logs have been synced with the official spec version Febr. 2023

Version 1.0.3

The logging system has been changed:
- it has been more closely integrated to the core library (in line with the evolution of the specification)
- a logger producing a YAML version of the logs has been added to the core library
Minor changes to the interface class:
- default value for a data factory has been added to simplify usage
- the core interface class includes reference to serialization and hash beyond the canonicalization itself

Version 1.0.2

First version of the documentation completed