Vocabulary specification for backward-compatible JSON-LD 1.1 extensions targeting AI/ML data exchange
Status: Draft Specification v0.1.0
Date: 2026-02-12
Part of: JSON-LD Extensions for AI/ML (jsonld-ex)
This document specifies the transport extensions for jsonld-ex. Two modules — CBOR-LD serialization and MQTT transport optimization — enable efficient transmission of JSON-LD documents over bandwidth-constrained and publish/subscribe networks, with particular attention to IoT and sensor data pipelines.
The CBOR-LD module provides binary serialization with context compression, reducing payload size for repeated context references. The MQTT module builds on CBOR-LD to provide automatic topic derivation, quality-of-service mapping from confidence metadata, and MQTT 5.0 PUBLISH property generation — bridging JSON-LD semantics and MQTT transport semantics.
IoT and edge computing environments impose constraints that standard JSON-LD processing does not address:
@type, @id) enables semantic routing without application-level dispatching.@validUntil) provide a natural source for this metadata — a message whose assertion has expired is not worth delivering.The key words “MUST”, “MUST NOT”, “SHOULD”, and “MAY” in this document are to be interpreted as described in RFC 2119.
CBOR-LD serialization encodes JSON-LD documents as CBOR binary data, with optional context compression that replaces repeated context URLs with short integer identifiers.
To serialize a JSON-LD document to CBOR:
Input: A JSON-LD document (dict) and an optional context registry (dict mapping context URL strings to integer IDs).
Procedure:
@context value:
a. If the value is a string and appears as a key in the registry, replace it with the corresponding integer ID.
b. If the value is an array, apply this replacement to each string element.
c. If the value is an inline context object, preserve it without compression.All jsonld-ex extension keywords (@confidence, @source, @validFrom, etc.) are preserved as-is in the CBOR encoding. They are standard JSON-LD properties and serialize naturally to CBOR.
To deserialize CBOR bytes back to a JSON-LD document:
Input: CBOR-encoded bytes and an optional context registry (the same registry used during encoding).
Procedure:
@context value:
a. If the value is an integer and appears in the reverse registry, replace it with the corresponding URL string.
b. If the value is an array, apply this replacement to each integer element.
c. String and object values are preserved unchanged.The reference implementation provides a default registry for well-known contexts:
| Context URL | Integer ID |
|---|---|
http://schema.org/ |
1 |
https://schema.org/ |
1 |
https://www.w3.org/ns/activitystreams |
2 |
https://w3id.org/security/v2 |
3 |
https://www.w3.org/2018/credentials/v1 |
4 |
http://www.w3.org/ns/prov# |
5 |
Note that both http:// and https:// variants of schema.org map to the same ID, reflecting real-world usage where both schemes are encountered.
Processors MAY extend or replace this registry with application-specific mappings. Sender and receiver MUST agree on the registry in use — a mismatch will produce incorrect context restoration.
The payload_stats function computes four serialization sizes for a document, enabling empirical comparison:
| Metric | Description |
|---|---|
json_bytes |
Compact JSON serialization (no whitespace) |
cbor_bytes |
CBOR with context compression |
gzip_json_bytes |
Gzip-compressed JSON |
gzip_cbor_bytes |
Gzip-compressed CBOR |
Two derived ratios are provided:
cbor_ratio = cbor_bytes / json_bytes — measures CBOR compression alone.gzip_cbor_ratio = gzip_cbor_bytes / json_bytes — the headline number for maximum compression (CBOR + gzip vs. raw JSON).Example:
{
"@context": "https://schema.org/",
"@type": "SensorReading",
"temperature": {
"@value": 36.7,
"@confidence": 0.95,
"@unit": "celsius"
}
}
Typical results for a document like this: JSON ~180 bytes, CBOR ~120 bytes (0.67 ratio), gzip CBOR ~95 bytes (0.53 ratio).
The MQTT payload serialization functions wrap CBOR-LD (or JSON) encoding with payload size enforcement suitable for MQTT transmission.
Input: A JSON-LD document (dict), a compression flag (boolean, default true), a maximum payload size (integer, default 256,000 bytes), and an optional context registry.
Procedure:
cbor2 library is not available, raise an error indicating the dependency requirement.max_payload, raise a size error.The default maximum payload of 256,000 bytes (256 KB) corresponds to the MQTT v3.1.1 default maximum packet size. Processors MAY adjust this limit based on broker configuration — MQTT v5.0 brokers can advertise their maximum packet size via the CONNACK Maximum Packet Size property.
Input: Raw bytes received from MQTT, an optional @context to reattach, a compression flag (boolean, default true), and an optional context registry.
Procedure:
@context value is provided and the decoded document does not already contain @context, attach the provided context.The context reattachment in step 3 enables a common optimization: when sender and receiver agree on a shared context, the sender strips @context before transmission to save bytes, and the receiver reattaches it on receipt.
MQTT topics determine message routing. The topic derivation function generates structured, semantically meaningful topics from JSON-LD metadata.
{prefix}/{@type}/{@id_fragment}
| Segment | Source | Default |
|---|---|---|
prefix |
Caller-provided | "ld" |
@type |
Document @type property, local name extracted |
"unknown" |
@id_fragment |
Document @id property, last path/fragment segment |
"unknown" |
Examples:
| Document | Derived Topic |
|---|---|
{"@type": "SensorReading", "@id": "urn:sensor:imu-001"} |
ld/SensorReading/imu-001 |
{"@type": "https://schema.org/Person", "@id": "https://example.org/people/alice"} |
ld/Person/alice |
{"@type": ["SensorReading", "Observation"], "@id": "urn:obs:42"} |
ld/SensorReading/42 |
{} |
ld/unknown/unknown |
When @type or @id contains a full IRI, the local name is extracted:
#, return the substring after the last #./, return the substring after the last /.: (e.g., a URN), return the substring after the last :.When @type is an array, the first element is used.
Per MQTT specification (v3.1.1 §4.7, v5.0 §4.7), PUBLISH topic names have the following constraints:
| Character | Constraint | Handling |
|---|---|---|
# |
Wildcard, forbidden in PUBLISH topics | Replaced with _ |
+ |
Wildcard, forbidden in PUBLISH topics | Replaced with _ |
\x00 (null) |
Forbidden | Replaced with _ |
$ (leading) |
Reserved for broker system topics (e.g., $SYS/) |
Stripped from start of segment |
After sanitization, if a segment is empty, it is replaced with "unknown".
The MQTT specification limits topic names to 65,535 bytes when UTF-8 encoded. If the generated topic exceeds this limit, the derivation function MUST raise an error. In practice, topics derived from JSON-LD metadata are far shorter than this limit.
The QoS derivation function maps jsonld-ex confidence metadata to MQTT Quality of Service levels, enabling automatic resource allocation based on data reliability.
| Condition | QoS Level | MQTT Semantics |
|---|---|---|
@humanVerified = true |
2 | Exactly once |
@confidence ≥ 0.9 |
2 | Exactly once |
0.5 ≤ @confidence < 0.9 |
1 | At least once |
@confidence < 0.5 |
0 | At most once |
| No confidence metadata | 1 | At least once (default) |
@humanVerified at the document level. If true, return QoS 2.@confidence at the document level. If present, apply the mapping table.@-prefixed keys):
a. For each property whose value is a dict, check for @humanVerified. If true, return QoS 2.
b. For each property whose value is a dict, check for @confidence. If found, apply the mapping table using the first match and stop scanning.The mapping reflects a practical heuristic for IoT data pipelines:
The derive_mqtt_qos_detailed function returns additional diagnostic information alongside the QoS level:
| Field | Type | Description |
|---|---|---|
qos |
Integer (0, 1, 2) | The derived QoS level |
reasoning |
String | Human-readable explanation of the derivation |
confidence_used |
Float or null | The confidence value that drove the decision |
human_verified |
Boolean | Whether @humanVerified was the deciding factor |
Example response:
{
"qos": 2,
"reasoning": "@confidence=0.95 ≥ 0.9 (document-level) → QoS 2 (exactly once)",
"confidence_used": 0.95,
"human_verified": false
}
MQTT 5.0 (OASIS Standard, §3.3.2.3) introduced structured PUBLISH packet properties. The jsonld-ex transport module derives these properties from JSON-LD document metadata.
| MQTT 5.0 Property | MQTT Spec Section | Source | Value |
|---|---|---|---|
| Payload Format Indicator | §3.3.2.3.2 | Compression flag | 0 (unspecified bytes) for CBOR, 1 (UTF-8) for JSON |
| Content Type | §3.3.2.3.9 | Compression flag | "application/cbor" for CBOR, "application/ld+json" for JSON |
| Message Expiry Interval | §3.3.2.3.3 | @validUntil |
Seconds remaining until expiry (see §6.2) |
| User Properties | §3.3.2.3.7 | Document metadata | Key-value pairs (see §6.3) |
The Message Expiry Interval is derived from the @validUntil temporal annotation:
Procedure:
@validUntil at the document level.@-prefixed keys) for the first dict containing @validUntil.@validUntil is found, omit the Message Expiry Interval (the message does not expire).@validUntil value as an ISO 8601 datetime.ceil(validUntil - now).This mapping means that MQTT brokers will automatically discard messages whose temporal assertions have expired, without requiring application-level expiry logic.
User Properties are key-value string pairs attached to MQTT 5.0 PUBLISH packets. The following JSON-LD metadata fields are extracted when present:
| User Property Key | Source | Description |
|---|---|---|
jsonld_type |
@type |
The document type (first element if array) |
jsonld_confidence |
@confidence |
Confidence score as string |
jsonld_source |
@source |
Source/model IRI |
jsonld_id |
@id |
Document identifier |
User Properties enable MQTT 5.0 subscribers to filter or route messages based on JSON-LD metadata without deserializing the payload. For example, a subscriber could filter for messages where jsonld_type = "CriticalAlert" at the broker level.
Given the document:
{
"@type": "SensorReading",
"@id": "urn:sensor:imu-001",
"@confidence": 0.95,
"@source": "https://model.example.org/imu-classifier-v2",
"temperature": {
"@value": 36.7,
"@validUntil": "2026-02-12T18:00:00Z"
}
}
The derived MQTT 5.0 properties (assuming CBOR compression):
{
"payload_format_indicator": 0,
"content_type": "application/cbor",
"message_expiry_interval": 3600,
"user_properties": [
["jsonld_type", "SensorReading"],
["jsonld_confidence", "0.95"],
["jsonld_source", "https://model.example.org/imu-classifier-v2"],
["jsonld_id", "urn:sensor:imu-001"]
]
}
The transport extensions are entirely processor-side functionality. They do not introduce new keywords into JSON-LD documents. A standard JSON-LD 1.1 processor can:
The transport extensions support both MQTT v3.1.1 and MQTT v5.0:
| Feature | MQTT v3.1.1 | MQTT v5.0 |
|---|---|---|
| Payload serialization | ✓ (CBOR or JSON) | ✓ (CBOR or JSON) |
| Topic derivation | ✓ | ✓ |
| QoS derivation | ✓ | ✓ |
| PUBLISH properties | Not available | ✓ (§6) |
| Message expiry | Not available | ✓ (§6.2) |
When operating with an MQTT v3.1.1 broker, the MQTT 5.0 properties are simply not used. All other features remain functional.
The CBOR-LD serialization requires the cbor2 Python package. When cbor2 is not installed:
compress=True) raise an ImportError with installation instructions.compress=False) work without any additional dependencies.cbor2.The cbor2 dependency is available via the iot optional extra: pip install jsonld-ex[iot].
The MQTT transport extensions operate within the constraints of the MQTT specification:
| Constraint | MQTT Spec Reference | jsonld-ex Compliance |
|---|---|---|
| Topic max length: 65,535 bytes | v3.1.1 §4.7, v5.0 §4.7 | Enforced in topic derivation (§4.4) |
No wildcards (#, +) in PUBLISH topics |
v3.1.1 §4.7, v5.0 §4.7 | Sanitized (§4.3) |
| No null character in topics | v3.1.1 §4.7, v5.0 §4.7 | Sanitized (§4.3) |
$-prefixed topics reserved for broker |
v3.1.1 §4.7, v5.0 §4.7 | Leading $ stripped (§4.3) |
| QoS levels 0, 1, 2 | v3.1.1 §4.3, v5.0 §4.3 | Mapped from confidence (§5) |
| Payload Format Indicator values 0, 1 | v5.0 §3.3.2.3.2 | Set from compression mode (§6.1) |
| Message Expiry Interval: uint32 seconds | v5.0 §3.3.2.3.3 | Clamped to uint32 max (§6.2) |
The CBOR-LD serialization uses standard CBOR encoding as defined in RFC 8949. The context compression scheme (replacing URL strings with integer IDs) is applied before CBOR encoding — the CBOR layer receives a standard JSON-compatible data structure and encodes it without modification. This means any compliant CBOR decoder can read the bytes, though context restoration requires knowledge of the registry.
The W3C CBOR-LD draft specification defines a more comprehensive compression scheme that assigns integer codes to JSON-LD keywords and vocabulary terms. The jsonld-ex CBOR-LD module implements a simplified subset focused on context URL compression. The registry-based approach is compatible with the CBOR-LD draft’s context compression mechanism, but does not implement full term-level compression.
When using JSON serialization (not CBOR), the Content Type is application/ld+json as defined in the JSON-LD 1.1 specification §8 (IANA Considerations). When using CBOR serialization, the Content Type is application/cbor as registered in the IANA media type registry.
The reference implementation spans two modules in the jsonld-ex Python package.
jsonld_ex.cbor_ld)| Function | Spec Section |
|---|---|
to_cbor(doc, context_registry) |
§2.1 |
from_cbor(data, context_registry) |
§2.2 |
payload_stats(doc, context_registry) |
§2.4 |
DEFAULT_CONTEXT_REGISTRY |
§2.3 |
jsonld_ex.mqtt)| Function | Spec Section |
|---|---|
to_mqtt_payload(doc, compress, max_payload, context_registry) |
§3.1 |
from_mqtt_payload(payload, context, compressed, context_registry) |
§3.2 |
derive_mqtt_topic(doc, prefix) |
§4 |
derive_mqtt_qos(doc) |
§5 |
derive_mqtt_qos_detailed(doc) |
§5.4 |
derive_mqtt5_properties(doc, compress) |
§6 |