The Underlay Protocol
A content-addressed protocol for versioned structured data. User docs · llms.txt
Overview
Underlay is a protocol for publishing, versioning, and collaborating on structured data. Every piece of content (records, schemas, and files) is identified by its SHA-256 hash. Versions are manifests that reference these hashes. This means storage is deduplicated globally, transfers only move data the other side doesn't have, and provenance is built in: any record can be traced back to every collection and version that includes it.
Data model
The protocol has four primitives:
- Record: A JSON object with an
id, atype, and adatapayload. Records are the rows of your dataset. Each record is content-addressed by the SHA-256 hash of its canonical JSON representation. - Schema: A JSON Schema document that describes the structure of a record type. Schemas are also content-addressed. They define validation rules, mark private fields, and annotate cross-record references.
- Version: An immutable snapshot: a manifest of record hashes, schema hashes, file hashes, and a metadata bag. Versions are identified by semver (e.g.
v1.2.0). - File: A binary blob (PDF, image, etc.) stored by SHA-256 hash. Records reference files with the
{"$file": "sha256:..."}convention.
Record identity
A record's identity is the SHA-256 hash of its canonical JSON. The canonical form is:
canonical = JSON.stringify({ id: "pub-001", type: "Publication", data: { ... } })
hash = SHA256(canonical) // hex-encodedThe private flag is not part of the hash. Two records with identical id, type, and data but different privacy flags produce the same hash. This is intentional. The record's content identity doesn't change when you change who can see it.
A record whose type declares private fields has a second address: its public record hash, the SHA-256 of the same canonical form with the private fields stripped. Public manifests list records by their public hash, and the record endpoints resolve either address, so a public reader can always verify that hashing the document they received reproduces the address they requested. When a type has no private fields the two addresses coincide.
Wire format is JSONL: one record per line, independently hashable and streamable:
{"id":"pub-001","type":"Publication","data":{"title":"The Structure of Scientific Revolutions","doi":"10.1234/example"}}Version identity
A version's hash is the SHA-256 of a canonical JSON object containing sorted hashes:
canonical = JSON.stringify({
schemas: { "Publication": "abc123...", "Author": "def456..." }, // sorted by slug
records: ["0a1b2c...", "3d4e5f...", ...], // sorted, hex SHA-256
files: ["7a8b9c...", ...], // sorted, hex SHA-256
metadata: { "license": "CC-BY-4.0", "readme": "# My Collection\n..." } // canonicalized JSON
})
hash = "private:" + SHA256(canonical)Two versions with the same content produce the same hash, regardless of when or where they were created. The server rejects pushes that would create a duplicate hash.
A separate public: hash covers only non-private types and fields, with private fields stripped before re-hashing. This lets external verifiers confirm the public content without access to private data.
Semver semantics
Versions are identified by semver strings (e.g. v1.2.0). The server auto-derives the next version based on what changed:
- Major bump: a schema changed (e.g.
v1.2.0->v2.0.0) - Minor bump: records or files changed (e.g.
v1.2.0->v1.3.0) - Patch bump: metadata-only change such as readme or license (e.g.
v1.2.0->v1.2.1)
Push
All pushes use the negotiate protocol, a three-step flow similar to git's pack negotiation. The client sends a manifest of record hashes, the server says which it needs, the client sends those records (in one or more batches), then commits.
# 1. Client sends manifest of record hashes
POST /api/collections/:owner/:slug/versions/negotiate
{
"base_version": "v1.1.0",
"schemas": { "Publication": { ... } },
"manifest": [
{ "id": "pub-001", "type": "Publication", "hash": "abc123..." },
{ "id": "pub-002", "type": "Publication", "hash": "def456..." }
],
"files": ["7a8b9c..."],
"message": "Add new publication"
}
# 2. Server responds with what it needs
{
"session_id": "...",
"needed_records": ["def456..."],
"needed_files": [],
"total_records": 2,
"already_have_records": 1
}
# 3. Client sends only the missing records as JSONL (repeatable for large batches)
POST /api/collections/:owner/:slug/versions/negotiate/:sessionId/records
Content-Type: application/x-ndjson
{"id":"pub-002","type":"Publication","data":{"title":"...","doi":"..."}}
# -> { "received": 1, "remaining": 0, "total_needed": 1 }
# 4. Client commits — server validates schemas, creates version
POST /api/collections/:owner/:slug/versions/negotiate/:sessionId/commit
# -> { "semver": "v1.2.0", "hash": "...", "recordCount": 2, "fileCount": 1 }The negotiate step checks every record and file hash against the server's global store. If 100,000 records already exist and only 5 are new, only those 5 are transferred.
For large pushes, the /records endpoint can be called multiple times (up to 10,000 records per batch). The server tracks which records have been received. Once all needed records are submitted, commit to finalize the version. Sessions expire after 10 minutes.
Pull
Clients can fetch a full manifest or a delta between two versions. Combined with the batch records endpoint, this enables efficient pull synchronization.
# Full manifest
GET /api/collections/:owner/:slug/versions/v2.0.0/manifest
# Delta since a previous version
GET /api/collections/:owner/:slug/versions/v2.0.0/manifest?since=v1.1.0
{
"version": "v2.0.0",
"since": "v1.1.0",
"delta": {
"added": [{ "id": "pub-004", "type": "Publication", "hash": "..." }],
"updated": [{ "id": "pub-001", "type": "Publication", "hash": "...", "previousHash": "..." }],
"removed": [{ "id": "pub-003", "type": "Publication", "hash": "..." }]
}
}
# Fetch only the records you need
POST /api/records/batch
{ "hashes": ["abc123...", "def456..."] }
# Returns JSONL streamSchema semantics
Schemas are JSON Schema documents with a few protocol-level extensions:
{
"type": "object",
"properties": {
"title": { "type": "string" },
"doi": { "type": "string" },
"authors": {
"type": "array",
"items": { "type": "string", "x-ref-type": "Author" }
},
"pdf": { "type": "object" },
"internalNotes": { "type": "string", "private": true }
}
}"private": trueon a property: the field is stripped from public views and excluded from the public hash."private": trueon the schema root: the entire type is hidden from public views."x-ref-type": "Author": marks a field as a reference to another record type (advisory, not enforced).
Schemas are content-addressed by their SHA-256 hash. Two collections that use an identical Author schema share the same underlying schema object, with zero duplication. Schema changes trigger a major semver bump.
Unknown field handling
When records contain fields not defined in the schema, the server rejects the push with a 422 response listing the extra fields per record. This protects against accidentally storing data outside the schema contract.
To accept stripping, set "strip_unknown_fields": true in the negotiate request. The server strips the extra fields before hashing and storing, so the stored records match the schema exactly. Hashes are recomputed after stripping.
Files
Files are binary blobs stored by SHA-256 hash. Upload a file, then reference it from a record:
# Upload (content-addressed by SHA-256)
PUT /api/collections/:owner/:slug/files/sha256:a1b2c3...
Content-Type: application/pdf
<binary data>
# Reference in a record
{ "pdf": { "$file": "sha256:a1b2c3..." } }Files are verified on upload (the server recomputes the hash and rejects mismatches). Like records and schemas, files are globally deduplicated. The same PDF in ten collections is stored once.
Provenance
Because records are content-addressed, every record hash can be traced back to every version and collection that includes it. The provenance endpoint returns this lineage:
GET /api/records/:hash/provenance
{
"hash": "abc123...",
"recordId": "pub-001",
"type": "Publication",
"firstSeen": "2026-01-15T...",
"references": [
{ "owner": "alice", "collection": "papers", "version": "v1.2.0" },
{ "owner": "bob", "collection": "reading-list", "version": "v1.0.0" }
]
}firstSeen is the earliest version creation date across all references, the record's birthday on this server. This enables citation-like provenance: "this record first appeared in alice/papers v1.2.0 on 2026-01-15."
Collaboration
Underlay supports collaboration through a small set of primitives:
- Versioning. Every push creates a new immutable version. The full history is always available. Versions are identified by semver strings and use optimistic locking:
base_version(a semver string, or null for the first push) must match the current latest, or the push is rejected with a 409 conflict. - Diffing. Any two versions of a collection can be diffed (
GET .../versions/v2.0.0/diff?from=v1.1.0), returning added, updated, and removed records with hash-level comparison. - Cross-collection references. Records reference each other by ID. Because record hashes are global, the same record appearing in two collections can be identified as identical content.
- Mirroring. Any Underlay instance can pull from another, using hash negotiation to transfer only new data. Mirrors maintain verified, independent copies.
- Forking.
POST .../forkcreates a new collection under your org with the source's latest version. Because records, schemas, and files are content-addressed, forking copies only the manifest; zero additional storage. The fork tracks its origin viaforkedFrom.
Errors
All error responses return JSON with an error field and an HTTP status code:
400- Bad request (missing fields, invalid JSONL, hash mismatch)404- Collection, version, or record not found409- Version conflict (base_version doesn't match) or duplicate content422- Schema validation failed, missing schemas/files, or records contain fields not defined in the schema (setstrip_unknown_fieldsto accept stripping)429- Rate limited (includesRetry-Afterheader)
Spotted an ambiguity, an error, or something that broke when you implemented it? Select any text above to comment on it. The protocol is stewarded by Knowledge Futures . We read everything, publish what moves the spec forward, and keep building.