Add SnapshotHash::Canonical machinery (no behavior change)#7419
Conversation
Foundation for the config-in-database work: introduce a content-stable
structural hash (`SnapshotHash::Canonical`) that hashes `StoredConfig`
via canonical JSON encoding, alongside the existing canonical-TOML-bytes
hash (now tagged `LegacyToml`). Pure addition — no DB columns, no wire
format changes for legacy hashes, nothing on the inference path is
affected. Subsequent PRs introduce the storage columns, wire-up, and
default switch.
## Why
The legacy hash is `blake3` over canonical-TOML bytes. That's fragile in
ways that matter when config rows are persisted across:
- TOML crate version bumps (e.g. float reformatting `0.7` →
`0.6999999988…`)
- Serializer / canonicalization rule changes
- Round-trips through the planned `config_jsonb` JSONB column (where
the input is a `serde_json::Value`, not TOML bytes)
Any of these can drift the hash and silently invalidate every
`inferences.snapshot_hash` reference — which would break every
historical-config view in the UI and every audit query. The structural
hash decouples identity from textual encoding: same logical config →
same hash, regardless of which serializer round-trips it last.
## What this PR adds
- `SnapshotHashScheme { LegacyToml, Canonical }` enum.
- `SnapshotHash` carries a scheme tag; `Display` / `FromStr` / serde
use a self-describing wire form: `can:DECIMAL` for canonical, bare
decimal for legacy. Bare-decimal is the *unchanged* legacy wire form,
so existing `inferences.snapshot_hash` rows, autopilot tags, log
lines, and URLs continue to parse without modification.
- `from_biguint_canonical` / `from_canonical_bytes` constructors.
- `serialize_hash_bare_decimal` / `serialize_optional_hash_bare_decimal`
helpers for DB-row structs whose backing storage cannot accept the
prefix (CH `UInt256`, PG `NUMERIC(78,0)`). Defined here, applied by a
later PR.
- `crates/tensorzero-core/src/config/snapshot/canonical_hash.rs` —
the structural-hash implementation. Walks `serde_json::Value` with a
self-describing tagged encoding (1-byte type tag + length-prefixed
strings/arrays/objects + sorted object keys + IEEE 754 BE bit pattern
for numbers).
- `StoredConfig::canonical_hash() -> Result<SnapshotHash>` — the entry
point. Public but currently unused at runtime; tests are the only
callers. The first runtime call site lands in the follow-up PR that
adds the `canonical_hash` column to `config_snapshots`.
- 11 inline unit tests covering determinism, JSON / TOML / TOML→JSON
roundtrip stability, key-order independence, type discrimination
(string vs number), array-concat collision resistance, and divergence
from the legacy hash.
- Fixture-based round-trip tests (`fixtures_tests.rs` + 6 TOML
fixtures) that lock the canonical hash + serialization round-trip
against committed configs (kitchen sink, multi-variant, models with
multiple providers, tools + metrics, etc.). Any future change to
canonicalization rules will fail these fixtures rather than silently
drifting the persisted hash.
## What it does NOT do
- Does not change `Config.hash` (still legacy TOML-bytes).
- Does not change `ConfigSnapshot::hash` (still legacy).
- Does not add or modify any DB column.
- Does not change how any inference / feedback / datapoint row is
written.
- Does not apply the bare-decimal serializers to any DB-row struct.
- Does not change any wire API for inference.
In production, after merging this PR, no code path produces a
`Canonical`-scheme hash. The `Display` impl still emits bare decimal
for every legacy hash that exists today — so nothing downstream
observes a behavior change.
## How to test
```
cargo test -p tensorzero-types snapshot::
cargo test -p tensorzero-core config::snapshot::
```
51 tests pass with no DB needed.
## Tiny supporting fix
Two call sites in `db/clickhouse/workflow_evaluation_queries.rs`
relied on the now-removed `Deref<Target=str>` impl on `SnapshotHash`
(`&**snapshot_hash`). They're updated to use the explicit
`to_decimal_string()` accessor — same byte output, no scheme prefix
because the surrounding `toUInt256OrNull(...)` literal couldn't accept
one anyway.
## Roadmap context
Part 1 of the config-in-database series:
- **PR 1 (this one)**: canonical hash machinery
- PR 2: PG `config_snapshots` JSONB + `canonical_hash` columns
- PR 3: CH parity for the `canonical_hash` column
- PR 4: `Config.hash` → canonical by default; inference / feedback /
datapoint rows reference canonical
- PR 5a: Per-object metadata table + writers
- PR 5b: REST CRUD endpoints over per-object tables
- PR 6: Config-in-database boot mode + zero-config bootstrap + new CI
lanes
- PR 7: Functions UI overhaul + ModelPicker + TOML-editor cleanup
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…zers
Review feedback on the canonical-hash machinery:
- Fix doc comments that claimed `v1:` / `v2:` Display prefixes — the
actual wire form is bare decimal (legacy) and `can:` (canonical).
Updated `SnapshotHashScheme` variant docs and the `SnapshotHash`
module-level doc to match what `prefix()` returns.
- Replace `(x.len() as u64).to_be_bytes()` length prefixes in the
canonical encoding with an explicit `len_u64` helper that uses
`u64::try_from(...).expect(...)`. The canonical hash bytes are the
persisted identity of every config snapshot, so a silently-truncating
cast on a future >64-bit platform would be unrecoverable. Annotated
with `#[expect(clippy::expect_used)]` and a reason explaining why
the panic is the safe choice.
- Hand-roll `PartialEq`, `Eq`, and `Hash` on `SnapshotHash` so
identity is `(scheme, bytes)` only. `decimal_str` is a derived cache
of `bytes`, so including it was both ~78 bytes of redundant hashing
per call and an implicit invariant that the cache and bytes agree
(which the constructors enforce by convention but nothing else
prevents). Added a regression test that constructs the same hash
via two different paths and asserts both `eq` and `hash` agree.
- Group the two `#[serde(serialize_with = ...)]` opt-out helpers into
`pub mod serializers` (`bare_decimal` / `optional_bare_decimal`).
Removes two free functions from the crate root and gives the helpers
a discoverable home with collective module docs. Updated the inline
test's `serialize_with = "..."` path to match.
- Replace `Deserialize`'s `String::deserialize(...)` with a
`Visitor<'_>` implementing `visit_str` (with `visit_string` as
fallback). Borrowed deserializers (serde_json, sqlx) now skip the
per-row `String` allocation; owning deserializers still work
unchanged via `visit_string`. This is the per-inference-row read
path, so the saving compounds at scale.
- Drop the unused `SnapshotHashScheme::name()`. Doc-commented use case
("log/error messages") has no callers in this PR — `pub` API on a
published crate that has no users ages poorly. Will reintroduce in a
later PR if a real call-site needs it.
- Bump `from_biguint_with_scheme` and `from_bytes_with_scheme` from
module-private (`fn`) to `pub(crate)`. The intra-module test calls
them today; raising visibility makes the testing intent explicit
and lets future integration tests under `tests/` use them without
another visibility bump.
- Tiny: `hex::encode(&*self.bytes)` → `hex::encode(self.as_bytes())`,
`entries.sort_by(|(a,_),(b,_)| a.cmp(b))` → `sort_unstable_by_key`.
8 SnapshotHash type tests + 51 snapshot module tests pass; clippy clean
under workspace lints.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds the new scheme-aware SnapshotHash foundation for config-in-database work by introducing a canonical structural hash alongside the existing legacy TOML-byte hash, while keeping current runtime behavior on the legacy path.
Changes:
- Added
SnapshotHashSchemeand updatedSnapshotHashparsing/formatting/serde to distinguish legacy vs canonical hashes. - Introduced canonical structural hashing for
StoredConfigviaserde_json::Valuetraversal plus new unit tests and fixture-based round-trip tests. - Updated two ClickHouse workflow-evaluation query call sites to use bare decimal accessors after removing
Deref<Target = str>fromSnapshotHash.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
crates/tensorzero-types/src/snapshot.rs |
Adds scheme-aware snapshot hashes, new constructors/accessors, serde/display parsing changes, and type tests. |
crates/tensorzero-core/src/db/clickhouse/workflow_evaluation_queries.rs |
Replaces implicit string deref usage with to_decimal_string() in ClickHouse parameter binding. |
crates/tensorzero-core/src/config/snapshot/mod.rs |
Wires in the new canonical hash module and fixture test module. |
crates/tensorzero-core/src/config/snapshot/fixtures/tools_and_metrics.toml |
Adds fixture coverage for tools and metrics config shapes. |
crates/tensorzero-core/src/config/snapshot/fixtures/multi_variant_types.toml |
Adds fixture coverage for multiple function variant types. |
crates/tensorzero-core/src/config/snapshot/fixtures/models_multi_provider.toml |
Adds fixture coverage for multiple model and embedding providers. |
crates/tensorzero-core/src/config/snapshot/fixtures/kitchen_sink.toml |
Adds broad mixed-section fixture coverage. |
crates/tensorzero-core/src/config/snapshot/fixtures/empty.toml |
Adds empty-config baseline fixture. |
crates/tensorzero-core/src/config/snapshot/fixtures/chat_function_unversioned.toml |
Adds baseline chat-function fixture for stable round-tripping. |
crates/tensorzero-core/src/config/snapshot/fixtures_tests.rs |
Adds fixture-driven round-trip assertions over representative stored configs. |
crates/tensorzero-core/src/config/snapshot/canonical_hash.rs |
Implements canonical structural hashing for StoredConfig and adds focused unit tests. |
…tale comments Three Copilot review findings on PR #7419, all real: 1. **`u64` values above 2^53 silently aliased through `as_f64()`**. `Number::as_f64()` is lossy for any integer beyond f64's 53-bit mantissa, so two distinct `u64` values that share an `f64` representation hashed to the same canonical bytes. `StoredConfig` already has `u64` fields (timeouts, capacities), so this would have collided in production once a value crossed 9 quadrillion. Fix: encode integers exactly. The number sub-tag layout is now: - `NUMBER_KIND_INT` + 8 bytes BE for `as_i64()` results - `NUMBER_KIND_UINT` + 8 bytes BE for `as_u64()` results that didn't fit `i64` (the upper half of the unsigned range) - `NUMBER_KIND_FLOAT` + 8 bytes BE for fractional `f64` values - `NUMBER_KIND_STRING` + length + bytes for serde_json's `arbitrary_precision` fallback (we don't enable it; the encoding stays well-defined if anyone does) Each kind carries its own sub-tag, so an `i64`-shaped 1 and an `f64`-shaped 1.0 hash differently — that's correct, they're different types in `StoredConfig`. Two new regression tests: - `large_u64_values_do_not_collide_via_f64_truncation` (uses `2^53` and `2^53 + 1`, which `as_f64()` collapses) - `integer_and_float_with_same_magnitude_hash_differently` 2. **Fixture suite never asserted the hash bytes**, so a canonical encoding change that preserved JSON shape would silently drift every persisted hash. Added `assert_canonical_hash_matches` and committed expected hex values for all 6 fixtures (empty, chat_function_unversioned, multi_variant_types, models_multi_provider, tools_and_metrics, kitchen_sink). Any future change to canonicalization rules in `canonical_hash.rs` now fails these assertions explicitly, forcing the author to either revert the encoding change or update the expected hexes deliberately (and review the inference-row impact). 3. **Stale `vN:DECIMAL` comments in `workflow_evaluation_queries.rs`**. The actual `Display` impl emits `can:DECIMAL` for canonical hashes; the `vN:` form was never shipped. Updated both call-sites' comments to match, mirroring the snapshot.rs doc-comment fix already in the polish commit. 53 snapshot tests pass (was 51 before the regression tests). Clippy clean under workspace lints; fmt clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
can you add some documentation over how this is expected to handle schema changes? |
…inators The new PG e2e tests inherited TOML fixtures with `version = N` lines (both function- and variant-level). `version` doesn't exist on `UninitializedFunctionConfig` / `UninitializedVariantConfig` until the per-object metadata PR (#5a) adds it, so these fixtures fail to parse on this branch with `unknown field 'version'`. Fixes: - `config_jsonb_column_is_populated_on_write_postgres` — use `description = "v4"` on the function as the JSONB-roundtrip discriminator instead of `version = 4`. Drop the variant-level `version` assertion (variants don't have a `description` field, and this test only needs to verify "JSONB column matches in-memory config"; the path-level assertion is enough). - `snapshots_with_path_value_finds_matching_function_version` — renamed to `..._function_description`. Replaces the `version` discriminator with `description` and updates the JSONPath query accordingly. Same shape, same coverage, same edge cases (multiple matches, single match, no matches, decoy on a different function name). - `snapshots_containing_finds_variant_version_via_explicit_fragment` — switches variants from `version` to `temperature` (an `f32` field that already exists on `UninitializedVariantConfig`). The JSONB containment query semantics are identical for floats vs ints. - Decorative `version = N` lines in 4 other tests just go (no assertion referenced them). Also folds in the merge of PR #7419's tip (which itself merged main's Mistral-test disable from PR #7418), so this branch picks up the upstream-test fix that unblocks the CH live-tests rollup. `cargo clippy --all-targets --all-features -- -D warnings` clean; `cargo fmt --all -- --check` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per @virajmehta's review request: spell out exactly what drifts the canonical hash, what doesn't, what drift means in production, and the rules for changing `StoredConfig` safely. Two main sections added to `canonical_hash.rs`'s module docs: 1. **Drift table.** Categorizes every kind of change to `StoredConfig` (add Option, add non-Option, rename, retype, etc.) as drifting or not drifting the hash, with one-line rationales. Calls out the non-obvious cases (key reorder doesn't drift because we sort; float format doesn't drift because we use IEEE bits; `Option<T>` with `skip_serializing_if` and a `None` default doesn't drift). 2. **What drift means in production.** - Old `config_snapshots` rows are unaffected — their persisted `canonical_hash` column was written by the old serializer and is preserved verbatim. The read path (`from_stored_with_hash`) pairs row + stored hash; it does NOT recompute. - Old rows look up by their old hash, not by recomputation. - New writes get the new hash; old + new are different rows for the "same" logical config — fine, they reflect what was seen. - JSONB containment queries operate over the stored shape, so historical search across a cutover requires care. Plus a "Rules for changing `StoredConfig`" subsection that points at `stored/AGENTS.md`, recommends purely-additive changes, and reminds PRs to run the fixture suite (which now locks the hash bytes — any unintended drift fails CI). No code change. Documentation-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Good call — added schema-change docs in 61652f3. Covers the drift table (additive Option fields don't drift; renames / retypes / non-Option additions / mapping changes do), what drift actually means in production (old rows unaffected because the read path pairs row + stored hash via |
My understanding was that the config-snapshot writing and the |
The opening claimed that hash drift "invalidates every
inferences.snapshot_hash reference" — that's wrong. As Aaron pointed
out: drift just writes a new snapshot row, equivalent to the user
editing the config file. Old rows keep their old hashes, old
inference references keep resolving via legacy-hash lookup, nothing
breaks.
Rewrite the opening to lead with what drift actually costs:
1. Content-addressed dedupe — two gateway versions running the same
logical config should write one row, not one per toml-crate
version.
2. Multi-gateway consistency — two gateways with different transitive
dependencies running the same config against the same DB should
agree on the hash.
3. JSONB roundtrip identity — once `config_jsonb` is in the schema,
re-deriving the hash from the stored JSON must match the stored
hash, otherwise verification flows can't tell whether they got
the same content back.
The structural hash addresses all three by operating on logical
content rather than serialized text.
The "what drift means in production" section below already had this
right (points 1-3 explicitly cover "old rows unaffected, new writes
get new hash, both live as separate rows for the same logical
config"); the opening just had a misleading lead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@Aaron1011 you're right — drift just writes a new row, equivalent to the user editing the config; doesn't invalidate anything. Updated the doc framing in The real motivation is upstream of drift. Once we settled on storing the config as a structured JSON document ( So drift-handling is a side benefit, not the motivation. The motivation is "we're storing structure, so we should hash structure." |
CI reports the empty-fixture canonical hash drifts on Linux x86_64 but matches on macOS (the hex was committed from a Mac dev machine). The expected drift sources (Cargo.lock, StoredConfig shape, canonical encoding rules) are all identical between local and CI. Push an eprintln of the actual `serde_json::to_value(&empty StoredConfig)` so the CI failure shows what Linux is producing — then I can reverse the divergence vs the Mac dump. Will revert once the cause is found. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ical-hash-snapshot
…dition PR #7413 (`Propagate status code in FatalStreamError`) landed on main and added `global_outbound_http_intra_stream_read_timeout_ms: Option<u64>` to `StoredGatewayConfig`. With `Option::None` serializing as `null` (no `skip_serializing_if`), the empty/default StoredConfig JSON gains a new field — and every fixture's canonical hash shifts. This is the fixture suite **working as designed**: a real schema change on main drifted the hash, and CI's hash-stability assertions caught it explicitly rather than letting it land silently. The dual-dispatch read path means old `inferences.snapshot_hash` references continue to resolve via the legacy hash column unchanged; only NEW canonical-hash references will use the post-drift values. Refreshed all 6 expected hashes to match the current `StoredConfig` shape (locally + matches CI Linux Docker output bit-for-bit). Also reverts the TEMP `EMPTY_JSON=…` eprintln debug in `fixture_empty`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
OK LGTM |
Summary
Foundation for the config-in-database work: introduce a content-stable structural hash (
SnapshotHash::Canonical) that hashesStoredConfigvia canonical JSON encoding, alongside the existing canonical-TOML-bytes hash (now taggedLegacyToml).Pure addition. No DB columns, no wire format changes for legacy hashes, nothing on the inference path is affected. Subsequent PRs introduce the storage columns, wire-up, and default switch.
Why — context for the config-in-database work
The legacy
SnapshotHashisblake3over canonical-TOML bytes. That works fine when configs only ever live as TOML files on disk, but it's fragile in the world we're moving into where config rows are persisted across:0.7→0.6999999988…)config_jsonbJSONB column (where the input is aserde_json::Value, not TOML bytes)Any of these can drift the hash and silently invalidate every
inferences.snapshot_hashreference — which would break every historical-config view in the UI and every audit query that joins inferences back to the config they ran against.The structural hash decouples identity from textual encoding: the same logical config produces the same hash regardless of which serializer round-trips it last. That property is what makes the JSONB column safe to add (PR 2) and what lets us flip
Config.hashto canonical without breaking historical references (PR 4).What this PR adds
Type-level changes (
tensorzero-types/src/snapshot.rs)SnapshotHashScheme { LegacyToml, Canonical }enum.SnapshotHashcarries a scheme tag;Display/FromStr/ serde use a self-describing wire form:can:DECIMALfor canonical, bare decimal for legacy. The legacy bare-decimal form is unchanged from main, so existinginferences.snapshot_hashrows, autopilot tags, log lines, and URLs continue to parse without modification.from_biguint_canonical/from_canonical_bytesconstructors.serialize_hash_bare_decimal/serialize_optional_hash_bare_decimalhelpers for DB-row structs whose backing storage cannot accept the prefix (CHUInt256, PGNUMERIC(78,0)). Defined here, applied in PR 4.The hash itself (
config/snapshot/canonical_hash.rs)Walks
serde_json::Valuewith a self-describing tagged encoding:"42"cannot collide with number42.["ab"]cannot collide with["a","b"].serde_json::Numbertext round-trips that may pick different decimal representations.StoredConfig::canonical_hash() -> Result<SnapshotHash>is the entry point. Public but unused at runtime; tests are the only callers. The first runtime caller lands in the follow-up PR that adds thecanonical_hashcolumn toconfig_snapshots.Tests
canonical_hash.rs: determinism, JSON/TOML/TOML→JSON roundtrip stability, key-order independence, type discrimination (string vs number), array-concat collision resistance, and divergence from the legacy hash.fixtures_tests.rs+ 6 TOML fixtures): lock the canonical hash + serialization round-trip against committed configs (kitchen sink, multi-variant, models with multiple providers, tools + metrics, etc.). Any future change to canonicalization rules will fail these fixtures rather than silently drifting the persisted hash.tensorzero-types/src/snapshot.rs: scheme distinguishability, FromStr round-trip, bare-decimal helper output, Display/serde scheme preservation.51 tests pass total. No DB needed.
What this PR does NOT do
Config.hash(still legacy TOML-bytes).ConfigSnapshot::hash(still legacy).In production, after merging this PR, no code path produces a
Canonical-scheme hash. TheDisplayimpl still emits bare decimal for every legacy hash that exists today — so nothing downstream observes a behavior change.Tiny supporting fix
Two call sites in
db/clickhouse/workflow_evaluation_queries.rsrelied on the now-removedDeref<Target=str>impl onSnapshotHash(&**snapshot_hash). They're updated to use the explicitto_decimal_string()accessor — same byte output, no scheme prefix because the surroundingtoUInt256OrNull(...)literal couldn't accept one anyway.Roadmap context
Part 1 of the config-in-database series. The remaining PRs build directly on this foundation:
config_snapshotsJSONB +canonical_hashcolumnscanonical_hashcolumnConfig.hash→ canonical by default; inference / feedback / datapoint rows reference canonicalEach PR ships an independently testable deliverable. This one is fully covered by unit tests; later PRs add e2e coverage as they introduce DB and HTTP surfaces.
Test plan
cargo test -p tensorzero-types snapshot::— 7 tests on the typecargo test -p tensorzero-core config::snapshot::— 51 tests including the new fixture suitecargo clippy --all-targets --all-features -- -D warnings— cleancargo fmt --all -- --check— cleanCanonical-scheme hash; legacy hash wire form bit-identical to main🤖 Generated with Claude Code