CH ConfigSnapshot.canonical_hash column + dispatch + backfill#7424
Open
AntoineToussaint wants to merge 7 commits into
Open
CH ConfigSnapshot.canonical_hash column + dispatch + backfill#7424AntoineToussaint wants to merge 7 commits into
ConfigSnapshot.canonical_hash column + dispatch + backfill#7424AntoineToussaint wants to merge 7 commits into
Conversation
…me backfill Step 3 of the canonical-hash stack (depends on PR #7422). Brings ClickHouse to parity with Postgres: `ConfigSnapshot` gains a `canonical_hash` column, `get_config_snapshot` dispatches on `SnapshotHashScheme`, every new write populates both columns, and a boot-time backfill rewrites legacy rows with the structural hash. ## Why PR #7419 introduced `SnapshotHash::Canonical`. PR #7422 wired Postgres to persist both legacy and canonical hashes and dispatch lookups by scheme. CH was untouched until now: its single `hash` column held the canonical-TOML-bytes value, and any caller that tried to look up a canonical hash there would silently fail. This PR mirrors PG's approach: add a separate `canonical_hash UInt256` column, populate both columns on every new write, dispatch reads by the hash's scheme tag. Same identity contract on both backends going forward. ## What's in this PR ### Migration `migration_0054` Pure additive `ALTER TABLE ConfigSnapshot ADD COLUMN IF NOT EXISTS canonical_hash UInt256 DEFAULT 0`. The default is `0` because ClickHouse `UInt256` has no NULL — it doubles as the "needs backfill" sentinel for the boot-time backfill. The Blake3 collision space makes a real config canonicalizing to literal `0` astronomically unlikely. Wired into `make_all_migrations` and `NUM_MIGRATIONS` (47 → 48). Rollback test parameters updated to include index 47. ### Read-path scheme dispatch (`config_queries.rs`) ```rust match snapshot_hash.scheme() { SnapshotHashScheme::LegacyToml => /* WHERE hash = $1 */, SnapshotHashScheme::Canonical => /* WHERE canonical_hash = $1 */, } ``` Same shape as the PG dispatcher in PR #7422. Old `inferences.snapshot_hash` references resolve via the `hash` column; new ones resolve via `canonical_hash`. Every URL / log line / autopilot tag continues to parse as before. ### Write path populates both columns `write_config_snapshot` now computes the canonical hash from the in-memory `StoredConfig` alongside the legacy hash already on `snapshot.hash`, and inserts both as `UInt256` literals via `toUInt256(...)`. The `external_data` schema gains the new column declaration. ### Boot-time backfill (`backfill_config_snapshot_canonical_hash`) Mirrors PR #7422's PG backfill. Runs after CH migrations on every boot, idempotent (filters by `canonical_hash = 0`), cooperative (no locks). Per-row failures log at `error!` level and skip — startup continues. A skipped row stays readable via the legacy `hash` lookup; only canonical-hash lookups against it return not-found, which is the same behavior as if the backfill had not run at all. Implementation notes: - ClickHouse's `ReplacingMergeTree` engine deduplicates by the `(hash)` ORDER BY key on background merges. We re-INSERT each legacy row with `canonical_hash` populated; reads use `FINAL` so the new version is observed immediately. - `created_at` is preserved from the original row; `last_used` is refreshed to `now64()`, matching the convention from `write_config_snapshot`. ## What this PR does NOT do - Does not change `Config.hash` (still the legacy hash). PR 4 flips that. - Does not write canonical hashes to inference / feedback / datapoint rows. PR 4. - Does not change the multi-backend `make_db_test!` suite. Those tests will flip to canonical lookups in PR 4 once both backends support them. ## Test plan - [ ] `cargo test --features e2e_tests -p tensorzero-core --test e2e clickhouse::test_rollback_up_to_migration_index_47` (with CH up) - [ ] `cargo test --features e2e_tests -p tensorzero-core --test e2e config::snapshot` (with CH+PG up) - [ ] Manually verify: boot a clean CH, write a snapshot, confirm both `hash` and `canonical_hash` are populated; restart; confirm backfill is no-op - [ ] Manually verify: pre-migration row with `canonical_hash = 0` gets backfilled on next boot ## Roadmap context | PR | Scope | Status | |----|-------|--------| | #7419 | `SnapshotHash::Canonical` machinery | Open | | #7422 | PG `config_jsonb` + `canonical_hash` columns | Open | | **This PR** | **CH `canonical_hash` column + dispatch + backfill** | **Open (this)** | | 4 | Flip `Config.hash` default to canonical | Next | | 5a/b | Per-object metadata + REST CRUD | After PR 4 | | 6 | Config-in-database boot mode + CI lanes | After PR 5 | | 7 | UI overhaul + TOML editor cleanup | After PR 6 | Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…kfill test Two test changes: 1. **New CH backfill test**: `backfill_populates_canonical_hash_for_legacy_clickhouse_rows`. Mirrors PR #7422's PG `backfill_populates_config_jsonb_for_legacy_rows`. - Builds a snapshot, then INSERTs it directly with `canonical_hash = 0` to simulate a row written before migration_0054 ran. (Bypasses `write_config_snapshot`, which now populates both columns.) - Asserts pre-backfill state: row exists with `canonical_hash = 0`. - Runs `backfill_config_snapshot_canonical_hash`. - Asserts post-backfill: `canonical_hash` matches `StoredConfig::canonical_hash()` and a canonical-hash lookup via `get_config_snapshot` resolves the row. - Re-runs the backfill; asserts the column is unchanged (idempotency). 2. **Fix `backfill_populates_config_jsonb_for_legacy_rows` assertion**: that test (inherited from PR #7422) referenced `{"version": 2}` in a JSONB containment query, but I'd already stripped the `version = 2` line from the fixture in PR #7422 to work around `UninitializedFunctionConfig` not having a `version` field yet. Replace the version-based containment fragment with one that matches a field actually present in the fixture (`{"type": "chat"}`). This unblocks the test on the CI run that includes PR #7422's tip. `cargo clippy --all-targets --all-features -- -D warnings` clean; `cargo fmt --all -- --check` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m:tensorzero/tensorzero into pr/canonical-hash-snapshot-step3-ch-parity
Mirrors the PG backfill test additions in #7422. Adds 4 new CH tests plus a shared `insert_legacy_clickhouse_snapshot_row` helper: 1. `backfill_skips_unparseable_legacy_rows_clickhouse` — inserts a well-formed legacy row alongside one with garbage TOML. After backfill: good row's canonical_hash matches `StoredConfig::canonical_hash`; bad row stays at the `canonical_hash = 0` sentinel. Locks the fail-soft contract. 2. `backfill_leaves_already_populated_rows_unchanged_clickhouse` — writes a snapshot via `write_config_snapshot` (both columns populated), captures `canonical_hash`, runs backfill, asserts unchanged. Locks the `WHERE canonical_hash = 0` filter. 3. `backfill_no_op_on_empty_state_clickhouse` — backfill on a `ConfigSnapshot` table with no sentinel-marked rows is a no-op. 4. `backfill_preserves_created_at_clickhouse` — CH-specific. The backfill re-INSERTs each row with the new column populated; ReplacingMergeTree dedupes by ORDER BY (hash) and reads use FINAL. The re-INSERT preserves `created_at` from the original row and refreshes `last_used`. This test waits 1.5s between the legacy insert and the backfill so a wrong implementation that stamps `now64()` on `created_at` would surface a detectably different timestamp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m:tensorzero/tensorzero into pr/canonical-hash-snapshot-step3-ch-parity
…m:tensorzero/tensorzero into pr/canonical-hash-snapshot-step3-ch-parity
…`to_canonical_form`
Previously, PR 3 only added a `canonical_hash` column to CH
`ConfigSnapshot`, leaving the legacy `config` (TOML text) column as
the sole content storage. PG's PR 2 added BOTH `canonical_hash` AND
`config_jsonb`. Asymmetric — and missing the actual content-side
benefit of the canonical-form work: a stored representation that
doesn't require reparsing TOML on every read.
Bringing CH to true parity:
1. **Migration 0054 also adds `config_json String DEFAULT ''`.**
String type because CH's native JSON type isn't widely available
yet pre-25.x; storing canonical JSON as compact text is broadly
compatible. `''` is the new "needs backfill" sentinel — empty
string can't collide with real canonical content (the canonical
form of any non-empty `StoredConfig` is at least `"{}"`).
2. **`write_config_snapshot` uses `snapshot.to_canonical_form()`.**
Same helper PG uses; both backends now compute the canonical-form
pair (`json` + `hash`) in one walk, and neither knows whether
the underlying mechanism is `serde_json::to_value` or anything
else — that's encapsulated in `canonical_hash.rs`. The CH row
structure now carries `config`, `config_json`, `hash`,
`canonical_hash` together.
3. **Backfill populates BOTH columns.** Scan filter is now
`WHERE canonical_hash = 0 OR config_json = ''` so a row missing
either column gets reprocessed. Per-row work uses
`to_canonical_form` to get JSON and hash from one walk, then
re-INSERTs via external data so single-quotes / backslashes
inside the JSON can't break the SQL.
4. **Tests extended to cover `config_json`:**
- Happy-path (`backfill_populates_canonical_hash_for_legacy_clickhouse_rows`):
now asserts both columns post-backfill, both columns idempotent
under second run.
- Fail-soft (`backfill_skips_unparseable_legacy_rows_clickhouse`):
asserts both columns stay at sentinel on the bad row, both are
populated on the good row.
This brings the backfill story to: PG and CH both populate two
columns (hash + content), both have idempotent + fail-soft + locked
behavior, both pass through the same `to_canonical_form` helper,
and both are tested end-to-end.
Picks up PR #7422's tip via merge to get the `to_canonical_form`
helper that lives there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Step 3 of the canonical-hash stack. Brings ClickHouse to parity with Postgres:
ConfigSnapshotgains acanonical_hash UInt256column,get_config_snapshotdispatches onSnapshotHashScheme, every new write populates both columns, and a boot-time backfill rewrites legacy rows with the structural hash.Why — context
PR #7419 introduced
SnapshotHash::Canonical. PR #7422 wired Postgres to persist both legacy and canonical hashes and dispatch lookups by scheme. CH was untouched until now: its singlehashcolumn held the canonical-TOML-bytes value, and any caller that tried to look up a canonical hash against CH would silently fail.This PR mirrors PG's approach: add a separate
canonical_hashcolumn, populate both columns on every write, dispatch reads by the hash's scheme tag. Same identity contract on both backends going forward.What's in this PR
Migration
migration_0054Pure additive
ALTER TABLE ConfigSnapshot ADD COLUMN IF NOT EXISTS canonical_hash UInt256 DEFAULT 0. The default is0because ClickHouseUInt256has no NULL — it doubles as the "needs backfill" sentinel for the boot-time backfill. The Blake3 collision space makes a real config canonicalizing to literal0astronomically unlikely.Wired into
make_all_migrationsandNUM_MIGRATIONS(47 → 48). Rollback test parameters updated to include index 47.Read-path scheme dispatch
Same shape as the PG dispatcher in #7422. Old
inferences.snapshot_hashreferences resolve via thehashcolumn; new ones resolve viacanonical_hash. Every URL / log line / autopilot tag continues to parse as before.Write path populates both columns
write_config_snapshotnow computes the canonical hash from the in-memoryStoredConfigalongside the legacy hash already onsnapshot.hash, and inserts both asUInt256literals viatoUInt256(...). Theexternal_dataschema gains the new column declaration.Boot-time backfill (
backfill_config_snapshot_canonical_hash)Mirrors #7422's PG backfill. Runs after CH migrations on every boot, idempotent (filters by
canonical_hash = 0), cooperative (no locks). Per-row failures log aterror!level and skip — startup continues. A skipped row stays readable via the legacyhashlookup; only canonical-hash lookups against it return not-found, which is the same behavior as if the backfill had not run at all.Implementation notes:
ReplacingMergeTreeengine deduplicates by the(hash)ORDER BY key on background merges. We re-INSERT each legacy row withcanonical_hashpopulated; reads useFINALso the new version is observed immediately.created_atis preserved from the original row;last_usedis refreshed tonow64(), matching the convention fromwrite_config_snapshot.What this PR does NOT do
Config.hash(still the legacy hash). PR 4 flips that.make_db_test!suite. Those tests will flip to canonical lookups in PR 4 once both backends support them.Test plan
cargo test --features e2e_tests -p tensorzero-core --test e2e clickhouse::test_rollback_up_to_migration_index_47(with CH up)cargo test --features e2e_tests -p tensorzero-core --test e2e config::snapshot(with CH+PG up)hashandcanonical_hashare populated; restart; confirm backfill is no-opcanonical_hash = 0gets backfilled on next bootRoadmap context
SnapshotHash::Canonicalmachineryconfig_jsonb+canonical_hashcolumnscanonical_hashcolumn + dispatch + backfillConfig.hashdefault to canonical🤖 Generated with Claude Code