Skip to content

CH ConfigSnapshot.canonical_hash column + dispatch + backfill#7424

Open
AntoineToussaint wants to merge 7 commits into
pr/canonical-hash-snapshot-step2-pg-jsonbfrom
pr/canonical-hash-snapshot-step3-ch-parity
Open

CH ConfigSnapshot.canonical_hash column + dispatch + backfill#7424
AntoineToussaint wants to merge 7 commits into
pr/canonical-hash-snapshot-step2-pg-jsonbfrom
pr/canonical-hash-snapshot-step3-ch-parity

Conversation

@AntoineToussaint
Copy link
Copy Markdown
Member

🥞 Stacked on #7422 (PG canonical_hash columns), which is stacked on #7419 (canonical-hash machinery). Review #7419#7422 → this PR in order.

Summary

Step 3 of the canonical-hash stack. Brings ClickHouse to parity with Postgres: ConfigSnapshot gains a canonical_hash UInt256 column, get_config_snapshot dispatches on SnapshotHashScheme, every new write populates both columns, and a boot-time backfill rewrites legacy rows with the structural hash.

Why — context

PR #7419 introduced SnapshotHash::Canonical. PR #7422 wired Postgres to persist both legacy and canonical hashes and dispatch lookups by scheme. CH was untouched until now: its single hash column held the canonical-TOML-bytes value, and any caller that tried to look up a canonical hash against CH would silently fail.

This PR mirrors PG's approach: add a separate canonical_hash column, populate both columns on every write, dispatch reads by the hash's scheme tag. Same identity contract on both backends going forward.

What's in this PR

Migration migration_0054

Pure additive ALTER TABLE ConfigSnapshot ADD COLUMN IF NOT EXISTS canonical_hash UInt256 DEFAULT 0. The default is 0 because ClickHouse UInt256 has no NULL — it doubles as the "needs backfill" sentinel for the boot-time backfill. The Blake3 collision space makes a real config canonicalizing to literal 0 astronomically unlikely.

Wired into make_all_migrations and NUM_MIGRATIONS (47 → 48). Rollback test parameters updated to include index 47.

Read-path scheme dispatch

match snapshot_hash.scheme() {
    SnapshotHashScheme::LegacyToml => /* WHERE hash = $1 */,
    SnapshotHashScheme::Canonical  => /* WHERE canonical_hash = $1 */,
}

Same shape as the PG dispatcher in #7422. Old inferences.snapshot_hash references resolve via the hash column; new ones resolve via canonical_hash. Every URL / log line / autopilot tag continues to parse as before.

Write path populates both columns

write_config_snapshot now computes the canonical hash from the in-memory StoredConfig alongside the legacy hash already on snapshot.hash, and inserts both as UInt256 literals via toUInt256(...). The external_data schema gains the new column declaration.

Boot-time backfill (backfill_config_snapshot_canonical_hash)

Mirrors #7422's PG backfill. Runs after CH migrations on every boot, idempotent (filters by canonical_hash = 0), cooperative (no locks). Per-row failures log at error! level and skip — startup continues. A skipped row stays readable via the legacy hash lookup; only canonical-hash lookups against it return not-found, which is the same behavior as if the backfill had not run at all.

Implementation notes:

  • ClickHouse's ReplacingMergeTree engine deduplicates by the (hash) ORDER BY key on background merges. We re-INSERT each legacy row with canonical_hash populated; reads use FINAL so the new version is observed immediately.
  • created_at is preserved from the original row; last_used is refreshed to now64(), matching the convention from write_config_snapshot.

What this PR does NOT do

  • Does not change Config.hash (still the legacy hash). PR 4 flips that.
  • Does not write canonical hashes to inference / feedback / datapoint rows. PR 4.
  • Does not change the multi-backend make_db_test! suite. Those tests will flip to canonical lookups in PR 4 once both backends support them.

Test plan

  • cargo test --features e2e_tests -p tensorzero-core --test e2e clickhouse::test_rollback_up_to_migration_index_47 (with CH up)
  • cargo test --features e2e_tests -p tensorzero-core --test e2e config::snapshot (with CH+PG up)
  • Manually verify: boot a clean CH, write a snapshot, confirm both hash and canonical_hash are populated; restart; confirm backfill is no-op
  • Manually verify: pre-migration row with canonical_hash = 0 gets backfilled on next boot

Roadmap context

PR Scope Status
#7419 SnapshotHash::Canonical machinery Open
#7422 PG config_jsonb + canonical_hash columns Open
This PR CH canonical_hash column + dispatch + backfill Open (this)
4 Flip Config.hash default to canonical Next
5a/b Per-object metadata + REST CRUD After PR 4
6 Config-in-database boot mode + CI lanes After PR 5
7 UI overhaul + TOML editor cleanup After PR 6

🤖 Generated with Claude Code

…me backfill

Step 3 of the canonical-hash stack (depends on PR #7422). Brings
ClickHouse to parity with Postgres: `ConfigSnapshot` gains a
`canonical_hash` column, `get_config_snapshot` dispatches on
`SnapshotHashScheme`, every new write populates both columns, and a
boot-time backfill rewrites legacy rows with the structural hash.

## Why

PR #7419 introduced `SnapshotHash::Canonical`. PR #7422 wired Postgres
to persist both legacy and canonical hashes and dispatch lookups by
scheme. CH was untouched until now: its single `hash` column held the
canonical-TOML-bytes value, and any caller that tried to look up a
canonical hash there would silently fail.

This PR mirrors PG's approach: add a separate `canonical_hash UInt256`
column, populate both columns on every new write, dispatch reads by
the hash's scheme tag. Same identity contract on both backends going
forward.

## What's in this PR

### Migration `migration_0054`

Pure additive `ALTER TABLE ConfigSnapshot ADD COLUMN IF NOT EXISTS
canonical_hash UInt256 DEFAULT 0`. The default is `0` because
ClickHouse `UInt256` has no NULL — it doubles as the "needs backfill"
sentinel for the boot-time backfill. The Blake3 collision space makes
a real config canonicalizing to literal `0` astronomically unlikely.

Wired into `make_all_migrations` and `NUM_MIGRATIONS` (47 → 48).
Rollback test parameters updated to include index 47.

### Read-path scheme dispatch (`config_queries.rs`)

```rust
match snapshot_hash.scheme() {
    SnapshotHashScheme::LegacyToml => /* WHERE hash = $1 */,
    SnapshotHashScheme::Canonical  => /* WHERE canonical_hash = $1 */,
}
```

Same shape as the PG dispatcher in PR #7422. Old `inferences.snapshot_hash`
references resolve via the `hash` column; new ones resolve via
`canonical_hash`. Every URL / log line / autopilot tag continues to
parse as before.

### Write path populates both columns

`write_config_snapshot` now computes the canonical hash from the
in-memory `StoredConfig` alongside the legacy hash already on
`snapshot.hash`, and inserts both as `UInt256` literals via
`toUInt256(...)`. The `external_data` schema gains the new column
declaration.

### Boot-time backfill (`backfill_config_snapshot_canonical_hash`)

Mirrors PR #7422's PG backfill. Runs after CH migrations on every
boot, idempotent (filters by `canonical_hash = 0`), cooperative (no
locks). Per-row failures log at `error!` level and skip — startup
continues. A skipped row stays readable via the legacy `hash` lookup;
only canonical-hash lookups against it return not-found, which is
the same behavior as if the backfill had not run at all.

Implementation notes:
- ClickHouse's `ReplacingMergeTree` engine deduplicates by the
  `(hash)` ORDER BY key on background merges. We re-INSERT each
  legacy row with `canonical_hash` populated; reads use `FINAL` so
  the new version is observed immediately.
- `created_at` is preserved from the original row; `last_used` is
  refreshed to `now64()`, matching the convention from
  `write_config_snapshot`.

## What this PR does NOT do

- Does not change `Config.hash` (still the legacy hash). PR 4 flips
  that.
- Does not write canonical hashes to inference / feedback / datapoint
  rows. PR 4.
- Does not change the multi-backend `make_db_test!` suite. Those
  tests will flip to canonical lookups in PR 4 once both backends
  support them.

## Test plan

- [ ] `cargo test --features e2e_tests -p tensorzero-core --test e2e clickhouse::test_rollback_up_to_migration_index_47` (with CH up)
- [ ] `cargo test --features e2e_tests -p tensorzero-core --test e2e config::snapshot` (with CH+PG up)
- [ ] Manually verify: boot a clean CH, write a snapshot, confirm both `hash` and `canonical_hash` are populated; restart; confirm backfill is no-op
- [ ] Manually verify: pre-migration row with `canonical_hash = 0` gets backfilled on next boot

## Roadmap context

| PR | Scope | Status |
|----|-------|--------|
| #7419 | `SnapshotHash::Canonical` machinery | Open |
| #7422 | PG `config_jsonb` + `canonical_hash` columns | Open |
| **This PR** | **CH `canonical_hash` column + dispatch + backfill** | **Open (this)** |
| 4 | Flip `Config.hash` default to canonical | Next |
| 5a/b | Per-object metadata + REST CRUD | After PR 4 |
| 6 | Config-in-database boot mode + CI lanes | After PR 5 |
| 7 | UI overhaul + TOML editor cleanup | After PR 6 |

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AntoineToussaint and others added 6 commits May 4, 2026 15:54
…kfill test

Two test changes:

1. **New CH backfill test**:
   `backfill_populates_canonical_hash_for_legacy_clickhouse_rows`.
   Mirrors PR #7422's PG `backfill_populates_config_jsonb_for_legacy_rows`.

   - Builds a snapshot, then INSERTs it directly with
     `canonical_hash = 0` to simulate a row written before
     migration_0054 ran. (Bypasses `write_config_snapshot`, which now
     populates both columns.)
   - Asserts pre-backfill state: row exists with `canonical_hash = 0`.
   - Runs `backfill_config_snapshot_canonical_hash`.
   - Asserts post-backfill: `canonical_hash` matches
     `StoredConfig::canonical_hash()` and a canonical-hash lookup via
     `get_config_snapshot` resolves the row.
   - Re-runs the backfill; asserts the column is unchanged
     (idempotency).

2. **Fix `backfill_populates_config_jsonb_for_legacy_rows` assertion**:
   that test (inherited from PR #7422) referenced
   `{"version": 2}` in a JSONB containment query, but I'd already
   stripped the `version = 2` line from the fixture in PR #7422 to
   work around `UninitializedFunctionConfig` not having a `version`
   field yet. Replace the version-based containment fragment with
   one that matches a field actually present in the fixture
   (`{"type": "chat"}`). This unblocks the test on the CI run that
   includes PR #7422's tip.

`cargo clippy --all-targets --all-features -- -D warnings` clean;
`cargo fmt --all -- --check` clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m:tensorzero/tensorzero into pr/canonical-hash-snapshot-step3-ch-parity
Mirrors the PG backfill test additions in #7422. Adds 4 new CH tests
plus a shared `insert_legacy_clickhouse_snapshot_row` helper:

1. `backfill_skips_unparseable_legacy_rows_clickhouse` — inserts a
   well-formed legacy row alongside one with garbage TOML. After
   backfill: good row's canonical_hash matches
   `StoredConfig::canonical_hash`; bad row stays at the
   `canonical_hash = 0` sentinel. Locks the fail-soft contract.

2. `backfill_leaves_already_populated_rows_unchanged_clickhouse` —
   writes a snapshot via `write_config_snapshot` (both columns
   populated), captures `canonical_hash`, runs backfill, asserts
   unchanged. Locks the `WHERE canonical_hash = 0` filter.

3. `backfill_no_op_on_empty_state_clickhouse` — backfill on a
   `ConfigSnapshot` table with no sentinel-marked rows is a no-op.

4. `backfill_preserves_created_at_clickhouse` — CH-specific. The
   backfill re-INSERTs each row with the new column populated;
   ReplacingMergeTree dedupes by ORDER BY (hash) and reads use
   FINAL. The re-INSERT preserves `created_at` from the original
   row and refreshes `last_used`. This test waits 1.5s between the
   legacy insert and the backfill so a wrong implementation that
   stamps `now64()` on `created_at` would surface a detectably
   different timestamp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…m:tensorzero/tensorzero into pr/canonical-hash-snapshot-step3-ch-parity
…m:tensorzero/tensorzero into pr/canonical-hash-snapshot-step3-ch-parity
…`to_canonical_form`

Previously, PR 3 only added a `canonical_hash` column to CH
`ConfigSnapshot`, leaving the legacy `config` (TOML text) column as
the sole content storage. PG's PR 2 added BOTH `canonical_hash` AND
`config_jsonb`. Asymmetric — and missing the actual content-side
benefit of the canonical-form work: a stored representation that
doesn't require reparsing TOML on every read.

Bringing CH to true parity:

1. **Migration 0054 also adds `config_json String DEFAULT ''`.**
   String type because CH's native JSON type isn't widely available
   yet pre-25.x; storing canonical JSON as compact text is broadly
   compatible. `''` is the new "needs backfill" sentinel — empty
   string can't collide with real canonical content (the canonical
   form of any non-empty `StoredConfig` is at least `"{}"`).

2. **`write_config_snapshot` uses `snapshot.to_canonical_form()`.**
   Same helper PG uses; both backends now compute the canonical-form
   pair (`json` + `hash`) in one walk, and neither knows whether
   the underlying mechanism is `serde_json::to_value` or anything
   else — that's encapsulated in `canonical_hash.rs`. The CH row
   structure now carries `config`, `config_json`, `hash`,
   `canonical_hash` together.

3. **Backfill populates BOTH columns.** Scan filter is now
   `WHERE canonical_hash = 0 OR config_json = ''` so a row missing
   either column gets reprocessed. Per-row work uses
   `to_canonical_form` to get JSON and hash from one walk, then
   re-INSERTs via external data so single-quotes / backslashes
   inside the JSON can't break the SQL.

4. **Tests extended to cover `config_json`:**
   - Happy-path (`backfill_populates_canonical_hash_for_legacy_clickhouse_rows`):
     now asserts both columns post-backfill, both columns idempotent
     under second run.
   - Fail-soft (`backfill_skips_unparseable_legacy_rows_clickhouse`):
     asserts both columns stay at sentinel on the bad row, both are
     populated on the good row.

This brings the backfill story to: PG and CH both populate two
columns (hash + content), both have idempotent + fail-soft + locked
behavior, both pass through the same `to_canonical_form` helper,
and both are tested end-to-end.

Picks up PR #7422's tip via merge to get the `to_canonical_form`
helper that lives there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant