Gate /internal/config_toml on runtime DB mode + add zero-config E2E#7362
Closed
AntoineToussaint wants to merge 11 commits into
Closed
Gate /internal/config_toml on runtime DB mode + add zero-config E2E#7362AntoineToussaint wants to merge 11 commits into
AntoineToussaint wants to merge 11 commits into
Conversation
When `--config-file` and `--default-config` are both absent, the gateway now falls through to the DB-authoritative load path whenever `TENSORZERO_POSTGRES_URL` is set. Previously this required explicit opt-in via the `ENABLE_CONFIG_IN_DATABASE` feature flag. An empty database is a valid starting point: every singleton falls back to its default and every collection is empty, so the gateway serves a functional runtime with zero user config. This is the first step toward a "zero-config deploy": the operator provides a database URL and populates functions, variants, and models through REST endpoints. Also adds an empty-database smoke test for `load_config_from_db`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
Apply review feedback on the startup-config-from-Postgres fallback: - Treat an empty `TENSORZERO_POSTGRES_URL` as absent so a shell/compose misconfiguration produces the clear "no config source" error instead of an opaque sqlx dial failure. - Read the env var once and thread the `Option<String>` into `load_startup_config_from_database`, eliminating the double read. - Log a prominent `WARN` when falling through to the implicit DB path (env var set, no feature flag, no `--config-file`) so operators see the fallback in startup logs. Many deployments set the env var for observability/rate-limiting without intending DB-config boot. - Replace the positional `(…, …, bool /* config_in_database */)` tuple with a `StartupConfig` struct so callers don't rely on an inline-comment-documented bool. - Introduce a `TENSORZERO_POSTGRES_URL_ENV` constant for the two new call sites in this file. - Rewrite the empty-DB smoke test with `expect_that!` + `matches_pattern!` per `AGENTS.md` guidance, giving per-field failure diagnostics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`expect_that!` needs a `#[gtest]` test context to collect failures; the `#[sqlx::test]` macro doesn't provide one, so using it here panics with "No test context found" instead of running the assertion. Switch to `assert_that!`, which works without the gtest context and matches the convention used by every other test in this file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spawns the actual gateway binary with `TENSORZERO_POSTGRES_URL` set against a migrated Postgres and nothing else (no `--config-file`, no `--default-config`, no `ENABLE_CONFIG_IN_DATABASE` feature flag) and verifies the gateway binds a port, serves a healthy `/health`, and returns a well-formed `StatusResponse` from `/status`. This is the end-to-end counterpart to the unit-level empty-DB test on `load_config_from_db`: that one proves the loader returns defaults, this one proves the full binary actually reaches listening state and answers HTTP with that defaulted config. Also factors the "wait for listening + parse bound addr + build ChildData" tail of `start_gateway_impl` into a shared `await_gateway_listening` helper so the new `start_gateway_from_db_url_on_random_port` helper doesn't duplicate it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the new integration test from "does the gateway serve /health" to the full config-in-database scenario the UI will build on top of: migrated Postgres, no config rows, no `--config-file`, feature flag on, then assert: - `/health` 200 - `/status` returns `ok` + a non-empty `config_hash` - `/internal/config_toml` returns a default editable TOML whose hash matches `/status`, and whose `path_contents` is empty (no user-provided templates) - The TOML body parses as a valid TOML table The helper `start_gateway_from_db_url_on_random_port` now takes an `extra_env` slice so callers can either exercise the implicit-opt-in path (env var only) or the full config-in-database scenario (feature flag on) without duplicating the subprocess plumbing. Adds `toml` to gateway dev-dependencies for assertion-side parsing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a dedicated e2e scenario, parallel to `live-tests`, `live-tests-config-in-database`, `evaluation-tests`, and the existing live flavors: gateway booting from a migrated-but-empty Postgres + ClickHouse stack with `ENABLE_CONFIG_IN_DATABASE=true` and no `--config-file`. This is the deploy shape the configure-via-UI story builds on — schema present, no config rows, no files on disk. New pieces, all mirroring the existing config-in-database pattern: - `crates/tensorzero-core/tests/e2e/docker-compose.db-only-boot.yml`: override of `docker-compose.live.yml` that drops `gateway-migrate-config`, flips the feature flag, clears `--config-file`, and uses `!override` on `volumes` to remove every inherited bind mount (config TOMLs, fixtures, credentials) — so the gateway literally has nothing on disk to read. - `crates/tensorzero-core/tests/e2e/db_only_boot/mod.rs`: two `#[gtest] #[tokio::test]` Rust tests that run inside the live-tests container and hit the gateway over the compose network: one asserts `/status` reports the default config and a non-empty hash, the other asserts `/internal/config_toml` returns the same hash with empty `path_contents` and a TOML body that parses back as a valid table. - `crates/.config/nextest.toml`: new `db-only-boot` profile filtering to `db_only_boot::` tests, and `e2e`'s `default-filter` excludes them so they only run in their own CI job. - `.github/workflows/db-only-boot-e2e.yml`: new reusable workflow standing up the stack, running the profile inside `live-tests`, and asserting the gateway logs show the DB-authoritative boot banner. - `.github/workflows/general.yml`: wires the new job behind `detect-changes.outputs.code`; `ci/check-all-general-jobs-passed.sh` adds it to `ALLOWED_SKIP` so the merge queue tolerates skipped runs. Also drops the subprocess-spawning `crates/gateway/tests/boot_from_empty_db.rs` and its helper additions in `gateway/tests/common/mod.rs` and `gateway/Cargo.toml` — superseded by the in-container Rust test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four small cleanups from the branch review:
- `load_startup_config_from_database` takes `Option<&str>` instead of
`Option<String>` — the function never owned the url; caller now
passes `postgres_url.as_deref()`.
- Consolidate `UnwrittenConfig` import into the existing
`use tensorzero_core::config::{...}` block and drop the two inline
long-form paths, per AGENTS.md.
- Fold the three separate `expect_that!` calls on `StatusResponse`
into a single `matches_pattern!` — if the struct gains a field, the
test now makes a conscious choice instead of silently ignoring it.
- Replace `toml::from_str(...).unwrap_or_else(|e| panic!(...))` with
`assert_that!(parsed, ok(predicate(toml::Value::is_table)))` so
success + the "is-a-table" check collapse to one googletest
assertion.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two mistakes in the initial push of the new job: - The workflow pulls `tensorzero/live-tests:sha-$SHA` but only declared `build-gateway-container` in `needs:`. Adds `build-live-tests-container` and `build-fixtures-container` to the dependency list, matching `live-tests-config-in-database`. Also gates the job on the same fork/dependabot condition the sibling jobs use. - `pre-commit`'s `check-yaml` can't parse Compose's `!override` custom tag, so `validate` failed on the new compose file. Excludes that single file from `check-yaml`; Docker Compose still validates it at stack-up time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CI job failed because `docker compose run live-tests` started its full `depends_on` graph — including `fixtures-postgres`, which exits 1 when loading fixtures against a migrated-but-empty DB. The whole point of this scenario is an empty DB, so fixture loading is a semantic mismatch. Override `live-tests.depends_on` with `!override` to keep only the infra + gateway + migrations services and drop `fixtures` and `fixtures-postgres`. The `up --wait gateway` and the subsequent `run --rm live-tests` both pass locally after this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run the zero-config boot scenario in both observability modes:
- Postgres-config + ClickHouse-data (default TOML-config deploy shape)
- Postgres-config + Postgres-data (single-datastore deploy)
Matches the `live-tests` workflow's `database: [clickhouse, postgres]`
matrix. When `matrix.database == postgres`, sets
`TENSORZERO_INTERNAL_TEST_OBSERVABILITY_BACKEND=postgres` so the gateway
uses Postgres as the primary observability backend and exercises its
pgcron/pgvector/trigram extension checks.
The `check-all-general-jobs-passed.sh` ALLOWED_SKIP entry
(`db-only-boot-e2e`) already covers matrix-suffixed job names via the
existing `"entry ("` prefix match — no change needed there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before: the three `/internal/config_toml` routes were mounted only when `ENABLE_CONFIG_IN_DATABASE` was set at router-build time. That ties the endpoints to a process-wide flag and means a gateway booted implicitly from a Postgres URL (no flag, no `--config-file`) would 404 the UI's config bootstrap even though the stored-config tables are already the source of truth. Now: the routes are always mounted, and `get` / `apply` gate themselves at request time on `app_state.config_in_database` — the same bit the boot logic records when it decides which source to load from. The `require_config_in_database(bool, &str)` helper keeps the 501 response consistent across endpoints. `validate` is intentionally ungated: it is stateless (parse + run the shared load pipeline, no DB reads or writes), and the UI needs to lint editable TOML even against a file-backed gateway. The docstring now says so explicitly. Also extend `db_only_boot_returns_default_config_via_config_toml_endpoint` to assert the collection tables (functions/models/tools/metrics) are absent or empty, and that `base_signature` is populated so callers can use it as a CAS token on the first apply. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dd7d304 to
32fefa2
Compare
This was referenced Apr 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on top of #7361.
Motivation
After #7361 makes the gateway boot from a Postgres URL alone, the natural next question is "what does the operator do next?" Today, the answer is "nothing useful": the three
/internal/config_toml*endpoints — which the UI uses to read and write config — are gated on theENABLE_CONFIG_IN_DATABASEfeature flag at the route registration level. So even after a successful zero-config boot, those endpoints return 404 unless the flag is set.This PR fixes that and adds the first end-to-end coverage of the zero-config boot path: prove the gateway runs, prove the config endpoint returns the right shape against an empty DB.
What this PR does
1. Gate
/internal/config_toml*on runtime DB mode, not feature flagFile:
crates/gateway/src/routes/internal.rsif feature_flags::ENABLE_CONFIG_IN_DATABASE.get() { ... }wrapper around route registration.GET /internal/config_toml,POST /internal/config_toml/apply,POST /internal/config_toml/validate) are now always mounted.File:
crates/tensorzero-core/src/endpoints/internal/config_toml.rsget_latest_config_toml_handlerwith a runtime check onapp_state.config_in_database.apply_config_toml_handler.require_config_in_database(app_state, endpoint)returnsErrorDetails::NotImplementedwith a clear message pointing at the env var.config_in_database = false) still get a structured error explaining how to enable the endpoint.2. Zero-config E2E test suite
New file:
crates/tensorzero-core/tests/e2e/zero_config/mod.rs. Two tests:zero_config_health_returns_ok—/healthreturns 200, headerx-tensorzero-gateway-versionmatchesTENSORZERO_VERSION, and the JSON body hasgateway: okandclickhouse: ok.zero_config_get_config_toml_returns_defaults—GET /internal/config_tomlreturns 200 with:path_contentsis empty (no referenced template files when DB is empty)tomlis non-empty (default singletons emit[gateway],[clickhouse], etc.)tomldoes NOT contain[functions.,[models.,[tools., or[metrics.(no user-defined entries)hashandbase_signatureare non-empty (callers can chain into/apply)3. Test infrastructure
File:
crates/.config/nextest.tomltest(zero_config::)to thee2eprofile's exclusion list, so the main suite doesn't try to run these tests against its config-laden gateway.[profile.zero-config]:default-filter = 'binary(e2e) and test(zero_config::)',test-threads = 1(these tests mutate gateway-wide state in follow-up work), moderate retries.New file:
ui/fixtures/docker-compose.zero-config.ymldocker-compose.e2e.yml.--config-filecommand (command: []) so it falls through to the DB load path.depends_onto include onlyclickhouseandgateway-postgres-migrations— drops the dependency on the fixtures loader (no fixtures wanted, want an empty DB).fixturesservice asprofiles: ["never"]so it doesn't get pulled into the dependency graph.File:
crates/tensorzero-core/tests/e2e/tests.rszero_configmodule.Why this approach
config_in_database." Adds complexity. The feature flag's job was always to decide whether to use the DB-config code path; onceconfig_in_databasecarries that information at runtime, the flag is redundant for endpoint gating.How to verify locally
Test plan
cargo check -p gateway -p tensorzero-corecleancargo check --package tensorzero-core --test e2e --features e2e_testscleancargo fmt --checkcleanzero-configprofile against the new docker-compose stack (locally + CI when wired)config_editingtests still pass (they rely on/internal/config_toml/applyworking under the new handler-level gating)NotImplementederror from/internal/config_toml*instead of 404Stack context
Stacked on #7361. The CI workflow that actually runs the new
zero-configprofile is not yet pushed — it'll land as PR #4 in the stack (drafted in worktree)./internal/config_tomlon runtime DB mode + zero-config E2E🤖 Generated with Claude Code