Skip to content

Gate /internal/config_toml on runtime DB mode + add zero-config E2E#7362

Closed
AntoineToussaint wants to merge 11 commits into
mainfrom
db-config-rest-zero-config
Closed

Gate /internal/config_toml on runtime DB mode + add zero-config E2E#7362
AntoineToussaint wants to merge 11 commits into
mainfrom
db-config-rest-zero-config

Conversation

@AntoineToussaint
Copy link
Copy Markdown
Member

@AntoineToussaint AntoineToussaint commented Apr 23, 2026

Stacked on top of #7361.

Motivation

After #7361 makes the gateway boot from a Postgres URL alone, the natural next question is "what does the operator do next?" Today, the answer is "nothing useful": the three /internal/config_toml* endpoints — which the UI uses to read and write config — are gated on the ENABLE_CONFIG_IN_DATABASE feature flag at the route registration level. So even after a successful zero-config boot, those endpoints return 404 unless the flag is set.

This PR fixes that and adds the first end-to-end coverage of the zero-config boot path: prove the gateway runs, prove the config endpoint returns the right shape against an empty DB.

What this PR does

1. Gate /internal/config_toml* on runtime DB mode, not feature flag

File: crates/gateway/src/routes/internal.rs

  • Removed the if feature_flags::ENABLE_CONFIG_IN_DATABASE.get() { ... } wrapper around route registration.
  • All three endpoints (GET /internal/config_toml, POST /internal/config_toml/apply, POST /internal/config_toml/validate) are now always mounted.

File: crates/tensorzero-core/src/endpoints/internal/config_toml.rs

  • Replaced the feature-flag check inside get_latest_config_toml_handler with a runtime check on app_state.config_in_database.
  • Added the same check to apply_config_toml_handler.
  • New helper require_config_in_database(app_state, endpoint) returns ErrorDetails::NotImplemented with a clear message pointing at the env var.
  • File-backed gateways (config_in_database = false) still get a structured error explaining how to enable the endpoint.

2. Zero-config E2E test suite

New file: crates/tensorzero-core/tests/e2e/zero_config/mod.rs. Two tests:

  • zero_config_health_returns_ok/health returns 200, header x-tensorzero-gateway-version matches TENSORZERO_VERSION, and the JSON body has gateway: ok and clickhouse: ok.
  • zero_config_get_config_toml_returns_defaultsGET /internal/config_toml returns 200 with:
    • path_contents is empty (no referenced template files when DB is empty)
    • toml is non-empty (default singletons emit [gateway], [clickhouse], etc.)
    • toml does NOT contain [functions., [models., [tools., or [metrics. (no user-defined entries)
    • hash and base_signature are non-empty (callers can chain into /apply)

3. Test infrastructure

File: crates/.config/nextest.toml

  • Added test(zero_config::) to the e2e profile's exclusion list, so the main suite doesn't try to run these tests against its config-laden gateway.
  • New [profile.zero-config]: default-filter = 'binary(e2e) and test(zero_config::)', test-threads = 1 (these tests mutate gateway-wide state in follow-up work), moderate retries.

New file: ui/fixtures/docker-compose.zero-config.yml

  • Override layered on top of docker-compose.e2e.yml.
  • Clears the gateway's --config-file command (command: []) so it falls through to the DB load path.
  • Re-declares depends_on to include only clickhouse and gateway-postgres-migrations — drops the dependency on the fixtures loader (no fixtures wanted, want an empty DB).
  • Marks the fixtures service as profiles: ["never"] so it doesn't get pulled into the dependency graph.

File: crates/tensorzero-core/tests/e2e/tests.rs

  • Registered the new zero_config module.

Why this approach

  • Rejected: keeping the feature flag and adding a fallback "if no flag, check config_in_database." Adds complexity. The feature flag's job was always to decide whether to use the DB-config code path; once config_in_database carries that information at runtime, the flag is redundant for endpoint gating.
  • Rejected: deleting the feature flag entirely in this PR. Keeping the flag (it's still a CLI-level opt-in for explicit operators) reduces blast radius. Can be removed later.
  • Rejected: a bigger E2E suite in this PR. Two tests are enough to establish "gateway is up + config endpoint works." Phase 2A will build out behavioural coverage.

How to verify locally

docker compose \
  -f ui/fixtures/docker-compose.e2e.yml \
  -f ui/fixtures/docker-compose.zero-config.yml \
  up -d gateway

# Gateway should be reachable on :3000 with no config
curl -s http://localhost:3000/health | jq
# Should show gateway: ok, clickhouse: ok

curl -s http://localhost:3000/internal/config_toml | jq
# Should return non-empty toml, empty path_contents, non-empty hash/base_signature

# Run the new tests
cargo nextest run --profile zero-config -E 'binary(e2e) and test(zero_config::)' --features e2e_tests

Test plan

  • cargo check -p gateway -p tensorzero-core clean
  • cargo check --package tensorzero-core --test e2e --features e2e_tests clean
  • cargo fmt --check clean
  • Run new zero-config profile against the new docker-compose stack (locally + CI when wired)
  • Existing config_editing tests still pass (they rely on /internal/config_toml/apply working under the new handler-level gating)
  • File-backed gateways return structured NotImplemented error from /internal/config_toml* instead of 404

Stack context

Stacked on #7361. The CI workflow that actually runs the new zero-config profile is not yet pushed — it'll land as PR #4 in the stack (drafted in worktree).

  1. Allow gateway to boot with only a Postgres connection #7361 — gateway boots with only a Postgres connection
  2. Gate /internal/config_toml on runtime DB mode + add zero-config E2E #7362 (this PR) — gate /internal/config_toml on runtime DB mode + zero-config E2E
  3. Add REST config bootstrap helper for zero-config E2E tests #7363 — REST config bootstrap helper for zero-config E2E tests

🤖 Generated with Claude Code

When `--config-file` and `--default-config` are both absent, the gateway
now falls through to the DB-authoritative load path whenever
`TENSORZERO_POSTGRES_URL` is set. Previously this required explicit
opt-in via the `ENABLE_CONFIG_IN_DATABASE` feature flag.

An empty database is a valid starting point: every singleton falls back
to its default and every collection is empty, so the gateway serves a
functional runtime with zero user config. This is the first step toward
a "zero-config deploy": the operator provides a database URL and
populates functions, variants, and models through REST endpoints.

Also adds an empty-database smoke test for `load_config_from_db`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AntoineToussaint and others added 5 commits April 23, 2026 14:45
Apply review feedback on the startup-config-from-Postgres fallback:

- Treat an empty `TENSORZERO_POSTGRES_URL` as absent so a shell/compose
  misconfiguration produces the clear "no config source" error instead
  of an opaque sqlx dial failure.
- Read the env var once and thread the `Option<String>` into
  `load_startup_config_from_database`, eliminating the double read.
- Log a prominent `WARN` when falling through to the implicit DB path
  (env var set, no feature flag, no `--config-file`) so operators see
  the fallback in startup logs. Many deployments set the env var for
  observability/rate-limiting without intending DB-config boot.
- Replace the positional `(…, …, bool /* config_in_database */)` tuple
  with a `StartupConfig` struct so callers don't rely on an
  inline-comment-documented bool.
- Introduce a `TENSORZERO_POSTGRES_URL_ENV` constant for the two new
  call sites in this file.
- Rewrite the empty-DB smoke test with `expect_that!` + `matches_pattern!`
  per `AGENTS.md` guidance, giving per-field failure diagnostics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`expect_that!` needs a `#[gtest]` test context to collect failures; the
`#[sqlx::test]` macro doesn't provide one, so using it here panics with
"No test context found" instead of running the assertion. Switch to
`assert_that!`, which works without the gtest context and matches the
convention used by every other test in this file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Spawns the actual gateway binary with `TENSORZERO_POSTGRES_URL` set
against a migrated Postgres and nothing else (no `--config-file`, no
`--default-config`, no `ENABLE_CONFIG_IN_DATABASE` feature flag) and
verifies the gateway binds a port, serves a healthy `/health`, and
returns a well-formed `StatusResponse` from `/status`. This is the
end-to-end counterpart to the unit-level empty-DB test on
`load_config_from_db`: that one proves the loader returns defaults,
this one proves the full binary actually reaches listening state and
answers HTTP with that defaulted config.

Also factors the "wait for listening + parse bound addr + build
ChildData" tail of `start_gateway_impl` into a shared
`await_gateway_listening` helper so the new
`start_gateway_from_db_url_on_random_port` helper doesn't duplicate it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extend the new integration test from "does the gateway serve /health"
to the full config-in-database scenario the UI will build on top of:
migrated Postgres, no config rows, no `--config-file`, feature flag
on, then assert:

- `/health` 200
- `/status` returns `ok` + a non-empty `config_hash`
- `/internal/config_toml` returns a default editable TOML whose hash
  matches `/status`, and whose `path_contents` is empty (no
  user-provided templates)
- The TOML body parses as a valid TOML table

The helper `start_gateway_from_db_url_on_random_port` now takes an
`extra_env` slice so callers can either exercise the implicit-opt-in
path (env var only) or the full config-in-database scenario (feature
flag on) without duplicating the subprocess plumbing. Adds `toml` to
gateway dev-dependencies for assertion-side parsing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a dedicated e2e scenario, parallel to `live-tests`,
`live-tests-config-in-database`, `evaluation-tests`, and the existing
live flavors: gateway booting from a migrated-but-empty
Postgres + ClickHouse stack with `ENABLE_CONFIG_IN_DATABASE=true` and
no `--config-file`. This is the deploy shape the configure-via-UI
story builds on — schema present, no config rows, no files on disk.

New pieces, all mirroring the existing config-in-database pattern:

- `crates/tensorzero-core/tests/e2e/docker-compose.db-only-boot.yml`:
  override of `docker-compose.live.yml` that drops
  `gateway-migrate-config`, flips the feature flag, clears
  `--config-file`, and uses `!override` on `volumes` to remove every
  inherited bind mount (config TOMLs, fixtures, credentials) — so the
  gateway literally has nothing on disk to read.
- `crates/tensorzero-core/tests/e2e/db_only_boot/mod.rs`: two
  `#[gtest] #[tokio::test]` Rust tests that run inside the live-tests
  container and hit the gateway over the compose network: one asserts
  `/status` reports the default config and a non-empty hash, the other
  asserts `/internal/config_toml` returns the same hash with empty
  `path_contents` and a TOML body that parses back as a valid table.
- `crates/.config/nextest.toml`: new `db-only-boot` profile filtering
  to `db_only_boot::` tests, and `e2e`'s `default-filter` excludes
  them so they only run in their own CI job.
- `.github/workflows/db-only-boot-e2e.yml`: new reusable workflow
  standing up the stack, running the profile inside `live-tests`, and
  asserting the gateway logs show the DB-authoritative boot banner.
- `.github/workflows/general.yml`: wires the new job behind
  `detect-changes.outputs.code`; `ci/check-all-general-jobs-passed.sh`
  adds it to `ALLOWED_SKIP` so the merge queue tolerates skipped runs.

Also drops the subprocess-spawning `crates/gateway/tests/boot_from_empty_db.rs`
and its helper additions in `gateway/tests/common/mod.rs` and
`gateway/Cargo.toml` — superseded by the in-container Rust test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AntoineToussaint and others added 5 commits April 23, 2026 19:34
Four small cleanups from the branch review:

- `load_startup_config_from_database` takes `Option<&str>` instead of
  `Option<String>` — the function never owned the url; caller now
  passes `postgres_url.as_deref()`.
- Consolidate `UnwrittenConfig` import into the existing
  `use tensorzero_core::config::{...}` block and drop the two inline
  long-form paths, per AGENTS.md.
- Fold the three separate `expect_that!` calls on `StatusResponse`
  into a single `matches_pattern!` — if the struct gains a field, the
  test now makes a conscious choice instead of silently ignoring it.
- Replace `toml::from_str(...).unwrap_or_else(|e| panic!(...))` with
  `assert_that!(parsed, ok(predicate(toml::Value::is_table)))` so
  success + the "is-a-table" check collapse to one googletest
  assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two mistakes in the initial push of the new job:

- The workflow pulls `tensorzero/live-tests:sha-$SHA` but only declared
  `build-gateway-container` in `needs:`. Adds
  `build-live-tests-container` and `build-fixtures-container` to the
  dependency list, matching `live-tests-config-in-database`. Also
  gates the job on the same fork/dependabot condition the sibling jobs
  use.
- `pre-commit`'s `check-yaml` can't parse Compose's `!override` custom
  tag, so `validate` failed on the new compose file. Excludes that
  single file from `check-yaml`; Docker Compose still validates it at
  stack-up time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CI job failed because `docker compose run live-tests` started its
full `depends_on` graph — including `fixtures-postgres`, which exits
1 when loading fixtures against a migrated-but-empty DB. The whole
point of this scenario is an empty DB, so fixture loading is a
semantic mismatch.

Override `live-tests.depends_on` with `!override` to keep only the
infra + gateway + migrations services and drop `fixtures` and
`fixtures-postgres`. The `up --wait gateway` and the subsequent
`run --rm live-tests` both pass locally after this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run the zero-config boot scenario in both observability modes:

- Postgres-config + ClickHouse-data (default TOML-config deploy shape)
- Postgres-config + Postgres-data (single-datastore deploy)

Matches the `live-tests` workflow's `database: [clickhouse, postgres]`
matrix. When `matrix.database == postgres`, sets
`TENSORZERO_INTERNAL_TEST_OBSERVABILITY_BACKEND=postgres` so the gateway
uses Postgres as the primary observability backend and exercises its
pgcron/pgvector/trigram extension checks.

The `check-all-general-jobs-passed.sh` ALLOWED_SKIP entry
(`db-only-boot-e2e`) already covers matrix-suffixed job names via the
existing `"entry ("` prefix match — no change needed there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before: the three `/internal/config_toml` routes were mounted only when
`ENABLE_CONFIG_IN_DATABASE` was set at router-build time. That ties the
endpoints to a process-wide flag and means a gateway booted implicitly
from a Postgres URL (no flag, no `--config-file`) would 404 the UI's
config bootstrap even though the stored-config tables are already the
source of truth.

Now: the routes are always mounted, and `get` / `apply` gate themselves
at request time on `app_state.config_in_database` — the same bit the
boot logic records when it decides which source to load from. The
`require_config_in_database(bool, &str)` helper keeps the 501 response
consistent across endpoints.

`validate` is intentionally ungated: it is stateless (parse + run the
shared load pipeline, no DB reads or writes), and the UI needs to lint
editable TOML even against a file-backed gateway. The docstring now
says so explicitly.

Also extend `db_only_boot_returns_default_config_via_config_toml_endpoint`
to assert the collection tables (functions/models/tools/metrics) are
absent or empty, and that `base_signature` is populated so callers can
use it as a CAS token on the first apply.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant