Skip to content

refactor(workers): make the queue scale to zero when idle#7

Merged
jaredLunde merged 1 commit into
mainfrom
queue-scale-to-zero
Jun 27, 2026
Merged

refactor(workers): make the queue scale to zero when idle#7
jaredLunde merged 1 commit into
mainfrom
queue-scale-to-zero

Conversation

@jaredLunde

Copy link
Copy Markdown
Contributor

What

Makes the queue server's delivery and schedule workers event-driven instead of fixed-interval pollers, so an idle queue generates zero Postgres traffic and the VM (and Postgres) can scale to zero.

Why

The delivery and schedule workers polled Postgres every 1s, and a 15s depth scrape ran unconditionally. That continuous traffic kept the queue VM's TAP device busy — and, because the queries hit Postgres, kept the Postgres VM busy too — so neither ever reached instd's idle threshold. Of an app's 5 primitive VMs, queue and postgres never scaled to zero even for a completely idle app. (SCHEDULES.md previously documented this as intentional; that stance is reversed here.)

Changes

  • Delivery worker (src/ops/delivery.rs): drains due rows, then sleeps until the earliest pending next_attempt_at (or parks if empty), woken in-process when a publish inserts deliveries.
  • Schedule worker (src/ops/schedule_worker.rs): fires what's due, then sleeps until the earliest active next_fire_at, capped at KEEPALIVE_CAP (240s, under the 300s light-sleep window so the VM stays awake and fires on time while any schedule is active), woken in-process by /schedules mutations.
  • Depth scrape: background loop removed; computed lazily in the /metrics handler.
  • Wiring: route handlers poke a tokio::sync::Notify. Primitives run single-instance (max=1), so an in-process signal suffices — no LISTEN/NOTIFY, no extension change, no Postgres republish.
  • Wake resilience (src/db.rs): connect/first-query retry with backoff + 30s acquire timeout (mirrors auth's connect_with_retry) so the first query after Postgres deep-sleeps holds while it restores from S3.
  • Docs (SCHEDULES.md, ARCHITECTURE.md) updated to the event-driven model.

Verification

  • 33 unit + 80 integration tests pass against real Postgres.
  • Reworked unsubscribe_cancels_pending to assert the real guarantee (CASCADE-delete of a pending delivery) rather than relying on poll latency — delivery is now immediate.
  • Statement-level trace: an idle queue issues 0 Postgres queries over a 12s window (was ~2/s). cargo clippy -D warnings and dprint clean; .sqlx cache regenerated.

🤖 Generated with Claude Code

@jaredLunde jaredLunde force-pushed the queue-scale-to-zero branch 2 times, most recently from 905cd8a to c60b002 Compare June 27, 2026 19:11
The delivery and schedule workers polled Postgres on a fixed 1s cadence and a
15s queue-depth scrape ran unconditionally. That continuous traffic kept the
queue VM's TAP device busy — and, because the queries hit Postgres, kept the
Postgres VM busy too — so neither could ever reach instd's idle threshold.
Of an app's 5 primitive VMs, queue and postgres never scaled to zero.

Make the workers event-driven instead of polling:

- Delivery worker drains due rows, then sleeps until the earliest pending
  `next_attempt_at` (or parks if the table is empty), woken in-process when a
  publish inserts new deliveries.
- Schedule worker fires what's due, then sleeps until the earliest active
  `next_fire_at` capped at KEEPALIVE_CAP (240s, under the 300s light-sleep
  window so the VM stays awake and fires on time while any schedule is active),
  woken in-process by `/schedules` mutations.
- Depth-scrape background loop removed; computed lazily in the `/metrics`
  handler so an unscraped (sleeping) VM emits nothing.

Route handlers (publish, schedule create/upsert/patch/run/delete) poke the
relevant `tokio::sync::Notify`. Primitives run single-instance (max=1), so an
in-process signal is sufficient — no LISTEN/NOTIFY, no extension change.

DB connect/first-query now retry with backoff and a longer acquire timeout
(mirrors auth's connect_with_retry) so the first query after an idle Postgres
deep-sleeps holds while it restores from S3 instead of failing at 5s.

Result: an idle app generates zero queue→Postgres traffic, so queue and
postgres both sleep; verified by statement-level tracing (0 queries when idle)
and the full integration suite. Apps with active schedules keep both awake
(correct) at ~250x less idle DB traffic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jaredLunde jaredLunde force-pushed the queue-scale-to-zero branch from c60b002 to afab29c Compare June 27, 2026 19:24
@jaredLunde jaredLunde merged commit 731d603 into main Jun 27, 2026
6 checks passed
@jaredLunde jaredLunde deleted the queue-scale-to-zero branch June 27, 2026 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant