refactor(workers): make the queue scale to zero when idle#7
Merged
Conversation
905cd8a to
c60b002
Compare
The delivery and schedule workers polled Postgres on a fixed 1s cadence and a 15s queue-depth scrape ran unconditionally. That continuous traffic kept the queue VM's TAP device busy — and, because the queries hit Postgres, kept the Postgres VM busy too — so neither could ever reach instd's idle threshold. Of an app's 5 primitive VMs, queue and postgres never scaled to zero. Make the workers event-driven instead of polling: - Delivery worker drains due rows, then sleeps until the earliest pending `next_attempt_at` (or parks if the table is empty), woken in-process when a publish inserts new deliveries. - Schedule worker fires what's due, then sleeps until the earliest active `next_fire_at` capped at KEEPALIVE_CAP (240s, under the 300s light-sleep window so the VM stays awake and fires on time while any schedule is active), woken in-process by `/schedules` mutations. - Depth-scrape background loop removed; computed lazily in the `/metrics` handler so an unscraped (sleeping) VM emits nothing. Route handlers (publish, schedule create/upsert/patch/run/delete) poke the relevant `tokio::sync::Notify`. Primitives run single-instance (max=1), so an in-process signal is sufficient — no LISTEN/NOTIFY, no extension change. DB connect/first-query now retry with backoff and a longer acquire timeout (mirrors auth's connect_with_retry) so the first query after an idle Postgres deep-sleeps holds while it restores from S3 instead of failing at 5s. Result: an idle app generates zero queue→Postgres traffic, so queue and postgres both sleep; verified by statement-level tracing (0 queries when idle) and the full integration suite. Apps with active schedules keep both awake (correct) at ~250x less idle DB traffic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c60b002 to
afab29c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Makes the queue server's delivery and schedule workers event-driven instead of fixed-interval pollers, so an idle queue generates zero Postgres traffic and the VM (and Postgres) can scale to zero.
Why
The delivery and schedule workers polled Postgres every 1s, and a 15s depth scrape ran unconditionally. That continuous traffic kept the queue VM's TAP device busy — and, because the queries hit Postgres, kept the Postgres VM busy too — so neither ever reached instd's idle threshold. Of an app's 5 primitive VMs, queue and postgres never scaled to zero even for a completely idle app. (
SCHEDULES.mdpreviously documented this as intentional; that stance is reversed here.)Changes
src/ops/delivery.rs): drains due rows, then sleeps until the earliest pendingnext_attempt_at(or parks if empty), woken in-process when a publish inserts deliveries.src/ops/schedule_worker.rs): fires what's due, then sleeps until the earliest activenext_fire_at, capped atKEEPALIVE_CAP(240s, under the 300s light-sleep window so the VM stays awake and fires on time while any schedule is active), woken in-process by/schedulesmutations./metricshandler.tokio::sync::Notify. Primitives run single-instance (max=1), so an in-process signal suffices — noLISTEN/NOTIFY, no extension change, no Postgres republish.src/db.rs): connect/first-query retry with backoff + 30s acquire timeout (mirrors auth'sconnect_with_retry) so the first query after Postgres deep-sleeps holds while it restores from S3.SCHEDULES.md,ARCHITECTURE.md) updated to the event-driven model.Verification
unsubscribe_cancels_pendingto assert the real guarantee (CASCADE-delete of a pending delivery) rather than relying on poll latency — delivery is now immediate.cargo clippy -D warningsanddprintclean;.sqlxcache regenerated.🤖 Generated with Claude Code