Skip to content

fix: surface CI ingest failures in the PR comment#156

Open
veksen wants to merge 2 commits into
mainfrom
fix-surface-ci-ingest-failure
Open

fix: surface CI ingest failures in the PR comment#156
veksen wants to merge 2 commits into
mainfrom
fix-surface-ci-ingest-failure

Conversation

@veksen

@veksen veksen commented Jun 29, 2026

Copy link
Copy Markdown
Member

Why

When POST /ci/runs fails, the run silently vanishes. postToSiteApi logged a warning, returned null, and carried on — so the PR comment rendered with no dashboard link and a degraded "empty" look, the Actions check stayed green, and nothing was saved. The only trace was a line buried in the job log.

This actually happened: a Postgres-18 schema dump produced a not_null constraint type the Site API's ingest schema didn't accept, so it returned 400. 68 queries were analyzed, but the run showed up nowhere and the comment lost its link.

What a rejected ingest is

The analyzer did its whole job — captured, analyzed, dumped schema — and only the final persist step was refused. That's data loss, not a soft caveat, so it's an error worth telling the user. But the right severity and recipient depend on why it was rejected:

Kind Cause Recoverable Who fixes Behavior
transient network / timeout / 5xx yes (re-run) nobody core.warning, don't block
auth 401 / 403 no the user (token) core.setFailed — CI isn't running
rejected other 4xx (contract skew) no us, not them core.error (red, loud), non-blocking by default

Blocking a user's PR over our contract bug would be wrong, so rejected stays non-blocking unless a team opts in with FAIL_ON_INGEST_ERROR=true. Auth always fails (it's theirs to fix); transient never does.

Changes

  • postToSiteApi returns a discriminated PostRunOutcome carrying HTTP status + a length-capped body, instead of bare null.
  • classifyIngestFailure(status)transient | auth | rejected; main.ts picks the Actions severity from it and records ingestError (with kind) on the report context.
  • The comment renders a > [!CAUTION] banner with kind-specific copy (retry / fix token / out-of-sync) and a collapsible Details block (the API's error).
  • FAIL_ON_INGEST_ERROR env flag (default off) to escalate rejected to a hard check failure. Flows through action.yaml's inherited env; can be promoted to a formal action input later.

Tests

  • classifyIngestFailure: transient (null/5xx), auth (401/403), rejected (400/422).
  • postToSiteApi: ok on 2xx; failure with status + body on non-2xx; status: null when the request never completes.
  • comment banner: rejected (status + "re-running won't help" + details), auth ("Set a valid TOKEN"), transient ("re-run to retry"), and omitted on success.
  • Full suite: 211+ pass, tsc --noEmit + build green. (5 suites fail to load locally on a missing pg_restore binary — pre-existing/environment, identical on a clean tree.)

Companion

Server side, handled in the Site repo: report all exceptions (incl. handled 4xx) to Sentry so this class of failure pages the operator (Query-Doctor/Site#3426). The constraint-type enum that caused the original 400 is a separate root-cause follow-up.

🤖 Generated with Claude Code

When POST /ci/runs failed (e.g. the Site API rejected a payload), the
analyzer logged a warning, returned null, and carried on — producing a
degraded PR comment with no dashboard link that looked like a normal empty
result, plus a green check. The failure was effectively invisible.

postToSiteApi now returns a discriminated outcome carrying the HTTP status
and a capped response body on failure. On a failed ingest, main.ts records
it on the report context and emits a GitHub Actions warning annotation, and
the comment renders a prominent CAUTION banner (with a collapsible details
block) so the run no longer silently vanishes. Not setFailed — a Site API
hiccup shouldn't block merging; this is a "you should know" signal.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query Doctor Analysis

Caution

Query Doctor couldn't record this run. The API rejected the submission (HTTP 400), so it wasn't saved to the dashboard and the results below may be incomplete. This usually means the analyzer and Query Doctor are out of sync — re-running won't help; if it persists, it's a bug on our side.

Details
{"name":"ZodError","message":"[\n  {\n    \"code\": \"invalid_value\",\n    \"values\": [\n      \"check\",\n      \"foreign_key\",\n      \"primary_key\",\n      \"unique\",\n      \"exclusion\"\n    ],\n    \"path\": [\n      \"schema\",\n      \"constraints\",\n      0,\n      \"constraintType\"\n    ],\n    \"message\": \"Invalid option: expected one of \\\"check\\\"|\\\"foreign_key\\\"|\\\"primary_key\\\"|\\\"unique\\\"|\\\"exclusion\\\"\"\n  },\n  {\n    \"code\": \"invalid_value\",\n    \"values\": [\n      \"check\",\n      \"foreign_key\",\n      \"primary_key\",\n      \"unique\",\n …

3 queries analyzed

2 pre-existing issues
  • SELECT "guests"."id", "guests"."session_id", "guests"."username", "guests"."avatar_path", "guests"."color", "guests"."side", "guests"."audio_recording_path", "guests"."audio_recording_public", "gue...
    index assets(event_id, inserted_at desc)
    cost 31,003,449 → 1,498 (100% reduction)
  • SELECT * FROM guest_ip_addresses WHERE ip_address = '127.0.0.1';
    index guest_ip_addresses(ip_address)
    cost 154,402 → 8 (100% reduction)

Using assumed statistics (10000000 rows/table). For better results, sync production stats.

A rejected ingest isn't one thing — it's a computed run that couldn't be
saved, and what to do about it depends on why. Replace the single
core.warning with a classification:

- transient (network/timeout/5xx): recoverable → core.warning, "re-run".
- auth (401/403): the user's to fix and means CI isn't running →
  core.setFailed.
- rejected (other 4xx, e.g. analyzer/API contract skew): our bug, not the
  user's → core.error (red, loud) but non-blocking by default, since failing
  their PR over our fault is wrong. Opt in to block via FAIL_ON_INGEST_ERROR.

The PR-comment banner now tailors its copy per kind (retry / fix token /
out-of-sync). It's an error worth telling the user about — but named, scoped,
and routed to who can actually act.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@veksen veksen marked this pull request as ready for review June 29, 2026 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant