Skip to content

Add model provider timeouts and clearer offline/provider error feedback #4520

@KyleAMathews

Description

@KyleAMathews

Summary

When the dev UI/app is online to the local Agents server but the model provider is unreachable (for example no Internet during a demo, Anthropic unreachable, provider request hangs), the UI can remain in a long-running "thinking" state with poor feedback.

We should add default model-provider timeout/error handling in the Pi adapter/runtime path and surface a clear durable error to the UI.

Current behavior / code paths

Electric Agents uses Pi in:

  • packages/agents-runtime/src/pi-adapter.ts
    • constructs new Agent(...) from @mariozechner/pi-agent-core
    • resolves models via getModel(...) from @mariozechner/pi-ai
    • subscribes to Pi events and maps them to Electric runtime events
  • packages/agents-runtime/src/context-factory.ts
    • calls handle.run(runInput, config.runSignal) inside ctx.agent.run()
  • packages/agents-runtime/src/outbound-bridge.ts
    • writes runs, steps, texts, toolCalls
    • maps finishReason === 'error' to run status: 'failed'
  • packages/agents-runtime/src/process-wake.ts
    • catches handler failures and writes an errors row with error_code: 'HANDLER_FAILED'
  • packages/agents-server-ui/src/components/AgentResponse.tsx
    • already renders run/errors rows inline

The UI can already render failures once the runtime writes them. The likely missing piece is making provider hangs fail fast enough and classifying failures into useful messages.

Upstream Pi research

Current upstream Pi repo:

Relevant upstream files:

  • packages/ai/src/types.ts
    • defines stream options including signal?: AbortSignal, timeoutMs?: number, maxRetries?: number, maxRetryDelayMs?: number
  • packages/ai/src/stream.ts
    • passes stream options through to providers
  • packages/ai/src/providers/anthropic.ts
    • passes signal to SDK request options
    • maps timeoutMs to Anthropic SDK timeout
    • supports maxRetries
    • emits terminal error/aborted stream events
  • packages/agent/src/types.ts
    • stream contract expects failures to be encoded as stream protocol events and final assistant messages with stopReason: 'error' | 'aborted' and errorMessage
  • packages/agent/src/agent.ts
    • agent has an internal AbortController per run and stores run failures in state/error messages
  • packages/agent/src/agent-loop.ts
    • passes the active run signal into streamSimple/custom stream functions
  • packages/agent/src/harness/agent-harness.ts
    • shows wrapper streamFn injecting timeoutMs, retry settings, auth, headers, signal, etc.

Related upstream issues:

Takeaway: upstream Pi has useful primitives (timeoutMs, AbortSignal, retry settings, terminal error events), but does not appear to provide a rich normalized taxonomy like offline | timeout | auth | rate_limit | provider_unavailable. Electric should use Pi's primitives and add Electric-specific classification/messages at the adapter/runtime boundary.

Goals

  1. Provider calls should not leave the UI hanging indefinitely.

  2. If a model provider is unreachable/offline/timed out, the run should settle as failed.

  3. The UI should show a clear message, e.g.:

    Could not reach Anthropic. Check your Internet connection or Anthropic status.
  4. Preserve the original provider error details for debugging/logs.

  5. Keep behavior configurable for development and future production use.

Non-goals

  • Do not implement a browser-side Internet/offline detector as the main solution.
  • Do not make the UI guess provider state from client connectivity.
  • Do not replace Pi's stream/error contract.
  • Do not hide provider error details entirely.

The server/runtime is the right place to know whether model calls are timing out or failing.

Proposed implementation

1. Add default model provider timeout/retry settings

In packages/agents-runtime/src/pi-adapter.ts, ensure the Pi stream path receives defaults such as:

const DEFAULT_MODEL_TIMEOUT_MS = 30_000
const DEFAULT_MODEL_MAX_RETRIES = 0

Use upstream Pi options where available:

timeoutMs: DEFAULT_MODEL_TIMEOUT_MS,
maxRetries: DEFAULT_MODEL_MAX_RETRIES,
signal: abortSignal,

If the currently installed @mariozechner/pi-ai / @mariozechner/pi-agent-core version does not expose timeoutMs, fall back to composing an AbortController timeout around the existing run signal.

Possible env/config knobs:

ELECTRIC_AGENTS_MODEL_TIMEOUT_MS=30000
ELECTRIC_AGENTS_MODEL_MAX_RETRIES=0

Open question: should these live in AgentConfig, runtime config, env vars, or all of the above?

2. Ensure provider errors terminate the run

pi-adapter.ts already handles message_end with:

const isError =
  msg?.stopReason === `error` ||
  (!!msg?.errorMessage && msg.stopReason !== `aborted`)

and throws:

throw new Error(
  `pi-agent message_end error: ${msg.errorMessage ?? `unknown error`} ...`
)

Verify that provider timeout/offline failures reliably produce one of:

  • message_end with stopReason: 'error' and errorMessage
  • rejected agent.prompt(...) / agent.continue() promise
  • abort path via timeout signal

In all cases, the run should call bridge.onRunEnd({ finishReason: 'error' }) or bridge.onRunEnd({ finishReason: 'aborted' }) and not stay streaming forever.

3. Add Electric-specific error classification

Add a small classifier near the adapter/runtime boundary, for example in pi-adapter.ts or a new runtime utility:

type ModelProviderErrorCode =
  | 'MODEL_PROVIDER_TIMEOUT'
  | 'MODEL_PROVIDER_UNREACHABLE'
  | 'MODEL_PROVIDER_AUTH_FAILED'
  | 'MODEL_PROVIDER_RATE_LIMITED'
  | 'MODEL_PROVIDER_UNAVAILABLE'
  | 'MODEL_PROVIDER_ERROR'

Classification can start string/error based:

  • timeout:
    • AbortError, TimeoutError, timeout, timed out
  • offline/network:
    • ENOTFOUND, ECONNREFUSED, ECONNRESET, EAI_AGAIN, fetch failed, network, Failed to fetch
  • auth:
    • 401, invalid api key, authentication, unauthorized
  • rate limit:
    • 429, rate limit
  • provider unavailable:
    • 502, 503, 504, overloaded, unavailable
  • fallback:
    • MODEL_PROVIDER_ERROR

4. Surface a clearer durable error

Currently process-wake.ts catches handler errors and writes:

error_code: `HANDLER_FAILED`,
message: errMsg,

We should preserve compatibility but expose model-provider errors more clearly.

Options:

Option A: throw a classified error and let process-wake.ts map it

Create a runtime error class:

class ModelProviderError extends Error {
  code: ModelProviderErrorCode
  provider?: string
  model?: string
  cause?: unknown
}

Then process-wake.ts can write:

error_code: error instanceof ModelProviderError
  ? error.code
  : 'HANDLER_FAILED'
message: error.message

Option B: write an error row directly from the adapter

This is probably less clean because pi-adapter.ts currently writes run/step/text/tool events through OutboundBridge, not generic runtime errors.

Recommendation: Option A.

5. Make UI messaging friendly

AgentResponse.tsx already renders:

 {error_code}: {message}

A minimal first slice can rely on this.

A follow-up could map specific error codes to friendlier copy or hide noisy internals. Example:

Could not reach Anthropic. Check your Internet connection or Anthropic status.

Instead of:

MODEL_PROVIDER_UNREACHABLE: fetch failed ENOTFOUND api.anthropic.com

Example desired behavior

If the dev app is running locally but the machine has no Internet:

  1. User sends a message to Horton.
  2. UI shows thinking.
  3. Runtime starts model call with timeout.
  4. Provider call fails/times out.
  5. Runtime marks the run failed.
  6. UI exits thinking state and shows:
Could not reach Anthropic. Check your Internet connection or Anthropic status.

The wake should close normally after recording the error.

Testing plan

Unit tests

Add tests around error classification:

classifyModelProviderError(new Error('fetch failed'))
// MODEL_PROVIDER_UNREACHABLE

classifyModelProviderError(new Error('timeout'))
// MODEL_PROVIDER_TIMEOUT

classifyModelProviderError(new Error('401 invalid api key'))
// MODEL_PROVIDER_AUTH_FAILED

classifyModelProviderError(new Error('429 rate limit'))
// MODEL_PROVIDER_RATE_LIMITED

Adapter tests

Mock/override streamFn or Pi agent stream behavior to simulate:

  • no response until timeout
  • rejected provider promise
  • terminal message_end with stopReason: 'error'
  • aborted run

Assert:

  • run is marked failed for provider errors
  • run is marked completed/aborted for explicit aborts as appropriate
  • no indefinite pending state
  • classified error is written or thrown through the process-wake path

UI smoke test

Create an entity run that writes a classified error and verify AgentResponse.tsx renders it.

Open questions

  1. What should the default timeout be?

    • 30s is demo-friendly.
    • Longer may be safer for slow providers/models.
  2. Should timeout be per model/provider?

    • Some reasoning models may legitimately take longer before first token.
  3. Should timeout mean time-to-first-event or total model call duration?

    • For the demo offline case, time-to-first-event timeout is probably sufficient.
    • A separate max total run duration could be useful later.
  4. Should retries default to 0?

    • Upstream Anthropic provider appears to default retries to 0 in current code.
    • For demos/offline handling, retries can make failures feel like hangs.
  5. Should classified provider errors be represented in runs, steps, errors, or all of the above?

    • Today UI can read run errors. Need to confirm the best durable shape.
  6. Should this be configured through AgentConfig?

    Example:

    ctx.useAgent({
      model,
      provider,
      modelTimeoutMs: 30_000,
      modelMaxRetries: 0,
    })

    Or keep as runtime/env config first.

Acceptance criteria

  • Model-provider calls have a default timeout or equivalent abort mechanism.
  • Anthropic/OpenAI unreachable/offline failures do not leave the UI thinking indefinitely.
  • A failed provider call settles the run and closes the wake.
  • The entity timeline contains a clear durable error code/message.
  • The UI shows a useful error without requiring a page refresh.
  • Existing explicit abort/SIGINT behavior remains correct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions