Add model provider timeouts and clearer offline/provider error feedback

## Summary

When the dev UI/app is online to the local Agents server but the model provider is unreachable (for example no Internet during a demo, Anthropic unreachable, provider request hangs), the UI can remain in a long-running "thinking" state with poor feedback.

We should add default model-provider timeout/error handling in the Pi adapter/runtime path and surface a clear durable error to the UI.

## Current behavior / code paths

Electric Agents uses Pi in:

- `packages/agents-runtime/src/pi-adapter.ts`
  - constructs `new Agent(...)` from `@mariozechner/pi-agent-core`
  - resolves models via `getModel(...)` from `@mariozechner/pi-ai`
  - subscribes to Pi events and maps them to Electric runtime events
- `packages/agents-runtime/src/context-factory.ts`
  - calls `handle.run(runInput, config.runSignal)` inside `ctx.agent.run()`
- `packages/agents-runtime/src/outbound-bridge.ts`
  - writes `runs`, `steps`, `texts`, `toolCalls`
  - maps `finishReason === 'error'` to run `status: 'failed'`
- `packages/agents-runtime/src/process-wake.ts`
  - catches handler failures and writes an `errors` row with `error_code: 'HANDLER_FAILED'`
- `packages/agents-server-ui/src/components/AgentResponse.tsx`
  - already renders run/errors rows inline

The UI can already render failures once the runtime writes them. The likely missing piece is making provider hangs fail fast enough and classifying failures into useful messages.

## Upstream Pi research

Current upstream Pi repo:

- https://github.com/earendil-works/pi

Relevant upstream files:

- `packages/ai/src/types.ts`
  - defines stream options including `signal?: AbortSignal`, `timeoutMs?: number`, `maxRetries?: number`, `maxRetryDelayMs?: number`
- `packages/ai/src/stream.ts`
  - passes stream options through to providers
- `packages/ai/src/providers/anthropic.ts`
  - passes `signal` to SDK request options
  - maps `timeoutMs` to Anthropic SDK `timeout`
  - supports `maxRetries`
  - emits terminal error/aborted stream events
- `packages/agent/src/types.ts`
  - stream contract expects failures to be encoded as stream protocol events and final assistant messages with `stopReason: 'error' | 'aborted'` and `errorMessage`
- `packages/agent/src/agent.ts`
  - agent has an internal `AbortController` per run and stores run failures in state/error messages
- `packages/agent/src/agent-loop.ts`
  - passes the active run `signal` into `streamSimple`/custom stream functions
- `packages/agent/src/harness/agent-harness.ts`
  - shows wrapper `streamFn` injecting `timeoutMs`, retry settings, auth, headers, signal, etc.

Related upstream issues:

- https://github.com/earendil-works/pi/issues/2498
- https://github.com/earendil-works/pi/issues/3627
- https://github.com/earendil-works/pi/issues/2381
- https://github.com/earendil-works/pi/issues/4666
- https://github.com/earendil-works/pi/issues/3258

Takeaway: upstream Pi has useful primitives (`timeoutMs`, `AbortSignal`, retry settings, terminal error events), but does not appear to provide a rich normalized taxonomy like `offline | timeout | auth | rate_limit | provider_unavailable`. Electric should use Pi's primitives and add Electric-specific classification/messages at the adapter/runtime boundary.

## Goals

1. Provider calls should not leave the UI hanging indefinitely.
2. If a model provider is unreachable/offline/timed out, the run should settle as failed.
3. The UI should show a clear message, e.g.:

   ```txt
   Could not reach Anthropic. Check your Internet connection or Anthropic status.
   ```

4. Preserve the original provider error details for debugging/logs.
5. Keep behavior configurable for development and future production use.

## Non-goals

- Do not implement a browser-side Internet/offline detector as the main solution.
- Do not make the UI guess provider state from client connectivity.
- Do not replace Pi's stream/error contract.
- Do not hide provider error details entirely.

The server/runtime is the right place to know whether model calls are timing out or failing.

## Proposed implementation

### 1. Add default model provider timeout/retry settings

In `packages/agents-runtime/src/pi-adapter.ts`, ensure the Pi stream path receives defaults such as:

```ts
const DEFAULT_MODEL_TIMEOUT_MS = 30_000
const DEFAULT_MODEL_MAX_RETRIES = 0
```

Use upstream Pi options where available:

```ts
timeoutMs: DEFAULT_MODEL_TIMEOUT_MS,
maxRetries: DEFAULT_MODEL_MAX_RETRIES,
signal: abortSignal,
```

If the currently installed `@mariozechner/pi-ai` / `@mariozechner/pi-agent-core` version does not expose `timeoutMs`, fall back to composing an `AbortController` timeout around the existing run signal.

Possible env/config knobs:

```sh
ELECTRIC_AGENTS_MODEL_TIMEOUT_MS=30000
ELECTRIC_AGENTS_MODEL_MAX_RETRIES=0
```

Open question: should these live in `AgentConfig`, runtime config, env vars, or all of the above?

### 2. Ensure provider errors terminate the run

`pi-adapter.ts` already handles `message_end` with:

```ts
const isError =
  msg?.stopReason === `error` ||
  (!!msg?.errorMessage && msg.stopReason !== `aborted`)
```

and throws:

```ts
throw new Error(
  `pi-agent message_end error: ${msg.errorMessage ?? `unknown error`} ...`
)
```

Verify that provider timeout/offline failures reliably produce one of:

- `message_end` with `stopReason: 'error'` and `errorMessage`
- rejected `agent.prompt(...)` / `agent.continue()` promise
- abort path via timeout signal

In all cases, the run should call `bridge.onRunEnd({ finishReason: 'error' })` or `bridge.onRunEnd({ finishReason: 'aborted' })` and not stay streaming forever.

### 3. Add Electric-specific error classification

Add a small classifier near the adapter/runtime boundary, for example in `pi-adapter.ts` or a new runtime utility:

```ts
type ModelProviderErrorCode =
  | 'MODEL_PROVIDER_TIMEOUT'
  | 'MODEL_PROVIDER_UNREACHABLE'
  | 'MODEL_PROVIDER_AUTH_FAILED'
  | 'MODEL_PROVIDER_RATE_LIMITED'
  | 'MODEL_PROVIDER_UNAVAILABLE'
  | 'MODEL_PROVIDER_ERROR'
```

Classification can start string/error based:

- timeout:
  - `AbortError`, `TimeoutError`, `timeout`, `timed out`
- offline/network:
  - `ENOTFOUND`, `ECONNREFUSED`, `ECONNRESET`, `EAI_AGAIN`, `fetch failed`, `network`, `Failed to fetch`
- auth:
  - `401`, `invalid api key`, `authentication`, `unauthorized`
- rate limit:
  - `429`, `rate limit`
- provider unavailable:
  - `502`, `503`, `504`, `overloaded`, `unavailable`
- fallback:
  - `MODEL_PROVIDER_ERROR`

### 4. Surface a clearer durable error

Currently `process-wake.ts` catches handler errors and writes:

```ts
error_code: `HANDLER_FAILED`,
message: errMsg,
```

We should preserve compatibility but expose model-provider errors more clearly.

Options:

#### Option A: throw a classified error and let `process-wake.ts` map it

Create a runtime error class:

```ts
class ModelProviderError extends Error {
  code: ModelProviderErrorCode
  provider?: string
  model?: string
  cause?: unknown
}
```

Then `process-wake.ts` can write:

```ts
error_code: error instanceof ModelProviderError
  ? error.code
  : 'HANDLER_FAILED'
message: error.message
```

#### Option B: write an error row directly from the adapter

This is probably less clean because `pi-adapter.ts` currently writes run/step/text/tool events through `OutboundBridge`, not generic runtime errors.

Recommendation: Option A.

### 5. Make UI messaging friendly

`AgentResponse.tsx` already renders:

```tsx
✗ {error_code}: {message}
```

A minimal first slice can rely on this.

A follow-up could map specific error codes to friendlier copy or hide noisy internals. Example:

```txt
Could not reach Anthropic. Check your Internet connection or Anthropic status.
```

Instead of:

```txt
MODEL_PROVIDER_UNREACHABLE: fetch failed ENOTFOUND api.anthropic.com
```

## Example desired behavior

If the dev app is running locally but the machine has no Internet:

1. User sends a message to Horton.
2. UI shows thinking.
3. Runtime starts model call with timeout.
4. Provider call fails/times out.
5. Runtime marks the run failed.
6. UI exits thinking state and shows:

```txt
Could not reach Anthropic. Check your Internet connection or Anthropic status.
```

The wake should close normally after recording the error.

## Testing plan

### Unit tests

Add tests around error classification:

```ts
classifyModelProviderError(new Error('fetch failed'))
// MODEL_PROVIDER_UNREACHABLE

classifyModelProviderError(new Error('timeout'))
// MODEL_PROVIDER_TIMEOUT

classifyModelProviderError(new Error('401 invalid api key'))
// MODEL_PROVIDER_AUTH_FAILED

classifyModelProviderError(new Error('429 rate limit'))
// MODEL_PROVIDER_RATE_LIMITED
```

### Adapter tests

Mock/override `streamFn` or Pi agent stream behavior to simulate:

- no response until timeout
- rejected provider promise
- terminal `message_end` with `stopReason: 'error'`
- aborted run

Assert:

- run is marked `failed` for provider errors
- run is marked completed/aborted for explicit aborts as appropriate
- no indefinite pending state
- classified error is written or thrown through the process-wake path

### UI smoke test

Create an entity run that writes a classified error and verify `AgentResponse.tsx` renders it.

## Open questions

1. What should the default timeout be?
   - 30s is demo-friendly.
   - Longer may be safer for slow providers/models.

2. Should timeout be per model/provider?
   - Some reasoning models may legitimately take longer before first token.

3. Should timeout mean time-to-first-event or total model call duration?
   - For the demo offline case, time-to-first-event timeout is probably sufficient.
   - A separate max total run duration could be useful later.

4. Should retries default to 0?
   - Upstream Anthropic provider appears to default retries to 0 in current code.
   - For demos/offline handling, retries can make failures feel like hangs.

5. Should classified provider errors be represented in `runs`, `steps`, `errors`, or all of the above?
   - Today UI can read run errors. Need to confirm the best durable shape.

6. Should this be configured through `AgentConfig`?

   Example:

   ```ts
   ctx.useAgent({
     model,
     provider,
     modelTimeoutMs: 30_000,
     modelMaxRetries: 0,
   })
   ```

   Or keep as runtime/env config first.

## Acceptance criteria

- Model-provider calls have a default timeout or equivalent abort mechanism.
- Anthropic/OpenAI unreachable/offline failures do not leave the UI thinking indefinitely.
- A failed provider call settles the run and closes the wake.
- The entity timeline contains a clear durable error code/message.
- The UI shows a useful error without requiring a page refresh.
- Existing explicit abort/SIGINT behavior remains correct.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add model provider timeouts and clearer offline/provider error feedback #4520

Summary

Current behavior / code paths

Upstream Pi research

Goals

Non-goals

Proposed implementation

1. Add default model provider timeout/retry settings

2. Ensure provider errors terminate the run

3. Add Electric-specific error classification

4. Surface a clearer durable error

Option A: throw a classified error and let `process-wake.ts` map it

Option B: write an error row directly from the adapter

5. Make UI messaging friendly

Example desired behavior

Testing plan

Unit tests

Adapter tests

UI smoke test

Open questions

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Add model provider timeouts and clearer offline/provider error feedback #4520

Description

Summary

Current behavior / code paths

Upstream Pi research

Goals

Non-goals

Proposed implementation

1. Add default model provider timeout/retry settings

2. Ensure provider errors terminate the run

3. Add Electric-specific error classification

4. Surface a clearer durable error

Option A: throw a classified error and let process-wake.ts map it

Option B: write an error row directly from the adapter

5. Make UI messaging friendly

Example desired behavior

Testing plan

Unit tests

Adapter tests

UI smoke test

Open questions

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Option A: throw a classified error and let `process-wake.ts` map it