Skip to content

examples/blog: can-ai-agents-improve-ai-agents starter#7404

Open
anndvision wants to merge 2 commits into
mainfrom
andrew/blog-can-ai-agents-starter
Open

examples/blog: can-ai-agents-improve-ai-agents starter#7404
anndvision wants to merge 2 commits into
mainfrom
andrew/blog-can-ai-agents-starter

Conversation

@anndvision
Copy link
Copy Markdown
Member

@anndvision anndvision commented Apr 28, 2026

Summary

Adds the companion code for the Can AI Agents Improve AI Agents? blog post under examples/blog/can-ai-agents-improve-ai-agents/. Self-contained starter that lets a reader's coding agent (Claude Code or Codex) optimize a TensorZero function on the YC Bench Tutorial environment using a local Docker-Compose'd gateway, a markdown skill, and a real baseline rollout's traces.

What's in the starter

  • docker-compose.yml — three services: tensorzero/postgres:17 (with cron.database_name=tensorzero for pg_cron), a one-shot gateway-run-postgres-migrations, and the gateway itself. Postgres is required because the experimentation routing the agent uses (track_and_stop / adaptive) writes per-variant trial counts and means.
  • tensorzero.toml — function yc_bench_tutorial_v0::yc_bench_act on openai::gpt-5.4-mini, single initial variant, [gateway.observability] backend = "postgres".
  • functions/yc_bench_tutorial_v0::yc_bench_act/initial/ — initial system + user templates and user_schema.json (legacy per-role config style).
  • tools/run_command.json — the YC Bench agent's only tool. Exposed so the gateway accepts well-formed inference requests; the simulator itself is not installed locally.
  • baseline_data/.gitignore (excludes the two large jsonls), fetch.sh (downloads from a sibling anndvision/data Git LFS host with SHA-256 verification), initial_config/ (frozen day-one snapshot of the live config tree, used as a read-only reference once the agent starts editing).
  • README.md — quick-start (manual + "from the blog button"), what's in the starter, troubleshooting.
  • SKILL.md — the agent's playbook. Major sections:
    • Setup — six numbered steps the agent runs end-to-end on a fresh laptop: locate or clone the repo, make the API key reachable to the gateway (recommends a gitignored .env over a same-shell export), bring up the stack, fetch the baseline traces, smoke-test the initial variant, print a "ready" handoff.
    • Environment / Task / Available Models / Baseline data — environment metadata, real baseline numbers (tasks_succeeded mean = 0.800 on the 20-episode test split), JSONL projection patterns (grepnode -e projection, with examples adapted from the eval harness's skill_no_mcp.md).
    • Methodology — survey → diagnose → write → restart → probe → iterate → exit. Deliberately does not enumerate failure modes; the agent forms its own hypotheses from the worst-scoring test episodes.
    • Routing: Experimentation Config — example track_and_stop block.

Notes

  • Baseline traces hosted out-of-tree. inferences.jsonl is 98 MiB, well over the repo's check-added-large-files 1 MiB threshold. Fetched on demand via Git LFS in anndvision/data#1 (already merged); fetch.sh is idempotent, SHA-256-verifies, and is gated behind a .gitignore so the blobs never land here.
  • § Setup detects state. Detects whether the agent is already in the example dir, already inside the cloned repo, or somewhere else entirely; clones only when needed; surfaces concrete safe options when OPENAI_API_KEY isn't set or Docker isn't running (per-OS daemon hints, gitignored .env over shell export, no --privileged / root recommendations).

Out of scope

  • The blog-side "Open in Claude Code" CTA component — landing separately.
  • Codex deeplink (no documented codex:// URL scheme; the blog will ship Claude Code only and a prose pointer for Codex users).
  • Running the full evaluation harness — that's the user's job after the agent exits.

Test plan

  • docker compose up -d end-to-end on a clean machine
  • bash baseline_data/fetch.sh downloads + verifies
  • All in-skill node -e snippets run against real data
  • User manually walked through the deeplink flow with a real OpenAI key (no PR-blocking issues found)

Note

Low Risk
Low risk: this PR only adds a new self-contained example directory (docs, compose config, and fetch script) and does not change runtime/library code paths.

Overview
Adds a new examples/blog/can-ai-agents-improve-ai-agents/ starter used in the accompanying blog post, including a runnable Docker Compose stack (gateway + Postgres + one-shot migrations) and a baseline tensorzero.toml/prompt template setup for yc_bench_tutorial_v0::yc_bench_act.

Includes an agent-oriented playbook (SKILL.md) plus a baseline_data/fetch.sh downloader with SHA-256 verification and gitignored large trace files, along with README quick-start and troubleshooting guidance.

Reviewed by Cursor Bugbot for commit 8627461. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 320f15cb1c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread examples/blog/can-ai-agents-improve-ai-agents/SKILL.md Outdated
@anndvision anndvision force-pushed the andrew/blog-can-ai-agents-starter branch from 320f15c to c0ddecd Compare April 28, 2026 21:08
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit c0ddecd. Configure here.

Comment thread examples/blog/can-ai-agents-improve-ai-agents/docker-compose.yml
@anndvision anndvision force-pushed the andrew/blog-can-ai-agents-starter branch 5 times, most recently from 4264444 to 09df5dd Compare April 29, 2026 14:31
@anndvision anndvision force-pushed the andrew/blog-can-ai-agents-starter branch from 09df5dd to 8dbd47a Compare April 29, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants