Reusable audio-driven avatar video pipeline for faceless explainer channels.
The workflow is:
- Generate your avatar expression clips once in Veo3 with a solid green background.
- Chroma-key those clips into reusable RGBA frames.
- Build a
clip_library.jsonindex with tags and metadata. - For each new voiceover, analyze audio -> select clips -> smooth transitions -> composite final video.
No repeated Veo generation is required for each new video once the library is ready.
chroma_key.py: green-screen masking, edge feathering, spill suppression, RGBA output.clip_library.py: clip metadata schema, indexing, lookup, and summary.audio_analyzer.py: segment-level RMS/pitch/speech analysis and expression-tag mapping.transition_engine.py: best-clip selection and cached transition blend generation.compositor.py: alpha compositing of avatar over background plus audio mux.pipeline.py: CLI orchestration (setup,analyse,render).
- Python 3.12+
- System
ffmpegonPATH(required for final audio muxing) - Python packages:
opencv-pythonnumpylibrosasoundfile
Install dependencies with uv:
uv syncIf ffmpeg is missing on macOS:
brew install ffmpeg.
├── audio_analyzer.py
├── chroma_key.py
├── clip_library.py
├── compositor.py
├── pipeline.py
├── transition_engine.py
├── raw_clips/
├── keyed_clips/
├── transition_cache/
└── clip_library.json
Create 30-40 loopable expression clips with a bright solid green background and no camera movement.
Example prompt patterns:
- Idle: subtle breathing, neutral pose, loopable 3-second clip.
- Talk low/med/high: different energy talking gestures.
- Reaction clips: excited, thinking, nod, shrug, wave, point left/right/up, celebrate.
You can auto-generate 40 reusable raw scene clips directly from avatar.png using Google Veo:
uv run python generate_avatar_scenes.py \
--reference avatar.png \
--output-dir raw_clips \
--count 40The script writes:
raw_clips/<scene_name>.mp4raw_clips/veo_scene_manifest.json(prompt + status + resume metadata)raw_clips/generated_frames/<scene_name>.png(intermediate static scene image)
Auth options:
- Gemini API key mode:
- set
GOOGLE_API_KEY(orGEMINI_API_KEY) in your environment
- set
- Vertex AI mode:
--use-vertex --project <gcp-project> --location us-central1
Useful flags:
--dry-runto preview all prompts without calling Veo--continue-on-errorto keep generating even if one scene fails--start-index 21 --count 20to generate in batches--no-resumeto force regenerate existing scene files
If your Google model rejects enhancePrompt, do not pass --enhance-prompt.
This script now follows the exact flow you asked for:
- Generate a static scene image from your reference avatar (
--image-model, defaultgemini-2.5-flash-image) - Animate that generated image with Veo (
--video-model, defaultveo-3.1-fast-generate-001)
Example (explicit models):
uv run python generate_avatar_scenes.py \
--reference avatar.png \
--output-dir raw_clips \
--count 40 \
--image-model gemini-2.5-flash-image \
--video-model veo-3.1-lite-generate-preview \
--continue-on-errorCost-first defaults:
- Image model default:
gemini-2.5-flash-image(cheap and fast) - Video model default:
veo-3.1-lite-generate-preview(cheaper than fast/quality)
The script also auto-falls back to cheaper available models if your chosen model is unavailable.
If one Veo model returns empty video payload for a scene, keep auto-fallback enabled
(--auto-video-fallback, on by default) so the script retries other available Veo
models automatically.
For full-body output, prompts are now constrained to keep head-to-toe framing in both the static frame stage and animation stage.
Place raw Veo MP4 clips in raw_clips/ and run:
uv run python pipeline.py setup --raw_clips raw_clips/ --keyed_clips keyed_clips/ --library clip_library.jsonThis creates keyed PNG sequences and skips clips that were already processed.
Optional chroma tuning at setup time:
uv run python pipeline.py setup \
--raw_clips raw_clips/ \
--keyed_clips keyed_clips/ \
--hue-low 35 --hue-high 85 --feather 3After keying, map your clip names/tags in clip_library.py (default map provided), then build:
uv run python clip_library.py --build --clips_dir keyed_clips/ --raw_dir raw_clips/ --out clip_library.jsonCheck library summary:
uv run python clip_library.py --summary --out clip_library.jsonuv run python pipeline.py analyse --audio voiceover.wav --output segments.jsonThis writes segment timing and tag suggestions.
uv run python pipeline.py render \
--audio voiceover.wav \
--background bg.mp4 \
--output final.mp4 \
--library clip_library.json \
--scale 0.35 \
--position bottom_right \
--verboseWithout --background, the renderer uses a black fallback background.
- Segment-level clip selection picks nearest expression/energy fit.
- Loop-safe clips can repeat with minimal seam artifacts.
- Different adjacent clips get a cached 12-frame alpha transition (
transition_cache/).
clip_library.jsonis your reusable asset index. Keep it versioned with your clip pack.- If a clip tag is missing, selection falls back to nearest speech clip or idle.
- If a transition was already generated once, future renders reuse the cached sequence.