Skip to content

codingstark-dev/veo3-avatar-sync

Repository files navigation

Veo3 Avatar Sync Pipeline

Reusable audio-driven avatar video pipeline for faceless explainer channels.

The workflow is:

  1. Generate your avatar expression clips once in Veo3 with a solid green background.
  2. Chroma-key those clips into reusable RGBA frames.
  3. Build a clip_library.json index with tags and metadata.
  4. For each new voiceover, analyze audio -> select clips -> smooth transitions -> composite final video.

No repeated Veo generation is required for each new video once the library is ready.

Modules

  • chroma_key.py: green-screen masking, edge feathering, spill suppression, RGBA output.
  • clip_library.py: clip metadata schema, indexing, lookup, and summary.
  • audio_analyzer.py: segment-level RMS/pitch/speech analysis and expression-tag mapping.
  • transition_engine.py: best-clip selection and cached transition blend generation.
  • compositor.py: alpha compositing of avatar over background plus audio mux.
  • pipeline.py: CLI orchestration (setup, analyse, render).

Requirements

  • Python 3.12+
  • System ffmpeg on PATH (required for final audio muxing)
  • Python packages:
    • opencv-python
    • numpy
    • librosa
    • soundfile

Install dependencies with uv:

uv sync

If ffmpeg is missing on macOS:

brew install ffmpeg

Recommended project structure

.
├── audio_analyzer.py
├── chroma_key.py
├── clip_library.py
├── compositor.py
├── pipeline.py
├── transition_engine.py
├── raw_clips/
├── keyed_clips/
├── transition_cache/
└── clip_library.json

Step 1: Generate Veo3 clips (one-time)

Create 30-40 loopable expression clips with a bright solid green background and no camera movement.

Example prompt patterns:

  • Idle: subtle breathing, neutral pose, loopable 3-second clip.
  • Talk low/med/high: different energy talking gestures.
  • Reaction clips: excited, thinking, nod, shrug, wave, point left/right/up, celebrate.

Automated 40-scene generation from your avatar image

You can auto-generate 40 reusable raw scene clips directly from avatar.png using Google Veo:

uv run python generate_avatar_scenes.py \
  --reference avatar.png \
  --output-dir raw_clips \
  --count 40

The script writes:

  • raw_clips/<scene_name>.mp4
  • raw_clips/veo_scene_manifest.json (prompt + status + resume metadata)
  • raw_clips/generated_frames/<scene_name>.png (intermediate static scene image)

Auth options:

  • Gemini API key mode:
    • set GOOGLE_API_KEY (or GEMINI_API_KEY) in your environment
  • Vertex AI mode:
    • --use-vertex --project <gcp-project> --location us-central1

Useful flags:

  • --dry-run to preview all prompts without calling Veo
  • --continue-on-error to keep generating even if one scene fails
  • --start-index 21 --count 20 to generate in batches
  • --no-resume to force regenerate existing scene files

If your Google model rejects enhancePrompt, do not pass --enhance-prompt.

Two-stage generation (static image -> Veo animation)

This script now follows the exact flow you asked for:

  1. Generate a static scene image from your reference avatar (--image-model, default gemini-2.5-flash-image)
  2. Animate that generated image with Veo (--video-model, default veo-3.1-fast-generate-001)

Example (explicit models):

uv run python generate_avatar_scenes.py \
  --reference avatar.png \
  --output-dir raw_clips \
  --count 40 \
  --image-model gemini-2.5-flash-image \
  --video-model veo-3.1-lite-generate-preview \
  --continue-on-error

Cost-first defaults:

  • Image model default: gemini-2.5-flash-image (cheap and fast)
  • Video model default: veo-3.1-lite-generate-preview (cheaper than fast/quality)

The script also auto-falls back to cheaper available models if your chosen model is unavailable.

If one Veo model returns empty video payload for a scene, keep auto-fallback enabled (--auto-video-fallback, on by default) so the script retries other available Veo models automatically.

For full-body output, prompts are now constrained to keep head-to-toe framing in both the static frame stage and animation stage.

Step 2: Key raw clips for reuse

Place raw Veo MP4 clips in raw_clips/ and run:

uv run python pipeline.py setup --raw_clips raw_clips/ --keyed_clips keyed_clips/ --library clip_library.json

This creates keyed PNG sequences and skips clips that were already processed.

Optional chroma tuning at setup time:

uv run python pipeline.py setup \
  --raw_clips raw_clips/ \
  --keyed_clips keyed_clips/ \
  --hue-low 35 --hue-high 85 --feather 3

Step 3: Build clip index

After keying, map your clip names/tags in clip_library.py (default map provided), then build:

uv run python clip_library.py --build --clips_dir keyed_clips/ --raw_dir raw_clips/ --out clip_library.json

Check library summary:

uv run python clip_library.py --summary --out clip_library.json

Step 4: Analyze voiceover (optional)

uv run python pipeline.py analyse --audio voiceover.wav --output segments.json

This writes segment timing and tag suggestions.

Step 5: Render a new video

uv run python pipeline.py render \
  --audio voiceover.wav \
  --background bg.mp4 \
  --output final.mp4 \
  --library clip_library.json \
  --scale 0.35 \
  --position bottom_right \
  --verbose

Without --background, the renderer uses a black fallback background.

How transitions stay smooth

  • Segment-level clip selection picks nearest expression/energy fit.
  • Loop-safe clips can repeat with minimal seam artifacts.
  • Different adjacent clips get a cached 12-frame alpha transition (transition_cache/).

Notes

  • clip_library.json is your reusable asset index. Keep it versioned with your clip pack.
  • If a clip tag is missing, selection falls back to nearest speech clip or idle.
  • If a transition was already generated once, future renders reuse the cached sequence.

About

Reusable audio-driven avatar video pipeline for faceless explainer channels.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages