Skip to content

Multimodal SFT silently hangs forever on certain clips: load_audio (librosa→audioread) and decord video decode have no timeout (DataLoader deadlock) #9507

@LukeLIN-web

Description

@LukeLIN-web

Describe the bug

During multimodal LoRA SFT of Qwen3-Omni-30B-A3B with audio-in-video (USE_AUDIO_IN_VIDEO=1), training silently hangs forever on certain media clips. In our run it froze at a single step for ~9 hours: the process stayed alive, GPUs at 0% util (memory still held), and logging.jsonl / the tqdm progress bar stopped advancing. No exception, no timeout, no log line — completely silent, so the run looks healthy while wasting GPUs.

The root cause is media decoding inside the data pipeline with no timeout, running native code that blocks the worker indefinitely:

1. Audio — swift/template/vision_utils.py::load_audio (v4.2.2, ~L299–312):

res = librosa.load(audio_io, sr=sampling_rate)   # PySoundFile fails on some containers
# fallback: audioread.ffdec.FFmpegAudioFile(...) ; librosa.load(...)

When soundfile can't read a container, librosa falls back to audioread.ffdec.FFmpegAudioFile, which spawns an ffmpeg subprocess + reader thread and on certain clips deadlocks (the well-known audioread/ffmpeg pipe-starvation hang) and never returns. The last log line before the freeze was:

.../swift/template/vision_utils.py:310: UserWarning: PySoundFile failed. Trying audioread instead.
  res = librosa.load(audio_io, sr=sampling_rate)

2. Video — the load_video_* paths use decord.VideoReader(...).get_batch(...). decord's get_batch can also busy-loop/deadlock in native C on certain clips, holding the GIL — so even a Python-level SIGALRM/timeout cannot interrupt it (we verified this independently; it spins at ~100% CPU on one core). Same class of failure, different decoder.

Because decode is in-loop with no bound, a single bad clip freezes the whole training run silently.

Environment

  • ms-swift 4.2.2
  • Qwen3-Omni-30B-A3B (audio-in-video, USE_AUDIO_IN_VIDEO=1), LoRA SFT, 2×80GB.

Suggested fix

Make data-pipeline media decode bounded and recoverable so one bad clip can't hang the run:

  • Wrap load_audio / load_video_* decode in a hard wall-clock timeout via a short-lived subprocess that is killed on overrun. (A SIGALRM-based timeout is insufficient — decord's native loop holds the GIL, and audioread's hang is in a C subprocess pipe.)
  • For video, prefer torchcodec over decord (no equivalent native get_batch deadlock).
  • For audio, avoid the audioread.ffdec fallback — decode via a controlled ffmpeg subprocess (stderr drained, communicate(timeout=...)).
  • At minimum: on decode timeout, log + skip the offending sample (print its path) instead of blocking forever.

Happy to share a minimal repro clip or a PR if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions