Describe the bug
During multimodal LoRA SFT of Qwen3-Omni-30B-A3B with audio-in-video (USE_AUDIO_IN_VIDEO=1), training silently hangs forever on certain media clips. In our run it froze at a single step for ~9 hours: the process stayed alive, GPUs at 0% util (memory still held), and logging.jsonl / the tqdm progress bar stopped advancing. No exception, no timeout, no log line — completely silent, so the run looks healthy while wasting GPUs.
The root cause is media decoding inside the data pipeline with no timeout, running native code that blocks the worker indefinitely:
1. Audio — swift/template/vision_utils.py::load_audio (v4.2.2, ~L299–312):
res = librosa.load(audio_io, sr=sampling_rate) # PySoundFile fails on some containers
# fallback: audioread.ffdec.FFmpegAudioFile(...) ; librosa.load(...)
When soundfile can't read a container, librosa falls back to audioread.ffdec.FFmpegAudioFile, which spawns an ffmpeg subprocess + reader thread and on certain clips deadlocks (the well-known audioread/ffmpeg pipe-starvation hang) and never returns. The last log line before the freeze was:
.../swift/template/vision_utils.py:310: UserWarning: PySoundFile failed. Trying audioread instead.
res = librosa.load(audio_io, sr=sampling_rate)
2. Video — the load_video_* paths use decord.VideoReader(...).get_batch(...). decord's get_batch can also busy-loop/deadlock in native C on certain clips, holding the GIL — so even a Python-level SIGALRM/timeout cannot interrupt it (we verified this independently; it spins at ~100% CPU on one core). Same class of failure, different decoder.
Because decode is in-loop with no bound, a single bad clip freezes the whole training run silently.
Environment
- ms-swift 4.2.2
- Qwen3-Omni-30B-A3B (audio-in-video,
USE_AUDIO_IN_VIDEO=1), LoRA SFT, 2×80GB.
Suggested fix
Make data-pipeline media decode bounded and recoverable so one bad clip can't hang the run:
- Wrap
load_audio / load_video_* decode in a hard wall-clock timeout via a short-lived subprocess that is killed on overrun. (A SIGALRM-based timeout is insufficient — decord's native loop holds the GIL, and audioread's hang is in a C subprocess pipe.)
- For video, prefer torchcodec over decord (no equivalent native
get_batch deadlock).
- For audio, avoid the
audioread.ffdec fallback — decode via a controlled ffmpeg subprocess (stderr drained, communicate(timeout=...)).
- At minimum: on decode timeout, log + skip the offending sample (print its path) instead of blocking forever.
Happy to share a minimal repro clip or a PR if useful.
Describe the bug
During multimodal LoRA SFT of Qwen3-Omni-30B-A3B with audio-in-video (
USE_AUDIO_IN_VIDEO=1), training silently hangs forever on certain media clips. In our run it froze at a single step for ~9 hours: the process stayed alive, GPUs at 0% util (memory still held), andlogging.jsonl/ the tqdm progress bar stopped advancing. No exception, no timeout, no log line — completely silent, so the run looks healthy while wasting GPUs.The root cause is media decoding inside the data pipeline with no timeout, running native code that blocks the worker indefinitely:
1. Audio —
swift/template/vision_utils.py::load_audio(v4.2.2, ~L299–312):When
soundfilecan't read a container, librosa falls back toaudioread.ffdec.FFmpegAudioFile, which spawns anffmpegsubprocess + reader thread and on certain clips deadlocks (the well-known audioread/ffmpeg pipe-starvation hang) and never returns. The last log line before the freeze was:2. Video — the
load_video_*paths usedecord.VideoReader(...).get_batch(...). decord'sget_batchcan also busy-loop/deadlock in native C on certain clips, holding the GIL — so even a Python-levelSIGALRM/timeout cannot interrupt it (we verified this independently; it spins at ~100% CPU on one core). Same class of failure, different decoder.Because decode is in-loop with no bound, a single bad clip freezes the whole training run silently.
Environment
USE_AUDIO_IN_VIDEO=1), LoRA SFT, 2×80GB.Suggested fix
Make data-pipeline media decode bounded and recoverable so one bad clip can't hang the run:
load_audio/load_video_*decode in a hard wall-clock timeout via a short-lived subprocess that is killed on overrun. (ASIGALRM-based timeout is insufficient — decord's native loop holds the GIL, and audioread's hang is in a C subprocess pipe.)get_batchdeadlock).audioread.ffdecfallback — decode via a controlledffmpegsubprocess (stderr drained,communicate(timeout=...)).Happy to share a minimal repro clip or a PR if useful.