Multimodal SFT silently hangs forever on certain clips: load_audio (librosa→audioread) and decord video decode have no timeout (DataLoader deadlock)

### Describe the bug

During multimodal LoRA SFT of **Qwen3-Omni-30B-A3B** with audio-in-video (`USE_AUDIO_IN_VIDEO=1`), training **silently hangs forever** on certain media clips. In our run it froze at a single step for **~9 hours**: the process stayed alive, GPUs at **0% util** (memory still held), and `logging.jsonl` / the tqdm progress bar stopped advancing. **No exception, no timeout, no log line** — completely silent, so the run *looks* healthy while wasting GPUs.

The root cause is **media decoding inside the data pipeline with no timeout**, running native code that blocks the worker indefinitely:

**1. Audio — `swift/template/vision_utils.py::load_audio` (v4.2.2, ~L299–312):**
```python
res = librosa.load(audio_io, sr=sampling_rate)   # PySoundFile fails on some containers
# fallback: audioread.ffdec.FFmpegAudioFile(...) ; librosa.load(...)
```
When `soundfile` can't read a container, librosa falls back to `audioread.ffdec.FFmpegAudioFile`, which spawns an `ffmpeg` subprocess + reader thread and on certain clips **deadlocks** (the well-known audioread/ffmpeg pipe-starvation hang) and never returns. The last log line before the freeze was:
```
.../swift/template/vision_utils.py:310: UserWarning: PySoundFile failed. Trying audioread instead.
  res = librosa.load(audio_io, sr=sampling_rate)
```

**2. Video — the `load_video_*` paths use `decord.VideoReader(...).get_batch(...)`.** decord's `get_batch` can also **busy-loop/deadlock in native C** on certain clips, holding the GIL — so even a Python-level `SIGALRM`/timeout cannot interrupt it (we verified this independently; it spins at ~100% CPU on one core). Same class of failure, different decoder.

Because decode is in-loop with no bound, **a single bad clip freezes the whole training run silently**.

### Environment
- ms-swift **4.2.2**
- Qwen3-Omni-30B-A3B (audio-in-video, `USE_AUDIO_IN_VIDEO=1`), LoRA SFT, 2×80GB.

### Suggested fix
Make data-pipeline media decode **bounded and recoverable** so one bad clip can't hang the run:
- Wrap `load_audio` / `load_video_*` decode in a **hard wall-clock timeout via a short-lived subprocess that is killed on overrun**. (A `SIGALRM`-based timeout is insufficient — decord's native loop holds the GIL, and audioread's hang is in a C subprocess pipe.)
- For **video**, prefer **torchcodec** over decord (no equivalent native `get_batch` deadlock).
- For **audio**, avoid the `audioread.ffdec` fallback — decode via a controlled `ffmpeg` subprocess (stderr drained, `communicate(timeout=...)`).
- At minimum: on decode timeout, **log + skip** the offending sample (print its path) instead of blocking forever.

Happy to share a minimal repro clip or a PR if useful.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal SFT silently hangs forever on certain clips: load_audio (librosa→audioread) and decord video decode have no timeout (DataLoader deadlock) #9507

Describe the bug

Environment

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Multimodal SFT silently hangs forever on certain clips: load_audio (librosa→audioread) and decord video decode have no timeout (DataLoader deadlock) #9507

Description

Describe the bug

Environment

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions