[https://nvbugs/6212252][fix] Select CUTLASS MoE backend on non-Blackwell SMs in TestQwen3_5_35B_A3B::test_fp8#15081
Conversation
…well SMs in TestQwen3_5_35B_A3B::test_fp8 DeepGEMM MoE kernels only support datacenter Blackwell (SM100/SM103). On Hopper (SM90) and consumer Blackwell (SM120/SM121) the unsupported kernel trips a scale-factor dtype assertion during autotuner warmup, so the test hard-selecting backend='DEEPGEMM' fails on those GPUs. Pick the backend by SM version (DEEPGEMM on SM100/SM103, CUTLASS otherwise, which supports FP8 block scales) and drop the corresponding waive entry. Signed-off-by: xxi <xxi@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
💤 Files with no reviewable changes (1)
📝 WalkthroughWalkthroughThis PR modifies the Qwen3.5 35B FP8 accuracy test to support multiple GPU architectures. The test's MoE backend configuration is now determined at runtime based on the detected SM version, replacing a hardcoded DEEPGEMM setting. A corresponding test waiver is removed, enabling the test to run on all supported SM versions. ChangesQwen3.5 35B MoE test architecture support
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run --disable-fail-fast |
|
PR_Github #52672 [ run ] triggered by Bot. Commit: |
|
PR_Github #52672 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #52702 [ run ] triggered by Bot. Commit: |
|
PR_Github #52702 [ run ] completed with state
|
Summary by CodeRabbit
Description
TestQwen3_5_35B_A3B::test_fp8hard-selectedMoeConfig(backend='DEEPGEMM').The DeepGEMM MoE kernels only support datacenter Blackwell (SM100/SM103). On
Hopper (SM90, e.g. H20) and consumer Blackwell (SM120/SM121) the unsupported
kernel runs anyway and trips a scale-factor dtype assertion during autotuner
warmup:
The DEEPGEMM dispatch branch in
create_moe.pyhas no SM gate (unlikeDENSEGEMM/CUTEDSL/TRTLLM which fall back to CUTLASS), and
DeepGemmFusedMoE.can_implement(restricted to SM {100, 103}) is never invokedfrom the dispatch path, so the test fails on non-Blackwell-datacenter GPUs.
Fix: select the MoE backend by SM version in the test —
DEEPGEMMonSM100/SM103,
CUTLASSotherwise (CUTLASS supports FP8 block scales) — andremove the corresponding waive entry.
The exact gate
get_sm_version() in (100, 103)is used rather than the morecommon
>= 100on purpose: this test also runs on rtx6k (RTX PRO 6000Blackwell = SM120) in the QA lists.
>= 100would wrongly pickDEEPGEMMonSM120/SM121 (also unsupported) and introduce a new failure, whereas
in (100, 103)matchesDeepGemmFusedMoE's supported-SM set exactly.Test Coverage
accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[enable_block_reuse=False]accuracy/test_llm_api_pytorch.py::TestQwen3_5_35B_A3B::test_fp8[enable_block_reuse=True]These run on B200 (SM100,
l0_b200.yml) and rtx6k (SM120,qa/llm_function_rtx6k.txt,qa/llm_function_core.txt), exercising both theDEEPGEMMandCUTLASSbranches. The previously failing case on Hopper(SM90) now selects the CUTLASS backend.
PR Checklist