Validate fp16 dynamic loss scaling parameters are positive by aryanputta · Pull Request #8050 · deepspeedai/DeepSpeed

aryanputta · 2026-06-06T19:20:51Z

What

fp16.loss_scale_window and fp16.min_loss_scale drive dynamic loss scaling but are not validated, so invalid values initialize silently and fail later during training:

loss_scale_window is used as stable_interval % self.scale_window in DynamicLossScaler.update_scale (deepspeed/runtime/fp16/loss_scaler.py), so a value of 0 raises ZeroDivisionError mid-training.
min_loss_scale is the loss-scale floor (max(cur_scale / scale_factor, min_scale)); a value <= 0 collapses dynamic loss scaling.

This is the same class of silent-misconfiguration bug as fp16.loss_scale accepting inf, fixed in #7889.

Change

Add a single Pydantic mode="before" field validator on DeepSpeedFP16Config covering both fields. It rejects bool, non-numeric, non-finite (inf/-inf/nan), and non-positive values, raising a clear ValidationError (e.g. fp16.loss_scale_window must be > 0). Following the #7889 review, mode="before" runs prior to type coercion (so True is rejected), and float() is wrapped in try/except so []/{} surface a clear ValidationError rather than a raw TypeError.

Tests

Adds tests/unit/runtime/test_precision_config_dynamic_scale.py, parametrized over both fields:

invalid: 0, -1, inf, nan, True, [], {} -> ValidationError
valid: 1, 1000, "2" -> accepted

pytest -q tests/unit/runtime/test_precision_config_dynamic_scale.py

The validator logic was verified against the full matrix locally; the import-level test runs under CI.

loss_scale_window and min_loss_scale drive dynamic loss scaling but are not validated, so invalid values silently initialize and fail later: - loss_scale_window is used as `stable_interval % scale_window` in DynamicLossScaler.update_scale, so a value of 0 raises ZeroDivisionError during training. - min_loss_scale is the loss-scale floor; a value <= 0 collapses dynamic loss scaling. Add a Pydantic `mode="before"` field validator to DeepSpeedFP16Config that rejects bool, non-numeric, non-finite (inf/-inf/nan), and non-positive values for both fields, raising a clear ValidationError. This follows the same pattern as the fp16.loss_scale validation added in deepspeedai#7889. Add unit tests covering invalid values (0, -1, inf, nan, True, [], {}) and valid values for both fields. Signed-off-by: Aryan <aryansputta@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68eae02630

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T19:22:57Z

+        if number <= 0:
+            raise ValueError(f"fp16.{name} must be > 0")


Gate dynamic-scale validation on dynamic loss scaling

This rejects existing static-loss-scale configs that happen to carry loss_scale_window or min_loss_scale values such as 0, even though those fields are only used when dynamic loss scaling is enabled. I checked DeepSpeedEngine.dynamic_loss_scale() in deepspeed/runtime/engine.py, which returns true only when fp16.loss_scale == 0; with loss_scale > 0 the optimizer uses the static scale and these dynamic parameters are ignored, so failing config construction here is a compatibility regression for otherwise valid static fp16 setups.

Useful? React with 👍 / 👎.

Address review: loss_scale_window and min_loss_scale only take effect when dynamic loss scaling is active (fp16 enabled and loss_scale == 0, per DeepSpeedEngine.dynamic_loss_scale). Validating them unconditionally rejected otherwise-valid static-loss-scale configs that carry unused values like 0. Replace the per-field validator with a model_validator(mode="after") that checks loss_scale_window > 0 and min_loss_scale > 0 only when fp16 is enabled and loss_scale == 0. Update tests to cover the static and fp16-disabled cases where these fields are ignored. Signed-off-by: Aryan <aryansputta@gmail.com>

aryanputta · 2026-06-06T19:27:08Z

Addressed in 27af879: replaced the per-field validator with a model_validator(mode="after") that only enforces loss_scale_window > 0 and min_loss_scale > 0 when self.enabled and self.loss_scale == 0. Static configs (loss_scale > 0) and fp16-disabled configs now construct fine even if these fields carry 0. Added tests for both the static and fp16-disabled cases.

Thoughts... @tohtana

aryanputta requested review from loadams, tjruwase and tohtana as code owners June 6, 2026 19:20

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate fp16 dynamic loss scaling parameters are positive#8050

Validate fp16 dynamic loss scaling parameters are positive#8050
aryanputta wants to merge 2 commits into
deepspeedai:masterfrom
aryanputta:validate-fp16-dynamic-loss-scale

aryanputta commented Jun 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Uh oh!

aryanputta commented Jun 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aryanputta commented Jun 6, 2026

What

Change

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

aryanputta commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aryanputta commented Jun 6, 2026 •

edited

Loading