Skip to content

Validate fp16 dynamic loss scaling parameters are positive#8050

Open
aryanputta wants to merge 2 commits into
deepspeedai:masterfrom
aryanputta:validate-fp16-dynamic-loss-scale
Open

Validate fp16 dynamic loss scaling parameters are positive#8050
aryanputta wants to merge 2 commits into
deepspeedai:masterfrom
aryanputta:validate-fp16-dynamic-loss-scale

Conversation

@aryanputta
Copy link
Copy Markdown

What

fp16.loss_scale_window and fp16.min_loss_scale drive dynamic loss scaling but are not validated, so invalid values initialize silently and fail later during training:

  • loss_scale_window is used as stable_interval % self.scale_window in DynamicLossScaler.update_scale (deepspeed/runtime/fp16/loss_scaler.py), so a value of 0 raises ZeroDivisionError mid-training.
  • min_loss_scale is the loss-scale floor (max(cur_scale / scale_factor, min_scale)); a value <= 0 collapses dynamic loss scaling.

This is the same class of silent-misconfiguration bug as fp16.loss_scale accepting inf, fixed in #7889.

Change

Add a single Pydantic mode="before" field validator on DeepSpeedFP16Config covering both fields. It rejects bool, non-numeric, non-finite (inf/-inf/nan), and non-positive values, raising a clear ValidationError (e.g. fp16.loss_scale_window must be > 0). Following the #7889 review, mode="before" runs prior to type coercion (so True is rejected), and float() is wrapped in try/except so []/{} surface a clear ValidationError rather than a raw TypeError.

Tests

Adds tests/unit/runtime/test_precision_config_dynamic_scale.py, parametrized over both fields:

  • invalid: 0, -1, inf, nan, True, [], {} -> ValidationError
  • valid: 1, 1000, "2" -> accepted
pytest -q tests/unit/runtime/test_precision_config_dynamic_scale.py

The validator logic was verified against the full matrix locally; the import-level test runs under CI.

loss_scale_window and min_loss_scale drive dynamic loss scaling but are
not validated, so invalid values silently initialize and fail later:

- loss_scale_window is used as `stable_interval % scale_window` in
  DynamicLossScaler.update_scale, so a value of 0 raises ZeroDivisionError
  during training.
- min_loss_scale is the loss-scale floor; a value <= 0 collapses dynamic
  loss scaling.

Add a Pydantic `mode="before"` field validator to DeepSpeedFP16Config that
rejects bool, non-numeric, non-finite (inf/-inf/nan), and non-positive
values for both fields, raising a clear ValidationError. This follows the
same pattern as the fp16.loss_scale validation added in deepspeedai#7889.

Add unit tests covering invalid values (0, -1, inf, nan, True, [], {}) and
valid values for both fields.

Signed-off-by: Aryan <aryansputta@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68eae02630

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread deepspeed/runtime/precision_config.py Outdated
Comment on lines +149 to +150
if number <= 0:
raise ValueError(f"fp16.{name} must be > 0")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate dynamic-scale validation on dynamic loss scaling

This rejects existing static-loss-scale configs that happen to carry loss_scale_window or min_loss_scale values such as 0, even though those fields are only used when dynamic loss scaling is enabled. I checked DeepSpeedEngine.dynamic_loss_scale() in deepspeed/runtime/engine.py, which returns true only when fp16.loss_scale == 0; with loss_scale > 0 the optimizer uses the static scale and these dynamic parameters are ignored, so failing config construction here is a compatibility regression for otherwise valid static fp16 setups.

Useful? React with 👍 / 👎.

Address review: loss_scale_window and min_loss_scale only take effect when
dynamic loss scaling is active (fp16 enabled and loss_scale == 0, per
DeepSpeedEngine.dynamic_loss_scale). Validating them unconditionally rejected
otherwise-valid static-loss-scale configs that carry unused values like 0.

Replace the per-field validator with a model_validator(mode="after") that
checks loss_scale_window > 0 and min_loss_scale > 0 only when fp16 is enabled
and loss_scale == 0. Update tests to cover the static and fp16-disabled cases
where these fields are ignored.

Signed-off-by: Aryan <aryansputta@gmail.com>
@aryanputta
Copy link
Copy Markdown
Author

aryanputta commented Jun 6, 2026

Addressed in 27af879: replaced the per-field validator with a model_validator(mode="after") that only enforces loss_scale_window > 0 and min_loss_scale > 0 when self.enabled and self.loss_scale == 0. Static configs (loss_scale > 0) and fp16-disabled configs now construct fine even if these fields carry 0. Added tests for both the static and fp16-disabled cases.

Thoughts... @tohtana

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant