Validate fp16 dynamic loss scaling parameters are positive#8050
Validate fp16 dynamic loss scaling parameters are positive#8050aryanputta wants to merge 2 commits into
Conversation
loss_scale_window and min_loss_scale drive dynamic loss scaling but are not validated, so invalid values silently initialize and fail later: - loss_scale_window is used as `stable_interval % scale_window` in DynamicLossScaler.update_scale, so a value of 0 raises ZeroDivisionError during training. - min_loss_scale is the loss-scale floor; a value <= 0 collapses dynamic loss scaling. Add a Pydantic `mode="before"` field validator to DeepSpeedFP16Config that rejects bool, non-numeric, non-finite (inf/-inf/nan), and non-positive values for both fields, raising a clear ValidationError. This follows the same pattern as the fp16.loss_scale validation added in deepspeedai#7889. Add unit tests covering invalid values (0, -1, inf, nan, True, [], {}) and valid values for both fields. Signed-off-by: Aryan <aryansputta@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 68eae02630
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if number <= 0: | ||
| raise ValueError(f"fp16.{name} must be > 0") |
There was a problem hiding this comment.
Gate dynamic-scale validation on dynamic loss scaling
This rejects existing static-loss-scale configs that happen to carry loss_scale_window or min_loss_scale values such as 0, even though those fields are only used when dynamic loss scaling is enabled. I checked DeepSpeedEngine.dynamic_loss_scale() in deepspeed/runtime/engine.py, which returns true only when fp16.loss_scale == 0; with loss_scale > 0 the optimizer uses the static scale and these dynamic parameters are ignored, so failing config construction here is a compatibility regression for otherwise valid static fp16 setups.
Useful? React with 👍 / 👎.
Address review: loss_scale_window and min_loss_scale only take effect when dynamic loss scaling is active (fp16 enabled and loss_scale == 0, per DeepSpeedEngine.dynamic_loss_scale). Validating them unconditionally rejected otherwise-valid static-loss-scale configs that carry unused values like 0. Replace the per-field validator with a model_validator(mode="after") that checks loss_scale_window > 0 and min_loss_scale > 0 only when fp16 is enabled and loss_scale == 0. Update tests to cover the static and fp16-disabled cases where these fields are ignored. Signed-off-by: Aryan <aryansputta@gmail.com>
|
Addressed in 27af879: replaced the per-field validator with a Thoughts... @tohtana |
What
fp16.loss_scale_windowandfp16.min_loss_scaledrive dynamic loss scaling but are not validated, so invalid values initialize silently and fail later during training:loss_scale_windowis used asstable_interval % self.scale_windowinDynamicLossScaler.update_scale(deepspeed/runtime/fp16/loss_scaler.py), so a value of0raisesZeroDivisionErrormid-training.min_loss_scaleis the loss-scale floor (max(cur_scale / scale_factor, min_scale)); a value<= 0collapses dynamic loss scaling.This is the same class of silent-misconfiguration bug as
fp16.loss_scaleacceptinginf, fixed in #7889.Change
Add a single Pydantic
mode="before"field validator onDeepSpeedFP16Configcovering both fields. It rejectsbool, non-numeric, non-finite (inf/-inf/nan), and non-positive values, raising a clearValidationError(e.g.fp16.loss_scale_window must be > 0). Following the #7889 review,mode="before"runs prior to type coercion (soTrueis rejected), andfloat()is wrapped intry/exceptso[]/{}surface a clearValidationErrorrather than a rawTypeError.Tests
Adds
tests/unit/runtime/test_precision_config_dynamic_scale.py, parametrized over both fields:0, -1, inf, nan, True, [], {}->ValidationError1, 1000, "2"-> acceptedThe validator logic was verified against the full matrix locally; the import-level test runs under CI.