Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating point scale degrades FP16 MNIST training accuracy #60

Closed
balancap opened this issue Jan 2, 2024 · 3 comments
Closed

Floating point scale degrades FP16 MNIST training accuracy #60

balancap opened this issue Jan 2, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@balancap
Copy link
Contributor

balancap commented Jan 2, 2024

The combination of FP16 training and FP16/FP32 scale in ScaledArray is degrading training accuracy:

  • TODO numbers!

After investigation, it happens to be a related to non-power of 2 scaling factors:

  • The first matmul is introducing sqrt(28) rescaling;
  • Propagation of this scaling is following ops (add, matmul, ...) is leading to accumulating of floating point error;

The accumulation of floating point error is leading to degradation in training accuracy.

From this simple MNIST training bug, it seems clear that only using power of 2 scaling is the right strategy in order to avoid introducing additional floating point error in AutoScale. This is tacked in PR xxx

@balancap balancap added the bug Something isn't working label Jan 2, 2024
@balancap balancap self-assigned this Jan 2, 2024
balancap added a commit that referenced this issue Jan 2, 2024
As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16).
The additional rescaling operations will introduce non-negligible floating point accumulated errors.

Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode
can be modified in the config dataclass `AutoScaleConfig`.
balancap added a commit that referenced this issue Jan 2, 2024
As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16).
The additional rescaling operations will introduce non-negligible floating point accumulated errors.

Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode
can be modified in the config dataclass `AutoScaleConfig`.
balancap added a commit that referenced this issue Jan 2, 2024
As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16).
The additional rescaling operations will introduce non-negligible floating point accumulated errors.

Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode
can be modified in the config dataclass `AutoScaleConfig`.
balancap added a commit that referenced this issue Jan 2, 2024
As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16).
The additional rescaling operations will introduce non-negligible floating point accumulated errors.

Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode
can be modified in the config dataclass `AutoScaleConfig`. Scaled translations updated are: `dot_general`, `add`, `sub` and `reduce_sum`.
balancap added a commit that referenced this issue Jan 3, 2024
As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16).
The additional rescaling operations will introduce non-negligible floating point accumulated errors.

Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode
can be modified in the config dataclass `AutoScaleConfig`. Scaled translations updated are: `dot_general`, `add`, `sub` and `reduce_sum`.

Finally, when implicitely converting scalars to scaled arrays, the method `make_scaled_scaled` now splits the input mantissa and exponent between data and scale.
balancap added a commit that referenced this issue Jan 3, 2024
As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16).
The additional rescaling operations will introduce non-negligible floating point accumulated errors.

Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode
can be modified in the config dataclass `AutoScaleConfig`. Scaled translations updated are: `dot_general`, `add`, `sub` and `reduce_sum`.

Finally, when implicitely converting scalars to scaled arrays, the method `make_scaled_scaled` now splits the input mantissa and exponent between data and scale.
balancap added a commit that referenced this issue Jan 3, 2024
As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16).
The additional rescaling operations will introduce non-negligible floating point accumulated errors.

Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode
can be modified in the config dataclass `AutoScaleConfig`. Scaled translations updated are: `dot_general`, `add`, `sub` and `reduce_sum`.

Finally, when implicitely converting scalars to scaled arrays, the method `make_scaled_scaled` now splits the input mantissa and exponent between data and scale.
balancap added a commit that referenced this issue Jan 3, 2024
…#65)

As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16).
The additional rescaling operations will introduce non-negligible floating point accumulated errors.

Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode
can be modified in the config dataclass `AutoScaleConfig`. Scaled translations updated are: `dot_general`, `add`, `sub` and `reduce_sum`.

Finally, when implicitely converting scalars to scaled arrays, the method `make_scaled_scaled` now splits the input mantissa and exponent between data and scale.
@balancap
Copy link
Contributor Author

balancap commented Jan 3, 2024

MNIST accuracy following PR #67 and power-of-two unit scaling rules.

Config Training acc. Test acc.
Base FP32 0.97117 0.94070
AS FP32 0.97222 0.94030
Base FP16 0.96815 0.93830
AS FP16 0.96772 0.93770

TODO: extensive analysis on learning rate + weight initialization. cc @DouglasOrr @thecharlieblake

@balancap
Copy link
Contributor Author

balancap commented Jan 3, 2024

Additional experiment: when using param_scale=0.125, the drop of accuracy is larger:

Config Training acc. Test acc.
AS FP32 0.91398 0.90490
AS FP16 0.86487 0.86530

Not difference between normal and AutoScale modes

@balancap
Copy link
Contributor Author

balancap commented Feb 5, 2024

Closing as latest MNIST numbers in Normal and Autoscale mode match.

@balancap balancap closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant