Floating point scale degrades FP16 MNIST training accuracy #60

balancap · 2024-01-02T10:48:51Z

The combination of FP16 training and FP16/FP32 scale in ScaledArray is degrading training accuracy:

TODO numbers!

After investigation, it happens to be a related to non-power of 2 scaling factors:

The first matmul is introducing sqrt(28) rescaling;
Propagation of this scaling is following ops (add, matmul, ...) is leading to accumulating of floating point error;

The accumulation of floating point error is leading to degradation in training accuracy.

From this simple MNIST training bug, it seems clear that only using power of 2 scaling is the right strategy in order to avoid introducing additional floating point error in AutoScale. This is tacked in PR xxx

The text was updated successfully, but these errors were encountered:

As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16). The additional rescaling operations will introduce non-negligible floating point accumulated errors. Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode can be modified in the config dataclass `AutoScaleConfig`.

As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16). The additional rescaling operations will introduce non-negligible floating point accumulated errors. Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode can be modified in the config dataclass `AutoScaleConfig`. Scaled translations updated are: `dot_general`, `add`, `sub` and `reduce_sum`.

As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16). The additional rescaling operations will introduce non-negligible floating point accumulated errors. Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode can be modified in the config dataclass `AutoScaleConfig`. Scaled translations updated are: `dot_general`, `add`, `sub` and `reduce_sum`. Finally, when implicitely converting scalars to scaled arrays, the method `make_scaled_scaled` now splits the input mantissa and exponent between data and scale.

…#65) As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16). The additional rescaling operations will introduce non-negligible floating point accumulated errors. Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode can be modified in the config dataclass `AutoScaleConfig`. Scaled translations updated are: `dot_general`, `add`, `sub` and `reduce_sum`. Finally, when implicitely converting scalars to scaled arrays, the method `make_scaled_scaled` now splits the input mantissa and exponent between data and scale.

balancap · 2024-01-03T17:11:42Z

MNIST accuracy following PR #67 and power-of-two unit scaling rules.

Config	Training acc.	Test acc.
Base FP32	0.97117	0.94070
AS FP32	0.97222	0.94030
Base FP16	0.96815	0.93830
AS FP16	0.96772	0.93770

TODO: extensive analysis on learning rate + weight initialization. cc @DouglasOrr @thecharlieblake

balancap · 2024-01-03T19:49:40Z

Additional experiment: when using param_scale=0.125, the drop of accuracy is larger:

Config	Training acc.	Test acc.
AS FP32	0.91398	0.90490
AS FP16	0.86487	0.86530

Not difference between normal and AutoScale modes

balancap · 2024-02-05T09:35:14Z

Closing as latest MNIST numbers in Normal and Autoscale mode match.

balancap added the bug Something isn't working label Jan 2, 2024

balancap self-assigned this Jan 2, 2024

This was referenced Jan 2, 2024

Implement power of 2 scaling in AutoScale #61

Closed

Stochastic rounding on scalify scale exponent operations #63

Open

balancap mentioned this issue Jan 2, 2024

Use power-of-two scaling in autoscale scaled translation ops rules. #65

Merged

balancap closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Floating point scale degrades FP16 MNIST training accuracy #60

Floating point scale degrades FP16 MNIST training accuracy #60

balancap commented Jan 2, 2024

balancap commented Jan 3, 2024 •

edited

Loading

balancap commented Jan 3, 2024

balancap commented Feb 5, 2024

Floating point scale degrades FP16 MNIST training accuracy #60

Floating point scale degrades FP16 MNIST training accuracy #60

Comments

balancap commented Jan 2, 2024

balancap commented Jan 3, 2024 • edited Loading

balancap commented Jan 3, 2024

balancap commented Feb 5, 2024

balancap commented Jan 3, 2024 •

edited

Loading