Use power-of-two scaling in autoscale scaled translation ops rules. #65
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As shown in #60 issue, propagating non power-of-two scaling factors can decrease training accuracy in low precision (typically in FP16).
The additional rescaling operations will introduce non-negligible floating point accumulated errors.
Ths PR is adding the option to round the scale to a power-of-two in scaled translation. Supporting at the moment only rounding up and down. The rounding mode can be modified in the config dataclass
AutoScaleConfig
.Scaled translations updated are:
dot_general
,add
,sub
andreduce_sum
.Finally, when implicitely converting scalars to scaled arrays, the method
make_scaled_scaled
now splits the input mantissa and exponent between data and scale.