Clarification for the precision part in fp8_primer #835

YangFei1990 · 2024-05-04T00:57:55Z

YangFei1990
May 4, 2024

Inside the fp8_primer it says

If we compare the results of the FP32 and FP8 execution, we will see that they are relatively close, but different. That happens because in the FP8 case both the input and weights are cast to FP8 before the computation. We can see this if instead of the original inputs we use the inputs representable in FP8 This time the difference is really small.

But however I do not understand the reason. Could someone help me understand what is the inputs representable in FP8 and how they are different compared with the inputs that are cast into FP8? And what is the exact reason that representable FP8 has smaller gap compared with FP32 and how that could help debugging?
Many thanks!

Answered by Andrew-Luo1

Jul 30, 2024

Glad that it's of some help so far.

Re "both the input and linear weights are already fp8 representable": By "fp8 representable", to be clear, I mean you can cast it to fp8 then recast it to fp32 and get the same value, i.e. not lose any information from quantization. The input and linear weights that went into making out_fp32 do not appear to already be fp8 representable; they're randomly generated in the space of fp32's. Indeed, aggregating across a few cells, out_fp32 is calculated as:

my_linear = te.Linear(768, 768, bias=True) # I think the weights are fp32's.
inp = torch.rand((1024, 768)).cuda()
out_fp32 = my_linear(inp) # applies fp32 since not in fp8_autocast context .

Re "you als…

View full answer

Andrew-Luo1 · 2024-07-27T21:56:58Z

Andrew-Luo1
Jul 27, 2024

I believe the point of this section is explaining what exactly the fp8_autocast context manager does under the hood.

Looking at 217 and 218 of this function definition, inp_representable and my_linear.weight.data seem to just be 32-bit floats that have been rounded to numbers that could be representable in fp8. For example, 1.1100 is a representation of 1.1111 that also uses 4 decimal places, but is representable with 2 decimal places.

Then, in the line:

out_fp32_representable = my_linear(inp_representable)

Linear operations on fp8-representable values should result in fp8-representable values. So out_fp32_representable is calculated like how out_fp8 was calculated: by casting the weights and inputs to fp8 and applying my_linear.

out_fp8 has a bigger difference with out_fp32 because out_fp32 was calculated without first rounding to a number representable by fp8.

Hopefully this helps.

2 replies

YangFei1990 Jul 29, 2024
Author

@Andrew-Luo1 Many thanks for the explanation, this is super helpful. Some follow-up questions regarding to below statement

out_fp8 has a bigger difference with out_fp32 because out_fp32 was calculated without first rounding to a number representable by fp8.

Can you help me understand what is this first rounding to a number representable by fp8? My understanding is that both the input and the linear weights are already fp8 representable.
You also mentioned that the section is explaining what exactly the fp8_autocast context manager does under the hood, do does it mean when running under fp8_autocast context manager, the tensors will be casted to the fp8 using cast_to_fp8 and result will be casted back with cast_from_fp8?

Andrew-Luo1 Jul 30, 2024

Glad that it's of some help so far.

Re "both the input and linear weights are already fp8 representable": By "fp8 representable", to be clear, I mean you can cast it to fp8 then recast it to fp32 and get the same value, i.e. not lose any information from quantization. The input and linear weights that went into making out_fp32 do not appear to already be fp8 representable; they're randomly generated in the space of fp32's. Indeed, aggregating across a few cells, out_fp32 is calculated as:

my_linear = te.Linear(768, 768, bias=True) # I think the weights are fp32's.
inp = torch.rand((1024, 768)).cuda()
out_fp32 = my_linear(inp) # applies fp32 since not in fp8_autocast context .

Re "you also mentioned that ...", yes that is my understanding. So going back to the previous example, out_fp32 = 1.1111 * 1.1111 = 1.2345, whereas out_fp8 = dequantize(quantize(1.1111) * quantize(1.1111)) = dequantize(1.11*1.11) = 1.2300, so you get a 0.0045 difference with the full-precision version. Lmk if there's anything I'm missing.

Answer selected by YangFei1990

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification for the precision part in fp8_primer #835

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Clarification for the precision part in fp8_primer #835

YangFei1990 May 4, 2024

Replies: 1 comment · 2 replies

Andrew-Luo1 Jul 27, 2024

YangFei1990 Jul 29, 2024 Author

Andrew-Luo1 Jul 30, 2024

YangFei1990
May 4, 2024

Replies: 1 comment 2 replies

Andrew-Luo1
Jul 27, 2024

YangFei1990 Jul 29, 2024
Author