-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for reduced precision (#104) #317
Conversation
I'd rather not accept this until we have resolved the issue described here: #104 (comment) Moreover, we should have tests for reduced precision model performance if it's the kind of thing that can be impacted by accident. The rest seems fine. Thanks for the contribution! To do list:
|
What is model = HookedTransformer.from_pretrained("solu-1l", load_in_8bit=True, device_map="auto") then the |
Torch doesn't support 8-bit floating point models, so there is no corresponding torch dtype and it relies on different libraries like bitsandbytes and accelerate. It is bug-prone, so I should probably make further checks. But it would perhaps be wiser to set in bfloat16 whatever is not in 8-bits, since there is no float8 dtype. I'll try to figure it out and update the PR in the coming days. |
Actually my previous answer was a bit inaccurate. Quantization doesn't use 8 bit floats, it uses 8 bit integers with a scaling factor. And there is an int8 dtype in torch, but it doesn't seem to be used by the transformers library, as I understand it the hidden states are typically in half precision and only some operations are quantized. I recommend this article if you are interested in the details : https://huggingface.co/blog/hf-bitsandbytes-integration. |
That's a great blogpost, thanks! Yeah, it looks like the HuggingFace bitsandbytes integration only quantises nn.Linear layers and HookedTransformer has no nn.Linear layers. |
The fact that we use einsum instead of nn.Linear seems like the main reason for the remaining differences between the predictions of TransformerLens and Hugging Face in half precision. |
transformer_lens/components.py
Outdated
@@ -377,7 +409,7 @@ def __init__( | |||
else: | |||
raise ValueError(f"Invalid attention type: {self.attn_type}") | |||
|
|||
self.register_buffer("IGNORE", torch.tensor(-1e5)) | |||
self.register_buffer("IGNORE", torch.tensor(torch.finfo(cfg.dtype).min)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be constroversial but I modified the value of the layer "IGNORE" from -1e-5
to torch.finfo(cfg.dtype).min
. This shouldn't change much in practice, I just thought it was cleaner and closer to what Hugging Face did. I can revert that if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll revert that since it is done in the PR #319.
# If using 16 bits, increase the precision to avoid numerical instabilities | ||
q = q.to(torch.float32) | ||
k = k.to(torch.float32) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to convert to float32
before the attention layer, instead of dividing sooner with self.attn_scale as proposed. Mostly because it is what is done in the GPT NEO architecture in def _attn
and it looked a bit more reliable.
By the way, it's a problem that we can't self-assign issues, because when we chose an issue, we don't know if someone is already on it. Perhaps we could leave public the authorization to self-assign issues. Or we may have to systematically add a comment to indicate when we start working on an issue. |
(rebased on the main branch to solve conflicts) |
@jbloomAus, @slavachalnev, thanks in advance if you can give a review. |
I'm on holiday ATM but can review once @slavachalnev has taken a look depending on what's needed. Thanks for doing this! |
Looks good but need to fix a few things: When I run tests on GPU, I get four failures.
The first two are easily fixed by moving the tokens to device in the test e.g. def check_performance(tl_model, hf_model, margin=0.01, device='cpu'): # Added device here
"""
Check that the TransformerLens model and the HuggingFace have
approximately the same confidence in the expected answer.
"""
prompt = " Unable"
tokens = tl_model.tokenizer(prompt, return_tensors="pt")["input_ids"].to(device) # Added device here
expected_token = tl_model.tokenizer.encode(" to")[
0
] # Assume this is the expected token to predict
tl_logits = tl_model(tokens, prepend_bos=False)[0, -1].float()
hf_logits = hf_model(tokens).logits[0, -1].float()
tl_prob = torch.softmax(tl_logits, dim=-1)[expected_token].item()
hf_prob = torch.softmax(hf_logits, dim=-1)[expected_token].item()
assert tl_prob + margin > hf_prob Then we are left with a bfloat16 precision error being higher than the set margin: The final error is caused by a bunch of operations not being defined for Half precision on CPU. In I would also modify |
My recommendation is to disable LayerNorm folding in float16, and make them
use loading_fron_pretrained_no_processing If you really care about memory
you often want to avoid doing this post processing anyway, and it's not
essential, though it is nice. And a user who cares can manually do it on
the GPU themselves,or do it on the CPU in fp32. But it's important that
people can load large models in memory constrained environments like a
colab notebook
…On Sun, 16 Jul 2023, 2:07 pm Sviatoslav Chalnev, ***@***.***> wrote:
Looks good but need to fix a few things:
When I run tests on GPU, I get four failures.
FAILED test_hooked_transformer.py::test_dtypes[dtype0] - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking...
FAILED test_hooked_transformer.py::test_dtypes[dtype1] - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking...
FAILED test_hooked_transformer.py::test_half_precision[dtype0] - RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking...
FAILED test_hooked_transformer.py::test_half_precision[dtype1] - RuntimeError: "baddbmm_with_gemm" not implemented for 'Half'
The first two are easily fixed by moving the tokens to device in the test
e.g.
def check_performance(tl_model, hf_model, margin=0.01, device='cpu'): # Added device here
""" Check that the TransformerLens model and the HuggingFace have approximately the same confidence in the expected answer. """
prompt = " Unable"
tokens = tl_model.tokenizer(prompt, return_tensors="pt")["input_ids"].to(device) # Added device here
expected_token = tl_model.tokenizer.encode(" to")[
0
] # Assume this is the expected token to predict
tl_logits = tl_model(tokens, prepend_bos=False)[0, -1].float()
hf_logits = hf_model(tokens).logits[0, -1].float()
tl_prob = torch.softmax(tl_logits, dim=-1)[expected_token].item()
hf_prob = torch.softmax(hf_logits, dim=-1)[expected_token].item()
assert tl_prob + margin > hf_prob
Then we are left with a bfloat16 precision error being higher than the set
margin: assert (0.15567666292190552 + 0.005) > 0.17949192225933075
Probably acceptable to increase margin but not sure.
The final error is caused by a bunch of operations not being defined for
Half precision on CPU. In load_from_pretrained we do preprocessing on CPU
which causes an error. I think that when loading from pertrained, we need
to use float32 and move model to dtype after it’s loaded. What do you think?
I would also modify move_to_and_update_config function so that it updates
config when you move the model to a different dtype. The reason this
matters is you check the dtype in the config during the Attention forward
pass so it should be up to date.
—
Reply to this email directly, view it on GitHub
<#317 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ASRPNKNAAQWBE72V22NWXDTXQPRR5ANCNFSM6AAAAAAZA5AYK4>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Also thanks so much for making the PR! Real mixed precision support has been on my wishlist for ages |
Oops, sorry @slavachalnev, I missed the bugs induced by the rebase. I added a warning to adivse using Most of the time the results are pretty similar between TransformerLens and Hugging Face. But it's still a bit weird that for EleutherAI/pythia-70m with bfloat16 transformer_lens with |
Now getting an error in
May want to convert probs to float32 before sampling. Otherwise, changes look good. |
@jbloomAus this looks good to me |
Description
HookedTransformerKeyValueCacheEntry
wasn't compatible with other dtypes, so the argumentpast_kv_cache_entry
wasn't working.There is also an optimization. Instead of initializing the model in torch.float32 and then converting it to the desired dtype, we now directly initialize the model layers in the desired dtype.
The attribute
dtype
was added to the configuration with the default value torch.float32. I hope this is ok. It's practical to have access to it from anywhere.Also added a test for 8 bits loading, which is skipped if there is no GPU.
Fixes # 104
Type of change
Please delete options that are not relevant.
Checklist: