Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in Generation of Embeddings from list of sequences via ProtT5 #134

Open
amalislam675 opened this issue Nov 1, 2023 · 5 comments
Open

Comments

@amalislam675
Copy link

I am generating the embedding on my protein sequences via ProtT5 by the following code. I have total 5000 protein sequences which I am providing as list. I am fixing the max_length parameter to 500, but it gives me an out of memory error. Can you help me to fix? I have generate per protein embeddings. The final output which I want is of (5000, 1024).

RuntimeError: CUDA out of memory. Tried to allocate 10.77 GiB

Code:
`p_sequence = list(p_sequence)

from transformers import T5Tokenizer, T5EncoderModel
import torch
import re

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

Load the tokenizer

tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)

Load the model

model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)

only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)

model.full() if device=='cpu' else model.half()

prepare the protein sequences as a list

p_sequence = p_sequence

replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids

sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in p_sequence]

tokenize sequences and pad up to the longest sequence in the batch

ids = tokenizer(sequence_examples, add_special_tokens=True, padding="max_length", truncation=True, max_length=500)

input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

generate embeddings

with torch.no_grad():
embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)

extract residue embeddings for each sequence in the batch and removed padded and special tokens

emb_list = []
for i in range(len(sequence_examples)):
emb_i = embedding_repr.last_hidden_state[i]
emb_list.append(emb_i)

take mean of embedding vectors for the entire protein

emb_per_protein_list = []
for emb in emb_list:
emb_per_protein = torch.mean(emb, dim=0)
emb_per_protein_list.append(emb_per_protein)`

@amalislam675 amalislam675 changed the title Generation of Embeddings from list of sequences via ProtT5 Error in Generation of Embeddings from list of sequences via ProtT5 Nov 1, 2023
@mheinzinger
Copy link
Collaborator

Seems like you are running out of vRAM.
Try to generate embeddings for each protein in your set individually. (from the code above it seems as if you were to embed all proteins simultaneously.)
If this does not resolve your issue, you might have to lower the max_length even further (though, I guess switchting to single-sequence-processing instead of batching already solves the issue)

@amalislam675
Copy link
Author

@mheinzinger , thanks I have solved my issue. Can you please tell, how to select the maximum length of residue from our protein sequences? In ProtT5 model, there is an option to select max_length of residues.

@mheinzinger
Copy link
Collaborator

I usually do not set the parameter at all. ProtT5 has learnt positional encoding and can (to a certain extent) also embed protein sequences longer than the ones seen during training. I always embed full-length proteins up to the point where they trigger out-of-memory on my GPU; those get removed from the dataset.

@amalislam675
Copy link
Author

I usually do not set the parameter at all. ProtT5 has learnt positional encoding and can (to a certain extent) also embed protein sequences longer than the ones seen during training. I always embed full-length proteins up to the point where they trigger out-of-memory on my GPU; those get removed from the dataset.

@mheinzinger , on my use case, protein sequences are not comprised of PDB chains, my protein sequences are generated against some protein that belong to reviewed swissprot of uniporotKB entries. I want to do feature extraction for my protein sequences with ProtT5 model. Can you tell me which code better fits for my use case. The one which is mentioned in this link https://colab.research.google.com/drive/1TUj-ayG3WO52n5N50S7KH9vtt6zRkdmj?usp=sharing , or the other one which is mentioned here: https://colab.research.google.com/drive/1h7F5v5xkE_ly-1bTQSu-1xaLtTP2TnLF?usp=sharing. Should I generate per protein representations or per residue representations. If I single protein sequence to ProtT5 instead of batch, will the embedding that will generate via ProtT5 be same as are produced if we provide sequences in batch. Or, by providing sequences in batch we got more optimize results.

@mheinzinger
Copy link
Collaborator

Providing sequences as batch or processing them as single sequences should not make a difference (except for batching being faster).
Whether you want to generate per-residue or per-protein embeddings is completely up to your use-case so I can not tell, sorry.
This notebook provides you an example on how to also run a predictor on top of embeddings. In contrast, this second notebook only has the embedding generation part. So if you are solely interested in generating embeddings without any prediction, the second link is probably easier (but the first one should give you the same plus more).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants