Error in Generation of Embeddings from list of sequences via ProtT5 #134

amalislam675 · 2023-11-01T07:22:00Z

I am generating the embedding on my protein sequences via ProtT5 by the following code. I have total 5000 protein sequences which I am providing as list. I am fixing the max_length parameter to 500, but it gives me an out of memory error. Can you help me to fix? I have generate per protein embeddings. The final output which I want is of (5000, 1024).

RuntimeError: CUDA out of memory. Tried to allocate 10.77 GiB

Code:
`p_sequence = list(p_sequence)

from transformers import T5Tokenizer, T5EncoderModel
import torch
import re

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

Load the tokenizer

tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)

Load the model

model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)

only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)

model.full() if device=='cpu' else model.half()

prepare the protein sequences as a list

p_sequence = p_sequence

replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids

sequence_examples = [" ".join(list(re.sub(r"[UZOB]", "X", sequence))) for sequence in p_sequence]

tokenize sequences and pad up to the longest sequence in the batch

ids = tokenizer(sequence_examples, add_special_tokens=True, padding="max_length", truncation=True, max_length=500)

input_ids = torch.tensor(ids['input_ids']).to(device)
attention_mask = torch.tensor(ids['attention_mask']).to(device)

generate embeddings

with torch.no_grad():
embedding_repr = model(input_ids=input_ids, attention_mask=attention_mask)

extract residue embeddings for each sequence in the batch and removed padded and special tokens

emb_list = []
for i in range(len(sequence_examples)):
emb_i = embedding_repr.last_hidden_state[i]
emb_list.append(emb_i)

take mean of embedding vectors for the entire protein

emb_per_protein_list = []
for emb in emb_list:
emb_per_protein = torch.mean(emb, dim=0)
emb_per_protein_list.append(emb_per_protein)`

mheinzinger · 2023-11-02T08:10:50Z

Seems like you are running out of vRAM.
Try to generate embeddings for each protein in your set individually. (from the code above it seems as if you were to embed all proteins simultaneously.)
If this does not resolve your issue, you might have to lower the max_length even further (though, I guess switchting to single-sequence-processing instead of batching already solves the issue)

amalislam675 · 2023-11-24T12:54:09Z

@mheinzinger , thanks I have solved my issue. Can you please tell, how to select the maximum length of residue from our protein sequences? In ProtT5 model, there is an option to select max_length of residues.

mheinzinger · 2023-11-24T13:19:48Z

I usually do not set the parameter at all. ProtT5 has learnt positional encoding and can (to a certain extent) also embed protein sequences longer than the ones seen during training. I always embed full-length proteins up to the point where they trigger out-of-memory on my GPU; those get removed from the dataset.

amalislam675 · 2023-12-06T11:03:21Z

I usually do not set the parameter at all. ProtT5 has learnt positional encoding and can (to a certain extent) also embed protein sequences longer than the ones seen during training. I always embed full-length proteins up to the point where they trigger out-of-memory on my GPU; those get removed from the dataset.

@mheinzinger , on my use case, protein sequences are not comprised of PDB chains, my protein sequences are generated against some protein that belong to reviewed swissprot of uniporotKB entries. I want to do feature extraction for my protein sequences with ProtT5 model. Can you tell me which code better fits for my use case. The one which is mentioned in this link https://colab.research.google.com/drive/1TUj-ayG3WO52n5N50S7KH9vtt6zRkdmj?usp=sharing , or the other one which is mentioned here: https://colab.research.google.com/drive/1h7F5v5xkE_ly-1bTQSu-1xaLtTP2TnLF?usp=sharing. Should I generate per protein representations or per residue representations. If I single protein sequence to ProtT5 instead of batch, will the embedding that will generate via ProtT5 be same as are produced if we provide sequences in batch. Or, by providing sequences in batch we got more optimize results.

mheinzinger · 2023-12-06T16:55:43Z

Providing sequences as batch or processing them as single sequences should not make a difference (except for batching being faster).
Whether you want to generate per-residue or per-protein embeddings is completely up to your use-case so I can not tell, sorry.
This notebook provides you an example on how to also run a predictor on top of embeddings. In contrast, this second notebook only has the embedding generation part. So if you are solely interested in generating embeddings without any prediction, the second link is probably easier (but the first one should give you the same plus more).

amalislam675 changed the title ~~Generation of Embeddings from list of sequences via ProtT5~~ Error in Generation of Embeddings from list of sequences via ProtT5 Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in Generation of Embeddings from list of sequences via ProtT5 #134

Error in Generation of Embeddings from list of sequences via ProtT5 #134

amalislam675 commented Nov 1, 2023

mheinzinger commented Nov 2, 2023

amalislam675 commented Nov 24, 2023

mheinzinger commented Nov 24, 2023

amalislam675 commented Dec 6, 2023

mheinzinger commented Dec 6, 2023

Error in Generation of Embeddings from list of sequences via ProtT5 #134

Error in Generation of Embeddings from list of sequences via ProtT5 #134

Comments

amalislam675 commented Nov 1, 2023

Load the tokenizer

Load the model

only GPUs support half-precision currently; if you want to run on CPU use full-precision (not recommended, much slower)

prepare the protein sequences as a list

replace all rare/ambiguous amino acids by X and introduce white-space between all amino acids

tokenize sequences and pad up to the longest sequence in the batch

generate embeddings

extract residue embeddings for each sequence in the batch and removed padded and special tokens

take mean of embedding vectors for the entire protein

mheinzinger commented Nov 2, 2023

amalislam675 commented Nov 24, 2023

mheinzinger commented Nov 24, 2023

amalislam675 commented Dec 6, 2023

mheinzinger commented Dec 6, 2023