Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generating embedding from finetuned model #146

Open
abelavit opened this issue Mar 25, 2024 · 3 comments
Open

Generating embedding from finetuned model #146

abelavit opened this issue Mar 25, 2024 · 3 comments

Comments

@abelavit
Copy link

Hello,

I needed help on how to go about generating embedding after ProtT5 has been finetuned. I have carried out finetuning of the model using the sample code 'PT5_LoRA_Finetuning_per_residue_class.ipynb' on my own dataset. I have the saved mode called PT5_secstr_finetuned.pth. How do we now extract embedding for new protein sequences such as sequence_examples = ["PRTEINO", "SEQWENCE"] using the finetuned model?

Thank you for your time.

@mheinzinger
Copy link
Collaborator

The model outputs have a field called hidden_states which contain the embeddings.
Sth along those lines:
embeddings = model(input_ids, attention_mask=attention_mask).hidden_states

@abelavit
Copy link
Author

abelavit commented Mar 28, 2024

For loading the original pre-trained model, such as ProtT5, it can be done so:

Load the tokenizer
tokenizer = T5Tokenizer.from_pretrained('Rostlab/prot_t5_xl_half_uniref50-enc', do_lower_case=False)

Load the model
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_half_uniref50-enc").to(device)

To load the finetuned model, from the PT5_LoRA_Finetuning_per_residue_class.ipynb script, the command seems to be:

tokenizer, model_reload = load_model("./PT5_secstr_finetuned.pth", num_labels=3, mixed = False)

The load_model in the above calls other functions (e.g. PT5_classification_model function) which leads to having a chunky script. I am wondering if there was a simple way to load the finetuned model and obtain embedding for protein sequences, such as done for the original pre-trained model (ProtT5).

I am not sure if I am doing it right.

Thanks.

@mheinzinger
Copy link
Collaborator

I see your point; however, currently we do not have the bandwidth to work on a nicer interface, sorry.
In case you should find a nicer way, e.g., by using https://github.com/huggingface/peft , feel free to share or to create a pull request :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants