Skip to content

FriedbergLab/Epictope

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EpicTope

Software for predicting epitope tag insertion sites in proteins.

EpicTope is an R package/pipeline to identify epitope tag insertion sites for proteins of interest. It uses four features of protein structure; sequence conservation, secondary structure, disordered binding regions, and relative solvent accessibility to predict suitable internal locations for tag insertion.

The primary score for EpicTope relies on a "least-worst" approach, where insertion site suitability is identified by positions where the lowest scoring feature is the highest. For a given position, we sort the feature scores from lowest to highest, and take the lowest score. We then plot the lowest score for each position, and determine positions where this lowest score is highest to be suitable positions for tagging.

This repository contains the code source of the R EpicTope package, step-by-step R Markdown and Jupyter notebooks to run the complete workflow, wrapped scripts for simplified workflow execution, and instructions to adjust the weight and effect of each considered feature. The package requires local installations of BLAST, MUSCLE, and DSSP to run (those can be installed as a package with EpicTope). You will need at least 3GB of disk space.

Table of Contents

Methodology

Alt text

EpicTope workflow. Starting from a protein of interest, the EpicTope workflow automates multiple sequence alignment, downloading of predicted protein structure from AlphaFold2, and fetching disordered binding regions from IUPred2A..

Sequence conservation

Sequence conservation is used to guide internal epitope-tagging approaches. Regions of relatively low conservation are unlikely to be involved in the critical function of the protein. To identify these regions for a protein of interest, we first BLAST the query protein against the proteomes of a diverse set of model organisms. By default, we compare the query sequence against the proteomes of Mus musculus (mouse), Bos taurus (cow), Canis lupus familiaris (dog), Gallus gallus (chicken), Homo sapiens (human), Takifugu rubripes (pufferfish), and Xenopus tropicalis (western clawed frog). Using BLAST, we identify the highest scoring match in each organism, sorted by the lowest E-value. We then align the retrieved sequences with the query protein using MUSCLE, a multiple sequence alignment program, and calculate the shannon entropy at each position. We use Shannon entropy as a simple measure to calculate the variability of amino acids at each position in the alignment. A lower Shannon entropy indicates low variability or high sequence conservation at the position, and it should therefore be avoided for tag insertion. Conversely, a high Shannon entropy indicates a relatively low degree of sequence conservation, and potential suitability for tagging.

Alt text

Example Multiple Sequence Alignment for Tcf21 protein sequences from position 1 to 70. The protein of interest or query is the tcf21 protein from Danio rerio (zebrafish). Amino acids identical between all species are in red, non-identical between at least one species in blue, and gaps are highlighted in yellow. From this alignment, red regions would be unfavorable to tag insertion..

Solvent accessibility

Relative Solvent Accessibility (RSA) is a measure of the surface area of a folded protein that is accessible to a solvent, typically the cytoplasmic fluid. It is calculated by dividing the solvent accessible surface area (SASA) of an amino acid by the maximum possible solvent accessible surface area for that residue. SASA values are assigned with Define Secondary Structure of Proteins (DSSP). The DSSP program defines secondary structure, geometrical features and solvent exposure of proteins, given atomic coordinates in Protein Data Bank (PDB) format. Values used for the maximum possible solvent accessible surface area were taken from this study. We use the Alphafold2 predicted structure from the European Bioinformatics Institute (EBI) as the source PDB for DSSP calculations.

Secondary structure

Secondary structure is the local spatial conformation of the polypeptide backbone for the protein of interest. Certain structures, such as alpha helices or beta sheets, are more defined and disruption of these structure is likely to affect protein structure. As with solvent accessibility, we use DSSP to define the secondary structure of the protein from its PDB file. By default, we assign helices (GHI) and sheets (E) feature scores of 0. Hydrogen bonded turns (T), residues in isolated Beta bridges (B), and bends (S) scores of 0.5, and coils scores of 1. For all features, higher values indicate greater suitability for tag insertion.

Disordered binding

Disordered binding regions are sections of a protein that do not have a well-defined structure on their own, but can undergo a disorder-to-order transition when they bind to specific protein partners. To avoid these regions, we use ANCHOR2, a tool that analyzes an amino acid sequence and returns a score of intrinsic disorder depending on a model of the estimated energy potential for residue interactions. To maintain consistency with other features, the disordered binding feature score is taken as 1 minus the ANCHOR2 score.

Installation

System requirements

Installing EpicTope and its dependencies will require at least 3Gb of disk space. Users should also be familar with using conda, a package manager for macOS/linux and Windows. Conda does not need to be used if users already have access to installations of BLAST, MUSCLE, and DSSP, either locally or on an HPC environment.

Software dependencies

To calculate the multiple sequence alignment and secondary characteristics, EpicTope relies on local installs of BLAST, MUSCLE, and DSSP. These packages can be installed using Conda, an open-source package management system and environment management system that runs on Windows, macOS, and Linux. Conda installers can be found at the Anaconda website. Once installed, you may run the follow commands to install the requisite packages. These commands will create a conda environment named EpicTope, and install the requisite packages into that environment. Installing EpicTope and its dependencies will require at least 3GB of disk space.

macOS/Linux installation

For macOS/Linux, commands are issued at the terminal. Dependencies can be installed using the following commands.

  1. Download and place the contents of the "install/mac_linux" folder into your project directory. In your terminal, type "ls" to verify the files are in the correct folder.
curl -o "epictope_install.sh" "https://raw.githubusercontent.com/FriedbergLab/EpicTope/main/install/mac_linux/epictope_install.sh"
ls
  1. Run the installation scripts in the terminal with the following commands.
chmod u+x epictope_install.sh
bash -i epictope_install.sh

Additional installation methods for Linux can be found in the Detailed Installation for Linux page

Windows installation

BLAST and MUSCLE are not available for installation on Windows with conda and have to be installed separately. We provide a wrapper script to install these programs and the conda environment.

Sure, here are the detailed steps to open Anaconda Prompt, create a new project folder, and then 'cd' (change directory) into it on a Windows machine:

  1. In Anaconda prompt, download and place the "epictope_install.bat" folder into your project directory with curl. Type "dir" to verify the files are in the correct folder.
curl -o epictope_install.bat https://raw.githubusercontent.com/FriedbergLab/EpicTope/main/install/windows/epictope_install.bat
dir
  1. Run the installation scripts in Anaconda Prompt with the following commands. If prompted by Conda, type "Y" for Yes and then press "Enter". Click "Yes" if a pop-up window asks if you allow this app to make changes to your advice.
epictope_install.bat

More detailed instructions for Windows can be found in the Detailed Installation for Windows page

Usage

Here, we provide usage examples to demonstrate how to use EpicTope. Each example includes a brief description and code snippets or commands to showcase the function. These examples assume the installation steps have been followed.

Example 1A: Generating EpicTope predictions on macOS/Linux

For our example, we investigate the Smad5 gene for Zebrafish. Searching for the protein transcript in Uniprot, we find it's UniprotID is "Q9W7E7"

Run the EpicTope workflow with the following commands in the terminal.

conda activate epictope
Rscript install.R
Rscript single_score.R Q9W7E7

Example 1B: Generating EpicTope predictions on Windows

On windows, the commands are the same as for Linux, except Windows uses a backwards slash "\" instead of a forward slash "/". Run the EpicTope workflow with the following commands in Anaconda Prompt.

conda activate epictope
Rscript install.R
Rscript single_score.R Q9W7E7

Example 2: Viewing your results.

The EpicTope workflow generates a "<UniprotID>_score.csv" file (ex: Q9W7E7_score.csv), containing the individual feature scores for each position, the minimum score across features for each position, and a weighted sum score of all features. These values can be plotted in the data visualization tool of choice.

For convenience, we provide a "plot_scores.R" scripts that generates a plot of the minimum score for each position in the sequence using a rolling average of window size 7.

The plot script can be run in the same way as previous commands.

Rscript plot_scores.R outputs/Q9W7E7_score.csv
"outputs/Q97W7E7.tiff"

Alt text

Example minimum score plot for Q9W7E7. Values smoothed over a windows size of 7.

Workflow notebooks

Example workflows with the EpicTope package are available in the vignettes folder. Workflows are available as both R Markdown Documents and Jupyter notebooks. These workflows go through the EpicTope workflow step by step in an interactive session or an IDE. IDE usage requires access to local installations of BLAST, MUSCLE, and DSSP by the IDE.

Macro scripts

The scripts install.R and single_score.R are provided in the scripts folder of this repo to enable one-command operation. To run, download the install.R and single_score.R scripts from this repository either directly from the github page or using git clone.

  • install.R

    • This script first downloads the proteomes for the species used in the multiple sequence alignment from the NCBI FTP page.
    • It then converts these sequences into usable files for BLAST.
    • This file need to be re-run if the user changes the species considered in the multiple sequence alignment.
  • single_score.R

    • This script takes a UniprotID as input and performs the EpicTope workflow for that protein.
    • It first retrieves the amino acid sequence and Alphafold2 predicted structure for the protein.
    • It then BLASTs the protein against the proteomes of the animals used in the multiple sequence alignment, retrieves the highest scoring match (score measured by the lowest E-value), and aligns the matched proteins along with the query in a multiple sequence alignment.
    • It then determines the secondary structure, solvent accessibility, and disordered binding regions for the protein.
    • It combines all feature scores into a summary dataframe.
    • The dataframe annotates each residue position with its feature scores and final tagging score.
    • This file is saved to an /outputs folder with the name of the protein followed by '_score.csv'.
    • For example, the protein used in the examples saves a "outputs/P57102_score.csv" file.

From the terminal, these scripts can be run as follows.

Rscript install.R 
Rscript single_score.R "P57102" # replace 'P57102' with the UniprotID for your protein of interest.

Each script can also be opened in an IDE such as Rstudio, and run interactively line by line.

User configuration

A second scoring function used by EpicTope sums the calculated scores for the protein features, with equal weight assigned to secondary structure, disordered binding regions, and solvent accessibility. Sequence conservation carries by default carries a higher weight, at 1.5 times that of the other features.

Users can adjust the weight of each feature by modifying the "config_defaults.R" file. This file allows fine-tuning of parameters in EpicTope, including the weight of each feature, defining the species used in the multiple sequence alignment, scoring tag suitability for secondary structures, and determining maximum solvent accessibility values.

EpicTope searches for a "config.R" file in the working directory. If it doesn't find one, it will utilize default values. In the scripts folder, an example "config_defaults.R" value is provided. To use, edit and rename the file to "config.R" and place it anywhere in your project directory.

Frequently Asked Questions

A growing FAQ can be found in our repository wiki page.

License

EpicTope is distributed open-source under the GPL3 license.