-
Notifications
You must be signed in to change notification settings - Fork 0
/
genomic_stability_codes
238 lines (166 loc) · 10.1 KB
/
genomic_stability_codes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
title: \"Genomic Stability and Genetic Defense Systems in *Dolosigranulum pigrum* a Candidate Beneficial Bacterium from the Human Microbiome\" Supplemental Methods
```{r message=FALSE, include=FALSE}
library(tidyverse)
library(knitr)
library(readr)
knitr::opts_chunk$set(eval = FALSE, message = FALSE)
```
# ANVI'O PANGENOME ANALYSIS
[Anvi'o v7, "hope"](https://github.com/merenlab/anvio/releases/tag/v7) [@eren2015; @eren2020; @Delmont2018b] was [installed](https://merenlab.org/2016/06/26/installation-v2/) in a Python environment called `anvio-7`.
We followed a pipeline modified from the tutorial: [Anvi'o workflow for microbial pangenomics](https://merenlab.org/2016/11/08/pangenomics-v2/ "An anvi'o workflow for microbial pangenomics").
## Reformatting genome fasta files
The NCBI submitted .gb genome files were converted to .fa using SnapGene and renamed as indicated on `GENOMES/GenomeNames.txt`. The renamed genome files are available in the `GENOMES/renamed` folder.
These .fa genomes files were reformatted to be compatible with Anvi'o v7. Flag `--seq-type NT` was added to avoid a Config Error indicating the sequence "contains characters that are not any of A, C, T, G, N, a, c, t, g, n."
```{bash}
#conda activate anvio-7
mkdir -p "analysis_Anvio7/Reformat_Dpig"
path_f="GENOMES/renamed"
path_o="analysis_Anvio7/Reformat_Dpig"
for file in $path_f/*.f*; do
FILENAME=`basename ${file%.*}`
anvi-script-reformat-fasta -o $path_o/$FILENAME.fa --min-len 0 --simplify-names $file --seq-type NT;
done
```
## Importing Prokka annotations into Anvi'o
The Prokka annotation was performed with [Prokka v1.14.6](https://github.com/tseemann/prokka/releases/tag/v1.14.6) [@seemann2014] with default parameters, including gene recognition and translation initiation site identification with Prodigal [@hyatt2010]. Prokka annotations were imported into Anvi'o following the recommended [Anvi'o pipeline for Prokka annotated genomes](https://merenlab.org/2017/05/18/working-with-prokka/) described in the next steps.
### Parsing .gff files
The Prokka annotated .gff files are parsed into two tab-delimited text files; one containing the Prodigal gene calls data and another with the functional annotations. This is done with the script `gff_parser.py`.
```{bash}
#conda activate anvio-7
mkdir -p "analysis_Anvio7/Parsed_prokka_Dpig"
path_f="GENOMES/Prokka_Out_Renamed"
path_o="analysis_Anvio7/Parsed_prokka_Dpig"
for file in $path_f/*.gff; do
FILENAME=`basename ${file%.*}`
python analysis_Anvio7/gff_parser.py --gene-calls $path_o/calls_$FILENAME.txt --annotation $path_o/annot_$FILENAME.txt $file;
done
```
### Generating contigs databases
In this step the gene calls from Prokka and the reformatted fasta files get imported to generate Anvi'o contigs databases. We add the `--ignore-internal-stop-codons` flag to avoid errors downstream in the pipeline. Only 1-2 genes with internal stop codons were identified in each genome.
```{bash}
#conda activate anvio-7
mkdir -p "analysis_Anvio7/Contigs_db_prokka_Dpig"
path_f="analysis_Anvio7/Reformat_Dpig"
path_o="analysis_Anvio7/Contigs_db_prokka_Dpig"
path_e="analysis_Anvio7/Parsed_prokka_Dpig"
for file in $path_f/*.fa; do
FILENAME=`basename ${file%.*}`
anvi-gen-contigs-database -f $file \
-o $path_o/$FILENAME.db \
--external-gene-calls $path_e/calls_prokka_$FILENAME.txt \
--ignore-internal-stop-codons \
-n $FILENAME;
done
```
### Importing Prokka annotation
The functional annotation part of the parsed Prokka files gets added to the Anvi'o contigs databases (.db files)
```{bash}
#conda activate anvio-7
path_f="analysis_Anvio7/Contigs_db_prokka_Dpig"
path_e="analysis_Anvio7/Parsed_prokka_Dpig"
for file in $path_f/*.db; do
FILENAME=`basename ${file%.*}`
anvi-import-functions -c $file \
-i $path_e/annot_prokka_$FILENAME.txt
done
```
## Importing other functional annotations
### COG annotation
The `anvi-run-ncbi-cogs` command was used to annotate the .db genome files against the NCBI's COGs database 2020 release [@tatusov1997; @galperin2021]. We used the DIAMOND [@buchfink2015] "sensitive" option instead of the default "fast" one.
```{bash}
#conda activate anvio-7
path_f="analysis_Anvio7/Contigs_db_prokka_Dpig"
for file in $path_f/*.db; do
anvi-run-ncbi-cogs -T 4 --sensitive -c $file;
done
```
### KEGG annotation
The program `anvi-run-kegg-kofams` annotates the .db genomes with KEGG Orthology (KO) numbers via hits to the KEGG KOfam database [@Kanehisa2016; @Kanehisa2000].
```{bash}
#conda activate anvio-7
path_f="analysis_Anvio7/Contigs_db_prokka_Dpig"
for file in $path_f/*.db; do
anvi-run-kegg-kofams -T 4 -c $file;
done
```
### PFAM annotation
The program `anvi-run-pfams` annotates the .db genomes with the PFAM database [@mistry2020].
```{bash}
#conda activate anvio-7
path_f="analysis_Anvio7/Contigs_db_prokka_Dpig"
for file in $path_f/*.db; do
anvi-run-pfams -T 4 -c $file;
done
```
### Running HMMs
The program [`anvi-run-hmms`](https://merenlab.org/software/anvio/help/main/programs/anvi-run-hmms/) stores hmm-hits into the .db genomes files [@eddy2011].
```{bash}
#conda activate anvio-7
path_f="analysis_Anvio7/Contigs_db_prokka_Dpig"
for file in $path_f/*.db; do
anvi-run-hmms -T 4 -c $file;
done
```
### Displaying functions
The program [`anvi-db-info`](https://merenlab.org/software/anvio/help/main/programs/anvi-db-info/) displays information about an Anvi'o database. The output of this command included in `analysis_Anvio7/AnvioGenomesSummaryFuntions.txt` has the details of all the functional annotations included in the .db files.
```{bash}
#conda activate anvio-7
path_f="analysis_Anvio7/Contigs_db_prokka_Dpig"
for file in $path_f/*.db; do
anvi-db-info $file;
done
```
## Generating an Anvi'o genomes storage
The program `anvi-gen-genomes-storage` requires you to provide names and locations of the genomes to be included in a genome storage in a .txt file. The `analysis_Anvio7/AnvioGenomesSummary003.txt` file includes the file names and file paths.
```{bash}
#conda activate anvio-7
anvi-gen-genomes-storage -e analysis_Anvio7/Contigs_db_prokka_Dpig/AnvioGenomesSummary003.txt -o analysis_Anvio7/Pangenomic_Results_Dpig/Anvio005-GENOMES.db --gene-caller Prodigal
```
## Running the pangenomic analysis
The program `anvi-pan-genome`:
1. Calculates similarities of each amino acid sequence in every genome against every other amino acid sequence. We used the more sensitive blastp search (`--use-ncbi-blast`) [@altschul1990] instead of the default DIAMOND fast option.
2. Uses [muscle](http://www.drive5.com/muscle) [@edgar2004] to align sequences.
3. Uses 'minbit heuristic' [@benedict2014] to filter weak hits based on the aligned fraction between the two reads. We used the default 0.5
4. Uses the MCL algorithm [@vandongen2011] to identify gene clusters in the remaining blastp search results. `--mcl-inflation` was set to 10, as recommended for very closely related genomes (i.e., 'strains' of the same 'species').
```{bash}
#conda activate anvio-7
anvi-pan-genome -g analysis_Anvio7/Pangenomic_Results_Dpig/Anvio005-GENOMES.db -o analysis_Anvio7/Pangenomic_Results_Dpig/ -n PAN_DPIG_prokka --use-ncbi-blast --mcl-inflation 10 --num-threads 4
```
## Displaying the pangenome
This will display the pangenome in your default internet browser. Once it opens in the browser, you can adjust image settings and click the "DRAW" icon to view it. For help using the interactive interface see <http://merenlab.org/2016/02/27/the-anvio-interactive-interface/>
```{bash}
#conda activate anvio-7
sudo anvi-display-pan -p analysis_Anvio7/Pangenomic_Results_Dpig/PAN_DPIG_prokka-PAN.db -g analysis_Anvio7/Pangenomic_Results_Dpig/Anvio005-GENOMES.db
```
## Creating bin collections (Core vs Accessory)
In order to define Core vs Accessory pangenome we use Anvi'o interactive interface. In the "Bins" tab we can create and name as many bins as we want and store then in a bin collection. In the "Search" tab we used "Search gene clusters using filters" and "Append splits to selected bin" to create the following bins:
- **Core**: 1298/2905 (`r round(100*(1298/2905),1)`%) gene clusters in all 28 genomes
- Min number of genomes gene cluster occurs = 28
- **SC Core**: 1111/2905 (`r round(100*(1111/2905),1)`%) gene clusters detected in single copy in all 28 genomes
- Min number of genomes gene cluster occurs = 28
- Max number of genes from each genome = 1
- **MC Core**: 1298-1111=187 (`r round(100*(187/2905),1)`%) detected in multiple copy in all 28 genomes
- **Soft Core**: 90/2905 (`r round(100*(90/2905),1)`%) gene clusters detected in 26-27 genomes
- Min number of genomes gene cluster occurs = 26
- Max number of genomes gene cluster occurs = 27
- **Shell**: 805/2905 (`r round(100*(805/2905),1)`%) gene clusters detected in 3-25 genomes
- Min number of genomes gene cluster occurs = 3
- Max number of genomes gene cluster occurs = 25
- **Cloud**: 712/2905 (`r round(100*(712/2905),1)`%) gene clusters detected in only 2 genomes or less
- Max number of genomes gene cluster occurs = 2
We stored the 5 bins in a new collection named **CorevsAccessory**
We can see the collections and bins created in a pangenomic database using the command `anvi-show-collections-and-bins`
```{bash}
#conda activate anvio-7
anvi-show-collections-and-bins --debug -p analysis_Anvio7/Pangenomic_Results_Dpig/PAN_DPIG_prokka-PAN.db
```
## Summarizing the pangenome
We summarized the pangenome using the program `anvi-summarize`:
```{bash}
#conda activate anvio-7
anvi-summarize -p analysis_Anvio7/Pangenomic_Results_Dpig/PAN_DPIG_prokka-PAN.db \
-g analysis_Anvio7/Pangenomic_Results_Dpig/Anvio005-GENOMES.db \
-C CorevsAccessory \
-o analysis_Anvio7/Pangenomic_Results_Dpig/Dpig-PAN-SUMMARY
```
The resulting summary folder contains the file `analysis_Anvio7/Pangenomic_Results_Dpig/Dpig-PAN-SUMMARY/PAN_DPIG_prokka_gene_clusters_summary.txt.gz` that links each gene to gene clusters, genomes, functions, and bins selected from the interface.