Integrative metagenomics of wild bees reveals urbanization tied to reduced genetic diversity and increased pathogen loads
This repository hosts the scripts and brief methods from our manuscript. This project focused on using mapped and unmapped reads from whole genome sequencing of the small carpenter bee (Ceratina calcarata) to obtain population genetic and metagenomic data, respectively. Our main goal was to understand how the genetic and metagenomic components of the bee are influenced by an urbanization gradient.
Authors:
Acknowledgements:
- Rehan Lab for constructive feedback on earlier versions of our manuscript
- Alessia Schembri, Phuong Nguyen, and Yousaf Malik for assistance with field collection of bee specimens.
- SHARCNET community for technical support on high-performance computing (HPC) clusters (Graham, Beluga, and Cedar)
- Jose Sergio Hleap for assistance with the HPC clusters and overall pipeline performance.
Funding:
- Mitacs Elevate Postdoctoral Fellowship to KDC
- NSF Biological Collections Postdoctoral Fellowship to EPK
- Foundation for Food and Agricultural Research Pollinator Health to SMR
- Weston Family Foundation Microbiome Initiative to SMR
- NSERC Discovery Grant to SMR
- Supplement and E.W.R Steacie Memorial Fellowship to SMR.
- Install software
- Quality Check
- Separating Mapped and Unmapped Reads
- Population Genetics
- Metagenomics
- CoNet - Integration of Pop Gen and Metagenomics
We will be using several software in order to take our unmapped whole genome sequence (WGS) reads and extract taxonomic information from them.
- FastQC (version 0.11.9)
- Trimmomatic (version 0.39)
- BWA (version 0.7.17)
- SAMtools (version 1.10)
- Sambamba (version 0.8.0)
- BAMtools (version 2.5.1)
- Bedtools (version 2.29.2)
- Metaspades (version 3.10.1)
- Kraken2 (version 2.1.1)
- Diamond (version 2.0.13)
- Blast+ (version 2.12.0)
- Entrez Direct EUtilities (Efetch) (version 15.3)
- seqtk (version 1.3-r106)
- FragGeneScan (version 1.31)
- Past (version 4.06)
- GhostKOALA
- eggNOG-mapper (version 2.1.6)
- R (version 4.2.0)
- VCFTools (version 0.1.16)
- GATK (version 4.2.2.0)
Note: Majority of the above software are already in the Compute Canada clusters and just need to be loaded using "module load". For example:
module load fastqc # will load the latest version in the cluster
module load samtools/1.10 # specify the version to use a specific one
module load bwa/0.7.17
Loading these modules as such will appear in some of our scripts.
Run FASTQC on the raw reads to check their quality, followed by trimming the adaptors and poor quality bases. After the trim, run FASTQC again to ensure adaptors and base quality is acceptable. FastQC script
Adaptors are there and there are some poor quality bases. Will run trimmomatic to improve quality. Trimmomatic script
After this trim, run FASTQC again to check all is fine. Then move on to the read mapping! Can even use MultiQC here to get a nice comprehensive report of the trimmed results.
The following script uses BWA to map reads to a reference genome, and incorporates sambamba, samtools, and bedtools for extraction of mapped and unmapped reads. Mapping script.
We used the Ceratina calcarata (BioProject: PRJNA791561) genome to map the reads. The data will be held until December 31, 2023.
This will separate mapped and unmapped reads into fastq files which can then be used for population genetics section 4 or metagenomics section 5, respectively.
Two methods were used for variant calling: Genome Analysis Toolkit (GATK) and BCFTools. The resulting VCF files were filtered using GATK, BCFtools and VCFtools. Lastly, the resulting VCF files were intersected using BCFTools. The code for this analysis can be found here: Variant calling script. Additional details on the population genetics statistics calculations and population structure analyses can be found in the Methods and Supplementary Materials.
We used two approaches to calculate isolation by resistance: least cost path modelling and circuit theory. The code for this analysis can be found here: Isolation by resistance and environment script. Resistance values assigned for different land use classes across the tested hypotheses can be found in Table 1. The generated raster files were also used in the Circuitscape analysis using Circuitscape (version 5.0.0). Additionally, we calculated environmental distances using annual temperature and precipitation (see Supplementary Materials for details on how climatic variables were obtained).
Metaspades (version 3.10.1) is another option to use for gathering information on the sequences. In this case, contigs are produced which can then be BLASTed. Metaspades script.
Once the contigs.fasta
files are obtained, we can BLAST the contigs to obtain taxonomy information. To do this, we use BLAST+ and the BLAST+ script.
PERMANOVA via adonis2 using Bray-Curtis method, followed by ANOVA for beta dispersion via betadisper. This is then followed by TUKEY HSD via TukeyHSD if PERMANOVA was significant and beta dispersion was not significant. Diversity R Script.
SIMPER analyses was done using Past and simply uploading the dataframe which includes taxa relative contig abundances and environmental feature bin information. The dataframe should be set up where each sample is the row and each variable (landscape feature and family/genus) are the columns. Then simply selecting SIMPER analysis while highlighting all the taxa and all the landscape features, selecting pool data, and using Bray-Curtis dissimilarities.
To test if random forests would be worthwhile on this data, I used the following random forest R script to simply assess whether classification or regression random forest analyses would produce low out-of-box errors or high variance explained %.
The taxa dataframe of relative contig abundances is first normalized using the DESeq2 R script, and then the obtained normalized dataframe is used for WGCNA analysis using the WGCNA R script.
DESeq2 will identify in which category are taxa overrepresented (positive, significant Log2foldC). If the Log2FoldC is negative, then that taxa is more abundant in the second, contrasted sample.
WGCNA aims to find significant modules or clusters of taxa with respect to each landscape variable, and from those clusters we can identify hub genes by highlighting those that have an absolute gene significance (GS) value ≥ 0.2 and an absolute module membership (MM) value ≥ 0.8 (this can be done in Excel).
For the contigs that correspond to our domains of interest (these sequences were isolated from the metaspades contigs.fasta
output files), we run them through FragGeneScan which finds fragmented genes in short reads.
# Run FragGeneScan on the cluster like this, or in an sbatch script
for file in *extracted_contigs.fasta; do name=${file%%_*}; /scratch/chauk/06-EXTRACTED_CONTIGS/FragGeneScan1.31/FragGeneScan -s /scratch/chauk/06-EXTRACTED_CONTIGS/${name}_extracted_contigs.fasta -o /scratch/chauk/06-EXTRACTED_CONTIGS/${name}_output -p 1 -w 0 -t illumina_5; echo "done for $file"; done
# -w 0 means fasta file has short sequence reads (-w 1 would mean complete genomic sequence)
# -p 1 means use 1 thread (default is already 1, can omit this)
# -t illumina_5 means train the data using Illumina sequencing reads with about 0.5% error rate.
This will produce three types of output files but we will be using the output.faa
files for further functional analysis.
The output.faa
files were concantenated by bin for each of the variables and run through GhostKoala. So for example, for agricultural percent I had concantenated all the output.faa
samples that belonged to bin 1, then to bin 2, then bin 3, etc. I ended up with 5 .faa
files. I ran these through GhostKoala to get the KEGG reconstructed pathways list.
eggNOG-mapper will use Diamond (the nr database) to blast the fragmented sequences to obtain protein information. The output.faa
files are used. eggNOG-mapper needs to be manually installed into the cluster. After downloading the tar.gz
file and it is uploaded into the cluster, run the following:
module load python
tar xvzf 2.1.6.tar.gz
rm 2.1.6.tar.gz
cd eggnog-mapper-2.1.6
pip install -r requirements.txt
python download_eggnog_data.py ## note have python module loaded before, but make sure python version 3.7.7 is actually used for eggnog mapper analysis
Once complete, then we can run eggNOG-mapper as an sbatch script or in the terminal if the samples are small enough.
module load StdEnv/2020 gcc/9.3.0
module load hmmer diamond prodigal
module load python/3.7.7
module load mmseqs2/13-45111
for file in *_output.faa; do name=${file%%_*}; ../../08-EGGNOGGMAPPER/eggnog-mapper-2.1.6/emapper.py -i $file -o ../../08-EGGNOGGMAPPER/${name}; done
# Running it simply like this with emapper.py and no other commands means it will use diamond database as default
This will produce .emapper.hits
, .emapper.annotations
, and .emapper.seed_orthologs
files. KEGG ID information is found within .emapper.annotations
. From this file, extract the KEGG ids which is in the KEGG_KO column (they look something like "ko:K01100" or with multiple ids). Then once we have our last of KEGG ids from eggNOG-mapper, we can input them into the KEGG Database to see what the functional names are associated with these KEGG terms.
Finally, to integrate the population genetic and metagenomic component to infer potential risks to wild bee health, we analyzed the correlation between population and landscape-level data with potenital bee and plant pathogen diversity.
We created a dataframe that had samples as the rows, and the columns included: pathogenicity (Bray-Curtis dissimilarity for just identified bee and plant pathogens), Yang's relatedness, environmental distance, and resistance distance. We ran this with our CoNet script.