Integrative metagenomics of wild bees reveals urbanization tied to reduced genetic diversity and increased pathogen loads

This repository hosts the scripts and brief methods from our manuscript. This project focused on using mapped and unmapped reads from whole genome sequencing of the small carpenter bee (Ceratina calcarata) to obtain population genetic and metagenomic data, respectively. Our main goal was to understand how the genetic and metagenomic components of the bee are influenced by an urbanization gradient.

Authors:

Acknowledgements:

Rehan Lab for constructive feedback on earlier versions of our manuscript
Alessia Schembri, Phuong Nguyen, and Yousaf Malik for assistance with field collection of bee specimens.
SHARCNET community for technical support on high-performance computing (HPC) clusters (Graham, Beluga, and Cedar)
- Jose Sergio Hleap for assistance with the HPC clusters and overall pipeline performance.

Funding:

Mitacs Elevate Postdoctoral Fellowship to KDC
NSF Biological Collections Postdoctoral Fellowship to EPK
Foundation for Food and Agricultural Research Pollinator Health to SMR
Weston Family Foundation Microbiome Initiative to SMR
NSERC Discovery Grant to SMR
Supplement and E.W.R Steacie Memorial Fellowship to SMR.

1. Install software

We will be using several software in order to take our unmapped whole genome sequence (WGS) reads and extract taxonomic information from them.

FastQC (version 0.11.9)
Trimmomatic (version 0.39)
BWA (version 0.7.17)
SAMtools (version 1.10)
Sambamba (version 0.8.0)
BAMtools (version 2.5.1)
Bedtools (version 2.29.2)
Metaspades (version 3.10.1)
Kraken2 (version 2.1.1)
Diamond (version 2.0.13)
Blast+ (version 2.12.0)
Entrez Direct EUtilities (Efetch) (version 15.3)
seqtk (version 1.3-r106)
FragGeneScan (version 1.31)
Past (version 4.06)
GhostKOALA
eggNOG-mapper (version 2.1.6)
R (version 4.2.0)
VCFTools (version 0.1.16)
GATK (version 4.2.2.0)

Note: Majority of the above software are already in the Compute Canada clusters and just need to be loaded using "module load". For example:

   module load fastqc # will load the latest version in the cluster
   module load samtools/1.10 # specify the version to use a specific one
   module load bwa/0.7.17

Loading these modules as such will appear in some of our scripts.

2. Quality Check

Run FASTQC on the raw reads to check their quality, followed by trimming the adaptors and poor quality bases. After the trim, run FASTQC again to ensure adaptors and base quality is acceptable. FastQC script

Adaptors are there and there are some poor quality bases. Will run trimmomatic to improve quality. Trimmomatic script

After this trim, run FASTQC again to check all is fine. Then move on to the read mapping! Can even use MultiQC here to get a nice comprehensive report of the trimmed results.

3. Separating Mapped and Unmapped Reads

The following script uses BWA to map reads to a reference genome, and incorporates sambamba, samtools, and bedtools for extraction of mapped and unmapped reads. Mapping script.

We used the Ceratina calcarata (BioProject: PRJNA791561) genome to map the reads. The data will be held until December 31, 2023.

This will separate mapped and unmapped reads into fastq files which can then be used for population genetics section 4 or metagenomics section 5, respectively.

4. Population Genetics

4.1. Variant Calling and Filtering

Two methods were used for variant calling: Genome Analysis Toolkit (GATK) and BCFTools. The resulting VCF files were filtered using GATK, BCFtools and VCFtools. Lastly, the resulting VCF files were intersected using BCFTools. The code for this analysis can be found here: Variant calling script. Additional details on the population genetics statistics calculations and population structure analyses can be found in the Methods and Supplementary Materials.

4.2. Isolation by Resistance and Environment

We used two approaches to calculate isolation by resistance: least cost path modelling and circuit theory. The code for this analysis can be found here: Isolation by resistance and environment script. Resistance values assigned for different land use classes across the tested hypotheses can be found in Table 1. The generated raster files were also used in the Circuitscape analysis using Circuitscape (version 5.0.0). Additionally, we calculated environmental distances using annual temperature and precipitation (see Supplementary Materials for details on how climatic variables were obtained).

5. Metagenomics

5.1. metaSPADES

Metaspades (version 3.10.1) is another option to use for gathering information on the sequences. In this case, contigs are produced which can then be BLASTed. Metaspades script.

Once the contigs.fasta files are obtained, we can BLAST the contigs to obtain taxonomy information. To do this, we use BLAST+ and the BLAST+ script.

5.2. Diversity Statistics

PERMANOVA via adonis2 using Bray-Curtis method, followed by ANOVA for beta dispersion via betadisper. This is then followed by TUKEY HSD via TukeyHSD if PERMANOVA was significant and beta dispersion was not significant. Diversity R Script.

SIMPER analyses was done using Past and simply uploading the dataframe which includes taxa relative contig abundances and environmental feature bin information. The dataframe should be set up where each sample is the row and each variable (landscape feature and family/genus) are the columns. Then simply selecting SIMPER analysis while highlighting all the taxa and all the landscape features, selecting pool data, and using Bray-Curtis dissimilarities.

5.3. Random Forests

To test if random forests would be worthwhile on this data, I used the following random forest R script to simply assess whether classification or regression random forest analyses would produce low out-of-box errors or high variance explained %.

5.4. DESeq2 and WGCNA

The taxa dataframe of relative contig abundances is first normalized using the DESeq2 R script, and then the obtained normalized dataframe is used for WGCNA analysis using the WGCNA R script.

DESeq2 will identify in which category are taxa overrepresented (positive, significant Log2foldC). If the Log2FoldC is negative, then that taxa is more abundant in the second, contrasted sample.

WGCNA aims to find significant modules or clusters of taxa with respect to each landscape variable, and from those clusters we can identify hub genes by highlighting those that have an absolute gene significance (GS) value ≥ 0.2 and an absolute module membership (MM) value ≥ 0.8 (this can be done in Excel).

5.5. Functional Analyses

For the contigs that correspond to our domains of interest (these sequences were isolated from the metaspades contigs.fasta output files), we run them through FragGeneScan which finds fragmented genes in short reads.

# Run FragGeneScan on the cluster like this, or in an sbatch script
for file in *extracted_contigs.fasta; do name=${file%%_*}; /scratch/chauk/06-EXTRACTED_CONTIGS/FragGeneScan1.31/FragGeneScan -s /scratch/chauk/06-EXTRACTED_CONTIGS/${name}_extracted_contigs.fasta -o /scratch/chauk/06-EXTRACTED_CONTIGS/${name}_output -p 1 -w 0 -t illumina_5; echo "done for $file"; done
# -w 0 means fasta file has short sequence reads (-w 1 would mean complete genomic sequence)
# -p 1 means use 1 thread (default is already 1, can omit this)
# -t illumina_5 means train the data using Illumina sequencing reads with about 0.5% error rate.

This will produce three types of output files but we will be using the output.faa files for further functional analysis.

5.5.1. GhostKOALA

The output.faa files were concantenated by bin for each of the variables and run through GhostKoala. So for example, for agricultural percent I had concantenated all the output.faa samples that belonged to bin 1, then to bin 2, then bin 3, etc. I ended up with 5 .faa files. I ran these through GhostKoala to get the KEGG reconstructed pathways list.

5.5.2. eggNOG-mapper

eggNOG-mapper will use Diamond (the nr database) to blast the fragmented sequences to obtain protein information. The output.faa files are used. eggNOG-mapper needs to be manually installed into the cluster. After downloading the tar.gz file and it is uploaded into the cluster, run the following:

module load python
tar xvzf 2.1.6.tar.gz
rm 2.1.6.tar.gz
cd eggnog-mapper-2.1.6
pip install -r requirements.txt
python download_eggnog_data.py  ## note have python module loaded before, but make sure python version 3.7.7 is actually used for eggnog mapper analysis

Once complete, then we can run eggNOG-mapper as an sbatch script or in the terminal if the samples are small enough.

module load StdEnv/2020 gcc/9.3.0
module load hmmer diamond prodigal
module load python/3.7.7
module load mmseqs2/13-45111
for file in *_output.faa; do name=${file%%_*}; ../../08-EGGNOGGMAPPER/eggnog-mapper-2.1.6/emapper.py -i $file -o ../../08-EGGNOGGMAPPER/${name}; done
# Running it simply like this with emapper.py and no other commands means it will use diamond database as default

This will produce .emapper.hits, .emapper.annotations, and .emapper.seed_orthologs files. KEGG ID information is found within .emapper.annotations. From this file, extract the KEGG ids which is in the KEGG_KO column (they look something like "ko:K01100" or with multiple ids). Then once we have our last of KEGG ids from eggNOG-mapper, we can input them into the KEGG Database to see what the functional names are associated with these KEGG terms.

6. CoNet - Integration of Pop Gen and Metagenomics

Finally, to integrate the population genetic and metagenomic component to infer potential risks to wild bee health, we analyzed the correlation between population and landscape-level data with potenital bee and plant pathogen diversity.

We created a dataframe that had samples as the rows, and the columns included: pathogenicity (Bray-Curtis dissimilarity for just identified bee and plant pathogens), Yang's relatedness, environmental distance, and resistance distance. We ran this with our CoNet script.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
Scripts		Scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Integrative metagenomics of wild bees reveals urbanization tied to reduced genetic diversity and increased pathogen loads

Table of Contents

1. Install software

2. Quality Check

3. Separating Mapped and Unmapped Reads

4. Population Genetics

4.1. Variant Calling and Filtering

4.2. Isolation by Resistance and Environment

5. Metagenomics

5.1. metaSPADES

5.2. Diversity Statistics

5.3. Random Forests

5.4. DESeq2 and WGCNA

5.5. Functional Analyses

5.5.1. GhostKOALA

5.5.2. eggNOG-mapper

6. CoNet - Integration of Pop Gen and Metagenomics

About

Releases

Packages

Contributors 2

Languages

kdbchau/Ceratina-calcarata-Metagenomics

Folders and files

Latest commit

History

Repository files navigation

Integrative metagenomics of wild bees reveals urbanization tied to reduced genetic diversity and increased pathogen loads

Table of Contents

1. Install software

2. Quality Check

3. Separating Mapped and Unmapped Reads

4. Population Genetics

4.1. Variant Calling and Filtering

4.2. Isolation by Resistance and Environment

5. Metagenomics

5.1. metaSPADES

5.2. Diversity Statistics

5.3. Random Forests

5.4. DESeq2 and WGCNA

5.5. Functional Analyses

5.5.1. GhostKOALA

5.5.2. eggNOG-mapper

6. CoNet - Integration of Pop Gen and Metagenomics

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages