In this project, we got the metagenomics data using Illumina MiSeq System. Based on the raw dataset, we are going to following works:
- Data Processing
- Build the complete OTU Table
- Get a fairly good idea of what factors are driving the fungal communities vs ectomycorrhizal fungal communities.
- Study the outcome of fungal communitites after forest burning.
- Do Statistcal Test.
Note: All of codes are made and tested under mac and linux.
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
We use FastQC to do see the quality of the raw dataset. Run "test_qc_one"
to see FastQC reports in the file "fq_report"
.
./test_qc_one
Cutadapt on primers and parts showing low quality for each sequence.
Once you had a closer look at the quality report you can realize that the data quality is not too bad, however we still might be able to improve the quality with a quality based trimming since the quality drops towards the end of the reads. Additionally, we want to remove primers before we do OTU study.
Run "code_trimme"
to do this step and we save trimmed sequences in file "fq_trimmed"
.
./code_trimme
In our processing, we set the parameter as following:
-Forward Primmer:
GATCTCTTGGNTCTNGCATCGATGAAGAACG
-Forward cut end length: 25
-Forward q = 20; e = 0.2
-Reverse Primmer: GGAAACCTTGTTACGACTTTTACTTCCTCTAAATGACCAA
-Reverse cut end length: 30
-Reverse q = 20; e = 0.2
In order to ccompare the different before and after Cutadapt, we run fastQC again on "fq_trimmed"
file. Then show the comparesion in Figure1. From the changes in sequences' length and number, we see cutadapt help us remove primers and sequences with low quality.
Use PIPITS to do OTU study based on trimmed sequences data.
After trimming and cleaning the raw sequences according FastQC report. We are going to use PIPITS to make OTU table.
We run "run_PIPITS"
to generate the OTU table.
./run_PIPITS
There is one error when we run PIPITS on cluster but won't happen on laptop. We have already reflect this problem to the writer.
When we test PIPITS on sample dataset, it works well and gives us the finally OTU table in reasiable time. But when we run the program on our dataset. We found the time cost is too large. We waited for about an hour but cannot get the result (Only 01W, 01S, 01N, 01E) comparing with Usearch which can finish in 30 mins for whole dataset (01-15W, 01-15S, 01-15N, 01-15E).
Maybe PIPITS is a good program for small size dataset but it should be considered when our dataset is large.
As the dataset we have is a large dataset, 32 bit version Usearch does not approve enough memory for our computation, there are three main strategies to solve this problem.
-
Buy 64 bit Usearch.
-
Use Vsearch to do analysis.
-
We are going to separate whole dataset into two group then run Usearch on each group, then we can use the commond
otutab2biom command
to transform.txt
into.json
file. Next we are going to try to run program merge_otu_tables.py (a module in QIIME) to merge two OTU tables together.
To increase merge rate, there are three approach:
- Cut more end bases
- Change the parameter
fastq_maxdiffs
andfastq_pctid
- Change the parameter
fastq_minmergelen
andfastq_maxmergelen
Then we set fastq_minmergelen 180
, fastq_maxmergelen 320
, fastq_maxdiffs 20
and fastq_pctid 20
and get high merge rate as following:
usearch v11.0.667_i86linux32, 4.0Gb RAM (32.8Gb total), 32 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch
License: lyu062@ucr.edu
Merging
Fwd ../fq_trimmed_test/01E_T0_R1.fastq
Rev ../fq_trimmed_test/01E_T0_R2.fastq
Relabel reads as 01E.#
00:00 72Mb FASTQ base 33 for file ../fq_trimmed_test/01E_T0_R1.fastq
00:00 72Mb CPU has 32 cores, defaulting to 10 threads
00:57 101Mb 100.0% 97.3% merged
Merging
Fwd ../fq_trimmed_test/01N_T0_R1.fastq
Rev ../fq_trimmed_test/01N_T0_R2.fastq
Relabel reads as 01N.#
01:48 101Mb 100.0% 97.1% merged
Merging
Fwd ../fq_trimmed_test/01S_T0_R1.fastq
Rev ../fq_trimmed_test/01S_T0_R2.fastq
Relabel reads as 01S.#
02:12 101Mb 100.0% 96.1% merged
Merging
Fwd ../fq_trimmed_test/01W_T0_R1.fastq
Rev ../fq_trimmed_test/01W_T0_R2.fastq
Relabel reads as 01W.#
03:17 101Mb 100.0% 96.5% merged
Totals:
1001786 Pairs (1.0M)
966903 Merged (966.9k, 96.52%)
549979 Alignments with zero diffs (54.90%)
9975 Too many diffs (> 15) (1.00%)
1 Fwd tails Q <= 2 trimmed (0.00%)
2 Rev tails Q <= 2 trimmed (0.00%)
891 Fwd too short (< 64) after tail trimming (0.09%)
719 Rev too short (< 64) after tail trimming (0.07%)
6932 No alignment found (0.69%)
0 Alignment too short (< 16) (0.00%)
3065 Merged too short (< 180)
13301 Merged too long (> 320)
85 Staggered pairs (0.01%) merged & trimmed
194.76 Mean alignment length
260.01 Mean merged length
0.71 Mean fwd expected errors
0.63 Mean rev expected errors
0.11 Mean merged expected errors
The output for .\filter
00:00 38Mb CPU has 32 cores, defaulting to 10 threads
04:07 33Mb 100.0% Filtering, 97.4% passed
966903 Reads (966.9k)
25090 Discarded reads with expected errs > 1.00
941813 Filtered reads (941.8k, 97.4%)
The output for ./uniques
00:07 317Mb 100.0% Reading ./out/filtered.fa
00:07 283Mb CPU has 32 cores, defaulting to 10 threads
00:13 502Mb 100.0% DF
00:13 513Mb 941813 seqs, 192986 uniques, 154096 singletons (79.8%)
00:13 513Mb Min size 1, median 1, max 47075, avg 4.88
00:18 352Mb 100.0% Writing ./out/uniques.fa
The output for ./otus
00:28 52Mb 100.0% 606 OTUs, 190 chimeras
The output for ./otutable
00:00 41Mb 100.0% Reading ./out/otus.fa
00:00 7.2Mb 100.0% Masking (fastnucleo)
00:00 8.0Mb 100.0% Word stats
00:00 8.0Mb 100.0% Alloc rows
00:00 8.6Mb 100.0% Build index
00:00 42Mb CPU has 32 cores, defaulting to 10 threads
04:38 81Mb 100.0% Searching merged.fq, 99.1% matched
957618 / 966903 mapped to OTUs (99.0%)
04:38 81Mb Writing ./out/otutable.txt
04:38 81Mb Writing ./out/otutable.txt ...done.
Parameter information for Usearch:
Usearch
-Merge:
fastq_maxdiffs: 15; -fastq_pctid: 15;
fastq_minmergelen 200; -fastq_maxmergelen 350
-filter:
fastq_maxee 0.5
-otus:
minsize 2
-otutable:
id 0.97
-prediction:
database: unitITS.udb
cutoff: 0.8
We can use Usearch to normalize and delete the negative contral
Based on the OTU table got from bioinformatical pipeline, we are going to do the statistical analysis in the following step.
As we have get the OTU table already, we are going to make the abundance curve for each sample. We use R language to do next works.