Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nanopore #34

Open
devonorourke opened this issue Feb 18, 2018 · 11 comments
Open

Nanopore #34

devonorourke opened this issue Feb 18, 2018 · 11 comments

Comments

@devonorourke
Copy link

Is there anything in the structure of amptk that would prohibit using Nanopore amplicon data as input?

Fragment lengths are ~1500 bp. Input fastq are already adapter and quality trimmed and demultiplexed. Curious about using amptk for some comparisons with the clustering and classification steps.

Sound crazy?

@nextgenusfs
Copy link
Owner

nextgenusfs commented Feb 18, 2018

I was thinking same thing when I saw your tweet this morning. Probably a few mods and it should work, but right out of the box I'm less sure of unless I had some data. Quality trimming with expected errors won't work as it is too stringent and currently AMPtk is using that for all "clustering" steps. But you could of course set that really high to bypass quality trimming altogether. I thought I saw a paper on bioxriv about a nanopore pipeline for 16S (this seems really rough around the edges though https://github.com/umerijaz/nanopore) and I think Schloss has one for PacBio reads recently - which means its in mothur and is horrible to use....

I'm not sure how to deal with the "clustering" or if you need to keep each sequence separate or not. But basically there isn't a reason you couldn't run normal clustering at say 97% and see what happens. The reference based clustering could be useful in AMPtk as well. If you have some data and want to share I can see if it works and see if I need to make some tweaks or not. It would probably only take a few hours to get something that would work at least work well enough to get through some testing.

@nextgenusfs
Copy link
Owner

nextgenusfs commented Feb 18, 2018

In the past I've used PoreChop https://github.com/rrwick/Porechop for demuxing and adapter trimming (Ryan writes really nice tools). I think I would not quality trim the data at all actually, it would be better to leave the ends intact and make sure you can find primers (if there are any) -- then you know what a full length read looks like.

@nextgenusfs
Copy link
Owner

If I remember correctly, both PoreChop and now Albacore demux files into separate folders. So something like how amptk illumina should be somewhat easy to write, basically lift the sample name from the folder the read resides in --> FASTQ headers need to have ;barcodelabel=sample_name; in them to properly get through the AMPtk steps and additionally that script could look for primer sites to anchor to. After that, I think you could just run amptk cluster on that dataset and set a very high expected errors value or I can write a switch to turn that off. Let me know if this is something that would be helpful, always trying to make AMPtk as useful as possible.

@nextgenusfs
Copy link
Owner

Actually I think this already exists, the amptk SRA demuxing script takes a folder as input and fastq files in that folder get processed, i.e. the file name is used as sample name.

myfolder:
    barcode01.fastq
    barcode02.fastq

So in the above folder, if you ran this command it should label everything:

amptk SRA -i myfolder -l 1500 --min_len 1200 -o output \
    -f ATACCGGGAGA -r AGAGATTAGAGAG --require_primer off 

This would then relabel all sequences in barcode01.fastq as ;barcodelabel=barcode01; and so on, it would find and trim primers (if needed) and then drop sequences shorter than 1200 bp and trim reads to 1500 bp if they are longer than that.

You could then take the resulting output.demux.fq.gz file into amptk cluster, i.e.:

amptk cluster -i output.demux.fq.gz --minsize 1 -o output -e 100 

Note this will keep singletons which you probably want to do (default is --minsize 2). You could get a better idea about what expected error value to use by running the following command on your input reads and investigating a little bit:

vsearch --fastq_eestats2 test.fastq --output test.txt \
    --ee_cutoffs 1,2,5,10,50,100 --length_cutoffs 500,2000,100

This will tell you how many reads would be retained at various EE values and lengths.

@devonorourke
Copy link
Author

Thanks Jon,
I'm generating these data in the next couple of weeks. I'll let you know how testing goes soon. I'll send you some test data if you'd like?
Devon

@nextgenusfs
Copy link
Owner

Yeah that would be great. I have a nanopore, but haven't used it for amplicons. I probably got one a little too early where data wasn't as good as what I read comes off now. Seems like the kits/technology change overnight...

@druvus
Copy link
Contributor

druvus commented May 21, 2018

@devonorourke @nextgenusfs Any updates on using amptk with nanopore?

I have been playing around a little with my nanopore amplicons (16S, 16S23S, ITS, 18S) but I am not able to get nice clustering so I thought you might have some furter recommendations.

@nextgenusfs
Copy link
Owner

@druvus I haven't seen any data yet, so I haven't looked at it specifically. Should be able to come up with a method if a mock community was sequenced - anybody know if that data is public somewhere? Reference based clustering in theory should work, although probably the best aligner would be minimap2 for that (not currently in AMPtk). I would think for a de novo approach something like quality trimming (find forward/reverse primers, some sort of Q-filter), find uniques, followed by some sort of pre-clustering to find "centroids" (as I would think too many errors for 100% dereplication to be very effective in determining quality) and then mapping to those sequences using minimap2?

@devonorourke
Copy link
Author

I have generated a tiny bit of 16-S data a few months ago; totally failed experiment I was trying as part of a high school 1-week workshop (apparently bad reagents killed 3 flow cells in a day... ouch). It generated maybe 2000 total reads, so probably not enough to really flesh out how well amptk can handle these kinds of data.
@nextgenusfs - I'll be in London at the Nanopore conference starting Wednesday and can ask around for public data; nothing comes to mind at the moment.
I'm guessing the workflow could look something like you proposed:
sequence --> Albacore --> Porechop --> Minimap2.
In my tiny dataset I just used USEARCH to do everything and it seemed to generate an output that was expected (we swabbed the mouths of animals and got back bacteria commonly found in mouths of animals). I scribbled down that pretty standard code here.

If you wanted to do a de novo approach first, you could try miniasm on the front end. You won't be looking for forward/reverse primers, I don't think, will you? If you're base-calling with nanopore data, your first task is converting the raw signal from a .fast5 file to a .fastq; at that point you'll have your demultiplexed dataset. Porechop will also demultiplex if you want to; nevertheless you should probably have already split your reads before assembling or mapping.

Cheers,
Devon

@nextgenusfs
Copy link
Owner

Well let me know if you find some data that has a mock community... While it certainly depends on your experimental goal, I'm assuming here that we are talking about PCR amplicons -- but I very much would use the forward/reverse conserved priming regions to enforce "full-length" sequences for "OTU-picking" (to use classical terminology). I don't know what length amplicons you are talking about here? If 1.5 kb or so, should be easy for Nanopore to sequence across the entire length of these amplicons. While porechop would be good to remove adapter sequence - I'm assuming that your initial PCR region specific primers would still be intact - so then 1) pick out only sequences that are full-length 2) run dereplication, 3) cluster, 4) map reads to "OTUs" using minimap2. Would need to write a PAF/SAM to OTU_table script but that shouldn't be too difficult I wouldn't think.

I would be concerned with using something like miniasm as there should be many sequences in a community that are 95-97% identical yet are unique OTUs, so not sure about collapsing/assembling those reads would yield the desired result.

@nextgenusfs
Copy link
Owner

I should add -- we have a minion, but I have only tried to use for long reads for genome assembly and have not run any of the PCR/amplicon procedures -- so I'm not very familiar with the adapters/primers/etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants