-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dicistro assemblies #239
Comments
All gene_clusters concatenated into a single file: Diamond of all gene_clusters assemblies versus rdrp0_q_d: Diamond of all gene_clusters assemblies versus dicistro.protref.aa: |
My analysis of the above data is posted on S3 here: s3://serratus-public/rce/quenya_analysis/ Selection criteria are: contig length >300nt, RdRp/ORF1 id in range 50..75%. Many assemblies have hits to multiple contigs, but many of these look like assembly problems where the same virus is assembled multiple times. Therefore, I select only the first matching hit from each SRA. These results may therefore underestimate the correct number in cases where there are multiple viruses in one SRA. Each directory contains the following files: first_hit_id50to75.fa -- this is the aligned query segment from the diamond output. first_hit_id50to75_clustered_id70.fa -- This is first_hit_id50to75.fa clustered at 70% nt identity. first_hit_id50to75.tsv -- tabbed file with fields 1. SRA, 2. contig, 3. contig_length, 4. pctid, 5. database_label. |
all files are in (1) the a.a. translations of RdRp according to PR (PR applied to
(2) those RdRp a.a. sequences clustered at 97%id
(3) diamond search of the clustered sequences vs. dicistro.protref.aa
(4) diamond search of the clustered sequences vs. rdrp0.
cmdlines used: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/cluster.sh |
what did you mean by "I don't see any of these" in (4) above? |
Ah sorry :) was a copy-paste of your message that I forgot to erase! |
My analysis uploaded to: The RdRp a.a. sequences provided by Rayan above were clustered at 90% id ("species") and classified according to %id to rdrp0 and disistro,protref.aa. Files are:
|
Pathracer analysis versus all assembly graphs (not the gene_clusters this time): all files are in s3://serratus-public/assemblies/dicistro/analysis/ (1) the a.a. translations of RdRp according to PR (PR applied to each assembly graph and the concatenation of RdRP_1+RdRP_2+RdRP_3+RdRP_4+RdRP_quenya)
(2) those RdRp a.a. sequences clustered at 97%id
(3) diamond search of the clustered sequences vs. dicistro.protref.aa
(4) diamond search of the clustered sequences vs. rdrp0.
scripts used: |
Results uploaded to
|
Monkey work checking Error Type 1: Low complexity sequencesFrom:
Diamond finds no similarity of this in either rdrp0 or dicistro databases. This has no dicistro/rdrp0 match yet the HMM models hit. Traceback on the hit in pathracer, Anton working on this. |
Yes, I found these a few days ago; there is a discussion with Anton on the slack . These are filtered out by the diamond E-value, but we should keep this issue in mind. PR does not report an E-value, so any long enough ORF will induce an alignment. |
Putative Error 2: Known polio, likely misassemblyFrom:
Blastp hit: 70.93% from polio
That library contains a good clean hit to polio, and the hit above is from "another assembly" I am guessing. From
From:
|
I've uploaded PR hits to
|
Further uploaded:
|
Here is the assembly graph from @asl for poliovirus: The issue is when within a single assembly-graph there are multiple paths which provide a RdRp match, each one is being returned, this is yielding indels as seen above in slippage-prone viruses like polio (or other RNA viruses). Proposed Solution: 1. For each RdRp output graph, allow each rdrp-containing edge (red highlight above) to be used at most ONCE per reported sequences. The "top hit" will be selected by percent-identity / top-score to a known virus (thus assuming the hit is not novel). The risk is if there is a novel virus in the same library as a known virus and they share homology over an edge of say 50 amino acids, then the novel virus would be excluded as the known virus takes priority. The benefit is this will reduce intra-sample viral variants to the most conservative. One more caveat, assume rdrp is the red region in the graph above and the viral genome is green. If a sub-graph (blue) has a higher identity match then the longest match (green), but does not contain an end-to-end RdRp domain, the longest match with variants should take priority. |
Analysis of high-trust scaffold sequences uploaded to:
|
I extracted the assembled scaffolds that contain all of Anton's centroids ( Results are in: generated using: https://gitlab.pasteur.fr/rchikhi_pasteur/serratus-batch-assembly/-/blob/master/quenya/pathracer/graph_get_scaffolds.py |
Coronaspades' gene_clusters file are here:
s3://serratus-public/assemblies/dicistro/gene_clusters/
Other coronaspades files:
s3://serratus-public/assemblies/dicistro/other/
4921 assemblies made out of 5442 accessions:
s3://serratus-public/assemblies/dicistro/analysis/list_assembled_dicistro.txt
The text was updated successfully, but these errors were encountered: