-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Format of assembly report isn't clear #212
Comments
It is indeed not the clearest part of the code and pretty much absent from the documentation: The assembly report that match this description can be found on Genbank FTP like this one If the first column (CHROM) of the VCF and the first word (anything before the first white space) of the fasta header contains any of the synonyms found in the columns mentioned above from the assembly report then they are matched. I hope this helps. |
Cool, so (just to check I understand) if CHROM is in column 5 and 'the
first word' of the fasta header is in column 1, or the other way round, for
example, either would be a match?
Adding your text to the documentation would be enough I think.
I'm playing with an assembly mapping where the chromosome was initially
called 1, 2, 3, etc., then got renamed to chr1, chr2, chr3, etc. It could
be nice to add a 'chr stripped' (or 'chr prepended') ID to the list of
synonyms.
BTW, since you're here ;-) Does the vcf_assembly_checker look for matching
sequence lengths to 'validate' the assembly report?
Also, I initially thought I should make the sequence.fna.fai using
`makeblastdb`, but then realised it was the samtools format fasta index...
Why do you build and then discard the fasta index? You mention it's
required and then silently create it (and then discard it) on the fly... I
was wondering why the tool was running so slow until I realised that
makeblastdb wasn't producing the .fai.
Many thanks,
Dan.
…On Thu, 24 Jun 2021 at 12:53, Timothee Cezard ***@***.***> wrote:
It is indeed not the clearest part of the code and pretty much absent from
the documentation:
The assembly report is expected to have 10 columns
<https://github.com/EBIvariation/vcf-validator/blob/78cadd491d1d1e25fb5e8538072ba86c7272db2e/inc/assembly_report/assembly_report.hpp#L112>
and it is recording the content of column 1, 5, 7, and 10
<https://github.com/EBIvariation/vcf-validator/blob/78cadd491d1d1e25fb5e8538072ba86c7272db2e/inc/assembly_report/assembly_report.hpp#L181>
The assembly report that match this description can be found on Genbank
FTP like this one
<https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/002/285/GCA_000002285.2_CanFam3.1/GCA_000002285.2_CanFam3.1_assembly_report.txt>
If the first column (CHROM) of the VCF and the first word (anything before
the first white space) of the fasta header contains any of the synonyms
found in the columns mentioned above from the assembly report then they are
matched.
I hope this helps.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#212 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANKSZTTEYA6WETFDTCQDJ6DTUMMERANCNFSM47HVJBCA>
.
|
Which columns of the assembly report are used by the assembly checker to define synonyms?
Enquiring minds demand to know! ;-)
Many thanks,
Dan.
The text was updated successfully, but these errors were encountered: