Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX: update get-ncbi-genomes to use genome accession IDs as taxonomy IDs #193

Merged
merged 7 commits into from
Jun 24, 2024

Conversation

misialq
Copy link
Contributor

@misialq misialq commented Jun 11, 2024

This PR fixes the issue where the taxonomies generated by the get-ncbi-genomes action would contain IDs that do not match the IDs used in the corresponding FASTA file to represent genomic sequences.

  • Previously, genome assembly IDs (GCF*) were used in the taxonomy but genome accession numbers (e.g., NC*) were used in sequence headers obtained from NCBI.
  • The changes introduced here will allow finding the accession ID in the sequence metadata obtained with the sequences and use this one instead of the assembly ID.
  • An additional challenge which is solved by this approach is the potential existence of multiple chromosomes in a single assembly (hence, single ID would be have been used to represent multiple sequences): in case of bacterial, the typical scenario is existence of additional plasmid molecules; in eukaryotes it may be gDNA molecule comprising multiple chromosomes + mitochondrial/chloroplast DNA
  • Additionally, a new flag is being introduced: --p-only-genomic which will allow users to only keep the genomic DNA vs. keeping all the molecules.

@misialq misialq requested a review from nbokulich June 11, 2024 13:02
Copy link
Collaborator

@nbokulich nbokulich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @misialq ,

Looks good. Could you please remove the ncbi-datasets.zip and ncbi-dataset/README.md files if they are not needed?

I still want to pull down the code and test it out with a few different queries, but in the meantime I just have one small comment on the code.

rescript/ncbi_datasets.py Outdated Show resolved Hide resolved
@misialq
Copy link
Contributor Author

misialq commented Jun 21, 2024

Hey @nbokulich, thanks for review! I removed the README file - the zip one had already been removed (you meant the one in the tests/data dir, right?)

@nbokulich
Copy link
Collaborator

Hey @misialq I did a bit of user testing with my favorite taxa and all LGTM, thanks!

Yes you are right, I was not reading carefully enough. I now see that this PR removes the unwanted file.

@misialq
Copy link
Contributor Author

misialq commented Jun 24, 2024

Hey @nbokulich, thanks a lot for testing! In that case, I'm merging this with great excitement 😆

@misialq misialq merged commit 7d9ef27 into bokulich-lab:master Jun 24, 2024
4 checks passed
@misialq misialq deleted the id-fix branch June 24, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants