Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconclusive results with is.genome.available() #70

Closed
johanneswerner opened this issue Mar 4, 2021 · 3 comments
Closed

Inconclusive results with is.genome.available() #70

johanneswerner opened this issue Mar 4, 2021 · 3 comments

Comments

@johanneswerner
Copy link

Hello,

I am currently trying to reproduce the behavior in #40 which showed a lot of memory peaks. Therefore, I wanted to see if those memory peak also occur when calling getRNA() directly, but I did not manage to find an organism from Gammaproteobacteria to download. Therefore, I wanted to check for misspelling with is.genome.available().

> is.genome.available(db = "genbank", "Pseudomonas", details = FALSE)
Unfortunatey, no entry for 'Pseudomonas' was found in the 'genbank' database. Please consider specifying 'db = refseq' or 'db = ensembl' or 'db = ensemblgenomes' or 'db = uniprot' to check whether 'Pseudomonas' is available in these databases.
[1] FALSE
> is.genome.available(db = "genbank", "Haloferax", details = FALSE)
A reference or representative genome assembly is available for 'Haloferax'.
More than one entry was found for 'Haloferax'. Please consider to run the function 'is.genome.available()' and specify 'is.genome.available(organism = Haloferax, db = genbank, details = TRUE)'. This will allow you to select the 'assembly_accession' identifier that can then be specified in all get*() functions.
[1] TRUE

How can there not be an entry found for Pseudomonas, one of those bacteria that seems to be almost everywhere? As other example, I chose Haloferax as halophilic archaeon which has reference genomes available.

Any idea where I am going wrong?

@HajkD
Copy link
Member

HajkD commented Mar 9, 2021

Hi Johannes,

Many thanks for looking into this.

I think specifying the entire scientific name should do the trick:

biomartr::is.genome.available(db = "genbank", "Pseudomonas syringae", details = FALSE)
A reference or representative genome assembly is available for 'Pseudomonas syringae'.
More than one entry was found for 'Pseudomonas syringae'. Please consider to run the function 'is.genome.available()' and specify 'is.genome.available(organism = Pseudomonas syringae, db = genbank, details = TRUE)'. This will allow you to select the 'assembly_accession' identifier that can then be specified in all get*() functions.
[1] TRUE

For haloferax volcanii it is indeed FALSE. Do you by any chance know another haloferax strain that we could test?

biomartr::is.genome.available(db = "genbank", "haloferax volcanii", details = FALSE)
> FALSE

Many thanks,
Hajk

@johanneswerner
Copy link
Author

It seems I reported a result that is not reproducible.

> library(biomartr)
> biomartr::is.genome.available(db = "genbank", "Pseudomonas syringae", details = FALSE)
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/archaea/assembly_summary.txt'
Content type 'unknown' length 2162267 bytes (2.1 MB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/assembly_summary.txt'
Content type 'unknown' length 275981598 bytes (263.2 MB)
==================================================
|===================================================================================================| 100% 263 MB
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/fungi/assembly_summary.txt'
Content type 'unknown' length 2269258 bytes (2.2 MB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/invertebrate/assembly_summary.txt'
Content type 'unknown' length 542137 bytes (529 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/plant/assembly_summary.txt'
Content type 'unknown' length 496253 bytes (484 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/protozoa/assembly_summary.txt'
Content type 'unknown' length 302395 bytes (295 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_mammalian/assembly_summary.txt'
Content type 'unknown' length 515798 bytes (503 KB)
==================================================
trying URL 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/vertebrate_other/assembly_summary.txt'
Content type 'unknown' length 532328 bytes (519 KB)
==================================================
A reference or representative genome assembly is available for 'Pseudomonas syringae'.
More than one entry was found for 'Pseudomonas syringae'. Please consider to run the function 'is.genome.available()' and specify 'is.genome.available(organism = Pseudomonas syringae, db = genbank, details = TRUE)'. This will allow you to select the 'assembly_accession' identifier that can then be specified in all get*() functions.
[1] TRUE
> biomartr::is.genome.available(db = "genbank", "Haloferax volcanii", details = FALSE)
|===================================================================================================| 100% 265 MB
A reference or representative genome assembly is available for 'Haloferax volcanii'.
More than one entry was found for 'Haloferax volcanii'. Please consider to run the function 'is.genome.available()' and specify 'is.genome.available(organism = Haloferax volcanii, db = genbank, details = TRUE)'. This will allow you to select the 'assembly_accession' identifier that can then be specified in all get*() functions.
[1] TRUE
> biomartr::is.genome.available(db = "genbank", "Haloferax", details = FALSE)
|===================================================================================================| 100% 265 MB
A reference or representative genome assembly is available for 'Haloferax'.
More than one entry was found for 'Haloferax'. Please consider to run the function 'is.genome.available()' and specify 'is.genome.available(organism = Haloferax, db = genbank, details = TRUE)'. This will allow you to select the 'assembly_accession' identifier that can then be specified in all get*() functions.
[1] TRUE
> biomartr::is.genome.available(db = "genbank", "Pseudomonas", details = FALSE)
|===================================================================================================| 100% 265 MB
A reference or representative genome assembly is available for 'Pseudomonas'.
More than one entry was found for 'Pseudomonas'. Please consider to run the function 'is.genome.available()' and specify 'is.genome.available(organism = Pseudomonas, db = genbank, details = TRUE)'. This will allow you to select the 'assembly_accession' identifier that can then be specified in all get*() functions.
[1] TRUE

I am entirely sure that last time I tested, I got false for the last query back (organism = Pseudomonas), but I cannot reproduce it anymore.

So it seems is.genome.available() works both with species and genus names. With that, I will close this issue.

Btw, the reason for your FALSE return value was your misspelling in the organism name, it should be Haloferax volcanii.

@HajkD
Copy link
Member

HajkD commented Mar 10, 2021

Excellent! Thank you very much for this update!

This is very strange that it stopped working at some point. Maybe some file was wrongly updated on the NCBI servers which has now been fixed? Since I parse the species names from files provided by NCBI, this is the only "naive" explanation that I can come up with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants