From 4c1811169b6556b4d028fedc25169f1c7044c41b Mon Sep 17 00:00:00 2001 From: HajkD Date: Tue, 22 Feb 2022 16:24:03 +0100 Subject: [PATCH] fixing broken links in documentation --- DESCRIPTION | 2 +- NEWS.md | 23 +++--- R/check_annotation_biomartr.R | 4 +- R/getGenome.R | 5 +- README.md | 26 +++---- docs/articles/BioMart_Examples.html | 8 +- docs/articles/Database_Retrieval.html | 2 +- docs/articles/Functional_Annotation.html | 16 ++-- docs/articles/MetaGenome_Retrieval.html | 4 +- docs/articles/Sequence_Retrieval.html | 66 ++++++++--------- docs/index.html | 34 ++++----- docs/pkgdown.yml | 2 +- docs/reference/check_annotation_biomartr.html | 2 +- docs/reference/getGenome.html | 3 +- man/check_annotation_biomartr.Rd | 4 +- man/getGenome.Rd | 3 +- vignettes/BioMart_Examples.Rmd | 8 +- vignettes/Database_Retrieval.Rmd | 2 +- vignettes/Functional_Annotation.Rmd | 18 ++--- vignettes/MetaGenome_Retrieval.Rmd | 4 +- vignettes/Sequence_Retrieval.Rmd | 74 +++++++++---------- 21 files changed, 149 insertions(+), 161 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index 9a76c3b3..c31185cb 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -45,7 +45,7 @@ Suggests: magrittr License: GPL-2 LazyData: true -URL: https://docs.ropensci.org/biomartr, https://github.com/ropensci/biomartr +URL: https://docs.ropensci.org/biomartr/, https://github.com/ropensci/biomartr BugReports: https://github.com/ropensci/biomartr/issues RoxygenNote: 7.1.2 Encoding: UTF-8 diff --git a/NEWS.md b/NEWS.md index f69c2422..1b109aa3 100644 --- a/NEWS.md +++ b/NEWS.md @@ -69,7 +69,7 @@ Please use `tibble::as_tibble()` instead. -> adjusted `organismBM()` accordingly - Fixing a bug in `download.database.all()` where the lack of removing listed file `*-metadata.json` caused corruption of the download process (Many thanks to Jaruwatana Lotharukpong) -[biomartr 0.9.2](https://github.com/ropensci/biomartr/releases/tag/v0.9.1) +biomartr 0.9.2 - minor changes to comply with CRAN policy regarding Internet access failure -> Instead of using warnings or error messages, only gentle messages are allowed to be used @@ -79,7 +79,7 @@ Please use `tibble::as_tibble()` instead. -> adjusted `organismBM()` accordingly =========== __Please be aware that as of April 2019, ENSEMBLGENOMES -was retired ([see details here](http://www.ensembl.info/2019/03/08/joint-rest-server-for-ensembl-and-ensembl-genomes-in-ensembl-96/)). Hence, all `biomartr` functions were updated +was retired ([see details here](https://www.ensembl.info/2019/03/08/joint-rest-server-for-ensembl-and-ensembl-genomes-in-ensembl-96/)). Hence, all `biomartr` functions were updated and won't support data retrieval from `ENSEMBLGENOMES` servers anymore.__ ### New Functions @@ -133,14 +133,13 @@ protein sequences, gff files, etc for a particular species ### New Functionality of Existing Functions -- `getProteome()` can now retrieve proteomes from the [UniProt](http://www.uniprot.org/) database by specifying `getProteome(db = "uniprot")`. -An example can be found [here](https://github.com/ropensci/biomartr/blob/master/vignettes/Sequence_Retrieval.Rmd#example-retrieval-uniprot) +- `getProteome()` can now retrieve proteomes from the [UniProt](https://www.uniprot.org/) database by specifying `getProteome(db = "uniprot")`. - `is.genome.available()` now prints out more useful interactive messages when searching for available organisms - `is.genome.available()` can now handle `taxids` and `assembly_accession ids` in addition to the scientific name when specifying argument `organism` -An example can be found [here](https://github.com/ropensci/biomartr/blob/master/vignettes/Sequence_Retrieval.Rmd#example-ncbi-refseq) + - `is.genome.available()` can now check for organism availability in the UniProt database @@ -195,7 +194,7 @@ biomartr 0.5.1 ### Bug fixes -- fixing a bug in `exists.ftp.file()` and `getENSEMBLGENOMES.Seq()` that caused bacterial genome, proteome, etc retrieval to fail due to the wrong construction of a query ftp request https://github.com/HajkD/biomartr/issues/7 +- fixing a bug in `exists.ftp.file()` and `getENSEMBLGENOMES.Seq()` that caused bacterial genome, proteome, etc retrieval to fail due to the wrong construction of a query ftp request https://github.com/ropensci/biomartr/issues/7 (Many thanks to @dbsseven) - fix a major bug in which organisms having no representative genome would generate NULL paths that subsequently crashed the `meta.retrieval()` function when it tried to print out the result paths. @@ -247,7 +246,7 @@ biomartr 0.4.0 ### Bug fixes -- fixing a major bug https://github.com/HajkD/biomartr/issues/6 that caused that in all `get*()` (genome, proteome, gff, etc.) and `meta.retrieval*()` functions +- fixing a major bug https://github.com/ropensci/biomartr/issues/6 that caused that in all `get*()` (genome, proteome, gff, etc.) and `meta.retrieval*()` functions the meta retrieval process errored and terminated whenever NCBI or ENSEMBL didn't store all types of sequences for a particular organism: genome, proteome, cds, etc. This has been fixed now and function calls such as `meta.retrieval(kingdom = "bacteria", db = "genbank", type = "proteome")` should work properly now (Thanks to @ARamesh123 for making me aware if this bug). Hence, this bug affected all attempts to download all proteome sequences e.g. for bacteria and viruses, because NCBI does not store genome AND proteome information for all bacterial or viral species. @@ -291,15 +290,15 @@ biomartr 0.3.0 ### Bug fixes -- Fixing a bug https://github.com/HajkD/biomartr/issues/2 based on the [readr package](https://github.com/tidyverse/readr) that affected the `getSummaryFile()`, `getKingdomAssemblySummary()`, `getMetaGenomeSummary()`, +- Fixing a bug https://github.com/ropensci/biomartr/issues/2 based on the [readr package](https://github.com/tidyverse/readr) that affected the `getSummaryFile()`, `getKingdomAssemblySummary()`, `getMetaGenomeSummary()`, `getENSEMBL.Seq()` and `getENSEMBLGENOMES.Seq()` functions causing quoted lines in the `assembly_summary.txt` to be omitted when reading these files. This artefact caused that e.g. instead of information of 80,000 Bacteria genomes only 40,000 (which non-quotations) were read (Thanks to [Xin Wu](https://github.com/alartin)). biomartr 0.2.1 =========== -In this version of `biomartr` the `organism*()` functions were adapted to the new [ENSEMBL 87 release](http://www.ensembl.info/blog/2016/12/08/ensembl-87-has-been-released/) -in which organism name specification in the Biomart description column [was changed](https://github.com/HajkD/biomartr/issues/1) +In this version of `biomartr` the `organism*()` functions were adapted to the new [ENSEMBL 87 release](https://www.ensembl.info/2016/12/08/ensembl-87-has-been-released/) +in which organism name specification in the Biomart description column [was changed](https://github.com/ropensci/biomartr/issues/1) from a scientific name convention to a mix of common name and scientific name convention. - all `organism*()` functions have been adapted to the new ENSEMBL 87 release organism name notation that is used in the Biomart description @@ -310,7 +309,7 @@ biomartr 0.2.0 =========== In this version, `biomartr` was extended to now retrieve genome, proteome, CDS, GFF and meta-genome data -also from [ENSEMBL](http://www.ensembl.org/index.html) and [ENSEMLGENOMES](http://ensemblgenomes.org/). +also from [ENSEMBL](https://www.ensembl.org/index.html) and [ENSEMLGENOMES](https://ensemblgenomes.org/). Furthermore, all NCBI retrieval functions were updated to the new server folder structure standards of NCBI. @@ -367,7 +366,7 @@ into one big data.frame ### Function changes -- functions `getGenome()`, `getProteome()`, and `getCDS()` now can also in addition to NCBI retrieve genomes, proteomes or CDS from [ENSEMBL](http://www.ensembl.org/index.html) and [ENSEMLGENOMES](http://ensemblgenomes.org/) +- functions `getGenome()`, `getProteome()`, and `getCDS()` now can also in addition to NCBI retrieve genomes, proteomes or CDS from [ENSEMBL](https://www.ensembl.org/index.html) and [ENSEMLGENOMES](https://ensemblgenomes.org/) - the functions `getGenome()`, `getProteome()`, and `getCDS()` were completely re-written and now use the assembly_summary.txt files provided by NCBI to retrieve the download path to the corresponding genome. Furthermore, these functions now lost the `kingdom` argument. diff --git a/R/check_annotation_biomartr.R b/R/check_annotation_biomartr.R index fcb232d7..afe2b8dc 100644 --- a/R/check_annotation_biomartr.R +++ b/R/check_annotation_biomartr.R @@ -2,8 +2,8 @@ #' @description Some annotation files include lines with character lengths greater than 65000. This causes problems when trying to import such annotation files into R using \code{\link[rtracklayer]{import}}. #' To overcome this issue, this function screens for such lines #' in a given annotation file and removes these lines so that -#' \code{\link[rtracklayer]{import}} can handle the file. -#' @param annotation_file a file path tp the annotation file. +#' \code{import} can handle the file. +#' @param annotation_file a file path to the annotation file. #' @param remove_annotation_outliers shall outlier lines be removed from the input \code{annotation_file}? #' If yes, then the initial \code{annotation_file} will be overwritten and the removed outlier lines will be stored at \code{\link{tempdir}} #' for further exploration. diff --git a/R/getGenome.R b/R/getGenome.R index 4bd14c61..39be9250 100644 --- a/R/getGenome.R +++ b/R/getGenome.R @@ -73,8 +73,9 @@ getGenome <- release = NULL, gunzip = FALSE, path = file.path("_ncbi_downloads", "genomes"), - assembly_type = "toplevel", - kingdom_assembly_summary_file = NULL) { + assembly_type = "toplevel" + #kingdom_assembly_summary_file = NULL + ) { if (!is.element(db, c("refseq", "genbank", "ensembl"))) stop( diff --git a/README.md b/README.md index 330d6c96..5562c121 100644 --- a/README.md +++ b/README.md @@ -2,12 +2,12 @@ biomartr ======== -[![](https://badges.ropensci.org/93_status.svg)](https://github.com/ropensci/onboarding/issues/93) +[![](https://badges.ropensci.org/93_status.svg)](https://github.com/ropensci/software-review/issues/93) [![Travis-CI Build Status](https://travis-ci.org/ropensci/biomartr.svg?branch=master)](https://travis-ci.org/ropensci/biomartr) -[![rstudio mirror downloads](http://cranlogs.r-pkg.org/badges/biomartr)](https://github.com/metacran/cranlogs.app) -[![rstudio mirror downloads](http://cranlogs.r-pkg.org/badges/grand-total/biomartr)](https://github.com/metacran/cranlogs.app) +[![rstudio mirror downloads](https://cranlogs.r-pkg.org/badges/biomartr)](https://github.com/r-hub/cranlogs.app) +[![rstudio mirror downloads](https://cranlogs.r-pkg.org/badges/grand-total/biomartr)](https://github.com/r-hub/cranlogs.app) [![Paper link](https://img.shields.io/badge/Published%20in-Bioinformatics-126888.svg)](https://academic.oup.com/bioinformatics/article/33/8/1216/2931816) -[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/r-biomartr/README.html) +[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](https://bioconda.github.io/recipes/r-biomartr/README.html) ## Genomic Data Retrieval with R @@ -47,10 +47,10 @@ In detail, `biomartr` automates genome, proteome, CDS, RNA, Repeats, GFF/GTF (an - [NCBI RefSeq](https://www.ncbi.nlm.nih.gov/refseq/) - [NCBI Genbank](https://www.ncbi.nlm.nih.gov/genbank/) - [ENSEMBL](https://www.ensembl.org/index.html) -- [ENSEMBLGENOMES](http://ensemblgenomes.org) (as of April 2019 - `ENSEMBL` and `ENSEMBLGENOMES` were joined - see [details here](http://www.ensembl.info/2019/03/08/joint-rest-server-for-ensembl-and-ensembl-genomes-in-ensembl-96/)) -- [UniProt](http://www.uniprot.org) +- [ENSEMBLGENOMES](http://ensemblgenomes.org) (as of April 2019 - `ENSEMBL` and `ENSEMBLGENOMES` were joined - see [details here](https://www.ensembl.info/2019/03/08/joint-rest-server-for-ensembl-and-ensembl-genomes-in-ensembl-96/)) +- [UniProt](https://www.uniprot.org) -Furthermore, an interface to the `Ensembl Biomart` database allows users to retrieve functional annotation for genomic loci using a novel and organism centric search strategy. In addition, users can [download entire databases](https://github.com/HajkD/biomartr/blob/master/vignettes/Database_Retrieval.Rmd) such as +Furthermore, an interface to the `Ensembl Biomart` database allows users to retrieve functional annotation for genomic loci using a novel and organism centric search strategy. In addition, users can [download entire databases](https://github.com/ropensci/biomartr/blob/master/vignettes/Database_Retrieval.Rmd) such as - `NCBI RefSeq` - `NCBI nr` @@ -62,7 +62,7 @@ with only one command. ### Similar Work -The main difference between the [BiomaRt](http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) package and the [biomartr](https://docs.ropensci.org/biomartr/) package is that `biomartr` extends the `functional annotation retrieval` procedure of `BiomaRt` and __in addition__ provides useful retrieval functions for genomes, proteomes, coding sequences, gff files, RNA sequences, Repeat Masker annotations files, and functions for the retrieval of entire databases such as `NCBI nr` etc. +The main difference between the [BiomaRt](https://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) package and the [biomartr](https://docs.ropensci.org/biomartr/) package is that `biomartr` extends the `functional annotation retrieval` procedure of `BiomaRt` and __in addition__ provides useful retrieval functions for genomes, proteomes, coding sequences, gff files, RNA sequences, Repeat Masker annotations files, and functions for the retrieval of entire databases such as `NCBI nr` etc. Please consult the [Tutorials section](https://docs.ropensci.org/biomartr/#tutorials) for more details. @@ -99,7 +99,7 @@ install.packages("biomartr", dependencies = TRUE) ## Installation with Bioconda -With an activated Bioconda channel (see [2. Set up channels](http://bioconda.github.io/user/install.html#set-up-channels)), install with: +With an activated Bioconda channel (see [2. Set up channels](https://bioconda.github.io/user/install.html#set-up-channels)), install with: ``` conda install r-biomartr @@ -180,10 +180,6 @@ All geneomes are stored in the folder named according to the kingdom. In this case `vertebrate_mammalian`. Alternatively, users can specify the `out.folder` argument to define a custom output folder path. -### Platforms - -> Find `biomartr` also at [OmicTools](https://omictools.com/biomartr-tool). - ### Frequently Asked Questions (FAQs) Please find [all FAQs here](https://github.com/ropensci/biomartr/blob/master/FAQs.md). @@ -210,7 +206,7 @@ Getting Started with `biomartr`: - [BioMart Examples](https://docs.ropensci.org/biomartr/articles/BioMart_Examples.html) -Users can also read the tutorials within ([RStudio](http://www.rstudio.com/)) : +Users can also read the tutorials within ([RStudio](https://www.rstudio.com/)) : ```r # source the biomartr package @@ -328,7 +324,7 @@ library("biomartr", lib.loc = "C:/Program Files/R/R-3.1.1/library") ### Troubleshooting on Windows Machines -- Install `biomartr` on a Win 8 laptop: [solution](https://github.com/HajkD/orthologr/issues/1) ( Thanks to Andres Romanowski ) +- Install `biomartr` on a Win 8 laptop: [solution](https://github.com/drostlab/orthologr/issues/1) ( Thanks to Andres Romanowski ) # Code of conduct diff --git a/docs/articles/BioMart_Examples.html b/docs/articles/BioMart_Examples.html index bae0be52..3e2f9ba8 100644 --- a/docs/articles/BioMart_Examples.html +++ b/docs/articles/BioMart_Examples.html @@ -108,13 +108,13 @@

2022-02-22

Use Case #1: Functional Annotation of Genes Sharing a Common Evolutionary History

-

Evolutionary Transcriptomics aims to predict stages or periods of evolutionary conservation in biological processes on the transcriptome level. However, finding genes sharing a common evolutionary history could reveal how the the biological process might have evolved in the first place.

-

In this Use Case we will combine functional and biological annotation obtained with biomartr with enriched genes obtained with PlotEnrichment().

+

Evolutionary Transcriptomics aims to predict stages or periods of evolutionary conservation in biological processes on the transcriptome level. However, finding genes sharing a common evolutionary history could reveal how the the biological process might have evolved in the first place.

+

In this Use Case we will combine functional and biological annotation obtained with biomartr with enriched genes obtained with PlotEnrichment().

Step 1

-

For the following example we will use the dataset an enrichment analyses found in PlotEnrichment().

-

Install and load the myTAI package:

+

For the following example we will use the dataset an enrichment analyses found in PlotEnrichment().

+

Install and load the myTAI package:

 # install myTAI
 install.packages("myTAI")
diff --git a/docs/articles/Database_Retrieval.html b/docs/articles/Database_Retrieval.html
index 355d1181..f0f5a06d 100644
--- a/docs/articles/Database_Retrieval.html
+++ b/docs/articles/Database_Retrieval.html
@@ -116,7 +116,7 @@ 

Getting Started

-

NCBI stores a variety of specialized database such as Genbank, RefSeq, Taxonomy, SNP, etc. on their servers. The download.database() and download.database.all() functions implemented in biomartr allows users to download these databases from NCBI. This process might be very useful for downstream analyses such as sequence searches with e.g. BLAST. For this purpose see the R package metablastr which aims to seamlessly inegrate biomartr based genomic data retrieval with downsteam large-scale BLAST searches.

+

NCBI stores a variety of specialized database such as Genbank, RefSeq, Taxonomy, SNP, etc. on their servers. The download.database() and download.database.all() functions implemented in biomartr allows users to download these databases from NCBI. This process might be very useful for downstream analyses such as sequence searches with e.g. BLAST. For this purpose see the R package metablastr which aims to seamlessly integrate biomartr based genomic data retrieval with downstream large-scale BLAST searches.

  • 1. List available NCBI databases with listNCBIDatabases()
  • diff --git a/docs/articles/Functional_Annotation.html b/docs/articles/Functional_Annotation.html index 5accabff..debbce2a 100644 --- a/docs/articles/Functional_Annotation.html +++ b/docs/articles/Functional_Annotation.html @@ -117,22 +117,22 @@

    Getting Started

    -

    The Ensembl Biomart database enables users to retrieve a vast diversity of annotation data for specific organisms. Initially, Steffen Durinck and Wolfgang Huber provided a powerful interface between the R language and Ensembl Biomart by implementing the R package biomaRt.

    +

    The Ensembl Biomart database enables users to retrieve a vast diversity of annotation data for specific organisms. Initially, Steffen Durinck and Wolfgang Huber provided a powerful interface between the R language and Ensembl Biomart by implementing the R package biomaRt.

    The purpose of the biomaRt package was to mimic the ENSEMBL BioMart database structure to construct queries that can be sent to the Application Programming Interface (API) of BioMart. Although, this procedure was very useful in the past, it seems not intuitive from an organism centric point of view. Usually, users wish to download functional annotation for a particular organism of interest. However, the BioMart and thus the biomaRt package require that users already know in which mart and dataset the organism of interest will be found which requires significant efforts of searching and screening. In addition, once the mart and dataset of a particular organism of interest were found and specified the user must again learn which attribute has to be specified to retrieve the functional annotation information of interest.

    -

    The new functionality implemented in the biomartr package aims to overcome this search bottleneck by extending the functionality of the biomaRt package. The new biomartr package introduces a more organism cantered annotation retrieval concept which does not require to screen for marts, datasets, and attributes beforehand. With biomartr users only need to specify the scientific name of the organism of interest to then retrieve available marts, datasets, and attributes for the corresponding organism of interest.

    +

    The new functionality implemented in the biomartr package aims to overcome this search bottleneck by extending the functionality of the biomaRt package. The new biomartr package introduces a more organism cantered annotation retrieval concept which does not require to screen for marts, datasets, and attributes beforehand. With biomartr users only need to specify the scientific name of the organism of interest to then retrieve available marts, datasets, and attributes for the corresponding organism of interest.

    This paradigm shift enables users to quickly construct queries to the BioMart database without having to learn the particular database structure and organization of BioMart.

    -

    The following sections will introduce users to the functionality and data retrieval precedures of biomartr and will show how biomartr extends the functionality of the initial biomaRt package.

    +

    The following sections will introduce users to the functionality and data retrieval precedures of biomartr and will show how biomartr extends the functionality of the initial biomaRt package.

    The old biomaRt query methodology

    -

    The best way to get started with the old methodology presented by the established biomaRt package is to understand the workflow of its data retrieval process. The query logic of the biomaRt package derives from the database organization of Ensembl Biomart which stores a vast diversity of annotation data for specific organisms. In detail, the Ensembl Biomart database is organized into so called:
    marts, datasets, and attributes. Marts denote a higher level category of functional annotation such as SNP (e.g. for functional annotation of particular single nucleotide polymorphisms (SNPs)) or FUNCGEN (e.g. for functional annotation of regulatory regions or relationsships of genes). Datasets denote the particular species of interest for which functional annotation is available within this specific mart. It can happen that datasets (= particular species of interest) are available in one mart (= higher category of functional annotation) but not in an other mart. For the actual retrieval of functional annotation information users must then specify the type of functional annotation information they wish to retrieve. These types are called attributes in the biomaRt notation.

    +

    The best way to get started with the old methodology presented by the established biomaRt package is to understand the workflow of its data retrieval process. The query logic of the biomaRt package derives from the database organization of Ensembl Biomart which stores a vast diversity of annotation data for specific organisms. In detail, the Ensembl Biomart database is organized into so called:
    marts, datasets, and attributes. Marts denote a higher level category of functional annotation such as SNP (e.g. for functional annotation of particular single nucleotide polymorphisms (SNPs)) or FUNCGEN (e.g. for functional annotation of regulatory regions or relationsships of genes). Datasets denote the particular species of interest for which functional annotation is available within this specific mart. It can happen that datasets (= particular species of interest) are available in one mart (= higher category of functional annotation) but not in an other mart. For the actual retrieval of functional annotation information users must then specify the type of functional annotation information they wish to retrieve. These types are called attributes in the biomaRt notation.

    Hence, when users wish to retrieve information for a specific organism of interest, they first need to specify a particular mart and dataset in which the information of the corresponding organism of interest can be found. Subsequently they can specify the attributes argument to retrieve a particular type of functional annotation (e.g. Gene Ontology terms).

    The following section shall illustrate how marts, datasets, and attributes could be explored using biomaRt before the biomartr package existed.

    The availability of marts, datasets, and attributes can be checked by the following functions:

     # install the biomaRt package       
    -# source("http://bioconductor.org/biocLite.R")      
    +# source("https://bioconductor.org/biocLite.R")     
     # biocLite("biomaRt")       
     # load biomaRt      
     library(biomaRt)        
    @@ -205,7 +205,7 @@ 

    The getMarts() function allows users to list all available databases that can be accessed through BioMart interfaces.

     # load the biomartr package
    -library(biomartr)
    +library(biomartr)
     
     # list all available databases
     biomartr::getMarts()
    @@ -585,7 +585,7 @@

    Gene Ontology

    -

    The biomartr package also enables a fast and intuitive retrieval of GO terms and additional information via the getGO() function. Several databases can be selected to retrieve GO annotation information for a set of query genes. So far, the getGO() function allows GO information retrieval from the Ensembl Biomart database.

    +

    The biomartr package also enables a fast and intuitive retrieval of GO terms and additional information via the getGO() function. Several databases can be selected to retrieve GO annotation information for a set of query genes. So far, the getGO() function allows GO information retrieval from the Ensembl Biomart database.

    In this example we will retrieve GO information for a set of Homo sapiens genes stored as hgnc_symbol.

    @@ -600,7 +600,7 @@

    filters = "hgnc_symbol") GO_tbl

    -

    Hence, for each gene id the resulting table stores all annotated GO terms found in Ensembl Biomart.

    +

    Hence, for each gene id the resulting table stores all annotated GO terms found in Ensembl Biomart.

diff --git a/docs/articles/MetaGenome_Retrieval.html b/docs/articles/MetaGenome_Retrieval.html index 648361a5..ba44b52d 100644 --- a/docs/articles/MetaGenome_Retrieval.html +++ b/docs/articles/MetaGenome_Retrieval.html @@ -186,7 +186,7 @@

Getting Started

The meta.retrieval() and meta.retrieval.all() functions aim to simplify the genome retrieval and computational reproducibility process for meta-genomics studies. Both functions allow users to either download genomes, proteomes, CDS, etc for species within a specific kingdom or subgroup of life (meta.retrieval()) or of all species of all kingdoms (meta.retrieval.all()). Before biomartr users had to write shell scripts to download respective genomic data. However, since many meta-genomics packages exist for the R programming language, I implemented this functionality for easy integration into existing R workflows and for easier reproducibility.

-

For example, the pipeline logic of the magrittr package can be used with meta.retrieval() and meta.retrieval.all() as follows.

+

For example, the pipeline logic of the magrittr package can be used with meta.retrieval() and meta.retrieval.all() as follows.

 # download all vertebrate genomes, then apply ...
 meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome") %>% ...
@@ -487,7 +487,7 @@

Meta retrieval of genome assembly quality information

Although much effort is invested to increase the genome assembly quality when new genomes are published or new versions are released, the influence of genome assembly quality on downstream analyses cannot be neglected. A rule of thumb is, that the larger the genome the more prone it is to genome assembly errors and therefore, a reduction of assembly quality.

-

In Veeckman et al., 2016 the authors conclude:

+

In Veeckman et al., 2016 the authors conclude:

As yet, no uniform metrics or standards are in place to estimate the completeness of a genome assembly or the annotated gene space, despite their importance for downstream analyses

diff --git a/docs/articles/Sequence_Retrieval.html b/docs/articles/Sequence_Retrieval.html index ef274905..bfe17a16 100644 --- a/docs/articles/Sequence_Retrieval.html +++ b/docs/articles/Sequence_Retrieval.html @@ -427,7 +427,7 @@

Example UniProt (?is.genome.available):

-

Users can also check the availability of proteomes in the UniProt database by specifying:

+

Users can also check the availability of proteomes in the UniProt database by specifying:

 # retrieve information from UniProt
 is.genome.available(db = "uniprot", "Homo sapiens", details = FALSE)
@@ -708,24 +708,24 @@

Downloading Biological Sequences and Annotations

-

After checking for the availability of sequence information for an organism of interest, the next step is to download the corresponding genome, proteome, CDS, or GFF file. The following functions allow users to download proteomes, genomes, CDS and GFF files from several database resources such as: NCBI RefSeq, NCBI Genbank, ENSEMBL. When a corresponding proteome, genome, CDS or GFF file was loaded to your hard-drive, a documentation *.txt file is generated storing File Name, Organism, Database, URL, DATE, assembly_accession, bioproject, biosample, taxid, version_status, release_type, seq_rel_date etc. information of the retrieved file. This way a better reproducibility of proteome, genome, CDS and GFF versions used for subsequent data analyses can be achieved.

+

After checking for the availability of sequence information for an organism of interest, the next step is to download the corresponding genome, proteome, CDS, or GFF file. The following functions allow users to download proteomes, genomes, CDS and GFF files from several database resources such as: NCBI RefSeq, NCBI Genbank, ENSEMBL. When a corresponding proteome, genome, CDS or GFF file was loaded to your hard-drive, a documentation *.txt file is generated storing File Name, Organism, Database, URL, DATE, assembly_accession, bioproject, biosample, taxid, version_status, release_type, seq_rel_date etc. information of the retrieved file. This way a better reproducibility of proteome, genome, CDS and GFF versions used for subsequent data analyses can be achieved.

Genome Retrieval

The easiest way to download a genome is to use the getGenome() function.

In this example we will download the genome of Homo sapiens.

-

The getGenome() function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL databases from which corresponding genomes can be retrieved.

+

The getGenome() function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL databases from which corresponding genomes can be retrieved.

The db argument specifies from which database genome assemblies in *.fasta file format shall be retrieved.

Options are:

Furthermore, users need to specify the scientific name, the taxid (= NCBI Taxnonomy identifier), or the accession identifier of the organism of interest for which a genome assembly shall be downloaded, e.g. organism = "Homo sapiens" or organism = "9606" or organism = "GCF_000001405.37". Finally, the path argument specifies the folder path in which the corresponding assembly shall be locally stored. In case users would like to store the genome file at a different location, they can specify the path = file.path("put","your","path","here") argument (e.g. file.path("_ncbi_downloads","genomes")).

@@ -746,7 +746,7 @@

# and store the corresponding genome file in '_ncbi_downloads/genomes' HS.genome.refseq <- getGenome( db = "refseq", organism = "Homo sapiens")

-

Subsequently, users can use the read_genome() function to import the genome into the R session. Users can choose to work with the genome sequence in R either as Biostrings object (obj.type = "Biostrings") or data.table object (obj.type = "data.table") by specifying the obj.type argument of the read_genome() function.

+

Subsequently, users can use the read_genome() function to import the genome into the R session. Users can choose to work with the genome sequence in R either as Biostrings object (obj.type = "Biostrings") or data.table object (obj.type = "data.table") by specifying the obj.type argument of the read_genome() function.

 # import downloaded genome as Biostrings object
 Human_Genome <- read_genome(file = HS.genome.refseq)
@@ -797,7 +797,7 @@

  • rel_gc: The (relative frequency) of GCs (over all chromosomes or scaffolds or contigs) compared to the total number of nucleotides in the genome assembly file

  • In summary, the getGenome() and read_genome() functions allow users to retrieve genome assemblies by specifying the scientific name of the organism of interest and allow them to import the retrieved genome assembly e.g. as Biostrings object. Thus, users can then perform the Biostrings notation to work with downloaded genomes and can rely on the log file generated by getGenome() to better document the source and version of genome assemblies used for subsequent studies.

    -

    Alternatively, users can perform the pipeline logic of the magrittr package:

    +

    Alternatively, users can perform the pipeline logic of the magrittr package:

     # install.packages("magrittr")
     library(magrittr)
    @@ -1092,21 +1092,21 @@ 

    Proteome Retrieval

    -

    The getProteome() function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL, and UniProt databases from which corresponding proteomes can be retrieved. It works analogous to getGenome().

    +

    The getProteome() function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL, and UniProt databases from which corresponding proteomes can be retrieved. It works analogous to getGenome().

    The db argument specifies from which database proteomes in *.fasta file format shall be retrieved.

    Options are:

    Furthermore, again users need to specify the scientific name of the organism of interest for which a proteomes shall be downloaded, e.g. organism = "Homo sapiens". Finally, the path argument specifies the folder path in which the corresponding proteome shall be locally stored. In case users would like to store the proteome file at a different location, they can specify the path = file.path("put","your","path","here") argument.

    @@ -1120,7 +1120,7 @@

    HS.proteome.refseq <- getProteome( db = "refseq", organism = "Homo sapiens", path = file.path("_ncbi_downloads","proteomes"))

    -

    In this example, getProteome() creates a directory named '_ncbi_downloads/proteomes' into which the corresponding genome named GCF_000001405.34_GRCh38.p8_protein.faa.gz is downloaded. The return value of getProteome() is the folder path to the downloaded proteome file that can then be used as input to the read_proteome() function. The variable HS.proteome.refseq stores the path to the downloaded proteome. Subsequently, users can use the read_proteome() function to import the proteome into the R session. Users can choose to work with the proteome sequence in R either as Biostrings object (obj.type = "Biostrings") or data.table object (obj.type = "data.table") by specifying the obj.type argument of the read_proteome() function.

    +

    In this example, getProteome() creates a directory named '_ncbi_downloads/proteomes' into which the corresponding genome named GCF_000001405.34_GRCh38.p8_protein.faa.gz is downloaded. The return value of getProteome() is the folder path to the downloaded proteome file that can then be used as input to the read_proteome() function. The variable HS.proteome.refseq stores the path to the downloaded proteome. Subsequently, users can use the read_proteome() function to import the proteome into the R session. Users can choose to work with the proteome sequence in R either as Biostrings object (obj.type = "Biostrings") or data.table object (obj.type = "data.table") by specifying the obj.type argument of the read_proteome() function.

     # import proteome as Biostrings object
     Human_Proteome <- read_proteome(file = HS.proteome.refseq)
    @@ -1139,7 +1139,7 @@

    [113618] 603 MTMHTTMTTLTLT...FFPLILTLLLIT YP_003024036.1 NA... [113619] 174 MMYALFLLSVGLV...GVYIVIEIARGN YP_003024037.1 NA... [113620] 380 MTPMRKTNPLMKL...ISLIENKMLKWA YP_003024038.1 cy... -

    Alternatively, users can perform the pipeline logic of the magrittr package:

    +

    Alternatively, users can perform the pipeline logic of the magrittr package:

     # install.packages("magrittr")
     library(magrittr)
    @@ -1224,7 +1224,7 @@ 

    Example Retrieval Uniprot:

    -

    Another way of retrieving proteome sequences is from the UniProt database.

    +

    Another way of retrieving proteome sequences is from the UniProt database.

     # download the proteome of Mus musculus from UniProt
     # and store the corresponding proteome file in '_uniprot_downloads/proteomes'
    @@ -1267,18 +1267,18 @@ 

    CDS Retrieval

    -

    The getCDS() function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL databases from which corresponding CDS files can be retrieved. It works analogous to getGenome() and getProteome().

    +

    The getCDS() function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL databases from which corresponding CDS files can be retrieved. It works analogous to getGenome() and getProteome().

    The db argument specifies from which database proteomes in *.fasta file format shall be retrieved.

    Options are:

    Furthermore, again users need to specify the scientific name of the organism of interest for which a proteomes shall be downloaded, e.g. organism = "Homo sapiens". Finally, the path argument specifies the folder path in which the corresponding CDS file shall be locally stored. In case users would like to store the CDS file at a different location, they can specify the path = file.path("put","your","path","here") argument.

    @@ -1292,7 +1292,7 @@

    HS.cds.refseq <- getCDS( db = "refseq", organism = "Homo sapiens", path = file.path("_ncbi_downloads","CDS"))

    -

    In this example, getCDS() creates a directory named '_ncbi_downloads/CDS' into which the corresponding genome named Homo_sapiens_cds_from_genomic_refseq.fna.gz is downloaded. The return value of getCDS() is the folder path to the downloaded genome file that can then be used as input to the read_cds() function. The variable HS.cds.refseq stores the path to the downloaded CDS file. Subsequently, users can use the read_cds() function to import the genome into the R session. Users can choose to work with the genome sequence in R either as Biostrings object (obj.type = "Biostrings") or data.table object (obj.type = "data.table") by specifying the obj.type argument of the read_cds() function.

    +

    In this example, getCDS() creates a directory named '_ncbi_downloads/CDS' into which the corresponding genome named Homo_sapiens_cds_from_genomic_refseq.fna.gz is downloaded. The return value of getCDS() is the folder path to the downloaded genome file that can then be used as input to the read_cds() function. The variable HS.cds.refseq stores the path to the downloaded CDS file. Subsequently, users can use the read_cds() function to import the genome into the R session. Users can choose to work with the genome sequence in R either as Biostrings object (obj.type = "Biostrings") or data.table object (obj.type = "data.table") by specifying the obj.type argument of the read_cds() function.

     # import downloaded CDS as Biostrings object
     Human_CDS <- read_cds(file     = HS.cds.refseq, 
    @@ -1332,7 +1332,7 @@ 

    seq_rel_date: 2016-09-26 submitter: Genome Reference Consortium

    In summary, the getCDS() and read_cds() functions allow users to retrieve CDS files by specifying the scientific name of the organism of interest and allow them to import the retrieved CDS file e.g. as Biostrings object. Thus, users can then perform the Biostrings notation to work with downloaded CDS and can rely on the log file generated by getCDS() to better document the source and version of CDS used for subsequent studies.

    -

    Alternatively, users can perform the pipeline logic of the magrittr package:

    +

    Alternatively, users can perform the pipeline logic of the magrittr package:

     # install.packages("magrittr")
     library(magrittr)
    @@ -1435,18 +1435,18 @@ 

    RNA Retrieval

    -

    The getRNA() function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL databases from which corresponding RNA files can be retrieved. It works analogous to getGenome(), getProteome(), and getCDS().

    +

    The getRNA() function is an interface function to the NCBI RefSeq, NCBI Genbank, ENSEMBL databases from which corresponding RNA files can be retrieved. It works analogous to getGenome(), getProteome(), and getCDS().

    The db argument specifies from which database proteomes in *.fasta file format shall be retrieved.

    Options are:

    Furthermore, again users need to specify the scientific name of the organism of interest for which a proteomes shall be downloaded, e.g. organism = "Homo sapiens". Finally, the path argument specifies the folder path in which the corresponding RNA file shall be locally stored. In case users would like to store the RNA file at a different location, they can specify the path = file.path("put","your","path","here") argument.

    @@ -1460,7 +1460,7 @@

    HS.rna.refseq <- getRNA( db = "refseq", organism = "Homo sapiens", path = file.path("_ncbi_downloads","RNA"))

    -

    In this example, getRNA() creates a directory named '_ncbi_downloads/RNA' into which the corresponding RNA file named Homo_sapiens_rna_from_genomic_refseq.fna.gz is downloaded. The return value of getRNA() is the folder path to the downloaded genome file that can then be used as input to the read_rna() function. The variable HS.rna.refseq stores the path to the downloaded RNA file. Subsequently, users can use the read_cds() function to import the genome into the R session. Users can choose to work with the genome sequence in R either as Biostrings object (obj.type = "Biostrings") or data.table object (obj.type = "data.table") by specifying the obj.type argument of the read_rna() function.

    +

    In this example, getRNA() creates a directory named '_ncbi_downloads/RNA' into which the corresponding RNA file named Homo_sapiens_rna_from_genomic_refseq.fna.gz is downloaded. The return value of getRNA() is the folder path to the downloaded genome file that can then be used as input to the read_rna() function. The variable HS.rna.refseq stores the path to the downloaded RNA file. Subsequently, users can use the read_cds() function to import the genome into the R session. Users can choose to work with the genome sequence in R either as Biostrings object (obj.type = "Biostrings") or data.table object (obj.type = "data.table") by specifying the obj.type argument of the read_rna() function.

     # import downloaded RNA as Biostrings object
     Human_rna <- read_rna(file     = HS.rna.refseq, 
    @@ -1499,7 +1499,7 @@ 

    seq_rel_date: 2017-01-06 submitter: Genome Reference Consortium

    In summary, the getRNA() and read_rna() functions allow users to retrieve RNA files by specifying the scientific name of the organism of interest and allow them to import the retrieved RNA file e.g. as Biostrings object. Thus, users can then perform the Biostrings notation to work with downloaded RNA and can rely on the log file generated by getRNA() to better document the source and version of RNA used for subsequent studies.

    -

    Alternatively, users can perform the pipeline logic of the magrittr package:

    +

    Alternatively, users can perform the pipeline logic of the magrittr package:

     # install.packages("magrittr")
     library(magrittr)
    @@ -1634,7 +1634,7 @@ 

    Removing corrupt lines from downloaded GFF files

    -

    In some cases, GFF files stored at NCBI databases include corrupt lines that have more than 65000 characters. This leads to problems when trying to import such annotation files into R. To overcome this issue users can specify the remove_annotation_outliers = TRUE argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable.

    +

    In some cases, GFF files stored at NCBI databases include corrupt lines that have more than 65000 characters. This leads to problems when trying to import such annotation files into R. To overcome this issue users can specify the remove_annotation_outliers = TRUE argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable.

    Example:

     Ath_path <- biomartr::getGFF(organism = "Arabidopsis thaliana", remove_annotation_outliers = TRUE)
    @@ -1684,7 +1684,7 @@

    Removing corrupt lines from downloaded GFF files

    -

    In some cases, GFF files stored at NCBI databases include corrupt lines that have more than 65000 characters. This leads to problems when trying to import such annotation files into R. To overcome this issue users can specify the remove_annotation_outliers = TRUE argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable.

    +

    In some cases, GFF files stored at NCBI databases include corrupt lines that have more than 65000 characters. This leads to problems when trying to import such annotation files into R. To overcome this issue users can specify the remove_annotation_outliers = TRUE argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable.

    Example:

     Ath_path <- biomartr::getGFF(db = "genbank",
    @@ -1738,7 +1738,7 @@ 

    Removing corrupt lines from downloaded GFF files

    -

    In some cases, GFF files stored at NCBI databases include corrupt lines that have more than 65000 characters. This leads to problems when trying to import such annotation files into R. To overcome this issue users can specify the remove_annotation_outliers = TRUE argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable.

    +

    In some cases, GFF files stored at NCBI databases include corrupt lines that have more than 65000 characters. This leads to problems when trying to import such annotation files into R. To overcome this issue users can specify the remove_annotation_outliers = TRUE argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable.

    Alternatively for getGTF():

     # download the GTF file of Homo sapiens from ENSEMBL
    @@ -1763,13 +1763,13 @@ 

    # retrieve these three species from NCBI RefSeq biomartr::getGFFSet("refseq", organisms = download_species, path = "set_gff")

    If the download process was interrupted, users can re-run the function and it will only download missing genomes. In cases users wish to download everything again and updating existing genomes, they may specify the argument update = TRUE.

    -

    In some cases, GFF files stored at NCBI databases include corrupt lines that have more than 65000 characters. This leads to problems when trying to import such annotation files into R. To overcome this issue users can specify the remove_annotation_outliers = TRUE argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable.

    +

    In some cases, GFF files stored at NCBI databases include corrupt lines that have more than 65000 characters. This leads to problems when trying to import such annotation files into R. To overcome this issue users can specify the remove_annotation_outliers = TRUE argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable.

    Repeat Masker Retrieval

    -

    Repeat Masker is a tool for screening DNA sequences for interspersed repeats and low complexity DNA sequences. NCBI stores the Repeat Masker for sevel species in their databases and can be retrieved using getRepeatMasker() and imported via read_rm().

    +

    Repeat Masker is a tool for screening DNA sequences for interspersed repeats and low complexity DNA sequences. NCBI stores the Repeat Masker for sevel species in their databases and can be retrieved using getRepeatMasker() and imported via read_rm().

    Example NCBI RefSeq:

    diff --git a/docs/index.html b/docs/index.html index c628c1c8..227fd42a 100644 --- a/docs/index.html +++ b/docs/index.html @@ -104,6 +104,7 @@
    +

    @@ -134,10 +135,10 @@

  • NCBI Genbank
  • ENSEMBL
  • -ENSEMBLGENOMES (as of April 2019 - ENSEMBL and ENSEMBLGENOMES were joined - see details here)
  • -
  • UniProt
  • +ENSEMBLGENOMES (as of April 2019 - ENSEMBL and ENSEMBLGENOMES were joined - see details here) +
  • UniProt
  • -

    Furthermore, an interface to the Ensembl Biomart database allows users to retrieve functional annotation for genomic loci using a novel and organism centric search strategy. In addition, users can download entire databases such as

    +

    Furthermore, an interface to the Ensembl Biomart database allows users to retrieve functional annotation for genomic loci using a novel and organism centric search strategy. In addition, users can download entire databases such as

    • NCBI RefSeq
    • NCBI nr
    • @@ -150,7 +151,7 @@

      Similar Work

      -

      The main difference between the BiomaRt package and the biomartr package is that biomartr extends the functional annotation retrieval procedure of BiomaRt and in addition provides useful retrieval functions for genomes, proteomes, coding sequences, gff files, RNA sequences, Repeat Masker annotations files, and functions for the retrieval of entire databases such as NCBI nr etc.

      +

      The main difference between the BiomaRt package and the biomartr package is that biomartr extends the functional annotation retrieval procedure of BiomaRt and in addition provides useful retrieval functions for genomes, proteomes, coding sequences, gff files, RNA sequences, Repeat Masker annotations files, and functions for the retrieval of entire databases such as NCBI nr etc.

      Please consult the Tutorials section for more details.

      In the context of functional annotation retrieval the biomartr package allows users to screen available marts using only the scientific name of an organism of interest instead of first searching for marts and datasets which support a particular organism of interest (which is required when using the BiomaRt package). Furthermore, biomartr allows you to search for particular topics when searching for attributes and filters. I am aware that the similar naming of the packages is unfortunate, but it arose due to historical reasons (please find a detailed explanation here: https://github.com/ropensci/biomartr/blob/master/FAQs.md and here #11).

      I also dedicated an entire vignette to compare the BiomaRt and biomartr package functionality in the context of Functional Annotation (where their functionality overlaps which comprises about only 20% of the overall functionality of the biomartr package).

      @@ -183,7 +184,7 @@

      Installation with Bioconda

      -

      With an activated Bioconda channel (see 2. Set up channels), install with:

      +

      With an activated Bioconda channel (see 2. Set up channels), install with:

      conda install r-biomartr

      and update with:

      conda update r-biomartr
      @@ -230,13 +231,6 @@

      meta.retrieval(kingdom = "vertebrate_mammalian", db = "refseq", type = "genome")

    All geneomes are stored in the folder named according to the kingdom. In this case vertebrate_mammalian. Alternatively, users can specify the out.folder argument to define a custom output folder path.

    -
    -

    -Platforms

    -
    -

    Find biomartr also at OmicTools.

    -
    -

    Frequently Asked Questions (FAQs)

    @@ -261,10 +255,10 @@

  • Functional Annotation
  • BioMart Examples
  • -

    Users can also read the tutorials within (RStudio) :

    +

    Users can also read the tutorials within (RStudio) :

     # source the biomartr package
    -library(biomartr)
    +library(biomartr)
     
     # look for all tutorials (vignettes) available in the biomartr package
     # this will open your web browser
    @@ -446,12 +440,12 @@ 

    devtools::install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE) # and then call it from the library -library("biomartr", lib.loc = "C:/Program Files/R/R-3.1.1/library")

    +library("biomartr", lib.loc = "C:/Program Files/R/R-3.1.1/library")

    Troubleshooting on Windows Machines

      -
    • Install biomartr on a Win 8 laptop: solution ( Thanks to Andres Romanowski )
    • +
    • Install biomartr on a Win 8 laptop: solution ( Thanks to Andres Romanowski )

    @@ -498,12 +492,12 @@

    Developers

    Dev status

      -
    • +
    • Travis-CI Build Status
    • -
    • rstudio mirror downloads
    • -
    • rstudio mirror downloads
    • +
    • rstudio mirror downloads
    • +
    • rstudio mirror downloads
    • Paper link
    • -
    • install with bioconda
    • +
    • install with bioconda

    diff --git a/docs/pkgdown.yml b/docs/pkgdown.yml index a6ba9124..47529dc0 100644 --- a/docs/pkgdown.yml +++ b/docs/pkgdown.yml @@ -7,5 +7,5 @@ articles: Functional_Annotation: Functional_Annotation.html MetaGenome_Retrieval: MetaGenome_Retrieval.html Sequence_Retrieval: Sequence_Retrieval.html -last_built: 2022-02-22T12:17Z +last_built: 2022-02-22T15:20Z diff --git a/docs/reference/check_annotation_biomartr.html b/docs/reference/check_annotation_biomartr.html index 3531595f..28dd191e 100644 --- a/docs/reference/check_annotation_biomartr.html +++ b/docs/reference/check_annotation_biomartr.html @@ -157,7 +157,7 @@

    Arg annotation_file -

    a file path tp the annotation file.

    +

    a file path to the annotation file.

    remove_annotation_outliers diff --git a/docs/reference/getGenome.html b/docs/reference/getGenome.html index edd6234c..660ac5ca 100644 --- a/docs/reference/getGenome.html +++ b/docs/reference/getGenome.html @@ -165,8 +165,7 @@

    Genome Retrieval

    release = NULL, gunzip = FALSE, path = file.path("_ncbi_downloads", "genomes"), - assembly_type = "toplevel", - kingdom_assembly_summary_file = NULL + assembly_type = "toplevel" )

    Arguments

    diff --git a/man/check_annotation_biomartr.Rd b/man/check_annotation_biomartr.Rd index e6061fe1..986241fb 100644 --- a/man/check_annotation_biomartr.Rd +++ b/man/check_annotation_biomartr.Rd @@ -7,7 +7,7 @@ check_annotation_biomartr(annotation_file, remove_annotation_outliers = FALSE) } \arguments{ -\item{annotation_file}{a file path tp the annotation file.} +\item{annotation_file}{a file path to the annotation file.} \item{remove_annotation_outliers}{shall outlier lines be removed from the input \code{annotation_file}? If yes, then the initial \code{annotation_file} will be overwritten and the removed outlier lines will be stored at \code{\link{tempdir}} @@ -17,7 +17,7 @@ for further exploration.} Some annotation files include lines with character lengths greater than 65000. This causes problems when trying to import such annotation files into R using \code{\link[rtracklayer]{import}}. To overcome this issue, this function screens for such lines in a given annotation file and removes these lines so that -\code{\link[rtracklayer]{import}} can handle the file. +\code{import} can handle the file. } \examples{ \dontrun{ diff --git a/man/getGenome.Rd b/man/getGenome.Rd index 3a4b2516..fb3120be 100644 --- a/man/getGenome.Rd +++ b/man/getGenome.Rd @@ -11,8 +11,7 @@ getGenome( release = NULL, gunzip = FALSE, path = file.path("_ncbi_downloads", "genomes"), - assembly_type = "toplevel", - kingdom_assembly_summary_file = NULL + assembly_type = "toplevel" ) } \arguments{ diff --git a/vignettes/BioMart_Examples.Rmd b/vignettes/BioMart_Examples.Rmd index 5a1c2af6..102b050a 100644 --- a/vignettes/BioMart_Examples.Rmd +++ b/vignettes/BioMart_Examples.Rmd @@ -19,15 +19,15 @@ knitr::opts_chunk$set( ## Use Case #1: Functional Annotation of Genes Sharing a Common Evolutionary History Evolutionary Transcriptomics aims to predict stages or periods of evolutionary conservation in -biological processes on the transcriptome level. However, finding genes sharing a [common evolutionary history](https://github.com/HajkD/myTAI/blob/master/vignettes/Enrichment.Rmd) could reveal how the the biological process might have evolved in the first place. +biological processes on the transcriptome level. However, finding genes sharing a [common evolutionary history](https://github.com/drostlab/myTAI/blob/master/vignettes/Enrichment.Rmd) could reveal how the the biological process might have evolved in the first place. -In this `Use Case` we will combine functional and biological annotation obtained with `biomartr` with enriched genes obtained with [PlotEnrichment()](https://github.com/HajkD/myTAI/blob/master/vignettes/Enrichment.Rmd). +In this `Use Case` we will combine functional and biological annotation obtained with `biomartr` with enriched genes obtained with [PlotEnrichment()](https://github.com/drostlab/myTAI/blob/master/vignettes/Enrichment.Rmd). ### Step 1 -For the following example we will use the dataset an enrichment analyses found in [PlotEnrichment()](https://github.com/HajkD/myTAI/blob/master/vignettes/Enrichment.Rmd). +For the following example we will use the dataset an enrichment analyses found in [PlotEnrichment()](https://github.com/drostlab/myTAI/blob/master/vignettes/Enrichment.Rmd). -Install and load the [myTAI](https://github.com/HajkD/myTAI) package: +Install and load the [myTAI](https://github.com/drostlab/myTAI) package: ```{r, eval=FALSE} # install myTAI diff --git a/vignettes/Database_Retrieval.Rmd b/vignettes/Database_Retrieval.Rmd index eae5f984..e3825b8a 100644 --- a/vignettes/Database_Retrieval.Rmd +++ b/vignettes/Database_Retrieval.Rmd @@ -27,7 +27,7 @@ options(timeout = 300000) ## Getting Started NCBI stores a variety of specialized database such as [Genbank, RefSeq, Taxonomy, SNP, etc.](https://www.ncbi.nlm.nih.gov/guide/data-software/) on their servers. The `download.database()` and `download.database.all()` functions implemented in `biomartr` allows users to download these databases from NCBI. -This process might be very useful for downstream analyses such as sequence searches with e.g. BLAST. For this purpose see the R package [metablastr](https://github.com/HajkD/metablastr) which aims to seamlessly inegrate `biomartr` based genomic data retrieval with downsteam large-scale BLAST searches. +This process might be very useful for downstream analyses such as sequence searches with e.g. BLAST. For this purpose see the R package [metablastr](https://github.com/drostlab/metablastr) which aims to seamlessly integrate `biomartr` based genomic data retrieval with downstream large-scale BLAST searches. * [1. List available NCBI databases with `listNCBIDatabases()`](#ist-available-databases) * [2. Download NCBI databases with `download.database.all()`](#download-ncbi-databases) diff --git a/vignettes/Functional_Annotation.Rmd b/vignettes/Functional_Annotation.Rmd index 7f31a5e2..4ffc89be 100644 --- a/vignettes/Functional_Annotation.Rmd +++ b/vignettes/Functional_Annotation.Rmd @@ -27,24 +27,24 @@ options(timeout = 30000) ### Getting Started -The [Ensembl Biomart](http://ensemblgenomes.org/info/access/biomart) database enables users to retrieve a vast diversity of annotation data +The `Ensembl Biomart` database enables users to retrieve a vast diversity of annotation data for specific organisms. Initially, Steffen Durinck and Wolfgang Huber provided a powerful interface between -the R language and [Ensembl Biomart](http://ensemblgenomes.org/info/access/biomart) by implementing the R package [biomaRt](http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html). +the R language and `Ensembl Biomart` by implementing the R package [biomaRt](https://www.bioconductor.org/packages/release/bioc/html/biomaRt.html). The purpose of the `biomaRt` package was to mimic the ENSEMBL BioMart database structure to construct queries that can be sent to the Application Programming Interface (API) of BioMart. Although, this procedure was very useful in the past, it seems not intuitive from an organism centric point of view. Usually, users wish to download functional annotation for a particular organism of interest. However, the BioMart and thus the `biomaRt` package require that users already know in which `mart` and `dataset` the organism of interest will be found which requires significant efforts of searching and screening. In addition, once the `mart` and `dataset` of a particular organism of interest were found and specified the user must again learn which `attribute` has to be specified to retrieve the functional annotation information of interest. The new functionality implemented in the `biomartr` package aims to overcome this -search bottleneck by extending the functionality of the [biomaRt](http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) package. The new `biomartr` package introduces a more organism cantered annotation retrieval concept which does not require to screen for `marts`, `datasets`, and `attributes` beforehand. With `biomartr` users only need to specify the `scientific name` of the organism of interest to then retrieve available `marts`, `datasets`, and `attributes` for the corresponding organism of interest. +search bottleneck by extending the functionality of the [biomaRt](https://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) package. The new `biomartr` package introduces a more organism cantered annotation retrieval concept which does not require to screen for `marts`, `datasets`, and `attributes` beforehand. With `biomartr` users only need to specify the `scientific name` of the organism of interest to then retrieve available `marts`, `datasets`, and `attributes` for the corresponding organism of interest. This paradigm shift enables users to quickly construct queries to the BioMart database without having to learn the particular database structure and organization of BioMart. The following sections will introduce users to the functionality and data retrieval precedures of `biomartr` and will show how `biomartr` -extends the functionality of the initial [biomaRt](http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) package. +extends the functionality of the initial [biomaRt](https://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) package. ### The old `biomaRt` query methodology -The best way to get started with the _old_ methodology presented by the established [biomaRt](http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) package is to understand the workflow of its data retrieval process. The query logic of the `biomaRt` package derives from the database organization of [Ensembl Biomart](http://ensemblgenomes.org/info/access/biomart) which stores a vast diversity of annotation data -for specific organisms. In detail, the [Ensembl Biomart](http://ensemblgenomes.org/info/access/biomart) database is organized into so called: +The best way to get started with the _old_ methodology presented by the established [biomaRt](https://www.bioconductor.org/packages/release/bioc/html/biomaRt.html) package is to understand the workflow of its data retrieval process. The query logic of the `biomaRt` package derives from the database organization of `Ensembl Biomart` which stores a vast diversity of annotation data +for specific organisms. In detail, the `Ensembl Biomart` database is organized into so called: `marts`, `datasets`, and `attributes`. `Marts` denote a higher level category of functional annotation such as `SNP` (e.g. for functional annotation of particular single nucleotide polymorphisms (SNPs)) or `FUNCGEN` (e.g. for functional annotation of regulatory regions or relationsships of genes). `Datasets` denote the particular species of interest for which functional annotation is available __within__ this specific `mart`. It can happen that `datasets` (= particular species of interest) are available in one `mart` (= higher category of functional annotation) but not in an other `mart`. @@ -60,7 +60,7 @@ The availability of `marts`, `datasets`, and `attributes` can be checked by the ```{r,eval=FALSE} # install the biomaRt package -# source("http://bioconductor.org/biocLite.R") +# source("https://bioconductor.org/biocLite.R") # biocLite("biomaRt") # load biomaRt library(biomaRt) @@ -651,7 +651,7 @@ The `biomart()` function takes as arguments a set of genes (gene ids specified i The `biomartr` package also enables a fast and intuitive retrieval of GO terms and additional information via the `getGO()` function. Several databases can be selected to retrieve GO annotation information for a set of query genes. So far, the `getGO()` -function allows GO information retrieval from the [Ensembl Biomart](http://ensemblgenomes.org/info/access/biomart) database. +function allows GO information retrieval from the `Ensembl Biomart` database. In this example we will retrieve GO information for a set of _Homo sapiens_ genes stored as `hgnc_symbol`. @@ -675,5 +675,5 @@ GO_tbl ``` -Hence, for each _gene id_ the resulting table stores all annotated GO terms found in [Ensembl Biomart](http://ensemblgenomes.org/info/access/biomart). +Hence, for each _gene id_ the resulting table stores all annotated GO terms found in `Ensembl Biomart`. diff --git a/vignettes/MetaGenome_Retrieval.Rmd b/vignettes/MetaGenome_Retrieval.Rmd index 106519b5..05d3064a 100644 --- a/vignettes/MetaGenome_Retrieval.Rmd +++ b/vignettes/MetaGenome_Retrieval.Rmd @@ -69,7 +69,7 @@ computational and scientific reproducibility of the meta-genomics study at hand. The `meta.retrieval()` and `meta.retrieval.all()` functions aim to simplify the genome retrieval and computational reproducibility process for meta-genomics studies. Both functions allow users to either download genomes, proteomes, CDS, etc for species within a specific kingdom or subgroup of life (`meta.retrieval()`) or of all species of all kingdoms (`meta.retrieval.all()`). Before `biomartr` users had to write `shell` scripts to download respective genomic data. However, since many meta-genomics packages exist for the R programming language, I implemented this functionality for easy integration into existing R workflows and for easier reproducibility. -For example, the pipeline logic of the [magrittr](https://github.com/smbache/magrittr) package can be used with +For example, the pipeline logic of the [magrittr](https://github.com/tidyverse/magrittr) package can be used with `meta.retrieval()` and `meta.retrieval.all()` as follows. ```{r,eval=FALSE} @@ -468,7 +468,7 @@ of genome assembly quality on downstream analyses cannot be neglected. A rule of thumb is, that the larger the genome the more prone it is to genome assembly errors and therefore, a reduction of assembly quality. -In [Veeckman et al., 2016](http://www.plantcell.org/content/28/8/1759.short) the authors conclude: +In [Veeckman et al., 2016](https://doi.org/10.1105/tpc.16.00349) the authors conclude: > As yet, no uniform metrics or standards are in place to estimate the completeness of a genome assembly or > the annotated gene space, despite their importance for downstream analyses diff --git a/vignettes/Sequence_Retrieval.Rmd b/vignettes/Sequence_Retrieval.Rmd index b0e94bc0..86ee2312 100644 --- a/vignettes/Sequence_Retrieval.Rmd +++ b/vignettes/Sequence_Retrieval.Rmd @@ -423,7 +423,7 @@ on `accession` ids.__ ### Example `UniProt` (?is.genome.available): -Users can also check the availability of proteomes in the [UniProt database](http://www.uniprot.org) +Users can also check the availability of proteomes in the [UniProt database](https://www.uniprot.org) by specifying: @@ -783,7 +783,7 @@ listGroups(db = "genbank") After checking for the availability of sequence information for an organism of interest, the next step is to download the corresponding genome, proteome, CDS, or GFF file. The following functions allow users to download proteomes, genomes, CDS and GFF files from several -database resources such as: [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/), [ENSEMBL](http://www.ensembl.org/index.html). When a corresponding proteome, genome, CDS or GFF file was +database resources such as: [NCBI RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](https://www.ncbi.nlm.nih.gov/genbank/about/), [ENSEMBL](https://www.ensembl.org/index.html). When a corresponding proteome, genome, CDS or GFF file was loaded to your hard-drive, a documentation `*.txt` file is generated storing `File Name`, `Organism`, `Database`, `URL`, `DATE`, `assembly_accession`, `bioproject`, `biosample`, `taxid`, `version_status`, `release_type`, `seq_rel_date` etc. information of the retrieved file. This way a better reproducibility of proteome, genome, CDS and GFF versions @@ -796,17 +796,17 @@ The easiest way to download a genome is to use the `getGenome()` function. In this example we will download the genome of `Homo sapiens`. -The `getGenome()` function is an interface function to the [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/), -[ENSEMBL](http://www.ensembl.org/index.html) databases from +The `getGenome()` function is an interface function to the [NCBI RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](https://www.ncbi.nlm.nih.gov/genbank/about/), +[ENSEMBL](https://www.ensembl.org/index.html) databases from which corresponding genomes can be retrieved. The `db` argument specifies from which database genome assemblies in `*.fasta` file format shall be retrieved. Options are: -- `db = "refseq"` for retrieval from [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/) -- `db = "genbank"` for retrieval from [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/) -- `db = "ensembl"` for retrieval from [ENSEMBL](http://www.ensembl.org/index.html) +- `db = "refseq"` for retrieval from [NCBI RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/) +- `db = "genbank"` for retrieval from [NCBI Genbank](https://www.ncbi.nlm.nih.gov/genbank/about/) +- `db = "ensembl"` for retrieval from [ENSEMBL](https://www.ensembl.org/index.html) Furthermore, users need to specify the `scientific name`, the `taxid` (= [NCBI Taxnonomy](https://www.ncbi.nlm.nih.gov/taxonomy) identifier), or the `accession identifier` of the organism of interest for which a genome assembly shall be downloaded, e.g. `organism = "Homo sapiens"` or `organism = "9606"` or `organism = "GCF_000001405.37"`. @@ -836,7 +836,7 @@ HS.genome.refseq <- getGenome( db = "refseq", ``` -Subsequently, users can use the `read_genome()` function to import the genome into the R session. Users can choose to work with the genome sequence in R either as [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html) object (`obj.type = "Biostrings"`) or [data.table](https://github.com/Rdatatable/data.table/wiki) object +Subsequently, users can use the `read_genome()` function to import the genome into the R session. Users can choose to work with the genome sequence in R either as [Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html) object (`obj.type = "Biostrings"`) or [data.table](https://github.com/Rdatatable/data.table/wiki) object (`obj.type = "data.table"`) by specifying the `obj.type` argument of the `read_genome()` function. ```{r,eval=FALSE} @@ -921,7 +921,7 @@ the scientific name of the organism of interest and allow them to import the ret Thus, users can then perform the `Biostrings notation` to work with downloaded genomes and can rely on the log file generated by `getGenome()` to better document the source and version of genome assemblies used for subsequent studies. -Alternatively, users can perform the pipeline logic of the [magrittr](https://github.com/smbache/magrittr) package: +Alternatively, users can perform the pipeline logic of the [magrittr](https://github.com/tidyverse/magrittr) package: ```{r,eval=FALSE} # install.packages("magrittr") @@ -1331,18 +1331,18 @@ specify the argument `update = TRUE`. ### Proteome Retrieval -The `getProteome()` function is an interface function to the [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/), -[ENSEMBL](http://www.ensembl.org/index.html), and [UniProt](http://uniprot.org/) databases from +The `getProteome()` function is an interface function to the [NCBI RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](https://www.ncbi.nlm.nih.gov/genbank/about/), +[ENSEMBL](https://www.ensembl.org/index.html), and [UniProt](https://www.uniprot.org/) databases from which corresponding proteomes can be retrieved. It works analogous to `getGenome()`. The `db` argument specifies from which database proteomes in `*.fasta` file format shall be retrieved. Options are: -- `db = "refseq"` for retrieval from [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/) -- `db = "genbank"` for retrieval from [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/) -- `db = "ensembl"` for retrieval from [ENSEMBL](http://www.ensembl.org/index.html) -- `db = "uniprot" ` for retrieval from [UniProt](http://uniprot.org/) +- `db = "refseq"` for retrieval from [NCBI RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/) +- `db = "genbank"` for retrieval from [NCBI Genbank](https://www.ncbi.nlm.nih.gov/genbank/about/) +- `db = "ensembl"` for retrieval from [ENSEMBL](https://www.ensembl.org/index.html) +- `db = "uniprot" ` for retrieval from [UniProt](https://www.uniprot.org/) Furthermore, again users need to specify the scientific name of the organism of interest for which a proteomes shall be downloaded, e.g. `organism = "Homo sapiens"`. @@ -1361,7 +1361,7 @@ HS.proteome.refseq <- getProteome( db = "refseq", In this example, `getProteome()` creates a directory named `'_ncbi_downloads/proteomes'` into which the corresponding genome named `GCF_000001405.34_GRCh38.p8_protein.faa.gz` is downloaded. The return value of `getProteome()` is the folder path to the downloaded proteome file that can then be used as input to the `read_proteome()` function. The variable `HS.proteome.refseq` stores the path to the downloaded proteome. -Subsequently, users can use the `read_proteome()` function to import the proteome into the R session. Users can choose to work with the proteome sequence in R either as [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html) object (`obj.type = "Biostrings"`) or [data.table](https://github.com/Rdatatable/data.table/wiki) object +Subsequently, users can use the `read_proteome()` function to import the proteome into the R session. Users can choose to work with the proteome sequence in R either as [Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html) object (`obj.type = "Biostrings"`) or [data.table](https://github.com/Rdatatable/data.table/wiki) object (`obj.type = "data.table"`) by specifying the `obj.type` argument of the `read_proteome()` function. @@ -1390,7 +1390,7 @@ A AAStringSet instance of length 113620 [113620] 380 MTPMRKTNPLMKL...ISLIENKMLKWA YP_003024038.1 cy... ``` -Alternatively, users can perform the pipeline logic of the [magrittr](https://github.com/smbache/magrittr) package: +Alternatively, users can perform the pipeline logic of the [magrittr](https://github.com/tidyverse/magrittr) package: ```{r,eval=FALSE} # install.packages("magrittr") @@ -1497,7 +1497,7 @@ Users can also specify the `release` argument which denotes the database release ### Example Retrieval `Uniprot`: -Another way of retrieving proteome sequences is from the [UniProt](http://www.uniprot.org/) database. +Another way of retrieving proteome sequences is from the [UniProt](https://www.uniprot.org/) database. ```{r,eval=FALSE} # download the proteome of Mus musculus from UniProt @@ -1558,17 +1558,17 @@ specify the argument `update = TRUE`. ### CDS Retrieval -The `getCDS()` function is an interface function to the [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/), -[ENSEMBL](http://www.ensembl.org/index.html) databases from +The `getCDS()` function is an interface function to the [NCBI RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](https://www.ncbi.nlm.nih.gov/genbank/about/), +[ENSEMBL](https://www.ensembl.org/index.html) databases from which corresponding CDS files can be retrieved. It works analogous to `getGenome()` and `getProteome()`. The `db` argument specifies from which database proteomes in `*.fasta` file format shall be retrieved. Options are: -- `db = "refseq"` for retrieval from [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/) -- `db = "genbank"` for retrieval from [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/) -- `db = "ensembl"` for retrieval from [ENSEMBL](http://www.ensembl.org/index.html) +- `db = "refseq"` for retrieval from [NCBI RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/) +- `db = "genbank"` for retrieval from [NCBI Genbank](https://www.ncbi.nlm.nih.gov/genbank/about/) +- `db = "ensembl"` for retrieval from [ENSEMBL](https://www.ensembl.org/index.html) Furthermore, again users need to specify the scientific name of the organism of interest for which a proteomes shall be downloaded, e.g. `organism = "Homo sapiens"`. @@ -1588,7 +1588,7 @@ HS.cds.refseq <- getCDS( db = "refseq", In this example, `getCDS()` creates a directory named `'_ncbi_downloads/CDS'` into which the corresponding genome named `Homo_sapiens_cds_from_genomic_refseq.fna.gz` is downloaded. The return value of `getCDS()` is the folder path to the downloaded genome file that can then be used as input to the `read_cds()` function. The variable `HS.cds.refseq` stores the path to the downloaded CDS file. -Subsequently, users can use the `read_cds()` function to import the genome into the R session. Users can choose to work with the genome sequence in R either as [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html) object (`obj.type = "Biostrings"`) or [data.table](https://github.com/Rdatatable/data.table/wiki) object +Subsequently, users can use the `read_cds()` function to import the genome into the R session. Users can choose to work with the genome sequence in R either as [Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html) object (`obj.type = "Biostrings"`) or [data.table](https://github.com/Rdatatable/data.table/wiki) object (`obj.type = "data.table"`) by specifying the `obj.type` argument of the `read_cds()` function. ```{r,eval=FALSE} @@ -1646,7 +1646,7 @@ the scientific name of the organism of interest and allow them to import the ret Thus, users can then perform the `Biostrings notation` to work with downloaded CDS and can rely on the log file generated by `getCDS()` to better document the source and version of CDS used for subsequent studies. -Alternatively, users can perform the pipeline logic of the [magrittr](https://github.com/smbache/magrittr) package: +Alternatively, users can perform the pipeline logic of the [magrittr](https://github.com/tidyverse/magrittr) package: ```{r,eval=FALSE} # install.packages("magrittr") @@ -1781,17 +1781,17 @@ specify the argument `update = TRUE`. ### RNA Retrieval -The `getRNA()` function is an interface function to the [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/), -[ENSEMBL](http://www.ensembl.org/index.html) databases from +The `getRNA()` function is an interface function to the [NCBI RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/), [NCBI Genbank](https://www.ncbi.nlm.nih.gov/genbank/about/), +[ENSEMBL](https://www.ensembl.org/index.html) databases from which corresponding RNA files can be retrieved. It works analogous to `getGenome()`, `getProteome()`, and `getCDS()`. The `db` argument specifies from which database proteomes in `*.fasta` file format shall be retrieved. Options are: -- `db = "refseq"` for retrieval from [NCBI RefSeq](http://www.ncbi.nlm.nih.gov/refseq/about/) -- `db = "genbank"` for retrieval from [NCBI Genbank](http://www.ncbi.nlm.nih.gov/genbank/about/) -- `db = "ensembl"` for retrieval from [ENSEMBL](http://www.ensembl.org/index.html) +- `db = "refseq"` for retrieval from [NCBI RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/) +- `db = "genbank"` for retrieval from [NCBI Genbank](https://www.ncbi.nlm.nih.gov/genbank/about/) +- `db = "ensembl"` for retrieval from [ENSEMBL](https://www.ensembl.org/index.html) Furthermore, again users need to specify the scientific name of the organism of interest for which a proteomes shall be downloaded, e.g. `organism = "Homo sapiens"`. Finally, the `path` argument specifies the folder path in which the corresponding RNA file shall be locally stored. In case users would like to store the RNA file at a different location, @@ -1810,7 +1810,7 @@ HS.rna.refseq <- getRNA( db = "refseq", In this example, `getRNA()` creates a directory named `'_ncbi_downloads/RNA'` into which the corresponding RNA file named `Homo_sapiens_rna_from_genomic_refseq.fna.gz` is downloaded. The return value of `getRNA()` is the folder path to the downloaded genome file that can then be used as input to the `read_rna()` function. The variable `HS.rna.refseq` stores the path to the downloaded RNA file. -Subsequently, users can use the `read_cds()` function to import the genome into the R session. Users can choose to work with the genome sequence in R either as [Biostrings](http://bioconductor.org/packages/release/bioc/html/Biostrings.html) object (`obj.type = "Biostrings"`) or [data.table](https://github.com/Rdatatable/data.table/wiki) object +Subsequently, users can use the `read_cds()` function to import the genome into the R session. Users can choose to work with the genome sequence in R either as [Biostrings](https://bioconductor.org/packages/release/bioc/html/Biostrings.html) object (`obj.type = "Biostrings"`) or [data.table](https://github.com/Rdatatable/data.table/wiki) object (`obj.type = "data.table"`) by specifying the `obj.type` argument of the `read_rna()` function. ```{r,eval=FALSE} @@ -1867,7 +1867,7 @@ the scientific name of the organism of interest and allow them to import the ret Thus, users can then perform the `Biostrings notation` to work with downloaded RNA and can rely on the log file generated by `getRNA()` to better document the source and version of RNA used for subsequent studies. -Alternatively, users can perform the pipeline logic of the [magrittr](https://github.com/smbache/magrittr) package: +Alternatively, users can perform the pipeline logic of the [magrittr](https://github.com/tidyverse/magrittr) package: ```{r,eval=FALSE} # install.packages("magrittr") @@ -2042,7 +2042,7 @@ Human_GFF #### Removing corrupt lines from downloaded GFF files -In some cases, `GFF` files stored at NCBI databases include corrupt lines that have more than 65000 characters. This [leads to problems](https://github.com/lawremi/rtracklayer/issues/15) +In some cases, `GFF` files stored at NCBI databases include corrupt lines that have more than 65000 characters. This leads to problems when trying to import such annotation files into R. To overcome this issue users can specify the `remove_annotation_outliers = TRUE` argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable. @@ -2108,7 +2108,7 @@ Human_GFF #### Removing corrupt lines from downloaded GFF files -In some cases, `GFF` files stored at NCBI databases include corrupt lines that have more than 65000 characters. This [leads to problems](https://github.com/lawremi/rtracklayer/issues/15) +In some cases, `GFF` files stored at NCBI databases include corrupt lines that have more than 65000 characters. This leads to problems when trying to import such annotation files into R. To overcome this issue users can specify the `remove_annotation_outliers = TRUE` argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable. @@ -2179,7 +2179,7 @@ Users can also specify the `release` argument which denotes the database release #### Removing corrupt lines from downloaded GFF files -In some cases, `GFF` files stored at NCBI databases include corrupt lines that have more than 65000 characters. This [leads to problems](https://github.com/lawremi/rtracklayer/issues/15) +In some cases, `GFF` files stored at NCBI databases include corrupt lines that have more than 65000 characters. This leads to problems when trying to import such annotation files into R. To overcome this issue users can specify the `remove_annotation_outliers = TRUE` argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable. @@ -2222,14 +2222,14 @@ and it will only download missing genomes. In cases users wish to download everything again and updating existing genomes, they may specify the argument `update = TRUE`. -In some cases, `GFF` files stored at NCBI databases include corrupt lines that have more than 65000 characters. This [leads to problems](https://github.com/lawremi/rtracklayer/issues/15) +In some cases, `GFF` files stored at NCBI databases include corrupt lines that have more than 65000 characters. This leads to problems when trying to import such annotation files into R. To overcome this issue users can specify the `remove_annotation_outliers = TRUE` argument to remove such outlier lines and overwrite the downloaded annotation file. This will make any downstream analysis with this annotation file much more reliable. ## Repeat Masker Retrieval -[Repeat Masker](http://www.repeatmasker.org) is a tool for screening DNA sequences for interspersed repeats and low complexity DNA sequences. +[Repeat Masker](https://www.repeatmasker.org) is a tool for screening DNA sequences for interspersed repeats and low complexity DNA sequences. NCBI stores the `Repeat Masker` for sevel species in their databases and can be retrieved using `getRepeatMasker()` and imported via `read_rm()`.