Skip to content
mbarbini edited this page Aug 27, 2019 · 15 revisions

Table of Contents

  1. Dependencies
  2. How to source microbial data
  3. How to source metabolite data
  4. How to visualize data

Dependencies

Jupyter, RCurl, rjson, IRKernal, pheatmap, ggplot2, RColorBrewer, XML, foreach, parallel, doParallel, data.table, utils, rlist, crul, jsonlite, R.utils, rvest, colorspace, recommenderlab, RAM

Install dependencies using install.packages(c(Jupyter, RCurl, rjson, IRKernal, pheatmap, ggplot2, RColorBrewer, XML, foreach, parallel, doParallel, data.table, utils, rlist, crul, jsonlite, R.utils, rvest, colorspace, recommenderlab, RAM))

How to source microbial data

Files

BacDiveApiCrawler.R
BacMapCrawler.R
CleanProTrait.R
CombineData.R
CreateDataTable.R
ParseIJSEM.R
Utility.R

Functions


bacdive.crawler

Description

bacdive.crawler() retrieves information from the BacDive API, organizing it into a formatted table

Usage

bacdive.crawler(usrname, pass, num_requests = 10, save_file = TRUE)

Arguments

usrname the username for a verified BacDive account

pass the password for a corresponding BacDive account

num_requests the number of bacterial entries to asynchronously download

save_file if true, saves a .csv to the working directory containing the information extracted from the BacDive API

Details

Designed to traverse the API provided by BacDive. The BacDive API provides a database that can easily queried, providing microbial physiology data in the JSON format. Each specie contains its own ‘page’, which details information such as taxonomy, morphology, strain information, and more. This script currently selectively chooses certain traits to record, meaning that there is more data that could be chosen to extracted, if implemented.

Value

Returns a data.frame containing information extracted from BacDive

Warning

Because this traverses the site’s API, it is still limited by internet speeds and the rate at which the site’s server responds. This can be detrimental to the speed at which the script can run. In addition, if num_requests is set to too high, it may be too demanding for BacDive's server. Lastly, BacDive contains many more strains than the the number of species. All of these strains are collected even though this method was implemented to only extract species information.

bacmap.crawler()

Description

bacmap.crawler() scrapes the BacMap Database for microbial phenotypic information

Usage

bacmap.crawler(url = "http://bacmap.wishartlab.com/", num_requests = 10)

Arguments

url the page of the Bacmap website from which to begin scrapping. Must be a page containing a table of microbial entries. NOTE: The default homepage is the most stable from which to start, and due to the fast speed of the method under 'normal' internet connections, changing the homepage will not have notable runtime effects.

num_requests the number of bacterial entries to asynchronously download

Details

Webscrapes the BacMap website for microbial phenotype information according to the website's structure as of August, 2019. Expects there to be a table from which to determine all URLs from which to get microbial phenotype information. Expects all microbial phenotype pages to also contain a table with the same phenotype entries. Condenses this information from every microbe into one table.

Value

Returns a data.frame containing information extracted from BacMap. Also downloads a copy of this table to a .csv, locally

clean.protrait

Description

clean.protrait() retrieves information from a file downloaded from the ProTrait Atlas, formatting it into a table

Usage

clean.protrait(save_file = TRUE)

Arguments

save_file if true, saves a .csv to the working directory containing the information extracted from the ProTrait Atlas

Details

Designed to extract information from a table created by ProTrait. It lacks a format that generalizes traits, instead listing each type of trait (gram-positive, pathogenic in animals, aerobe, etc) as its column. Therefore, this script organizes this table into generalized traits, providing for an easy way to use this table for purposes such as annotation. It will first check if the ProTrait file already exists in the working directory. If it does not, it will download the file to the working directory and start formatting it.

Value

Returns a data.frame containing information extracted from ProTrait

parse.ijsem

Description

parse.ijsem() is a method for parsing the International Journal of Systematic and Evolutionary Microbiology database contains phenotypic information about microbes

Usage

parse.ijsem()

Details

If a local copy of the raw IJSEM datable exists locally, this function will look for it first. Otherwise, it will download a copy locally and then begin parsing it

Value

Returns a data.frame containing information extracted from the metadata and saves a .csv locally

read.excel

Description

read.excel is a method for reading a excel spread sheet and converting this into a data frame representation

Usage

read.excel(path)

Arguments

path the path to the desired excel spreadsheet

Details

This can be used to read an excel spreadsheet containing additional user-supplied data about microbes. This can then be used in combination with combine.table to merge it with the online databases to create a single, comprehensive data table.

Value

Returns a data frame representation

combine.data

Description

combine.data() combines the given tables into a single, formatted table

Usage

combine.data(data, save_file = TRUE)

Arguments

data a list of tables that will be combined

save_file if true, saves a .csv to the working directory containing the information resulting from the combined table

Details

CombineData.R is a script that merges given tables. This script is required because the column labels produced for each of these tables are different and there are different traits extracted in general. This script works to create one cohesive table. It also runs the following, additional methods for cleaning up the table: merging repetitive species entries, correcting traits that have synonyms, renaming nutrition requirements, and ordering each entry with multiple traits alphabetically.
In addition, this function requires an external file that contains the names of each column (represents a type of trait such as oxygen requirement, gram stain, etc.). This defaults to an internal file. This file supplied be a .csv with the following specifications: the first row must be the desired column names of the final table, and each entry under these names are the column names of the data tables that are going be to be merged. For example, if the tables that are going to be combined have "Oxygen Requirement" information, one which is called "Oxygen tolerance" and the other called "Oxygen preference", have a column named "Oxygen Requirement". Then have "Oxygen tolerance" and "Oxygen preference" as entries under this column.

Value

Returns a data.frame containing a information from the combined tables

How to source metabolite data

Files

ParseHMDB.R

Functions


parse.hmdb

Description

parse.hmdb() downloads all metabolite information from the HMDB website and parses through the XML file, converting it into a readable data table that is saved locally to a .csv.

Usage

parse.hmdb(file = 'hmdb_metabolites.xml', link = 'http://www.hmdb.ca/system/downloads/current/hmdb_metabolites.zip')

Arguments

file the name of the local file for the hmdb data table

link the web address from where to download the hmdb database

Details

Parses the HMDB data table, which is in the XML format, into a data frame containing species as rows, and traits as columns. In the original XML format, the information is nested and difficult to read information from. This method flattens out all the nested information and reformats it as a table. This method does not look for certain information, but keeps all categories found.

Value

Returns a data.frame containing information extracted from HMDB

Warning

This method currently does not keep multiple entries under the same trait category. For example, if the data has multiple entries for disease, it will only keep the first entry under 'disease'.\

Data Visualization

Files

HeatMap.R

Functions


save.figure

Description

save.figure() is a method for saving ggplots to a png image

Usage

load.abundance.data(figure, file_location = '', width = 5, height = 6)

Arguments

figure the ggplot to save

file_location a path directing where to save the figure

width the width of the saved image

height he height of the saved image

Value

Saves a png image to the given path

load.abundance.data

Description

load.abundance.data() is a method for loading abundance table in .csv files in the appropriate format for use with the heat map creating functions

Usage

load.abundance.data(path, column = 1)

Arguments

path the path from the working directory to the .csv file containing the abundance table

column the column number containing the feature names

Details

The abundance table needs to be loaded into R in such a way that the row names are the feature names, the sample names are the column names, and all its values are numerics.

Value

Returns a numerical matrix created from the abundance table

load.meta.data

Description

load.meta.data() is a method for loading metadata in .csv files in the appropriate format for use with the heat map creating functions

Usage

load.meta.data(path, tax_column = 1)

Arguments

path the path from the working directory to the .csv file containing the metadata

tax_column the column number containing the taxonomical or sample (ie identifying) name for the metadata

Details

This can be used to load feature or sample metadata. Metadata needs to be loaded in such a way that the row names are the identifying names and the traits are the column names.

Value

Returns a data.frame containing information extracted from the metadata

Warning

This will eliminate all duplicate entries from the metadata without merging their data resulting in potential data loss.

create.heatmap

Description

create.heatmap() creates a heat map based on relative abundance, with row and column dendrograms based on given metadata.

Usage

create.heatmap(data, sample_meta, feature_meta, percentile = 0.75, filter = '', show = FALSE, omit_na = TRUE, cluster_distance_method = 'euclidean')

Arguments

data abundance data in a numerical matrix

sample_meta a data.frame containing sample metadata

feature_meta a data.frame containing feature metadata

show if true, will display the graph upon completion

omit if true, will delete all microbial entries that are missing metadata

cluster_distance_method the method for which to cluster samples and features (must be one of euclidean, maximum, manhattan, canberra, binary or minkowski)

unique_colors if true, will attempt to choose the most distinct colors, does not work well for continuous values

Details

The features need to be the rows of the abundance data.

Value

Returns a pheatmap with the following components: row hclusters, column hclusters, kmeans, and gtable
\

create.correlogram

Description

create.correlogram() creates a heat map based on the correlation of features given an abundance table and feature metadata.

Usage

create.correlogram(data, feature_meta, show = TRUE, omit = TRUE, cluster_distance_method = 'euclidean', unique_colors = TRUE)

Arguments

data abundance data in a numerical matrix

feature_meta a data.frame containing feature metadata

show if true, will display the graph upon completion

omit if true, will delete all microbial entries that are missing metadata

cluster_distance_method the method for which to cluster samples and features (must be one of euclidean, maximum, manhattan, canberra, binary or minkowski)

unique_colors if true, will attempt to choose the most distinct colors, does not work well for continuous values

Details

The features need to be the rows of the abundance data.

Value

Returns a pheatmap with the following components: row hclusters, column hclusters, kmeans, and gtable
\

multi.correlogram

Description

multi.correlogram() creates multiple correlograms using multiple data sets

Usage

multi.correlogram(data_tables, sample_datas, omit = FALSE)

Arguments

data_tables a list of matrices (sample data like abundance tables, intensity values, etc.)

sample_datas a list of data frames with annotation data (microbial physiology data, metabolite physiochemical data, etc)

omit whether to eliminate samples with no annotation data

Details

The features need to be the rows of the sample data.

Value

Returns a list of correlograms, one for each combination of sample data

one.v.all

Description

one.v.all() uses the create.heatmap function, but filters the metadata such that it labels only a single feature category and type, labeling all others as 'other'

Usage

one.v.all(data, sample_meta, feature_meta, which = 2, percentile = 0.75, show = FALSE, column, trait, cluster_distance_method = "euclidean")

Arguments

data abundance data in a numerical matrix

sample_meta a data.frame containing sample metadata

feature_meta a data.frame containing feature metadata

which a number representing whether to filter the sample(1) or feature(2) metadata

percentile a filter for displaying only entries with a threshold correlation

show if true, will display the graph upon completion

column the column number with the feature category

trait the specific feature type to use

cluster_distance_method the method for which to cluster samples and features (must be one of euclidean, maximum, manhattan, canberra, binary or minkowski)

Details

Compare only one feature type against all others in a feature category (ex. aerobic respiration v all other oxygen requirements). The features need to be the rows of the abundance data. Can supply any number of feature categories, but only one will be used.

Value

Returns a pheatmap with the following components: row hclusters, column hclusters, kmeans, and gtable

all.one.v.all

Description

all.one.v.all() uses the one.v.all function, creates a heatmap for every feature type found

Usage

all.one.v.all <- function(data, sample_meta, feature_meta, which = 2, percentile = 0.75, show = FALSE, column, directory='', cluster_distance_method = "euclidean")

Arguments

data abundance data in a numerical matrix

sample_meta a data.frame containing sample metadata

feature_meta a data.frame containing feature metadata

which a number representing whether to filter the sample(1) or feature(2) metadata

percentile a filter for displaying only entries with a threshold correlation

show if true, will display the graph upon completion

column the column number with the feature category

directory the path from the working directory to where the file should be saved

cluster_distance_method the method for which to cluster samples and features (must be one of euclidean, maximum, manhattan, canberra, binary or minkowski)

Details

Creates a heatmap for every feature type found (ex. 3 forms of oxygen requirements). The features need to be the rows of the abundance data. Can supply any number of feature categories, but only one will be used. Will automatically name the files based on the trait