-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
151 lines (124 loc) · 6.9 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: "NMDC_Test"
author: "npc"
date: "`r Sys.Date()`"
output:
github_document: default
html_document: default
knit: (function(inputFile, encoding) {
rmarkdown::render(inputFile,
encoding = encoding,
output_format = "all")
})
---
# README
This repo contains answers to pre-interview questions assigned before an interview for an NMDC position.
```{r, include = TRUE, echo = TRUE}
# load in some libraries and set the WD
suppressMessages(library(utils))
suppressMessages(library(yaml))
suppressMessages(library(jsonlite))
suppressMessages(library(httr))
setwd("~/Repos/NMDC_Test")
source(file = "R/Biosample_Metadata.R")
```
## Question 1:
*Write a script to download the metadata from one NMDC’s biosample API endpoints for biosample id nmdc:bsm-11-q84vp418 and return the values for id, habitat, and gold_biosample_identifiers.*
A brief barebones function can be constructed in R that semi-flexibly pulls data from the NMDC API endpoints. This function only relies on two packages that are not in base R, `httr` and `jsonlite`. The function is present in the `R` directory, while a script that loads the function and attempts to print out the requested metadata in relatively acceptable formating is present in the `Scripts` directory. A brief example of the function in use is provided in the code chunk below.
```{r, include = TRUE, echo = TRUE}
res1 <- Biosample_Metadata(metadata1 = "id",
annotations = "habitat",
alternative_IDs = "GOLD",
ID = "nmdc:bsm-11-q84vp418")
print(res1)
```
## Question 2:
*Review the existing docker containers available on dockerhub for SPAdes, a common assembler for metagenomic datasets, and describe how you would determine which, if any, of the existing containers you would use versus writing one from scratch.*
SPAdes has included it's `--meta` mode for a while, so any container that is successfully installing a recent version of SPAdes should be acceptable for assembly alone. Pre- and post-assembly tasks like binning and quality control would require other tools though. How that work is divided up is largely a choice dependent upon available resources, and tool dependencies. A dockerfile that constructs a custom container with common tools for read QC, assembly, and binning is present in the `Container01` directory of this repo.
## Question 3:
*Using a workflow management tool of your choice, write a portable workflow which 1) runs the SPAdes test dataset (spades.py --test) and 2) returns the version of SPAdes used.*
A workflow for testing SPAdes using CWL is constructed below. A dockerfile that constructs the container these were tested in is present in the `Container01` directory of this repo, and the generated `.cwl` files are deposited in the `Scripts` folder of this repo. In brief, two cwl processes are invoked by a single cwl workflow to run and capture the stdout for `spades.py --test` and `spades.py --version`. This container can also be pulled locally for testing with `docker pull npcooley/nmdctest01:1.0`. With all three `.cwl` files in the home directory and `spades.py` present in the `$PATH` variable, `cwltool combine.cwl` should run to completion and return two text files, `res1.txt` containing the stdout from running the spades test set, and `res2.txt` containing the version info for the spades installation.
```{r, include = TRUE, echo = TRUE}
suppressMessages(library(yaml))
TEST <- as.yaml(list("cwlVersion" = "v1.0",
"class" = "CommandLineTool",
"baseCommand" = c("spades.py", "--test"),
"stdout" = "res1.txt",
"inputs" = list(),
"outputs" = list("example_out" = list("type" = "stdout"))))
TEST <- gsub(pattern = "'",
replacement = "",
x = TEST)
# cat(TEST)
VER <- as.yaml(list("cwlVersion" = "v1.0",
"class" = "CommandLineTool",
"baseCommand" = c("spades.py", "--version"),
"stdout" = "res2.txt",
"inputs" = list(),
"outputs" = list("example_out" = list("type" = "stdout"))))
VER <- gsub(pattern = "'",
replacement = "",
x = VER)
# cat(VER)
COMBINE <- as.yaml(list("cwlVersion" = "v1.0",
"class" = "Workflow",
"inputs" = list(),
"outputs" = list("out" = list("type" = "File",
"outputSource" = "step1/example_out"),
"res" = list("type" = "File",
"outputSource" = "step2/example_out")),
"steps" = list("step1" = list("run" = "test.cwl",
"in" = list(),
"out" = "[example_out]"),
"step2" = list("run" = "version.cwl",
"in" = list(),
"out" = "[example_out]"))))
COMBINE <- gsub(pattern = "'",
replacement = "",
x = COMBINE)
# cat(COMBINE)
writeLines(TEST, "Scripts/test.cwl")
writeLines(VER, "Scripts/version.cwl")
writeLines(COMBINE, "Scripts/combine.cwl")
```
## Question 4:
*Clone the nmdc-schema repository or use the UI documentation to list the required slots for Class OmicsProcessing.*
A brief R workflow for scraping the required slots for the `OmicsProcessing` class is included in the R code chunk below:
```{r, include = TRUE, echo = TRUE}
suppressMessages(library(rvest))
# scrape the page
res2 <- read_html("https://microbiomedata.github.io/nmdc-schema/OmicsProcessing/")
# grab the tables
res3 <- html_table(res2)
# slots are the first table, print the first column:
print(res3[[1]]$Name)
# find out the required slots:
temp1 <- tempfile()
res3 <- html_text(html_elements(res2, "#induced+ details"))
writeLines(res3, temp1)
res4 <- read_yaml(temp1)
# there are no 'required: FALSE' slots in the YAML, so can just look for 'required' as the list name
res5 <- res4$slot_usage
req_by_slot <- sapply(X = res5,
FUN = function(x) {
"required" %in% names(x)
})
req_by_slot <- names(req_by_slot)[req_by_slot]
# print out slots that have a required slot
print(req_by_slot)
res6 <- res4$attributes
req_by_attribute <- sapply(X = res6,
FUN = function(x) {
"required" %in% names(x)
})
req_by_attribute <- names(req_by_attribute)[req_by_attribute]
# print out slots that have a required slot
print(req_by_attribute)
```
## session stuff
```{r, include = TRUE, echo = FALSE}
# session info
sessionInfo()
# r version
version
```