-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.Rmd
178 lines (116 loc) · 9.67 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# bedtorch
<!-- badges: start -->
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](https://cran.r-project.org/web/licenses/MIT) [![R-CMD-check](https://github.com/haizi-zh/bedtorch/workflows/R-CMD-check/badge.svg)](https://github.com/haizi-zh/bedtorch/actions) [![Codecov test coverage](https://codecov.io/gh/haizi-zh/bedtorch/branch/main/graph/badge.svg)](https://github.com/haizi-zh/bedtorch/actions) [![Lifecycle: experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](https://lifecycle.r-lib.org/articles/stages.html#experimental) [![CodeFactor](https://www.codefactor.io/repository/github/haizi-zh/bedtorch/badge)](https://www.codefactor.io/repository/github/haizi-zh/bedtorch)
<!-- badges: end -->
*For full documentation, refer to the [package documentation site](https://haizi-zh.github.io/bedtorch/).*
## Motivation
The goal of bedtorch is to provide a fast BED file manipulation tool suite native in R.
In bioinformatics studies, an important type of jobs is related to BED and BED-like file manipulation. To name a few example:
- Filtering cfDNA fragments based on length
- Identifying genomic regions overlapping with certain genes
- Filtering features based on epigenetic profiles such as histone modification levels
- Extracting SNVs in a certain genomic interval
Many of such tasks can be done by some highly-optimized tools, such as [bedtools](https://bedtools.readthedocs.io/en/latest/) and [bedops](https://bedops.readthedocs.io/en/latest/). However, these are binary tools which only works in shell. Sometimes, especially for developers using R, it's desirable to invoke these tools in R codes.
One solution is using functions that invokes system commands ([`system2`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/system2) or [`system`](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/system)). [`bedr`](https://cran.r-project.org/web/packages/bedr/vignettes/Using-bedr.html), a popular package, is one good example. It's a bridge linking between R codes and shell tools. Under the hood, it writes data into the temporary directory, calls certain command line tools such as `tabix`, `bedtools`, or `bedops`, parses the output and re-generate a data frame in R space.
Generally this approach is very helpful. However, it's worth noting some disadvantages.
- This workflow requires lots of disk IO, since data frames in R needs to be written to temporary directories before being fed to command line tools.
- Command line output is captured and transferred back to R s space. The data transfer is done by stdin/stdout pipes. For R, reading large amount of text content from stdin can be extremely slow. As a result, most of time when large dataset is involved, `bedr` is hardly usable.
- `bedr` captures command line output as plain text. Therefore, when the data is transferred back to R, most of the information about data types, index (in the case of `data.table`), column names, etc., are lost. It's the package user's responsibility to make sure data types are correct, column names are restored, column indices have been rebuilt (in the case of `data.table`, which can be very time consuming). Often this job is tedious, error-prone and introduces very large overheads.
- It requires the command line tools to be present in `PATH`. This depends on the end user's system environment and the R developer can do nothing about it. It puts additional burden to deployment, especially when the end user is not an IT professional.
Another approach is using native R data types to represent BED-like data. R's native `data.frame` seems to be a good model, but in many scenarios, it's performance is poor. More importantly, it lacks many advanced features compared with `GenomicRanges`, `dplyr` and `data.table`.
`GenomicRanges` is widely used in bioinformatics community. Although it provides essential building blocks for many tasks, such as `findOverlaps`, `reduce`, etc., it doesn't provide high-level features as `bedtools` does. Using `findOverlaps`, `reduce`, and other functions provided by `GenomicRanges`, users may implement operations such as map, merge, unique intersect, but this still requires lots of coding.
It would be desirable to have an R package, which represent BED-like data by `GenomicRanges` or `data.table` objects, and provide native support for common BED manipulations, such as merge, intersect, filter, etc.
## Details
`bedtorch` is an attempt for this. Under the hood, every BED dataset is represented as a `GenomicRanges` object (it also can represent the dataset as `data.table`, if preferred. But this is not recommended). Users can apply various operations on it, such as intersect, merge, etc.
In terms of the core computation, `bedtorch`'s performance is comparable to `bedtools`. However, since no disk IO and data conversion is needed, users can directly manipulate the dataset in memory, therefore in practice, via `bedtorch`, many tasks can be done about one magnitude faster.
Additionally, by linking to [htslib](https://github.com/samtools/htslib), `bedtorch` can write data frames directly to BGZIP-format files, and optionally create the tabix index. It can also directly load either local or remote BGZIP-format files at any particular genomic regions. This feature is done by htslib, therefore not requiring external command line tools such as `tabix` or `bgzip`. In the case of working with remote files, being capable of loading any arbitrary portion of the file will be very convenient.
[BGZIP](http://www.htslib.org/doc/bgzip.html) is a variant of gzip and is widely used in bioinformatics. It's compatible with gzip, with an important additional feature: allows a compressed file to be indexed for fast query. However, R is unaware of BGZIP. It considers all `.gz` files as gzip-compressed, thus doesn't take advantage of BED index, and cannot output BGZIP files.
For more details, refer to the [package reference page](https://haizi-zh.github.io/bedtorch/reference/index.html).
### Installation
```{r, eval=FALSE}
# install.packages("devtools")
devtools::install_github("haizi-zh/bedtorch")
```
### Example
Here are several basic examples which shows you how to solve a common problem:
Read a BED file from disk:
```{r example}
library(bedtorch)
## Load BED data
file_path <-
system.file("extdata", "example2.bed.gz", package = "bedtorch")
bedtbl <- read_bed(file_path, range = "1:3001-4000")
bedtbl
```
Read a remote BGZIP BED file for a certain region:
```{r}
# Load remote BGZIP files with tabix index specified
# Here we need to explicitly indicate `compression` as `bgzip` since we are
# using short URL for `tabix_index`, so that the function cannot guess
# compression type using the URL
read_bed(
"https://git.io/JYATB",
range = "22:20000001-30000001",
tabix_index = "https://git.io/JYAkT",
compression = "bgzip"
)
```
Write the previous data table to the temporary directory, and create the tabix index alongside:
```{r}
write_bed(bedtbl,
file_path = tempfile(fileext = ".bed.gz"),
tabix_index = TRUE)
```
Merge intervals, and take the mean of score in each merged group:
[<img src="https://bedtools.readthedocs.io/en/latest/_images/merge-glyph.png" alt="Image by bedtools" width="64%"/>](https://bedtools.readthedocs.io/en/latest/)
*Image by [bedtools](https://bedtools.readthedocs.io/en/latest/)*
```{r}
operation = list(mean_score = list(on = "score1", func = mean))
merged <- merge_bed(bedtbl, operation = operation)
head(merged)
```
Find intersections between two datasets:
[<img src="https://bedtools.readthedocs.io/en/latest/_images/intersect-glyph.png" alt="Image by bedtools" width="64%"/>](https://bedtools.readthedocs.io/en/latest/)
*Image by [bedtools](https://bedtools.readthedocs.io/en/latest/)*
```{r}
file_path1 <- system.file("extdata", "example_merge.bed", package = "bedtorch")
file_path2 <- system.file("extdata", "example_intersect_y.bed", package = "bedtorch")
tbl_x <- read_bed(file_path1, genome = "hs37-1kg")
tbl_y <- read_bed(file_path2, genome = "hs37-1kg")
intersect_bed(tbl_x, tbl_y)
```
Shuffle a BED data table across the genome:
[<img src="https://bedtools.readthedocs.io/en/latest/_images/shuffle-glyph.png" alt="Image by bedtools" width="64%"/>](https://bedtools.readthedocs.io/en/latest/)
*Image by [bedtools](https://bedtools.readthedocs.io/en/latest/)*
```{r}
shuffle_bed(tbl_x)
```
Calculate Jaccard statistics between the two BED data tables:
[<img src="https://bedtools.readthedocs.io/en/latest/_images/jaccard-glyph.png" alt="Image by bedtools" width="64%"/>](https://bedtools.readthedocs.io/en/latest/)
*Image by [bedtools](https://bedtools.readthedocs.io/en/latest/)*
```{r}
jaccard_bed(tbl_x, tbl_y)
```
### Performance
The following is the result of a simple benchmark for three common BED manipulations: merge, intersect and map.
Benchmark platform information:
- OS: macOS Big Sur 11.1
- CPU: 2.7 GHz Quad-Core Intel Core i7
- Memory: 16 GB LPDDR3
<img src="https://raw.githubusercontent.com/haizi-zh/bedtorch/main/data-raw/benchmark.png" width="64%/"/>
From the benchmark result, we can see that for all three tasks, `bedtorch` uses much less time than `bedtools`.
Note: this does not mean `bedtools` is indeed slower than `bedtorch` at performing the actual computation. Don't forget `bedtools` requires large amount of disk IO, while `bedtorch` does not.
## See also
Many features are inspired by bedtools. Thus, it's helpful to get familiar with bedtool's documentation: <https://bedtools.readthedocs.io/en/latest/index.html>