Skip to content

Commit

Permalink
0.0.53: accept floating point intervals in FNCLS
Browse files Browse the repository at this point in the history
  • Loading branch information
endrebak committed Feb 25, 2020
1 parent 7692e67 commit f9894b0
Show file tree
Hide file tree
Showing 5 changed files with 41 additions and 79 deletions.
13 changes: 9 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,12 @@ it available to the Python community as a stand-alone library. Enjoy.
Original Paper: https://academic.oup.com/bioinformatics/article/23/11/1386/199545
Cite: http://dx.doi.org/10.1093/bioinformatics/btz615

## Cite

If you use this library in published research cite

http://dx.doi.org/10.1093/bioinformatics/btz615

## Install

```
Expand Down Expand Up @@ -102,6 +108,9 @@ intervals
# [(0, 100, 0), (1, 101, 1), (2, 102, 2), (3, 103, 3), (4, 104, 4)]
```

There is also an experimental floating point version of the NCLS called FNCLS.
See the examples folder.

## Benchmark

Test file of 100 million intervals (created by subsetting gencode gtf with replacement):
Expand All @@ -116,10 +125,6 @@ Test file of 100 million intervals (created by subsetting gencode gtf with repla
Building is 50 times faster and overlap queries are 20 times faster. Memory
usage is one fifth and one ninth.

## Cite

http://dx.doi.org/10.1093/bioinformatics/btz615

## Original paper

> Alexander V. Alekseyenko, Christopher J. Lee; Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Bioinformatics, Volume 23, Issue 11, 1 June 2007, Pages 1386–1393, https://doi.org/10.1093/bioinformatics/btl647
32 changes: 32 additions & 0 deletions examples/test_fncls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from ncls import FNCLS
import numpy as np
np.random.seed(0)

import pandas as pd
size = int(1e4)

starts = np.random.randint(0, high=int(1e6), size=size) + np.random.random()
ends = starts + np.random.randint(0, high=1000, size=size)
df = pd.DataFrame(data={"Start": starts, "End": ends})

starts = np.random.randint(0, high=int(1e6), size=size) + np.random.random()
ends = starts + np.random.randint(0, high=1000, size=size)
df2 = pd.DataFrame(data={"Start": starts, "End": ends})

print(df)
print(df2)

from time import time

start = time()
fncls = FNCLS(df.Start.values, df.End.values, df.index.values)
end = time()
print("Time:", end - start)
start = time()
qx, sx = fncls.all_overlaps_both(df2.Start.values, df2.End.values, df2.index.values)
end = time()
print("Time:", end - start)
df2.columns = df2.columns + "_b"
j = pd.concat([df.reindex(sx).reset_index(drop=True), df2.reindex(qx).reset_index(drop=True)], axis=1)

print(j.sort_values("Start"))
30 changes: 0 additions & 30 deletions examples/test_multiprocessing.py

This file was deleted.

25 changes: 0 additions & 25 deletions examples/test_pickle.py

This file was deleted.

20 changes: 0 additions & 20 deletions examples/test_read_write_binaries.py

This file was deleted.

0 comments on commit f9894b0

Please sign in to comment.