0.0.53: accept floating point intervals in FNCLS

pyranges · Feb 25, 2020 · f9894b0 · f9894b0
1 parent 7692e67
commit f9894b0
Show file tree

Hide file tree

Showing 5 changed files with 41 additions and 79 deletions.
diff --git a/README.md b/README.md
@@ -17,6 +17,12 @@ it available to the Python community as a stand-alone library. Enjoy.
 Original Paper: https://academic.oup.com/bioinformatics/article/23/11/1386/199545
 Cite: http://dx.doi.org/10.1093/bioinformatics/btz615
 
+## Cite
+
+If you use this library in published research cite
+
+http://dx.doi.org/10.1093/bioinformatics/btz615
+
 ## Install
 
 ```
@@ -102,6 +108,9 @@ intervals
 # [(0, 100, 0), (1, 101, 1), (2, 102, 2), (3, 103, 3), (4, 104, 4)]
 ```
 
+There is also an experimental floating point version of the NCLS called FNCLS.
+See the examples folder.
+
 ## Benchmark
 
 Test file of 100 million intervals (created by subsetting gencode gtf with replacement):
@@ -116,10 +125,6 @@ Test file of 100 million intervals (created by subsetting gencode gtf with repla
 Building is 50 times faster and overlap queries are 20 times faster. Memory
 usage is one fifth and one ninth.
 
-## Cite
-
-http://dx.doi.org/10.1093/bioinformatics/btz615
-
 ## Original paper
 
 > Alexander V. Alekseyenko, Christopher J. Lee; Nested Containment List (NCList): a new algorithm for accelerating interval query of genome alignment and interval databases, Bioinformatics, Volume 23, Issue 11, 1 June 2007, Pages 1386–1393, https://doi.org/10.1093/bioinformatics/btl647
diff --git a/examples/test_fncls.py b/examples/test_fncls.py
@@ -0,0 +1,32 @@
+from ncls import FNCLS
+import numpy as np
+np.random.seed(0)
+
+import pandas as pd
+size = int(1e4)
+
+starts = np.random.randint(0, high=int(1e6), size=size) + np.random.random()
+ends = starts + np.random.randint(0, high=1000, size=size)
+df = pd.DataFrame(data={"Start": starts, "End": ends})
+
+starts = np.random.randint(0, high=int(1e6), size=size) + np.random.random()
+ends = starts + np.random.randint(0, high=1000, size=size)
+df2 = pd.DataFrame(data={"Start": starts, "End": ends})
+
+print(df)
+print(df2)
+
+from time import time
+
+start = time()
+fncls = FNCLS(df.Start.values, df.End.values, df.index.values)
+end = time()
+print("Time:", end - start)
+start = time()
+qx, sx = fncls.all_overlaps_both(df2.Start.values, df2.End.values, df2.index.values)
+end = time()
+print("Time:", end - start)
+df2.columns = df2.columns + "_b"
+j = pd.concat([df.reindex(sx).reset_index(drop=True), df2.reindex(qx).reset_index(drop=True)], axis=1)
+
+print(j.sort_values("Start"))
diff --git a/examples/test_multiprocessing.py b/examples/test_multiprocessing.py
diff --git a/examples/test_pickle.py b/examples/test_pickle.py
diff --git a/examples/test_read_write_binaries.py b/examples/test_read_write_binaries.py