Numba Neighbors

Approximate port of scikit-learn neighbors using Numba.

Installation

If you want to install/modify (recommented at this point):

git clone https://github.com/jackd/numba-neighbors.git
pip install -e numba-neighbors

Quick-start:

pip install git+git://github.com/jackd/numba-neighbors.git

You may see performance benefits from fastmath by installing Intel's short vector math library (SVML).

conda install -c numba icc_rt

Benchmarks

Requires scikit-learn >= 0.24.2

python benchmarks/kd_tree/query_radius.py

sklearn: sklearn.neighbors._kd_tree.KDTree implementation
numba: numba_neighbors.kd_tree.KDTree implementation
numba3: numba_neighbors.kd_tree.KDTree3 implementation
bu: backup implementations (i.e. takes advantage of known leaf indices of query points)
pre: pre-allocated memory

Burning in sklearn...
Benchmarking sklearn...
Burning in numba_pre...
Benchmarking numba_pre...
Burning in numba...
Benchmarking numba...
Burning in numba_bu...
Benchmarking numba_bu...
Burning in numba_bu_pre...
Benchmarking numba_bu_pre...
Burning in numba3...
Benchmarking numba3...
Burning in numba3_bu...
Benchmarking numba3_bu...
Burning in numba3_bu_pre...
Benchmarking numba3_bu_pre...
numba3_bu           : 0.0001566232 (1.00)
numba3_bu_pre       : 0.0001664740 (1.06)
numba3              : 0.0001693816 (1.08)
numba_bu            : 0.0001695879 (1.08)
numba_bu_pre        : 0.0001740832 (1.11)
numba_pre           : 0.0001744671 (1.11)
numba               : 0.0001828313 (1.17)
sklearn             : 0.0008870633 (5.66)

Debugging

Debugging is often simpler without jitting. To disable numba,

export NUMBA_DISABLE_JIT=1

and re-enable with

export NUMBA_DISABLE_JIT=0

Be wary of using os.environ["NUMBA_DISABLE_JIT"] = "1" from python code - this must be set above imports.

Differences compared to Scikit-learn

All operations are done using reduced distances. E.g. provided KDTree implementations use squared distances rather than actual distances both for inputs and outputs.
query_radius-like functions must specify a maximum number of neighbors. Over-estimating this is fairly cheap - it just means we allocate more data than necessary - but if the limit is reached the first max_count neighbors that are found are returned. These aren't necessarily the closest max_count neighbors.
Query outputs aren't sorted, though can be using binary_tree.simultaneous_sort_partial.
Use of Intel's short vector math library (SVML) if instaled. This makes computation faster, but may result in very small errors.

TODO

Note: this package has served it's purpose for what I wrote it for. I'm unlikely to get to these unless there's serious interest and I get the spare time.

better benchmarks with google-benchmark
n_nodes is inconsistent with scikit-learn implementation... - scikit bug?
Port NeighborsHeap from scikit learn.
query implementations.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
benchmarks		benchmarks
example		example
numba_neighbors		numba_neighbors
.gitignore		.gitignore
.isort.cfg		.isort.cfg
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Numba Neighbors

Installation

Benchmarks

Debugging

Differences compared to Scikit-learn

TODO

About

Releases

Packages

Contributors 2

Languages

License

jackd/numba-neighbors

Folders and files

Latest commit

History

Repository files navigation

Numba Neighbors

Installation

Benchmarks

Debugging

Differences compared to Scikit-learn

TODO

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages