Parallel k-d tree

Parallel implementation of k-d tree using MPI, OpenMP and C++17.

Summary

A k-d tree is a data structure which can be used to represent k-dimensional points in a convenient way. This kind of representation can be used to implement algorithms like K-Nearest Neighbors (KNN) very efficiently [2].

Below you find two possible representations for a k-d tree (k=3):

Surfaces	Branches	Textual

The procedure employed to build such a structure is the following:

Choose one axis;
Determine the median point of the dataset with respect to the axis;
For each one of the two halves, go back to 1.

Note: The median point is the data point which splits the dataset in two halves along the chosen axis.

The algorithms stops when the dataset contains only one point. Below you find the progression of the algorithm at increasing depths of the tree, in the two visualizations provided by our visualizer tool (surfaces and branches):

Depth 0	Depth 1	Depth 2

A k-d tree may be represented with a binary tree, whose nodes are defined approximately in the following way:

struct Node {
    // k-dimensional value
    double *data;

    Node *left;
    Node *right;
};

We assume that our data can be represented using a double value.

Parallelization

MPI

The strategy employed for the parallelization with MPI consists in delegating one of the two branches resulting from a split to one of the available processes waiting. At the beginning we start with only one process (which we refer to as 'rank0') processing the data. After the split (step 2) rank0 produces two branches. The right branch (data greater than the median along the chosen axis) is assigned to a new process ('rank1'), which then carries out the computation independently. The left branch is kept by rank0.

This same strategy is then employed recursively by rank0 and rank1, until the tree is completed.

OpenMP

The strategy is very similar to the one presented in MPI, we use tasks to distribute the work easily.

Compilation

Clone the repository and navigate to the folder src. A Makefile is available along with the following recipes:

Recipe	Result
`output`	Show only the output (i.e. the textual representation of the tree).
`file`	Save the k-d tree as a CSV file, each node is saved in order of increasing level starting from the root (left to right).
`time`	Show only the time it took to build the k-d tree.
`debug`	The binary produced will show both debug messages and output.
`leaks`	Find memory leaks in the source code. The executable does not produce any output.
`mpidebug`	Produces an executable that can be debugged using gdb (the rank of the process to be controlled via gdb must be set via the environment variable `MPI_DEBUG_RANK`).

There are also some optional parameters which are used to select additional features:

Parameter	Values	Explanation	Default
`src`	`mpi`/`omp`/`none`	Framework used for the parallelization (`none` produced a serial version of the code).	`omp`
`prec`	`double`	Precision used to represents the values in memory (`double` leads to wore performance, in particular with MPI).	`float`
`write`	`alternative`/`default`	Enable alternative writing to fix false sharing (see the PDF report for more details).	`default`
`size`	`big`/`default`	Uses the biggest possible data type for the indexes of arrays used inside the code (may lead to worse performance overall).	`default`

For instance, the following produces the executable tree_mpi.x which prints the time needed to build the tree using MPI, and represents data as double values in memory:

make time src=mpi prec=double

Usage

Run the executable tree_omp.x (or tree_mpi.x) generated in Compile with:

# OpenMP
./tree_omp.x

or

# MPI
mpirun -np ... tree_mpi.x

Specifying a dataset

By default the dataset used is the file benchmark/benchmark1.csv, but you can specify your own dataset via a command line argument. Valid datasets are CSV files where each data point has the same number of components (one data point per row).

For example, the following command builds a k-d tree on the dataset inside the file foo.csv in the current directory:

# here we use MPI and 10 processors
# foo.csv must exist and be well-formatted
mpirun -np 10 tree_mpi.x foo.csv

Saving results

If you compiled with the recipe make file (see Compile) you can save the result to a CSV file, which allows you to do some more things on your k-d tree (see Visualization). You need to provide a path where the result can be saved (filename, not directory).

For example, this saves the tree constructed on the dataset in data.csv and saves it in output_tree.csv in another directory:

# the folder ../results should exist
./tree_omp.x data.csv ../results/output_tree.csv

Visualization

After constructing a k-d tree you may want to visualize it in some graphical way. In the folder visualization we provide an experimental tool to do just that.

Suppose there is a k-d tree stored in a CSV file called output.csv in the folder results, which has been generated using this library. This is very important, since using a CSV file which does not represent a k-d tree results in very bad-looking results.

Navigate to the folder visualization, we visualize the directory tree for clarity:

└──kd-tree
    ├──benchmark
    ├──src
    ├──results
    │   └──output.csv            # <- k-d tree to be visualized
    └──visualization             # (*)
        ├──converter.py
        └──visualize_kd_tree.py

Open an interactive Python interpreter and write the following commands:

>>> from visualize_kd_tree import KDTreeVisualization
>>> from converter import csv_to_tree

# we obtain a Python representation of the root of the k-d tree
>>> root = csv_to_tree('../results/output.csv')

# show the tree in an interactive Matplotlib window
>>> KDTreeVisualization().visualize(root)

6 Data Points	10 Data Points

There are some configurations available:

# save the image in 'output_figure.png'
# the figure is 30x30 (default is 20x20) and 200 dpi (default is 100)
>>> KDTreeVisualization().visualize(
...     root, figsize=(30, 30), dpi=200, filename="output_figure.png"
... )

# - decrease the opacity of the surface drawn to represent a split (default is
#   plane_alpha=0.4);
# - decrease the width of lines parallel to a split surface
#   (default is intersection_lines_width=3);
# - increase the size of split points in the 3D volume (default is
#   point_size=75)
>>> v = KDTreeVisualization(
...     plane_alpha=0.2, intersection_lines_width=1, point_size=100
... )
>>> v.visualize(root)

Roadmap

References

Friedman et al. 1977, "An algorithm for finding best matches in logarithmic expected time."
Wikipedia

Name		Name	Last commit message	Last commit date
Latest commit History 259 Commits
benchmark		benchmark
res		res
src		src
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
final_report.pdf		final_report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel k-d tree

Summary

Parallelization

MPI

OpenMP

Compilation

Usage

Specifying a dataset

Saving results

Visualization

Roadmap

References

About

Languages

License

fandreuz/parallel-kd-tree

Folders and files

Latest commit

History

Repository files navigation

Parallel k-d tree

Summary

Parallelization

MPI

OpenMP

Compilation

Usage

Specifying a dataset

Saving results

Visualization

Roadmap

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages