GTS: GPU-based Tree Index for Fast Similarity Search

Introduction

GTS is a GPU-based tree index designed for parallel processing of concurrent similarity search in general metric spaces, where only the distance metric for measuring object similarity is known. The GTS index utilizes a pivot based-tree structure for efficient object pruning and employs list tables to facilitate GPU computing. To efficiently handle concurrent similarity queries with limited GPU memory, a two-stage search method that combines the batch processing and sequential strategies is developed to control memory usage. We also introduces an effective update strategy for GTS, covering streaming data updates and batch data updates.

Development

We implement our index using C++ and NVIDIA CUDA (a platform and programming model for parallel computing on GPUs). We use CMake to compile and build the project.

The code for this project is located in the "GTS" directory. The files in the "GTS" directory are organized as follows.

include: It encompasses all the functionalities, including file reading (file.cuh), GTS construction (tree.cuh), similarity search (search.cuh), and updates (update.cuh and search_naive.cuh).
src: It encompasses the main function file that constructs a GTS index based on input data and utilizes GTS for concurrent metric range queries and concurrent metric $k$NN queries.
CMakeLists.txt

The steps for building the project are as follows:

cd GTS
mkdir bin
mkdir build
cd build 
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j

The compiled executable file "gpu_tree" is located in the "GTS/bin" directory.

Examples

Index Construction and Similarity Search

Build GTS index based on the input data, and utilize GTS for similarity search.

GTS/bin/GTS [data_path] [query_path] [operation_type] [operation_para] [result_path]

The meanings of the parameters are as follows.

data_path, The file contains all the data, and the first line for datasets denotes the dimensionality, the size and the used distance metric.
query_path, The file contains the IDs for the quries and the first line denotes the number of concurrent queries.
operation_type, Type of operation, with 0 for $k$NN queries, 1 for range queries and 2 for updates.
operation_para, Parameter for the corresponding operation, representing '$k$' in $k$NN queries or the radius in range queries.
result_path, Result file path.

Here is a specific example for GTS index construction and similarity search.

cd GTS
bin/GTS ../Datasets/sbw_example.txt ../Datasets/sbw_example_qid.txt 0 8 cost_q_k.txt

Index Updating

When confronted with batch updates, we choose to directly reconstruct the index. Leveraging our efficient parallel index construction method, performing a direct batch reconstruction of the modified dataset proves both feasible and effective.

When data arrives in a sequential streaming fashion (stream data updates), we propose to use lazy strategy with a cache list to handle streaming data updates.

The example of the stream data updates is as follows. Range queries are mixed In the process of data insertion and deletion.

GTS/bin/GTS [data_path] [update_path] [operation_type] [operation_para] [result_path]

The meanings of the parameters are as follows.

data_path, The file contains all the data, and the first line for datasets denotes the dimensionality, the size and the used metric.
update_path, The file contains the stream data and corresponding operations. The first line denotes the size. Afterwards, each line consists of an operation and a data ID. Specifically, for each operation, 0 represents an insertion, 1 represents a deletion, and 2 represents a query.
operation_type, Type of operation, with 0 for $k$NN queries, 1 for range queries and 2 for updates.
operation_para, Representing the radius in range queries when data arrives in a sequential streaming fashion.
result_path, Result file path.

Here is a specific example for steam data updates.

cd GTS
bin/GTS ../Datasets/sbw_example.txt ../Datasets/sbw_example_u.txt 2 59 cost_u.txt

Baselines

Algorithm	Paper	Year
BST	A Data Structure and an Algorithm for the Nearest Point Problem	1983
EGNAT	Searching and Updating Metric Space Databases Using the Parallel EGNAT	2007
MVPT	Distance-Based Indexing for High-Dimensional Metric Spaces	1997
GPU-Table	- GPU-accelerated table-based methods	-
GPU-Tree	- GPU-accelerated tree-based methods	-
LBPG-Tree	Multi-GPU Efficient Indexing For Maximizing Parallelism of High Dimensional Range Query Services	2022
GANNS	GPU-accelerated Proximity Graph Approximate Nearest Neighbor Search and Construction	2022

We choose the CPU baselines according to the Chen et al.'s survey in CSUR 2022, where BST, EGNAT, and MVPT show superiority performance among all CPU-based approaches for exact similarity search. The source codes for BST, EGNAT and MVPT are available at the SISAP Metric Space Library.
GPU-Table computes the distances between query and all the objects to answer MRQ and leverage Dr. Top-$k$ algorithm to answer M$k$NNQ.
GPU-Tree implements the SOTA GPU-based tree index G-PICS strategy for general similarity search on single GPU by constructing multiple MVP-trees. The source code of our implementation for GPU-Tree is in the "GPU-Tree" folder, and its compilation and similarity search example are performed similarly to GTS. Here is a specific example for GPU-Tree index construction and similarity search.

cd GPU-Tree
bin/gpu_tree ../Datasets/sbw_example.txt ../Datasets/sbw_example_qid.txt 0 8

LBPG-Tree constructs R-Trees on GPU high dimensional range query services. Notably, we also implement the G-PICS's $k$ nearest neighbour search strategy for LBPG-Tree to support M$k$NNQ.
GANNS is a GPU-based graph method for vector similarity search.

Datasets

Each dataset can be obtained from the following links.

Dataset	Cardinality	Dimensionality	Distance Metric	Link
Words	611,756	1∼34	Edit distance	https://mobyproject.org
T-Loc	10,000,000	2	$L_2$-norm	http://www.vldb.org/pvldb/vol12/p99-ghosh.pdf
Vector	200,000	300	Word cosine distance	https://code.google.com/archive/p/word2vec
DNA	1,000,000	108	Edit distance	http://www.ncbi.nlm.nih.gov/genome
Color	5,000,000	282	$L_1$-norm	http://cophir.isti.cnr.it

The dataset file is in txt format. The example of dataset file is located in "Datasets Sample".

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Datasets Sample		Datasets Sample
Source Code		Source Code
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GTS: GPU-based Tree Index for Fast Similarity Search

Introduction

Development

Examples

Index Construction and Similarity Search

Index Updating

Baselines

Datasets

About

Releases

Packages

Languages

ZJU-DAILY/GTS

Folders and files

Latest commit

History

Repository files navigation

GTS: GPU-based Tree Index for Fast Similarity Search

Introduction

Development

Examples

Index Construction and Similarity Search

Index Updating

Baselines

Datasets

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages