DIMS is a distributed Index for similarity search in metric spaces, which support efficient metric range query and metric k nearest neighbour query. DIMS builds effective global index, medium index, and local index with three-stage partition strategy to capture the unique characteristics of various data and ensure balanced workload. In addition, we propose concurrent search methods that support efficient distributed similarity search, and leverage filtering and validation techniques to avoid unnecessary distance computations. Furthermore, we devise cost-based optimization model to strike the balance between computation cost and communication cost.
We recommend using IntelliJ for development and following the official guide for setting up the IDE. Here are the steps to follow:
- Add
hadoop-2.6
,hive-provided
,hive-thriftserver
, andyarn
toProfiles
in View > Tool Windows > Maven Projects. - Select Reimport All Maven Projects and Generate Sources and Update Folders For All Projects.
- Rebuild the entire project, which will initially fail. This step is important to proceed to the next steps.
- Marking the generated sources is also essential for successful builds. To do this, navigate to File > Project Structure > Project Settings > Modules and mark the relevant sources as the source.
- Go to File > Project Structure > Project Settings > Modules. Find
spark-streaming-flume-sink
, and marktarget/scala-2.11/src_managed/main/compiled_avro
as source. (Click on the Sources on the top to mark) - Go to File > Project Structure > Project Settings > Modules. Find
spark-hive-thriftserver
, and marksrc/gen/java
as source. (Click on the Sources on the top to mark)
- Go to File > Project Structure > Project Settings > Modules. Find
- Rebuild the entire project again, and it should work correctly. If there are any compilation errors because some classes cannot be found, go back to the previous step and mark the corresponding sources if they are not included.
DIMSDataFrameExample for how to use DataFrame for metric similarity search
The main code for this project is located in the following directories:
-
Spark SQL Catalyst
- /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/dims
-
Spark SQL Core
- /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/dims
-
Spark Example
- /spark/examples/src/main/scala/org/apache/spark/examples/sql/dims
Algorithm | Paper | Year |
---|---|---|
M-tree | M-tree: An Efficient Access Method for Similarity Search in Metric Spaces | 1997 |
M-index | Metric Index: An efficient and scalable solution for precise and approximate similarity search | 2011 |
AMDS | Distributed k-Nearest Neighbor Queries in Metric Spaces | 2018 |
D-HNSW | Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs | 2020 |
Statistics of the datasets used.
Dataset | Card. | Dim. | Dis. Metric | Link |
---|---|---|---|---|
Words | 611,756 | 1~34 | Edit Dis | https://mobyproject.org |
T-Loc | 10M | 2 |
|
http://www.vldb.org/pvldb/vol12/p99-ghosh.pdf |
Vec | 0.1M | 300 |
|
https://www.kaggle.com/datasets/rtatman/pretrained-word-vectors-for-spanish |
Color | 1M | 282 |
|
http://cophir.isti.cnr.it/ |
Deep | 50M | 100 |
|
https://big-ann-benchmarks.com/neurips21.html |
The dataset file is in txt format and can optionally be read from HDFS in a cluster environment or from local directory examples/src/main/resources/ in standalone mode. Each row in the dataset represents one piece of data, and every dimension within each row is split by spaces. For example, see /spark/examples/src/main/resources/mpeg_example.txt.