Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PIXELS-580] implement using LSH index to perform approximate nearest neighbour search on vector column #88

Open
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

TiannanSha
Copy link
Contributor

@TiannanSha TiannanSha commented Feb 18, 2024

(todo probably will merge the ExactNNS PR first. But the first ~10 commits of this PR now is about implementing ExactNNS)

Implement using LSH index to perform approximate nearest neighbour search on vector column. This consists of two functions for user to call:

  • to build LSH index distributedly using multiple nodes:
    select build_lsh(vec_col, numBits), where numBits lets user specify the number of bits in the hashed value. There are 2^numBits number of possible buckets that vec_col is divided into.
  • to use the built LSH index using only one node but all the queries will be distributed evenly across all nodes:
  • lsh_search_single_node(input_vec, distFunc, test_schema.test_arr_table.arr_col, k)
    e.g. lsh_search_single_node(array[3.5, 3.5], 'euc', 'test_schema.test_arr_table.arr_col', 3)

@bianhq bianhq changed the title Implement using LSH index to perform approximate nearest neighbour search on vector column [PIXELS-580] implement using LSH index to perform approximate nearest neighbour search on vector column May 11, 2024
Copy link
Contributor

@bianhq bianhq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also resolve the conflicts.

<scope>test</scope>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not use AWS Java SDK 1.x.
AWS Java SDK 2 is already included in the dependency.

import software.amazon.awssdk.services.s3.model.PutObjectRequest;

public class CachedLSHIndex {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the Storage API to access the underlying storage system such as S3, instead of directly using the S3 client.

@bianhq bianhq added the enhancement New feature or request label Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants