-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PIXELS-580] implement using LSH index to perform approximate nearest neighbour search on vector column #88
base: master
Are you sure you want to change the base?
Conversation
… support more than ten features
… work on a multinode environment
…and persist the mapping when JVM shuts down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also resolve the conflicts.
<scope>test</scope> | ||
</dependency> | ||
<dependency> | ||
<groupId>com.amazonaws</groupId> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use AWS Java SDK 1.x.
AWS Java SDK 2 is already included in the dependency.
import software.amazon.awssdk.services.s3.model.PutObjectRequest; | ||
|
||
public class CachedLSHIndex { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use the Storage API to access the underlying storage system such as S3, instead of directly using the S3 client.
(todo probably will merge the ExactNNS PR first. But the first ~10 commits of this PR now is about implementing ExactNNS)
Implement using LSH index to perform approximate nearest neighbour search on vector column. This consists of two functions for user to call:
select build_lsh(vec_col, numBits), where numBits lets user specify the number of bits in the hashed value. There are 2^numBits number of possible buckets that vec_col is divided into.
e.g. lsh_search_single_node(array[3.5, 3.5], 'euc', 'test_schema.test_arr_table.arr_col', 3)