-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PIXELS-580] implement using LSH index to perform approximate nearest neighbour search on vector column #88
Open
TiannanSha
wants to merge
23
commits into
pixelsdb:master
Choose a base branch
from
TiannanSha:lsh-nns
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
23 commits
Select commit
Hold shift + click to select a range
f88dedb
implement a simple trino udf to calcualte the sum of all the elements…
TiannanSha 5a6d75c
use array instead of features because apparently trino features don't…
TiannanSha df04bbf
implement three types of distances: euclidean, dotproduct and cosine …
TiannanSha 2300020
make select exactNNS() udf work in trino
TiannanSha 1767b44
fix a test
TiannanSha d52a31e
fix a comment
TiannanSha 2201829
minor polish
TiannanSha 420408a
make pixels vector type support trino array type
TiannanSha abbbee7
implement a exact NNS that acts as an aggregation function and should…
TiannanSha 4ca554d
clean up
TiannanSha 5bc07ba
pretty much finished LSH build; LSH search wip
TiannanSha 01a3903
wip: lsh search
TiannanSha 19fedc8
lshSearch work in progress
TiannanSha 3e89598
implement LSH search, including updating mapping from col to buckets …
TiannanSha 0fc8cd1
fix the ser and deser for LSH index
TiannanSha 9150fab
auto decide s3dir for storing LSH buckets. clean up
TiannanSha dd9016d
implement code for experiments; before fixing remote page too large
TiannanSha b843296
accidentally got ignored entire folder of lsh build
TiannanSha fad3f5d
fix lsh_build
TiannanSha 9eaf254
implement LSH load and adjust lsh search
TiannanSha 09afb3f
fix lsh_search() so that it works with lsh_load()
TiannanSha f231da9
minor changes on lsh_load()
TiannanSha bbbb793
add bucket write threshold to lsh_build() to avoid sending big messages
TiannanSha File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
104 changes: 104 additions & 0 deletions
104
connector/src/main/java/io/pixelsdb/pixels/trino/vector/S3TestFileGenerator.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
package io.pixelsdb.pixels.trino.vector; | ||
|
||
import io.pixelsdb.pixels.common.physical.Storage; | ||
import io.pixelsdb.pixels.common.physical.StorageFactory; | ||
import io.pixelsdb.pixels.core.PixelsWriter; | ||
import io.pixelsdb.pixels.core.PixelsWriterImpl; | ||
import io.pixelsdb.pixels.core.TypeDescription; | ||
import io.pixelsdb.pixels.core.encoding.EncodingLevel; | ||
import io.pixelsdb.pixels.core.exception.PixelsWriterException; | ||
import io.pixelsdb.pixels.core.vector.VectorColumnVector; | ||
import io.pixelsdb.pixels.core.vector.VectorizedRowBatch; | ||
|
||
import java.io.IOException; | ||
|
||
/** | ||
* This class is responsible for writing test files containing vector columns to s3 | ||
*/ | ||
public class S3TestFileGenerator { | ||
public static void main(String[] args) throws IOException | ||
{ | ||
writeVectorColumnToS3(getTestVectors(4,2), "exactNNS-test-file3.pxl"); | ||
writeVectorColumnToS3(getTestVectors(4,2), "exactNNS-test-file4.pxl"); | ||
// todo maybe add a large scale test | ||
} | ||
|
||
public static void writeVectorColumnToS3(double[][] vectorsToWrite, String s3File) | ||
{ | ||
int length = vectorsToWrite.length; | ||
if (vectorsToWrite[0]==null) { | ||
return; | ||
} | ||
int dimension = vectorsToWrite[0].length; | ||
// Note you may need to restart intellij to let it pick up the updated environment variable value | ||
// example path: s3://bucket-name/test-file.pxl | ||
try | ||
{ | ||
String pixelsFile = System.getenv("PIXELS_S3_TEST_BUCKET_PATH") + s3File; | ||
Storage storage = StorageFactory.Instance().getStorage("s3"); | ||
|
||
String schemaStr = String.format("struct<v:vector(%s)>", dimension); | ||
|
||
TypeDescription schema = TypeDescription.fromString(schemaStr); | ||
VectorizedRowBatch rowBatch = schema.createRowBatch(); | ||
VectorColumnVector v = (VectorColumnVector) rowBatch.cols[0]; | ||
|
||
PixelsWriter pixelsWriter = | ||
PixelsWriterImpl.newBuilder() | ||
.setSchema(schema) | ||
.setPixelStride(10000) | ||
.setRowGroupSize(64 * 1024 * 1024) | ||
.setStorage(storage) | ||
.setPath(pixelsFile) | ||
.setBlockSize(256 * 1024 * 1024) | ||
.setReplication((short) 3) | ||
.setBlockPadding(true) | ||
.setEncodingLevel(EncodingLevel.EL2) | ||
.setCompressionBlockSize(1) | ||
.build(); | ||
|
||
for (int i = 0; i < length-1; i++) | ||
{ | ||
int row = rowBatch.size++; | ||
v.vector[row] = new double[dimension]; | ||
System.arraycopy(vectorsToWrite[row], 0, v.vector[row], 0, dimension); | ||
v.isNull[row] = false; | ||
if (rowBatch.size == rowBatch.getMaxSize()) | ||
{ | ||
pixelsWriter.addRowBatch(rowBatch); | ||
rowBatch.reset(); | ||
} | ||
} | ||
|
||
if (rowBatch.size != 0) | ||
{ | ||
pixelsWriter.addRowBatch(rowBatch); | ||
System.out.println("A rowBatch of size " + rowBatch.size + " has been written to " + pixelsFile); | ||
rowBatch.reset(); | ||
} | ||
|
||
pixelsWriter.close(); | ||
} catch (IOException | PixelsWriterException e) | ||
{ | ||
e.printStackTrace(); | ||
} | ||
} | ||
|
||
/** | ||
* testVectors[i][j] = i + j*0.0001 | ||
* e.g. testVector[0][500] = 0.05 | ||
* @param length number of vectors | ||
* @param dimension dimension of each vector | ||
* @return | ||
*/ | ||
private static double[][] getTestVectors(int length, int dimension) | ||
{ | ||
double[][] testVecs = new double[length][dimension]; | ||
for (int i=0; i<length; i++) { | ||
for (int j=0; j<dimension; j++) { | ||
testVecs[i][j] = i + j*0.0001; | ||
} | ||
} | ||
return testVecs; | ||
} | ||
} |
71 changes: 71 additions & 0 deletions
71
connector/src/main/java/io/pixelsdb/pixels/trino/vector/VectorAggFuncUtil.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
package io.pixelsdb.pixels.trino.vector; | ||
|
||
import com.fasterxml.jackson.databind.ObjectMapper; | ||
import io.airlift.slice.Slice; | ||
import io.pixelsdb.pixels.core.TypeDescription; | ||
import io.pixelsdb.pixels.trino.PixelsColumnHandle; | ||
import io.pixelsdb.pixels.trino.vector.VectorDistFuncs; | ||
import io.trino.spi.block.Block; | ||
import io.trino.spi.type.ArrayType; | ||
|
||
import static io.trino.spi.type.DoubleType.DOUBLE; | ||
|
||
public class VectorAggFuncUtil { | ||
|
||
public static double[] blockToVec(Block block) { | ||
double[] inputVector; | ||
if (block == null) { | ||
return null; | ||
} | ||
// todo use offset here | ||
inputVector = new double[block.getPositionCount()]; | ||
for (int i = 0; i < block.getPositionCount(); i++) { | ||
inputVector[i] = DOUBLE.getDouble(block, i); | ||
} | ||
return inputVector; | ||
} | ||
|
||
public static double[] sliceToVec(Slice slice) { | ||
ObjectMapper objectMapper = new ObjectMapper(); | ||
try { | ||
double[] vec = objectMapper.readValue(slice.toStringUtf8(), double[].class); | ||
return vec; | ||
} catch (Exception e) { | ||
e.printStackTrace(); | ||
return null; | ||
} | ||
} | ||
|
||
public static VectorDistFuncs.DistFuncEnum sliceToDistFunc(Slice distFuncStr) { | ||
return switch (distFuncStr.toStringUtf8()) { | ||
case "euc" -> VectorDistFuncs.DistFuncEnum.EUCLIDEAN_DISTANCE; | ||
case "cos" -> VectorDistFuncs.DistFuncEnum.COSINE_SIMILARITY; | ||
case "dot" -> VectorDistFuncs.DistFuncEnum.DOT_PRODUCT; | ||
default -> null; | ||
}; | ||
} | ||
|
||
public static PixelsColumnHandle sliceToColumn(Slice distFuncStr) { | ||
String[] schemaTableCol = distFuncStr.toStringUtf8().split("\\."); | ||
if (schemaTableCol.length != 3) { | ||
throw new IllegalColumnException("column should be of form schema.table.column"); | ||
} | ||
return PixelsColumnHandle.builder() | ||
.setConnectorId("pixels") | ||
.setSchemaName(schemaTableCol[0]) | ||
.setTableName(schemaTableCol[1]) | ||
.setColumnName(schemaTableCol[2]) | ||
.setColumnAlias(schemaTableCol[2]) | ||
.setColumnType(new ArrayType(DOUBLE)) | ||
.setTypeCategory(TypeDescription.Category.VECTOR) | ||
.setLogicalOrdinal(0) | ||
.setColumnComment("") | ||
.build(); | ||
} | ||
|
||
public static class IllegalColumnException extends IllegalArgumentException { | ||
public IllegalColumnException(String msg) { | ||
super(msg); | ||
} | ||
} | ||
} |
7 changes: 7 additions & 0 deletions
7
connector/src/main/java/io/pixelsdb/pixels/trino/vector/VectorDistFunc.java
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
package io.pixelsdb.pixels.trino.vector; | ||
|
||
import io.trino.spi.block.Block; | ||
|
||
public interface VectorDistFunc { | ||
Double getDist(double[] vec1, double[] vec2); | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not use AWS Java SDK 1.x.
AWS Java SDK 2 is already included in the dependency.