From bcf9ed7a2c1c0e4b6294d866c87d12727e8dbe00 Mon Sep 17 00:00:00 2001 From: Oliver Rice Date: Fri, 15 Sep 2023 11:25:07 -0500 Subject: [PATCH] update docs for HNSW indexes --- README.md | 4 ++-- docs/api.md | 39 +++++++++++++++++++------------ docs/concepts_indexes.md | 50 ++++++++++++++++++++++++++++++++++------ 3 files changed, 69 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 31b5b83..d3056b0 100644 --- a/README.md +++ b/README.md @@ -42,7 +42,7 @@ pip install vecs ## Usage -Visit the [quickstart guide](https://supabase.github.io/vecs/api/) for more complete info. +Visit the [quickstart guide](https://supabase.github.io/vecs/latest/api) for more complete info. ```python import vecs @@ -53,7 +53,7 @@ DB_CONNECTION = "postgresql://:@:/" vx = vecs.create_client(DB_CONNECTION) # create a collection of vectors with 3 dimensions -docs = vx.create_collection(name="docs", dimension=3) +docs = vx.get_or_create_collection(name="docs", dimension=3) # add records to the *docs* collection docs.upsert( diff --git a/docs/api.md b/docs/api.md index 9dcd9d9..b2eb957 100644 --- a/docs/api.md +++ b/docs/api.md @@ -71,27 +71,19 @@ docs.delete(ids=["vec0", "vec1"]) ## Create an index Collections can be queried immediately after being created. -However, for good performance, the collection should be indexed after records have been upserted. - -Indexes should be created __after__ the collection has been populated with records. Building an index -on an empty collection will result in significantly reduced recall. Once the index has been created -you can still upsert new documents into the collection but you should rebuild the index if the size of -the collection more than doubles. +However, for good throughput, the collection should be indexed after records have been upserted. Only one index may exist per-collection. By default, creating an index will replace any existing index. To create an index: ```python -## -# INSERT RECORDS HERE -## - -# index the collection to be queried by cosine distance -docs.create_index(measure=vecs.IndexMeasure.cosine_distance) +docs.create_index() ``` -Available options for query `measure` are: +You may optionally provide a distance measure and index method. + +Available options for distance `measure` are: - `vecs.IndexMeasure.cosine_distance` - `vecs.IndexMeasure.l2_distance` @@ -99,10 +91,27 @@ Available options for query `measure` are: which correspond to different methods for comparing query vectors to the vectors in the database. -If you aren't sure which to use, stick with the default (cosine_distance) by omitting the parameter i.e. +If you aren't sure which to use, the default of cosine_distance is the most widely compatible with off-the-shelf embedding methods. + +Available options for index `method` are: + +- `vecs.IndexMethod.auto` +- `vecs.IndexMethod.hnsw` +- `vecs.IndexMethod.ivfflat` + +Where `auto` selects the best available index method, `hnsw` uses the [HNSW](https://github.com/pgvector/pgvector#hnsw) method and `ivfflat` uses [IVFFlat](https://github.com/pgvector/pgvector#ivfflat). + +When using IVFFlat indexes, the index must be created __after__ the collection has been populated with records. Building an IVFFlat index on an empty collection will result in significantly reduced recall. You can continue upserting new documents after the index has been created, but should rebuild the index if the size of the collection more than doubles since the last index operation. + +HNSW indexes can be created immediately after the collection without populating records. + +To manually specify `method` and `measure`, ass them as arguments to `create_index` for example: ```python -docs.create_index() +docs.create_index( + method=IndexMethod.hnsw, + measure=IndexMeasure.cosine_distance, +) ``` !!! note diff --git a/docs/concepts_indexes.md b/docs/concepts_indexes.md index 47de2d2..3256ddf 100644 --- a/docs/concepts_indexes.md +++ b/docs/concepts_indexes.md @@ -2,22 +2,58 @@ Indexes are tools for optimizing query performance of a [collection](concepts_collections.md). -Collections can be [queried](api.md/#query) without an index, but that will emit a python warning and should never be done in produciton. +Collections can be [queried](api.md/#query) without an index, but that will emit a python warning and should never be done in production. ```text query does not have a covering index for cosine_similarity. See Collection.create_index ``` -as each query vector must be checked against every record in the collection. When the number of dimensions and/or number of records becomes large, that becomes extremely slow and computationally expensive. +As each query vector must be checked against every record in the collection. When the number of dimensions and/or number of records becomes large, that becomes extremely slow and computationally expensive. An index is a heuristic datastructure that pre-computes distances among key points in the vector space. It is smaller and can be traversed more quickly than the whole collection enabling __much__ more performant seraching. Only one index may exist per-collection. An index optimizes a collection for searching according to a selected distance measure. -Available options distance measure are: +To create an index: -- cosine distance -- l2 distance -- max inner product +```python +docs.create_index() +``` + +You may optionally provide a distance measure and index method. + +Available options for distance `measure` are: + +- `vecs.IndexMeasure.cosine_distance` +- `vecs.IndexMeasure.l2_distance` +- `vecs.IndexMeasure.max_inner_product` + +which correspond to different methods for comparing query vectors to the vectors in the database. + +If you aren't sure which to use, the default of cosine_distance is the most widely compatible with off-the-shelf embedding methods. + +Available options for index `method` are: + +- `vecs.IndexMethod.auto` +- `vecs.IndexMethod.hnsw` +- `vecs.IndexMethod.ivfflat` + +Where `auto` selects the best available index method, `hnsw` uses the [HNSW](https://github.com/pgvector/pgvector#hnsw) method and `ivfflat` uses [IVFFlat](https://github.com/pgvector/pgvector#ivfflat). + +When using IVFFlat indexes, the index must be created __after__ the collection has been populated with records. Building an IVFFlat index on an empty collection will result in significantly reduced recall. You can continue upserting new documents after the index has been created, but should rebuild the index if the size of the collection more than doubles since the last index operation. + +HNSW indexes can be created immediately after the collection without populating records. + +To manually specify `method` and `measure`, ass them as arguments to `create_index` for example: + +```python +docs.create_index( + method=IndexMethod.hnsw, + measure=IndexMeasure.cosine_distance, +) +``` -If you aren't sure which to use, stick with the default (cosine_distance) by omitting the parameter when creating indexes and querying. +!!! note + The time required to create an index grows with the number of records and size of vectors. + For a few thousand records expect sub-minute a response in under a minute. It may take a few + minutes for larger collections.