update docs for HNSW indexes

supabase · Sep 15, 2023 · bcf9ed7 · bcf9ed7
1 parent a11ef43
commit bcf9ed7
Show file tree

Hide file tree

Showing 3 changed files with 69 additions and 24 deletions.
diff --git a/README.md b/README.md
@@ -42,7 +42,7 @@ pip install vecs
 
 ## Usage
 
-Visit the [quickstart guide](https://supabase.github.io/vecs/api/) for more complete info.
+Visit the [quickstart guide](https://supabase.github.io/vecs/latest/api) for more complete info.
 
 ```python
 import vecs
@@ -53,7 +53,7 @@ DB_CONNECTION = "postgresql://<user>:<password>@<host>:<port>/<db_name>"
 vx = vecs.create_client(DB_CONNECTION)
 
 # create a collection of vectors with 3 dimensions
-docs = vx.create_collection(name="docs", dimension=3)
+docs = vx.get_or_create_collection(name="docs", dimension=3)
 
 # add records to the *docs* collection
 docs.upsert(

diff --git a/docs/api.md b/docs/api.md
@@ -71,38 +71,47 @@ docs.delete(ids=["vec0", "vec1"])
 ## Create an index
 
 Collections can be queried immediately after being created.
-However, for good performance, the collection should be indexed after records have been upserted.
-
-Indexes should be created __after__ the collection has been populated with records. Building an index
-on an empty collection will result in significantly reduced recall. Once the index has been created
-you can still upsert new documents into the collection but you should rebuild the index if the size of
-the collection more than doubles.
+However, for good throughput, the collection should be indexed after records have been upserted.
 
 Only one index may exist per-collection. By default, creating an index will replace any existing index.
 
 To create an index:
 
 ```python
-##
-# INSERT RECORDS HERE
-##
-
-# index the collection to be queried by cosine distance
-docs.create_index(measure=vecs.IndexMeasure.cosine_distance)
+docs.create_index()
 ```
 
-Available options for query `measure` are:
+You may optionally provide a distance measure and index method.
+
+Available options for distance `measure` are:
 
 - `vecs.IndexMeasure.cosine_distance`
 - `vecs.IndexMeasure.l2_distance`
 - `vecs.IndexMeasure.max_inner_product`
 
 which correspond to different methods for comparing query vectors to the vectors in the database.
 
-If you aren't sure which to use, stick with the default (cosine_distance) by omitting the parameter i.e.
+If you aren't sure which to use, the default of cosine_distance is the most widely compatible with off-the-shelf embedding methods.
+
+Available options for index `method` are:
+
+- `vecs.IndexMethod.auto`
+- `vecs.IndexMethod.hnsw`
+- `vecs.IndexMethod.ivfflat`
+
+Where `auto` selects the best available index method, `hnsw` uses the [HNSW](https://github.com/pgvector/pgvector#hnsw) method and `ivfflat` uses [IVFFlat](https://github.com/pgvector/pgvector#ivfflat).
+
+When using IVFFlat indexes, the index must be created __after__ the collection has been populated with records. Building an IVFFlat index on an empty collection will result in significantly reduced recall. You can continue upserting new documents after the index has been created, but should rebuild the index if the size of the collection more than doubles since the last index operation.
+
+HNSW indexes can be created immediately after the collection without populating records.
+
+To manually specify `method` and `measure`, ass them as arguments to `create_index` for example:
 
 ```python
-docs.create_index()
+docs.create_index(
+    method=IndexMethod.hnsw,
+    measure=IndexMeasure.cosine_distance,
+)
 ```
 
 !!! note

diff --git a/docs/concepts_indexes.md b/docs/concepts_indexes.md
@@ -2,22 +2,58 @@
 
 Indexes are tools for optimizing query performance of a [collection](concepts_collections.md).
 
-Collections can be [queried](api.md/#query) without an index, but that will emit a python warning and should never be done in produciton.
+Collections can be [queried](api.md/#query) without an index, but that will emit a python warning and should never be done in production.
 
 ```text
 query does not have a covering index for cosine_similarity. See Collection.create_index
 ```
 
-as each query vector must be checked against every record in the collection. When the number of dimensions and/or number of records becomes large, that becomes extremely slow and computationally expensive.
+As each query vector must be checked against every record in the collection. When the number of dimensions and/or number of records becomes large, that becomes extremely slow and computationally expensive.
 
 An index is a heuristic datastructure that pre-computes distances among key points in the vector space. It is smaller and can be traversed more quickly than the whole collection enabling __much__ more performant seraching.
 
 Only one index may exist per-collection. An index optimizes a collection for searching according to a selected distance measure.
 
-Available options distance measure are:
+To create an index:
 
-- cosine distance
-- l2 distance
-- max inner product
+```python
+docs.create_index()
+```
+
+You may optionally provide a distance measure and index method.
+
+Available options for distance `measure` are:
+
+- `vecs.IndexMeasure.cosine_distance`
+- `vecs.IndexMeasure.l2_distance`
+- `vecs.IndexMeasure.max_inner_product`
+
+which correspond to different methods for comparing query vectors to the vectors in the database.
+
+If you aren't sure which to use, the default of cosine_distance is the most widely compatible with off-the-shelf embedding methods.
+
+Available options for index `method` are:
+
+- `vecs.IndexMethod.auto`
+- `vecs.IndexMethod.hnsw`
+- `vecs.IndexMethod.ivfflat`
+
+Where `auto` selects the best available index method, `hnsw` uses the [HNSW](https://github.com/pgvector/pgvector#hnsw) method and `ivfflat` uses [IVFFlat](https://github.com/pgvector/pgvector#ivfflat).
+
+When using IVFFlat indexes, the index must be created __after__ the collection has been populated with records. Building an IVFFlat index on an empty collection will result in significantly reduced recall. You can continue upserting new documents after the index has been created, but should rebuild the index if the size of the collection more than doubles since the last index operation.
+
+HNSW indexes can be created immediately after the collection without populating records.
+
+To manually specify `method` and `measure`, ass them as arguments to `create_index` for example:
+
+```python
+docs.create_index(
+    method=IndexMethod.hnsw,
+    measure=IndexMeasure.cosine_distance,
+)
+```
 
-If you aren't sure which to use, stick with the default (cosine_distance) by omitting the parameter when creating indexes and querying.
+!!! note
+    The time required to create an index grows with the number of records and size of vectors.
+    For a few thousand records expect sub-minute a response in under a minute. It may take a few
+    minutes for larger collections.