Skip to content

Commit

Permalink
update docs for HNSW indexes
Browse files Browse the repository at this point in the history
  • Loading branch information
olirice committed Sep 15, 2023
1 parent a11ef43 commit bcf9ed7
Show file tree
Hide file tree
Showing 3 changed files with 69 additions and 24 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ pip install vecs

## Usage

Visit the [quickstart guide](https://supabase.github.io/vecs/api/) for more complete info.
Visit the [quickstart guide](https://supabase.github.io/vecs/latest/api) for more complete info.

```python
import vecs
Expand All @@ -53,7 +53,7 @@ DB_CONNECTION = "postgresql://<user>:<password>@<host>:<port>/<db_name>"
vx = vecs.create_client(DB_CONNECTION)

# create a collection of vectors with 3 dimensions
docs = vx.create_collection(name="docs", dimension=3)
docs = vx.get_or_create_collection(name="docs", dimension=3)

# add records to the *docs* collection
docs.upsert(
Expand Down
39 changes: 24 additions & 15 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,38 +71,47 @@ docs.delete(ids=["vec0", "vec1"])
## Create an index

Collections can be queried immediately after being created.
However, for good performance, the collection should be indexed after records have been upserted.

Indexes should be created __after__ the collection has been populated with records. Building an index
on an empty collection will result in significantly reduced recall. Once the index has been created
you can still upsert new documents into the collection but you should rebuild the index if the size of
the collection more than doubles.
However, for good throughput, the collection should be indexed after records have been upserted.

Only one index may exist per-collection. By default, creating an index will replace any existing index.

To create an index:

```python
##
# INSERT RECORDS HERE
##

# index the collection to be queried by cosine distance
docs.create_index(measure=vecs.IndexMeasure.cosine_distance)
docs.create_index()
```

Available options for query `measure` are:
You may optionally provide a distance measure and index method.

Available options for distance `measure` are:

- `vecs.IndexMeasure.cosine_distance`
- `vecs.IndexMeasure.l2_distance`
- `vecs.IndexMeasure.max_inner_product`

which correspond to different methods for comparing query vectors to the vectors in the database.

If you aren't sure which to use, stick with the default (cosine_distance) by omitting the parameter i.e.
If you aren't sure which to use, the default of cosine_distance is the most widely compatible with off-the-shelf embedding methods.

Available options for index `method` are:

- `vecs.IndexMethod.auto`
- `vecs.IndexMethod.hnsw`
- `vecs.IndexMethod.ivfflat`

Where `auto` selects the best available index method, `hnsw` uses the [HNSW](https://github.com/pgvector/pgvector#hnsw) method and `ivfflat` uses [IVFFlat](https://github.com/pgvector/pgvector#ivfflat).

When using IVFFlat indexes, the index must be created __after__ the collection has been populated with records. Building an IVFFlat index on an empty collection will result in significantly reduced recall. You can continue upserting new documents after the index has been created, but should rebuild the index if the size of the collection more than doubles since the last index operation.

HNSW indexes can be created immediately after the collection without populating records.

To manually specify `method` and `measure`, ass them as arguments to `create_index` for example:

```python
docs.create_index()
docs.create_index(
method=IndexMethod.hnsw,
measure=IndexMeasure.cosine_distance,
)
```

!!! note
Expand Down
50 changes: 43 additions & 7 deletions docs/concepts_indexes.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,58 @@

Indexes are tools for optimizing query performance of a [collection](concepts_collections.md).

Collections can be [queried](api.md/#query) without an index, but that will emit a python warning and should never be done in produciton.
Collections can be [queried](api.md/#query) without an index, but that will emit a python warning and should never be done in production.

```text
query does not have a covering index for cosine_similarity. See Collection.create_index
```

as each query vector must be checked against every record in the collection. When the number of dimensions and/or number of records becomes large, that becomes extremely slow and computationally expensive.
As each query vector must be checked against every record in the collection. When the number of dimensions and/or number of records becomes large, that becomes extremely slow and computationally expensive.

An index is a heuristic datastructure that pre-computes distances among key points in the vector space. It is smaller and can be traversed more quickly than the whole collection enabling __much__ more performant seraching.

Only one index may exist per-collection. An index optimizes a collection for searching according to a selected distance measure.

Available options distance measure are:
To create an index:

- cosine distance
- l2 distance
- max inner product
```python
docs.create_index()
```

You may optionally provide a distance measure and index method.

Available options for distance `measure` are:

- `vecs.IndexMeasure.cosine_distance`
- `vecs.IndexMeasure.l2_distance`
- `vecs.IndexMeasure.max_inner_product`

which correspond to different methods for comparing query vectors to the vectors in the database.

If you aren't sure which to use, the default of cosine_distance is the most widely compatible with off-the-shelf embedding methods.

Available options for index `method` are:

- `vecs.IndexMethod.auto`
- `vecs.IndexMethod.hnsw`
- `vecs.IndexMethod.ivfflat`

Where `auto` selects the best available index method, `hnsw` uses the [HNSW](https://github.com/pgvector/pgvector#hnsw) method and `ivfflat` uses [IVFFlat](https://github.com/pgvector/pgvector#ivfflat).

When using IVFFlat indexes, the index must be created __after__ the collection has been populated with records. Building an IVFFlat index on an empty collection will result in significantly reduced recall. You can continue upserting new documents after the index has been created, but should rebuild the index if the size of the collection more than doubles since the last index operation.

HNSW indexes can be created immediately after the collection without populating records.

To manually specify `method` and `measure`, ass them as arguments to `create_index` for example:

```python
docs.create_index(
method=IndexMethod.hnsw,
measure=IndexMeasure.cosine_distance,
)
```

If you aren't sure which to use, stick with the default (cosine_distance) by omitting the parameter when creating indexes and querying.
!!! note
The time required to create an index grows with the number of records and size of vectors.
For a few thousand records expect sub-minute a response in under a minute. It may take a few
minutes for larger collections.

0 comments on commit bcf9ed7

Please sign in to comment.