Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for vector db semantic convention #1231

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
fa8ee30
Proposal for vector db semantic convention
ezimuel Jul 10, 2024
7068720
Merge + applied feedbacks #1231
ezimuel Jul 17, 2024
5e12a86
Removed allow_custom_values: true in db.yaml
ezimuel Jul 17, 2024
3b61784
Fixed merge
ezimuel Jul 18, 2024
828bacc
Merge branch 'main' into vector-db
ezimuel Jul 20, 2024
53d82d4
Merge remote-tracking branch 'upstream/main' into vector-db
ezimuel Aug 5, 2024
a3330ff
Updated dimension_count and similarity_metric
ezimuel Aug 5, 2024
e5ff387
Merge remote-tracking branch 'origin/vector-db' into vector-db
ezimuel Aug 5, 2024
da6649b
Merge branch 'main' into vector-db
ezimuel Aug 7, 2024
d99ec10
Fix array attribute examples (#1325)
lmolkova Aug 8, 2024
61b0f2c
Add k8s.{pod,node}.cpu.{time,usage} metrics (#1320)
ChrsMark Aug 11, 2024
ceae2ca
Db metrics pending requests (#1290)
maryliag Aug 12, 2024
6db7ec5
Fix `process.args_count` attribute (#1331)
lmolkova Aug 12, 2024
e5e0d9d
Add k8s.volume.{name,type} attributes (#1251)
ChrsMark Aug 14, 2024
ae0e066
Add tests for rego policies (#1334)
MadVikingGod Aug 14, 2024
03b67bf
add `nodejs.eventloop.time` metric (#1259)
maryliag Aug 15, 2024
93d2cbe
chore: Remove support for the event `fields` referencing/inheriting d…
MSNev Aug 18, 2024
f411554
Attempt to optimise attribute name collision checks. (#1328)
jsuereth Aug 19, 2024
daa0a14
(chore) Add dependabot config to keep tooling up to date. (#1346)
jsuereth Aug 19, 2024
bc8a63c
Fix broken docker link (#1332)
ChrsMark Aug 19, 2024
a5f8661
Bump markdownlint-cli from 0.31.0 to 0.41.0 (#1349)
dependabot[bot] Aug 19, 2024
d996cd9
Bump go.opentelemetry.io/build-tools/chloggen from 0.12.0 to 0.14.0 i…
dependabot[bot] Aug 19, 2024
a10e75f
Bump gulp from 4.0.2 to 5.0.0 (#1348)
dependabot[bot] Aug 19, 2024
fd0f2e7
Fix link anchors (#1354)
lmolkova Aug 19, 2024
1c6bd00
chore: update ids (#1352)
maryliag Aug 20, 2024
9feb74d
Removed db.vector.id and added db.record.id, renamed db.vector.field_…
ezimuel Aug 20, 2024
2357766
Merge branch 'main' into vector-db
ezimuel Aug 20, 2024
81dca47
Merge from upstream/main
ezimuel Sep 25, 2024
ff03da1
Removed db.vector.model and moved db.vector.search.similarity_metric …
ezimuel Sep 25, 2024
523bcb9
Merge branch 'main' into vector-db
ezimuel Sep 30, 2024
fd891f6
Merge remote-tracking branch 'upstream/main' into vector-db
ezimuel Nov 5, 2024
bc2ddb1
Merge remote-tracking branch 'upstream/main' into vector-db
ezimuel Nov 5, 2024
fc90f3f
Merge branch 'vector-db' of github.com:ezimuel/semantic-conventions i…
ezimuel Nov 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions docs/attributes-registry/db.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,22 @@

# Db

<<<<<<< HEAD
ezimuel marked this conversation as resolved.
Show resolved Hide resolved
- [Db](#db-attributes)
- [Db Cassandra](#db-cassandra-attributes)
- [Db Cosmosdb](#db-cosmosdb-attributes)
- [Db Deprecated](#db-deprecated-attributes)
- [Db Elasticsearch](#db-elasticsearch-attributes)
- [Db Metrics Deprecated](#db-metrics-deprecated-attributes)
- [Db Vector](#db-vector-attributes)
=======
- [General Database Attributes](#general-database-attributes)
- [Cassandra Attributes](#cassandra-attributes)
- [Azure Cosmos DB Attributes](#azure-cosmos-db-attributes)
- [Elasticsearch Attributes](#elasticsearch-attributes)
- [Deprecated Database Attributes](#deprecated-database-attributes)
- [Deprecated Database Metrics](#deprecated-database-metrics)
>>>>>>> upstream/main

## General Database Attributes

Expand Down Expand Up @@ -116,6 +126,7 @@ Even though parameterized query text can potentially have sensitive data, by usi
| `sybase` | Sybase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `teradata` | Teradata | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `trino` | Trino | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `vector` | vector | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `vertica` | Vertica | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

## Cassandra Attributes
Expand Down Expand Up @@ -244,3 +255,25 @@ This group defines attributes for Elasticsearch.
| ------ | ----------- | ---------------------------------------------------------------- |
| `idle` | idle | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `used` | used | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

## Db Vector Attributes

This group defines attributes for vector databases.

| Attribute | Type | Description | Examples | Stability |
| ---------------------- | -------- | ---------------------------------------------------- | -------------------------------------- | ---------------------------------------------------------------- |
| `db.vector.dimension` | int | The dimension of the vector. | `3` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `db.vector.embeddings` | double[] | The values of the vector, the array of numbers. | `[0.9, 0.1, 0.1]` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `db.vector.id` | string | The ID of vector. | `5c56c793-69f3-4fbf-87e6-c4bf54c28c26` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `db.vector.model` | string | The model used for the embedding. | `text-embedding-3-small` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `db.vector.name` | string | The name field as of the vector (e.g. a field name). | `vector` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `db.vector.similarity` | string | The metric used in similarity search. | `cosine` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

`db.vector.similarity` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
| ----------- | ------------------------------ | ---------------------------------------------------------------- |
| `cosine` | The cosine metric. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `dot` | The dot product metric. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `euclidean` | The euclidean distance metric. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `manhattan` | The Manhattan distance metric. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
83 changes: 83 additions & 0 deletions docs/database/dynamodb.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,90 @@ linkTitle: AWS DynamoDB
The Semantic Conventions for [AWS DynamoDB](https://aws.amazon.com/dynamodb/) extend and override the general
[AWS SDK Semantic Conventions](/docs/cloud-providers/aws-sdk.md) and [Database Semantic Conventions](database-spans.md).

<<<<<<< HEAD
ezimuel marked this conversation as resolved.
Show resolved Hide resolved
## Common Attributes

These attributes are filled in for all DynamoDB request types.

<!-- semconv dynamodb.all(full) -->
<!-- NOTE: THIS TEXT IS AUTOGENERATED. DO NOT EDIT BY HAND. -->
<!-- see templates/registry/markdown/snippet.md.j2 -->
<!-- prettier-ignore-start -->
<!-- markdownlint-capture -->
<!-- markdownlint-disable -->

| Attribute | Type | Description | Examples | [Requirement Level](https://opentelemetry.io/docs/specs/semconv/general/attribute-requirement-level/) | Stability |
|---|---|---|---|---|---|
| [`db.system`](/docs/attributes-registry/db.md) | string | The value `dynamodb`. [1] | `dynamodb` | `Required` | ![Experimental](https://img.shields.io/badge/-experimental-blue) |

**[1]:** The actual DBMS may differ from the one identified by the client. For example, when using PostgreSQL client libraries to connect to a CockroachDB, the `db.system` is set to `postgresql` based on the instrumentation's best knowledge.



`db.system` has the following list of well-known values. If one of them applies, then the respective value MUST be used; otherwise, a custom value MAY be used.

| Value | Description | Stability |
|---|---|---|
| `adabas` | Adabas (Adaptable Database System) | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `cassandra` | Apache Cassandra | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `clickhouse` | ClickHouse | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `cockroachdb` | CockroachDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `cosmosdb` | Microsoft Azure Cosmos DB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `couchbase` | Couchbase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `couchdb` | CouchDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `db2` | IBM Db2 | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `derby` | Apache Derby | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `dynamodb` | Amazon DynamoDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `edb` | EnterpriseDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `elasticsearch` | Elasticsearch | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `filemaker` | FileMaker | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `firebird` | Firebird | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `geode` | Apache Geode | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `h2` | H2 | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `hanadb` | SAP HANA | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `hbase` | Apache HBase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `hive` | Apache Hive | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `hsqldb` | HyperSQL DataBase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `influxdb` | InfluxDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `informix` | Informix | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `ingres` | Ingres | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `instantdb` | InstantDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `interbase` | InterBase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `intersystems_cache` | InterSystems Caché | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `mariadb` | MariaDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `maxdb` | SAP MaxDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `memcached` | Memcached | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `mongodb` | MongoDB | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `mssql` | Microsoft SQL Server | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `mysql` | MySQL | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `neo4j` | Neo4j | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `netezza` | Netezza | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `opensearch` | OpenSearch | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `oracle` | Oracle Database | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `other_sql` | Some other SQL database. Fallback only. See notes. | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `pervasive` | Pervasive PSQL | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `pointbase` | PointBase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `postgresql` | PostgreSQL | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `progress` | Progress Database | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `redis` | Redis | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `redshift` | Amazon Redshift | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `spanner` | Cloud Spanner | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `sqlite` | SQLite | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `sybase` | Sybase | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `teradata` | Teradata | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `trino` | Trino | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
| `vector` | vector | ![Experimental](https://img.shields.io/badge/-experimental-blue) |
ezimuel marked this conversation as resolved.
Show resolved Hide resolved
| `vertica` | Vertica | ![Experimental](https://img.shields.io/badge/-experimental-blue) |



<!-- markdownlint-restore -->
<!-- prettier-ignore-end -->
<!-- END AUTOGENERATED TEXT -->
<!-- endsemconv -->
=======
`db.system` MUST be set to `"dynamodb"` and SHOULD be provided **at span creation time**.
>>>>>>> upstream/main

## DynamoDB.BatchGetItem

Expand Down
66 changes: 63 additions & 3 deletions model/registry/db.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,6 @@
For example, when using PostgreSQL client libraries to connect to a CockroachDB, the `db.system`
is set to `postgresql` based on the instrumentation's best knowledge.
type:
allow_custom_values: true
members:
- id: other_sql
value: 'other_sql'
Expand Down Expand Up @@ -319,7 +318,6 @@
- id: client.connection.state
stability: experimental
type:
allow_custom_values: true
members:
- id: idle
value: 'idle'
Expand Down Expand Up @@ -441,7 +439,6 @@
brief: Cosmos client connection mode.
- id: cosmosdb.operation_type
type:
allow_custom_values: true
members:
- id: invalid
value: 'Invalid'
Expand Down Expand Up @@ -533,3 +530,66 @@
reference the [elasticsearch schema](https://raw.githubusercontent.com/elastic/elasticsearch-specification/main/output/schema/schema.json)
in order to map the path part values to their names.
examples: ['db.elasticsearch.path_parts.index=test-index', 'db.elasticsearch.path_parts.doc_id=123']
- id: registry.db.vector
prefix: db.vector
type: attribute_group
brief: >
This group defines attributes for vector databases.
attributes:
- id: similarity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggesting more explicit name like db.vector.search.similarity_metric

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezimuel I think you also missed this change, is marked as resolved without any updates :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a3330ff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, who/how/when is going to populate it?

It seems it's only available at index creation time and not available at query or insert time. I.e. very rarely.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this is generally provided during the index creation but some databases uses it also in the query (e.g. Qdrant).

I think we can also leverage the Conditionally Required level for some attributes that are not always available, like the db.vector.search.similarity_metric, WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmolkova just a reminder for this, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it makes sense!

BTW I think the attribute should be db.search.similarity_metric or similar - similarity is not always based on vectors and we don't need to limit this attribute.

type:
members:
- id: cosine
value: 'cosine'
brief: >
The cosine metric.
stability: experimental
- id: dot
value: 'dot'
brief: >
The dot product metric.
stability: experimental
- id: euclidean
value: 'euclidean'
brief: >
The euclidean distance metric.
stability: experimental

Check failure on line 556 in model/registry/db.yaml

View workflow job for this annotation

GitHub Actions / yamllint

[trailing-spaces] trailing spaces
- id: manhattan
value: 'manhattan'
brief: >
The Manhattan distance metric.
stability: experimental
stability: experimental
brief: >
The metric used in similarity search.
examples: 'cosine'
- id: id
type: string
stability: experimental
brief: >
The ID of vector.
examples: '5c56c793-69f3-4fbf-87e6-c4bf54c28c26'
- id: name
type: string
stability: experimental
brief: >
The name field as of the vector (e.g. a field name).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need both - the name and the id?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The id is the identifier of the vector, the name is the field name that contains the vector. Many database uses both but some uses only name of the field. Maybe we should use a better naming here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmolkova what do you think about this? Thanks.

Copy link
Contributor

@lmolkova lmolkova Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide some examples of id and name?

Let's pick a few (2-3) popular databases and explain what id and name would mean in their context.

E.g. I look into MongoDB and I don't understand what id or name would be.

Or I look into Azure Search and don't understand what should go into id.

I look into pinecone and it does not talk about ids.

Also, perhaps by vector you mean index? Or if it's about individual vectors, then how the index would be represented?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmolkova Here some examples:

  • MongoDB uses name for the field (i.e. path in search here and field-name in the definition here)
  • Azure Search uses name (here)
  • Pinecone uses id (here)
  • Qdrant uses id (Point ID, here)
  • Elasticsearch uses name (here)
  • Milvus uses id (here)
  • Chroma uses id (here)
  • pgvector for PostgreSQL uses name (i.e. the field name with type vector(x), here)
  • Redis uses name (i.e. the field name that will contain the vector values in fieldname_embedding, here)

By db.vector.name I mean the field name in the document that will be used for embedding (most common for DB like MongoDB, Azure Search, Redis, Elasticsearch, PostgreSQL) and by db.vector.id I mean the identifier of the vector (most used in native vector db like Pinecone, Qdrant, Chroma, Milvus).

Basically these attributes, name and ID, are used to identify the vector, maybe a better naming for db.vector.name can be db.vector.field_name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the context!

name

I agree that db.vector.field_name would be more descriptive.
It could be difficult to collect it though - for cosmos, postgres and many other dbs it would require parsing queries (and specific parsing for vector search). I.e. it'd probably mean that very few generic DB instrumentations will do it.

Even the native instrumentation that we have in CosmosDB does not know if query supplied by user is doing vector search.

So I wonder how critical this attribute is for the observability purposes. Also I'd be mostly static - is it important enough to justify additional costs of populating it on each span?

A typical semconv decision making hint is: if we're not sure we need it, let's not add it. Adding attributes to spans is easy, removing them is breaking and hard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id

I still don't understand what's behind db.vector.id - it seems to be a generic record id and there is nothing vector-specific here.

I support adding generic attribute like db.record.id: string or db.record.ids: string[] (needs polishing)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The db.vector.id is the record id but specific for the vector. The db.record.id will work as well, since it's just an identifier for the record (i.e. vector).

I think we should add db.vector.field_name since it is used in many database to specify which field has been used for the embedding.

examples: 'vector'
- id: dimension
type: int
stability: experimental
brief: >
The dimension of the vector.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it the number of dimensions? let's call it something like db.vector.dimension_count and also update the brief

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezimuel you marked this as resolved, but looks like you missed this change, it still shows dimension instead of dimention_count

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in a3330ff

examples: [3]
- id: model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be captured with gen_ai.request.model (and other gen-ai attributes) - it's ok to mix different attributes on the same telemetry item.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the model for the embedding and can be different from the gen_ai.request.model. Vector databases are not strictly related with GenAI. There are many use cases where vector db can be used without GenAI (e.g. semantic search).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you provide some examples of databases that can compute embeddings?

How would database instrumentation know which model was used to create embeddings if it only stores and queries them?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some vendors that provide in-database embedding:

I think many other vendors will add this feature. Having the possibility to generate embeddings in the database simplify the customer use cases.

Regarding the question on how database instrumentation know which model was used to create an embedding this is an information that is specified when you create a collection or when you provide some search (e.g. in PostgreML). Moreover, I think the instrumentation libraries can also leverage this model attribute since they know the embedding model used to generate the vectors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems all of them use some GenAI model (in-process or external) and provide an integration layer with it.

I'm not sure what benefit defining a new attribute for databases brings. If embeddings become cross-domain concerns, let's find a generic attribute name for the vectorization model that will be reused between DB and GenAI. For now I strongly recommend to avoid adding new attribute.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I agree that we can wait to see how this topic progresses.

type: string
stability: experimental
brief: >
The model used for the embedding.
examples: 'text-embedding-3-small'
- id: query.top_k
Copy link
Contributor

@lmolkova lmolkova Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should come up with a more common attribute not specific to vector dbs.
Many databases allow to limit number of returned rows:

Suggesting db.query.max_returned_items. The actual returned count could be even better - db.query.item_count could mean items inserted or returned depending on the operation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmolkova I see the similarity here but I think the db.vector.query.top-k is more specific from a semantic point of view and more related to vectors, since it specifies the top k results in order, starting from the most similar. In semantic search we have this similarity value that is always present in any result that we don't have in standard database. The limit parameter of SQL returns the first k results but not in order, it depends on how you build the query (e.g. using ORDER BY).
I personally think we should keep db.vector.query.top-k and potentially add a db.query.limit (or db.query.max_returned_items as you suggested) in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of the few dbs I checked, they use limit in vector search

So we're saying that DB instrumentations will need to detect if query is related to vector search or not and depending on this populate top-k or limit. That's difficult or impossible, but most importantly inconsistent and depends on instrumentation capabilities.

I.e. instrumentations that don't have vector-db specifics and those that do will use different attributes for the same thing.

So, I'd still prefer db.query.limit or something similar (and it should be under the same condition as db.query.text - we cannot require instrumentations to do query parsing)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmolkova do you agree that top-k and limit are two different concepts, based on my previous comment? If they are I think we cannot use a single attribute (e.g. db.query.limit) to manage both.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmolkova just a reminder for this, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezimuel I see that databases use both terms to describe the same thing (see my comment above).

Let's say you have a postgres query like SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5; - the general-purpose DB instrumentation can report limit. If you make it understand vector search syntax, it may be able to use top_k instead, but that's would be inconsistent and unfamiliar for those who use vector search in postgres.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmolkova I see your point but I think top-k has a different meaning from limit. If you are using a relation database as vector db, limit is fine since you are building an SQL statement and you specify the order. But, if you are using a native vector database (e.g. Qdrant), the top-k is more relevant since top implies the order, using a similarity metric.

I think we should add both:

  • db.query.limit
  • db.vector.query.top-k

type: int
stability: experimental
brief: >
The top-k most similar vectors returned by a query.
examples: [5]

Check failure on line 595 in model/registry/db.yaml

View workflow job for this annotation

GitHub Actions / yamllint

[new-line-at-end-of-file] no new line character at the end of file
Loading