-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for vector db semantic convention #1231
base: main
Are you sure you want to change the base?
Conversation
|
model/registry/db.yaml
Outdated
brief: > | ||
The dimension of the vector. | ||
examples: [3] | ||
- id: model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be captured with gen_ai.request.model
(and other gen-ai attributes) - it's ok to mix different attributes on the same telemetry item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the model
for the embedding and can be different from the gen_ai.request.model
. Vector databases are not strictly related with GenAI. There are many use cases where vector db can be used without GenAI (e.g. semantic search).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you provide some examples of databases that can compute embeddings?
How would database instrumentation know which model was used to create embeddings if it only stores and queries them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some vendors that provide in-database embedding:
- Weaviate, supporting the following provider integrations;
- Elasticsearch, using the Inference API;
- PostgresML provides in-database embedding generation;
- Vespa provides the embedding using the embed() function;
- Supabase provides an embegging generator using edge function;
I think many other vendors will add this feature. Having the possibility to generate embeddings in the database simplify the customer use cases.
Regarding the question on how database instrumentation know which model was used to create an embedding this is an information that is specified when you create a collection or when you provide some search (e.g. in PostgreML). Moreover, I think the instrumentation libraries can also leverage this model
attribute since they know the embedding model used to generate the vectors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems all of them use some GenAI model (in-process or external) and provide an integration layer with it.
I'm not sure what benefit defining a new attribute for databases brings. If embeddings become cross-domain concerns, let's find a generic attribute name for the vectorization model that will be reused between DB and GenAI. For now I strongly recommend to avoid adding new attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I agree that we can wait to see how this topic progresses.
model/registry/db.yaml
Outdated
type: int | ||
stability: experimental | ||
brief: > | ||
The dimension of the vector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the number of dimensions? let's call it something like db.vector.dimension_count
and also update the brief
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezimuel you marked this as resolved, but looks like you missed this change, it still shows dimension
instead of dimention_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in a3330ff
model/registry/db.yaml
Outdated
type: string | ||
stability: experimental | ||
brief: > | ||
The name field as of the vector (e.g. a field name). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need both - the name and the id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The id
is the identifier of the vector, the name
is the field name that contains the vector. Many database uses both but some uses only name
of the field. Maybe we should use a better naming here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova what do you think about this? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide some examples of id and name?
Let's pick a few (2-3) popular databases and explain what id and name would mean in their context.
E.g. I look into MongoDB and I don't understand what id or name would be.
Or I look into Azure Search and don't understand what should go into id.
I look into pinecone and it does not talk about ids.
Also, perhaps by vector you mean index? Or if it's about individual vectors, then how the index would be represented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova Here some examples:
- MongoDB uses name for the field (i.e.
path
in search here andfield-name
in the definition here) - Azure Search uses name (here)
- Pinecone uses id (here)
- Qdrant uses id (Point ID, here)
- Elasticsearch uses name (here)
- Milvus uses id (here)
- Chroma uses id (here)
- pgvector for PostgreSQL uses name (i.e. the field name with type vector(x), here)
- Redis uses name (i.e. the field name that will contain the vector values in
fieldname_embedding
, here)
By db.vector.name I mean the field name in the document that will be used for embedding (most common for DB like MongoDB, Azure Search, Redis, Elasticsearch, PostgreSQL) and by db.vector.id I mean the identifier of the vector (most used in native vector db like Pinecone, Qdrant, Chroma, Milvus).
Basically these attributes, name and ID, are used to identify the vector, maybe a better naming for db.vector.name
can be db.vector.field_name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the context!
name
I agree that db.vector.field_name
would be more descriptive.
It could be difficult to collect it though - for cosmos, postgres and many other dbs it would require parsing queries (and specific parsing for vector search). I.e. it'd probably mean that very few generic DB instrumentations will do it.
Even the native instrumentation that we have in CosmosDB does not know if query supplied by user is doing vector search.
So I wonder how critical this attribute is for the observability purposes. Also I'd be mostly static - is it important enough to justify additional costs of populating it on each span?
A typical semconv decision making hint is: if we're not sure we need it, let's not add it. Adding attributes to spans is easy, removing them is breaking and hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id
I still don't understand what's behind db.vector.id
- it seems to be a generic record id and there is nothing vector-specific here.
I support adding generic attribute like db.record.id: string
or db.record.ids: string[]
(needs polishing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The db.vector.id
is the record id but specific for the vector. The db.record.id
will work as well, since it's just an identifier for the record (i.e. vector).
I think we should add db.vector.field_name
since it is used in many database to specify which field has been used for the embedding.
model/registry/db.yaml
Outdated
brief: > | ||
This group defines attributes for vector databases. | ||
attributes: | ||
- id: similarity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggesting more explicit name like db.vector.search.similarity_metric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezimuel I think you also missed this change, is marked as resolved without any updates :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in a3330ff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, who/how/when is going to populate it?
It seems it's only available at index creation time and not available at query or insert time. I.e. very rarely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, this is generally provided during the index creation but some databases uses it also in the query (e.g. Qdrant).
I think we can also leverage the Conditionally Required level for some attributes that are not always available, like the db.vector.search.similarity_metric
, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova just a reminder for this, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, it makes sense!
BTW I think the attribute should be db.search.similarity_metric
or similar - similarity is not always based on vectors and we don't need to limit this attribute.
Thanks for creating this PR. A few additional attributes which we instrument today with our SDK that we have found useful are the following:
Thoughts on the ones listed above? cc @lmolkova |
@karthikscale3 regarding the attributes that you proposed, some already exists:
The
Moreover, I found very interesting the proposal of OpenLLMetry project especially the part regarding the attributes for vector db, here:
|
@lmolkova I applied all the feedbacks, thanks for the review. @karthikscale3 I added the Summary of the changes:
|
Yea that sounds good! And yes, my intention was to reuse the existing ones. Wasn't sure if we needed them redefined for the sake of vector dbs or not. But sounds like its unnecessary. |
Thank you! From my side, everything looks good. We discussed this PR in today's working group call and @nirga wanted to take a deeper look at it once again. |
I fixed the merge issues. Thanks @trask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's a great start! I wonder if we want to add specific spans that use these attributes in this PR as well?
@nirga can you give me an example of specific span? FYI, I'm going offline and I'll come back August 4 for further discussion. |
Sorry, nvm I think this is already covered as part of the DB semconv |
docs/attributes-registry/db.md
Outdated
@@ -199,6 +200,28 @@ This group defines attributes for Elasticsearch. | |||
|
|||
**[8]:** Many Elasticsearch url paths allow dynamic values. These SHOULD be recorded in span attributes in the format `db.elasticsearch.path_parts.<key>`, where `<key>` is the url path part name. The implementation SHOULD reference the [elasticsearch schema](https://raw.githubusercontent.com/elastic/elasticsearch-specification/main/output/schema/schema.json) in order to map the path part values to their names. | |||
|
|||
## Db Vector Attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Vector Database Attributes
@ezimuel looks like you missed some of the changes you marked as resolved:
|
model/registry/db.yaml
Outdated
brief: > | ||
The model used for the embedding. | ||
examples: 'text-embedding-3-small' | ||
- id: query.top_k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should come up with a more common attribute not specific to vector dbs.
Many databases allow to limit number of returned rows:
- JDBC has Statement.setMaxRows,
- Mongo allows to set a limit
Suggesting db.query.max_returned_items
. The actual returned count could be even better - db.query.item_count
could mean items inserted or returned depending on the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova I see the similarity here but I think the db.vector.query.top-k
is more specific from a semantic point of view and more related to vectors, since it specifies the top k results in order, starting from the most similar. In semantic search we have this similarity value that is always present in any result that we don't have in standard database. The limit
parameter of SQL returns the first k results but not in order, it depends on how you build the query (e.g. using ORDER BY
).
I personally think we should keep db.vector.query.top-k
and potentially add a db.query.limit
(or db.query.max_returned_items
as you suggested) in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of the few dbs I checked, they use limit
in vector search
- pgvector uses traditional
limit
- same with cosmos and other sql databases - mongo uses imit
- qdrant uses limit
So we're saying that DB instrumentations will need to detect if query is related to vector search or not and depending on this populate top-k or limit. That's difficult or impossible, but most importantly inconsistent and depends on instrumentation capabilities.
I.e. instrumentations that don't have vector-db specifics and those that do will use different attributes for the same thing.
So, I'd still prefer db.query.limit
or something similar (and it should be under the same condition as db.query.text
- we cannot require instrumentations to do query parsing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova do you agree that top-k
and limit
are two different concepts, based on my previous comment? If they are I think we cannot use a single attribute (e.g. db.query.limit
) to manage both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova just a reminder for this, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezimuel I see that databases use both terms to describe the same thing (see my comment above).
Let's say you have a postgres query like SELECT * FROM items ORDER BY embedding <-> '[3,1,2]' LIMIT 5;
- the general-purpose DB instrumentation can report limit
. If you make it understand vector search syntax, it may be able to use top_k
instead, but that's would be inconsistent and unfamiliar for those who use vector search in postgres.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lmolkova I see your point but I think top-k
has a different meaning from limit
. If you are using a relation database as vector db, limit
is fine since you are building an SQL statement and you specify the order. But, if you are using a native vector database (e.g. Qdrant), the top-k
is more relevant since top
implies the order, using a similarity metric.
I think we should add both:
db.query.limit
db.vector.query.top-k
We need to reference new attributes in the database spans conventions (see https://github.com/open-telemetry/semantic-conventions/blob/main/model/trace/database.yaml), specifically on the conventions for the databases we have there which support vector search. We should describe how new attributes apply to them. |
Co-authored-by: Liudmila Molkova <limolkova@microsoft.com>
@lmolkova I provided the following changes:
I think the only missing point is about |
@ezimuel Can you please update this PR with the new semconv folder structure? Thanks |
thanks for working on this @ezimuel ! Please make sure to update actual database semantic conventions and reference attributes you're adding under those that report them - #1231 (comment) |
…in db.search.similarity_metric
@AlexanderWert I updated the PR and applied the suggestions from @lmolkova. I did the following changes:
The only open question is about @lmolkova I didn't understand what I supposed to do in this comment, since the link https://github.com/open-telemetry/semantic-conventions/blob/main/model/trace/database.yaml does not work. Can you clarify? Thanks. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
@ezimuel I'm sorry I did not reply earlier. I'm swamped with some work at the moment and, unfortunately, it might take me some time to reply. If you look into https://github.com/open-telemetry/semantic-conventions/tree/main/docs/database you'd see that we have some docs for individual database systems - we reference attributes from the registry and we explain how these attributes apply to this system (or don't apply). This is powered by the yaml in https://github.com/open-telemetry/semantic-conventions/tree/main/model/database. Please look at the existing database conventions and update them to include new attributes you're adding. |
@lmolkova thanks for the information and sorry also on my side for the late reply, very busy period. |
This is a proposal for vector db semantic convention (see #936). I tried to expand the
db
semantic convention adding somedb.vector
attributes. I tried to focus on the basic needs of a general purpose vector database.I proposed the following experimental attributes (updated with the feedbacks in this PR):
The operations performed in a vector db, such as
insert
,update
,search
anddelete
can be performed using the existing db.operation.name attribute.Regarding the similarity search we can use the db.query attributes, such as
db.query.parameter.<key>
.