Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support similarity scores in Document API #1794

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ThomasVitale
Copy link
Contributor

Document

  • Introduced “score” attribute in Document API. It stores the similarity score.
  • Consolidate “distance” metadata for Documents. It stores the distance measurement.
  • Adopted prefix-less naming convention in Document.Builder and deprecated old methods.
  • Deprecated the many overloaded Document constructors in favour of Document.Builder.

Vector Stores

  • Every vector store implementation now configures a “score” attribute with the similarity score of the Document embedding. It also includes the “distance” metadata with the distance measurement.
  • Fixed error in Elasticsearch where distance and similarity were mixed up.
  • Added missing integration tests for SimpleVectorStore.
  • The Azure Vector Store and HanaDB Vector Store do not include those measurements because the product documentation do not include information about how the similarity score is returned, and without access to the cloud products I could not verify that via debugging.
  • Improved tests to actually assert the result of the similarity search based on the returned score.

* @author Thomas Vitale
* @since 1.0.0
*/
public enum DocumentMetadata {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea for this enum is to use it for other common metadata used in Documents, such as the "source file" or "page" when using a DocumentReader, helping the RAG flow traceability.

* The lower the distance, the more they are similar.
* It's the opposite of the similarity score.
*/
DISTANCE("distance");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this metadata for backward compatibility, but we might consider removing it completely since we now have the "score" field in each Document (and "distance" is always the opposite value of "score").

.filter(s -> s.score >= request.getSimilarityThreshold())
.sorted(Comparator.<Similarity>comparingDouble(s -> s.score).reversed())
.peek(document -> document
.setScore(EmbeddingMath.cosineSimilarity(userQueryEmbedding, document.getEmbedding())))
Copy link
Contributor Author

@ThomasVitale ThomasVitale Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove the "embedding" field (see: #1781), the SimpleVectorStore will not work

// It must always be "latest" or else Azure locks the image after a while. See:
// https://github.com/Azure/azure-cosmos-db-emulator-docker/issues/60
public static final DockerImageName DEFAULT_IMAGE = DockerImageName
.parse("mcr.microsoft.com/cosmosdb/linux/azure-cosmos-emulator:latest");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we'll be able to have integration tests for CosmosDB based on Testcontainers in the future. For now, this image includes the vector store-specific features disabled and there's no way to enable them, so it cannot be used.

@@ -298,6 +303,8 @@ public static final class PineconeVectorStoreConfig {

private final String contentFieldName;

// TODO: Why is this field configurable? Can we remove this after standardizing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be ok to remove this and keep the standard "distance" metadata? Having this configurable means we cannot use the metadata reliably across implementations.

Map<String, Object> metadata = this.config.metadataFields.stream()
.map(MetadataField::name)
.filter(doc::hasProperty)
.collect(Collectors.toMap(Function.identity(), doc::getString));
// TODO: this seems wrong. The key is named "vector_store", but the value is the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be ok to remove this and keep the standard "distance" metadata?

@ThomasVitale
Copy link
Contributor Author

PR updated after #1822 was merged

Document
* Introduced “score” attribute in Document API. It stores the similarity score.
* Consolidate “distance” metadata for Documents. It stores the distance measurement.
* Adopted prefix-less naming convention in Document.Builder and deprecated old methods.
* Deprecated the many overloaded Document constructors in favour of Document.Builder.

Vector Stores
* Every vector store implementation now configures a “score” attribute with the similarity score of the Document embedding. It also includes the “distance” metadata with the distance measurement.
* Fixed error in Elasticsearch where distance and similarity were mixed up.
* Added missing integration tests for SimpleVectorStore.
* The Azure Vector Store and HanaDB Vector Store do not include those measurements because the product documentation do not include information about how the similarity score is returned, and without access to the cloud products I could not verify that via debugging.
* Improved tests to actually assert the result of the similarity search based on the returned score.

Signed-off-by: Thomas Vitale <ThomasVitale@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants