remove the vectordb IP address fix for podman (seems to be fixed on p…

…odman side now)
sib-swiss · Sep 17, 2024 · 6b27545 · 6b27545
1 parent 24901a8
commit 6b27545
Show file tree

Hide file tree

Showing 6 changed files with 24 additions and 12 deletions.
diff --git a/README.md b/README.md
@@ -6,17 +6,19 @@
 
 </div>
 
-Reusable components and complete webapp to improve Large Language Models (LLMs) capabilities when generating [SPARQL](https://www.w3.org/TR/sparql11-overview/) queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoint schema.
+Reusable components and complete web service to improve Large Language Models (LLMs) capabilities when generating [SPARQL](https://www.w3.org/TR/sparql11-overview/) queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoints schemas.
 
-The different components of the system can be used separately, or the whole chat system webapp can be deployed for a set of endpoints. It relies on the endpoint containing some descriptive metadata: [SPARQL query examples](https://github.com/sib-swiss/sparql-examples), and endpoint description using the [Vocabulary of Interlinked Datasets (VoID)](https://www.w3.org/TR/void/), which can generated automatically using the [void-generator](https://github.com/JervenBolleman/void-generator).
+The different components of the system can be used separately, or the whole chat system can be deployed for a set of endpoints. It relies on the endpoint containing some descriptive metadata: [SPARQL query examples](https://github.com/sib-swiss/sparql-examples), and endpoint description using the [Vocabulary of Interlinked Datasets (VoID)](https://www.w3.org/TR/void/), which can generated automatically using the [void-generator](https://github.com/JervenBolleman/void-generator).
 
 This repository contains:
 
 * Functions to extract and load relevant metadata from a SPARQL endpoints. Loaders are compatible with [LangChain](https://python.langchain.com), but they can also be used outside of LangChain as they just return a list of documents with metadata as JSON, which can then be loaded how you want in your vectorstore.
-* Function to automatically parse and validate SPARQL queries based on a endpoint VoID description.
+* Function to automatically parse and validate SPARQL queries based on an endpoint VoID description.
 * A complete reusable system to deploy a LLM chat system with web UI, API and vector database, designed to help users to write SPARQL queries for a given set of endpoints by exploiting metadata uploaded to the endpoints (WIP).
 * The deployment configuration for **[chat.expasy.org](https://chat.expasy.org)** the LLM chat system to help users accessing the endpoints maintained at the [SIB](https://www.sib.swiss/).
 
+> [!TIP]
+>
 > You can quickly check if an endpoint contains the expected metadata at [sib-swiss.github.io/sparql-editor/check](https://sib-swiss.github.io/sparql-editor/check)
 
 ## 🪄 Reusable components
@@ -61,6 +63,17 @@ print(len(docs))
 print(docs[0].metadata)
 ```
 
+> The generated shapes are well-suited for use with a LLM or a human, as they provide clear information about which predicates are available for a class, and the corresponding classes or datatypes those predicates point to. Each object property references a list of classes rather than another shape, making each shape self-contained and interpretable on its own, e.g. for a *Disease Annotation* in UniProt:
+>
+> ```turtle
+> up:Disease_Annotation {
+>   a [ up:Disease_Annotation ] ;
+>   up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;
+>   rdfs:comment xsd:string ;
+>   up:disease IRI
+> }
+> ```
+
 ### Generate complete ShEx shapes from VoID description
 
 You can also generate the complete ShEx shapes for a SPARQL endpoint with:
@@ -142,7 +155,7 @@ issues = validate_sparql_with_void(sparql_query, "https://sparql.uniprot.org/spa
 print("\n".join(issues))
 ```
 
-## 🚀 Complete chat system 
+## 🚀 Complete chat system
 
 > [!WARNING]
 >

diff --git a/compose.dev.yml b/compose.dev.yml
@@ -14,8 +14,6 @@ services:
       service: api
     ports:
       - 8000:80
-    environment:
-      - VECTORDB_HOST=vectordb
     volumes:
       - ./src:/app/src
       - ./prestart.sh:/app/prestart.sh

diff --git a/src/sparql_llm/api.py b/src/sparql_llm/api.py
@@ -338,7 +338,7 @@ def chat_ui(request: Request) -> Any:
             "llm_model": llm_model,
             "description": """Assistant to navigate resources from the Swiss Institute of Bioinformatics. Particularly knowledgeable about UniProt, OMA, Bgee, RheaDB, and SwissLipids. But still learning.
 
-Contact kru@sib.swiss if you have any feedback or suggestions.
+Contact kru@sib.swiss if you have any feedback or suggestions. Questions asked here are stored for research purposes, see the [SIB privacy policy](https://www.sib.swiss/privacy-policy) for more information.
 """,
             "short_description": "Ask about SIB resources.",
             "repository_url": "https://github.com/sib-swiss/sparql-llm",

diff --git a/src/sparql_llm/config.py b/src/sparql_llm/config.py
@@ -34,8 +34,10 @@ class Settings(BaseSettings):
     ontology_chunk_size: int = 3000
     ontology_chunk_overlap: int = 200
 
-    # NOTE: Default is the IP address inside the podman network to solve a ridiculous bug from podman
-    vectordb_host: str = "10.89.0.2"
+    vectordb_host: str = "vectordb"
+    # NOTE: old hack to fix a bug with podman internal network, can be removed soon
+    # vectordb_host: str = "10.89.0.2"
+
     retrieved_queries_count: int = 20
     retrieved_docs_count: int = 15
     docs_collection_name: str = "expasy"

diff --git a/src/sparql_llm/embed.py b/src/sparql_llm/embed.py
@@ -189,7 +189,7 @@ def init_vectordb(vectordb_host: str = settings.vectordb_host) -> None:
             collection_name=settings.docs_collection_name,
             vectors_config=VectorParams(size=settings.embedding_dimensions, distance=Distance.COSINE),
         )
-
+    print(f"Generating embeddings for {len(docs)} documents")
     embeddings = embedding_model.embed([q.page_content for q in docs])
     start_time = time.time()
     vectordb.upsert(
@@ -201,7 +201,7 @@ def init_vectordb(vectordb_host: str = settings.vectordb_host) -> None:
         ),
         # wait=False, # Waiting for indexing to finish or not
     )
-    print(f"Done generating and indexing documents into the vectordb in {time.time() - start_time} seconds")
+    print(f"Done generating and indexing {len(docs)} documents into the vectordb in {time.time() - start_time} seconds")
 
 
 if __name__ == "__main__":

diff --git a/src/sparql_llm/validate_sparql.py b/src/sparql_llm/validate_sparql.py
@@ -228,7 +228,6 @@ def validate_triple_pattern(
 
     query_dict = sparql_query_to_dict(query, endpoint_url)
     issues_msgs: set[str] = set()
-    # error_msgs = {}
 
     # Go through the query BGPs and check if they match the VoID description
     for endpoint, subj_dict in query_dict.items():