diff --git a/README.md b/README.md index 72c4798..7cd54b7 100644 --- a/README.md +++ b/README.md @@ -6,17 +6,19 @@ -Reusable components and complete webapp to improve Large Language Models (LLMs) capabilities when generating [SPARQL](https://www.w3.org/TR/sparql11-overview/) queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoint schema. +Reusable components and complete web service to improve Large Language Models (LLMs) capabilities when generating [SPARQL](https://www.w3.org/TR/sparql11-overview/) queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoints schemas. -The different components of the system can be used separately, or the whole chat system webapp can be deployed for a set of endpoints. It relies on the endpoint containing some descriptive metadata: [SPARQL query examples](https://github.com/sib-swiss/sparql-examples), and endpoint description using the [Vocabulary of Interlinked Datasets (VoID)](https://www.w3.org/TR/void/), which can generated automatically using the [void-generator](https://github.com/JervenBolleman/void-generator). +The different components of the system can be used separately, or the whole chat system can be deployed for a set of endpoints. It relies on the endpoint containing some descriptive metadata: [SPARQL query examples](https://github.com/sib-swiss/sparql-examples), and endpoint description using the [Vocabulary of Interlinked Datasets (VoID)](https://www.w3.org/TR/void/), which can generated automatically using the [void-generator](https://github.com/JervenBolleman/void-generator). This repository contains: * Functions to extract and load relevant metadata from a SPARQL endpoints. Loaders are compatible with [LangChain](https://python.langchain.com), but they can also be used outside of LangChain as they just return a list of documents with metadata as JSON, which can then be loaded how you want in your vectorstore. -* Function to automatically parse and validate SPARQL queries based on a endpoint VoID description. +* Function to automatically parse and validate SPARQL queries based on an endpoint VoID description. * A complete reusable system to deploy a LLM chat system with web UI, API and vector database, designed to help users to write SPARQL queries for a given set of endpoints by exploiting metadata uploaded to the endpoints (WIP). * The deployment configuration for **[chat.expasy.org](https://chat.expasy.org)** the LLM chat system to help users accessing the endpoints maintained at the [SIB](https://www.sib.swiss/). +> [!TIP] +> > You can quickly check if an endpoint contains the expected metadata at [sib-swiss.github.io/sparql-editor/check](https://sib-swiss.github.io/sparql-editor/check) ## 🪄 Reusable components @@ -61,6 +63,17 @@ print(len(docs)) print(docs[0].metadata) ``` +> The generated shapes are well-suited for use with a LLM or a human, as they provide clear information about which predicates are available for a class, and the corresponding classes or datatypes those predicates point to. Each object property references a list of classes rather than another shape, making each shape self-contained and interpretable on its own, e.g. for a *Disease Annotation* in UniProt: +> +> ```turtle +> up:Disease_Annotation { +> a [ up:Disease_Annotation ] ; +> up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ; +> rdfs:comment xsd:string ; +> up:disease IRI +> } +> ``` + ### Generate complete ShEx shapes from VoID description You can also generate the complete ShEx shapes for a SPARQL endpoint with: @@ -142,7 +155,7 @@ issues = validate_sparql_with_void(sparql_query, "https://sparql.uniprot.org/spa print("\n".join(issues)) ``` -## 🚀 Complete chat system +## 🚀 Complete chat system > [!WARNING] > diff --git a/compose.dev.yml b/compose.dev.yml index 5fa37b5..a55d2b5 100644 --- a/compose.dev.yml +++ b/compose.dev.yml @@ -14,8 +14,6 @@ services: service: api ports: - 8000:80 - environment: - - VECTORDB_HOST=vectordb volumes: - ./src:/app/src - ./prestart.sh:/app/prestart.sh diff --git a/src/sparql_llm/api.py b/src/sparql_llm/api.py index 8e4531c..741fd0a 100644 --- a/src/sparql_llm/api.py +++ b/src/sparql_llm/api.py @@ -338,7 +338,7 @@ def chat_ui(request: Request) -> Any: "llm_model": llm_model, "description": """Assistant to navigate resources from the Swiss Institute of Bioinformatics. Particularly knowledgeable about UniProt, OMA, Bgee, RheaDB, and SwissLipids. But still learning. -Contact kru@sib.swiss if you have any feedback or suggestions. +Contact kru@sib.swiss if you have any feedback or suggestions. Questions asked here are stored for research purposes, see the [SIB privacy policy](https://www.sib.swiss/privacy-policy) for more information. """, "short_description": "Ask about SIB resources.", "repository_url": "https://github.com/sib-swiss/sparql-llm", diff --git a/src/sparql_llm/config.py b/src/sparql_llm/config.py index 6a6f783..022aad6 100644 --- a/src/sparql_llm/config.py +++ b/src/sparql_llm/config.py @@ -34,8 +34,10 @@ class Settings(BaseSettings): ontology_chunk_size: int = 3000 ontology_chunk_overlap: int = 200 - # NOTE: Default is the IP address inside the podman network to solve a ridiculous bug from podman - vectordb_host: str = "10.89.0.2" + vectordb_host: str = "vectordb" + # NOTE: old hack to fix a bug with podman internal network, can be removed soon + # vectordb_host: str = "10.89.0.2" + retrieved_queries_count: int = 20 retrieved_docs_count: int = 15 docs_collection_name: str = "expasy" diff --git a/src/sparql_llm/embed.py b/src/sparql_llm/embed.py index 41c9df3..d3056b4 100644 --- a/src/sparql_llm/embed.py +++ b/src/sparql_llm/embed.py @@ -189,7 +189,7 @@ def init_vectordb(vectordb_host: str = settings.vectordb_host) -> None: collection_name=settings.docs_collection_name, vectors_config=VectorParams(size=settings.embedding_dimensions, distance=Distance.COSINE), ) - + print(f"Generating embeddings for {len(docs)} documents") embeddings = embedding_model.embed([q.page_content for q in docs]) start_time = time.time() vectordb.upsert( @@ -201,7 +201,7 @@ def init_vectordb(vectordb_host: str = settings.vectordb_host) -> None: ), # wait=False, # Waiting for indexing to finish or not ) - print(f"Done generating and indexing documents into the vectordb in {time.time() - start_time} seconds") + print(f"Done generating and indexing {len(docs)} documents into the vectordb in {time.time() - start_time} seconds") if __name__ == "__main__": diff --git a/src/sparql_llm/validate_sparql.py b/src/sparql_llm/validate_sparql.py index 18891bf..09a778f 100644 --- a/src/sparql_llm/validate_sparql.py +++ b/src/sparql_llm/validate_sparql.py @@ -228,7 +228,6 @@ def validate_triple_pattern( query_dict = sparql_query_to_dict(query, endpoint_url) issues_msgs: set[str] = set() - # error_msgs = {} # Go through the query BGPs and check if they match the VoID description for endpoint, subj_dict in query_dict.items():