Skip to content

Commit

Permalink
remove the vectordb IP address fix for podman (seems to be fixed on p…
Browse files Browse the repository at this point in the history
…odman side now)
  • Loading branch information
vemonet committed Sep 17, 2024
1 parent 24901a8 commit 6b27545
Show file tree
Hide file tree
Showing 6 changed files with 24 additions and 12 deletions.
21 changes: 17 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,19 @@

</div>

Reusable components and complete webapp to improve Large Language Models (LLMs) capabilities when generating [SPARQL](https://www.w3.org/TR/sparql11-overview/) queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoint schema.
Reusable components and complete web service to improve Large Language Models (LLMs) capabilities when generating [SPARQL](https://www.w3.org/TR/sparql11-overview/) queries for a given set of endpoints, using Retrieval-Augmented Generation (RAG) and SPARQL query validation from the endpoints schemas.

The different components of the system can be used separately, or the whole chat system webapp can be deployed for a set of endpoints. It relies on the endpoint containing some descriptive metadata: [SPARQL query examples](https://github.com/sib-swiss/sparql-examples), and endpoint description using the [Vocabulary of Interlinked Datasets (VoID)](https://www.w3.org/TR/void/), which can generated automatically using the [void-generator](https://github.com/JervenBolleman/void-generator).
The different components of the system can be used separately, or the whole chat system can be deployed for a set of endpoints. It relies on the endpoint containing some descriptive metadata: [SPARQL query examples](https://github.com/sib-swiss/sparql-examples), and endpoint description using the [Vocabulary of Interlinked Datasets (VoID)](https://www.w3.org/TR/void/), which can generated automatically using the [void-generator](https://github.com/JervenBolleman/void-generator).

This repository contains:

* Functions to extract and load relevant metadata from a SPARQL endpoints. Loaders are compatible with [LangChain](https://python.langchain.com), but they can also be used outside of LangChain as they just return a list of documents with metadata as JSON, which can then be loaded how you want in your vectorstore.
* Function to automatically parse and validate SPARQL queries based on a endpoint VoID description.
* Function to automatically parse and validate SPARQL queries based on an endpoint VoID description.
* A complete reusable system to deploy a LLM chat system with web UI, API and vector database, designed to help users to write SPARQL queries for a given set of endpoints by exploiting metadata uploaded to the endpoints (WIP).
* The deployment configuration for **[chat.expasy.org](https://chat.expasy.org)** the LLM chat system to help users accessing the endpoints maintained at the [SIB](https://www.sib.swiss/).

> [!TIP]
>
> You can quickly check if an endpoint contains the expected metadata at [sib-swiss.github.io/sparql-editor/check](https://sib-swiss.github.io/sparql-editor/check)
## 🪄 Reusable components
Expand Down Expand Up @@ -61,6 +63,17 @@ print(len(docs))
print(docs[0].metadata)
```

> The generated shapes are well-suited for use with a LLM or a human, as they provide clear information about which predicates are available for a class, and the corresponding classes or datatypes those predicates point to. Each object property references a list of classes rather than another shape, making each shape self-contained and interpretable on its own, e.g. for a *Disease Annotation* in UniProt:
>
> ```turtle
> up:Disease_Annotation {
> a [ up:Disease_Annotation ] ;
> up:sequence [ up:Chain_Annotation up:Modified_Sequence ] ;
> rdfs:comment xsd:string ;
> up:disease IRI
> }
> ```
### Generate complete ShEx shapes from VoID description

You can also generate the complete ShEx shapes for a SPARQL endpoint with:
Expand Down Expand Up @@ -142,7 +155,7 @@ issues = validate_sparql_with_void(sparql_query, "https://sparql.uniprot.org/spa
print("\n".join(issues))
```

## 🚀 Complete chat system
## 🚀 Complete chat system

> [!WARNING]
>
Expand Down
2 changes: 0 additions & 2 deletions compose.dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,6 @@ services:
service: api
ports:
- 8000:80
environment:
- VECTORDB_HOST=vectordb
volumes:
- ./src:/app/src
- ./prestart.sh:/app/prestart.sh
Expand Down
2 changes: 1 addition & 1 deletion src/sparql_llm/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,7 @@ def chat_ui(request: Request) -> Any:
"llm_model": llm_model,
"description": """Assistant to navigate resources from the Swiss Institute of Bioinformatics. Particularly knowledgeable about UniProt, OMA, Bgee, RheaDB, and SwissLipids. But still learning.
Contact kru@sib.swiss if you have any feedback or suggestions.
Contact kru@sib.swiss if you have any feedback or suggestions. Questions asked here are stored for research purposes, see the [SIB privacy policy](https://www.sib.swiss/privacy-policy) for more information.
""",
"short_description": "Ask about SIB resources.",
"repository_url": "https://github.com/sib-swiss/sparql-llm",
Expand Down
6 changes: 4 additions & 2 deletions src/sparql_llm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,10 @@ class Settings(BaseSettings):
ontology_chunk_size: int = 3000
ontology_chunk_overlap: int = 200

# NOTE: Default is the IP address inside the podman network to solve a ridiculous bug from podman
vectordb_host: str = "10.89.0.2"
vectordb_host: str = "vectordb"
# NOTE: old hack to fix a bug with podman internal network, can be removed soon
# vectordb_host: str = "10.89.0.2"

retrieved_queries_count: int = 20
retrieved_docs_count: int = 15
docs_collection_name: str = "expasy"
Expand Down
4 changes: 2 additions & 2 deletions src/sparql_llm/embed.py
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ def init_vectordb(vectordb_host: str = settings.vectordb_host) -> None:
collection_name=settings.docs_collection_name,
vectors_config=VectorParams(size=settings.embedding_dimensions, distance=Distance.COSINE),
)

print(f"Generating embeddings for {len(docs)} documents")
embeddings = embedding_model.embed([q.page_content for q in docs])
start_time = time.time()
vectordb.upsert(
Expand All @@ -201,7 +201,7 @@ def init_vectordb(vectordb_host: str = settings.vectordb_host) -> None:
),
# wait=False, # Waiting for indexing to finish or not
)
print(f"Done generating and indexing documents into the vectordb in {time.time() - start_time} seconds")
print(f"Done generating and indexing {len(docs)} documents into the vectordb in {time.time() - start_time} seconds")


if __name__ == "__main__":
Expand Down
1 change: 0 additions & 1 deletion src/sparql_llm/validate_sparql.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,6 @@ def validate_triple_pattern(

query_dict = sparql_query_to_dict(query, endpoint_url)
issues_msgs: set[str] = set()
# error_msgs = {}

# Go through the query BGPs and check if they match the VoID description
for endpoint, subj_dict in query_dict.items():
Expand Down

0 comments on commit 6b27545

Please sign in to comment.