Chat With Pennsieve
This is the research project component developed under the guidance of Dr. Zachary Ives. The initial goal is to develop a graph layer on top of the Pennsieve database and enable machine learning through effective data extraction of medical data from complex and versatile file formats. This component enables natural language interaction with the database.
Note: All methods were implemented on the underlying graph built on Neo4j using another repository which will be linked once it is public. This project is ready to be used out of the box, however, without the underlying graph filled in you will not get any results.
__init__.py
: Initializes the app package.- Purpose: Marks the directory as a Python package. Add package-level imports here if needed.
config.py
: Handles configuration and environment variables.- Purpose: Loads environment variables and defines configuration settings.
- Enhancements: Implement error handling for missing environment variables if needed.
database.py
: Manages Neo4j database connection.- Purpose: The function
setup_neo4j_graph()
returns a Neo4j graph configured with URL, username, and password provided in the.env
file. - Documentation:
setup_neo4j_graph()
returns the Langchain Neo4j database wrapper. Important methods used:query()
andrefresh_schema()
. Langchain Neo4jGraph Documentation
- Purpose: The function
main.py
: Entry point of the application. Pass the user query and retrieves the result by callingrun_query(user_query: str)
fromqa_chain.py
. It abstracts away all the complexities and provides a simple interface to interact with the system.dataguide.py
: Extracts dataguide paths from the database and formats them into Cypher paths.- Methods:
extract_dataguide_paths(graph: Neo4jGraph)
: Extracts dataguide paths from root to leaf using a Cypher query.format_paths_for_llm(results: List[Dict[str, Any]])
: Formats results fromextract_dataguide_paths
into valid Cypher paths for MATCH queries.
- Methods:
test.py
: Tests the connection with Neo4j graph, extraction of dataguide paths, and formatting them. Outputs the time taken for each part.- Enhancements: Add unit testing or test other methods manually.
prompt_generator.py
: This module is responsible for creating and combining Langchain system and human prompts intolangchain.prompts.ChatPromptTemplate
. It is a crucial part of the project as it defines how the prompts are structured and used in the Langchain framework.- Methods:
get_cypher_prompt_template()
: This method returns theChatPromptTemplate
instance created in this file. It combines system and human prompts into a single template that can be used to generate Cypher queries fromGraphCypherQAChain
inqa_chain.py
.
- Documentation:
- PromptTemplate: This class is used to define the structure of the prompts. The primary parameters used are
input_variables
, which specify the variables to be included in the prompt, andtemplate
, which defines the prompt's text. - SystemMessagePromptTemplate: This class is used to create system messages in the prompt. The primary parameter used is
prompt
, which defines the system message's text. - HumanMessagePromptTemplate: This class is used to create human messages in the prompt. The primary parameter used is
prompt
, which defines the human message's text. - ChatPromptTemplate: This class combines system and human messages into a single chat prompt. The primary method used is
from_messages()
, which takes a list of message templates and combines them into a chat prompt.
- PromptTemplate: This class is used to define the structure of the prompts. The primary parameters used are
- Methods:
qa_chain.py
: Defines therun_query(user_query: str)
function, which integrates all project components and runs aGraphCypherQAChain
on the user query.- Documentation:
- GraphCypherQAChain
- ChatOpenAI
- Note: Replace
ChatOpenAI
with AzureChatOpenAI if needed.
- Documentation:
__init__.py
: Initializes the app package.- Purpose: Marks the directory as a Python package. Add package-level imports here if needed.
generate_descriptions.py
: Defines the system prompt to generate descriptions from LLMs for Cypher paths.- Methods:
generate_path_descriptions(all_paths: List[str])
: Generates descriptions for the given paths using the LLM. Outputs a list of descriptions.generate_embedding(path_description: str)
: Generates embeddings for the given path description using the OpenAI embeddings API.
- Documentation: OpenAIEmbeddings
- Methods:
random_path_generator.py
: Provides methods to generate random paths from the database and format them into Cypher paths.vectorDB_setup.py
: Provides methods to start Milvus container, connect with it, define collection schema, create collection, insert data, and conduct vector similarity searches.- Documentation: pymilvus
main.py
: Wrapper functions that combine all functionalities from this directory. For example,get_similar_paths_from_milvus
is used inapp/qa_chain.py
to conduct vector similarity search with user queries.test.py
: Methods to test various functionalities. Currently commented out.- Enhancements: Add unit testing or test methods manually.
write_read_data.py
: Simple write and read methods to store Cypher paths and descriptions generated from API calls.- Purpose: Helps with analysis and saving API costs. The method
fill_collection_with_random_paths
inpaths_vectorDB/main.py
writes down the paths and descriptions generated from API calls intodata.txt
.
- Purpose: Helps with analysis and saving API costs. The method
env.sample
: Make a copy of this in your project root directory and rename it to.env
. Fill in the values..gitignore
: Specifies files and directories to be ignored by Git.README.md
: Project documentation.docker-compose.yml
: Docker file for Milvus DB. If there is a new version, replace this file. Ensure it is nameddocker-compose.yml
and placed in the root directory.requirements.txt
: Python dependencies and their compatible versions used for development. Note: Therequirements.txt
file was created throughpipenv
.
- Python 3.8+
- Docker
- Neo4j Desktop and Neo4j Database filled with graph and dataguide (code for this will be linked soon)
Getting started with this project is simple. You can follow the steps below:
-
Clone the repository:
git clone https://github.com/hussainzs/chat-with-pennsieve.git cd project_root
note: Make sure you are in the project root directory before proceeding with the next steps.
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables:
- Copy
env.sample
and rename the file to.env
and fill in the required values.
- Copy
-
Set up Neo4j Desktop:
- Download and install Neo4j Desktop.
- Note the URL, username, and password for the Neo4j database that contains the graph and dataguide.
- Update the
.env
file with the Neo4j connection details (URL, username, password). Default values have been filled in.
-
Run app/main.py:
- Navigate to the
app
directory and runmain.py
. Make sure your desired user query is passed as an argument to therun_query(user_query)
function. - Make sure you have
docker-compose.yml
in the root directory. When you run app/main.py, the Milvus containers will start automatically by running terminal commands. Check outpaths_vectorDB/vectorDB_setup.py
for more information. - Note: When the Milvus container is created the first time, it downloads and creates a new folder in the root directory named
volumes
. The folder contains 3 subfolders:milvus
,minio
, andetcd
. - For more information check out: Run Milvus using Docker Compose
- Navigate to the
Note: For further clarification of expected output when you run app/main.py
, I'm attaching 2 pdfs of output generated from the system in the folder called Expected Outputs.
- The file named
first_output.pdf
shows what's expected when user runs theapp/main.py
for the first time in a new session with default values. (When you run it for the first time ever, it may take a while to download everything) - The
regular_output.pdf
shows what's expected when user runs theapp/main.py
in a regular session with default values.
- Improve System Prompts: Enhancing the prompts in both
app
andpaths_vectorDB
can significantly improve LLM performance. I witnessed that high quality examples in system prompt will increase the quality of description generation for paths. System Prompt also significantly effects the final answer from LLM. - Optimize Context for LLM: Instead of sending all dataguide paths, send the top 10 related paths from the Milvus vector DB to reduce API costs and potentially improve performance. Long system prompts can increase hallucination and confuses LLM, refer to this paper for more information: Lost in the Middle: How Language Models Use Long Contexts
- Update Milvus: Install the latest version of Milvus and change the similarity metric from "IP" (Inner Product) to COSINE in
search_similar_vectors
method inside ofpaths_vectorDB/vectorDB_setup.py
for better results. - Create a chat UI: Use Streamlit or your favorite UI library to create a basic user interface for this project. You can use FastAPI to create a simple API for sending user queries and receiving responses from
app/main.py.
- Add Conversational Ability: Allow for follow-up interactions to guide the LLM for better path generation, although this may increase API costs. I noticed that often when LLM was wrong, it was only off by a little in its path generation. Someone with domain knowledge of the underlying graph can easily correct it with a basic follow-up.