Skip to content

Take advantage of WME Structured-Contents APIs to seed a LLM RAG search engine

Notifications You must be signed in to change notification settings

wikimedia-enterprise/Structured-Contents-LLM-RAG

Repository files navigation

Requirements:

  1. Download and install Ollama and follow the quick setup instructions: https://ollama.com/download

  2. Download the models mxbai and llama3. In a terminal console, type (Warning, llama3 is a 4.7GB download and mxbai-embed-large is 670MB):

ollama pull mxbai-embed-large
ollama run llama3

Notes:

  • As of March 2024, mxbai-embed-large model archives SOTA performance for Bert-large sized models on the MTEB. It outperforms commercial models like OpenAIs text-embedding-3-large model and matches the performance of model 20x its size.
  • Llama 3 (8B) instruction-tuned models are fine-tuned and optimized for dialogue/chat use cases and outperform many of the available open-source chat models on common benchmarks.
  • Bonus, if you have a powerful laptop/desktop you might want to swap Llama3 8 billion parameter model for the Llama3 70 billion parameter, which has better inference and more internal knowledge. To use 70B instead use this command ollama run llama3:70b (note this is a 40GB download), and change the line of code in query.py that loads the llama3 model to: model="llama3:70b"
  1. Verify that Ollama is working and using the model, the output should be a JSON object with an embedding array of floating point numbers
curl http://localhost:11434/api/embeddings -d '{
  "model": "mxbai-embed-large",
  "prompt": "Summarize the features of Wikipedia in 5 bullet points"
}'
  1. Clone our demo Repo to get started:
git clone https://github.com/wikimedia-enterprise/Structured-Contents-LLM-RAG.git
  1. Install virtual Python environment, activate it, and install Python packages in requirements.txt:
python3 -m venv myenv
source myenv/bin/activate
pip install -r requirements.txt
  1. Edit the Environment variables file to add your Wikimedia Enterprise API credentials. Don't have an account yet; signup for free. Then rename sample.env to .env and add your Wikimedia Enterprise username and password:
WIKI_API_USERNAME=username
WIKI_API_PASSWORD=password

Notes:

  • You can skip this next step if you have a slow internet connection and instead use the /dataset/en_sample.csv file that has the structured Wikipedia data ready to use in Step 7.
  1. Review the Python in get_dataset.py which calls the Wikipedia Enterprise On-Demand API for 500 English articles. We're using our Structured Contents endpoint that has pre-parsed article body sections to cleanly obtain data without extra parsing. You can run that process with this command:
python get_dataset.py

Notes:

  • In get_dataset.py, we are using multithreading to download the dataset, using your CPU Cores to send many requests at one. If you prefer to keep it simple, we have a less complex downloader that downloads the data in sequence, but it takes considerable longer. See the code in pipelineV1() and pipelineV2(), the first function runs sequentially, the second runs in parallel. Notice we are using thread locking to guarantee that the array is appended without a race condition.
  • The script will first check if you've downloaded new data but will fallback to using sample data if not.
  • One important function in this code is clean_text() which parses the HTML tags and extracts the plain text that the LLM model is expecting. Data tidying is a big part of the Machine Learning workflow. Review the code in clean_text() as you may want to understand the text cleaning steps.
  • Wikimedia Enterprise has a number of added-value APIs, that give developers easier access to cleaned Wikimedia data. You don't need to be a Data Scientist or AI expert to integrate Wikipedia/Wikidata knowledge into your systems. Visit our developer documentation portal for more API info.
  1. Review the Python in import.py which imports the CSV data from step 7 and load it into ChromaDB. Then run it:
python import.py
  1. Review the Python in query.py to input your query, query ChromaDB, get the relevant articles and pass it to Llama3 for generating the response. Run the Streamlit Web UI with:
streamlit run query.py

Notes:

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
  1. You can safely delete all the code and data in this project, there are no other dependencies. You may wish to uninstall Ollama and the LLM models you downloaded. Use these commands:
ollama rm mxbai-embed-large
ollama rm llama3

Here are some example chats with RAG turned OFF and then ON:

Joe that does icosathlon - RAG OFF

"Joe that does icosathlon" with RAG off

Joe that does icosathlon - RAG ON

"Joe that does icosathlon" with RAG on

Wlassifoff - RAG OFF

"Wlassifoff" with RAG off

Wlassifoff - RAG ON

"Wlassifoff" with RAG on

Newala Town - RAG OFF

"Newala Town" with RAG off

Newala Town - RAG ON

"Newala Town" with RAG on

horse breeds of the British Isles - RAG OFF

"horse breeds of the British Isles" with RAG off

horse breeds of the British Isles - RAG ON

"horse breeds of the British Isles" with RAG on

Chow a shooter - RAG OFF

"Chow a shooter" with RAG off

Chow a shooter - RAG ON

"Chow a shooter" with RAG on

Steps initially derived from https://ollama.com/blog/embedding-models

About

Take advantage of WME Structured-Contents APIs to seed a LLM RAG search engine

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages