Skip to content

Commit

Permalink
Merge pull request #92 from marklogic/feature/remove-langchain-example
Browse files Browse the repository at this point in the history
Updated example project to point to ai-examples repo
  • Loading branch information
rjrudin authored Sep 25, 2024
2 parents 9e94b5e + 600fe1c commit 5e943fe
Show file tree
Hide file tree
Showing 23 changed files with 1 addition and 1,044 deletions.
4 changes: 0 additions & 4 deletions examples/langchain/.gitignore

This file was deleted.

167 changes: 1 addition & 166 deletions examples/langchain/README.md
Original file line number Diff line number Diff line change
@@ -1,166 +1 @@
# Example langchain retriever

This project demonstrates one approach for implementing a
[langchain retriever](https://python.langchain.com/docs/modules/data_connection/)
that allows for
[Retrieval Augmented Generation (RAG)](https://python.langchain.com/docs/use_cases/question_answering/)
to be supported via MarkLogic and the MarkLogic Python Client. This example uses the same data as in
[the langchain RAG quickstart guide](https://python.langchain.com/docs/use_cases/question_answering/quickstart),
but with the data having first been loaded into MarkLogic.

**This is only intended as an example** of how easily a langchain retriever can be developed
using the MarkLogic Python Client. The queries in this example are simple and naturally
do not have any knowledge of how your data is modeled in MarkLogic. You are encouraged to use
this as an example for developing your own retriever, where you can build a query based on a
question submitted to langchain that fully leverages the indexes and data models in your MarkLogic
application. Additionally, please see the
[langchain documentation on splitting text](https://python.langchain.com/docs/modules/data_connection/document_transformers/). You may need to restructure your data so that you have a larger number of
smaller documents in your database so that you do not exceed the limit that langchain imposes on how
much data a retriever can return.

# Setup

To try out this project, use [docker-compose](https://docs.docker.com/compose/) to instantiate a new MarkLogic
instance with port 8003 available (you can use your own MarkLogic instance too, just be sure that port 8003
is available):

docker-compose up -d --build

## Deploy With Gradle

Then deploy a small REST API application to MarkLogic, which includes a basic non-admin MarkLogic user
named `langchain-user`:

./gradlew -i mlDeploy

## Install Python Libraries

Next, create a new Python virtual environment - [pyenv](https://github.com/pyenv/pyenv) is recommended for this -
and install the
[langchain example dependencies](https://python.langchain.com/docs/use_cases/question_answering/quickstart#dependencies),
along with the MarkLogic Python Client:

pip install -U langchain langchain_openai langchain-community langchainhub openai chromadb bs4 marklogic_python_client

## Load Sample Data

Then run the following Python program to load text data from the langchain quickstart guide
into two different collections in the `langchain-test-content` database:

python load_data.py

## Create Python Environment File

Create a ".env" file to hold your AzureOpenAI environment values. It should look
something like this.
```
OPENAI_API_VERSION=2023-12-01-preview
AZURE_OPENAI_ENDPOINT=<Your Azure OpenAI Endpoint>
AZURE_OPENAI_API_KEY=<Your Azure OpenAI API Key>
AZURE_LLM_DEPLOYMENT_NAME=gpt-test1-gpt-35-turbo
AZURE_LLM_DEPLOYMENT_MODEL=gpt-35-turbo
```

# Testing the retriever

## Testing using a retriever with a basic query

You are now ready to test the example retriever. Run the following to ask a question
with the results augmented via the `marklogic_similar_query_retriever.py` module in this
project:

python ask_similar_query.py "What is task decomposition?" posts

The retriever uses a [cts.similarQuery](https://docs.marklogic.com/cts.similarQuery) to
select from the documents loaded via `load_data.py`. It defaults to a page length of 10.
You can change this by providing a command line argument - e.g.:

python ask_similar_query.py "What is task decomposition?" posts 15

Example of a question for the "sotu" (State of the Union speech) collection:

python ask_similar_query.py "What are economic sanctions?" sotu 20

To use a word query instead of a similar query, along with a set of drop words, specify
"word" as the 4th argument:

python ask_similar_query.py "What are economic sanctions?" sotu 20 word

## Testing using a retriever with a contextual query

There may be times when your langchain application needs to use both a question and a
structured query during the document retrieval process. To see an example of this, run
the following to ask a question. That question is combined with a hard-coded structured
query using the `marklogic_contextual_query_retriever.py` module in this project.

python ask_contextual_query.py "What is task decomposition?" posts

This retriever builds a term-query using words from the question. Then the term-query is
added to the structured query and the merged query is used to select from the documents
loaded via `load_data.py`.

## Testing using MarkLogic 12EA Vector Search

### MarkLogic 12EA Setup

To try out this functionality out, you will need acces to an instance of MarkLogic 12
(currently internal or Early Access only).
<TODO>Add info to get ML12</TODO>
You may use docker
[docker-compose](https://docs.docker.com/compose/) to instantiate a new MarkLogic
instance with port 8003 available (you can use your own MarkLogic instance too, just be
sure that port 8003 is available):

docker-compose -f docker-compose-12.yml up -d --build

### Deploy With Gradle

You will also need to deploy the application. However, for this example, you will need
to include an additional switch on the command line to deploy a TDE schema that takes
advantage of the vector capabilities in MarkLogic 12.

./gradlew -i mlDeploy -PmlSchemasPath=src/main/ml-schemas-12

### Install Python Libraries

As above, if you have not yet installed the Python libraries, install this with pip:
```
pip install -U langchain langchain_openai langchain-community langchainhub openai chromadb bs4 marklogic_python_client
```

### Create Python Environment File
The Python script for this example also generates LLM embeddings and includes them in
the documents stored in MarkLogic. In order to generate the embeddings, you'll need to
add the following environment variables (with your values) to the .env file created
above.

```
AZURE_EMBEDDING_DEPLOYMENT_NAME=text-test-embedding-ada-002
AZURE_EMBEDDING_DEPLOYMENT_MODEL=text-embedding-ada-002
```

### Load Sample Data

Then run the following Python program to load text data from the langchain quickstart
guide into two different collections in the `langchain-test-content` database. Note that
this script is different than the one in the earlier setup section and loads the data
into different collections.

```
python load_data_with_embeddings.py
```

### Running the Vector Query

You are now ready to test the example vector retriever. Run the following to ask a
question with the results augmented via the `marklogic_vector_query_retriever.py` module
in this project:

python ask_vector_query.py "What is task decomposition?" posts_with_embeddings

This retriever searches MarkLogic for candidate documents, and defaults to
using the new score-bm25 scoring method in MarkLogic 12EA. If preferred, you can adjust
this to one of the other scoring methods. After retrieving candidate documents based on
the CTS search, the retriever uses the new vector functionality to sort the documents
based on cosine similarity to the user question, and then returns the top N documents
for the retriever to package up.
This example project has been moved to the [MarkLogic AI examples repository](https://github.com/marklogic/marklogic-ai-examples).
72 changes: 0 additions & 72 deletions examples/langchain/ask_contextual_query.py

This file was deleted.

48 changes: 0 additions & 48 deletions examples/langchain/ask_similar_query.py

This file was deleted.

53 changes: 0 additions & 53 deletions examples/langchain/ask_vector_query.py

This file was deleted.

4 changes: 0 additions & 4 deletions examples/langchain/build.gradle

This file was deleted.

17 changes: 0 additions & 17 deletions examples/langchain/docker-compose-12.yml

This file was deleted.

17 changes: 0 additions & 17 deletions examples/langchain/docker-compose.yml

This file was deleted.

4 changes: 0 additions & 4 deletions examples/langchain/gradle.properties

This file was deleted.

Binary file removed examples/langchain/gradle/wrapper/gradle-wrapper.jar
Binary file not shown.
Loading

0 comments on commit 5e943fe

Please sign in to comment.