forked from langchain-ai/langchain
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
community[minor]: Add Apache Doris as vector store (langchain-ai#17527)
--------- Co-authored-by: Bagatur <baskaryan@gmail.com>
- Loading branch information
Showing
5 changed files
with
833 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Apache Doris | ||
|
||
>[Apache Doris](https://doris.apache.org/) is a modern data warehouse for real-time analytics. | ||
It delivers lightning-fast analytics on real-time data at scale. | ||
|
||
>Usually `Apache Doris` is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb. | ||
## Installation and Setup | ||
|
||
|
||
```bash | ||
pip install pymysql | ||
``` | ||
|
||
## Vector Store | ||
|
||
See a [usage example](/docs/integrations/vectorstores/apache_doris). | ||
|
||
```python | ||
from langchain_community.vectorstores import ApacheDoris | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,322 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "84180ad0-66cd-43e5-b0b8-2067a29e16ba", | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"source": [ | ||
"# Apache Doris\n", | ||
"\n", | ||
">[Apache Doris](https://doris.apache.org/) is a modern data warehouse for real-time analytics.\n", | ||
"It delivers lightning-fast analytics on real-time data at scale.\n", | ||
"\n", | ||
">Usually `Apache Doris` is categorized into OLAP, and it has showed excellent performance in [ClickBench — a Benchmark For Analytical DBMS](https://benchmark.clickhouse.com/). Since it has a super-fast vectorized execution engine, it could also be used as a fast vectordb.\n", | ||
"\n", | ||
"Here we'll show how to use the Apache Doris Vector Store." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "1685854f", | ||
"metadata": {}, | ||
"source": [ | ||
"## Setup" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "311d44bb-4aca-4f3b-8f97-5e1f29238e40", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%pip install --upgrade --quiet pymysql" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "2c891bba", | ||
"metadata": {}, | ||
"source": [ | ||
"Set `update_vectordb = False` at the beginning. If there is no docs updated, then we don't need to rebuild the embeddings of docs" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "f4e6ca20-79dd-482a-8f68-af9d7dd59c7c", | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"!pip install sqlalchemy\n", | ||
"!pip install langchain" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "96f7c7a2-4811-4fdf-87f5-c60772f51fe1", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-02-14T12:54:01.392500Z", | ||
"start_time": "2024-02-14T12:53:58.866615Z" | ||
}, | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from langchain.chains import RetrievalQA\n", | ||
"from langchain.text_splitter import TokenTextSplitter\n", | ||
"from langchain_community.document_loaders import (\n", | ||
" DirectoryLoader,\n", | ||
" UnstructuredMarkdownLoader,\n", | ||
")\n", | ||
"from langchain_community.vectorstores.apache_doris import (\n", | ||
" ApacheDoris,\n", | ||
" ApacheDorisSettings,\n", | ||
")\n", | ||
"from langchain_openai import OpenAI, OpenAIEmbeddings\n", | ||
"\n", | ||
"update_vectordb = False" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ee821c00", | ||
"metadata": {}, | ||
"source": [ | ||
"## Load docs and split them into tokens" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "34ba0cfd", | ||
"metadata": {}, | ||
"source": [ | ||
"Load all markdown files under the `docs` directory\n", | ||
"\n", | ||
"for Apache Doris documents, you can clone repo from https://github.com/apache/doris, and there is `docs` directory in it." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "799edf20-bcf4-4a65-bff7-b907f6bdba20", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-02-14T12:55:24.128917Z", | ||
"start_time": "2024-02-14T12:55:19.463831Z" | ||
}, | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"loader = DirectoryLoader(\n", | ||
" \"./docs\", glob=\"**/*.md\", loader_cls=UnstructuredMarkdownLoader\n", | ||
")\n", | ||
"documents = loader.load()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "b415fe2a", | ||
"metadata": {}, | ||
"source": [ | ||
"Split docs into tokens, and set `update_vectordb = True` because there are new docs/tokens." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "0dc5ba83-62ef-4f61-a443-e872f251e7da", | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"# load text splitter and split docs into snippets of text\n", | ||
"text_splitter = TokenTextSplitter(chunk_size=400, chunk_overlap=50)\n", | ||
"split_docs = text_splitter.split_documents(documents)\n", | ||
"\n", | ||
"# tell vectordb to update text embeddings\n", | ||
"update_vectordb = True" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "46966e25-9449-4a36-87d1-c0b25dce2994", | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"source": [ | ||
"split_docs[-20]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "99422e95-b407-43eb-aa68-9a62363fc82f", | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"source": [ | ||
"print(\"# docs = %d, # splits = %d\" % (len(documents), len(split_docs)))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "e780d77f-3f96-4690-a10f-f87566f7ccc6", | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"source": [ | ||
"## Create vectordb instance" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "15702d9c", | ||
"metadata": {}, | ||
"source": [ | ||
"### Use Apache Doris as vectordb" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"id": "ced7dbe1", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-02-14T12:55:39.508287Z", | ||
"start_time": "2024-02-14T12:55:39.500370Z" | ||
} | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"def gen_apache_doris(update_vectordb, embeddings, settings):\n", | ||
" if update_vectordb:\n", | ||
" docsearch = ApacheDoris.from_documents(split_docs, embeddings, config=settings)\n", | ||
" else:\n", | ||
" docsearch = ApacheDoris(embeddings, settings)\n", | ||
" return docsearch" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "15d86fda", | ||
"metadata": {}, | ||
"source": [ | ||
"## Convert tokens into embeddings and put them into vectordb" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "ff1322ea", | ||
"metadata": {}, | ||
"source": [ | ||
"Here we use Apache Doris as vectordb, you can configure Apache Doris instance via `ApacheDorisSettings`.\n", | ||
"\n", | ||
"Configuring Apache Doris instance is pretty much like configuring mysql instance. You need to specify:\n", | ||
"1. host/port\n", | ||
"2. username(default: 'root')\n", | ||
"3. password(default: '')\n", | ||
"4. database(default: 'default')\n", | ||
"5. table(default: 'langchain')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"id": "b34f8c31-c173-4902-8168-2e838ddfb9e9", | ||
"metadata": { | ||
"ExecuteTime": { | ||
"end_time": "2024-02-14T12:56:02.671291Z", | ||
"start_time": "2024-02-14T12:55:48.350294Z" | ||
}, | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"from getpass import getpass\n", | ||
"\n", | ||
"os.environ[\"OPENAI_API_KEY\"] = getpass()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "c53ab3f2-9e34-4424-8b07-6292bde67e14", | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"update_vectordb = True\n", | ||
"\n", | ||
"embeddings = OpenAIEmbeddings()\n", | ||
"\n", | ||
"# configure Apache Doris settings(host/port/user/pw/db)\n", | ||
"settings = ApacheDorisSettings()\n", | ||
"settings.port = 9030\n", | ||
"settings.host = \"172.30.34.130\"\n", | ||
"settings.username = \"root\"\n", | ||
"settings.password = \"\"\n", | ||
"settings.database = \"langchain\"\n", | ||
"docsearch = gen_apache_doris(update_vectordb, embeddings, settings)\n", | ||
"\n", | ||
"print(docsearch)\n", | ||
"\n", | ||
"update_vectordb = False" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "bde66626", | ||
"metadata": {}, | ||
"source": [ | ||
"## Build QA and ask question to it" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "84921814", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"llm = OpenAI()\n", | ||
"qa = RetrievalQA.from_chain_type(\n", | ||
" llm=llm, chain_type=\"stuff\", retriever=docsearch.as_retriever()\n", | ||
")\n", | ||
"query = \"what is apache doris\"\n", | ||
"resp = qa.run(query)\n", | ||
"print(resp)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.