We are constantly improving LangChain's self-query retriever. Some of the features are not merged yet.
Yet another chat-with-documents app, but supporting query over millions of files with MyScale and LangChain.
ChatData is a robust chat-with-documents application designed to extract information and provide answers by querying the MyScale free knowledge base or your uploaded documents.
Powered by the Retrieval Augmented Generation (RAG) framework, ChatData leverages millions of Wikipedia pages and arXiv papers as its external knowledge base, with MyScale managing all data hosting tasks. Simply input your questions in natural language, and ChatData takes care of generating SQL, querying the data, and presenting the results.
Enhancing your chat experience, ChatData introduces three key features. Let's delve into each of them in detail.
MyScale works closely with LangChain, providing the easiest interface to build complex queries with LLM.
Self-querying retriever: MyScale augmented LangChain's Self Querying Retriever, where the LLM can use more data types, for instance timestamps and array of strings, to build filters for the query.
VectorSQL: SQL is powerful and can be used to construct complex search queries. Vector Structured Query Language (Vector SQL) is designed to teach LLMs how to query SQL vector databases. Besides the general data types and functions, vectorSQL contains extra functions like DISTANCE(column, query_vector)and NeuralArray(entity), with which we can extend the standard SQL for vector search.
To enhance your experience and seamlessly continue interactions with existing sessions, ChatData has introduced the Session Management feature. You can easily customize your session ID and modify your prompt to guide ChatData in addressing your queries. With just a few clicks, you can enjoy smooth and personalized session interactions.
In addition to tapping into ChatData's external knowledge base powered by MyScale for answers, you also have the option to upload your own files and establish a personalized knowledge base. We've implemented the Unstructured API for this purpose, ensuring that only processed texts from your documents are stored, prioritizing your data privacy.
In conclusion, with ChatData, you can effortlessly navigate through vast amounts of data, effortlessly accessing precisely what you need. Whether you're a researcher, a student, or a knowledge enthusiast, ChatData empowers you to explore academic papers and research documents like never before. Unlock the true potential of information retrieval with ChatData and discover a world of knowledge at your fingertips.
β‘οΈ Dive in and experience ChatData on Hugging Faceπ€
Database credentials:
MYSCALE_HOST = "msc-950b9f1f.us-east-1.aws.myscale.com"
MYSCALE_PORT = 443
MYSCALE_USER = "chatdata"
MYSCALE_PASSWORD = "myscale_rocks"
ChatData also provides you access to Wikipedia, a large knowledge base that contains about 36 million paragraphs under 5 million wiki pages. The knowledge base is a snapshot on 2022-12.
You can query from this table with the public account here.
CREATE TABLE wiki.Wikipedia (
-- Record ID
`id` String,
-- Page title to this paragraph
`title` String,
-- Paragraph text
`text` String,
-- Page URL
`url` String,
-- Wiki page ID
`wiki_id` UInt64,
-- View statistics
`views` Float32,
-- Paragraph ID
`paragraph_id` UInt64,
-- Language ID
`langs` UInt32,
-- Feature vector to this paragraph
`emb` Array(Float32),
-- Vector Index
VECTOR INDEX emb_idx emb TYPE MSTG('metric_type=Cosine'),
CONSTRAINT emb_len CHECK length(emb) = 768)
ENGINE = ReplacingMergeTree ORDER BY id SETTINGS index_granularity = 8192
ChatData brings millions of papers into your knowledge base. We imported 2.2 million papers with metadata info, which contains:
id
: paper's arxiv idabstract
: paper's abstracts used as ranking criterion (with InstructXL)vector
: column that contains the vector array inArray(Float32)
metadata
: LangChain VectorStore Compatible Columnsmetadata.authors
: paper's authors in list of stringsmetadata.abstract
: paper's abstracts used as ranking criterion (with InstructXL)metadata.titles
: papers's titlesmetadata.categories
: paper's categories in list of strings like ["cs.CV"]metadata.pubdate
: paper's date of publication in ISO 8601 formated stringsmetadata.primary_category
: paper's primary category in strings defined by arXivmetadata.comment
: some additional comment to the paper
Columns below are native columns in MyScale and can only be used as SQLDatabase
authors
: paper's authors in list of stringstitles
: papers's titlescategories
: paper's categories in list of strings like ["cs.CV"]pubdate
: paper's date of publication in Date32 data type (faster)primary_category
: paper's primary category in strings defined by arXivcomment
: some additional comment to the paper
And for overall table schema, please refer to table creation section in docs/self-query.md.
If you want to use this database with langchain.chains.sql_database.base.SQLDatabaseChain
or langchain.retrievers.SQLDatabaseRetriever
, please follow guides on data preparation section and chain creation section in docs/vector-sql.md
-
Or Directly use MyScale database as service... for FREE β¨
import clickhouse_connect client = clickhouse_connect.get_client( host='msc-950b9f1f.us-east-1.aws.myscale.com', port=443, username='chatdata', password='myscale_rocks' )
- π Upload your documents and chat with your own knowledge bases with MyScale!
- π¬ Chat with RAG-enabled agents on both ArXiv and Wikipedia knowledge base!
- π Wikipedia is available as knowledge base!! Feel FREE π° to ask with 36 million of paragraphs under 5 million titles! π«
- π€ LLMs are now capable of writing Vector SQL - a extended SQL with vector search! Vector SQL allows you to access MyScale faster and stronger! This will be added to LangChain soon! (PR 7454)
- π Customized Retrieval QA Chain that gives you more information on each PDF and answer question in your native language!
- π§ Our contribution to LangChain that helps self-query retrievers filter with more types and functions
- π We just opened a FREE pod hosting data for ArXiv paper. Anyone can try their own SQL with vector search!!! Feel the power when SQL meets vector search! See how to access the pod here.
- π We collected about 2 million papers on arxiv! We are collecting more and we need your advice!
- More coming...
- Enter directory
app/
cd app/
- Create an virtual environment
python3 -m venv venv
source venv/bin/activate
- Install dependencies
python3 -m pip install -r requirements.txt
- Run the app!
# fill you OpenAI key in .streamlit/secrets.toml
cp .streamlit/secrets.example.toml .streamlit/secrets.toml
# start the app
python3 -m streamlit run app.py
- Why Vector SQL?
- How did LangChain and MyScale convert natural language to structured filters?
- How to make chain execution more responsive in LangChain?
- How this app is built?
- What is the overview pipeline?
- How did LangChain and MyScale convert natural language to structured filters?
- How to make chain execution more responsive in LangChain?
- Welcome to join our #ChatData channel in Discord to discuss anything about ChatData.
- Feel free to filing an issue or opening a PR against this repository.
- arXiv API for its open access interoperability to pre-printed papers.
- InstructorXL for its promptable embeddings that improves retrieve performance.
- LangChainπ¦οΈπ for its easy-to-use and composable API designs and prompts.
- OpenChatPaper for prompt design reference.
- The Alexandria Index for providing arXiv data index to the public.