From 041e0aa4072fa907dc7621df7ad90660c46cc4dd Mon Sep 17 00:00:00 2001 From: Luca Foppiano Date: Mon, 24 Jun 2024 23:16:30 +0900 Subject: [PATCH] prepare for the new release --- CHANGELOG.md | 20 ++++++++++++++++++++ README.md | 18 ++++++++++-------- streamlit_app.py | 5 ++--- 3 files changed, 32 insertions(+), 11 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 21cc72b..dd45a0e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,26 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). +## [0.4.0] - 2024-06-24 + +### Added ++ Add selection of embedding functions ++ Add selection of text from the pdf viewer (provided by https://github.com/lfoppiano/streamlit-pdf-viewer) ++ Added an experimental feature for calculating the coefficient that relate the question and the embedding database ++ Added the data availability statement in the searchable text + +### Changed ++ Removed obsolete and non-working models zephyr and mistral v0.1 ++ The underlying library was refactored to make it easier to maintain ++ Removed the native PDF viewer ++ Updated langchain and streamlit to the latest versions ++ Removed conversational memory which was causing more problems than bringing benefits ++ Rearranged the interface to get more space + +### Fixed ++ Updated and removed models that were not working ++ Fixed problems with langchain and other libraries + ## [0.3.4] - 2023-12-26 ### Added diff --git a/README.md b/README.md index ce72730..23ccf5f 100644 --- a/README.md +++ b/README.md @@ -21,17 +21,14 @@ https://lfoppiano-document-qa.hf.space/ ## Introduction Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, GPT4, GPT4-Turbo, Mistral-7b-instruct and Zephyr-7b-beta. -The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan. +The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents. **Different to most of the projects**, we focus on scientific articles and we extract text from a structured document. We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions). Additionally, this frontend provides the visualisation of named entities on LLM responses to extract physical quantities, measurements (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and materials mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)). -The conversation is kept in memory by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages". - (The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion) - [](https://www.youtube.com/embed/M4UaYs5WKGs) @@ -46,6 +43,9 @@ The conversation is kept in memory by a buffered sliding window memory (top 4 mo ## Documentation +### Embedding selection +In the latest version there is the possibility to select both embedding functions and LLMs. There are some limitation, OpenAI embeddings cannot be used with open source models, and viceversa. + ### Context size Allow to change the number of blocks from the original document that are considered for responding. The default size of each block is 250 tokens (which can be changed before uploading the first document). @@ -61,8 +61,9 @@ Larger blocks will result in a larger context less constrained around the questi ### Query mode Indicates whether sending a question to the LLM (Language Model) or to the vector storage. - - LLM (default) enables question/answering related to the document content. - - Embeddings: the response will consist of the raw text from the document related to the question (based on the embeddings). This mode helps to test why sometimes the answers are not satisfying or incomplete. + - **LLM** (default) enables question/answering related to the document content. + - **Embeddings**: the response will consist of the raw text from the document related to the question (based on the embeddings). This mode helps to test why sometimes the answers are not satisfying or incomplete. + - **Question coefficient** (experimental): provide a coefficient that indicate how the question has been far or closed to the retrieved context ### NER (Named Entities Recognition) This feature is specifically crafted for people working with scientific documents in materials science. @@ -102,8 +103,9 @@ To install the library with Pypi: ## Acknowledgement -This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with [Guillaume Lambard](https://github.com/GLambard) and the [Lambard-ML-Team](https://github.com/Lambard-ML-Team). -Contributed by [Pedro Ortiz Suarez](https://github.com/pjox), [Tomoya Mato](https://github.com/t29mato). +The project was initiated at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan. +Currently, the development is possible thanks to [ScienciLAB](https://www.sciencialab.com). +This project was contributed by [Guillaume Lambard](https://github.com/GLambard) and the [Lambard-ML-Team](https://github.com/Lambard-ML-Team), [Pedro Ortiz Suarez](https://github.com/pjox), and [Tomoya Mato](https://github.com/t29mato). Thanks also to [Patrice Lopez](https://www.science-miner.com), the author of [Grobid](https://github.com/kermitt2/grobid). diff --git a/streamlit_app.py b/streamlit_app.py index 729f584..5512ca7 100644 --- a/streamlit_app.py +++ b/streamlit_app.py @@ -299,7 +299,7 @@ def play_old_messages(container): ) placeholder = st.empty() - messages = st.container(height=300, border=False) + messages = st.container(height=300) question = st.chat_input( "Ask something about the article", @@ -483,6 +483,5 @@ def generate_color_gradient(num_elements): input=st.session_state['binary'], annotation_outline_size=2, annotations=st.session_state['annotations'], - render_text=True, - height=700 + render_text=True )