Skip to content

Commit

Permalink
prepare for the new release
Browse files Browse the repository at this point in the history
  • Loading branch information
lfoppiano committed Jun 24, 2024
1 parent 6eee84d commit 041e0aa
Show file tree
Hide file tree
Showing 3 changed files with 32 additions and 11 deletions.
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,26 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [0.4.0] - 2024-06-24

### Added
+ Add selection of embedding functions
+ Add selection of text from the pdf viewer (provided by https://github.com/lfoppiano/streamlit-pdf-viewer)
+ Added an experimental feature for calculating the coefficient that relate the question and the embedding database
+ Added the data availability statement in the searchable text

### Changed
+ Removed obsolete and non-working models zephyr and mistral v0.1
+ The underlying library was refactored to make it easier to maintain
+ Removed the native PDF viewer
+ Updated langchain and streamlit to the latest versions
+ Removed conversational memory which was causing more problems than bringing benefits
+ Rearranged the interface to get more space

### Fixed
+ Updated and removed models that were not working
+ Fixed problems with langchain and other libraries

## [0.3.4] - 2023-12-26

### Added
Expand Down
18 changes: 10 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,17 +21,14 @@ https://lfoppiano-document-qa.hf.space/
## Introduction

Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, GPT4, GPT4-Turbo, Mistral-7b-instruct and Zephyr-7b-beta.
The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan.
The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents.
**Different to most of the projects**, we focus on scientific articles and we extract text from a structured document.
We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).

Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).

The conversation is kept in memory by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".

(The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)


[<img src="https://img.youtube.com/vi/M4UaYs5WKGs/hqdefault.jpg" height="300" align="right"
/>](https://www.youtube.com/embed/M4UaYs5WKGs)

Expand All @@ -46,6 +43,9 @@ The conversation is kept in memory by a buffered sliding window memory (top 4 mo

## Documentation

### Embedding selection
In the latest version there is the possibility to select both embedding functions and LLMs. There are some limitation, OpenAI embeddings cannot be used with open source models, and viceversa.

### Context size
Allow to change the number of blocks from the original document that are considered for responding.
The default size of each block is 250 tokens (which can be changed before uploading the first document).
Expand All @@ -61,8 +61,9 @@ Larger blocks will result in a larger context less constrained around the questi

### Query mode
Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
- LLM (default) enables question/answering related to the document content.
- Embeddings: the response will consist of the raw text from the document related to the question (based on the embeddings). This mode helps to test why sometimes the answers are not satisfying or incomplete.
- **LLM** (default) enables question/answering related to the document content.
- **Embeddings**: the response will consist of the raw text from the document related to the question (based on the embeddings). This mode helps to test why sometimes the answers are not satisfying or incomplete.
- **Question coefficient** (experimental): provide a coefficient that indicate how the question has been far or closed to the retrieved context

### NER (Named Entities Recognition)
This feature is specifically crafted for people working with scientific documents in materials science.
Expand Down Expand Up @@ -102,8 +103,9 @@ To install the library with Pypi:

## Acknowledgement

This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with [Guillaume Lambard](https://github.com/GLambard) and the [Lambard-ML-Team](https://github.com/Lambard-ML-Team).
Contributed by [Pedro Ortiz Suarez](https://github.com/pjox), [Tomoya Mato](https://github.com/t29mato).
The project was initiated at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan.
Currently, the development is possible thanks to [ScienciLAB](https://www.sciencialab.com).
This project was contributed by [Guillaume Lambard](https://github.com/GLambard) and the [Lambard-ML-Team](https://github.com/Lambard-ML-Team), [Pedro Ortiz Suarez](https://github.com/pjox), and [Tomoya Mato](https://github.com/t29mato).
Thanks also to [Patrice Lopez](https://www.science-miner.com), the author of [Grobid](https://github.com/kermitt2/grobid).


Expand Down
5 changes: 2 additions & 3 deletions streamlit_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -299,7 +299,7 @@ def play_old_messages(container):
)

placeholder = st.empty()
messages = st.container(height=300, border=False)
messages = st.container(height=300)

question = st.chat_input(
"Ask something about the article",
Expand Down Expand Up @@ -483,6 +483,5 @@ def generate_color_gradient(num_elements):
input=st.session_state['binary'],
annotation_outline_size=2,
annotations=st.session_state['annotations'],
render_text=True,
height=700
render_text=True
)

0 comments on commit 041e0aa

Please sign in to comment.