Text Summarization Proof of Concept

This repository contains the implementation of PDF document summarization using local Large Language Models (LLMs). The purpose of this project is to prototype simple methods for extracting and summarizing text from PDF documents (or other PDF-converted text-based documents) using local LLMs.

The project is divided into two main components:

Text Extraction: Methods for extracting text from PDF files using Python libraries.
Summarization: Utilizing a Podman container to deploy an Ollama server, from which the summarization notebooks consume the extracted text.

Installation and Running

Requirements

Podman and podman-compose
GPU with CUDA support: In order to run the summarization processes with the GPU you will need the Nvidia and CUDA drivers as well as the Nvidia Container Toolkit installed and updated in your system.
Storage: you can expect to use:
- ~12.5GB in container images
- ~40GB in LLM files

Install the repository as follows:

Clone the Repository:

git clone git@github.com:lfenzo/poc-text-summarization.git
cd poc-text-summarization

Start the Containers: Use Podman Compose to bring up all necessary containers:
```
podman compose up --build
```
Access the Jupyter Lab Instances: Upon executing the previous command both extraction and summarization containers are started, however, due to the way that both ollama and jupyterlab are launched in the summarization container, you will need to execute the following command to get the URL in order to access the JupyterLab from your browser:
```
podman exec poc-text-summarization_summarization_1 jupyter lab list
```
Since both jupyterlab and ollama server are spawned and managed by supervisord, the logs from these two services are not readily available straight from the podman compose up command. (If you know any better way to do this, please let me know by opening an issue in this repo).

The extraction JupyterLab URL is showed in the logs and can be easily access copied to access the server via the browser.

Uninstalling

In order to stop and remove the running containers run:

podman compose down

All LLM files are stored in summarization/models/ so removing the repository as a whole will also free all space used by LLMs.

Project Overview

Text Extraction

Container responsible for extracting text from PDF files producing as output a markdown file. The objective of this step is not only extract the text, but also maintain the basic hierarchical structure with headers, titles and subtitles. Some of the operations performed included:

Exclusion of headers and footers: a token frequency-based heuristic was devised to address the header and footer deletions from the document texts. In summary, this heuristic compares the frequency of tokens in upper and lower portions of the pages excluding sequences of tokens which frequency across the analysed pages surpasses some predefined threshold. Check the notebook extractions/pdfplumber.ipynb for implementation details.
Exclusion of tables and diagrams: in which bounding boxes corresponding to images, tables and diagrams were excluded from the text selection and extraction. Depending on the combination of text extracting libraries used, such bounding boxes were used to overwrite text in these regions.
Font size frequency analysis: in order to associate each title to the correct heading level a font frequency analysis was implemented. As a general rule, the most frequent font in the document was associated as the default font. Fonts with greater size were associated to header/title fonts, sorted and arranged as such.

Implemented extractors/used:

pdfplumber
pymupdf
pymupdf4llm
pdfplumber-based with a header/footer deletion heuristic and table/diagrams exclusion
pymupdf-based with table/images bounding box overwriting with pdfplumber.

As a sample, this repository contains in the extraction/input/ a single PDF file (digital-thermometer-ds18b20.pdf) used for development with several formatting peculiarities such as double column, header and footer, embedded diagrams, single column and double column tables, etc. The outputs for each method are stored in the outputs/ directory.