From 1488ed1665e1bee7b4d891a32baadb303656ca18 Mon Sep 17 00:00:00 2001 From: Tyler Crawford Date: Sat, 7 Dec 2024 16:00:14 -0500 Subject: [PATCH] Remove llimage_mvp.md --- llimage_mvp.md | 423 ------------------------------------------------- 1 file changed, 423 deletions(-) delete mode 100644 llimage_mvp.md diff --git a/llimage_mvp.md b/llimage_mvp.md deleted file mode 100644 index e8f43a5..0000000 --- a/llimage_mvp.md +++ /dev/null @@ -1,423 +0,0 @@ -Below is the proposed MVP implementation plan along with example code and documentation. Everything is designed for local use, with a Flask backend, a simple HTML/JS front-end, Tesseract OCR for images, and minimal logic for describing charts. The documentation and code comments are aimed at helping both a human developer and an automated coding agent understand and extend the project later. - -Overview of the MVP - -Name: LLiMage -License: MIT -Primary Goals: - • Take a PDF as input. - • Extract text from the PDF. - • Extract images from the PDF and run OCR to identify any text in those images. - • Produce a simplistic description of charts/images (e.g., “This appears to be a chart or image. Unable to fully interpret.”) for now. - • Return a single plain text file containing all extracted text and image descriptions. - • Provide a simple web interface that supports drag-and-drop or file-browse upload of a PDF. - • Run entirely locally, using Tesseract OCR. - • Produce logs in both the console and a log file for troubleshooting. - • Keep security, privacy, and FOSS principles in mind. - -Key Technologies: - • Python 3.x for backend code. - • Flask for the web server. - • pdfplumber for PDF text extraction. - • pdf2image or pikepdf + Pillow for extracting and converting PDF pages or images into images suitable for OCR. - • pytesseract (Tesseract) for OCR. - • HTML/CSS/JS for a basic front-end interface. - -Assumptions & Limitations: - • MVP handles small PDFs (1-5 pages). - • Limited chart/image interpretation: just describe the presence of images and any embedded text found via OCR. - • English only. - • If OCR or image analysis fails, we provide a generic fallback description. - • Output is a single .txt file summarizing text and images in the order they appear. - • Detailed logs written to logs/LLiMage.log and also shown in the console. - • No progress bar or “view logs” button in this MVP, just a drag-and-drop and file download link. - -File/Directory Structure - -LLiMage/ -├─ app.py # Flask application -├─ static/ -│ ├─ style.css # Basic CSS -│ └─ script.js # JS for handling drag-and-drop -├─ templates/ -│ └─ index.html # Main UI -├─ requirements.txt # Project dependencies -├─ README.md # Basic instructions -└─ logs/ - └─ LLiMage.log # Log file (created at runtime) - -Example requirements.txt - -Flask==2.2.5 -pdfplumber==0.9.0 -pytesseract==0.3.10 -pillow==9.1.0 -pdf2image==1.16.3 - -(Versions are examples; you may adjust as needed.) - -Note: Ensure Tesseract OCR is installed on the system (e.g., brew install tesseract on macOS, or equivalent on Linux/Windows). - -README.md (Initial Simple Version) - -# LLiMage (MVP) - -## Overview -LLiMage is a local tool for extracting text and basic descriptions of images from PDFs, outputting the results as plain text. The MVP uses a simple web interface built with Flask and runs entirely on your machine. - -## Features -- Drag-and-drop or file upload your PDF -- Extracts all text from the PDF -- Identifies images and runs OCR to extract any readable text from them -- Provides a simplistic description for images (fallback if it’s unclear) -- Outputs a single `.txt` file containing all extracted information -- Fully local: no external data sharing unless you choose to add it later - -## Requirements -- Python 3.x -- Tesseract OCR installed locally (e.g., `brew install tesseract` on macOS) -- Dependencies in `requirements.txt` - -## Installation -1. Clone this repository. -2. Install dependencies: `pip install -r requirements.txt` -3. Ensure Tesseract is installed and accessible in your system’s PATH. - -## Usage -1. Run the Flask app: `python app.py` -2. Open `http://127.0.0.1:5000` in your browser. -3. Drag and drop or select a PDF file. -4. Once processed, a download link for the `.txt` file will be provided. - -## License -MIT License. See `LICENSE` file for details. - -## Security and Privacy -- All processing is done locally by default. -- No external calls unless explicitly added later. -- Logs are stored locally. Secure and monitor these logs if working with sensitive documents. - -## Future Plans -- Support for more detailed chart descriptions. -- Multiple-page PDFs with better performance. -- Support for more output formats (JSON, structured data). -- Internationalization and multi-language OCR. - -app.py (Main Flask Application) - -import os -import logging -from flask import Flask, render_template, request, send_file -import pdfplumber -from pdf2image import convert_from_path -from PIL import Image -import pytesseract -import io -import uuid - -# === Logging Setup === -if not os.path.exists('logs'): - os.makedirs('logs', exist_ok=True) - -logging.basicConfig( - filename='logs/LLiMage.log', - filemode='a', - format='%(asctime)s - %(levelname)s - %(message)s', - level=logging.DEBUG -) - -console_handler = logging.StreamHandler() -console_handler.setLevel(logging.DEBUG) -logging.getLogger('').addHandler(console_handler) - -logging.info("LLiMage application started.") - -app = Flask(__name__) - -# === Configuration === -# In the future, we can allow config via environment variables or a config file -# For MVP: just defaults -TESSERACT_CMD = "tesseract" # Ensure it's in PATH or update path if needed -pytesseract.pytesseract.tesseract_cmd = TESSERACT_CMD - -@app.route('/', methods=['GET']) -def index(): - return render_template('index.html') - -@app.route('/process', methods=['POST']) -def process_pdf(): - # Check if a file is uploaded - if 'file' not in request.files: - logging.error("No file part in request.") - return "No file uploaded.", 400 - - pdf_file = request.files['file'] - if pdf_file.filename == '': - logging.error("Empty filename received.") - return "No file selected.", 400 - - # Save PDF temporarily - pdf_filename = f"temp_{uuid.uuid4()}.pdf" - pdf_file.save(pdf_filename) - logging.info(f"PDF saved as {pdf_filename}") - - # Extract text and images - try: - output_text = process_document(pdf_filename) - except Exception as e: - logging.exception("Error processing the PDF.") - return f"Error processing the PDF: {e}", 500 - finally: - # Clean up temporary file - if os.path.exists(pdf_filename): - os.remove(pdf_filename) - - # Create a text file to send back - result_filename = f"result_{uuid.uuid4()}.txt" - with open(result_filename, 'w', encoding='utf-8') as f: - f.write(output_text) - - logging.info(f"Processing complete. Result saved in {result_filename}") - - return send_file(result_filename, as_attachment=True, mimetype='text/plain', download_name='LLiMage_output.txt') - -def process_document(pdf_path): - """ - Process the PDF: - 1. Extract text from pages. - 2. Extract images from pages. - 3. For each image, run OCR and produce a basic description. - 4. Combine all text + image descriptions into a single text output. - """ - logging.debug("Starting document processing.") - text_output = [] - with pdfplumber.open(pdf_path) as pdf: - num_pages = len(pdf.pages) - for i, page in enumerate(pdf.pages, start=1): - logging.debug(f"Processing page {i}/{num_pages}") - # Extract text - page_text = page.extract_text() or "" - text_output.append(f"=== Page {i} Text ===\n{page_text}\n") - - # Convert page to image(s) - typically one image per page, but - # we may get multiple images if the PDF is complex. - # For the MVP, let's convert the whole page to an image and treat that as "the image". - # Future versions could extract individual images from the page. - page_images = convert_from_path(pdf_path, first_page=i, last_page=i) - # Usually one image per page in this scenario - for img_index, img in enumerate(page_images, start=1): - logging.debug(f"OCR on page image {i}-{img_index}") - # Convert to a pillow image in RGB - img = img.convert('RGB') - - # OCR the image - ocr_text = pytesseract.image_to_string(img).strip() - - if ocr_text: - # Basic heuristic: If we got text, let's just describe it simply - description = f"=== Page {i} Image {img_index} Description ===\nThis image contains text:\n{ocr_text}\n" - else: - # If no text, fallback to a generic description - description = f"=== Page {i} Image {img_index} Description ===\nThis appears to be an image or chart, but no readable text was found.\n" - - text_output.append(description) - - full_text = "\n".join(text_output) - logging.debug("Document processing complete.") - return full_text - -if __name__ == '__main__': - app.run(debug=True) - -templates/index.html - - - - - - -LLiMage MVP - - - -

LLiMage MVP

-

Drag and drop your PDF here, or click to select a file.

-
-
- - -
-
-
- - - - -static/style.css - -body { - font-family: sans-serif; - margin: 20px; -} - -#drop-area { - border: 2px dashed #ccc; - border-radius: 10px; - padding: 20px; - text-align: center; -} - -#drop-area.hover { - border-color: #333; -} - -.button { - padding: 10px 20px; - background: #4f8ff7; - color: #fff; - text-decoration: none; - border-radius: 5px; - cursor: pointer; - display: inline-block; -} - -#message { - margin-top: 20px; -} - -static/script.js - -const dropArea = document.getElementById("drop-area"); -const fileElem = document.getElementById("fileElem"); -const messageDiv = document.getElementById("message"); - -['dragenter', 'dragover', 'dragleave', 'drop'].forEach(eventName => { - dropArea.addEventListener(eventName, preventDefaults, false) -}); - -function preventDefaults (e) { - e.preventDefault(); - e.stopPropagation(); -} - -['dragenter', 'dragover'].forEach(eventName => { - dropArea.addEventListener(eventName, () => dropArea.classList.add('hover'), false) -}); - -['dragleave', 'drop'].forEach(eventName => { - dropArea.addEventListener(eventName, () => dropArea.classList.remove('hover'), false) -}); - -dropArea.addEventListener('drop', handleDrop, false); -dropArea.addEventListener('click', () => fileElem.click()); - -fileElem.addEventListener('change', handleFiles, false); - -function handleDrop(e) { - const dt = e.dataTransfer; - const files = dt.files; - handleFiles({target: {files: files}}); -} - -function handleFiles(e) { - const files = e.target.files; - if (files.length > 0) { - uploadFile(files[0]); - } -} - -function uploadFile(file) { - messageDiv.innerText = "Processing..."; - const formData = new FormData(); - formData.append('file', file); - - fetch('/process', { - method: 'POST', - body: formData - }) - .then(response => { - if (!response.ok) { - return response.text().then(text => {throw new Error(text)}); - } - return response.blob(); - }) - .then(blob => { - const url = window.URL.createObjectURL(blob); - const a = document.createElement('a'); - a.href = url; - a.download = 'LLiMage_output.txt'; - document.body.appendChild(a); - a.click(); - a.remove(); - messageDiv.innerText = "Download complete. Check your file."; - }) - .catch(error => { - console.error(error); - messageDiv.innerText = "Error processing file. Check logs."; - }); -} - -Inline Developer Guidance - -In the Code Comments: - • app.py includes comments explaining each step. - • Mention where to improve OCR, image analysis. - • Explain how to switch to JSON output later. - • Mention adding a config file or environment variables in the future. - -In the Documentation (README): - • Suggest potential improvements (chart detection, structured output). - • Mention testing framework plans. - • Indicate where security considerations apply (e.g., keep Tesseract updated, run on a secure machine, consider sandboxing). - -Basic Automated Testing: -For the MVP, we can add a very simple test just to confirm the server runs and a sample PDF is processed: - -Create a tests/ directory with a basic test (this can be minimal): - -tests/test_basic.py - -import os -import tempfile -import pytest -from app import app - -@pytest.fixture -def client(): - app.config['TESTING'] = True - with app.test_client() as client: - yield client - -def test_index_page(client): - rv = client.get('/') - assert rv.status_code == 200 - assert b"LLiMage" in rv.data - -# Additional tests would require a sample PDF in the repo. -# For MVP, we can add a note to test later with actual PDF files. - -Note: In a real scenario, you’d place a small sample PDF in tests/sample.pdf and write a test that POSTs it to the /process route, then checks the output. For the MVP, this is a placeholder. - -Security Considerations (MVP) - • All processing is local by default. No network calls made automatically. - • Validate file uploads are PDFs. (For MVP, we trust the user input, but later we can add MIME checks or PDF validation.) - • Log files contain potentially sensitive text. Advise users to secure them. - • Consider sandboxing Tesseract in future releases. - • Consider adding HTTPS and user authentication for a deployed scenario. For local MVP, this may not be necessary but can be added later. - • No external dependencies beyond mentioned libraries. Keep them updated and check for known vulnerabilities regularly. - -Future Roadmap (Post-MVP) - -Phase 2 and beyond: - • More Detailed Image Descriptions: -Implement heuristics or machine learning models to classify chart types (bar, line, pie) and extract axis labels, data points, etc. - • Structured Output Formats: -Add a JSON endpoint or a toggle to output in structured formats. - • Internationalization: -Add support for multiple OCR languages, possibly by allowing a configuration setting or environment variable. - • UI Enhancements: -Add progress bars, image previews, and display logs or results directly in the browser. - • Scalability: -Handle larger and more complex PDFs efficiently. - -This completes the MVP code and documentation. \ No newline at end of file