OCR and Document Search Web Application

The OCR and Document Search Web Application is designed to extract text from uploaded images using Optical Character Recognition (OCR) and provide a keyword-based search within the extracted text. This tool helps users quickly scan documents and search for specific information. This project leverages the Tesseract OCR engine to provide accurate text extraction capabilities, supporting multiple languages, including Hindi and English. The application is built to enhance image quality before processing, ensuring better recognition accuracy through DPI adjustments and format conversions.

Features

Upload images and extract text via OCR.
Perform keyword-based search in the extracted text.
Simple, user-friendly interface. -Copy extracted text to clipboard.

Technologies Used

Frontend: HTML, CSS (Bootstrap), JavaScript
Backend: Java (Spring Boot), OCR Library (Tesseract)
Libraries:
- Tesseract for OCR.
- Spring Boot for backend services.
- Apache PDFBox for handling image files and adjusting DPI.
- Lombok to reduce boilerplate code and improve readability.
- SLF4J for logging purposes.
- Thymeleaf for rendering frontend.
- IntelliJ IDEA for Development enviroment.

Prerequisites

Ensure you have the following installed:

Java 8 or higher, this is built on Java 17 and Spring boot version 3.3.4
Maven (for dependency management)
Tesseract OCR (Install locally for OCR functionality)

Sample Inputs and Outputs

Input 1: Image Upload (OCR)
- Uploaded Image: Image with text such as a scanned document.
- Extracted Text: The text extracted from the uploaded image will appear in the text area.
Input 2: Search Keyword
- Keyword: "Invoice"
- Search Result: Found || Not Found

Screenshots

Home Page
Upload Image and Extract Text, Extracted Text Display

Hindi
English

Setup Instructions

Clone the Repository

  git clone https://github.com/hey-its-d2t2/OCR_Reader.git
  cd OCR_Reader

Install Tesseract OCR

The first step is to download the Tess4J API from the link https://sourceforge.net/projects/tess4j/
Extract the Files from the downloaded file
Open your IDE and make a new project
Link the jar file with your project. Refer this link . https://www.edureka.co/community/4028/how-to-import-a-jar-file-in-eclipse
Please migrate via this path “..\Tess4J-3.4.8-src\Tess4J\dist”.

OR

Read the Article from link https://www.geeksforgeeks.org/tesseract-ocr-with-java-with-examples/

OR

Simple extract the folder
Go to this path C:\Program Files\Tess4J-3.4.8-src\Tess4J\tessdata" || slected folder tessdata /path_to_tessdata_folder
select "C:\Program Files\Tess4J-3.4.8-src\Tess4J\tessdata" || path_to_tessdata_folder

Set the TESSDATA_PREFIX Environment Variable

Ensure you set the TESSDATA_PREFIX to the directory containing the tessdata folder. Here’s how you can do it:

Windows Command Prompt:

 set TESSDATA_PREFIX=D:\path\to\tessdata\

Linux/Mac Terminal:

export TESSDATA_PREFIX=/path/to/tessdata/

List Available Languages

You can list the available languages supported by Tesseract using the following command in the terminal or command prompt:

 tesseract --list-langs

This will display the languages that are installed and recognized by Tesseract.

3.1 To set the TESSDATA_PREFIX environment variable on Windows, you can follow these steps:

Setting TESSDATA_PREFIX in Windows

Open System Properties:
- Right-click on the This PC or My Computer icon on your desktop or in File Explorer.
- Select Properties.
- Click on Advanced system settings on the left side.
- In the System Properties window, go to the Advanced tab.
- Environment Variables:
  - Click on the Environment Variables button at the bottom of the window.
  - Create a New System Variable:
    - In the System variables section, click on New.
    - In the Variable name field, enter TESSDATA_PREFIX.
    - In the Variable value field, enter the path to your tessdata directory. For example: "C:\Program Files\Tess4J-3.4.8-src\Tess4J\tessdata"
    - Click OK to save the new variable.
    - Close All Windows:
- Click OK in the Environment Variables window and in the System Properties window to apply your changes.
Verify the Variable:
Open a new Command Prompt window and type:
```
echo %TESSDATA_PREFIX%
```
- This should display the path you set.

Backend Setup

a. Update path in AppConstants.java according to your path of tessdata Example:

   public static final String PATH ="C:\\MyProj\\ORC-Reader\\Tess4J-3.4.8-src\\Tess4J\\tessdata"; //path_to_tessdata_folder

b. Install Java dependencies Navigate to the project directory and run:

  mvn clean install

c. Start the Spring Boot Application Run the application with:

  mvn spring-boot:run

d. Access the application by navigating to http://localhost:8080 in your web browser.

Use Cases

Upload Image and Extract Text Users can upload an image, and the OCR system will extract text from the image and display it in a text area.
Keyword Search in Extracted Text After the text is extracted, users can search for specific keywords in the extracted text. The matching words will be highlighted in the search results section.
Copy Extracted Text Users can click the “Copy” button to copy the entire extracted text to their clipboard for further use.

API Endpoints

Image Upload and OCR Extraction
- Method: POST
- Endpoint: /api/ocr/upload
- Request Body: Image file
- Response: Extracted text in JSON format.
Search in Extracted Text
- Method: POST
- Endpoint: /api/ocr/search
- Request Body: keyword (string) and extractedText (string)
- Response: Search results (text)

Troubleshooting

Issue: Tesseract is not found
- Ensure the Tesseract path is correctly set in the environment variables (TESSDATA_PREFIX and TESSERACT_PATH).
- Verify the Variable: Open a new Command Prompt window and type:
```
echo %TESSDATA_PREFIX%
```
Issue: Image is not uploading
- Check if the file size is within the allowed limits.
- Ensure the backend API /api/ocr/upload is running and accessible.
- Ensure file size is aligned according application.properties file check once and update accordingly.

Conclusion

This project demonstrates a complete OCR and Document Search web application that allows users to upload images, extract text using OCR, and perform keyword searches within the extracted text. The application is built with Spring Boot, Java, and Bootstrap for a clean and responsive frontend. By following the setup instructions and deploying the application using Docker, you can easily run the project in any environment with minimal configuration.

Contributions

We welcome contributions from the open-source community! If you'd like to improve this project, follow these steps:

Fork the repository from GitHub.
Clone your forked repository locally
Create a new feature branch
Commit your changes
Push the branch to your forked repo:
Create a Pull Request on the main repository.

All contributions—big or small—are highly appreciated! Feel free to improve the documentation, fix bugs, or add new features.

Getting Help

Feel free to reach out to Mail or via GitHub issues or discussions.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
.mvn/wrapper		.mvn/wrapper
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OCR and Document Search Web Application

Features

Technologies Used

Prerequisites

Sample Inputs and Outputs

Screenshots

Setup Instructions

OR

OR

Use Cases

API Endpoints

Troubleshooting

Conclusion

Contributions

Getting Help

Thank you for using and contributing to OCR and Document Search! Happy coding!

About

Releases

Packages

Languages

hey-its-d2t2/OCR_Reader

Folders and files

Latest commit

History

Repository files navigation

OCR and Document Search Web Application

Features

Technologies Used

Prerequisites

Sample Inputs and Outputs

Screenshots

Setup Instructions

OR

OR

Use Cases

API Endpoints

Troubleshooting

Conclusion

Contributions

Getting Help

Thank you for using and contributing to OCR and Document Search! Happy coding!

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages