The OCR and Document Search Web Application is designed to extract text from uploaded images using Optical Character Recognition (OCR) and provide a keyword-based search within the extracted text. This tool helps users quickly scan documents and search for specific information. This project leverages the Tesseract OCR engine to provide accurate text extraction capabilities, supporting multiple languages, including Hindi and English. The application is built to enhance image quality before processing, ensuring better recognition accuracy through DPI adjustments and format conversions.
- Upload images and extract text via OCR.
- Perform keyword-based search in the extracted text.
- Simple, user-friendly interface. -Copy extracted text to clipboard.
- Frontend: HTML, CSS (Bootstrap), JavaScript
- Backend: Java (Spring Boot), OCR Library (Tesseract)
- Libraries:
- Tesseract for OCR.
- Spring Boot for backend services.
- Apache PDFBox for handling image files and adjusting DPI.
- Lombok to reduce boilerplate code and improve readability.
- SLF4J for logging purposes.
- Thymeleaf for rendering frontend.
- IntelliJ IDEA for Development enviroment.
Ensure you have the following installed:
- Java 8 or higher, this is built on Java 17 and Spring boot version 3.3.4
- Maven (for dependency management)
- Tesseract OCR (Install locally for OCR functionality)
- Input 1: Image Upload (OCR)
- Uploaded Image: Image with text such as a scanned document.
- Extracted Text: The text extracted from the uploaded image will appear in the text area.
- Input 2: Search Keyword
- Keyword: "Invoice"
- Search Result: Found || Not Found
-
Clone the Repository
git clone https://github.com/hey-its-d2t2/OCR_Reader.git cd OCR_Reader
-
Install Tesseract OCR
- The first step is to download the Tess4J API from the link
https://sourceforge.net/projects/tess4j/
- Extract the Files from the downloaded file
- Open your IDE and make a new project
- Link the jar file with your project. Refer this link .
https://www.edureka.co/community/4028/how-to-import-a-jar-file-in-eclipse
- Please migrate via this path “..\Tess4J-3.4.8-src\Tess4J\dist”.
- Read the Article from link
https://www.geeksforgeeks.org/tesseract-ocr-with-java-with-examples/
- Simple extract the folder
- Go to this path
C:\Program Files\Tess4J-3.4.8-src\Tess4J\tessdata"
|| slected foldertessdata
/path_to_tessdata_folder - select
"C:\Program Files\Tess4J-3.4.8-src\Tess4J\tessdata"
|| path_to_tessdata_folder
- Set the TESSDATA_PREFIX Environment Variable
Ensure you set the TESSDATA_PREFIX to the directory containing the tessdata folder. Here’s how you can do it:
- Windows Command Prompt:
set TESSDATA_PREFIX=D:\path\to\tessdata\
- Linux/Mac Terminal:
export TESSDATA_PREFIX=/path/to/tessdata/
- List Available Languages
- You can list the available languages supported by Tesseract using the following command in the terminal or command prompt:
tesseract --list-langs
- This will display the languages that are installed and recognized by Tesseract.
3.1 To set the TESSDATA_PREFIX environment variable on Windows, you can follow these steps:
Setting TESSDATA_PREFIX
in Windows
-
Open System Properties:
-
Right-click on the This PC or My Computer icon on your desktop or in File Explorer.
-
Select Properties.
-
Click on Advanced system settings on the left side.
-
In the System Properties window, go to the Advanced tab.
-
Environment Variables:
- Click on the Environment Variables button at the bottom of the window.
- Create a New
System Variable
:- In the System variables section, click on New.
- In the Variable name field, enter
TESSDATA_PREFIX
. - In the Variable value field, enter the path to your tessdata directory. For example:
"C:\Program Files\Tess4J-3.4.8-src\Tess4J\tessdata"
- Click OK to save the new variable.
- Close All Windows:
-
Click OK in the Environment Variables window and in the System Properties window to apply your changes.
-
-
Verify the Variable:
-
Open a new Command Prompt window and type:
echo %TESSDATA_PREFIX%
- This should display the path you set.
- Backend Setup
a. Update path in AppConstants.java
according to your path of tessdata
Example:
public static final String PATH ="C:\\MyProj\\ORC-Reader\\Tess4J-3.4.8-src\\Tess4J\\tessdata"; //path_to_tessdata_folder
b. Install Java dependencies Navigate to the project directory and run:
mvn clean install
c. Start the Spring Boot Application Run the application with:
mvn spring-boot:run
d. Access the application by navigating to http://localhost:8080
in your web browser.
- Upload Image and Extract Text Users can upload an image, and the OCR system will extract text from the image and display it in a text area.
- Keyword Search in Extracted Text After the text is extracted, users can search for specific keywords in the extracted text. The matching words will be highlighted in the search results section.
- Copy Extracted Text Users can click the “Copy” button to copy the entire extracted text to their clipboard for further use.
- Image Upload and OCR Extraction
- Method:
POST
- Endpoint:
/api/ocr/upload
- Request Body: Image file
- Response: Extracted text in JSON format.
- Method:
- Search in Extracted Text
- Method:
POST
- Endpoint:
/api/ocr/search
- Request Body: keyword (string) and extractedText (string)
- Response: Search results (text)
- Method:
-
Issue: Tesseract is not found
- Ensure the Tesseract path is correctly set in the environment variables (TESSDATA_PREFIX and TESSERACT_PATH).
- Verify the Variable: Open a new Command Prompt window and type:
echo %TESSDATA_PREFIX%
-
Issue: Image is not uploading
- Check if the file size is within the allowed limits.
- Ensure the backend API
/api/ocr/upload
is running and accessible. - Ensure file size is aligned according
application.properties
file check once and update accordingly.
This project demonstrates a complete OCR and Document Search web application that allows users to upload images, extract text using OCR, and perform keyword searches within the extracted text. The application is built with Spring Boot, Java, and Bootstrap for a clean and responsive frontend. By following the setup instructions and deploying the application using Docker, you can easily run the project in any environment with minimal configuration.
We welcome contributions from the open-source community! If you'd like to improve this project, follow these steps:
- Fork the repository from GitHub.
- Clone your forked repository locally
- Create a new feature branch
- Commit your changes
- Push the branch to your forked repo:
- Create a Pull Request on the main repository.
All contributions—big or small—are highly appreciated! Feel free to improve the documentation, fix bugs, or add new features.
Feel free to reach out to Mail or via GitHub issues or discussions.