-
Notifications
You must be signed in to change notification settings - Fork 213
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #228 from Mayank202004/main
Added project "Autofill Personal Info Using Adhaar Card Image" under automation category.
- Loading branch information
Showing
11 changed files
with
1,215 additions
and
0 deletions.
There are no files selected for viewing
59 changes: 59 additions & 0 deletions
59
Automation_Tools/Autofill personal info using Aadhar Card Image/OCR ADHAAR API/app.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
from flask import Flask, request, jsonify | ||
import easyocr | ||
import re | ||
|
||
app = Flask(__name__) | ||
reader = easyocr.Reader(['en', 'hi']) # Load EasyOCR with English and Hindi support | ||
|
||
def extract_info(ocr_result): | ||
first_name, middle_name, last_name, gender, dob, year_of_birth, aadhaar_number = None, None, None, None, None, None, None | ||
|
||
for item in ocr_result: | ||
text = item[1] | ||
|
||
# Check for gender and extract names | ||
if re.search(r'Male|Female|पुरुष|महिला', text): | ||
name_match = re.findall(r'[A-Za-z]+', text) | ||
if len(name_match) >= 3: | ||
first_name, middle_name, last_name = name_match[:3] | ||
gender = 'Male' if 'Male' in text or 'पुरुष' in text else 'Female' | ||
|
||
# Extract DOB or Year of Birth | ||
dob_match = re.search(r'\b(\d{2}/\d{2}/\d{4})\b', text) | ||
if dob_match: | ||
dob = dob_match.group(1) | ||
elif 'Year of Birth' in text or 'जन्म वर्ष' in text: | ||
yob_match = re.search(r'Year of Birth\s*:\s*([\d]+)', text) | ||
year_of_birth = yob_match.group(1) if yob_match else None | ||
|
||
# Extract Aadhaar number | ||
aadhaar_match = re.search(r'\b\d{4}\s\d{4}\s\d{4}\b', text) | ||
if aadhaar_match: | ||
aadhaar_number = aadhaar_match.group(0) | ||
|
||
return { | ||
"First Name": first_name, | ||
"Middle Name": middle_name, | ||
"Last Name": last_name, | ||
"Gender": gender, | ||
"DOB": dob, | ||
"Year of Birth": year_of_birth, | ||
"Aadhaar Number": aadhaar_number | ||
} | ||
|
||
@app.route('/extract', methods=['POST']) | ||
def extract_data(): | ||
data = request.json | ||
image_path = data.get('image_path') | ||
|
||
if not image_path: | ||
return jsonify({"error": "Image path is required"}), 400 | ||
|
||
# Process the image with EasyOCR | ||
ocr_result = reader.readtext(image_path, paragraph=True) | ||
extracted_info = extract_info(ocr_result) | ||
|
||
return jsonify(extracted_info) | ||
|
||
if __name__ == '__main__': | ||
app.run(debug=True) |
136 changes: 136 additions & 0 deletions
136
Automation_Tools/Autofill personal info using Aadhar Card Image/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
# **Aadhaar Information Extraction Project** | ||
|
||
### 🎯 **Goal** | ||
|
||
The project aims to automate the extraction of relevant Aadhaar card information using Optical Character Recognition (OCR) techniques. The extracted details include: | ||
- First Name | ||
- Middle Name | ||
- Last Name | ||
- Gender | ||
- Date of Birth (DOB) | ||
- Aadhaar Number | ||
|
||
### 🧵 **Dataset** | ||
|
||
- No specific dataset is provided as input images will be Aadhaar card images uploaded by the user. | ||
|
||
### 🧾 **Description** | ||
|
||
This project implements two approaches for extracting Aadhaar card information: | ||
1. **Tesseract OCR with Pre-processing**: Text extraction from greyscale images using Tesseract and post-processing via regular expressions. | ||
2. **EasyOCR with Multi-language Support**: Leveraging EasyOCR’s Hindi and English language support for more accurate text extraction. | ||
|
||
### 🧮 **What I have done!** | ||
|
||
1. Pre-processed the Aadhaar card images by converting them to greyscale. | ||
2. Implemented text extraction using Tesseract and EasyOCR. | ||
3. Processed the extracted text using regular expressions to retrieve critical Aadhaar details: | ||
- First Name | ||
- Middle Name | ||
- Last Name | ||
- Gender | ||
- Date of Birth (DOB) | ||
- Aadhaar Number (in `XXXX XXXX XXXX` format) | ||
|
||
### 🚀 **Models Implemented** | ||
|
||
1. **Tesseract OCR**: Utilized for extracting text after image pre-processing. | ||
2. **EasyOCR**: Used for multi-language OCR (Hindi and English) to overcome limitations of Tesseract. | ||
|
||
- **Why these models?** | ||
- Tesseract is a commonly used open-source OCR tool, but its performance drops with complex fonts and mixed-language documents like Aadhaar cards. | ||
- EasyOCR supports multiple languages and handles complex document structures better than Tesseract. | ||
|
||
### 📚 **Libraries Needed** | ||
|
||
- Tesseract OCR | ||
- EasyOCR | ||
- OpenCV (`cv2`) for image pre-processing | ||
- Regular expressions (`re`) for text processing | ||
- Python Imaging Library (Pillow) | ||
|
||
### 📊 **Exploratory Data Analysis Results** | ||
|
||
This section highlights the result comparison of the Aadhaar information extraction using both **Tesseract OCR** and **EasyOCR** approaches. Screenshots are provided to demonstrate the results and a comparative analysis of performance. | ||
|
||
--- | ||
|
||
## **Result Comparison: Aadhaar Information Extraction** | ||
|
||
### 1. Tesseract OCR Approach | ||
|
||
In this approach, the Aadhaar card image is first converted to greyscale and then passed through the Tesseract OCR engine. Regular expressions (`re`) are used to extract key information such as names, gender, date of birth, and Aadhaar number from the extracted text. | ||
|
||
#### Screenshot for Tesseract OCR Result: | ||
![Tesseract OCR Result](assets/images/tesseract.png) | ||
|
||
#### Challenges: | ||
- **Accuracy**: Tesseract struggles with mixed-language documents (English + Hindi). | ||
- **Pre-processing Required**: The image needs to be pre-processed (converted to greyscale) to improve text extraction. | ||
- **Hindi Text**: Tesseract doesn't handle Hindi text well, which reduces its accuracy for Aadhaar cards that include Hindi. | ||
|
||
--- | ||
|
||
### 2. EasyOCR Approach | ||
|
||
The EasyOCR approach uses multi-language support for both Hindi and English, making it a better fit for Aadhaar card text recognition. The extracted text is processed using regular expressions to find relevant details. | ||
|
||
#### Output for EasyOCR: | ||
- **First Name**: `Rahul` | ||
- **Middle Name**: `Ramesh` | ||
- **Last Name**: `Gaikwad` | ||
- **Gender**: `Male` | ||
- **DOB**: `23/08/1995` | ||
- **Aadhaar Number**: `2058 6470 5393` | ||
|
||
#### Screenshot for EasyOCR Result: | ||
![EasyOCR Result](assets/images/easyocr.png) | ||
|
||
#### Screenshot after Extraction: | ||
![Processed EasyOCR Result](assets/images/Output.png) | ||
|
||
#### Advantages: | ||
- **Higher Accuracy**: EasyOCR performs significantly better with mixed-language documents, making it ideal for Aadhaar cards. | ||
- **Multi-language Support**: Supports both English and Hindi, improving text extraction accuracy. | ||
- **No Heavy Pre-processing**: Works well without needing extensive image manipulation. | ||
|
||
--- | ||
|
||
### **Comparison of Results** | ||
|
||
| Feature | Tesseract OCR | EasyOCR | | ||
|----------------------|------------------------------------|----------------------------------| | ||
| **Languages** | English only | English and Hindi support | | ||
| **Accuracy** | Low to Medium | High | | ||
| **Pre-processing** | Requires greyscale conversion | Minimal pre-processing needed | | ||
| **Performance** | Faster but less accurate | Bit slower but more accurate | | ||
| **Aadhaar Extraction**| Struggles with Hindi and complex fonts | Handles both languages well | | ||
|
||
--- | ||
|
||
## **API Result Screenshot** | ||
|
||
Here is the expected result returned from the API after extracting information from the Aadhaar card image: | ||
|
||
![API Response Screenshot](assets/images/api_response.png) | ||
|
||
### **Input Body (JSON):** | ||
` | ||
{ | ||
"image_path": "C:\\Users\\mayan\\Downloads\\fs.jpeg" | ||
} | ||
` | ||
|
||
### 📈 **Performance of the Models based on the Accuracy Scores** | ||
|
||
EasyOCR is more accurate than Tesseract. THough no accuracy testing has been done on large scale due to lack of dataset | ||
|
||
### 📢 **Conclusion** | ||
|
||
- The EasyOCR approach shows higher accuracy due to its ability to process both Hindi and English text on Aadhaar cards. Minimal pre-processing is required compared to Tesseract. | ||
- Based on the accuracy results, EasyOCR is the preferred model for extracting Aadhaar card information. | ||
|
||
### ✒️ **Your Signature** | ||
|
||
*Mayank Chougale* | ||
[GitHub](https://github.com/Mayank202004) | [LinkedIn](https://www.linkedin.com/in/mayank-chougale-4b12b4262/) |
85 changes: 85 additions & 0 deletions
85
Automation_Tools/Autofill personal info using Aadhar Card Image/RESULT.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
# Result Comparison: Aadhaar Information Extraction | ||
|
||
This document showcases the results of extracting Aadhaar card information using two different approaches: | ||
|
||
1. **Tesseract OCR with Image Pre-processing** | ||
2. **EasyOCR with Multi-language Support (English and Hindi)** | ||
|
||
Screenshots are provided for the extracted information, along with an API result screenshot. | ||
|
||
--- | ||
|
||
## 1. Tesseract OCR Approach | ||
|
||
In this approach, the Aadhaar card image is first converted to greyscale and then passed through the Tesseract OCR engine. Regular expressions (`re`) are used to extract key information such as names, gender, date of birth, and Aadhaar number from the extracted text. | ||
|
||
### Screenshot for Tesseract OCR Result: | ||
![Tesseract OCR Result](assets/images/tesseract.png) | ||
|
||
#### Challenges: | ||
- **Accuracy**: Tesseract struggles with mixed-language documents (English + Hindi). | ||
- **Pre-processing Required**: The image needs to be pre-processed (converted to greyscale) to improve text extraction. | ||
- **Hindi Text**: Tesseract doesn't handle Hindi text as well, which reduces its accuracy for Aadhaar cards that include Hindi. | ||
|
||
--- | ||
|
||
## 2. EasyOCR Approach | ||
|
||
The EasyOCR approach uses multi-language support for both Hindi and English, making it a better fit for Aadhaar card text recognition. The extracted text is processed using regular expressions to find relevant details. | ||
|
||
### Output for EasyOCR: | ||
- **First Name**: `Rahul` | ||
- **Middle Name**: `Ramesh` | ||
- **Last Name**: `Gaikwad` | ||
- **Gender**: `Male` | ||
- **DOB**: `23/08/1995` | ||
- **Aadhaar Number**: `2058 6470 5393` | ||
|
||
### Screenshot for EasyOCR Result: | ||
![EasyOCR Result](assets/images/easyocr.png) | ||
|
||
### After Extraction | ||
![EasyOCR Result](assets/images/Output.png) | ||
|
||
#### Advantages: | ||
- **Higher Accuracy**: EasyOCR performs significantly better with mixed-language documents, making it ideal for Aadhaar cards. | ||
- **Multi-language Support**: Supports both English and Hindi, improving text extraction accuracy. | ||
- **No Heavy Pre-processing**: Works well without needing extensive image manipulation. | ||
|
||
--- | ||
|
||
## Comparison of Results | ||
|
||
| Feature | Tesseract OCR | EasyOCR | | ||
|----------------------|------------------------------------|----------------------------------| | ||
| **Languages** | English only | English and Hindi support | | ||
| **Accuracy** | Low to Medium | High | | ||
| **Pre-processing** | Requires greyscale conversion | Minimal pre-processing needed | | ||
| **Performance** | Faster but less accurate | Bit slower and more accurate | | ||
| **Aadhaar Extraction**| Struggles with Hindi and complex fonts | Handles both languages well | | ||
|
||
--- | ||
|
||
## API Result Screenshot | ||
|
||
Here is the expected result returned from the API after extracting information from the Aadhaar card image: | ||
|
||
![EasyOCR Result](assets/images/api_response.png) | ||
|
||
### Input Body JSON: | ||
{ | ||
|
||
"image_path": "C:\\Users\\mayan\\Downloads\\fs.jpeg" | ||
|
||
} | ||
|
||
### API Response: | ||
```json | ||
{ | ||
"First Name": "Rahul", | ||
"Middle Name": "Ramesh", | ||
"Last Name": "Gaikwad", | ||
"Gender": "Male", | ||
"DOB": "23/08/1995", | ||
"Aadhaar Number": "2058 6470 5393" | ||
} |
Binary file added
BIN
+7.07 KB
...n_Tools/Autofill personal info using Aadhar Card Image/assets/images/Output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+53.1 KB
...s/Autofill personal info using Aadhar Card Image/assets/images/api_response.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+27.7 KB
..._Tools/Autofill personal info using Aadhar Card Image/assets/images/easyocr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+16 KB
...ools/Autofill personal info using Aadhar Card Image/assets/images/tesseract.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.