Merge pull request #228 from Mayank202004/main

Added project "Autofill Personal Info Using Adhaar Card Image" under automation category.
UTSAVS26 · Oct 8, 2024 · 2d4e04d · 2d4e04d
2 parents 46fae96 + 33143b9
commit 2d4e04d
Show file tree

Hide file tree

Showing 11 changed files with 1,215 additions and 0 deletions.
diff --git a/Automation_Tools/Autofill personal info using Aadhar Card Image/OCR ADHAAR API/app.py b/Automation_Tools/Autofill personal info using Aadhar Card Image/OCR ADHAAR API/app.py
@@ -0,0 +1,59 @@
+from flask import Flask, request, jsonify
+import easyocr
+import re
+
+app = Flask(__name__)
+reader = easyocr.Reader(['en', 'hi'])  # Load EasyOCR with English and Hindi support
+
+def extract_info(ocr_result):
+    first_name, middle_name, last_name, gender, dob, year_of_birth, aadhaar_number = None, None, None, None, None, None, None
+
+    for item in ocr_result:
+        text = item[1]
+
+        # Check for gender and extract names
+        if re.search(r'Male|Female|पुरुष|महिला', text):
+            name_match = re.findall(r'[A-Za-z]+', text)
+            if len(name_match) >= 3:
+                first_name, middle_name, last_name = name_match[:3]
+            gender = 'Male' if 'Male' in text or 'पुरुष' in text else 'Female'
+
+            # Extract DOB or Year of Birth
+            dob_match = re.search(r'\b(\d{2}/\d{2}/\d{4})\b', text)
+            if dob_match:
+                dob = dob_match.group(1)
+            elif 'Year of Birth' in text or 'जन्म वर्ष' in text:
+                yob_match = re.search(r'Year of Birth\s*:\s*([\d]+)', text)
+                year_of_birth = yob_match.group(1) if yob_match else None
+
+        # Extract Aadhaar number
+        aadhaar_match = re.search(r'\b\d{4}\s\d{4}\s\d{4}\b', text)
+        if aadhaar_match:
+            aadhaar_number = aadhaar_match.group(0)
+
+    return {
+        "First Name": first_name,
+        "Middle Name": middle_name,
+        "Last Name": last_name,
+        "Gender": gender,
+        "DOB": dob,
+        "Year of Birth": year_of_birth,
+        "Aadhaar Number": aadhaar_number
+    }
+
+@app.route('/extract', methods=['POST'])
+def extract_data():
+    data = request.json
+    image_path = data.get('image_path')
+
+    if not image_path:
+        return jsonify({"error": "Image path is required"}), 400
+
+    # Process the image with EasyOCR
+    ocr_result = reader.readtext(image_path, paragraph=True)
+    extracted_info = extract_info(ocr_result)
+
+    return jsonify(extracted_info)
+
+if __name__ == '__main__':
+    app.run(debug=True)
diff --git a/Automation_Tools/Autofill personal info using Aadhar Card Image/README.md b/Automation_Tools/Autofill personal info using Aadhar Card Image/README.md
@@ -0,0 +1,136 @@
+# **Aadhaar Information Extraction Project**
+
+### 🎯 **Goal**
+
+The project aims to automate the extraction of relevant Aadhaar card information using Optical Character Recognition (OCR) techniques. The extracted details include:
+- First Name
+- Middle Name
+- Last Name
+- Gender
+- Date of Birth (DOB)
+- Aadhaar Number
+
+### 🧵 **Dataset**
+
+- No specific dataset is provided as input images will be Aadhaar card images uploaded by the user.
+
+### 🧾 **Description**
+
+This project implements two approaches for extracting Aadhaar card information:
+1. **Tesseract OCR with Pre-processing**: Text extraction from greyscale images using Tesseract and post-processing via regular expressions.
+2. **EasyOCR with Multi-language Support**: Leveraging EasyOCR’s Hindi and English language support for more accurate text extraction.
+
+### 🧮 **What I have done!**
+
+1. Pre-processed the Aadhaar card images by converting them to greyscale.
+2. Implemented text extraction using Tesseract and EasyOCR.
+3. Processed the extracted text using regular expressions to retrieve critical Aadhaar details:
+   - First Name
+   - Middle Name
+   - Last Name
+   - Gender
+   - Date of Birth (DOB)
+   - Aadhaar Number (in `XXXX XXXX XXXX` format)
+
+### 🚀 **Models Implemented**
+
+1. **Tesseract OCR**: Utilized for extracting text after image pre-processing.
+2. **EasyOCR**: Used for multi-language OCR (Hindi and English) to overcome limitations of Tesseract.
+
+- **Why these models?**  
+  - Tesseract is a commonly used open-source OCR tool, but its performance drops with complex fonts and mixed-language documents like Aadhaar cards.
+  - EasyOCR supports multiple languages and handles complex document structures better than Tesseract.
+
+### 📚 **Libraries Needed**
+
+- Tesseract OCR
+- EasyOCR
+- OpenCV (`cv2`) for image pre-processing
+- Regular expressions (`re`) for text processing
+- Python Imaging Library (Pillow)
+
+### 📊 **Exploratory Data Analysis Results**
+
+This section highlights the result comparison of the Aadhaar information extraction using both **Tesseract OCR** and **EasyOCR** approaches. Screenshots are provided to demonstrate the results and a comparative analysis of performance.
+
+---
+
+## **Result Comparison: Aadhaar Information Extraction**
+
+### 1. Tesseract OCR Approach
+
+In this approach, the Aadhaar card image is first converted to greyscale and then passed through the Tesseract OCR engine. Regular expressions (`re`) are used to extract key information such as names, gender, date of birth, and Aadhaar number from the extracted text.
+
+#### Screenshot for Tesseract OCR Result:
+![Tesseract OCR Result](assets/images/tesseract.png)
+
+#### Challenges:
+- **Accuracy**: Tesseract struggles with mixed-language documents (English + Hindi).
+- **Pre-processing Required**: The image needs to be pre-processed (converted to greyscale) to improve text extraction.
+- **Hindi Text**: Tesseract doesn't handle Hindi text well, which reduces its accuracy for Aadhaar cards that include Hindi.
+
+---
+
+### 2. EasyOCR Approach
+
+The EasyOCR approach uses multi-language support for both Hindi and English, making it a better fit for Aadhaar card text recognition. The extracted text is processed using regular expressions to find relevant details.
+
+#### Output for EasyOCR:
+- **First Name**: `Rahul`
+- **Middle Name**: `Ramesh`
+- **Last Name**: `Gaikwad`
+- **Gender**: `Male`
+- **DOB**: `23/08/1995`
+- **Aadhaar Number**: `2058 6470 5393`
+
+#### Screenshot for EasyOCR Result:
+![EasyOCR Result](assets/images/easyocr.png)
+
+#### Screenshot after Extraction:
+![Processed EasyOCR Result](assets/images/Output.png)
+
+#### Advantages:
+- **Higher Accuracy**: EasyOCR performs significantly better with mixed-language documents, making it ideal for Aadhaar cards.
+- **Multi-language Support**: Supports both English and Hindi, improving text extraction accuracy.
+- **No Heavy Pre-processing**: Works well without needing extensive image manipulation.
+
+---
+
+### **Comparison of Results**
+
+| Feature              | Tesseract OCR                      | EasyOCR                          |
+|----------------------|------------------------------------|----------------------------------|
+| **Languages**         | English only                      | English and Hindi support        |
+| **Accuracy**          | Low to Medium                     | High                             |
+| **Pre-processing**    | Requires greyscale conversion     | Minimal pre-processing needed    |
+| **Performance**       | Faster but less accurate          | Bit slower but more accurate     |
+| **Aadhaar Extraction**| Struggles with Hindi and complex fonts | Handles both languages well    |
+
+---
+
+## **API Result Screenshot**
+
+Here is the expected result returned from the API after extracting information from the Aadhaar card image:
+
+![API Response Screenshot](assets/images/api_response.png)
+
+### **Input Body (JSON):**
+`
+{
+  "image_path": "C:\\Users\\mayan\\Downloads\\fs.jpeg"
+}
+`
+
+### 📈 **Performance of the Models based on the Accuracy Scores**
+
+EasyOCR is more accurate than Tesseract. THough no accuracy testing has been done on large scale due to lack of dataset
+
+### 📢 **Conclusion**
+
+- The EasyOCR approach shows higher accuracy due to its ability to process both Hindi and English text on Aadhaar cards. Minimal pre-processing is required compared to Tesseract.
+- Based on the accuracy results, EasyOCR is the preferred model for extracting Aadhaar card information.
+
+### ✒️ **Your Signature**
+
+*Mayank Chougale*  
+[GitHub](https://github.com/Mayank202004) | [LinkedIn](https://www.linkedin.com/in/mayank-chougale-4b12b4262/)
diff --git a/Automation_Tools/Autofill personal info using Aadhar Card Image/RESULT.md b/Automation_Tools/Autofill personal info using Aadhar Card Image/RESULT.md
@@ -0,0 +1,85 @@
+# Result Comparison: Aadhaar Information Extraction
+
+This document showcases the results of extracting Aadhaar card information using two different approaches:
+
+1. **Tesseract OCR with Image Pre-processing**
+2. **EasyOCR with Multi-language Support (English and Hindi)**
+
+Screenshots are provided for the extracted information, along with an API result screenshot.
+
+---
+
+## 1. Tesseract OCR Approach
+
+In this approach, the Aadhaar card image is first converted to greyscale and then passed through the Tesseract OCR engine. Regular expressions (`re`) are used to extract key information such as names, gender, date of birth, and Aadhaar number from the extracted text.
+
+### Screenshot for Tesseract OCR Result:
+![Tesseract OCR Result](assets/images/tesseract.png)
+
+#### Challenges:
+- **Accuracy**: Tesseract struggles with mixed-language documents (English + Hindi).
+- **Pre-processing Required**: The image needs to be pre-processed (converted to greyscale) to improve text extraction.
+- **Hindi Text**: Tesseract doesn't handle Hindi text as well, which reduces its accuracy for Aadhaar cards that include Hindi.
+
+---
+
+## 2. EasyOCR Approach
+
+The EasyOCR approach uses multi-language support for both Hindi and English, making it a better fit for Aadhaar card text recognition. The extracted text is processed using regular expressions to find relevant details.
+
+### Output for EasyOCR:
+- **First Name**: `Rahul`
+- **Middle Name**: `Ramesh`
+- **Last Name**: `Gaikwad`
+- **Gender**: `Male`
+- **DOB**: `23/08/1995`
+- **Aadhaar Number**: `2058 6470 5393`
+
+### Screenshot for EasyOCR Result:
+![EasyOCR Result](assets/images/easyocr.png)
+
+### After Extraction
+![EasyOCR Result](assets/images/Output.png)
+
+#### Advantages:
+- **Higher Accuracy**: EasyOCR performs significantly better with mixed-language documents, making it ideal for Aadhaar cards.
+- **Multi-language Support**: Supports both English and Hindi, improving text extraction accuracy.
+- **No Heavy Pre-processing**: Works well without needing extensive image manipulation.
+
+---
+
+## Comparison of Results
+
+| Feature              | Tesseract OCR                      | EasyOCR                          |
+|----------------------|------------------------------------|----------------------------------|
+| **Languages**         | English only                      | English and Hindi support        |
+| **Accuracy**          | Low to Medium                     | High                             |
+| **Pre-processing**    | Requires greyscale conversion     | Minimal pre-processing needed    |
+| **Performance**       | Faster but less accurate          | Bit slower and more accurate     |
+| **Aadhaar Extraction**| Struggles with Hindi and complex fonts | Handles both languages well       |
+
+---
+
+## API Result Screenshot
+
+Here is the expected result returned from the API after extracting information from the Aadhaar card image:
+
+![EasyOCR Result](assets/images/api_response.png)
+
+### Input  Body JSON: 
+{
+
+    "image_path": "C:\\Users\\mayan\\Downloads\\fs.jpeg"
+
+}
+
+### API Response:
+```json
+{
+  "First Name": "Rahul",
+  "Middle Name": "Ramesh",
+  "Last Name": "Gaikwad",
+  "Gender": "Male",
+  "DOB": "23/08/1995",
+  "Aadhaar Number": "2058 6470 5393"
+}
diff --git a/...n_Tools/Autofill personal info using Aadhar Card Image/assets/images/Output.png b/...n_Tools/Autofill personal info using Aadhar Card Image/assets/images/Output.png
diff --git a/...s/Autofill personal info using Aadhar Card Image/assets/images/api_response.png b/...s/Autofill personal info using Aadhar Card Image/assets/images/api_response.png
diff --git a/..._Tools/Autofill personal info using Aadhar Card Image/assets/images/easyocr.png b/..._Tools/Autofill personal info using Aadhar Card Image/assets/images/easyocr.png
diff --git a/...ools/Autofill personal info using Aadhar Card Image/assets/images/tesseract.png b/...ools/Autofill personal info using Aadhar Card Image/assets/images/tesseract.png