Skip to content

SensiHidePDF is an end-to-end solution for redacting sensitive information from PDF files (specially resumes) in bulk. It makes use of google data loss prevention API

Notifications You must be signed in to change notification settings

aditya-shrivastavv/sensihide-pdf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SensiHidePDF 🕵️‍♂️

An end-to-end solution to hide sensitive information in PDF files, primarily resumes.

architecture diagram google cloud

Architecture 🏗️

The application is built natively on Google Cloud Platform, Leveraging various services like Cloud Run, Cloud Workflows, Cloud Storage: Bucket, EventArc, Data Loss Prevention API and BigQuery.

The entire application can be provisioned using Terraform, making it easy to deploy and manage.

Here is how it works: 🤔

  • Whenever a PDF file is uploaded on the Cloud Storage Bucket (input_bucket), an EventArc event is triggered.
  • That runs a Cloud Workflows which does the following steps in sequence:
    • First cloud run service downloads that PDF file and extracts text from it.
    • Second service gets that text data and it sends it to Data Loss Prevention API to detect sensitive information. (For now, it is hardcoded to detect EMAIL_ADDRESS and PHONE_NUMBER)
    • Third service is given the response from DLP API. It then downloads the PDF file and redacts the sensitive information from it. The redacted PDF is then uploaded to another Cloud Storage Bucket (output_bucket).
    • Finally, the last service stores the sensitive information in BigQuery for further analysis.
  • That's it! 🎉

Services 🛠️

Service Name Source Code Infrastructure
PDF To Text Code Terraform
DLP Runner Code Terraform
Redactor Code Terraform
Findings Writer Code Terraform

Leave a ⭐ if you like this project!

Secret message

There are other solutions out there solving the same problem, namely from GoogleCloudPlatform itself. But there is a huge difference between my implementation and there's. There's implementation converts PDF into images and then gets the images redacted from the DLP API, but the drawback of this approach is that the redacted PDF generated after merging the images is not readable by screen readers, or even searchable making it less accessible at large scale.

I took a different approach, I didn't run DLP API on images. Instead I ran it directly on the text, upon receiving the findings, I did redactions by myself using a python library. This way the PDF remains searchable and ATS friendly.

If you wish you can keep this a secret 🤫

About

SensiHidePDF is an end-to-end solution for redacting sensitive information from PDF files (specially resumes) in bulk. It makes use of google data loss prevention API

Topics

Resources

Stars

Watchers

Forks