SensiHidePDF 🕵️‍♂️

An end-to-end solution to hide sensitive information in PDF files, primarily resumes.

Architecture 🏗️

The application is built natively on Google Cloud Platform, Leveraging various services like Cloud Run, Cloud Workflows, Cloud Storage: Bucket, EventArc, Data Loss Prevention API and BigQuery.

The entire application can be provisioned using Terraform, making it easy to deploy and manage.

Here is how it works: 🤔

Whenever a PDF file is uploaded on the Cloud Storage Bucket (input_bucket), an EventArc event is triggered.
That runs a Cloud Workflows which does the following steps in sequence:
- First cloud run service downloads that PDF file and extracts text from it.
- Second service gets that text data and it sends it to Data Loss Prevention API to detect sensitive information. (For now, it is hardcoded to detect EMAIL_ADDRESS and PHONE_NUMBER)
- Third service is given the response from DLP API. It then downloads the PDF file and redacts the sensitive information from it. The redacted PDF is then uploaded to another Cloud Storage Bucket (output_bucket).
- Finally, the last service stores the sensitive information in BigQuery for further analysis.
That's it! 🎉

Services 🛠️

Service Name	Source Code	Infrastructure
PDF To Text	Code	Terraform
DLP Runner	Code	Terraform
Redactor	Code	Terraform
Findings Writer	Code	Terraform

Leave a ⭐ if you like this project!

Secret message

There are other solutions out there solving the same problem, namely from GoogleCloudPlatform itself. But there is a huge difference between my implementation and there's. There's implementation converts PDF into images and then gets the images redacted from the DLP API, but the drawback of this approach is that the redacted PDF generated after merging the images is not readable by screen readers, or even searchable making it less accessible at large scale.

I took a different approach, I didn't run DLP API on images. Instead I ran it directly on the text, upon receiving the findings, I did redactions by myself using a python library. This way the PDF remains searchable and ATS friendly.

If you wish you can keep this a secret 🤫

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
.github/workflows		.github/workflows
local		local
public		public
src		src
terraform		terraform
.gitignore		.gitignore
README.md		README.md
sample_findings_output.json		sample_findings_output.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SensiHidePDF 🕵️‍♂️

Architecture 🏗️

Here is how it works: 🤔

Services 🛠️

About

Languages

aditya-shrivastavv/sensihide-pdf

Folders and files

Latest commit

History

Repository files navigation

SensiHidePDF 🕵️‍♂️

Architecture 🏗️

Here is how it works: 🤔

Services 🛠️

About

Topics

Resources

Stars

Watchers

Forks

Languages