Pii removal with Multi-process support based on minmin's codes (intel#62

) * commit pii removal mp Signed-off-by: Xue, Chendi <chendi.xue@intel.com> * Update README.md * rename folder Signed-off-by: Xue, Chendi <chendi.xue@intel.com> * fix Signed-off-by: Xue, Chendi <chendi.xue@intel.com> * Update README.md --------- Signed-off-by: Xue, Chendi <chendi.xue@intel.com>
zhangjian94cn · Sep 20, 2023 · f7a3126 · f7a3126
1 parent 644fa01
commit f7a3126
Show file tree

Hide file tree

Showing 16 changed files with 1,822 additions and 0 deletions.
diff --git a/tools/pii_removal_for_contact_mp/README.md b/tools/pii_removal_for_contact_mp/README.md
@@ -0,0 +1,27 @@
+# PII removal for contact info
+
+## Intro
+
+PII removal for contact info is to replace personal information such as email, phone number to a random-non-sense string to protect personal infomation
+This script is using multi processing method to speed up PPI removal
+
+## Expected input and Output
+
+Input format: a folder of *parquet, 'text' will required in parquet column names.
+
+Out format: a folder of *parquet, 'text' will be processed and personal info will be replaced.
+
+## How to RUN
+```
+conda create --name pyrecdp
+conda activate pyrecdp
+pip install pyrecdp --pre
+pip install presidio_analyzer
+python -m spacy download en_core_web_lg
+python pii_redaction.py -d ../falcon-refinedweb -o ../falcon-refinedweb-pii_removal -mp 224
+```
+
+## NOTICE
+
+We are running at file-wised parallism, usually a 300MB file took around 15-20min to complete, so you will see slow progress in progress bar.
+One thing to identify the activity of the process may be using 'top' to check of there are multiple activitily running python processes.
diff --git a/tools/pii_removal_for_contact_mp/pii_detection_redaction/README.md b/tools/pii_removal_for_contact_mp/pii_detection_redaction/README.md
@@ -0,0 +1,34 @@
+# How to run PII-for-text pipeline
+
+## Overview
+The pipeline detects 5 different types of PIIs: 'PHONE_NUMBER, 'IP_ADDRESS', 'EMAIL', 'USER', 'KEY'. The detection is based on regular expressions using open-source packages including presidio and bigscience-pii. The detection precision/recall has been tuned for web scrapes based datasets, for example, Falcon-RefinedWeb, SlimPajama-StackExchange, PILE-Hackernews. But the detection precion/recall is not good for code data like Github. </p>
+
+Two redaction methods have been implemented:
+1. Replacement with random values
+2. Replacement with tags such as [PHONE_NUMBER], [EMAIL], etc. </br>
+Currently, the option 1) is used.
+
+
+## How to run
+### Step 1: Set up Env
+Please follow [this guide](../workload_in_containers/README.md) on how to set-up the container environment of this workload. When the containers are running, you can enter the container on head node using following command:
+```bash  
+docker exec -it ray-leader bash 
+```
+
+### Step 2: Run PII removal
+Once you are inside the ray-leader container, go to the scripts folder. You can change the `BATCH_SIZE` and `CPUCORES` depending on the memory and number of cores on your systems. Then you can run the pii script, for example:
+```
+bash pii-refinedweb.sh
+```
+
+### Step 3: Validate outputs
+We implemented 3 checks:
+1. Check schema and sample rows in output parquets by loading parquet with pandas
+2. Count numbers of PIIs per category by sampling from the outputs. You can further get an estimate of the total number of PIIs per category by multiplying total_num_samples/sample_used_for_this_check
+3. Visual check of a small sample by producing a html with yellow highlights of the PIIs and annotating with corresponding category (note that sometimes the highlights are not at the exact location, but should be quite close).
+
+```
+# First change the path to the data files in the python script
+python src/validate_ray_outputs.py
+```
diff --git a/tools/pii_removal_for_contact_mp/pii_detection_redaction/scripts/pii-pile-hn.sh b/tools/pii_removal_for_contact_mp/pii_detection_redaction/scripts/pii-pile-hn.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+BATCHSIZE=1000
+CPUCORES=48
+DATA=pile_hn
+OUTPUT_PREFIX=pile_hn
+DATA_DIR=/home/user/local/PILE/hn
+
+python ../src/pii_redaction_v2.py \
+--load-batch-size $BATCHSIZE \
+--cpu-per-worker $CPUCORES \
+--dataset-family $DATA \
+--output-prefix $OUTPUT_PREFIX \
+--data-dir $DATA_DIR \
+--local
diff --git a/tools/pii_removal_for_contact_mp/pii_detection_redaction/scripts/pii-redpj.sh b/tools/pii_removal_for_contact_mp/pii_detection_redaction/scripts/pii-redpj.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+BATCHSIZE=50000
+CPUCORES=48
+INPUT=togethercomputer/RedPajama-Data-1T-Sample
+DATA=slimpajama
+OUTPUT_PREFIX=redpajama
+DATA_DIR=/home/user/local/dataset/RedPajama-Data-1T-Sample/
+
+python ../src/pii_redaction.py \
+--load-batch-size $BATCHSIZE \
+--cpu-per-worker $CPUCORES \
+--input $INPUT \
+--dataset-family $DATA \
+--output-prefix $OUTPUT_PREFIX \
+--data-dir $DATA_DIR \
+--local \
+#--skip 500000
diff --git a/tools/pii_removal_for_contact_mp/pii_detection_redaction/scripts/pii-refinedweb.sh b/tools/pii_removal_for_contact_mp/pii_detection_redaction/scripts/pii-refinedweb.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+BATCHSIZE=1000
+CPUCORES=48
+DATA=refinedweb
+OUTPUT_PREFIX=pii_test_output
+DATA_DIR=/home/user/local/refinedweb_samples
+
+python ../src/pii_redaction_v2.py \
+--load-batch-size $BATCHSIZE \
+--cpu-per-worker $CPUCORES \
+--dataset-family $DATA \
+--output-prefix $OUTPUT_PREFIX \
+--data-dir $DATA_DIR \
+--local
diff --git a/tools/pii_removal_for_contact_mp/pii_detection_redaction/scripts/pii-slimpj.sh b/tools/pii_removal_for_contact_mp/pii_detection_redaction/scripts/pii-slimpj.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+BATCHSIZE=50000
+CPUCORES=48
+INPUT=cerebras/SlimPajama-627B
+DATA=slimpajama
+OUTPUT_PREFIX=pii_slimpajama_se
+DATA_DIR=/home/user/local/
+
+python ../src/pii_redaction_v2.py \
+--load-batch-size $BATCHSIZE \
+--cpu-per-worker $CPUCORES \
+--input $INPUT \
+--dataset-family $DATA \
+--output-prefix $OUTPUT_PREFIX \
+--data-dir $DATA_DIR \
+--local