Skip to content

Commit

Permalink
Pii removal with Multi-process support based on minmin's codes (intel#62
Browse files Browse the repository at this point in the history
)

* commit pii removal mp

Signed-off-by: Xue, Chendi <chendi.xue@intel.com>

* Update README.md

* rename folder

Signed-off-by: Xue, Chendi <chendi.xue@intel.com>

* fix

Signed-off-by: Xue, Chendi <chendi.xue@intel.com>

* Update README.md

---------

Signed-off-by: Xue, Chendi <chendi.xue@intel.com>
  • Loading branch information
xuechendi authored Sep 20, 2023
1 parent 644fa01 commit f7a3126
Show file tree
Hide file tree
Showing 16 changed files with 1,822 additions and 0 deletions.
27 changes: 27 additions & 0 deletions tools/pii_removal_for_contact_mp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# PII removal for contact info

## Intro

PII removal for contact info is to replace personal information such as email, phone number to a random-non-sense string to protect personal infomation
This script is using multi processing method to speed up PPI removal

## Expected input and Output

Input format: a folder of *parquet, 'text' will required in parquet column names.

Out format: a folder of *parquet, 'text' will be processed and personal info will be replaced.

## How to RUN
```
conda create --name pyrecdp
conda activate pyrecdp
pip install pyrecdp --pre
pip install presidio_analyzer
python -m spacy download en_core_web_lg
python pii_redaction.py -d ../falcon-refinedweb -o ../falcon-refinedweb-pii_removal -mp 224
```

## NOTICE

We are running at file-wised parallism, usually a 300MB file took around 15-20min to complete, so you will see slow progress in progress bar.
One thing to identify the activity of the process may be using 'top' to check of there are multiple activitily running python processes.
34 changes: 34 additions & 0 deletions tools/pii_removal_for_contact_mp/pii_detection_redaction/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# How to run PII-for-text pipeline

## Overview
The pipeline detects 5 different types of PIIs: 'PHONE_NUMBER, 'IP_ADDRESS', 'EMAIL', 'USER', 'KEY'. The detection is based on regular expressions using open-source packages including presidio and bigscience-pii. The detection precision/recall has been tuned for web scrapes based datasets, for example, Falcon-RefinedWeb, SlimPajama-StackExchange, PILE-Hackernews. But the detection precion/recall is not good for code data like Github. </p>

Two redaction methods have been implemented:
1. Replacement with random values
2. Replacement with tags such as [PHONE_NUMBER], [EMAIL], etc. </br>
Currently, the option 1) is used.


## How to run
### Step 1: Set up Env
Please follow [this guide](../workload_in_containers/README.md) on how to set-up the container environment of this workload. When the containers are running, you can enter the container on head node using following command:
```bash
docker exec -it ray-leader bash
```

### Step 2: Run PII removal
Once you are inside the ray-leader container, go to the scripts folder. You can change the `BATCH_SIZE` and `CPUCORES` depending on the memory and number of cores on your systems. Then you can run the pii script, for example:
```
bash pii-refinedweb.sh
```

### Step 3: Validate outputs
We implemented 3 checks:
1. Check schema and sample rows in output parquets by loading parquet with pandas
2. Count numbers of PIIs per category by sampling from the outputs. You can further get an estimate of the total number of PIIs per category by multiplying total_num_samples/sample_used_for_this_check
3. Visual check of a small sample by producing a html with yellow highlights of the PIIs and annotating with corresponding category (note that sometimes the highlights are not at the exact location, but should be quite close).

```
# First change the path to the data files in the python script
python src/validate_ray_outputs.py
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash
BATCHSIZE=1000
CPUCORES=48
DATA=pile_hn
OUTPUT_PREFIX=pile_hn
DATA_DIR=/home/user/local/PILE/hn

python ../src/pii_redaction_v2.py \
--load-batch-size $BATCHSIZE \
--cpu-per-worker $CPUCORES \
--dataset-family $DATA \
--output-prefix $OUTPUT_PREFIX \
--data-dir $DATA_DIR \
--local
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/bin/bash
BATCHSIZE=50000
CPUCORES=48
INPUT=togethercomputer/RedPajama-Data-1T-Sample
DATA=slimpajama
OUTPUT_PREFIX=redpajama
DATA_DIR=/home/user/local/dataset/RedPajama-Data-1T-Sample/

python ../src/pii_redaction.py \
--load-batch-size $BATCHSIZE \
--cpu-per-worker $CPUCORES \
--input $INPUT \
--dataset-family $DATA \
--output-prefix $OUTPUT_PREFIX \
--data-dir $DATA_DIR \
--local \
#--skip 500000
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash
BATCHSIZE=1000
CPUCORES=48
DATA=refinedweb
OUTPUT_PREFIX=pii_test_output
DATA_DIR=/home/user/local/refinedweb_samples

python ../src/pii_redaction_v2.py \
--load-batch-size $BATCHSIZE \
--cpu-per-worker $CPUCORES \
--dataset-family $DATA \
--output-prefix $OUTPUT_PREFIX \
--data-dir $DATA_DIR \
--local
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/bash
BATCHSIZE=50000
CPUCORES=48
INPUT=cerebras/SlimPajama-627B
DATA=slimpajama
OUTPUT_PREFIX=pii_slimpajama_se
DATA_DIR=/home/user/local/

python ../src/pii_redaction_v2.py \
--load-batch-size $BATCHSIZE \
--cpu-per-worker $CPUCORES \
--input $INPUT \
--dataset-family $DATA \
--output-prefix $OUTPUT_PREFIX \
--data-dir $DATA_DIR \
--local
Loading

0 comments on commit f7a3126

Please sign in to comment.