forked from intel/llm-on-ray
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Pii removal with Multi-process support based on minmin's codes (intel#62
) * commit pii removal mp Signed-off-by: Xue, Chendi <chendi.xue@intel.com> * Update README.md * rename folder Signed-off-by: Xue, Chendi <chendi.xue@intel.com> * fix Signed-off-by: Xue, Chendi <chendi.xue@intel.com> * Update README.md --------- Signed-off-by: Xue, Chendi <chendi.xue@intel.com>
- Loading branch information
Showing
16 changed files
with
1,822 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
# PII removal for contact info | ||
|
||
## Intro | ||
|
||
PII removal for contact info is to replace personal information such as email, phone number to a random-non-sense string to protect personal infomation | ||
This script is using multi processing method to speed up PPI removal | ||
|
||
## Expected input and Output | ||
|
||
Input format: a folder of *parquet, 'text' will required in parquet column names. | ||
|
||
Out format: a folder of *parquet, 'text' will be processed and personal info will be replaced. | ||
|
||
## How to RUN | ||
``` | ||
conda create --name pyrecdp | ||
conda activate pyrecdp | ||
pip install pyrecdp --pre | ||
pip install presidio_analyzer | ||
python -m spacy download en_core_web_lg | ||
python pii_redaction.py -d ../falcon-refinedweb -o ../falcon-refinedweb-pii_removal -mp 224 | ||
``` | ||
|
||
## NOTICE | ||
|
||
We are running at file-wised parallism, usually a 300MB file took around 15-20min to complete, so you will see slow progress in progress bar. | ||
One thing to identify the activity of the process may be using 'top' to check of there are multiple activitily running python processes. |
34 changes: 34 additions & 0 deletions
34
tools/pii_removal_for_contact_mp/pii_detection_redaction/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# How to run PII-for-text pipeline | ||
|
||
## Overview | ||
The pipeline detects 5 different types of PIIs: 'PHONE_NUMBER, 'IP_ADDRESS', 'EMAIL', 'USER', 'KEY'. The detection is based on regular expressions using open-source packages including presidio and bigscience-pii. The detection precision/recall has been tuned for web scrapes based datasets, for example, Falcon-RefinedWeb, SlimPajama-StackExchange, PILE-Hackernews. But the detection precion/recall is not good for code data like Github. </p> | ||
|
||
Two redaction methods have been implemented: | ||
1. Replacement with random values | ||
2. Replacement with tags such as [PHONE_NUMBER], [EMAIL], etc. </br> | ||
Currently, the option 1) is used. | ||
|
||
|
||
## How to run | ||
### Step 1: Set up Env | ||
Please follow [this guide](../workload_in_containers/README.md) on how to set-up the container environment of this workload. When the containers are running, you can enter the container on head node using following command: | ||
```bash | ||
docker exec -it ray-leader bash | ||
``` | ||
|
||
### Step 2: Run PII removal | ||
Once you are inside the ray-leader container, go to the scripts folder. You can change the `BATCH_SIZE` and `CPUCORES` depending on the memory and number of cores on your systems. Then you can run the pii script, for example: | ||
``` | ||
bash pii-refinedweb.sh | ||
``` | ||
|
||
### Step 3: Validate outputs | ||
We implemented 3 checks: | ||
1. Check schema and sample rows in output parquets by loading parquet with pandas | ||
2. Count numbers of PIIs per category by sampling from the outputs. You can further get an estimate of the total number of PIIs per category by multiplying total_num_samples/sample_used_for_this_check | ||
3. Visual check of a small sample by producing a html with yellow highlights of the PIIs and annotating with corresponding category (note that sometimes the highlights are not at the exact location, but should be quite close). | ||
|
||
``` | ||
# First change the path to the data files in the python script | ||
python src/validate_ray_outputs.py | ||
``` |
14 changes: 14 additions & 0 deletions
14
tools/pii_removal_for_contact_mp/pii_detection_redaction/scripts/pii-pile-hn.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
#!/bin/bash | ||
BATCHSIZE=1000 | ||
CPUCORES=48 | ||
DATA=pile_hn | ||
OUTPUT_PREFIX=pile_hn | ||
DATA_DIR=/home/user/local/PILE/hn | ||
|
||
python ../src/pii_redaction_v2.py \ | ||
--load-batch-size $BATCHSIZE \ | ||
--cpu-per-worker $CPUCORES \ | ||
--dataset-family $DATA \ | ||
--output-prefix $OUTPUT_PREFIX \ | ||
--data-dir $DATA_DIR \ | ||
--local |
17 changes: 17 additions & 0 deletions
17
tools/pii_removal_for_contact_mp/pii_detection_redaction/scripts/pii-redpj.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
#!/bin/bash | ||
BATCHSIZE=50000 | ||
CPUCORES=48 | ||
INPUT=togethercomputer/RedPajama-Data-1T-Sample | ||
DATA=slimpajama | ||
OUTPUT_PREFIX=redpajama | ||
DATA_DIR=/home/user/local/dataset/RedPajama-Data-1T-Sample/ | ||
|
||
python ../src/pii_redaction.py \ | ||
--load-batch-size $BATCHSIZE \ | ||
--cpu-per-worker $CPUCORES \ | ||
--input $INPUT \ | ||
--dataset-family $DATA \ | ||
--output-prefix $OUTPUT_PREFIX \ | ||
--data-dir $DATA_DIR \ | ||
--local \ | ||
#--skip 500000 |
14 changes: 14 additions & 0 deletions
14
tools/pii_removal_for_contact_mp/pii_detection_redaction/scripts/pii-refinedweb.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
#!/bin/bash | ||
BATCHSIZE=1000 | ||
CPUCORES=48 | ||
DATA=refinedweb | ||
OUTPUT_PREFIX=pii_test_output | ||
DATA_DIR=/home/user/local/refinedweb_samples | ||
|
||
python ../src/pii_redaction_v2.py \ | ||
--load-batch-size $BATCHSIZE \ | ||
--cpu-per-worker $CPUCORES \ | ||
--dataset-family $DATA \ | ||
--output-prefix $OUTPUT_PREFIX \ | ||
--data-dir $DATA_DIR \ | ||
--local |
16 changes: 16 additions & 0 deletions
16
tools/pii_removal_for_contact_mp/pii_detection_redaction/scripts/pii-slimpj.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
#!/bin/bash | ||
BATCHSIZE=50000 | ||
CPUCORES=48 | ||
INPUT=cerebras/SlimPajama-627B | ||
DATA=slimpajama | ||
OUTPUT_PREFIX=pii_slimpajama_se | ||
DATA_DIR=/home/user/local/ | ||
|
||
python ../src/pii_redaction_v2.py \ | ||
--load-batch-size $BATCHSIZE \ | ||
--cpu-per-worker $CPUCORES \ | ||
--input $INPUT \ | ||
--dataset-family $DATA \ | ||
--output-prefix $OUTPUT_PREFIX \ | ||
--data-dir $DATA_DIR \ | ||
--local |
Oops, something went wrong.