forked from intel/llm-on-ray
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
minor updates for distributed tokenization (intel#64)
* slim dockerfile * remove credentials * rename * add postfix * update * push 1 version of dp * add new code * remove the old code * revert * remove unused libs * add new package * add parquet support * change name * use output_dir instead of output_prefix * merge * remove unused file * fix typo * add more automation * add dp config * add saving csv * add dp config yaml * add stop containers * add stop containers * remove dp config * tokenier as input * add a file to count row numbers * change dockerfile name * some refinement * add file numbers * add real script * add mulit-processing code * add file name * add pyrecdp * refine * remove * remove developer name * remove pyrecdp * change name oder * remove files * add use-slow flag --------- Co-authored-by: N <matrix.yao@intel.com>
- Loading branch information
1 parent
f7a3126
commit 71cc3ce
Showing
5 changed files
with
87 additions
and
87 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,23 +1,30 @@ | ||
echo -e "\n distributed tokenization with ray" | ||
|
||
start=`date +%s` | ||
echo -e "\n distributed tokenization with ray for Book" | ||
python tokenize_and_save.py \ | ||
--input-dir /home/user/shared/PILE_dedup/EuroParl \ | ||
--input-dir /home/user/shared/user/Book \ | ||
--file-type parquet \ | ||
--output-dir /home/user/shared/user/tokenized_Book \ | ||
--data-field text \ | ||
--tokenizer togethercomputer/LLaMA-2-7B-32K \ | ||
--output-dir /home/user/shared/EuroParl_tokenized \ | ||
--load-batch-size 1000 \ | ||
--cpu-per-node 90 | ||
--load-batch-size 10000 \ | ||
--cpu-per-node 220 \ | ||
--use-slow | ||
end=`date +%s` | ||
echo "Execution Time is: $(($end-$start)) seconds" | tee tokenized_Book.log | ||
|
||
sleep 30 | ||
sleep 10 | ||
echo -e "\n merging multiple megatron data files.." | ||
python merge_datasets.py --input /home/user/shared/EuroParl_tokenized --output-prefix /home/user/shared/EuroParl_tokenized | ||
python merge_datasets.py --input /home/user/shared/user/tokenized_Book --output-prefix /home/user/shared/user/tokenized_Book >> tokenized_Book.log | ||
|
||
sleep 15 | ||
sleep 5 | ||
echo -e "\n removing multiple megatron files.." | ||
rm -fr /home/user/shared/EuroParl_tokenized | ||
rm -fr /home/user/shared/user/tokenized_Book | ||
|
||
sleep 5 | ||
echo -e "\n counting token numbers.." | ||
python count_tokens.py /home/user/shared/EuroParl_tokenized /home/user/shared/EuroParl_tokenized.stat | ||
|
||
|
||
python count_tokens.py /home/user/shared/user/tokenized_Book /home/user/shared/user/tokenized_Book.stat >> tokenized_Book.log | ||
|
||
sleep 5 | ||
mkdir /home/user/shared/user/tokenized_Book | ||
mv /home/user/shared/user/tokenized_Book.* /home/user/shared/user/tokenized_Book |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters