Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for mcore T5 PEFT #9584

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
205 changes: 68 additions & 137 deletions .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2391,7 +2391,7 @@ jobs:

L2_Megatron_GPT_Pretraining_and_Resume_Training_TP2:
needs: [cicd-test-container-setup]
runs-on: self-hosted-azure-gpus-2-h100
runs-on: self-hosted-azure
timeout-minutes: 10
container:
image: nemoci.azurecr.io/nemo_container_${{ github.run_id }}
Expand All @@ -2403,21 +2403,6 @@ jobs:
--env TRANSFORMERS_OFFLINE=0
--env HYDRA_FULL_ERROR=1
--volume /mnt/datadrive/TestData:/home/TestData
env:
# This is to improve p2p overlap on H100
NVTE_FWD_LAYERNORM_SM_MARGIN: 8
NVTE_BWD_LAYERNORM_SM_MARGIN: 8
TORCH_NCCL_AVOID_RECORD_STREAMS: 1
NCCL_MIN_NCHANNELS: 4
# TP overlap is not supported in docker environment
#NVTE_UB_SPLIT_RS: 0
#NVTE_UB_ATOMIC_GEMM_RS: 1
#NVTE_RS_STRIDED_ATOMIC: 1
#NVTE_UB_FP8_RS: 1
# Increase p2p chunksize to 2MB
NCCL_P2P_NET_CHUNKSIZE: 2097152
# Disable gc when switching to/from validation steps
NEMO_MANUAL_GC_IN_VALIDATION: 0
steps:
- name: Checkout repository
uses: actions/checkout@v4
Expand All @@ -2432,17 +2417,8 @@ jobs:
trainer.max_steps=3 \
trainer.gradient_clip_val=1.0 \
exp_manager.exp_dir=examples/nlp/language_modeling/gpt_pretrain_results \
++model.transformer_engine=True \
++model.fp8=True \
++model.fp8_hybrid=True \
++model.fp8_amax_history_len=1024 \
++model.fp8_amax_compute_algo=max \
++model.reduce_amax=True \
++model.use_te_rng_tracker=True \
++model.name=megatron_gpt_full_te_layer_autocast \
model.ub_tp_comm_overlap=False \
model.tensor_model_parallel_size=2 \
model.optim.name=distributed_fused_adam \
model.optim.name=fused_adam \
model.optim.lr=2e-4 \
model.optim.sched.warmup_steps=1 \
model.optim.sched.constant_steps=1 \
Expand Down Expand Up @@ -2476,17 +2452,8 @@ jobs:
trainer.gradient_clip_val=1.0 \
exp_manager.exp_dir=examples/nlp/language_modeling/gpt_pretrain_results \
exp_manager.resume_if_exists=True \
++model.transformer_engine=True \
++model.fp8=True \
++model.fp8_hybrid=True \
++model.fp8_amax_history_len=1024 \
++model.fp8_amax_compute_algo=max \
++model.reduce_amax=True \
++model.use_te_rng_tracker=True \
++model.name=megatron_gpt_full_te_layer_autocast \
model.ub_tp_comm_overlap=False \
model.tensor_model_parallel_size=2 \
model.optim.name=distributed_fused_adam \
model.optim.name=fused_adam \
model.optim.lr=2e-4 \
model.optim.sched.warmup_steps=2 \
model.optim.sched.constant_steps=2 \
Expand Down Expand Up @@ -2978,11 +2945,10 @@ jobs:
needs: [cicd-test-container-setup]
uses: ./.github/workflows/_test_template.yml
with:
RUNNER: self-hosted-azure-gpus-2-h100
RUNNER: self-hosted-azure
SCRIPT: |
python examples/nlp/language_modeling/megatron_gpt_pretraining.py \
trainer.devices=2 \
trainer.accelerator=gpu \
trainer.log_every_n_steps=1 \
trainer.val_check_interval=2 \
trainer.limit_val_batches=2 \
Expand All @@ -2991,15 +2957,6 @@ jobs:
trainer.precision=bf16 \
trainer.gradient_clip_val=1.0 \
exp_manager.exp_dir=examples/nlp/language_modeling/gpt_pretrain_results \
++model.transformer_engine=True \
++model.fp8=True \
++model.fp8_hybrid=True \
++model.fp8_amax_history_len=1024 \
++model.fp8_amax_compute_algo=max \
++model.reduce_amax=True \
++model.use_te_rng_tracker=True \
++model.name=megatron_gpt_full_te_layer_autocast \
model.ub_tp_comm_overlap=False \
model.pipeline_model_parallel_size=2 \
model.tensor_model_parallel_size=1 \
model.mcore_gpt=True \
Expand All @@ -3024,15 +2981,12 @@ jobs:
model.hidden_size=256 \
model.num_attention_heads=8 \
model.activations_checkpoint_method=block \
model.activations_checkpoint_granularity=full \
model.activations_checkpoint_num_layers=1 \
model.data.validation_drop_last=False \
model.data.data_prefix=[.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document,.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document] \
model.data.index_mapping_dir=examples/nlp/language_modeling/gpt_index_mappings

python examples/nlp/language_modeling/megatron_gpt_pretraining.py \
trainer.devices=2 \
trainer.accelerator=gpu \
trainer.log_every_n_steps=1 \
trainer.val_check_interval=2 \
trainer.limit_val_batches=2 \
Expand All @@ -3044,15 +2998,6 @@ jobs:
model.megatron_amp_O2=True \
exp_manager.exp_dir=examples/nlp/language_modeling/gpt_pretrain_results \
exp_manager.resume_if_exists=True \
++model.transformer_engine=True \
++model.fp8=True \
++model.fp8_hybrid=True \
++model.fp8_amax_history_len=1024 \
++model.fp8_amax_compute_algo=max \
++model.reduce_amax=True \
++model.use_te_rng_tracker=True \
++model.name=megatron_gpt_full_te_layer_autocast \
model.ub_tp_comm_overlap=False \
model.pipeline_model_parallel_size=2 \
model.tensor_model_parallel_size=1 \
model.optim.name=distributed_fused_adam \
Expand All @@ -3075,9 +3020,7 @@ jobs:
model.hidden_size=256 \
model.num_attention_heads=8 \
model.activations_checkpoint_method=block \
model.activations_checkpoint_granularity=full \
model.activations_checkpoint_num_layers=1 \
model.data.validation_drop_last=False \
model.data.data_prefix=[.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document,.5,/home/TestData/nlp/megatron_gpt/data/gpt/simple_wiki_gpt_preproc_text_document] \
model.data.index_mapping_dir=examples/nlp/language_modeling/gpt_index_mappings
AFTER_SCRIPT: |
Expand Down Expand Up @@ -3586,80 +3529,6 @@ jobs:
rm -rf examples/nlp/language_modeling/t5_pretrain_results
rm -rf examples/nlp/language_modeling/t5_index_mappings

L2_Megatron_Core_T5_Pretraining_and_Resume_Training_TP2:
needs: [cicd-test-container-setup]
uses: ./.github/workflows/_test_template.yml
with:
RUNNER: self-hosted-azure
SCRIPT: |
NVTE_FUSED_ATTN=0 NVTE_FLASH_ATTN=0 python examples/nlp/language_modeling/megatron_t5_pretraining.py \
trainer.devices=2 \
trainer.log_every_n_steps=1 \
trainer.max_epochs=null \
trainer.max_steps=10 \
trainer.val_check_interval=10 \
trainer.accumulate_grad_batches=1 \
trainer.precision=bf16 \
model.megatron_amp_O2=True \
exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \
model.mcore_t5=True \
model.transformer_engine=True \
model.tensor_model_parallel_size=2 \
model.micro_batch_size=4 \
model.global_batch_size=4 \
model.seq_length=128 \
model.encoder.num_layers=4 \
model.encoder.hidden_size=64 \
model.encoder.num_attention_heads=8 \
model.decoder.num_layers=4 \
model.decoder.hidden_size=64 \
model.decoder.num_attention_heads=8 \
model.encoder.transformer_block_type='pre_ln' \
model.decoder.transformer_block_type='pre_ln' \
model.data.data_prefix=[.5,/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src,.5,/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref] \
model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings \
model.data.data_impl=text_mmap \
+model.data.data_impl_kwargs.newline_int=10 \
+model.data.data_impl_kwargs.header_lines=0 \
+model.data.data_impl_kwargs.workers=null \
+model.data.data_impl_kwargs.sort_dataset_paths=False

NVTE_FUSED_ATTN=0 NVTE_FLASH_ATTN=0 python examples/nlp/language_modeling/megatron_t5_pretraining.py \
trainer.devices=2 \
trainer.log_every_n_steps=1 \
trainer.max_epochs=null \
trainer.max_steps=10 \
trainer.val_check_interval=10 \
trainer.accumulate_grad_batches=1 \
trainer.precision=bf16 \
model.megatron_amp_O2=True \
exp_manager.exp_dir=examples/nlp/language_modeling/t5_pretrain_results \
exp_manager.resume_if_exists=True \
model.mcore_t5=True \
model.transformer_engine=True \
model.tensor_model_parallel_size=2 \
model.micro_batch_size=4 \
model.global_batch_size=4 \
model.seq_length=128 \
model.encoder.num_layers=4 \
model.encoder.hidden_size=64 \
model.encoder.num_attention_heads=8 \
model.decoder.num_layers=4 \
model.decoder.hidden_size=64 \
model.decoder.num_attention_heads=8 \
model.encoder.transformer_block_type='pre_ln' \
model.decoder.transformer_block_type='pre_ln' \
model.data.data_prefix=[.5,/home/TestData/nlp/nmt/toy_data/wmt14-de-en.src,.5,/home/TestData/nlp/nmt/toy_data/wmt14-de-en.ref] \
model.data.index_mapping_dir=examples/nlp/language_modeling/t5_index_mappings \
model.data.data_impl=text_mmap \
+model.data.data_impl_kwargs.newline_int=10 \
+model.data.data_impl_kwargs.header_lines=0 \
+model.data.data_impl_kwargs.workers=null \
+model.data.data_impl_kwargs.sort_dataset_paths=False
AFTER_SCRIPT: |
rm -rf examples/nlp/language_modeling/t5_pretrain_results
rm -rf examples/nlp/language_modeling/t5_index_mappings

L2_Megatron_T5_with_ALiBi_Pretraining_and_Resume_Training_TP2:
needs: [cicd-test-container-setup]
uses: ./.github/workflows/_test_template.yml
Expand Down Expand Up @@ -4046,6 +3915,17 @@ jobs:
--prompt 'How do I fix my GPU memory issue? I am seeing <mask> out of memory.' \
--tensor_model_parallel_size 1

L2_Megatron_Core_T5_Eval:
needs: [cicd-test-container-setup]
uses: ./.github/workflows/_test_template.yml
with:
RUNNER: self-hosted-azure
SCRIPT: |
NVTE_FLASH_ATTN=0 NVTE_FUSED_ATTN=0 python examples/nlp/language_modeling/megatron_t5_eval.py \
--model_file /home/TestData/nlp/megatron_t5/220m/megatron_mcore_t5_220m.nemo \
--prompt 'How do I fix my GPU memory issue? I am seeing <mask> out of memory.' \
--tensor_model_parallel_size 1

L2_Megatron_BART_Pretraining_and_Resume_Training_TP2:
needs: [cicd-test-container-setup]
uses: ./.github/workflows/_test_template.yml
Expand Down Expand Up @@ -4184,7 +4064,6 @@ jobs:
AFTER_SCRIPT: |
rm -rf examples/nlp/language_modeling/bart_pretrain_results


L2_Megatron_T5_PEFT_Lora_TP2:
needs: [cicd-test-container-setup]
uses: ./.github/workflows/_test_template.yml
Expand Down Expand Up @@ -4236,6 +4115,57 @@ jobs:
AFTER_SCRIPT: |
rm -rf /home/TestData/nlp/t5_lora_tuning_tp2

L2_Megatron_Core_T5_PEFT_Lora_TP2:
needs: [cicd-test-container-setup]
uses: ./.github/workflows/_test_template.yml
with:
RUNNER: self-hosted-azure
SCRIPT: |
rm -rf /home/TestData/nlp/mcore_t5_lora_tuning_tp2

NVTE_FUSED_ATTN=0 NVTE_FLASH_ATTN=0 python examples/nlp/language_modeling/tuning/megatron_t5_finetuning.py \
trainer.devices=2 \
trainer.log_every_n_steps=1 \
trainer.max_epochs=9999 \
trainer.max_steps=3 \
trainer.val_check_interval=3 \
++trainer.limit_val_batches=2 \
trainer.precision=16 \
exp_manager.exp_dir=/home/TestData/nlp/mcore_t5_lora_tuning_tp2 \
model.pipeline_model_parallel_size=1 \
model.tensor_model_parallel_size=2 \
model.restore_from_path=/home/TestData/nlp/megatron_t5/220m/megatron_mcore_t5_220m.nemo \
model.peft.peft_scheme=lora \
model.answer_only_loss=True \
model.micro_batch_size=1 \
model.global_batch_size=1 \
model.data.train_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl] \
model.data.train_ds.concat_sampling_probabilities=[1.0] \
model.data.train_ds.num_workers=0 \
model.data.validation_ds.num_workers=0 \
model.data.validation_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel.jsonl] \
model.data.validation_ds.names=[quarel]

NVTE_FUSED_ATTN=0 NVTE_FLASH_ATTN=0 python examples/nlp/language_modeling/tuning/megatron_t5_generate.py \
model.restore_from_path=/home/TestData/nlp/megatron_t5/220m/megatron_mcore_t5_220m.nemo \
model.peft.restore_from_path=/home/TestData/nlp/mcore_t5_lora_tuning_tp2/megatron_t5_peft_lora_tuning/checkpoints/megatron_t5_peft_lora_tuning.nemo \
model.peft.restore_from_ckpt_name=null \
model.peft.restore_from_hparams_path=null \
model.tensor_model_parallel_size=2 \
trainer.devices=2 \
model.data.test_ds.file_names=[/home/TestData/nlp/megatron_sft/quarel_4.jsonl] \
model.data.test_ds.names=[quarel4] \
model.global_batch_size=1 \
model.micro_batch_size=1 \
model.data.test_ds.tokens_to_generate=10 \
model.data.test_ds.write_predictions_to_file=True \
model.data.test_ds.output_file_path_prefix=/home/TestData/nlp/mcore_t5_lora_tuning_tp2/out \
inference.greedy=True \
inference.repetition_penalty=1.0 \
inference.outfile_path=/home/TestData/nlp/mcore_t5_lora_tuning_tp2/out.jsonl
AFTER_SCRIPT: |
rm -rf /home/TestData/nlp/mcore_t5_lora_tuning_tp2

# L2: Megatron Mock Data Generation
L2_Megatron_Mock_Data_Generation_MockGPTDataset:
needs: [cicd-test-container-setup]
Expand Down Expand Up @@ -4605,16 +4535,17 @@ jobs:
- L2_Megatron_Change_Partitions_Reduce_TP_Num_Partitions_-2_to_1-_and_PP_Num_Partitions_-1_to_2
- L2_Megatron_Change_Partitions_Increase_TP_Num_Partitions_-2_to_4-_and_PP_Num_Partitions_-1_to_2
- L2_Megatron_T5_Pretraining_and_Resume_Training_TP2
- L2_Megatron_Core_T5_Pretraining_and_Resume_Training_TP2
- L2_Megatron_T5_with_ALiBi_Pretraining_and_Resume_Training_TP2
- L2_Megatron_T5_with_KERPLE_Pretraining_and_Resume_Training_TP2
- L2_Megatron_T5_Pretraining_and_Resume_Training_PP2
- L2_Megatron_T5_w_Mixture_of_Expert_Pretraining
- L2_Megatron_UL2_Pretraining_and_Resume_Training_TP2
- L2_Megatron_T5_Eval
- L2_Megatron_Core_T5_Eval
- L2_Megatron_BART_Pretraining_and_Resume_Training_TP2
- L2_Megatron_BART_Pretraining_and_Resume_Training_PP2
- L2_Megatron_T5_PEFT_Lora_TP2
- L2_Megatron_Core_T5_PEFT_Lora_TP2
- L2_Megatron_Mock_Data_Generation_MockGPTDataset
- L2_Megatron_Mock_Data_Generation_MockT5Dataset
- L2_TTS_Fast_dev_runs_1_Tacotron_2
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile.ci
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ WORKDIR /workspace
# Install NeMo requirements
ARG TE_TAG=7d576ed25266a17a7b651f2c12e8498f67e0baea
ARG MODELOPT_VERSION=0.13.0
ARG MCORE_TAG=0bc3547702464501feefeb5523b7a17e591b21fa
ARG MCORE_TAG=c7a1f82d761577e6ca0338d3521eac82f2aa0904
ARG APEX_TAG=810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c
RUN \
--mount=type=bind,source=requirements,target=requirements \
Expand Down
Loading
Loading