Skip to content

Commit

Permalink
feat: Upgrading TRTLLM to v13 (#320)
Browse files Browse the repository at this point in the history
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: NeMo-Aligner CI <nemo-aligner-ci@nvidia.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored Oct 31, 2024
1 parent 77af2a8 commit c86c63c
Show file tree
Hide file tree
Showing 39 changed files with 1,337 additions and 346 deletions.
19 changes: 11 additions & 8 deletions .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,16 +57,22 @@ jobs:
uses: ./.github/workflows/_build_container.yml

Unit_Tests:
name: ${{ matrix.test_case }}
needs: [build-container, pre-flight]
uses: ./.github/workflows/_run_test.yml
if: contains(fromJSON(needs.pre-flight.outputs.test_to_run), 'unit') || needs.pre-flight.outputs.all == 'true'
strategy:
matrix:
test_case:
- run_unit.sh
- run_mpi_unit.sh
with:
RUNNER: self-hosted-azure
TIMEOUT: 10
SCRIPT: |
nvidia-smi
cd ${ALIGNER_REPO_DIR}
bash tests/run_unit.sh
bash tests/${{ matrix.test_case }}
Functional_Tests:
name: ${{ matrix.test_case }}
Expand All @@ -76,15 +82,12 @@ jobs:
strategy:
matrix:
test_case:
#- ppo-pp-llama3
- ppo-llama3-pp2-reshard
- dpo-llama3

with:
RUNNER: self-hosted-azure
# Fairly aggresive timeout that all functional tests should try to adhere to
TIMEOUT: 10
TIMEOUT: 8
SCRIPT: |
export PYTHONPATH=${ALIGNER_REPO_DIR}:${PYTHONPATH:-}
nvidia-smi
git config --global --add safe.directory ${ALIGNER_REPO_DIR}
cd ${ALIGNER_REPO_DIR}
bash tests/functional/test_cases/${{ matrix.test_case }}.sh
bash /opt/NeMo-Aligner/tests/functional/test_cases/${{ matrix.test_case }}
29 changes: 28 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,42 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
### Breaking Changes
### Bug Fixes
### Deprecation Notices
-->

## [Next Version]

### New Features and Optimizations
- Added support for Megatron’s distributed optimizer, which can be configured using `++model.optim.name=mcore_distributed_optim`.
- Introduced `ScopedTimer` as a successor to `SyncedTimer`. `SyncedTimer` is marked for deprecation and will be removed in the next version.
```python
from nemo_aligner.utils.distributed import ScopedTimer
timer = ScopedTimer()

# All durations are logged in the timer
with timer("step_time"):
with timer("fwd"):
model.fwd()
with timer("bwd"):
model.bwd()

# Consume all durations and reset internal store
durations = timer.consume_durations()
```

### Breaking Changes
- Upgrade TRTLLM dependency from v0.10.0 to v0.12.0 and migrate from `GPTSession` cpp runtime to `ModelRunner` python runtime. Please use the latest Dockerfile.
- Using latest TransformerEngine versions may require `++model.dist_ckpt_load_strictness=log_all` when loading from a older pre-existing checkpoint to not error out.
- NeMo-Aligner now requires Megatron-LM==0.9.0 for the APIs to calculate the microbatch sizes (API introduced `megatron.core.num_microbatches_calculator.reconfigure_num_microbatch_calculator`).
- NeMo-Aligner now requires a version of NeMo with this change to how the MoE spec is handled: https://github.com/NVIDIA/NeMo/pull/9035 .

### Bug Fixes
- It is now required, for stability, to add `export NCCL_ALGO=...` to scripts launching PPO training loop. Please see the [RLHF docs](./docs/user-guide/rlhf.rst) for information.

### Deprecation Notices
- `SyncedTimer` is marked for deprecation and will be removed in `0.7.0`. Please switch to `ScopedTimer`
- `broadcast_2d_tensor` and `broadcast_2d_tensor_within_pp` is marked for deprecation and will be removed in `0.7.0`. Please switch to `broadcast_tensor` and `broadcast_tensor_within_pp`.

## NVIDIA NeMo-Aligner 0.5.0

Expand All @@ -32,6 +58,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)

### Bug Fixes
- Change `log_prob_forward_micro_batch_size` in DPO to mean the same as the `micro_batch_size`, which is how many samples(chosen and rejected included) that we process at once.
- PPO TensorRT-LLM acceleration now no longer errors if using a tokenizer without a `pad_id`. Examples being llama3 and llama3.1 tokenizers from huggingface.

## NVIDIA NeMo-Aligner 0.4.0
- Implement reward-aware preference optimization.
Expand All @@ -51,7 +78,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
### Breaking Changes
- `inference.micro_batch_size` is now renamed to `inference.inference_micro_batch_size` when running reward model inference in `inference_rm.yaml`. This is to stay consistent with the naming scheme of the PPO critic.
- It is no longer possible to specify `add_EOS` when running reward model or critic inference.
- NeMo-Aligner now requires Megatron-LM>=0.8.0 for the APIs to calculate the microbatch sizes.
- NeMo-Aligner now requires Megatron-LM==0.8.0 for the APIs to calculate the microbatch sizes (API introduced `megatron.core.num_microbatches_calculator.reconfigure_microbatch_calculator`).

### Bug Fixes
- Make `num_workers` for dataloaders 0 by default. This prevents issues when using MPI (with TRT-LLM) or more sophisticated launchers.
Expand Down
96 changes: 57 additions & 39 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# To update NeMo-Aligner from a pre-built NeMo-Framework container:
#
# docker buildx build --target=aligner-bump --build-arg=BASE_IMAGE=nvcr.io/nvidia/nemo:24.07 -t aligner:latest .
# docker buildx build --target=aligner-bump -t aligner:latest .
#

# Number of parallel threads for compute heavy build jobs
Expand All @@ -13,13 +13,12 @@ ARG MAX_JOBS=8
# Git refs for dependencies
ARG TE_TAG=7d576ed25266a17a7b651f2c12e8498f67e0baea
ARG PYTRITON_VERSION=0.5.10
ARG NEMO_TAG=e033481e26e6ae32764d3e2b3f16afed00dc7218 # On: r2.0.0rc1
ARG MLM_TAG=a3fe0c75df82218901fa2c3a7c9e389aa5f53182 # On: core_r0.8.0
ARG NEMO_TAG=19668e5320a2e2af0199b6d5e0b841993be3a634 # On: main
ARG MLM_TAG=25059d3bbf68be0751800f3644731df12a88f3f3 # On: main
ARG ALIGNER_COMMIT=main
ARG TRTLLM_VERSION=v0.10.0
ARG TRTLLM_VERSION=v0.13.0
ARG PROTOBUF_VERSION=4.24.4

ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:24.03-py3
ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:24.07-py3

FROM ${BASE_IMAGE} AS aligner-bump
ARG ALIGNER_COMMIT
Expand All @@ -36,13 +35,40 @@ git checkout -f $ALIGNER_COMMIT
# case 2: ALIGNER_COMMIT is a commit, so git-pull is expected to fail
git pull --rebase || true

pip install --no-deps -e .
pip install --no-cache-dir --no-deps -e .
EOF

FROM ${BASE_IMAGE} as final
WORKDIR /opt
# needed in case git complains that it can't detect a valid email, this email is fake but works
RUN git config --global user.email "worker@nvidia.com"
# install latest apex
ARG APEX_TAG
RUN pip uninstall -y apex && \
git clone https://github.com/NVIDIA/apex && \
cd apex && \
if [ ! -z $APEX_TAG ]; then \
git fetch origin $APEX_TAG && \
git checkout FETCH_HEAD; \
fi && \
pip install -v --no-build-isolation --disable-pip-version-check --no-cache-dir --config-settings "--build-option=--cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam" ./

# Git LFS
RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash && \
apt-get install git-lfs && \
git lfs install && \
apt-get clean

# TRTLLM
ARG TRTLLM_VERSION
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git && \
cd TensorRT-LLM && \
git checkout ${TRTLLM_VERSION} && \
. docker/common/install_tensorrt.sh && \
python3 ./scripts/build_wheel.py --job_count $(nproc) --trt_root /usr/local/tensorrt --python_bindings --benchmarks && \
pip install -e .
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12/compat/lib.real/

# install TransformerEngine
ARG MAX_JOBS
ARG TE_TAG
Expand All @@ -56,17 +82,6 @@ RUN pip uninstall -y transformer-engine && \
git submodule init && git submodule update && \
NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install .

# install latest apex
ARG APEX_TAG
RUN pip uninstall -y apex && \
git clone https://github.com/NVIDIA/apex && \
cd apex && \
if [ ! -z $APEX_TAG ]; then \
git fetch origin $APEX_TAG && \
git checkout FETCH_HEAD; \
fi && \
pip install -v --no-build-isolation --disable-pip-version-check --no-cache-dir --config-settings "--build-option=--cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam" ./

# place any util pkgs here
ARG PYTRITON_VERSION
RUN pip install --upgrade-strategy only-if-needed nvidia-pytriton==$PYTRITON_VERSION
Expand Down Expand Up @@ -99,29 +114,32 @@ RUN pip uninstall -y megatron-core && \
fi && \
pip install -e .

# Git LFS
RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash && \
apt-get install git-lfs && \
git lfs install

COPY --from=aligner-bump /opt/NeMo-Aligner /opt/NeMo-Aligner
RUN cd /opt/NeMo-Aligner && \
pip install --no-deps -e .

# TRTLLM
ARG TRTLLM_VERSION
RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git && \
cd TensorRT-LLM && \
git checkout ${TRTLLM_VERSION} && \
patch -p1 < ../NeMo-Aligner/setup/trtllm.patch && \
. docker/common/install_tensorrt.sh && \
python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt

RUN cd TensorRT-LLM && \
pip install ./build/tensorrt_llm*.whl
ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12/compat/lib.real/
RUN cd TensorRT-LLM && patch -p1 < ../NeMo-Aligner/setup/trtllm.patch

# WAR(0.4.0): The pin of NeMo requires a higher nvidia-modelopt version than
# TRT-LLM allows. This installation must follow TRT-LLM and is
# only necessary when NeMo 2.0.0rc1 is installed with TRT-LLM v10.
RUN pip install --upgrade-strategy only-if-needed nvidia-modelopt==0.13.0
# TODO(terryk): This layer should be deleted ASAP after NeMo is bumped to include all of these PRs
RUN <<"EOF" bash -exu
cd NeMo
# Ensures we don't cherry-pick "future" origin/main commits
git fetch -a
# 0c92fe17df4642ffc33d5d8c0c83fda729e3910c: [fix] Ensures disabling exp_manager with exp_manager=null does not error NeMo#10651
# 60e677423667c029dd05875da72bf0719774f844: [feat] Update get_model_parallel_src_rank to support tp-pp-dp ordering NeMo#10652
# 0deaf6716cb4f20766c995ce25d129795f1ae200: fix[export]: update API for disabling device reassignment in TRTLLM for Aligner NeMo#10863
# (superceded by 10863) 148543d6e9c66ff1f8562e84484448202249811d: feat: Migrate GPTSession refit path in Nemo export to ModelRunner for Aligner NeMo#10654
for pr_and_commit in \
"10651 0c92fe17df4642ffc33d5d8c0c83fda729e3910c" \
"10652 60e677423667c029dd05875da72bf0719774f844" \
"10863 0deaf6716cb4f20766c995ce25d129795f1ae200" \
; do
pr=$(cut -f1 -d' ' <<<"$pr_and_commit")
head_pr_commit=$(cut -f2 -d' ' <<<"$pr_and_commit")
git fetch origin $head_pr_commit:PR-${pr}
# cherry-picks all commits between main and the top of the PR
git cherry-pick --allow-empty $(git merge-base origin/main PR-${pr})..PR-${pr}
# Tag cherry-picks to help
git tag cherry-pick-PR-${pr}
done
EOF
2 changes: 1 addition & 1 deletion docs/user-guide/dpo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` correspond
++model.dpo.ref_policy_kl_penalty=0.1
EOF
srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x
The default DPO training tunes all parameters. To use LoRA, we can set ``model.peft.peft_scheme=lora`` and use different parameters in ``model.peft.lora_tuning``. Please check the parameters in `the config file <https://github.com/NVIDIA/NeMo-Aligner/blob/main/examples/nlp/gpt/conf/gpt_dpo.yaml>`__.
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/draftp.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ To launch reward model training, you must have checkpoints for `UNet <https://hu
exp_manager.wandb_logger_kwargs.project=${PROJECT}
EOF
srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x
Expand Down
11 changes: 8 additions & 3 deletions docs/user-guide/rlhf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ To launch reward model training, you must start with a pretrained or SFT-trained
exp_manager.wandb_logger_kwargs.project=${PROJECT}
EOF
srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x
Expand Down Expand Up @@ -252,6 +252,11 @@ You can use Slurm to launch both jobs and coordinate them together in a full RLH
#SBATCH hetjob
#SBATCH -N 1 --ntasks-per-node 8 -A <<ACCOUNT>> -p <<PARTITION>> --job-name <<JOBNAME>> -t 4:00:00 --exclusive
# To ensure determinism when calculating log probabilities between two forward-passes with identical weights, it is strongly
# recommended to set NCCL_ALGO. See https://github.com/NVIDIA/Megatron-LM/blob/b3375a0e38c10e2300ef4be031f7dcabab52b448/megatron/training/arguments.py#L593-L595
# for options.
export NCCL_ALGO=Tree
NAME="2p_ppo"
# PARAMETERS
Expand Down Expand Up @@ -300,7 +305,7 @@ You can use Slurm to launch both jobs and coordinate them together in a full RLH
pretrained_checkpoint.restore_from_path=${RM_NEMO_FILE}
EOF
srun --het-group=0 -o $CRITIC_OUTFILE -e $CRITIC_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_critic_inference}" &
srun --no-container-mount-home --het-group=0 -o $CRITIC_OUTFILE -e $CRITIC_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_critic_inference}" &
sleep 30
Expand Down Expand Up @@ -351,7 +356,7 @@ You can use Slurm to launch both jobs and coordinate them together in a full RLH
remote_critic_rm.critic.port=${CRITIC_PORT}
EOF
srun --het-group=1 -o $PPO_OUTFILE -e $PPO_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_ppo}" &
srun --no-container-mount-home --het-group=1 -o $PPO_OUTFILE -e $PPO_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_ppo}" &
wait
Expand Down
4 changes: 2 additions & 2 deletions docs/user-guide/rs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,7 @@ You can use slurm to launch the 2 jobs and get them to coordinate together in a
inference.port=${CRITIC_PORT}
EOF
srun --het-group=0 -o $CRITIC_OUTFILE -e $CRITIC_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_critic_inference}" &
srun --no-container-mount-home --het-group=0 -o $CRITIC_OUTFILE -e $CRITIC_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_critic_inference}" &
sleep 30
Expand Down Expand Up @@ -213,7 +213,7 @@ You can use slurm to launch the 2 jobs and get them to coordinate together in a
model.rs.top_n_rollouts=1
EOF
srun --het-group=1 -o $PPO_OUTFILE -e $PPO_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_rs}" &
srun --no-container-mount-home --het-group=1 -o $PPO_OUTFILE -e $PPO_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_rs}" &
wait
Expand Down
4 changes: 2 additions & 2 deletions docs/user-guide/sft.rst
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,7 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner.
exp_manager.checkpoint_callback_params.monitor=validation_loss
EOF
srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x
If using sequence packing, replace the data paths with the paths to your packed datasets. For each packed dataset, you should also set ``packed_sequence=True`` in the config:
Expand Down Expand Up @@ -379,7 +379,7 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner. Compare
exp_manager.checkpoint_callback_params.monitor=validation_loss
EOF
srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/spin.rst
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ For the below parameters, the ``model.spin.ref_policy_kl_penalty`` corresponds t
model.data.train_ds.max_seq_length=4096
EOF
srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
set +x
During SPIN training, there will be several metrics recorded to WandB which you can monitor, chiefly acc (representing the percentage amount whereby the model's implicit reward for the ground truth response is greater than for the response generated by the reference policy).
Expand Down
2 changes: 1 addition & 1 deletion examples/mm/stable_diffusion/train_sd_draftp.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ def main(cfg) -> None:
ptl_model.reward_model = reward_model

ckpt_callback = add_custom_checkpoint_callback(trainer, ptl_model)
timer = Timer(cfg.exp_manager.get("max_time_per_run", "0:12:00:00"))
timer = Timer(cfg.exp_manager.get("max_time_per_run") if cfg.exp_manager else None)

draft_p_trainer = SupervisedTrainer(
cfg=cfg.trainer.draftp_sd,
Expand Down
2 changes: 1 addition & 1 deletion examples/mm/stable_diffusion/train_sdxl_draftp.py
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ def checkpoint_check_fn(module):
torch.distributed.barrier()

ckpt_callback = add_custom_checkpoint_callback(trainer, ptl_model)
timer = Timer(cfg.exp_manager.get("max_time_per_run", "0:24:00:00"))
timer = Timer(cfg.exp_manager.get("max_time_per_run") if cfg.exp_manager else None)

draft_p_trainer = SupervisedTrainer(
cfg=cfg.trainer.draftp_sd,
Expand Down
Loading

0 comments on commit c86c63c

Please sign in to comment.