Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[misc] Layerwise profile updates #10242

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

varun-sundar-rabindranath
Copy link
Contributor

@varun-sundar-rabindranath varun-sundar-rabindranath commented Nov 12, 2024

Layerwise profile - changes and updates:

  1. Add ability to profile the scenario where the engine is processing requests of different output-lengths.
    • Remove --output-length from offline_profile.py CLI args.
    • Add sub-commands run_num_steps and run_to_completion in its place.
    • run_num_steps captures the user-intent more clearly than --output-length. i.e. Profile n engine steps.
    • run_to_completion lets the user specify the number of requests the engine should complete every decode step. This provides layer-wise profile information for a range of batch-sizes.
  2. Split gemm_ops into cutlass-gemm-ops and gemm-ops
  3. Fix ProfileContext read in visualization code.
  4. Bug fix : Incorrect use of rstrip()
  5. Update layerwise_profile and visualization code to store and display metadata.

Examples:
run_num_steps sub-command:

model=neuralmagic/Meta-Llama-3-8B-Instruct-FP8
python3 examples/offline_profile.py --model ${model} --batch-size 128 --prompt-len 64 --json llama-8b-fp8-num-steps --csv llama-8b-fp8-num-ste    ps --enforce-eager run_num_steps -n 5

Graph:
command :

python3 tools/profiler/visualize_layerwise_profile.py --json-trace llama-8b-fp8-num-steps.json --output-directory profile-breakdown-num-steps     --level kernel --plot-metric cuda_time_ms --step-plot-interval 1

prefill
decode_steps

run_to_completion subcommand:

model=neuralmagic/Meta-Llama-3-8B-Instruct-FP8
python3 examples/offline_profile.py --model ${model} --batch-size 128 --prompt-len 64 --json llama-8b-fp8-to-completion --csv llama-8b-fp8-to-    completion --enforce-eager run_to_completion -n 16 

Graphs

python3 tools/profiler/visualize_layerwise_profile.py --json-trace llama-8b-fp8-to-completion.json --output-directory profile-breakdown-to-com    pletion --level kernel --plot-metric cuda_time_ms --step-plot-interval 1

prefill
decode_steps

Varun Sundar Rabindranath added 4 commits November 12, 2024 03:07
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@varun-sundar-rabindranath
Copy link
Contributor Author

@LucasWilkinson i made some updates to layerwise-profile . Can you please take a look. Thanks !

parser.add_argument("--min-output-len",
type=int,
default=OUTPUT_LEN_DEFAULT,
help="Minimum output length of the requests")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we maybe keep a --output-len option? thats mutually exclusive with --max-output-len and --min-output-len?, seems a bit cumbersome if I want say all to have output-len of 8 to do --max-output-len 8 --min-output-len 8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey. I have updated to CLI args to pass in num_steps directly. I believe it is better in capturing the intent. PTAL.

@@ -151,16 +151,18 @@ def is_quant(op_name: str):
"scaled_int8_quant" in op_name:
return True

def is_cutlass_gemm_op(op_name: str):
return "void cutlass::Kernel" in op_name or \
"void cutlass::device_kernel" in op_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should check for gemm in the name too? not that we use cutlass for anything else right now, but might prevent future confusion if we use cutlass for convolution or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you are right. Ill add it.

@LucasWilkinson
Copy link
Contributor

Left a few comments otherwise LGTM. Thanks, these seem like nice improvements!

Varun Sundar Rabindranath added 2 commits November 25, 2024 16:17
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
@varun-sundar-rabindranath
Copy link
Contributor Author

Hi @LucasWilkinson - I've made some non trivial changes. PTAL, thanks !

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
output lengths of the requests such that step_request is honoured.

Example:
if batch size = 32 and step_request = [128, 128, 96, 64, 32, 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the 'batch size = 32', when step_request[0] = 128, this doesn't seem to align with the '--complete-num-requests-per-step' comment where it is batch size 128

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake. It should be "batch_size = 128" ... fixed now 👍

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants