[misc] Layerwise profile updates #10242

varun-sundar-rabindranath · 2024-11-12T03:16:41Z

Layerwise profile - changes and updates:

Add ability to profile the scenario where the engine is processing requests of different output-lengths.
- Remove --output-length from offline_profile.py CLI args.
- Add sub-commands run_num_steps and run_to_completion in its place.
- run_num_steps captures the user-intent more clearly than --output-length. i.e. Profile n engine steps.
- run_to_completion lets the user specify the number of requests the engine should complete every decode step. This provides layer-wise profile information for a range of batch-sizes.
Split gemm_ops into cutlass-gemm-ops and gemm-ops
Fix ProfileContext read in visualization code.
Bug fix : Incorrect use of rstrip()
Update layerwise_profile and visualization code to store and display metadata.

Examples:
run_num_steps sub-command:

model=neuralmagic/Meta-Llama-3-8B-Instruct-FP8
python3 examples/offline_profile.py --model ${model} --batch-size 128 --prompt-len 64 --json llama-8b-fp8-num-steps --csv llama-8b-fp8-num-ste    ps --enforce-eager run_num_steps -n 5

Graph:
command :

python3 tools/profiler/visualize_layerwise_profile.py --json-trace llama-8b-fp8-num-steps.json --output-directory profile-breakdown-num-steps     --level kernel --plot-metric cuda_time_ms --step-plot-interval 1

run_to_completion subcommand:

model=neuralmagic/Meta-Llama-3-8B-Instruct-FP8
python3 examples/offline_profile.py --model ${model} --batch-size 128 --prompt-len 64 --json llama-8b-fp8-to-completion --csv llama-8b-fp8-to-    completion --enforce-eager run_to_completion -n 16

Graphs

python3 tools/profiler/visualize_layerwise_profile.py --json-trace llama-8b-fp8-to-completion.json --output-directory profile-breakdown-to-com    pletion --level kernel --plot-metric cuda_time_ms --step-plot-interval 1

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

github-actions · 2024-11-12T03:16:52Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

varun-sundar-rabindranath · 2024-11-12T03:17:25Z

@LucasWilkinson i made some updates to layerwise-profile . Can you please take a look. Thanks !

LucasWilkinson · 2024-11-12T20:51:34Z

examples/offline_profile.py

+    parser.add_argument("--min-output-len",
+                        type=int,
+                        default=OUTPUT_LEN_DEFAULT,
+                        help="Minimum output length of the requests")


could we maybe keep a --output-len option? thats mutually exclusive with --max-output-len and --min-output-len?, seems a bit cumbersome if I want say all to have output-len of 8 to do --max-output-len 8 --min-output-len 8

Hey. I have updated to CLI args to pass in num_steps directly. I believe it is better in capturing the intent. PTAL.

LucasWilkinson · 2024-11-12T20:53:07Z

tools/profiler/visualize_layerwise_profile.py

@@ -151,16 +151,18 @@ def is_quant(op_name: str):
           "scaled_int8_quant" in op_name:
            return True

+    def is_cutlass_gemm_op(op_name: str):
+        return "void cutlass::Kernel" in op_name or \
+           "void cutlass::device_kernel" in op_name


maybe we should check for gemm in the name too? not that we use cutlass for anything else right now, but might prevent future confusion if we use cutlass for convolution or something

Yeah, you are right. Ill add it.

LucasWilkinson · 2024-11-12T20:55:44Z

Left a few comments otherwise LGTM. Thanks, these seem like nice improvements!

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath · 2024-11-25T22:54:41Z

Hi @LucasWilkinson - I've made some non trivial changes. PTAL, thanks !

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

LucasWilkinson · 2024-11-27T03:27:49Z

examples/offline_profile.py

+    output lengths of the requests such that step_request is honoured.
+
+    Example: 
+    if batch size = 32 and step_request = [128, 128, 96, 64, 32, 1]


I'm confused by the 'batch size = 32', when step_request[0] = 128, this doesn't seem to align with the '--complete-num-requests-per-step' comment where it is batch size 128

My mistake. It should be "batch_size = 128" ... fixed now 👍

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

Varun Sundar Rabindranath added 4 commits November 12, 2024 03:07

add cutlass_gemm_ops and fixes

0cd7806

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

update + fix layerwise profile

ec0897b

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

format

975785c

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

format

cb13822

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

LucasWilkinson reviewed Nov 12, 2024

View reviewed changes

LucasWilkinson approved these changes Nov 12, 2024

View reviewed changes

Varun Sundar Rabindranath added 2 commits November 25, 2024 16:17

multi-step profile updates

4ed0f4e

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

remove rstrips

8fdd3b6

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

varun-sundar-rabindranath requested a review from LucasWilkinson November 25, 2024 22:53

Add lora ops

60f9cf4

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

LucasWilkinson reviewed Nov 27, 2024

View reviewed changes

fix comment

2b5704d

Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[misc] Layerwise profile updates #10242

[misc] Layerwise profile updates #10242

varun-sundar-rabindranath commented Nov 12, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 12, 2024

varun-sundar-rabindranath commented Nov 12, 2024

LucasWilkinson Nov 12, 2024

varun-sundar-rabindranath Nov 25, 2024

LucasWilkinson Nov 12, 2024

varun-sundar-rabindranath Nov 25, 2024

LucasWilkinson commented Nov 12, 2024

varun-sundar-rabindranath commented Nov 25, 2024

LucasWilkinson Nov 27, 2024

varun-sundar-rabindranath Nov 27, 2024

[misc] Layerwise profile updates #10242

Are you sure you want to change the base?

[misc] Layerwise profile updates #10242

Conversation

varun-sundar-rabindranath commented Nov 12, 2024 • edited by github-actions bot Loading

github-actions bot commented Nov 12, 2024

varun-sundar-rabindranath commented Nov 12, 2024

LucasWilkinson Nov 12, 2024

Choose a reason for hiding this comment

varun-sundar-rabindranath Nov 25, 2024

Choose a reason for hiding this comment

LucasWilkinson Nov 12, 2024

Choose a reason for hiding this comment

varun-sundar-rabindranath Nov 25, 2024

Choose a reason for hiding this comment

LucasWilkinson commented Nov 12, 2024

varun-sundar-rabindranath commented Nov 25, 2024

LucasWilkinson Nov 27, 2024

Choose a reason for hiding this comment

varun-sundar-rabindranath Nov 27, 2024

Choose a reason for hiding this comment

varun-sundar-rabindranath commented Nov 12, 2024 •

edited by github-actions bot

Loading