-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[misc] Layerwise profile updates #10242
base: main
Are you sure you want to change the base?
[misc] Layerwise profile updates #10242
Conversation
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
@LucasWilkinson i made some updates to layerwise-profile . Can you please take a look. Thanks ! |
examples/offline_profile.py
Outdated
parser.add_argument("--min-output-len", | ||
type=int, | ||
default=OUTPUT_LEN_DEFAULT, | ||
help="Minimum output length of the requests") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we maybe keep a --output-len
option? thats mutually exclusive with --max-output-len
and --min-output-len
?, seems a bit cumbersome if I want say all to have output-len of 8 to do --max-output-len 8 --min-output-len 8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey. I have updated to CLI args to pass in num_steps
directly. I believe it is better in capturing the intent. PTAL.
@@ -151,16 +151,18 @@ def is_quant(op_name: str): | |||
"scaled_int8_quant" in op_name: | |||
return True | |||
|
|||
def is_cutlass_gemm_op(op_name: str): | |||
return "void cutlass::Kernel" in op_name or \ | |||
"void cutlass::device_kernel" in op_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should check for gemm
in the name too? not that we use cutlass for anything else right now, but might prevent future confusion if we use cutlass for convolution or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you are right. Ill add it.
Left a few comments otherwise LGTM. Thanks, these seem like nice improvements! |
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Hi @LucasWilkinson - I've made some non trivial changes. PTAL, thanks ! |
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
examples/offline_profile.py
Outdated
output lengths of the requests such that step_request is honoured. | ||
|
||
Example: | ||
if batch size = 32 and step_request = [128, 128, 96, 64, 32, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by the 'batch size = 32', when step_request[0] = 128, this doesn't seem to align with the '--complete-num-requests-per-step' comment where it is batch size 128
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake. It should be "batch_size = 128" ... fixed now 👍
Signed-off-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
Layerwise profile - changes and updates:
--output-length
from offline_profile.py CLI args.run_num_steps
andrun_to_completion
in its place.run_num_steps
captures the user-intent more clearly than--output-length
. i.e. Profilen
engine steps.run_to_completion
lets the user specify the number of requests the engine should complete every decode step. This provides layer-wise profile information for a range of batch-sizes.Examples:
run_num_steps
sub-command:Graph:
command :
run_to_completion
subcommand:Graphs