Skip to content

How does the profile_run work? #10110

Answered by andoorve
zhuangqh asked this question in Q&A
Nov 7, 2024 · 1 comments · 7 replies
Discussion options

You must be logged in to vote

I don't think I quite got your question. The flow is kind of like this:

  1. profile_run: doesn't actually allocate any permanent memory for KV cache. Will just allocate temporary activations.
  2. KV cache allocation. Allocates the difference between total and peak memory from the profile run to KV tensor. Note this is what will make it almost max out nvidia-smi. Also, if there are not enough KV cache blocks it will error at this point.

Replies: 1 comment 7 replies

Comment options

You must be logged in to vote
7 replies
@zhuangqh
Comment options

@andoorve
Comment options

@zhuangqh
Comment options

@andoorve
Comment options

Answer selected by zhuangqh
@zhuangqh
Comment options

@andoorve
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants