You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
prereqs
- [ ] figure out interface for specifying cache management algorithm (trie or base) Ideally we can even hotswap it by specifying in the incoming http request.
refactor CI to be able to reuse the same model artifacts between Base and Trie
tests needed
test models
start with toy llama model that Rob has
test sequences
repeat 100 x the same prompt. Ran on both Base and Trie. Trie should be close to 100x faster by skipping prefill. If i screwed up the cache matching, trie would be slower.
prompts forking at various locations
things to track over all test cases
output token consistency between base and trie
performance comparison between base and trie.
total time between sending first request and receiving output of last request
timeline of sending & receiving requests. This should be helpful for tracking performance problems down the line
sharding
GPU first
sharding is not useful on CPU & in the past we've encountered problems unique to CPU. If we're trying to make GPU work there's not much reason to wade through those. If we are stuck with GPU specific issues & there is no more important work to do, THEN we should try sharding on CPU
- [ ] figure out interface for specifying cache management algorithm (trie or base) Ideally we can even hotswap it by specifying in the incoming http request.
The text was updated successfully, but these errors were encountered: