Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

nod-ai / shark-ai Public

Notifications You must be signed in to change notification settings
Fork 32
Star 16

Code
Issues 81
Pull requests 51
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Shortfin Tasks tracker #647

Open

1 of 21 tasks

renxida opened this issue Dec 4, 2024 · 0 comments

Open

1 of 21 tasks

Shortfin Tasks tracker #647

renxida opened this issue Dec 4, 2024 · 0 comments

Comments

Copy link

Contributor

renxida commented Dec 4, 2024 •

edited

Loading

stress testing shortfin
- prereqs
  - [ ] figure out interface for specifying cache management algorithm (trie or base) Ideally we can even hotswap it by specifying in the incoming http request.
  - refactor CI to be able to reuse the same model artifacts between Base and Trie
- tests needed
  - test models
    - start with toy llama model that Rob has
  - test sequences
    - repeat 100 x the same prompt. Ran on both Base and Trie. Trie should be close to 100x faster by skipping prefill. If i screwed up the cache matching, trie would be slower.
    - prompts forking at various locations
  - things to track over all test cases
    - output token consistency between base and trie
    - performance comparison between base and trie.
      - total time between sending first request and receiving output of last request
      - timeline of sending & receiving requests. This should be helpful for tracking performance problems down the line
sharding
- GPU first
  - sharding is not useful on CPU & in the past we've encountered problems unique to CPU. If we're trying to make GPU work there's not much reason to wade through those. If we are stuck with GPU specific issues & there is no more important work to do, THEN we should try sharding on CPU
ux improvements
- Eliminate edited_config.json
- Improve documentation shark-ai/docs/shortfin/llm/developer/e2e_llama8b_mi300x.md at main · nod-ai/shark-ai
- CLI chatting client https://xilinx.slack.com/archives/C0832CJ7GET/p1733256766854939
[ ]

The text was updated successfully, but these errors were encountered:

All reactions

renxida mentioned this issue

[tracking] Minimal prefix-sharing kv cache #593

Closed

20 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

No branches or pull requests

1 participant

Footer

© 2025 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.