Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove concatenation in paged attention to help dispatch fusion #630

Open
sogartar opened this issue Dec 2, 2024 · 2 comments
Open

Remove concatenation in paged attention to help dispatch fusion #630

sogartar opened this issue Dec 2, 2024 · 2 comments

Comments

@sogartar
Copy link
Contributor

sogartar commented Dec 2, 2024

It may be worth trying to help IREE fuse some dispatches in paged attention.

When writing to the cache we first concatenate the K and V partitions and then scatter into the cache, but it may be faster to not concatenate and scatter the pieces twice.
This concatenations 1, 2 prevent the scatter to be fused with the producer.

The fused kernel codgen may end up producing less efficient code due to its increased complexity but saving on the kernel call overhead may be better.

@sogartar
Copy link
Contributor Author

sogartar commented Dec 2, 2024

FYI @qedawkins, @IanNod.

@sogartar
Copy link
Contributor Author

sogartar commented Dec 2, 2024

@rsuderman

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant