You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It may be worth trying to help IREE fuse some dispatches in paged attention.
When writing to the cache we first concatenate the K and V partitions and then scatter into the cache, but it may be faster to not concatenate and scatter the pieces twice.
This concatenations 1, 2 prevent the scatter to be fused with the producer.
The fused kernel codgen may end up producing less efficient code due to its increased complexity but saving on the kernel call overhead may be better.
The text was updated successfully, but these errors were encountered:
It may be worth trying to help IREE fuse some dispatches in paged attention.
When writing to the cache we first concatenate the K and V partitions and then scatter into the cache, but it may be faster to not concatenate and scatter the pieces twice.
This concatenations 1, 2 prevent the scatter to be fused with the producer.
The fused kernel codgen may end up producing less efficient code due to its increased complexity but saving on the kernel call overhead may be better.
The text was updated successfully, but these errors were encountered: