Remove concatenation in paged attention to help dispatch fusion #630

sogartar · 2024-12-02T16:53:56Z

It may be worth trying to help IREE fuse some dispatches in paged attention.

When writing to the cache we first concatenate the K and V partitions and then scatter into the cache, but it may be faster to not concatenate and scatter the pieces twice.
This concatenations 1, 2 prevent the scatter to be fused with the producer.

The fused kernel codgen may end up producing less efficient code due to its increased complexity but saving on the kernel call overhead may be better.

sogartar · 2024-12-02T16:55:20Z

FYI @qedawkins, @IanNod.

sogartar · 2024-12-02T22:38:58Z

@rsuderman

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove concatenation in paged attention to help dispatch fusion #630

Remove concatenation in paged attention to help dispatch fusion #630

sogartar commented Dec 2, 2024

sogartar commented Dec 2, 2024

sogartar commented Dec 2, 2024

Remove concatenation in paged attention to help dispatch fusion #630

Remove concatenation in paged attention to help dispatch fusion #630

Comments

sogartar commented Dec 2, 2024

sogartar commented Dec 2, 2024

sogartar commented Dec 2, 2024