Tensor parallel in distributed inference #10118

MohmedMonsef · 2024-11-07T14:19:23Z

MohmedMonsef
Nov 7, 2024

In the documentation, it recommends using tensor parallelism for cases with a single node and multiple GPUs If your model is too large to fit in a single GPU.

My question is: why is tensor parallelism preferred over pipeline parallelism in this setup, even though tensor parallelism involves more communication?

What are the specific advantages of using tensor parallelism in this setup?

Answered by andoorve

Nov 8, 2024

Hey sure,

So with tensor parallelism we have the downside that communication cost is higher. However, this is not typically a concern for us in a single node with very good networking, and in the case where we don't have a lot of long prefills. With TP you can enjoy very low latency because you can use all the memory bandwidth available across all your GPUs. We are memory bound most of the time so this is preferable. PP is ideal in a case where communication is expensive compared to this (poor interconnect, more communication volume from prefills, cross-node).

View full answer

andoorve · 2024-11-07T18:43:48Z

andoorve
Nov 7, 2024
Collaborator

Within a node, networking is generally fast. This means that the added communication overhead of TP is not as much of a concern compared to the big improvement you get from using multiple GPUs at once and from batch efficiencies present in TP but not PP.

3 replies

MohmedMonsef Nov 8, 2024
Author

Thanks for your reply.
I’d love to dive a bit deeper into the specifics—could you elaborate on the types of improvements and efficiencies in TP that aren’t achievable with pipeline parallelism (PP)? For instance, are there certain request patterns or batching methods that particularly benefit from TP over PP?

andoorve Nov 8, 2024
Collaborator

Hey sure,

So with tensor parallelism we have the downside that communication cost is higher. However, this is not typically a concern for us in a single node with very good networking, and in the case where we don't have a lot of long prefills. With TP you can enjoy very low latency because you can use all the memory bandwidth available across all your GPUs. We are memory bound most of the time so this is preferable. PP is ideal in a case where communication is expensive compared to this (poor interconnect, more communication volume from prefills, cross-node).

Answer selected by MohmedMonsef

MohmedMonsef Nov 9, 2024
Author

It is more clear now, thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor parallel in distributed inference #10118

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Tensor parallel in distributed inference #10118

MohmedMonsef Nov 7, 2024

Replies: 1 comment · 3 replies

andoorve Nov 7, 2024 Collaborator

MohmedMonsef Nov 8, 2024 Author

andoorve Nov 8, 2024 Collaborator

MohmedMonsef Nov 9, 2024 Author

MohmedMonsef
Nov 7, 2024

Replies: 1 comment 3 replies

andoorve
Nov 7, 2024
Collaborator

MohmedMonsef Nov 8, 2024
Author

andoorve Nov 8, 2024
Collaborator

MohmedMonsef Nov 9, 2024
Author