Tensor parallel in distributed inference #10118
-
In the documentation, it recommends using tensor parallelism for cases with a single node and multiple GPUs My question is: why is tensor parallelism preferred over pipeline parallelism in this setup, even though tensor parallelism involves more communication? What are the specific advantages of using tensor parallelism in this setup? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Within a node, networking is generally fast. This means that the added communication overhead of TP is not as much of a concern compared to the big improvement you get from using multiple GPUs at once and from batch efficiencies present in TP but not PP. |
Beta Was this translation helpful? Give feedback.
Hey sure,
So with tensor parallelism we have the downside that communication cost is higher. However, this is not typically a concern for us in a single node with very good networking, and in the case where we don't have a lot of long prefills. With TP you can enjoy very low latency because you can use all the memory bandwidth available across all your GPUs. We are memory bound most of the time so this is preferable. PP is ideal in a case where communication is expensive compared to this (poor interconnect, more communication volume from prefills, cross-node).