Can allreduce (and other multi-GPU communications) work via SLI? #1

ai-bits · 2017-03-01T20:18:43Z

For quite some time I've been wondering how to get a "poor-man's" 4-GPU-accelerated DL machine at maximum price / performance ratio.

PCIe is said to be the communications bottleneck.
DGX-1 @ 125k is no-go, DIGITS DevBox hardware is partly jaded and not exactly a snap at 15k...

Now there will be the GTX 1080TI @ half the Titan price... What would be a 4-way hardware @ rock-bottom price / perf?
Nvidia says SLI is fast, but not how fast. Any chance to use SLI as comm between the GPUs for allreduce?

Thanks
G.

shubho · 2017-03-01T21:28:02Z

For fast allreduce - you would want all 4 on the same PCI-E root complex. These motherboards are not always cheap - on a 4 GPU motherboard - they will probably split the 4 GPUs across two root complexes.

We haven't tried SLI - we use OpenMPI as our transport layer. If OpenMPI supports SLI in their Byte Transport Layer (btl) then allreduce will automatically work across it.

Updating with what I learned from skimming the SLI docs - the communication is handled by the driver so that you can do alternate frame rendering. So this is very much only used for graphics and only the driver (I am assuming) knows the communication protocol across SLI. So I think you are out of luck trying to use SLI for allreduce.

Added caching of handles for libxsmm forward convolutions

ai-bits · 2017-03-02T13:00:50Z

Thanks a ton for your take on this. I appreciate it!

I fear you are right that SLI is too little general purpose and just for splitting up graphics rendering and physics calc in games or VR, specifically prepared for that. And then I could get nowhere hold of any transfer speed numbers.

Upon reading up more:
As opposed to their Deep Learning prowess, Nvidia consumer docs on SLI are totally outdated (7 series!), e.g. implying you need dual GPU graphics cards (not available for 10 series) to get to 4-way SLI.
Only closer study of motherboard makers reveals they all deliver several SLI bridges with their mobos to get to 2-/3-/4-way.

The compromise with PCIe: Due to CPU PCIe lane restrictions in consumer hardware, lanes are multiplexed and four physical x16 PCIe slots mostly end up as four logical x8 ones. (or x16/x16 or x16/x8/x8)

I guess a perf prognosis with the different factors (transfer speed, latency,...) would be very hard to make, so I'll simply take the plunge in the coming weeks.

Thanks again
G.

shubho · 2017-03-02T18:43:59Z

Your best bet is to buy the motherboard that has the maximum number of GPUs per PCI-E root complex and that fits in your budget. You have to go through the specs of the motherboard to figure this out - it is not always spelled out clearly. Tyan or SuperMicro should have something.

gibiansky pushed a commit that referenced this issue Mar 1, 2017

Merge pull request #1 from taknevski/master

cfe4b8e

Added caching of handles for libxsmm forward convolutions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can allreduce (and other multi-GPU communications) work via SLI? #1

Can allreduce (and other multi-GPU communications) work via SLI? #1

ai-bits commented Mar 1, 2017

shubho commented Mar 1, 2017 •

edited

Loading

ai-bits commented Mar 2, 2017 •

edited

Loading

shubho commented Mar 2, 2017

Can allreduce (and other multi-GPU communications) work via SLI? #1

Can allreduce (and other multi-GPU communications) work via SLI? #1

Comments

ai-bits commented Mar 1, 2017

shubho commented Mar 1, 2017 • edited Loading

ai-bits commented Mar 2, 2017 • edited Loading

shubho commented Mar 2, 2017

shubho commented Mar 1, 2017 •

edited

Loading

ai-bits commented Mar 2, 2017 •

edited

Loading