Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can allreduce (and other multi-GPU communications) work via SLI? #1

Open
ai-bits opened this issue Mar 1, 2017 · 3 comments
Open

Comments

@ai-bits
Copy link

ai-bits commented Mar 1, 2017

For quite some time I've been wondering how to get a "poor-man's" 4-GPU-accelerated DL machine at maximum price / performance ratio.

PCIe is said to be the communications bottleneck.
DGX-1 @ 125k is no-go, DIGITS DevBox hardware is partly jaded and not exactly a snap at 15k...

Now there will be the GTX 1080TI @ half the Titan price... What would be a 4-way hardware @ rock-bottom price / perf?
Nvidia says SLI is fast, but not how fast. Any chance to use SLI as comm between the GPUs for allreduce?

Thanks
G.

@shubho
Copy link
Collaborator

shubho commented Mar 1, 2017

For fast allreduce - you would want all 4 on the same PCI-E root complex. These motherboards are not always cheap - on a 4 GPU motherboard - they will probably split the 4 GPUs across two root complexes.

We haven't tried SLI - we use OpenMPI as our transport layer. If OpenMPI supports SLI in their Byte Transport Layer (btl) then allreduce will automatically work across it.

Updating with what I learned from skimming the SLI docs - the communication is handled by the driver so that you can do alternate frame rendering. So this is very much only used for graphics and only the driver (I am assuming) knows the communication protocol across SLI. So I think you are out of luck trying to use SLI for allreduce.

gibiansky pushed a commit that referenced this issue Mar 1, 2017
Added caching of handles for libxsmm forward convolutions
@ai-bits
Copy link
Author

ai-bits commented Mar 2, 2017

Thanks a ton for your take on this. I appreciate it!

I fear you are right that SLI is too little general purpose and just for splitting up graphics rendering and physics calc in games or VR, specifically prepared for that. And then I could get nowhere hold of any transfer speed numbers.

Upon reading up more:
As opposed to their Deep Learning prowess, Nvidia consumer docs on SLI are totally outdated (7 series!), e.g. implying you need dual GPU graphics cards (not available for 10 series) to get to 4-way SLI.
Only closer study of motherboard makers reveals they all deliver several SLI bridges with their mobos to get to 2-/3-/4-way.

The compromise with PCIe: Due to CPU PCIe lane restrictions in consumer hardware, lanes are multiplexed and four physical x16 PCIe slots mostly end up as four logical x8 ones. (or x16/x16 or x16/x8/x8)

I guess a perf prognosis with the different factors (transfer speed, latency,...) would be very hard to make, so I'll simply take the plunge in the coming weeks.

Thanks again
G.

@shubho
Copy link
Collaborator

shubho commented Mar 2, 2017

Your best bet is to buy the motherboard that has the maximum number of GPUs per PCI-E root complex and that fits in your budget. You have to go through the specs of the motherboard to figure this out - it is not always spelled out clearly. Tyan or SuperMicro should have something.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants