Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak for pipelined optimizer swapper #5700

Merged
merged 1 commit into from
Jul 15, 2024

Conversation

mauryaavinash95
Copy link
Contributor

We identified a memory leak when training with NVMe offloaded optimizer states. The issue occurs when pipeline_write=true, where the tensors that have swapped out and written to NVMe are not deallocated, leading to a memory leak.

This PR resolves the issue by deallocating the unused tensors which have swapped out to NVMe.

@tjruwase
Copy link
Contributor

@mauryaavinash95, thanks for fixing this leak. As this is not a commonly used feature, I am curious if you are seeing any benefits from the pipelined swapper?

@mauryaavinash95
Copy link
Contributor Author

@tjruwase: We see a 20-30% improvement in the optimizer update phase when using the pipelined swapper-- most likely because of the full-duplex PCIe through which the NVMes are connected. We are running some extensive experiments on this and can share some results in a few weeks if you're interested.

@tjruwase
Copy link
Contributor

@tjruwase: We see a 20-30% improvement in the optimizer update phase when using the pipelined swapper-- most likely because of the full-duplex PCIe through which the NVMes are connected. We are running some extensive experiments on this and can share some results in a few weeks if you're interested.

@mauryaavinash95, wow! It is great to learn that you are seeing speedups from pipelined swapper. Thanks for sharing.

The back story is that we actually never got a chance to properly evaluate the perf benefits, but released the code anyways in hope that someday somebody might find it useful :). And so, I am very curious about your work and results in this line, if you don't mind sharing.

@mauryaavinash95
Copy link
Contributor Author

Sorry, I accidentally made some other commits to my master branch which caused this PR to get closed and I had to open it again.

@adk9 adk9 added this pull request to the merge queue Jul 15, 2024
Merged via the queue into microsoft:master with commit db5a875 Jul 15, 2024
11 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants