Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Aborted (core dumped) #56

Open
vshkurin opened this issue Jan 3, 2022 · 3 comments
Open

Aborted (core dumped) #56

vshkurin opened this issue Jan 3, 2022 · 3 comments

Comments

@vshkurin
Copy link

vshkurin commented Jan 3, 2022

Hello! You are doing a very cool project that helps ordinary users to solve the problem with the lack of memory in the video card. But unfortunately my neural network training ends up with an error.

Sometimes this error is:

F tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:136] Check failed: this->H2D_stream_->ok()
Aborted (core dumped)

Sometimes:

F tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:175] Check failed: this->D2H_stream_->ok()
Aborted (core dumped)

Sometimes errors appear immediately, sometimes in the middle of epoch.

I am using LMS with tensorflow 2.2.0.

@smatzek
Copy link
Collaborator

smatzek commented Jan 4, 2022

Thanks for trying out LMS. LMS is not being actively maintained, hence the 2.2.0 version. It has been nearly 2 years since I ran a job with LMS, but I recall that those types of errors would come out during some types of out of memory errors.

Even with LMS enabled it is possible to run out of memory on the GPU in certain cases such as when the model requires too many active tensors, or individual operations have input and output tensors so large they cannot fit.

If you provide more error messages / output from your call I may be able to help more.

@vshkurin
Copy link
Author

vshkurin commented Jan 5, 2022

Thank you for the clarification! I am using a GeForce GTX 1060 3GB and this card probably has very little video memory. I saw that if I reduce the batch size, the error disappears. This is all that I can write to you, except for the above, nothing appears in the console.

@HarshaanNiles010
Copy link

Were you able to reason out why sometimes the error would come at the beginning of the process. Correct me if I am wrong if there's a large batch size then it would start the computation on the GPU and when a tensor is no longer in use it should be sent back to the host machine for storage. A lot like paging algorithm so it should not be problem to handle large batch sizes

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants