Multiple 400MB processes on single GPU #20114

changspencer · 2024-07-22T13:29:26Z

changspencer
Jul 22, 2024

Hello everyone!

I had a question for something I've been wondering about for DP/DDP behavior. Occasionally, for my runs on a SLURM cluster, I see multiple small 400 MB processes get placed on a single GPU. I assume this GPU is something like the "master" process, but I've no idea why the small processes are necessary or if they don't get cleaned up after a method finishes running.

What could be the reason I see the processes show up on a single GPU during training (although all training processes have been started)? Could this be a SLURM resource management problem or a (personal) programming problem?

Some quick notes:

I'm working on semantic segmentation with a basic UNet-style architecture that has around 24-27M parameters and RGB images of size (512, 512).
I've integrated the RAPIDS Agglomerative Clustering routine at one step in my deep learning pipeline.
The processes can show up during training or validation steps, and it may be random whether they persist or not.
I've had CUDA OOM problems in the past, but it's not clear that these small processes are the direct cause (a recent run shows that one of the other GPUs had an OOM before device 0).

I can try to provide more details, but I wanted to see if anyone else has experienced the same situation for - possibly - different use cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple 400MB processes on single GPU #20114

{{title}}

Replies: 0 comments

Select a reply

Multiple 400MB processes on single GPU #20114

changspencer Jul 22, 2024

Replies: 0 comments

changspencer
Jul 22, 2024