-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel queue #311
base: main
Are you sure you want to change the base?
Parallel queue #311
Conversation
Any idea when this will get merged? |
@sarthakpati, according to the discussion in #301, it's not clear that this implementation would generally be preferred, and I believe it's now in a prototyping state. You can always check out this branch and use it, if you like. I'll convert this PR to a draft, so it's clear that it's not just waiting to be reviewed. |
@fepegar, Would you consider providing a parallel queue? I'm thinking of sticking with the current code interface, but offering this as an option in a parallel environment. |
Thanks! I will try this out and will be following the discussion closely. Cool work! |
Unfortunately, I can't "provide" a parallel queue because I don't have the resources to work on it. I think nowadays the data loading tools available in modern versions of PyTorch would allow for a very nice implementation, but I really don't have the bandwidth. You're welcome to contribute! |
@fepegar I have some possible improvements to make queues more parallel. Can you tell me my understanding of queues is correct, and my improvement ideas are useful? Understanding how queue works I think this can be improved in two ways. |
If I understood correctly, I think this is already the case. The data loader gets images from the dataset, where they are loaded and transformed in the subprocesses: torchio/src/torchio/data/dataset.py Lines 82 to 100 in 0933def
This is indeed interesting. I think it's related to MONAI's |
Hi, I used to write the patch long time ago, and as @fepegar said: it was not clear if it would improve speed.
Static transforms could be applied before the dataset, and nonstatic afterwards, something like
where the getitem_(self, idx): would be something like this:
So what do you need? First, even if you have N worker processes, only one should create the cache. However, this design has a serious limitation, that is CUDA access. Workers don't have CUDA access. In remote sensing, there is another thing: coordinate systems. While this is not directly applicable to medical imaging, the different modalities exhibit similar resampling challenges. So summarizing: I would guess the disk format is the most significant bottleneck, and the only reasonable way, I think, it to read the pixel data from a preprocessed file, and as a first attempt I would go for tensorstore. Ugly but not completely out of question: most medical datasets are small: make an in memory cache, load everything in the beginning of the training, and make subsamples from these. (unfortunately, we rarely get more than 100G data, and 100G in a server is not too much). Monai & co: I would assume that everyone has the same problem, so sooner or later someone will solve it. But most frameworks have relatively clear cut at the dataset level, so it is not unreasonable to combine datasets/loaders/samplers with the other frameworks transforms, models, losses, etc. |
Thank you, @fepegar and @dvolgyes , for your kind responses. I mentioned improving the prefetch of queue (a) and caching the pretransform (b). I'll talk about those two first, and then talk about a broader speedup (c). (a) Improving prefetch on queue (b) Caching pretransforms (c) Using CUDA for transforms Thanks again for your answers. I think there is still a lot of room for improvement in medical image pretransforms. If I can think of more, I'll comment again. |
Hi there, Just my feedback about "pre-transform" because I do not think that this is the major bottleneck. Of course it may depend on the exact choice of transform (depending on your application). In my case I do use a RandomAffine, and a RandomElastics ... even worst I also find very useful to use RandomMotion ! So each on those transforms takes a quite a lot of time. I would love to see them implemented on GPU, (should it be a developing objective of torchio ???) Unfortunately the gpu node, I can access, do not have enough cpu memory and reducing the num_worker (to 6) lead to training time just too long (a factor 5 to 10 or more compare with almost no augmentation) I end up doing the augmentation on a cpu cluster and saving each iteration on disk. This is a quite insane solution since I do need a few Terra of disk, to save my few hundred thousand augmented examples, but it is effective and I greatly speed up the training |
The code for #301. It is way uglier than it should be, but maybe it is easier to
understand this way. Tests, type hints, etc: i will fix it when we agreed on the details.
But now it should work.
In my examples the network processes a patch as fast as i can extract,
this will be a fundamenal limit. But if i put a time.sleep(1) into my training loop,
then i see that the queue preloads the volumes.
Recommendation: set the subject size to a small value, e.g. 3, the number of patches to a large number, like 64,
and put a time.sleep(1) into the training loop. Hopefully this will set conditions where you can study how it works.
(Assuming you have enough memory for that.)