Multi-core performance and memory consumption #41

lankiszhang · 2020-11-16T11:44:16Z

Dear all,

We have found nblast really helpful to our current project, especially when doing nblast against the FlyEM database.

On my laptop (6 cores 12 threads), it takes about 4 min for a one against all NBlast when running on single core.

As I want to reduce the time, I used doParallel to define a multi-core backend and run NBlast with .parallel = TRUE. Interestingly, I could confirm that all my 12 cores were running with a 100% RAM consumption, and it ended up with more than 10 min for the same task.

Then I tried running NBlast on only two cores to avoid the high memory consumption, and it took 5 min for the task.

Take the longer time and high memory consumption into consideration, I am a little bit confused about how exactly nblast using .parallel. As I have a 4 processors 40 cores 80 threads CPU and 48 GB RAM, and my dps_flyEM is 2.32 GB, is it the best to run NBlast on only 16 cores rather than 80?

Best wishes,
Jiajun Zhang

jefferis · 2020-11-16T17:26:41Z

Dear Jiajun, as you have discovered although NBLAST is in theory an embarrassingly parallel problem that could scale perfectly across parallel cores, in practice non-CPU resources can be limiting. We normally find that memory is limiting in practice.

In terms of your server spec, you have a lot of cores for the amount of memory. I suspect that you will not want to run more than 10 jobs, but the only way to be sure is to test. Note that running 1 query neuron against the flyem set will be less efficient than running 10 neurons; the memory usage is likely to be similar for both.

A few additional questions. Is that 2.3 GB the on disk or in memory footprint of your dotprops object? In either case it is bigger than ours, which either means that you have processed more neurons (which is good) or that you have more information in the dotprops objects than necessary. This will be a double hit against you since the searches will take more CPU (because there are more points) and you will be able to run fewer parallel jobs (because they take more memory). I would suggest running some basic stats on your dotprops object as I have done below and pasting in your results.

> load("fib.twigs5.dps.rda")
> object.size(fib.twigs5.dps)
1594104384 bytes
> length(fib.twigs5.dps)
[1] 27411
> stem(nvertices(fib.twigs5.dps))

  The decimal point is 3 digit(s) to the right of the |

   0 | 00000000000000000000000000000000000000000000000000000000000000000000+24861
   2 | 00000000000000000000000000000000000000000000000000000000000000000000+1810
   4 | 00000000000000000000000000000000000111111111111111111111111111122222+289
   6 | 00000000001111111111222222222333333344445555555556677778888890000000+27
   8 | 111122234444444455566677777788899900233333566788999
  10 | 1222223466677779901688
  12 | 0123445599935
  14 | 112490234
  16 | 3938
  18 | 24
  20 | 28
  22 | 
  24 | 
  26 | 
  28 | 
  30 | 
  32 | 
  34 | 
  36 | 
  38 | 3


> mean(nvertices(fib.twigs5.dps))
[1] 936.0071

Perhaps you could also add the output of to give a slightly more fine-grained view on this.

dput(hist(nvertices(fib.twigs5.dps), plot=F, breaks=100))

structure(list(breaks = c(0, 500, 1000, 1500, 2000, 2500, 3000, 
3500, 4000, 4500, 5000, 5500, 6000, 6500, 7000, 7500, 8000, 8500, 
9000, 9500, 10000, 10500, 11000, 11500, 12000, 12500, 13000, 
13500, 14000, 14500, 15000, 15500, 16000, 16500, 17000, 17500, 
18000, 18500, 19000, 19500, 20000, 20500, 21000, 21500, 22000, 
22500, 23000, 23500, 24000, 24500, 25000, 25500, 26000, 26500, 
27000, 27500, 28000, 28500, 29000, 29500, 30000, 30500, 31000, 
31500, 32000, 32500, 33000, 33500, 34000, 34500, 35000, 35500, 
36000, 36500, 37000, 37500, 38000, 38500), counts = c(9967L, 
8933L, 4324L, 1825L, 846L, 459L, 299L, 198L, 139L, 94L, 70L, 
50L, 43L, 18L, 30L, 12L, 18L, 17L, 7L, 9L, 8L, 9L, 2L, 3L, 8L, 
3L, 2L, 0L, 4L, 2L, 3L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 2L, 0L, 0L, 
0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 1L), density = c(0.00072722629601255, 0.000651782131261173, 
0.000315493779869395, 0.000133158221152092, 6.17270438874904e-05, 
3.34902046623618e-05, 2.1816059246288e-05, 1.44467549523914e-05, 
1.01419138302141e-05, 6.85856043194338e-06, 5.1074386195323e-06, 
3.64817044252307e-06, 3.13742658056984e-06, 1.31334135930831e-06, 
2.18890226551384e-06, 8.75560906205538e-07, 1.31334135930831e-06, 
1.24037795045785e-06, 5.1074386195323e-07, 6.56670679654153e-07, 
5.83707270803692e-07, 6.56670679654153e-07, 1.45926817700923e-07, 
2.18890226551384e-07, 5.83707270803692e-07, 2.18890226551384e-07, 
1.45926817700923e-07, 0, 2.91853635401846e-07, 1.45926817700923e-07, 
2.18890226551384e-07, 0, 7.29634088504615e-08, 7.29634088504615e-08, 
7.29634088504615e-08, 7.29634088504615e-08, 0, 0, 1.45926817700923e-07, 
0, 0, 0, 7.29634088504615e-08, 7.29634088504615e-08, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 7.29634088504615e-08), mids = c(250, 
750, 1250, 1750, 2250, 2750, 3250, 3750, 4250, 4750, 5250, 5750, 
6250, 6750, 7250, 7750, 8250, 8750, 9250, 9750, 10250, 10750, 
11250, 11750, 12250, 12750, 13250, 13750, 14250, 14750, 15250, 
15750, 16250, 16750, 17250, 17750, 18250, 18750, 19250, 19750, 
20250, 20750, 21250, 21750, 22250, 22750, 23250, 23750, 24250, 
24750, 25250, 25750, 26250, 26750, 27250, 27750, 28250, 28750, 
29250, 29750, 30250, 30750, 31250, 31750, 32250, 32750, 33250, 
33750, 34250, 34750, 35250, 35750, 36250, 36750, 37250, 37750, 
38250), xname = "nvertices(fib.twigs5.dps)", equidist = TRUE), class = "histogram")

jefferis · 2020-11-16T17:30:47Z

Finally it would be worth checking a few specific neurons (here identified by bodyid):

> n20=sample(names(fib.twigs5.dps), 20)
> nvertices(fib.twigs5.dps)[n20]
 794225197 1288172070  974456295  886643729 1170749805  297580512 1903866591 
      1942       1148       1007        922        195        903       1084 
 796401179 2073408606  642720190 5813039999  600011602 1890237470 5813000419 
      2071         70        756        648        680         13        914 
1007256095  639875892 1010338583 1471778506  796940742  541127846 
       940       1479        676        590        434       3786

You can do this by:

n20=c("794225197", "1288172070", "974456295", "886643729", "1170749805", 
"297580512", "1903866591", "796401179", "2073408606", "642720190", 
"5813039999", "600011602", "1890237470", "5813000419", "1007256095", 
"639875892", "1010338583", "1471778506", "796940742", "541127846")
nvertices(dps_flyEM[n20])

lankiszhang · 2020-11-17T08:59:40Z

Dear Greg,

Thank you for your fast and kindly reply!

I did some test on our server yesterday. It seems that running on four cores was faster than running on either single core or 16 cores.

Thank you for the important hint that my dotprops is larger than necessary. I tried the test code but got an error with nvertices()
> nvertices(dps_JRC2018U) Error in dimnames(x) <- dn : length of 'dimnames' [2] not equal to array extent

I looked back on my dps objects, it is a large list consist of 21663 dotprops list. In each dotprops list, there are four lists of "points", "alpha", "vect" and "labels". While "alpha" and "labels" have a type of "double [898 ]", "points" and "vect" have type of "double [898 x 3]".

It seems that I really need to regenerate my dps object. I will try to do it on our ubuntu server with the docker image or natmanager as you told me before. Would you mind giving me some suggestion for the regeneration according to my current problematic dps obejct? Thank you in advance!

Best wishes,
Jiajun Zhang

jefferis · 2021-04-10T10:30:18Z

@lankiszhang I'm sorry I missed replying to this at the time. Your dotprops appear to be in a plain list object rather than a neuronlist. If you do as.neuronlist on it, then nvertices should work again.

As for your benchmarks, it is hard to interpret them without knowing how many neurons you are using or how many vertices there are in the dotprops objects. If you are using relatively few neurons then the overhead of forking and setting up the parallel environment will be large compared with the time saving of running in parallel. This is likely why 4 cores already gives you only 2x speed-up over one core but is actually slightly faster than 16. At the other end, as discussed earlier if you have many neurons, then running many cores will cause problems due to memory issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-core performance and memory consumption #41

Multi-core performance and memory consumption #41

lankiszhang commented Nov 16, 2020

jefferis commented Nov 16, 2020

jefferis commented Nov 16, 2020

lankiszhang commented Nov 17, 2020

jefferis commented Apr 10, 2021

Multi-core performance and memory consumption #41

Multi-core performance and memory consumption #41

Comments

lankiszhang commented Nov 16, 2020

jefferis commented Nov 16, 2020

jefferis commented Nov 16, 2020

lankiszhang commented Nov 17, 2020

jefferis commented Apr 10, 2021