You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As we move to larger simulations with Genesis, we need to think about improving the performance of the model and making it maintainable and useful into the future.
Performance
Our number one stumbling block here is domain decomposition. We currently parallelize the model using MPI alone, taking advantage of the forest information to split the computation up to different cores. This has some benefits:
There is no need to transfer halos/galaxies between cores. We always know that we will have all of the halos we will need locally to allow us to evolve a galaxy.
We don't need to worry about mutexs, cache collisions and all those other fun thread related issues when we are connecting up galaxies and halos (which requires lots of random RAM read and writes).
However, it has one major draw back: Our ability to evenly distribute work amongst cores is limited by the imbalance in the input forest sizes. For example, with Tiamat, there is one forest which is >30x the size of the next largest forest. That means there is no benefit from having more than 32 cores running the model. There will be one core that gets the largest forest and all the other cores will be waiting on that one. We therefore have to rework the domain decomposition.
Domain decomposition options
A few ideas of how we could go about changing things:
We could move to an OpenMP + MPI hybrid model. Then our decomposition would be much more course (number of nodes rather than number of cores typically). The main problem here would be how we split up connecting galaxies and halos between threads. Perhaps we could do this by forest, but then would we not run in to a similar issue that we already have?
We could maybe dynamically load balance the forests after every N snapshots. In Tiamat, the largest forest has lots of halos at high redshift and relatively few at low redshift. Conversely, there are forests with no halos at high redshift and lots at low redshift. Instead of decomposing by the total number of halos in the forest, we could decompose by the number of halos in each forest in the next, say 10, snapshots. After those snapshots are up, we could reassess and shuffle the forests around between cores.
Others...?
Other areas of improvement
Memory model
Meraxes currently uses a linked list to store galaxies. This is very flexible and has all the benefits that come with a linked list. However, it also has all of the drawbacks, namely cache coherency. Given that galaxies merge in a hierarchical fashion and that we are almost always jumping around from the 10th galaxy to the 10643 one etc., this may not be an issue. However, if we do consider moving to OpenMP + MPI, we might benefit from a better memory model. I am particularly fond of drawing inspiration from the gaming industry (who have many of the same issues) and using something like a slotmap for this. See here, for example.
GPU code
It's a mess. The current implementation, while it works and provides some performance improvement, is far from ideal. There is lots of scope to overlap data transfer and computation, to remove transfer of data back and forth between the GPU and CPU (which is very expensive), and to simplify the error handling.
Documentation
Balu has made a start on this, but we really, really need to document the code fully moving forward. It would make life easier for everyone (after the initial pain of writing the docs!).
Test suite
Testing a model like this is tough. There are lots of coupled parts and any kind of unit testing would require a whole whack of mocking to go along with it. However, as a minimum, we should package up a small SWIFT+Velociraptor+TreeFrog simulation and use this for regression testing.
Miscellaneous
Last but not least, over the years Meraxes has grown into quite the little collection of hacks and patch jobs. Some of this has been by necessity (to ensure we can get on with the science of interest in a timely fashion or due to evolving input merger trees etc.), however, many of these kludges need to be fixed and made robust. e.g. passing around structs as raw bytes, outputing tables instead of separate datasets for each galaxy property, using our own input file format instead of something like TOML.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
As we move to larger simulations with Genesis, we need to think about improving the performance of the model and making it maintainable and useful into the future.
Performance
Our number one stumbling block here is domain decomposition. We currently parallelize the model using MPI alone, taking advantage of the forest information to split the computation up to different cores. This has some benefits:
However, it has one major draw back: Our ability to evenly distribute work amongst cores is limited by the imbalance in the input forest sizes. For example, with Tiamat, there is one forest which is >30x the size of the next largest forest. That means there is no benefit from having more than 32 cores running the model. There will be one core that gets the largest forest and all the other cores will be waiting on that one. We therefore have to rework the domain decomposition.
Domain decomposition options
A few ideas of how we could go about changing things:
Other areas of improvement
Memory model
Meraxes currently uses a linked list to store galaxies. This is very flexible and has all the benefits that come with a linked list. However, it also has all of the drawbacks, namely cache coherency. Given that galaxies merge in a hierarchical fashion and that we are almost always jumping around from the 10th galaxy to the 10643 one etc., this may not be an issue. However, if we do consider moving to OpenMP + MPI, we might benefit from a better memory model. I am particularly fond of drawing inspiration from the gaming industry (who have many of the same issues) and using something like a slotmap for this. See here, for example.
GPU code
It's a mess. The current implementation, while it works and provides some performance improvement, is far from ideal. There is lots of scope to overlap data transfer and computation, to remove transfer of data back and forth between the GPU and CPU (which is very expensive), and to simplify the error handling.
Documentation
Balu has made a start on this, but we really, really need to document the code fully moving forward. It would make life easier for everyone (after the initial pain of writing the docs!).
Test suite
Testing a model like this is tough. There are lots of coupled parts and any kind of unit testing would require a whole whack of mocking to go along with it. However, as a minimum, we should package up a small SWIFT+Velociraptor+TreeFrog simulation and use this for regression testing.
Miscellaneous
Last but not least, over the years Meraxes has grown into quite the little collection of hacks and patch jobs. Some of this has been by necessity (to ensure we can get on with the science of interest in a timely fashion or due to evolving input merger trees etc.), however, many of these kludges need to be fixed and made robust. e.g. passing around structs as raw bytes, outputing tables instead of separate datasets for each galaxy property, using our own input file format instead of something like TOML.
Beta Was this translation helpful? Give feedback.
All reactions