You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi everyone! From several months (november 2022) I am trying to run FL experiments with OpenFL and SGX. I have 4 SGX machines, with these specifics: 4x Baremetal 8380 ICX systems, Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz. So, I decided to use one of them as an Aggregator and the other 3 as Collaborators. I started experiments training a Resnet18 on MNIST. Everything worked, and I completed the experiment. However, there is a problem. Training time increases round by round. First round takes 3 minutes more or less; time increases round after round and after 100 iterations training time was about 30 minutes for a single round!!!
I thought that there was a problem with OpenFL, so I started profiling it with a Python profiler. I have also measured time of important functions with my own scripts. However, I did not found any slowdown due to OpenFL.
So, I started thinking that the problem was SGX. However, I have not enough knowledge of SGX and architectures to try to understand which problem can be.
Then, I decided to do simpler experiments; in particular, I ran typical centralized deep learning experiments using MNIST as dataset and Resnet18 as neural network, on one of the previously mentioned SGX machines. I ran 3 types of experiments:
Typical training with python3 mnist.py
Non-SGX Gramine with gramine-direct ./pytorch mnist.py
As you can see from these pictures, the slowdown is always present. However, in the typical deep learning scenario, the slowdown is negligible (less than 1 second of slowdown after 200 epochs), while when using Gramine, even without SGX, increases a lot the training time epoch after epoch. So, I think that there is some problem with Gramine and PyTorch that needs to be fixed. I have already written an issue on the official Gramine github, here is the link.
I know that you are not the developers of Gramine, but my question is whether it is possible to deepen this weird behaviour and if it is possible to search for libOS alternatives (that work well) to replace Gramine.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi everyone! From several months (november 2022) I am trying to run FL experiments with OpenFL and SGX. I have 4 SGX machines, with these specifics: 4x Baremetal 8380 ICX systems, Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz. So, I decided to use one of them as an Aggregator and the other 3 as Collaborators. I started experiments training a Resnet18 on MNIST. Everything worked, and I completed the experiment. However, there is a problem. Training time increases round by round. First round takes 3 minutes more or less; time increases round after round and after 100 iterations training time was about 30 minutes for a single round!!!
I thought that there was a problem with OpenFL, so I started profiling it with a Python profiler. I have also measured time of important functions with my own scripts. However, I did not found any slowdown due to OpenFL.
So, I started thinking that the problem was SGX. However, I have not enough knowledge of SGX and architectures to try to understand which problem can be.
Then, I decided to do simpler experiments; in particular, I ran typical centralized deep learning experiments using MNIST as dataset and Resnet18 as neural network, on one of the previously mentioned SGX machines. I ran 3 types of experiments:
python3 mnist.py
gramine-direct ./pytorch mnist.py
gramine-sgx ./pytorch mnist.py
I have followed the steps described in this PyTorch Gramine guide to run my Python script.
Below you can find the charts showing how training time grows "linearly".
Typical training time:
Non-SGX Gramine
SGX Gramine
Here you can find my Python script: pastebin
As you can see from these pictures, the slowdown is always present. However, in the typical deep learning scenario, the slowdown is negligible (less than 1 second of slowdown after 200 epochs), while when using Gramine, even without SGX, increases a lot the training time epoch after epoch. So, I think that there is some problem with Gramine and PyTorch that needs to be fixed. I have already written an issue on the official Gramine github, here is the link.
I know that you are not the developers of Gramine, but my question is whether it is possible to deepen this weird behaviour and if it is possible to search for libOS alternatives (that work well) to replace Gramine.
Beta Was this translation helpful? Give feedback.
All reactions