Change libOS for working with Intel SGX #782

CasellaJr · 2023-03-27T20:26:56Z

CasellaJr
Mar 27, 2023

Hi everyone! From several months (november 2022) I am trying to run FL experiments with OpenFL and SGX. I have 4 SGX machines, with these specifics: 4x Baremetal 8380 ICX systems, Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz. So, I decided to use one of them as an Aggregator and the other 3 as Collaborators. I started experiments training a Resnet18 on MNIST. Everything worked, and I completed the experiment. However, there is a problem. Training time increases round by round. First round takes 3 minutes more or less; time increases round after round and after 100 iterations training time was about 30 minutes for a single round!!!
I thought that there was a problem with OpenFL, so I started profiling it with a Python profiler. I have also measured time of important functions with my own scripts. However, I did not found any slowdown due to OpenFL.
So, I started thinking that the problem was SGX. However, I have not enough knowledge of SGX and architectures to try to understand which problem can be.
Then, I decided to do simpler experiments; in particular, I ran typical centralized deep learning experiments using MNIST as dataset and Resnet18 as neural network, on one of the previously mentioned SGX machines. I ran 3 types of experiments:

Typical training with python3 mnist.py
Non-SGX Gramine with gramine-direct ./pytorch mnist.py
SGX Gramine with gramine-sgx ./pytorch mnist.py

I have followed the steps described in this PyTorch Gramine guide to run my Python script.

Below you can find the charts showing how training time grows "linearly".

Typical training time:

Non-SGX Gramine

SGX Gramine

Here you can find my Python script: pastebin

As you can see from these pictures, the slowdown is always present. However, in the typical deep learning scenario, the slowdown is negligible (less than 1 second of slowdown after 200 epochs), while when using Gramine, even without SGX, increases a lot the training time epoch after epoch. So, I think that there is some problem with Gramine and PyTorch that needs to be fixed. I have already written an issue on the official Gramine github, here is the link.
I know that you are not the developers of Gramine, but my question is whether it is possible to deepen this weird behaviour and if it is possible to search for libOS alternatives (that work well) to replace Gramine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change libOS for working with Intel SGX #782

{{title}}

Replies: 0 comments

Select a reply

Change libOS for working with Intel SGX #782

CasellaJr Mar 27, 2023

Replies: 0 comments

CasellaJr
Mar 27, 2023