We provide the code/scripts to replicate Lotus experiment results in the cloudlab testbed using a c4130 node available in Wisconsin cluster.
The following experiments are targetting an Intel Processor chip with 4x V100 GPUs. The experiments are performed on the ImageNet dataset for the Image Classification task. We focus on a single configuration for the below experiments because the same process/method can be applied to each of them. Please note that the figures generated via below experiments correspond to one configuration of the figures found in the paper.
We have setup software dependencies such as CUDA, CuDNN, Intel VTune, Anaconda, and ImageNet dataset on the c4130 node. You can check out the instructions to do so in the SETUP.md file.
-
Clone this repository
# Clone in below work directory because some scripts have absolute paths cd /mydata/iiswc24 # Below command will take some time git clone --depth 1 --recurse-submodules https://github.com/rajveerb/lotus.git -b iiswc24ae cd lotus
-
Create a conda environment
conda create -n lotus python=3.10 -y conda activate lotus
-
Install itt-python using build instructions below:
pushd code/itt-python export ITT_LIBRARY_DIR=/opt/intel/oneapi/vtune/latest/lib64/ export ITT_INCLUDE_DIR=/opt/intel/oneapi/vtune/latest/include python setup.py install # Check if installed pip list | grep "itt" popd
-
Install PyTorch (LotusTrace):
sudo apt install -y g++ bash install_lotustrace.sh # Sanity check pip list | grep "torch" | grep "2.0.0a0"
-
Install torchvision:
bash install_torchvision.sh # Sanity check pip list | grep "torchvision" | grep "0.15.1a0"
-
Install below packages:
conda install ipykernel pandas=2.0.3 -y pip install matplotlib==3.9.0 natsort==8.4.0 seaborn==0.13.2
-
Get the mapping logs for the preprocessing operations:
# Activate VTune, command will fail an error if it is already activated source /opt/intel/oneapi/setvars.sh # Sanity check vtune --version bash code/image_classification/LotusMap/Intel/LotusMap.sh
-
Generate JSON file with mapping info by running all cells in
code/image_classification/LotusMap/Intel/logsToMapping.ipynb
-
You have successfully obtained the mapping (
mapping_funcs.json
) using LotusMap (Table 1)! -
Run the Image Classification pipeline experiment where batch size and number of gpus are varied and LotusTrace is enabled:
bash scripts/cloudlab/LotusTrace_imagenet.sh
Note: # of DataLoader workers is equal to # of gpus in this experiment.
-
Run the below commands for observations in
High variance in Preprocessing Time
for fig 4 (a) and the statistics:python code/image_classification/analysis/LotusTrace_imagenet_vary_batch_and_gpu/preprocessing_time_stats.py\ --remove_outliers\ --data_dir lotustrace_result/512_gpu4/\ --output_file lotustrace_result/preprocessing_time_stats.log python code/image_classification/analysis/LotusTrace_imagenet_vary_batch_and_gpu/box_plot_preprocessing_time.py\ --remove_outliers\ --data_dir lotustrace_result/512_gpu4\ --output_file lotustrace_result/box_plot_preprocessing_time.png
-
Run the below commands for observations in
Significant wait time
for fig 4 (b), (c) and the statistics:python code/image_classification/analysis/LotusTrace_imagenet_vary_batch_and_gpu/delay_and_wait_time_stats_and_plot.py\ --sort_criteria duration\ --data_dir lotustrace_result/b512_gpu4\ --fig_dir lotustrace_result/figures\ --output_file lotustrace_result/delay_and_wait_time_stats_and_plot.log
-
Run the visualization script for Fig 2:
python code/visualize_LotusTrace/visualization_augmenter.py\ --coarse\ --lotustrace_trace_dir lotustrace_result/b512_gpu4\ --custom_log_prefix lotustrace_log\ --output_lotustrace_viz_file lotustrace_result/viz_file.lotustrace
Open the file in chrome trace viewer for visualization (Navigate to
chrome://tracing
URL in Google Chrome, upload theviz_file.lotustrace
and visualize the trace) -
Run the below command for Image Classification pipeline to generate hardware performance numbers for Fig 5:
source /opt/intel/oneapi/setvars.sh bash scripts/cloudlab/LotusTrace_imagenet_vtune.sh
-
Follow the below steps to get a CSV of hw performance numbers (has to be performed manually):
# Below step will provide a link, open a browser window, and login to the VTune GUI (set the password to anything you like) vtune-backend --web-port 8080 --data-directory ./vtune_mem_access_vary_dataloader/b1024_gpu4_dataloader20
-
Navigate to Microarchitecture Exploration tab
-
Perform grouping by Source Function / Function / Call Stack
-
Select all cells and paste it in a CSV file called
code/image_classification/analysis/combine_lotus/lotustrace_uarch/b1024_gpu4_dataloader20.csv
-
-
Plot Fig 5 (a) by running
code/image_classification/analysis/combine_lotus/elapsed_time_plot.ipynb
notebook Check out the plot at the bottom of the notebook. -
Plot Fig 5 (b) by running
code/image_classification/analysis/combine_lotus/per_python_func_plot_vary_dataloaders.ipynb
notebook Check out the plot at the bottom of the notebook. -
Plot Fig 5 (c) by running below command:
python code/image_classification/analysis/combine_lotus/hw_event_analyzer.py\ --mapping_file code/image_classification/LotusMap/Intel/mapping_funcs.json\ --uarch_dir code/image_classification/analysis/combine_lotus/lotustrace_uarch\ --combined_hw_events code/image_classification/analysis/combine_lotus/combined_lotustrace_uarch.csv\ --cpp_hw_events_plot_dir code/image_classification/analysis/combine_lotus/cpp_hw_events_figs
Check out the
code/image_classification/analysis/combine_lotus/cpp_hw_events_figs
directory for the plots. -
Plot Fig 5 (e)-(h) by running
code/image_classification/analysis/combine_lotus/c_to_python_analyser.ipynb
notebook Check out the plots in thecode/image_classification/analysis/combine_lotus/mapped_python_figs
directory. -
That completes the experiment for LotusTrace on ImageNet dataset for Image Classification task!