Runtime performance benchmarking results for various model on various hardware configurations.
Date Executed | Hardware | Training Set | Epoch | Precision | Batch Size | Max Source Length | Training Runtime (s) | Samples Per Second | Train Steps Per Second | Loss | Notes |
---|---|---|---|---|---|---|---|---|---|---|---|
2023-09-05 | 1 x A100 80GB | Glue / RTE | 1 | bfloat16 | 6 | 4096 | 350 | 21.325 | 0.22 | 1.65 | 4096 is the context size for Llama2 |
2023-09-05 | 1 x A100 80GB | Glue / RTE | 1 | bfloat16 | 6 | 1024 | 350 | 21.333 | 0.22 | 1.65 | batch size of 7 fails CUDA OOM |
2023-09-06 | 1 x A100 80GB | Glue / RTE | 1 | bfloat16 | 6 | 512 | 348 | 21.44 | 0.22 | 1.65 | batch size of 7 fails CUDA OOM |
2023-09-05 | 1 x A100 80GB | Glue / RTE | 1 | bfloat16 | 8 | 256 | 356 | 20.939 | 0.16 | 1.70 | batch size of 9 fails CUDA OOM |
2023-09-05 | 1 x A100 80GB | Glue / RTE | 1 | bfloat16 | 19 | 128 | 254 | 29.332 | 0.09 | 1.94 | batch size of 20 fails CUDA OOM |