Caikit NLP Runtime Performance Benchmarks

Runtime performance benchmarking results for various model on various hardware configurations.

Llama2-7b

Date Executed	Hardware	Training Set	Epoch	Precision	Batch Size	Max Source Length	Training Runtime (s)	Samples Per Second	Train Steps Per Second	Loss	Notes
2023-09-05	1 x A100 80GB	Glue / RTE	1	bfloat16	6	4096	350	21.325	0.22	1.65	4096 is the context size for Llama2
2023-09-05	1 x A100 80GB	Glue / RTE	1	bfloat16	6	1024	350	21.333	0.22	1.65	batch size of 7 fails CUDA OOM
2023-09-06	1 x A100 80GB	Glue / RTE	1	bfloat16	6	512	348	21.44	0.22	1.65	batch size of 7 fails CUDA OOM
2023-09-05	1 x A100 80GB	Glue / RTE	1	bfloat16	8	256	356	20.939	0.16	1.70	batch size of 9 fails CUDA OOM
2023-09-05	1 x A100 80GB	Glue / RTE	1	bfloat16	19	128	254	29.332	0.09	1.94	batch size of 20 fails CUDA OOM