LLMEvaluation

Found out that using A100 and V100 on Vicuna and Llama2 have a different result, while other model such as Falcon doesn't has such question. The results are here.

Experiment

Running the experiment on Google Colab Pro +

Model Evaluation

We use four LLM benchmarks to evaluate the model.

Hellaswag: acc
Truthfulqa_mc: mc1, mc2
Arc_challenge: acc
MMLU(HendrycksTest): Average score of all test acc.