Skip to content

Latest commit

 

History

History
14 lines (12 loc) · 608 Bytes

README.md

File metadata and controls

14 lines (12 loc) · 608 Bytes

LLMEvaluation

Found out that using A100 and V100 on Vicuna and Llama2 have a different result, while other model such as Falcon doesn't has such question. The results are here. image

Experiment

Running the experiment on Google Colab Pro +

Model Evaluation

We use four LLM benchmarks to evaluate the model.

  1. Hellaswag: acc
  2. Truthfulqa_mc: mc1, mc2
  3. Arc_challenge: acc
  4. MMLU(HendrycksTest): Average score of all test acc.