- In this webpage, press ctrl+F (for Windows)/command + F(for Mac)
- Enter the keyword you want to search
- Read the paper from its link.
Author | Title | Proceeding | Link |
---|---|---|---|
Jiatong Li, et al. | PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations | NeurIPS 2024 | https://arxiv.org/abs/2405.19740 |
Jingnan Zheng, et al. | ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation | NeurIPS 2024 | https://arxiv.org/abs/2405.14125 |
Jinhao Duan, et al. | GTBench: Uncovering the Strategic Reasoning Capabilities of LLMs via Game-Theoretic Evaluations | NeurIPS 2024 | https://arxiv.org/abs/2402.12348 |
Felipe Maia Polo, et al. | Efficient multi-prompt evaluation of LLMs | NeurIPS 2024 | https://arxiv.org/abs/2405.17202 |
Fan Lin, et al. | IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation | NeurIPS 2024 | https://arxiv.org/abs/2409.18892 |
Jinjie Ni, et al. | MixEval: Fast and Dynamic Human Preference Approximation with LLM Benchmark Mixtures | NeurIPS 2024 | https://nips.cc/virtual/2024/poster/96545 |
Percy Liang, et al. | Holistic Evaluation of Language Models | TMLR | https://arxiv.org/abs/2211.09110 |
Felipe Maia Polo, et al. | tinyBenchmarks: evaluating LLMs with fewer examples | ICML 2024 | https://openreview.net/forum?id=qAml3FpfhG |
Miltiadis Allamanis, et al. | Unsupervised Evaluation of Code LLMs with Round-Trip Correctness | ICML 2024 | https://icml.cc/virtual/2024/poster/33761 |
Wei-Lin Chiang, et al. | Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference | ICML 2024 | https://arxiv.org/abs/2403.04132 |
Yonatan Oren, et al. | Proving Test Set Contamination in Black-Box Language Models | ICLR 2024 | https://arxiv.org/abs/2310.17623 |
Kaijie Zhu, et al. | DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks | ICLR 2024 | https://arxiv.org/abs/2309.17167 |
Seonghyeon Ye, et al. | FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets | ICLR 2024 | https://openreview.net/forum?id=CYmF38ysDa |
Shahriar Golchin, et al. | Time Travel in LLMs: Tracing Data Contamination in Large Language Models | ICLR 2024 | https://openreview.net/forum?id=2Rwq6c3tvr |
Gati Aher, et al. | Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies | ICML 2023 | https://proceedings.mlr.press/v202/aher23a/aher23a.pdf |
Author | Title | Proceeding | Link |
---|---|---|---|
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt | Measuring Massive Multitask Language Understanding | ICLR 2021 | https://arxiv.org/abs/2009.03300 |
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu, Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, Junxian He | C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models | NeurIPS 2023 | https://arxiv.org/abs/2305.08322 |
Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, Minlie Huang | SafetyBench: Evaluating the Safety of Large Language Models | ACL 2024 | https://aclanthology.org/2024.acl-long.830/ |
Haoran Li, Dadi Guo, Donghao Li, Wei Fan, Qi Hu, Xin Liu, Chunkit Chan, Duanyi Yao, Yuan Yao, Yangqiu Song | PrivLM-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models | ACL 2024 | https://aclanthology.org/2024.acl-long.4/ |
Author | Title | Proceeding | Link |
---|---|---|---|
Yupeng Chang, et al. | A Survey on Evaluation of Large Language Models | TIST | https://dl.acm.org/doi/full/10.1145/3641289 |
Zishan Guo, et al. | Evaluating Large Language Models: A Comprehensive Survey | Preprint (arxiv) | https://arxiv.org/abs/2310.19736 |
Zhuang Ziyu, et al. | Through the Lens of Core Competency: Survey on Evaluation of Large Language Models | CCL 2023 | https://aclanthology.org/2023.ccl-2.8/ |
Isabel O. Gallegos, et al. | Bias and Fairness in Large Language Models: A Survey | CL 2024 | https://aclanthology.org/2024.cl-3.8/ |