Skip to content

Latest commit

 

History

History
7 lines (4 loc) · 511 Bytes

230807 AgentBench.md

File metadata and controls

7 lines (4 loc) · 511 Bytes

https://arxiv.org/abs/2308.03688

AgentBench: Evaluating LLMs as Agents (Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, Jie Tang)

환경과 상호작용하는 행위자로서의 능력을 테스트한다는 접근의 벤치마크가 나왔군요. 이쪽은 GPT-4가 그냥 압도적이네요.

#benchmark