RESEARCH USE ONLY✅ NO COMMERCIAL USE ALLOWED❌
Benchmarking LLMs' Psychological Portray.
UPDATES
[Jan 16 2024]: PsychoBench is accepted to ICLR 2024 Oral (1.2%)
[Dec 28 2023]: Add support to 16personalities.com
✨An example run:
python run_psychobench.py \
--model gpt-3.5-turbo \
--questionnaire EPQ-R \
--openai-key "<openai_api_key>"\
--shuffle-count 1 \
--test-count 2
✨An example result:
Category | gpt-4 (n = 10) | Male (n = 693) | Female (n = 878) |
---|---|---|---|
Extraversion | 13.9 |
12.5 |
14.1 |
Pschoticism | 17.8 |
7.2 |
5.7 |
Neuroticism | 3.9 |
10.5 |
12.5 |
Lying | 7.0 |
7.1 |
6.9 |
-
--questionnaire
: (Required) Select the questionnaire(s) to run. For choises please see the list bellow. -
--model
: (Required) The name of the model to test. -
--shuffle-count
: (Required) Numbers of different orders. If set zero, run only the original order. If set n > 0, run the original order along with its n permutations. Defaults to zero. -
--test-count
: (Required) Numbers of runs for a same order. Defaults to one. -
--name-exp
: Name of this run. Is used to name the result files. -
--significance-level
: The significance level for testing the difference of means between human and LLM. Defaults to 0.01. -
--mode
: For debugging. To choose which part of the code is running.
Arguments related to openai
API (can be discarded when users customize models):
--openai-key
: Your API key. Can be found inView API keys -> API keys
.
It is easy! Just replace the function example_generator
fed into the function run_psychobench(args, generator)
.
Your customized function your_generator()
does the following things:
- Read questions from the file
args.testing_file
. The file locates underresults/
(checkrun_psychobench()
inutils.py
) and has the following format:
Prompt: ... | order-1 | shuffle0-test0 | shuffle0-test1 | Prompt: ... | order-2 | shuffle0-test0 | shuffle0-test1 |
---|---|---|---|---|---|---|---|
Q1 | 1 | Q3 | 3 | ||||
Q2 | 2 | Q5 | 5 | ||||
... | ... | ... | ... | ||||
Qn | n | Q1 | 1 |
You can read the columns before each column starting with order-
, which contains the shuffled questions for your input.
-
Call your own LLM and get the results.
-
Fill in the blank in the file
args.testing_file
. Remember: No need to map the response to its original order. Our code will take care of it.
Please check example_generator.py
for datailed information.
To include multiple questionnaires, use a comma to separate them. For example: --questionnaire BFI,DTDD,EPQ-R
.
To include ALL questionnaires, just use --questionnaire ALL
.
-
Big Five Inventory:
--questionnaire BFI
-
Dark Triad Dirty Dozen:
--questionnaire DTDD
-
Eysenck Personality Questionnaire-Revised:
--questionnaire EPQ-R
-
Experiences in Close Relationships-Revised (Adult Attachment Questionnaire):
--questionnaire ECR-R
-
Comprehensive Assessment of Basic Interests:
--questionnaire CABIN
-
General Self-Efficacy:
--questionnaire GSE
-
Love of Money Scale:
--questionnaire LMS
-
Bem's Sex Role Inventory:
--questionnaire BSRI
-
Implicit Culture Belief:
--questionnaire ICB
-
Revised Life Orientation Test:
--questionnaire LOT-R
-
Empathy Scale:
--questionnaire Empathy
-
Emotional Intelligence Scale:
--questionnaire EIS
-
Wong and Law Emotional Intelligence Scale:
--questionnaire WLEIS
For more details, please refer to our paper here.
If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:
@inproceedings{huang2024humanity,
author = {Jen{-}tse Huang and
Wenxuan Wang and
Eric John Li and
Man Ho Lam and
Shujie Ren and
Youliang Yuan and
Wenxiang Jiao and
Zhaopeng Tu and
Michael R. Lyu},
title = {On the Humanity of Conversational AI: Evaluating the Psychological Portrayal of LLMs},
booktitle = {Proceedings of the Twelfth International Conference on Learning Representations (ICLR)},
year = {2024}
}