To appear in the 4th edition of the International Conference on Pattern Recognition and Artificial Intelligence. 3-6 July 2024 in Jeju, Korea.
Paper link (ArXiv). Citations below
- BibTex citation
@inproceedings{sandler2024linguistic,
title={A Linguistic Comparison between Human and ChatGPT-Generated Conversations},
author={Sandler, Morgan and Choung, Hyesun and Ross, Arun and David, Prabu},
booktitle={4th International Conference on Pattern Recognition and Artificial Intelligence (ICPRAI)},
year={2024},
organization={IAPR}
}
- M. Sandler, H. Choung, A. Ross, and P. David, “A Linguistic Comparison between Human and ChatGPT-Generated Conversations,” in the 4th International Conference on Pattern Recognition and Artificial Intelligence (ICPRAI), 2024.
The python/conda environment may be set up via:
conda env create -f environment.yml
To download the human-generated dialogues, refer to the original paper by Rashkin et al, 2019 and their corresponding code repository. The ChatGPT-generated (ChatGPT3.5) dialogues may be download via this link. Corresponding embeddings of the 2GPTEmpathicDialogues dataset can be downloaded here. These were used in the following visualization from the paper:
Proofread and run 2gpt_empathy_conv_gen.py. Requires an OpenAI API key. Note: the model used was gpt-3.5-turbo. At the time, that was the best available option. GPT-4 now has API key access with more affordable options. Don't forget to update that line in the code if you are intending to use GPT-4.
- To obtain the dialogue embeddings use compute_dialogue_embeddings.py. This code can be reused for the human-generated and ChatGPT-generated dialogues. See TODOs in the file for more.
- To visualize the 3-D UMAP viz of the dialogue embeddings and obtain the Dunn index, use vizualize_dialogue_embeddings.py
- Run the ValenceClassification.py file. Check the TODOs for the required embeddings file input. Note: this code is currently set up for valence classification of the ChatGPT-generated embeddings, but can be re-used for the human-generated embeddings as well (TODOs explain).
Note: separate statistical software was used for the linguistic analysis. Additionally, LIWC is a proprietary software and must be obtained by the appropriate means. See this website for more.
Summary statistics and statistical significance tests for all 118 linguistic categories from LIWC-22. Accessible here.