Skip to content

๐ŸŽญ Official code and dataset for our CCGPK@COLING 2022 paper - "PersonaChatGen: Generating Personalized Dialogue using GPT-3"

License

Notifications You must be signed in to change notification settings

passing2961/PersonaChatGen

Repository files navigation

๐ŸŽญ PersonaChatGen

This is the official github repository for PERSONACHATGEN: Generating Personalized Dialogues using GPT-3.

  • TL;DR: Recently, many prior works have made their own agents generate more personalized and engaging responses using personachat. However, since this dataset is frozen in 2018, the dialogue agents trained on this dataset would not know how to interact with a human who loves โ€œWandavision.โ€ One way to alleviate this problem is to create a large-scale dataset. In this work, we introduce the pipeline of creating personachatgen, which is comprised of three main components: Creating (1) profilegen, (2) Persona Set, and (3) personachatgen. To encourage GPT-3โ€™s generation ability, we also defined a taxonomy of hierarchical persona category derived from social profiling taxonomy. To create the speaker consistent persona set, we propose a simple contradiction-based iterative sentence replacement algorithm, named CoNL. Moreover, to prevent GPT-3 generating harmful content, we presented two filtering pipelines, one each for profilegen and personachatgen. Through analyzing of personachatgen, we showed that GPT-3 can generate personalized dialogue containing diverse persona. Furthermore, we revealed a state-of-the-art Blender 90M trained on our dataset that leads to higher performance.

๐Ÿ“œ Slide

๐Ÿ† PersonaChatGen won the Best Paper Award at CCGPK@COLING 2022!

Reference

Use the following to cite our paper:

@inproceedings{lee2022personachatgen,
  title={PERSONACHATGEN: Generating Personalized Dialogues using GPT-3},
  author={Lee, Young-Jun and Lim, Chae-Gyun and Choi, Yunsu and Lm, Ji-Hui and Choi, Ho-Jin},
  booktitle={Proceedings of the 1st Workshop on Customized Chat Grounding Persona and Knowledge},
  pages={29--48},
  year={2022}
}

๐Ÿ”Ž ProfileGen

You can now download ProfileGen dataset from the google drive. We provide individual json files, where each file is related to the persona category. (See Table 15, 16, 17 in our paper) Each file contains a list of profile-related sentence generated by GPT-3, where each element in the list consists of sentence, attr, value, and nli_score. Please check a sample data in dataset/profile_sample.json.

๐ŸŽญ PersonaChatGen

You can now download PersonaChatGen dataset from the google drive. We provide the train and validation sets following the format of the original PersonaChat dataset, as provided by the ParlAI framework. Please check a sample data in dataset/chat_sample.txt.

๐Ÿค– How to make PersonaChatGen using GPT-3?

To construct the PersonaChatGen dataset using GPT-3, we propose a pipeline consisting of three stages: (1) ProflieGen Creation, (2) Persona Set Creation, and (3) PersonaChatGen Creation. The detailed information is in our paper. Please follow the below instruction step-by-step.

Preparation

Installation

Install the required set of libraries as follows:

pip install -r requirements.txt

Set up OpenAI API Key

Set up the OpenAI API Key in the function of get_response() in prompt_generator.py as follows:

openai.api_key = "<API_KEY>"
openai.organization = "<ORG_ID>"

ProfileGen Creation

Generation

Run the command below to generate various profile-related sentences using GPT-3.

python profile_main.py

Filtering

Run the command below to filter low-quality sentences based on regex-based filtering, exact matching persona entity, preserving persona category, and duplication filtering.

python profile_filtering.py

Persona Set Creation

Run the command below to create persona sets using our proposed simple algorithm, namely CoNL (Contradiction-based Iterative Sentence Replacement).

๐Ÿšจ Please note that this algorithm and the accompanying implementation can take a significant amount of time to create numerous persona sets. We encourage other contributors to improve it for greater efficiency.

python conl_main.py

PersonaChatGen Creation

Generation

Run the command below to generate PersonaChatGen dataset using GPT-3.

python chat_main.py

Filtering

Run the command below to filter low-quality dialogues based on copy-paste, persona consistency, toxicity filtering.

python chat_filtering.py

Acknowledgements

This work was supported by the KT Corporation. We thank all KT researchers for helpful discussions.

Have any question?

Please contact Young-Jun Lee at yj2961@kaist.ac.kr or passing2961@gmail.com.

License

This repository is MIT licensed. See the LICENSE file for details.

About

๐ŸŽญ Official code and dataset for our CCGPK@COLING 2022 paper - "PersonaChatGen: Generating Personalized Dialogue using GPT-3"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages