Skip to content

BYU-PCCL/chitchat-dataset

Repository files navigation

chitchat-dataset

PyPI - Python Version PyPI PyPI - Wheel

CI Code style: black

Open-domain conversational dataset from the BYU Perception, Control & Cognition lab's Chit-Chat Challenge.

install

pip3 install chitchat_dataset

or simply download the raw dataset:

curl -LO https://raw.githubusercontent.com/BYU-PCCL/chitchat-dataset/master/chitchat_dataset/dataset.json

usage

More formal docs should be coming soon, but for now, see chitchat_dataset/__init__.py for more options.

import chitchat_dataset as ccc

dataset = ccc.Dataset()

# Dataset is a subclass of dict()
for convo_id, convo in dataset.items():
    print(convo_id, convo)

Or get the messages in a flat list:

messages = list(ccc.MessageDataset())

See examples/ for other languages.

stats

  • 7,168 conversations
  • 258,145 utterances
  • 1,315 unique participants

format

The dataset is a mapping from conversation UUID to a conversation:

{
  "prompt": "What's the most interesting thing you've learned recently?",
  "ratings": { "witty": "1", "int": 5, "upbeat": 5 },
  "start": "2018-04-20T01:57:41",
  "messages": [
    [
      {
        "text": "Hello",
        "timestamp": "2018-04-19T19:57:51",
        "sender": "22578ac2-6317-44d5-8052-0a59076e0b96"
      }
    ],
    [
      {
        "text": "I learned that the Queen of England's last corgi died",
        "timestamp": "2018-04-19T19:58:14",
        "sender": "bebad07e-15df-48c3-a04f-67db828503e3"
      }
    ],
    [
      {
        "text": "Wow that sounds so sad",
        "timestamp": "2018-04-19T19:58:18",
        "sender": "22578ac2-6317-44d5-8052-0a59076e0b96"
      },
      {
        "text": "was it a cardigan welsh corgi",
        "timestamp": "2018-04-19T19:58:22",
        "sender": "22578ac2-6317-44d5-8052-0a59076e0b96"
      },
      {
        "text": "?",
        "timestamp": "2018-04-19T19:58:24",
        "sender": "22578ac2-6317-44d5-8052-0a59076e0b96"
      }
    ]
  ]
}

This makes it convenient to represent multi-message conversational turns etc., preserving the structure/flow of the conversation.

how to cite

If you extend or use this work, please cite the paper where it was introduced:

@article{myers2020conversational,
  title={Conversational Scaffolding: An Analogy-Based Approach to Response Prioritization in Open-Domain Dialogs},
  author={Myers, Will and Etchart, Tyler and Fulda, Nancy},
  year={2020}
}