Work in progress.
src
‒ main source code with model and dataset implementations and code to train, test or infer model.notebooks
‒ notebooks with experiments and visualizations.scripts
‒ different useful scripts, e.g. print dataset examples or evaluate existing models.tests
‒ unit tests.
Create virtual environment with venv
or conda
and install requirements:
pip install -r requirements.txt
For proper contributions, also use dev requirements:
pip install -r requirements-dev.txt
We use pushshift.io
dataset with Reddits' comments to pretrain our model.
We have collected all the comments for 2019.
TODO: add preprocessing and filter steps
A total of 237.212.662
dialogs.
237.162.662
are used for train split, 25.000
each are used for validation and test splits.
CommonSense Conversation from DiffuSeq
Token statistic collected w/ facebook/blenderbot-400M-distill
tokenizer, see scripts.cc_tokens_stats
.
Train
- 3.382.137 samples
- Context contains 81.772.641 tokens in range 2-84, average 24.178
- Target contains 80.812.361 tokens in range 1-84, average 23.894
Valid
- 2.047 samples
- Context contains 49.424 tokens in range 3-53, average 24.133
- Target contains 49.887 tokens in range 2-56, average 24.359
Test
- 9.999 samples
- Context contains 241.541 tokens in range 2-58, average 24.154
- Target contains 240.374 tokens in range 2-61, average 24.037
We use the ConvAI2 Dataset containing dialogues between personas with different descriptive profiles. The dataset can be downloaded here.