Skip to content

Latest commit

 

History

History
386 lines (343 loc) · 10.6 KB

README.md

File metadata and controls

386 lines (343 loc) · 10.6 KB

WhenToTalk

Make the model decide when to utter the utterances in the conversation, which can make the interaction more engaging.

Model architecture:

  1. GCN for predicting the timing of speaking
    • Dialogue-sequence: Sequence of the dialogue history
    • User-sequence: User utterance sequence
    • PMI: Context relationship
  2. (Seq2Seq/HRED) for language generation
  3. Multi-head attention for dialogue context (use GCN hidden state)

Requirements

  1. Pytorch 1.2
  2. PyG
  3. numpy
  4. tqdm
  5. nltk: word tokenize and sent tokenize
  6. BERTScore 0.2.1

Dataset

Format:

  1. Corpus folder have lots of sub folder, each named as the turn lengths of the conversations.
  2. Each sub folder have lots of file which contains one conversation.
  3. Each conversation file is the tsv format, each line have four element:
    • time
    • poster
    • reader
    • utterance

Create the dataset

# ubuntu / cornell, cf / ncf. Then the ubuntu-corpus folder will be created
# ubuntu-corpus have two sub folder (cf / ncf) for each mode
./data/run.sh ubuntu cf

Metric

  1. Language Model: BLEU4, PPL, Distinct-1, Distinct-2
  2. Talk timing: F1, Acc
  3. Human Evaluation: Engaging evaluation

Baselines

1. Traditional methods

  1. Seq2Seq
  2. HRED / HRED + CF

2. Graph ablation learning

  1. w/o BERT Embedding cosine similarity
  2. w/o User-sequence
  3. w/o Dialogue-sequence

How to use

Generate the graph of the context

# generate the graph information of the train/test/dev dataset
./run.sh graph cornell when2talk 0

Analyze the graph context coverage information

# The average context coverage in the graph: 0.7935/0.7949/0.7794 (train/test/dev) dataset
./run.sh stat cornell 0 0

Generate the vocab of the dataset

./run.sh vocab ubuntu 0 0

Train the model (seq2seq / seq2seq-cf / hred / hred-cf):

# train the hred model on the 4th GPU
./run.sh train ubuntu hred 4

Translate the test dataset by applying the model

# translate the test dataset by applying the hred model on 4th GPU
./run.sh translate ubuntu hred 4

Evaluate the result of the translated utterances

# evaluate the translated result of the model on 4th GPU (BERTScore need it)
./run.sh eval ubuntu hred 4

Generate performance curve

./run.sh curve dailydialog hred-cf 0

Chat with the model

./run.sh chat dailydialog GatedGCN 0

Experiment Result

wait to do:
1. add GatedGCN to all the graph-based method
2. add BiGRU to all the graph-based method
3. refer the DialogueGCN to construct the graph
    * the complete graph in the **p** windows size
    * add one long edge out of the windows size to explore long context sentence
    * user embedding as the node for processing
4. Layers analyse of the GatedGCN in this repo and mutli-turn modeling
  1. Methods

    • Seq2Seq: seq2seq with attention
    • HRED: hierarchical context modeling
    • HRED-CF: HRED model with classification for talk timing
    • When2Talk: GCNContext modeling first and RNN Context later
    • W2T_RNN_First: BiRNN Context modeling first and GCNContext later
    • GCNRNN: combine the Gated GCNContext and RNNContext together (?)
    • GatedGCN: combine the Gated GCNContext and RNNContext together
      1. BiRNN for background modeling
      2. Gated GCN for context modeling
      3. Combine GCN embedding and BiRNN embedding, final embedding
      4. Low-turn examples trained without the GCNConv (only use the BiRNN)
      5. Separate the decision module and generation module is better
    • W2T_GCNRNN: RNN + GCN combine RNN together (W2T_RNN_First + GCNRNN)
  2. Automatic evaluation

    • Compare the PPL, BLEU4, Disctint-1, Distinct-2 score for all the models.

      Proposed classified methods need to be cascaded to calculate the BLEU4, BERTScore (the same format as the traditional models' results)

      Model Dailydialog Cornell
      BLEU Dist-1 Dist-2 PPL BLEU Dist-1 Dist-2 PPL
      Seq2Seq 0.1038 0.0178 0.072 29.0640 0.0843 0.0052 0.0164 45.1504
      HRED 0.1175 0.0176 0.0571 29.7402 0.0823 0.0227 0.0524 39.9009
      HRED-CF 0.1268 0.0435 0.1567 29.0111 0.1132 0.0221 0.0691 38.5633
      When2Talk 0.1226 0.0211 0.0608 24.0131 0.0996 0.0036 0.0073 32.9503
      W2T_RNN_First 0.1244 0.0268 0.0787 24.5056 0.1118 0.0065 0.0147 33.754
      GCNRNN 0.1250 0.0214 0.0624 25.8213 0.1072 0.0077 0.0188 33.9572
      W2T_GCNRNN 0.1246 0.0152 0.0400 23.4434 0.1107 0.0063 0.0142 34.4256
      GatedGCN 0.1231 0.0423 0.1609 27.1615 0.1157 0.0261 0.0873 34.4256
    • F1 metric for measuring the accuracy for the timing of the speaking, only for classified methods (hred-cf, ...). The stat data shows that the number of the negative label is the half of the number of the positive label. F1 and Acc maybe suitable for mearusing the result instead of the F1. In this settings, we care more about the precision in F1 metric.

      Model Dailydialog Cornell
      Acc F1 Acc F1
      HRED-CF 0.8272 0.8666 0.7708 0.8427
      When2Talk 0.7992 0.8507 0.7616 0.8388
      W2T_RNN_First 0.8144 0.8584 0.7481 0.8312
      GCNRNN 0.8176 0.8635 0.7598 0.8445
      W2T_GCNRNN 0.7565 0.8434 0.7853 0.8466
      GatedGCN 0.8226 0.8663 0.738 0.8181
  3. Human judgments (engaging, ...)

    Invit the volunteer to chat with these models (seq2seq, hred, seq2seq-cf, hred-cf,) and score the models' performance accorading to the Engaging, Fluent, ...

    • Dailydialog dataset

      Model When2Talk vs. kappa
      win(%) loss(%) tie(%)
      Seq2Seq
      HRED
      HRED-CF
    • Cornell dataset

      Model When2Talk vs. kappa
      win(%) loss(%) tie(%)
      Seq2Seq
      HRED
      HRED-CF
  4. Graph ablation learning

    • F1 accuracy of predicting the speaking timing (hred-cf,)
    • BLEU4, BERTScore, Distinct-1, Distinct-2