Performance on DailyDialog dataset #8

aman229 · 2020-09-19T16:33:28Z

Hi,
I tried running the Seq2Seq and HRED models on dialydialog dataset. Here are the results I got:

Model Seq2Seq Result
BLEU-1: 0.215
BLEU-2: 0.0986
BLEU-3: 0.057
BLEU-4: 0.0366
ROUGE: 0.0492
Distinct-1: 0.0268; Distinct-2: 0.131
Ref distinct-1: 0.0599; Ref distinct-2: 0.3644
BERTScore: 0.1414

Model HRED Result
BLEU-1: 0.2121
BLEU-2: 0.0961
BLEU-3: 0.0542
BLEU-4: 0.0331
ROUGE: 0.0502
Distinct-1: 0.0208; Distinct-2: 0.0992
Ref distinct-1: 0.0588; Ref distinct-2: 0.3619
BERTScore: 0.1436

These results seem to be much lower than the ones reported in the dailydialog paper: https://www.aclweb.org/anthology/I17-1099.pdf
Do you have any clues on why is that the case?
Thanks!

gmftbyGMFTBY · 2020-09-20T01:22:00Z

Hi, thanks for your attention on this repo. Compared with the results in the original DailyDialog paper, the BLEU-1/2 score are lower but it can also be found that the BLEU-3/4 are much better. In my opinion, the BLEU-3/4 score are more suitable than BLEU-1/2, which indicates that the model can generate more fluently. So I think it is just okay. If you are still confused about it, feel free to contact me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance on DailyDialog dataset #8

Performance on DailyDialog dataset #8

aman229 commented Sep 19, 2020

gmftbyGMFTBY commented Sep 20, 2020

Performance on DailyDialog dataset #8

Performance on DailyDialog dataset #8

Comments

aman229 commented Sep 19, 2020

gmftbyGMFTBY commented Sep 20, 2020