This is the PyTorch implementation of the paper: Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements. Weidong He, Zhi Li, Dongcai Lu, Enhong Chen, Tong Xu, Baoxing Huai, Nicholas Jing Yuan. ACM MM 2020. [PDF]
If you use any source codes or datasets included in this toolkit in your work, please cite the following paper. The bibtex is listed below:
@inproceedings{he2020multimodal, title={Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements}, author={He, Weidong and Li, Zhi and Lu, Dongcai and Chen, Enhong and Xu, Tong and Huai, Baoxing and Yuan, Jing}, booktitle={Proceedings of the 28th ACM International Conference on Multimedia}, pages={2755--2764}, year={2020} }
Recently, multimodal dialogue systems have engaged increasing attention in several domains such as retail, travel, etc. In spite of the promising performance of pioneer works, existing studies usually focus on utterance-level semantic representations with hierarchical structures, which ignore the context-aware dependencies of multimodal semantic elements, i.e., words and images. Moreover, when integrating the visual content, they only consider images of the current turn, leaving out ones of previous turns as well as their ordinal information. To address these issues, we propose a Multimodal diAlogue systems with semanTic Elements, MATE for short. Specifically, we unfold the multimodal inputs and devise a Multimodal Element-level Encoder to obtain the semantic representation at element-level. Besides, we take into consideration all images that might be relevant to the current turn and inject the sequential characteristics of images through position encoding. Finally, we make comprehensive experiments on a public multimodal dialogue dataset in the retail domain, and improve the BLUE-4 score by 9.49, and NIST score by 1.8469 compared with state-of-the-art methods.
The architecture of the proposed MATE model, which includes two main components:Multimodal Element-level Encoder: In this component, all images from the dialog history and the user query are organized as dialog image memory. Then, we allocate related images to each turn and obtain image-enhanced text embeddings through an attention mechanism. Meanwhile, all images are integrated with a user query to get a query-enhanced image embeddings. Finally, all embeddings are concatenated as multimodal semantic element embeddings.
Knowledge-aware Two-Stage Decoder: It is a variant of a transformer decoder for generating better responses. The first-stage decoder focuses on the multimodal conversation context from the encoder, while the second-stage decoder takes domain knowledge and results from the first decoder to further refine the responses.
Check the packages needed or simply run the command, with python 3.7.
❱❱❱ pip install -r requirement.txt
- Download the MMD dataset and unzip it. Note that we only use the dataset.zip and image_annoy_index.zip. The data directory is like this.
data
├── annoy.ann
├── ImageUrlToIndex.pkl
├── FileNameMapToIndex.pkl
├── styletips_synset.txt
├── celebrity_distribution.json
├── v1
│ ├── train
│ │ ├── *.json
│ │ └── ...
│ ├── valid
│ └── test
└── v2
├── train
├── valid
└── test
- Process with the following command or just download processed data from Google Driver
❱❱❱ python3 generate_data.py --input_dir data/raw --out_dir data/processed
The final data directory is like this.
data
├── raw
│ ├── annoy.ann
│ ├── ImageUrlToIndex.pkl
│ ├── FileNameMapToIndex.pkl
│ ├── styletips_synset.txt
│ ├── celebrity_distribution.json
│ ├── v1
│ │ ├── train
│ │ │ ├── *.json
│ │ │ └── ...
│ │ ├── valid
│ │ └── test
│ └── v2
│ ├── train
│ ├── valid
│ └── test
└── processed
├── knowledge.json
├── v1
│ ├── train.pkl
│ ├── valid.pkl
│ └── test.pkl
└── v2
├── train.pkl
├── valid.pkl
└── test.pkl
Training
❱❱❱ python3 train.py -g 0 --config_file_path config/mate_v1.json --model_path work_path/ --task text --version 1 --context_size 2 --batch_size 32
Note that at the first time it will generate train data file to config["work_path"], so maybe slow.
Testing
❱❱❱ python3 translate.py -g 0 --config_file_path config/mate_v1.json --model_path work_path/ --checkpoint_file --out_file save_path/
We provide an example config file mate_v1.json, which consists of three fields: "training", "data" and "model". The meaning of parameters is as follows:
training: the parameters in this domain are related to the train process.
- "seed": random seed
- "lr": learning rate
- "lr_decay": weight decay (L2 penalty)
- "max_gradient_norm": max norm of the gradients for clip
- "num_epochs": total epochs
- "log_batch": logging interval (num of batches)
- "evaluate_epoch": evaluation interval (num of epochs)
- "patience": patience for early stop
- "label_smoothing": if use label smoothing
data: the parameters in this domain are about data processing.
- "annoy_file": file path for "annoy.ann"
- "annoy_pkl": file path for "ImageUrlToIndex.pkl"
- "source_path": data source directory
- "work_path": work directory
- "knowledge_path": knowledge file path
- "context_text_cutoff": min word frequency
- "text_length": max text length
- "image_length": max image num
- "num_pos_images": max positive images num
- "num_neg_images": max negative images num
model: the parameters in this domain are related to the model structure and do not need to be adjusted in most cases.