Text generation is a interesting task, and we want to generates a long text under the meaning of multiple words. In detail, given a set W = {w1, w2, ..., wk}, this generator aims at generates a text under the semantic information of those words. SC-LSTM (Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems, Wen et al., 2015) is the best paper of EMNLP 2015, which is is a statistical language generator based on a semantically controlled Long Short-term Memory structure for response generation. The author incorporates a dialogue act 1-hot vector into the original LSTM model and enables the generator to output the word-related text. We directly use this model for our task. And we input a set of words represented by 1-hot vector instead of dialogue act vector in our task.
The code in this repository is written in Python 2.7/TensorFlow 0.12. And if you use other versions of Python or TensorFlow, you should modify some code. Since SC-LSTM is based on original LSTM, we modify some code based on BasicLSTMCell class of TensorFlow to develop SC-LSTM model (detail in SC_LSTM_Model.py).
We need text-word_set pairs to train SC-LSTM model, but to the best of our knowledge, there is no public large-scale dataset. Therefore, we can only use the public small-scale data to test this model. We have found a news article dataset annotated using AMT(More details about the corpus can be found in the paper)
In Data/
respository, there are three files TrainingData_Keywords.txt
, TrainingData_Text.txt
and vec5.txt
(word embedding trained by word2vec), which is created from news article dataset mentioned above. TrainingData_Text.txt
file contains just title, and each line is a title which is regarded as one text(data). Correspondingly, TrainingData_keywords.txt
file contains word set, and each line is a set of word for text. Then, we use this text-words pair data to train SC-LSTM model.
Before train the model, you should set some parameters of this model in Config.py
file. Then, you need to run Preprocess.py
file for creating sclstm_data
file(convert trainingdata into binary formats of TensorFlow, and more detail about this can be found in the blog), word_vec.pkl
file(this is word embedding), word_vec.pkl
file(vocabulary of text) and kwd_voc.pkl
file(vocabulary of keywords). At the same time, you should set total_step
parameter in train.py
whose value is got from output of Preprocess.py
Start training the model using train.py
:
$ python train.py
After you train the model, you can generate the text in the control of word set. You should modify generation.py
file and set test_word
to a set of words. Then, if you want, you can also set some parameters for generation in Config.py
file. Generate text by run:
$ python generation.py
We randomly choose a set of words from trainingdata, [u'FDA', u'menu']
. The training data is so small that we can't get a desired result, and some result samples show below:
Depp Calorie proposes the pregnancy END END PAD
carries FDA Have of pleading fracas END PAD
Privacy proposes FDA milk PAD END PAD PAD
If you have large-scale dataset, I think you could get much better result.