-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to reproduce the results of the paper? #9
Comments
Hi @pfllo, thanks for trying out the code! The sample code here demonstrates how we introduced attention to tagging task. As mentioned in "generate_embedding_RNN_output.py", this published code here does not include the modeling of slot label dependencies. The benefit of modeling label dependencies depends on task and dataset. On ATIS, connecting emitted slot labels back to RNN state will improve the F1 score by a small margin. Alternatively, one may also add CRF layer on top of the RNN outputs for sequence level optimization. Training the model for more epochs may also likely to improve the test F1 score and classification accuracy. Moreover, we also tried using pre-trained word embeddings. On ATIS, the improvement with pre-trained word embedding is limited, but we did see improvement on some other datasets. Hope that the above clarification helps. |
Hi, i'm trying this great code. But as you say above, modeling the slot label dependency is benefit for the F1 score, but the code produces low intention classification accuracy as reported in your paper about 96.75... |
Thanks for the interest in our work. I believe the intent classification accuracy is not likely to be improved much with the modeling of label dependencies. The above results posted is obtained by directly running the published code once, and the classification accuracy achieved is around 98.10%, comparing to the best value (98.21%) reported in the paper for "Attention BiRNN". The label dependency modeling codes we used need much cleaning and formatting, and we don't really have enough bandwidth to handle it at the moment. The implementation we have follows the TensorFlow seq2seq.py decoder example. Thanks. |
I think you do more trick on the data preprocessing step or you just use the whole training set which not leave aside a validation data set(for I don't recognize an Eval result above). Your code contains a validation step and I got reasonable result on the label tagging task, but always fail to reproduce the result for intent classification task. The result I ran the code is as follow(the highest Test accuracy is 96.98): |
We used cross validation for hyper-parameter tuning. During final model evaluation, we used the full training set (4978 training samples) from the original ATIS train/test splits. During data preprocessing, we replaced all digits with "DIGIT", e.g. the example in https://github.com/HadoopIt/rnn-nlu/blob/master/data/ATIS_samples/valid/valid.seq.in |
Thanks for your reply, but it seems that some utterance may have more than one intent, (like flight+flight_fare). And according to the paper you referred(JOINT SEMANTIC UTTERANCE CLASSIFICATION AND SLOT FILLING WITH RECURSIVE NEURAL NETWORKS). Result was evaluated on the 17 intent. But in your code, you seems to classify sequence including intent that contains multiple unit intent. Which may hurt the classification performance, how do you deal with it? |
This is a good point. We simply used the first intent label as the true label during data preprocessing, for both model training and testing. There are in total 15 out of 893 utterances in the test set that have multiple intents. We processed the ATIS corpus from the original .tsv format, not from Vivian's https://github.com/yvchen/JointSLU/tree/master/data. They are the same ATIS corpus though. |
@HadoopIt . Hello, We try to reproduce the results of the paper. and we meet some question. |
@HadoopIt Hi, when doing final model evaluation, you used the full training set (4978 samples). Can you tell me how you get your final result? Are you training your model once and run it many times then choose the best result from them or use the average results after the model has converged? |
Hi, thanks for the great code.
I tried running your code on the ATIS data in https://github.com/yvchen/JointSLU/tree/master/data, and got accuracy 96.75 and F1 94.42 after training for 8400 steps. (I replaced the digits in the text with digit*n, where n is the length of the digit sequence)
However, there is still a gap between this result and the published results.
My questions are:
Is this result reasonable for the published code?
What else should I do to reproduce the published results, except for implementing the tag dependency? Are there any important tricks?
Should I use the default hyper-parameters in the code, or another set of hyper-parameters?
Thanks a lot.
The text was updated successfully, but these errors were encountered: