Source code for the EMNLP2020 paper: "Cold-start and Interpretability: Turning Regular Expressions into Trainable Recurrent Neural Networks", Chengyue Jiang, Yinggong Zhao, Shanbo Chu, Libin Shen, and Kewei Tu.
@inproceedings{jiang-etal-2020-cold,
title = "Cold-start and Interpretability: Turning Regular Expressions into Trainable Recurrent Neural Networks",
author = "Jiang, Chengyue and
Zhao, Yinggong and
Chu, Shanbo and
Shen, Libin and
Tu, Kewei",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-main.258",
pages = "3193--3207",
}
- pytorch 1.3.1
- tensorly 0.5.0
- numpy
- tqdm
- automata-tools
- pyparsing
Raw dataset files, preprocessed dataset files, glove word embedding matrix, rules for each dataset, and the decomposed automata files can be downloaded here: Google Drive, Tencent Drive.
You can download and extract the zip file, and replace the original data directory. The directory structure should be:
.
├── data
│ ├── ATIS
│ │ ├── automata
│ │ ├── ....
│ ├── TREC
│ │ ├── automata
│ │ ├── ....
│ ├── ....
├── src
│ ├── ....
├── src_simple
│ ├── ....
├── model
│ ├── ....
├── imgs
│ ├── ....
If you have done these, you can skip to the training part.
We provide the RE rules for three datasets, ATIS, QC(TREC-6) and SMS. Our REs are word-level, not char-level. We show the symbols and their meanings in the following table.
Symbol | Meaning |
---|---|
$ | wildcard |
% | numbers, e.g. 5, 1996 |
& | punctuations |
? | 0 or 1 occurrence |
* | zero or more occurrences |
+ | one or more occurrences |
(a| b) |
a or b |
ATIS - abbreviation label.
[abbreviation]
( $ * ( mean | ( stand | stands ) for | code ) $ * ) | ( $ * what is $ EOS )
SMS - spam label.
[spam]
$ * dating & * $ * call $*
We show examples on ATIS dataset, for other datasets, simply change --dataset option to TREC or SMS.
You need first download the GloVe 6B embeddings, and place the embedding files into data/emb/glove.6B/
You can also prepare the dataset from the raw dataset by running the following command.
python data.py --dataset ATIS
We turn the regular expressions into a finite automaton using our automata-tools package implemented by (@linonetwo). This tool is modified based on https://github.com/sdht0/automata-from-regex. This package require the 'dot' command for drawing the automata.
Or running the following command to convert REs/reversed RE (for backward direction) to FA.
python create_automata.py --dataset ATIS --automata_name all --reversed 0
python create_automata.py --dataset ATIS --automata_name all --reversed 1
The regular expression for ATIS - abbreviation mentioned above can be represented using following automaton.
The RE system's result is got by running the un-decomposed automaton you just created.
python main.py --model_type Onehot --dataset ATIS --only_probe 1 --wfa_type viterbi \
--epoch 0 --automata_path_forward all.1105155534-1604591734.6171093.split.pkl --seed 0
We convert the FAs using tensor-rank decomposition.
Run the following command to convert REs to FA.
python decompose_automata.py --dataset ATIS --automata_name automata_name --rank 150 --init svd
To train the initialize the FA-RNNs on ATIS, SMS, and TREC make sure you finish the above steps. Then let's train an FA-RNN initialized by the decomposed automata. If you have downloaded automata and place them into the right location, you can run.
python main.py --dataset ATIS --run save_dir --model_type FSARNN --beta 0.9 \
--wfa_type forward --seed 0 --lr 0.001 --bidirection 0 --farnn 0 --random 0
Please check the function get_automata_from_seed
in utils/utils.py
to understand which automaton you are using.
If you use newly decomposed automata, you need to specify the --automata_path_forward
and --automata_path_backward
options.
For example: FARNN
python main.py --dataset ATIS --run save_dir --model_type FSARNN --bidirection 0 \
--beta 0.9 --wfa_type forward --seed 0 --lr 0.001 --farnn 0 --normalize_automata l2 \
--automata_path_forward automata.newrule.split.randomseed150.False.0.0003.0.pkl
For example: BiFARNN
python main.py --dataset ATIS --run save_dir --model_type FSARNN --bidirection 1 \
--beta 0.9 --wfa_type forward --seed 0 --lr 0.001 --farnn 0 --normalize_automata l2 \
--automata_path_forward automata.newrule.split.randomseed150.False.0.0003.0.pkl \
--automata_path_backward automata.newrule.reversed.randomseed150.False.0.0735.0.pkl
For example: FAGRU
python main.py --dataset ATIS --run save_dir --model_type FSARNN --bidirection 0 \
--beta 0.9 --wfa_type forward --seed 0 --lr 0.001 --farnn 1 --normalize_automata l2 \
--automata_path_forward automata.newrule.split.randomseed150.False.0.0003.0.pkl
We also remove some options and unimportant code to provide a cleaner version of code in /src_simple
, in which only contains FARNN related code.
As an example:
python main_min.py --dataset ATIS --run save_dir --model_type FSARNN --beta 0.3
You first need to download the FA-RNN models and config files here:
Google Drive,
Tencent Drive.
Please place the files in the /model
directory.
To seed the log and hyper-parameters of these provided model, simple using pickle to load the .res config files. For example, to achieve the hyper-params for model D0.9739-T0.9653-DI0.8655-TI0.8645-1106095843-1604656723.555744-ATIS-0.model, you can run:
import pickle
pickle.load(open('1106095843-1604656723.5809364.res', 'rb'))
Note that some useless hyper-parameters in the config files are cleaned in the final version/simple version, so the config file may not be directly used, just filter out the useless hyper-params.
We provide several examples showing how to convert the trained model parameters back into WFAs, and threshold them into NFA.
See the file jupyter/checkTrainedRules.ipynb
.