Skip to content

Latest commit

 

History

History
119 lines (93 loc) · 4.79 KB

README.md

File metadata and controls

119 lines (93 loc) · 4.79 KB

TopNCosSimAvg

Code for the RELATIONS 2019 Workshop paper
Semantic Matching of Documents from Heterogeneous Collections: A Simple and Transparent Method for Practical Applications arXiv

Installation

Create a new virtual environment with Python 3.6 first:

$ conda create --name topn python=3.6
$ conda activate topn

Clone this repository:

$ (topn) git clone https://github.com/nlpAThits/TopNCosSimAvg.git

Put the required files into the folders data, wombat-data, concept-project-mapping-dataset, and fastText (see the respective README.md files in these folders).

The code in this repository uses the WOMBAT-API, which can be installed as follows:

$ (topn) git clone https://github.com/nlpAThits/WOMBAT.git
$ (topn) cd WOMBAT
$ (topn) pip install .

Finally, install the following libraries:

$ (topn) conda install scipy scikit-learn gensim matplotlib colorama tqdm nltk==3.2.5

Tuning: AVG_COS_SIM

For the AVG_COS_SIM measure, tuning comprises a brute-force search for the optimal value for the sim_ts parameter (the minimum cosine similarity).
The start, end, and step values for sim_ts can be supplied like this:
--sim_ts start:end:step,
where start, end, and step must be floats.

The following call will search the whole range for 'label' for all four unit types, where
types = -tf -idf
tokens = +tf -idf
idf_types = -tf +idf
idf_tokens = +tf +idf

--plot_curves yes causes results plots to be written to the ./plots/ folder.

$ (topn) python perform-c-p-matching.py --input label --embeddings google --measures avg_cos_sim 
  --sim_ts 0.3:1.0:0.005 --units types,tokens,idf_types,idf_tokens --mode dev --plot_curves yes

Tuning results avg_cos_sim

Tuning: TOP_N_COS_SIM_AVG

For the TOP_N_COS_SIM_AVG measure, tuning comprises a brute-force search for the optimal value for the sim_ts parameter (the minimum cosine similarity, cf. above), plus a search over the optimal value for the top_n parameter.

The range of values for top_n to test can be supplied like this:
--top_n start:end:step,
where start, end, and step must be integers.
One row in the plot will be created for every value of top_n.

$ (topn) python perform-c-p-matching.py --input label --embeddings google 
  --measures top_n_cos_sim_avg --top_n 2:30:2 
  --sim_ts 0.3:1.0:0.005 --units types,tokens,idf_types,idf_tokens --mode dev --plot_curves yes

Tuning results top_n_cos_sim_avg

Reproducing the published results

DEV results avg_cosine

The following call will reproduce the top avg_cos_sim result reached when only label information is used.

$ (topn) python perform-c-p-matching.py  --mode dev --input label      --sim_ts .430 --units idf_tokens 
    --embeddings google  --measures avg_cos_sim  --print_classifications yes

Since the top results for avg_cos_sim are all yielded with basically the same setting, just change the value for --input and --sim_ts to reproduce the other top baseline results.

$ (topn) python perform-c-p-matching.py  --mode dev --input description --sim_ts .530 --units idf_tokens 
    --embeddings google  --measures avg_cos_sim  --print_classifications yes
$ (topn) python perform-c-p-matching.py  --mode dev --input both        --sim_ts .545 --units idf_tokens 
    --embeddings google  --measures avg_cos_sim  --print_classifications yes

DEV results top_n_cos_sim_avg

Likewise, the following calls will reproduce the three top top_n_cos_sim_avg results:

$ (topn) python perform-c-p-matching.py  --mode dev --input label      --sim_ts .345 --units tokens 
    --embeddings google  --measures top_n_cos_sim_avg --top_n 22 --print_classifications yes
$ (topn) python perform-c-p-matching.py  --mode dev --input description --sim_ts .345 --units idf_tokens 
    --embeddings glove --measures top_n_cos_sim_avg --top_n 6 --print_classifications yes
$ (topn) python perform-c-p-matching.py  --mode dev --input both         --sim_ts .310 --units idf_tokens 
    --embeddings fasttext --measures top_n_cos_sim_avg --top_n 14 --print_classifications yes