Skip to content

dineshnagumothu/relevanceClassification

Repository files navigation

Linked Data Triples Enhance Document Relevance Classification

This is a Tensorflow based implementation of the document relevance classification systems described in the paper Linked Data Triples Enhance Document Relevance Classification

If you find this work useful, please cite our paper as:

@Article{app11146636,
AUTHOR = {Nagumothu, Dinesh and Eklund, Peter W. and Ofoghi, Bahadorreza and Bouadjenek, Mohamed Reda},
TITLE = {Linked Data Triples Enhance Document Relevance Classification},
JOURNAL = {Applied Sciences},
VOLUME = {11},
YEAR = {2021},
NUMBER = {14},
ARTICLE-NUMBER = {6636},
URL = {https://www.mdpi.com/2076-3417/11/14/6636},
ISSN = {2076-3417},
DOI = {10.3390/app11146636}
}

Instructions

  1. Download data

    Download files from these links and copy them to the data directory

    Energy Hub

    Energy Hub Training set - https://drive.google.com/file/d/1-2Rrr4lruYSXNx0r0DUpNzTyRITkFocp/view?usp=sharing
    
    Energy Hub Validation set - https://drive.google.com/file/d/1-AC0WW2FAjdM09YJ6V58R8u7CMiQn7AL/view?usp=sharing
    
    Energy Hub Test set - https://drive.google.com/file/d/1-CvtKz8oxtBW5s6xlt-w1icnEkWyafy9/view?usp=sharing
    

    Reuters

    Reuters Training set - https://drive.google.com/file/d/1-3c2Wqn3544AO2GMdHC6rOcAwakzznML/view?usp=sharing
    
    Reuters Validation set - https://drive.google.com/file/d/1FAruSND8Lh3IGuEpP2MI-OWzq1scRp9Q/view?usp=sharing
    
    Retuers Test set - https://drive.google.com/file/d/1kTks59QOpMu1e1AqcbpWnykFD37wl_hZ/view?usp=sharing
    

    20 News Groups

    20 News Groups Training set - https://drive.google.com/file/d/1--yVr6rj_F-brd0cPqOgBRVpnsOaQ8lj/view?usp=sharing
    
    20 News Groups Validation set - https://drive.google.com/file/d/1-6MrisNQ-aoXA2aHHPT4-OUqDddSG1Kx/view?usp=sharing
    
    20 News Groups Test set - https://drive.google.com/file/d/1-FCIq69HIfsPrXgdOcnRfzrI5wjjPNTR/view?usp=sharing
    
  2. Downloading Necessary Packages
    • Download NLTK stopwords using
      import nltk
      
      nltk.download('stopwords')
      
    • Download Mallet from here. Unzip and copy it to the directory.

      If you use Google Colab:

      !wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
      !unzip mallet-2.0.8.zip
      
    • Download GloVe embeddings from here. Unzip and copy it to the directory.

      If you use Google Colab:

      !wget https://nlp.stanford.edu/data/wordvecs/glove.6B.zip
      !unzip glove*.zip
      
    • If you choose to build triples with Stanford OpenIE

      Install Stanza: pip install stanza

      Run the following code from python command shell

      To run the Python Shell, open the command prompt or power shell on Windows and terminal window on mac, write python and press enter (or) you can use Jupyter Notebook (you can follow the same with Google Colab).

      Change corenlp_dir to a physical path on your machine. This will be the corenlp installation directory.

      import stanza
      
      # Download the Stanford CoreNLP package with Stanza's installation command
      # This'll take several minutes, depending on the network speed
      
      corenlp_dir = 'path/to/install/corenlp'
      stanza.install_corenlp(dir=corenlp_dir)
      
      # Set the CORENLP_HOME environment variable to point to the installation location
      
      import os
      os.environ["CORENLP_HOME"] = corenlp_dir
      
  3. Build Topic-Entity Triples

    This step involves

    • Training a Topic Modeler over the corpus
    • Extracting Named-Entities using spaCy
    • Building Triples using Dependency parser and POS tagger
    • Apply Topic Entity Filter over these triples

    Run the following python file.

    python data_preprocess.py <dataset>

    Change <dataset> to "energy hub", "reuters" or "20ng" to select the corpus.

  4. Optional Step (If you choose to use sentence embeeddings using InferSent, execute the following commands)
        !mkdir GloVe
        !curl -Lo GloVe/glove.840B.300d.zip http://nlp.stanford.edu/data/glove.840B.300d.zip
        !unzip GloVe/glove.840B.300d.zip -d GloVe/
        !mkdir fastText
        !curl -Lo fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
        !unzip fastText/crawl-300d-2M.vec.zip -d fastText/
      
        !mkdir encoder
        !curl -Lo encoder/infersent1.pkl https://dl.fbaipublicfiles.com/infersent/infersent1.pkl
        !curl -Lo encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl
    
  5. Training Models

    Run the following python file.

    python train.py --dataset=<dataset> --model=<model> --embedding=<embedding>

    Change <dataset> to "energy hub", "reuters" or "20ng" to select the corpus.

    Change <model> to the following options

    • text - for GloVe based text model
    • topics - To use topic distributions
    • entites - To use Glove-enriched named entities
    • triples - To use Glove-enriched triples
    • text_topics - To use text and topic distributions
    • text_entities - To use text(GloVe) and Named Entities(GloVe)
    • text_triples - To use text(GloVe) and triples(GloVe)
    • text_topics_entities - To use text(GloVe), topic distributions and Named Entities(GloVe)
    • text_topics_triples - To use text(GloVe), topic distributions and triples(GloVe)
    • text_entities_triples - To use text(GloVe), Named Entities(GloVe) and triples(GloVe)
    • text_topics_entities_triples - To use text(GloVe), topic distributions, Named Entities(GloVe) and triples(GloVe)

    Change <embedding> to "glove", "sentences" or "bert" to select the vector representation.

#Contact

Dinesh Nagumothu (dinesh.nagumothu@deakin.edu.au)