Skip to content

Latest commit

 

History

History
36 lines (30 loc) · 2.3 KB

README.md

File metadata and controls

36 lines (30 loc) · 2.3 KB

SDG Classifier

The github for the sustainability project for the course Text Mining Domains at Vrije Universiteit Amsterdam (2021). Students: Myrthe Buckens, Dyon van der Ende and Eva den Uijl

data

In the data folder, you will find the following files:

  • original data file provided by Aurora Universities Alliance: SDG-1-17-basic-excl-target13.0.csv
  • preprocessed training data with 2 SDG's and negative class: SDG-training.csv
  • preprocessed test data with 2 SDG's and negative class: SDG-test.csv
  • preprocessed training data with all SDG's: SDG-training-all.csv
  • preprocessed test data with all SDG's: SDG-test-all.csv
  • file with predictions for baseline: SDG-predictions-baseline.csv
  • file with predictions for RoBERTa on title: SDG-predictions-title.csv
  • file with predictions for RoBERTa on abstract: SDG-predictions-abstract.csv

requirements

The needed requirements can be found in requirements and installed by running pip install requirements from your terminal.

code

In the code folder, you will find the following scripts, to be run with the specified arguments:

  • baseline system: baseline.py <test data> <output location>
  • preprocessing script: preprocessing.py <data set>
  • basic statistics on the data: basic_stats.py <data set>
  • RoBERTa classifier with abstracts: RoBERTa_classifier_abstract.py <train data> <test data> <output location>
  • RoBERTa classifier with titles: RoBERTa_classifier_title.py <train data> <test data> <output location>
  • evaluation script: evaluation.py <classifier predictions>

For running the mBERT models, we used google colaboration. It is possible to download this code on top of the page as .ipynb or .py, but for speed and possible hardware limitations, we advise you run the code online with the GPU from google. The code can be found by following the links below: