This repository holds the code for the paper: H-FND: Hierarchical False-Negative Denoising for Distant Supervision Relation Extraction, to appear in the findings of ACL-2021.
The preprocessing of human-annotated datasets (SemEval, TACRED) and distantly supervised dataset (NYT10) are all handled in preprocess/preprocessor.py
Each class handles:
- The parsing of raw dataset inputs
- The generation of synthetic noise (for human-annotated datasets.)
- Converting the dataset to RelationDataset compatable formats.
Modify the paths that points to the raw data files in the code to fit your file locations.
Each file contains the model definition, trainer function and the run function.
-
In
base/
resides the base CNN/PCNN models and their training process. -
denoise/
contains code of H-FND, as well as other denoising baselines:denoise/rl.py
: H-FNDdenoise/cleanlab.py
: Cleanlabdenoise/coteaching.py
: Co-teaching
-
util/
contains the utiliy functions that are shared among models.data.py
contains the definition of classRelationDatast
.embedding.py
contains the wrapper functions to retrieve Spacy embeddings.measure.py
defines the mircoF1 score and the accuracy measurements.tokenizer.py
handles the tokenization process that was called in the construction ofRelationDatasets
To run the codes, first prepare the raw dataset. Then, update the paths in the code to match the file locations.
Then change the hyperparameter and path in each train_{dataset_name}_()
functions, then call the module using python3 -m {path_to_.py_file}