Code and datasets used in our paper "Named Entity Recognition for Partially Annotated Datasets" (
- Install requirements in requirements.txt
- Create WEXEA dataset:
- Download CoreNLP: Download CoreNLP (
- Adjust upper case variables in python files accordingly. We are using the
directory of the WEXEA output. - Download hierarchy.json.tar.gz from
- Extract to data/.
- Run
and set appropriate category from Wikipedia. - Type 'y' for keep, 'n' for ignore and 's' for keep subcategories only.
- Once all categories are seen, the algorithm stops.
- Start server (
java -mx4g -cp "<path to CoreNLP>/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -ner.model 4class -ner.applyFineGrained false -ner.statisticalOnly true
- Run
Wikipedia category hierarchies, related article names, sentences from Wikipedia and gold annotated datasets for Food and Drugs can be found in the data/ directory.
Please find code for all three models used in the paper in src/training. Paths to the datasets need to be adjusted in src/traininig/config/config.json.