NER for Partially Annotated Data

Code and datasets used in our paper "Named Entity Recognition for Partially Annotated Datasets" (https://arxiv.org/abs/2204.09081).

Setup

Install requirements in requirements.txt
Create WEXEA dataset: https://github.com/mjstrobl/WEXEA
Download CoreNLP: Download CoreNLP (https://stanfordnlp.github.io/CoreNLP/download.html).
Adjust upper case variables in python files accordingly. We are using the article_2 directory of the WEXEA output.
Download hierarchy.json.tar.gz from https://drive.google.com/file/d/1MIWoUikaRVxrZrR_WlKEYc8TyNXqri0q/view?usp=sharing
Extract to data/.

Create type hierarchy

Run src/typed_hierarchy_creator.py and set appropriate category from Wikipedia.
Type 'y' for keep, 'n' for ignore and 's' for keep subcategories only.
Once all categories are seen, the algorithm stops.

Create data in CoNLL format

Start server (https://stanfordnlp.github.io/CoreNLP/corenlp-server.html): java -mx4g -cp "<path to CoreNLP>/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000 -ner.model 4class -ner.applyFineGrained false -ner.statisticalOnly true
Run process_hierarchy.sh

Datasets

Wikipedia category hierarchies, related article names, sentences from Wikipedia and gold annotated datasets for Food and Drugs can be found in the data/ directory.

Train models

Please find code for all three models used in the paper in src/training. Paths to the datasets need to be adjusted in src/traininig/config/config.json.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NER for Partially Annotated Data

Setup

Create type hierarchy

Create data in CoNLL format

Datasets

Train models

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
process_hierarchy.sh		process_hierarchy.sh
requirements.txt		requirements.txt

License

mjstrobl/NER_for_partially_annotated_data

Folders and files

Latest commit

History

Repository files navigation

NER for Partially Annotated Data

Setup

Create type hierarchy

Create data in CoNLL format

Datasets

Train models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages