Skip to content

hpi-dhc/ggponc

Repository files navigation

GGPONC - The German Clinical Guideline Corpus for Oncology

GGPONC Annotations in INCepTION

This repository collects resources related to GGPONC.

It covers:

see also:

Repository Description
ggponc_annotation GGPONC 2.0 Results and Gold Standard Annotations
ggponc_preprocessing Pre-Processing Pipeline (Tokenization, POS Tagging) and GGPONC 1.0 Results
ggponc_ellipses Resolving Elliptical Compounds in German Medical Text
ggponc_molecular GGTWEAK - Gene Tagging with Weak Supervision for German Clinical Text

Preparation

  1. Get access to GGPONC following the instructions on the project homepage and place the contents of the 2.0 release in the data folder:
  2. Install Python dependencies pip install -r requirements.txt `

Clinical Named Entity Recognition

Data Loading

A BigBIO-compatible data loader for loading the latest gold-standard annotations (GGPONC 2.0) to train NER models are available through the Hugging Face Hub: https://huggingface.co/datasets/bigbio/ggponc2

from datasets import load_dataset
dataset = load_dataset('bigbio/ggponc2', data_dir='data/v2.0_2022_03_24', name='ggponc2_fine_long_bigbio_kb')

Nested NER with spaCy Spancat

A trained spaCy model for nested NER is available on Hugging Face: https://huggingface.co/phlobo/de_ggponc_medbertde

huggingface-cli download phlobo/de_ggponc_medbertde de_ggponc_medbertde-any-py3-none-any.whl --local-dir .
pip install -q de_ggponc_medbertde-any-py3-none-any.whl

See: 01_GGPONC_Nested_NER

Flat NER

Training and evaluation of the (flat) NER models described in Borchert et al. (2022) is covered in the GGPONC 2.0 repository.

UMLS Entity Linking with xMEN

We use the xMEN toolkit with a pre-trained re-ranker to normalize identified entity mention spans to UMLS codes.

See: 02_GGPONC_UMLS_Linking

Resolution of Coordination Ellipses

Application of our encoder-decoder model for resolving elliptical coordinated compound noun phrases (ECCNPs), e.g. Chemo- und Strahlentherapie -> Chemotherapie und Strahlentherapie

See: 03_ECCNP_Analysis.ipynb

Molecular Named Entities

Training and evaluation of a nested NER model for gene / protein and variant mentions. The dataset (molecular_2024_04_03) is not yet published, but available upon request. Place the release in data to run the notebook.

See: 04_Molecular.ipynb