OMOP Note -> Note_NLP

This is a simple script that will take your OMOP compliant Note table and convert it into an OMOP Note_NLP table.

The goal of this project is to: "Enable NLP analysis for semi-technical analysts and programmers"

Installation

Clone this repository using git or the GitHub CLI

git clone https://github.com/UK-IPOP/omop-nlp.git
# or
gh repo clone UK-IPOP/omop-nlp

Setup python and install dependencies. I recommend using conda to create a python virtual environment.

NOTE: scispacy requires python3.9

Run the following inside the repository you cloned:

# -y accept defaults
conda create -n omop-nlp -y
conda activate omop-nlp
# install requirements (rich for logging, pandas for file reading, and scispacy -- which includes spacy itself)
pip install -r requirements.txt
# install scispacy model
# NOTE: this model is a few GB
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_lg-0.5.4.tar.gz

Usage

Once you have created the relevant python environment, usage is simple, again inside the cloned directory:

python run_nlp.py <OMOP-DIR>

Replace <OMOP-DIR> with the directory of your OMOP files. For example:

python run_nlp.py ~/Documents/OMOP/

More CLI options can be discovered by running the help option:

python run_nlp.py --help

Things to Know

This project is intentionally minimal and only accomplishes the following:

Utilize large scispacy model for named entity recognition of biomedical concepts
Utilize scispacy EntityLinker to extract UMLS https://uts.nlm.nih.gov/uts/umls/home) concepts
Utilizze negspacy for concept negation
Convert relevant spacy outputs to their OMOP fields:
- note_nlp_id: a unique identifier
- note_id: a linked note id from the Note table
- lexical_variant: the entity we extracted
- note_nlp_source_concept_id: the CUI of the UMLS concept we linked to
- nlp_system: "scispacy"
- nlp_datetime: datetime script was run
- nlp_date: date script was run
- term_modifiers: "Negation=True/False" based on negspacy

This encompasses all of the required Note_NLP fields and some optional fields. This is intentional to limit potentially irrelevant information to a user and to decrease the surface area of code that must be maintained long-term. The following exercises are left to the user (with pull-requests/branches for their feature implemenations welcomed):

Linking UMLS CUIs to OMOP concept_ids (note_nlp_concept_id)
Extracting supplemental text information (offset or term_temporal)
Transitioning from directory/file based storage to alternative storage solutions. Potential options include:
- S3 data stores
- Databases/warehouse
- Note that becuase we use pandas, reading your Note table (and thus writing the Note_NLP table) are limited only by the data sources pandas can read from and thus the above can be implemented with ease
Dockerizing this process and publishing it as a service
- Note here that the models use a significant amount of RAM so provision your infrastructure and container accordingly

Resources

scispacy
negspacy
spacy
pandas
UMLS

Contributions

Contributions are welcomed. Please feel free to either create your own fork or submit pull-requests with corrections or feature additions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

OMOP Note -> Note_NLP

Installation

Usage

Things to Know

Resources

Contributions

Files

README.md

Latest commit

History

README.md

File metadata and controls

OMOP Note -> Note_NLP

Installation

Usage

Things to Know

Resources

Contributions