This is a simple script that will take your OMOP compliant Note table and convert it into an OMOP Note_NLP table.
The goal of this project is to: "Enable NLP analysis for semi-technical analysts and programmers"
Clone this repository using git or the GitHub CLI
git clone https://github.com/UK-IPOP/omop-nlp.git
# or
gh repo clone UK-IPOP/omop-nlp
Setup python and install dependencies. I recommend using conda to create a python virtual environment.
NOTE:
scispacy
requires python3.9
Run the following inside the repository you cloned:
# -y accept defaults
conda create -n omop-nlp -y
conda activate omop-nlp
# install requirements (rich for logging, pandas for file reading, and scispacy -- which includes spacy itself)
pip install -r requirements.txt
# install scispacy model
# NOTE: this model is a few GB
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.4/en_core_sci_lg-0.5.4.tar.gz
Once you have created the relevant python environment, usage is simple, again inside the cloned directory:
python run_nlp.py <OMOP-DIR>
Replace <OMOP-DIR>
with the directory of your OMOP files. For example:
python run_nlp.py ~/Documents/OMOP/
More CLI options can be discovered by running the help option:
python run_nlp.py --help
This project is intentionally minimal and only accomplishes the following:
- Utilize large
scispacy
model for named entity recognition of biomedical concepts - Utilize
scispacy
EntityLinker
to extract UMLShttps://uts.nlm.nih.gov/uts/umls/home) concepts - Utilizze
negspacy
for concept negation - Convert relevant
spacy
outputs to their OMOP fields:note_nlp_id
: a unique identifiernote_id
: a linked note id from the Note tablelexical_variant
: the entity we extractednote_nlp_source_concept_id
: the CUI of the UMLS concept we linked tonlp_system
: "scispacy"nlp_datetime
: datetime script was runnlp_date
: date script was runterm_modifiers
: "Negation=True/False" based on negspacy
This encompasses all of the required Note_NLP fields and some optional fields. This is intentional to limit potentially irrelevant information to a user and to decrease the surface area of code that must be maintained long-term. The following exercises are left to the user (with pull-requests/branches for their feature implemenations welcomed):
- Linking UMLS CUIs to OMOP concept_ids (
note_nlp_concept_id
) - Extracting supplemental text information (
offset
orterm_temporal
) - Transitioning from directory/file based storage to alternative storage solutions. Potential options include:
- S3 data stores
- Databases/warehouse
- Note that becuase we use
pandas
, reading your Note table (and thus writing the Note_NLP table) are limited only by the data sourcespandas
can read from and thus the above can be implemented with ease
- Dockerizing this process and publishing it as a service
- Note here that the models use a significant amount of RAM so provision your infrastructure and container accordingly
Contributions are welcomed. Please feel free to either create your own fork or submit pull-requests with corrections or feature additions.