Spark NLP-based NLP Sandbox PHI Annotator

Introduction

NLPSandbox.io is an open platform for benchmarking modular natural language processing (NLP) tools on both public and private datasets. Academics, students, and industry professionals are invited to browse the available tasks and participate by developing and submitting an NLP Sandbox tool.

This repository provides an example implementation of the NLP Sandbox PHI Annotator API written in Python-Flask. An NLP Sandbox PHI annotator takes as input a clinical note (text) and outputs a list of predicted PHI annotations found in the clinical note. Here PHIs are identified using regular expressions.

This NLP Sandbox tool uses a model from Spark NLP to annotate PHI in clinical notes. Because NLP Sandbox tools must run without access to internet connection, this implementation install and configure Spark NLP to run offline.

Models:

NER: ner_deid_large
Embeddings: embeddings_clinical

Specification

Tool version: 0.2.3
NLP Sandbox schemas version: 1.2.0
Docker image: docker.synapse.org/syn22277124/phi-annotator-spark-nlp
Apache Spark version: 3.1.1
Spark NLP version: 3.1.1

Note: The Docker image includes models from Spark NLP for Healthcare that requires a trial or paid subscription. Therefore the Docker image cannot be made publicly available.

Requirements

Docker Engine >=19.03.0

Usage

Creating the configuration file

Create the configuration file and update the configuration values.

cp .env.example .env

Running with Docker

The command below starts this NLP Sandbox PHI annotator locally.

docker compose up --build

You can stop the container run with Ctrl+C, followed by docker compose down.

Running with Python

Install Apache Spark and Spark NLP on your system. The Dockerfile included with this project shows how to install them on a Debian-based distribution.

Create a Conda environment.

conda create --name phi-annotator-spark-nlp python=3.9 -y
conda activate phi-annotator-spark-nlp

Install and start this NLP Sandbox PHI annotator.

cd server && pip install -r requirements.txt
python -m openapi_server

Accessing this NLP Sandbox tool User Interface

This NLP Sandbox tool provides a web interface that you can use to annotate clinical notes. This web client has been automatically generated by openapi-generator. To access the UI, open a new tab in your browser and navigate to one of the following address depending on whether you are running the tool using Docker (production) or Python (development).

Using Docker: http://localhost/ui
Using Python: http://localhost:8080/ui

Development

Interested in creating your own NLP Sandbox tool based on this implementation? Start by creating a new GitHub repository based on this GitHub template.

This NLP Sandbox tool is based on the NLP Sandbox PHI Annotator example. Please refer to the Development section of this example tool for general information on how to develop an NLP Sandbox tool. The sections listed below provide additional information about the present tool.

Configuring your GitHub repository

After creating your GitHub repository based on this this GitHub template, you must update your project to enable the CI/CD workflow to automatically build and push your tool as a Docker image to Synapse, then enabling you to submit your tool to NLPSandbox.io.

Update the CI/CD workflow file .github/workflows/ci.yml
- Update the environment variable docker_repository

Add the following GitHub Secrets to your repository

Name	Description
`SPARK_LICENSE_SECRET`	Your Spark NLP license secret.
`SPARK_AWS_ACCESS_KEY_ID`	Your Spark NLP AWS access key ID.
`SPARK_AWS_SECRET_ACCESS_KEY`	Your Spark NLP AWS secret access key.
`SYNAPSE_USERNAME`	Your Synapse username.
`SYNAPSE_TOKEN`	A Synapse personal access token for your Synapse account.

Note: The trial version of Spark NLP works only with a specific version of Spark NLP. If your license is for a more recent version of Spark NLP, you will need to update the Spark NLP version in the files of this project: Dockerfile, .github/workflows/ci.yml.

Versioning

GitHub release tags

This repository uses semantic versioning to track the releases of this tool. This repository uses "non-moving" GitHub tags, that is, a tag will always point to the same git commit once it has been created.

Docker image tags

The artifact published by the CI/CD workflow of this GitHub repository is a Docker image pushed to the Synapse Docker Registry. This table lists the image tags pushed to the registry.

Tag name	Moving	Description
`latest`	Yes	Latest stable release.
`edge`	Yes	Latest commit made to the default branch.
`edge-<sha>`	No	Same as above with the reference to the git commit.
`<major>.<minor>.<patch>`	No	Stable release.

You should avoid using a moving tag like latest when deploying containers in production, because this makes it hard to track which version of the image is running and hard to roll back.

Benchmarking on NLPSandbox.io

Visit nlpsandbox.io for instructions on how to submit your NLP Sandbox tool and evaluate its performance.

Contributing

Thinking about contributing to this project? Get started by reading our contribution guide.

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
nginx		nginx
server		server
.env.example		.env.example
.env.ner-deid-large		.env.ner-deid-large
.env.ner-deid-synthetic		.env.ner-deid-synthetic
.env.ner-deidentify-dl		.env.ner-deidentify-dl
.gitignore		.gitignore
.nlpsandbox-version		.nlpsandbox-version
.release-it.json		.release-it.json
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
openapitools.json		openapitools.json
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spark NLP-based NLP Sandbox PHI Annotator

Introduction

Contents

Specification

Requirements

Usage

Creating the configuration file

Running with Docker

Running with Python

Accessing this NLP Sandbox tool User Interface

Development

Configuring your GitHub repository

Versioning

GitHub release tags

Docker image tags

Benchmarking on NLPSandbox.io

Contributing

License

About

Releases 7

Contributors 2

Languages

License

nlpsandbox/phi-annotator-spark-nlp

Folders and files

Latest commit

History

Repository files navigation

Spark NLP-based NLP Sandbox PHI Annotator

Introduction

Contents

Specification

Requirements

Usage

Creating the configuration file

Running with Docker

Running with Python

Accessing this NLP Sandbox tool User Interface

Development

Configuring your GitHub repository

Versioning

GitHub release tags

Docker image tags

Benchmarking on NLPSandbox.io

Contributing

License

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 7

Contributors 2

Languages