classifier-pipeline

A software suite that enables remote extraction, transformation and loading of data.

This repository is geared heavily towards drawing articles from PubMed, identifying scientific articles containing information about biological pathways, and loading the records into a data store.

Requirements

Python (version >=3.8<3.10)
Poetry (version >=1.5.0)
Docker (version 20.10.14) and Docker Compose (version 2.5.1)
- We use Docker to create a RethinkDB (v2.3.6) instance for loading data.
Miniconda (Optional)
- For creating virtual environments. Any other solution will work.
Graphics Processing Unit (GPU) (Optional)
- The pipeline classifier can be sped up an order of magnitude by running on a system with a GPU. We have been using a system running Ubuntu 18.04.5 LTS, Intel(R) Xeon(R) CPU E5-2687W, 24 Core with an NVIDIA GP102 [TITAN Xp] GPU.

Access article feed via web service

Access the Swagger documentation at /docs.
Access the Redoc documentation at /redoc.

Usage

Create a conda environment, here named pipeline:

$ conda create --name pipeline python=3.8 --yes
$ conda activate pipeline

Download the remote:

$ git clone https://github.com/jvwong/classifier-pipeline
$ cd classifier-pipeline

Install the dependencies:

$ poetry install

Web server

To start up the server:

uvicorn classifier_pipeline.main:app --port 8000 --reload

uvicron options
- --reload: Enable auto-reload.
- --port INTEGER: Bind socket to this port (default 8000)

And now, go to http://127.0.0.1:8000/redoc (swap out the port if neccessary) to see the automatic documentation.

Pipeline

Launch a pipeline to process daily updates from PubMed and dump the RethinkDB database:

$ ./scripts/cron/install.sh

Elements of the 'Pipeline'

The pipeline

The scripts directory contains python files that chain functions in classifier_pipeline to:

read in data from
- csv, via stdin (csv2dict_reader)
- daily PubMed updates (updatefiles_extractor)
retrieve records/files from PubMed (pubmed_transformer)
apply various filters on the individual records (citation_pubtype_filter, citation_date_filter)
apply a deep-learning classifier to text fields (classification_transformer)
loads the formatted data into a RethinkDB instance (db_loader)

Launchers

Pipelines are launched through bash scripts that retrieve PubMed article records in two ways:
- ./scripts/cron/cron.sh: retrieves via the FTP file server all new content
- ./scripts/csv/pmids.sh: retrieve using the NCBI E-Utilities given a set of PubMed IDs
Variables
- DATA_DIR root directory where your data files exist
- DATA_FILE name of the csv file in your DATA_DIR
- ARG_IDCOLUMN the csv header column name containing either
  - a list of update files to extract (dailyupdates.sh)
  - a list of PubMed IDs to extract (pmids.sh)
- JOB_NAME the name of this pipeline job
- CONDA_ENV should be the environment name you declared in the first steps
- ARG_TYPE
  - use fetch for downloading individual PubMed IDs
  - use download to retrieve FTP update files
- ARG_MINYEAR articles published in years before this will be filtered out (optional)
- ARG_TABLE is the name of the table to dump results into
- ARG_THRESHOLD set the lowest probability to classify an article as 'positive' using pathway-abstract-classifier

Testing

There is a convenience script that can be launched:

$ ./test.sh

This will run the tests in ./tests, lint with flake8 and type check with mypy.

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.github/workflows		.github/workflows
classifier_pipeline		classifier_pipeline
scripts		scripts
tests		tests
.coveragerc		.coveragerc
.env		.env
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
docker-compose.yml		docker-compose.yml
mypy.ini		mypy.ini
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
test.sh		test.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

classifier-pipeline

Requirements

Access article feed via web service

Usage

Web server

Pipeline

Elements of the 'Pipeline'

The pipeline

Launchers

Testing

About

Releases

Packages

Languages

License

PathwayCommons/classifier-pipeline

Folders and files

Latest commit

History

Repository files navigation

classifier-pipeline

Requirements

Access article feed via web service

Usage

Web server

Pipeline

Elements of the 'Pipeline'

The pipeline

Launchers

Testing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages