Skip to content

Latest commit

 

History

History
178 lines (133 loc) · 8.75 KB

README.md

File metadata and controls

178 lines (133 loc) · 8.75 KB

Open in GitHub Codespaces

HDRUK Avoidable Admissions Analytics

HDRUK Data Science Collaboration on Avoidable Admissions in the NHS.

Please see https://mattstammers.github.io/hdruk_avoidable_admissions_collaboration_docs/ for more information.

Installation

For contributing to this repo, please see Development Setup section below.

The following describes installation of the package within an existing environment. A separate virtual environment is recommended.

The package maybe installed directly from GitHub.

pip install "avoidable_admissions @ git+https://github.com/LTHTR-DST/hdruk_avoidable_admissions.git@<release-name>"

Additional installation options are described in the documentation.

Replace <release-name> with the latest release version e.g. v0.2.1-alpha. List of releases can be found here - https://github.com/LTHTR-DST/hdruk_avoidable_admissions/releases.

Omit <release-name> to install the latest code in the repo.

Quickstart

Detailed instructions are available in the documentation including a complete pipeline example.

import pandas as pd
from avoidable_admissions.data.validate import (
    validate_dataframe,
    AdmittedCareEpisodeSchema,
    AdmittedCareFeatureSchema
)
from avoidable_admissions.features.build_features import (
    build_admitted_care_features
)


# Load raw data typically extracted using SQL from source database
df = pd.read_csv("../data/raw/admitted_care.csv")

# First validation step using Episode Schema
# Review, fix DQ issues and repeat this step until all data passes validation
good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)

# Feature engineering using the _good_ dataframe
df_features = build_admitted_care_features(good)

# Second validation step using Feature Schema
# Review and fix DQ issues.
# This may require returning to the first validation step or even extraction.
good_f, bad_f = validate_dataframe(df_features, AdmittedCareFeatureSchema)

# Use the good_f dataframe for analysis as required by lead site

Development Setup

The project setup is based on an opinionated cookiecutter datascience project template. There are a few additional components to ease development and facilitate a collaborative workspace.

The setup has only been tested on Windows 10. Before setting this project up, the following requirements need to be met:

  • Anaconda or Miniconda installed and access to the Anaconda Powershell prompt
  • Mamba (conda install mamba -n base)
  • Git
  • Fork and clone this repo

Steps

  1. Start Anaconda powershell prompt and navigate to the root of this folder.
  2. Execute ./init.bat
  3. Activate the environment with conda activate hdruk_aa
  4. Start JupyterLab with jupyter-lab
  5. Alternatively open an IDE (e.g. Code) and set python environment to hdruk_aa

Additional features

pre-commit

The project expects collaboration using git and GitHub and uses pre-commit git hooks to help with identifying and resolving issues with code quality. See .pre-commit-config.yaml for what features are enabled by default.

Development Containers

This repo allows the usage of containers for full-featured development using development containers. This can be done either locally using Visual Studio Code or remotely using GitHub Codespaces.

Local Development: To enable containerised development locally, clone the repositiory and open in VS Code. Code should automatically prompt to reopen in a devcontainer. This requires Docker Desktop to be installed. It can take several minutes for the container to be created the first time while all required dependencies are installed. This removes the need for creating a new conda environment. Access to data should be configured as described at the end of project organisation.

Remote Development: Remote development is made easy using GitHub Codespaces with configurable compute. Compute instances are not deployable in the UK region yet which raises data governance issues ⚠️. However, this option is useful for writing documentation and code that is not dependent on data. For instance, updating Markdown cells in Jupyter notebooks and docstrings is entirely possible. Making API calls from this environment to generate code lists is also supported.

Important: ⛔ Patient level data must not be uploaded into a codepace even if it is excluded from version control.

Known issues

  • pandas-profiling is not compatible with Python 3.11 yet. If this is critical, the options are either to downgrade Python to 3.10 or to use a separate environment with Python<=3.10 to run pandas-profiling. As pandas-profiling will only be used infrequently, the latter may be a better option. Suggestions welcome.

Project Organisation

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- makes project pip installable (pip install -e .) so src can be imported
├── avoidable_admissions <- Source code for use in this project.
│   ├── __init__.py    <- Makes avoidable_admissions a Python module
│   │
│   ├── data           <- Scripts to download or generate and validate data
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │
│   ├── models         <- Scripts to train models and then use trained models to make predictions
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│
└── tox.ini            <- tox file with settings for running tox; see tox.readthedocs.io

📁 The following directories are not in version control and will need to be manually created by the user.

├── data
    ├── external       <- Data from third party sources.
    ├── interim        <- Intermediate data that has been transformed.
    ├── processed      <- The final, canonical data sets for modeling.
    └── raw            <- The original, immutable data dump.

Alternatively, create a .env file with database credentials, paths to data directories, etc. and load this using python-dotenv. See .env.sample and https://pypi.org/project/python-dotenv/ for how to do this. Avoid hardcoding local paths in notebooks or code to ensure reproducibility between collaborators.

DO NOT COMMIT CREDENTIALS TO VERSION CONTROL!

ENSURE NO PII IS EXPOSED BEFORE COMMITING TO VERSION CONTROL!


Project based on the cookiecutter data science project template. #cookiecutterdatascience