In this repository we are developing a pipeline mapping the ERKER Dataset to the Phenopackets format. The ERKER dataset is a collection of clinical data from the Charité Berlin. The Phenopackets format is a standard for the exchange of phenotypic and genomic data for patients with rare diseases. The goal of this project is to develop a general pipeline that can be used to map any clinical dataset to the Phenopackets format.
More information on Phenopackets:
Reliable exchange of medical data between medical facilities is essential for patient's medical care. Especially patients with rare diseases can profit of digital interoperability.
To capture and produce RD-specific and FAIR (Findability, Accessability, Interoperability, Reusability) data, the ERKER (ERDRI-CDS kompatible Erfassung in REDCap / ERDRI-CDS compatible data capture in REDCap) was developed within our research projects CORD-MI, Screen4Care, Fair4Rare and Lab4Rare. With the subproject ERKER2Phenopackets we develop the transfer of ERKER data to Phenopackets - a computable representation of clinical data enabling deep phenotyping. The MC4R-deficiency, a rare genetic disease resulting in severe obesity, is used to develop the first example pipeline.
Please find a set of 50 phenopackets that were created artificially in the ERKER2Phenopackets/data/out/phenopackets
directory.
REDCap is a clinical electronic data capture system, for which many university hospitals have licenses. The ERKER version 1.5 form (ERDRI-CDS kompatible Erfassung in REDCap / ERDRI-CDS compatible data capture in REDCap) can be downloaded for free here: https://github.com/BIH-CEI/ERKER. For disease specific RD data capture in .csv or Excel format, you can use the Python import template to capture the data semi automatically.
For further use of Phenopackets please read: https://www.nature.com/articles/s41587-022-01357-4. The overall pipeline from the ERKER format to Phenopackets will be developed here.
We recommend the installation using Anaconda
To install Python using the standard method with pip, follow these steps:
-
Check if Python is already installed: Open your terminal or command prompt and type
python --version
orpython3 --version
.
a. If Python is installed, you'll see the version number. Please check that your version is at least3.10
or higher. if not, runpip install --upgrade python
b. If not, proceed with the installation. -
Download Python: Visit the official Python website at python.org and download the latest version of Python for your operating system.
-
Run the Installer: Double-click the downloaded installer and follow the on-screen instructions. Make sure to check the box that says "Add Python to PATH" during installation to easily run Python from the command line.
-
Verify Installation: After installation, open your terminal or command prompt and type
python --version
orpython3 --version
to confirm that Python is installed and check the version.
Anaconda is a popular distribution of Python that comes with many pre-installed data science libraries and tools. To install Python using Anaconda, follow these steps:
-
Download Anaconda: Visit the Anaconda website at anaconda.com and download the Anaconda distribution for your operating system.
-
Run the Installer: Double-click the downloaded Anaconda installer and follow the on-screen instructions. Anaconda provides an installer with a graphical user interface that makes it easy to customize your installation.
-
Create an Environment (Optional, but highly recommended): Anaconda allows you to create isolated Python environments for different projects. You can use the Anaconda Navigator or the command line to create and manage environments. Anaconda Creating Environment Tutorial
-
Verify Installation: After installation, open your terminal or Anaconda Navigator and type
python --version
orpython3 --version
to confirm that Anaconda Python is installed and check the version.
Now, you have Python installed on your system, and you can start using it by running python
in your terminal.
- Open a git CMD
- Navigate to the folder where you would like to install this git repository.
- Run
git clone https://github.com/BIH-CEI/ERKER2Phenopackets
- Open your Python CMD
- Navigate into the repository folder
- Run
pip install .
- (Optional): If you would like to install the required Python packages for testing run
pip install .[test]
[Optional] 4. Installing phenopacket-tools
We can use phenopacket-tools
to validate the created phenopackets.
Unfortunately, as of writing this, there is no Python version of phenopacket-tools
available. Therefore, we rely on
the CLI version of phenopacket-tools
, which is then automatically called upon if installed, when executing the
pipeline
command.
Note: During development we used the phenopacket-tools
version v1.0.0-RC3
.
To install phenopacket-tools
follow these steps:
- Check if Java is already installed: Open your terminal or command prompt and type
java --version
.
a. If Java is installed, you'll see the version number.
b. If not, proceed with the installation. - Download Java: Visit the official Java website at java.com and download the latest version of Java for your operating system. Follow the on-screen instructions.
- Download the
phenopacket-tools
CLI: Visit the officialphenopacket-tools
releases page and download the latest version of thephenopacket-tools
CLI. - Unzip the downloaded file and place the
.jar
file (e.g.,phenopacket-tools-cli-1.0.0-RC3.jar
) into theERKER2Phenopackets/submodules/phenopacket-tools
directory. - If you are using a different version of
phenopacket-tools
, please also change the path to the.jar
file in theconfig.cfg
configuration file under the headerPaths
atjar_path
.
Please follow the official MongoDB Installation Tutorial.
To run the pipeline, you require a .csv
file in ERKER format with filled columns that allow Phenopacket creation from MC4R data.
- Follow the steps in the Installation section. (Especially important is the
pip install .
command) - Navigate to the root directory (top level
ERKER2Phenopackets
folder). - Run
pipeline [-h] [-d | -t] [-p] [-v] data_path [out_dir_name]
a. If you do not provide an output folder name, the output folder will be named according to the current date and time in the'YYYY-MM-DD-hhmm'
format.
b. Running the command with the-v
or-validate
tag automatically callsvalidate
on the created phenopackets. This is recommended, especially when using-p
or--publish
. c. To get more info on how to run this command, runpipeline -h
orpipeline --help
. - You can find the created phenopackets in the
ERKER2Phenopackets/data/out/
folder. Do not upload real patient data to GitHub.
Run validate
(optionally add path to a single phenopacket .json
file or a folder that includes phenopackets), defaults to validating last created phenopackets.
- Human Phenotype Ontology (HP, version: 2023-06-06) 🔗
- Online Mendelian Inheritance in Man (OMIM, version: 2023-09-08) 🔗
- Orphanet Rare Disease Ontology (OPRHA, version: 2023-09-10) 🔗
- National Center for Biotechnology Information Taxonomy (NCBITaxon, version: 2023-02-21) 🔗
- Logical Observation Identifiers Names and Codes (LOINC, version: 2023-08-15) 🔗
- HUGO Gene Nomenclature Committee (HGNC, version: 2023-09-10) 🔗
- Gene Ontology (GENO, version: 2023-07-27) 🔗
This project is licensed under the terms of the MIT License
We would like to extend our thanks to Daniel Danis for his support in the development of this project.
- Authors: