Package to serve public data from rare disease patients as found in publications and public resources. Most cases here collected have only phenotypic data as a list of HPO terms. The package offers 5 core modules:
- DiseaseAnnotations: Disease information.
- HPO: Symptom analysis through HPO.
- PatientSampler: Functionality to sample simulated patients based on the disease annotations and HPO.
- PhenotypicComparison: Functionality to plot phenotypic comparisons between two phenotypic profiles.
- PhenotypicDatabase: Local database to push available data to and pull data from. Publicly available data will be persisted here.
The 5 modules are covered in the Usage section below.
This package is in early development, so do not expect to see extense docstrings and sphinx documentation. At this point, this README is your best resource. Any doubt, please create an Issue and we'll give you an answer ASAP.
If you are not a Python programmer, but you are interested in analyzing these data and maybe try to create a disease prediction algorithm, you will find the data in the resources directory. You have all the nodes of the HPO ontology, the edges between them and a json with the disease information.
To install it simply run:
pip install rarecrowds
The PyPI project lives here: https://pypi.org/project/rarecrowds/.
Disease information is extracted from Orphanet's orphadata (product 4, product 9 (prevalence) and product 9 (ages)) and from the HPOA file created by the Monarch Initiative within the HPO project. By default, Orphanet's and OMIM phenotypic description of a rare disease extracted from the HPOA file are intersected. There is, in principle, no need for you to parse the data provided from these institutions.
In order to get information from a particular disease, use the following lines:
from rarecrowds import DiseaseAnnotations
dann = DiseaseAnnotations(mode='intersect')
data = dann.data['ORPHA:324']
This will output the information available about Fabry disease, with Orphanet's ID ORPHA:324
. In order to query the disease information, please use Orphanet IDs. For further reference, visit www.orpha.net.
The following is an extract of the data returned by the lines above:
data = {
'ageDeath': ['adult'],
'ageOnset': ['Childhood'],
'group': 'Disorder',
'inheritance': ['X-linked recessive'],
'link': 'http://www.orpha.net/consor/cgi-bin/OC_Exp.php?lng=en&Expert=324',
'name': 'Fabry disease',
'phenotype': { 'HP:0000083': { 'frequency': 'HP:0040281',
'modifier': { 'diagnosticCriteria': True}},
'HP:0000091': { 'frequency': 'HP:0040282',
'modifier': { 'diagnosticCriteria': True}},
## Many other symptoms here
'HP:0100820': { 'frequency': 'HP:0040283',
'modifier': { 'diagnosticCriteria': True}}},
'prevalence': [ { 'class': '1-9 / 1 000 000',
'geographic': 'Europe',
'meanPrev': '0.22',
'qualification': 'Value and class',
'source': 'ORPHANET',
'type': 'Prevalence at birth',
'validation': {'status': 'Not yet validated'}},
## Other prevalence studies here
{ 'class': '1-9 / 100 000',
'geographic': 'Sweden',
'meanPrev': '1.11',
'qualification': 'Value and class',
'source': '25274184[PMID]',
'type': 'Prevalence at birth',
'validation': {'status': 'Validated'}}],
'source': {},
'type': 'Disease',
'validation': {'date': '2016-06-01 00:00:00.0', 'status': 'y'}
}
Based on this data, one may subset the diseases in order to get a list of diseases of interest, highly recommended at the beginning of the development of a phenotype-based prediction algorithm:
# These lines come from the previous code
ann = dann.data
del phen
print(f'# total initial entities: {len(ann)}')
## Keep only disorders
for dis,val in list(ann.items()):
if val['group'] != 'Disorder':
del ann[dis]
print(f'# disases: {len(ann)}')
## Keep only those with phenotypic information
for dis,val in list(ann.items()):
if not val.get('phenotype'):
del ann[dis]
print(f'# disases with phenotype data: {len(ann)}')
## Remove clinial syndromes
for dis,val in list(ann.items()):
if val['type'].lower() == 'clinical syndrome':
del ann[dis]
print(f'# diseases w/o clinical syndromes: {len(ann)}')
## Keep only selected prevalences
valid_prev = ['>1 / 1000', '6-9 / 10 000', '1-5 / 10 000', '1-9 / 100 000', 'Unknown', 'Not yet documented']
for dis, val in list(ann.items()):
if 'prevalence' in val:
classes = [a['class'] for a in val['prevalence'] if a['type'] == 'Point prevalence']
if not any(x in valid_prev for x in classes):
del ann[dis]
else:
del ann[dis]
print(f'# disases with valid prevalence: {len(ann)}')
As a result, the number of entities in the disease annotations object should be reduced as follows:
# total initial entities: 6930
# disases: 5745
# disases with phenotypes: 3649
# diseases w/o clinical syndromes: 3628
# disases with valid prevalence: 1288
Symptoms are organized through the Human Phenotype Ontology (HPO). If you are not familiar with it, please visit the website.
In order to get information on specific symptom IDs and other items included in the HPO ontology, such as the frequency subontology, RareCrowds includes the HPO module. This module allows you to get information about each term and their relationships.
In order to get information about a specific HPO term, run the following lines:
from rarecrowds import Hpo
hpo = Hpo()
hpo['HP:0001250'] ## Get information about 'seizures'
In order to see the successors or predecessors of a particular term, run any of the following lines:
hpo.successors(['HP:0001250'])
hpo.predecessors(['HP:0001250'])
In order to simplify a phenotypic profile, leaving only most informative, yet independent, terms run the following lines:
hpo.simplify(['HP:0001250', 'HP:0007359'])
Available methods (apologies for the lack of documentation):
hpo.items(): returns all items in HPO. Keep in mind that not all items are phenotypic abnormalities. If you want all symptoms, call for ALL the successors of HP:0000118.
hpo.save_json(filename): saves the ontology as json.
hpo.json(): returns a json object of th eontology.
hpo.json_adjacency(): Dumps the adjacency matrix as json.
hpo.successors(ids, depth=1): Returns list of successors. If depth = 0 it returns immediate successors.
hpo.predecessors(ids, depth=1): Returns list of predecessors. If depth = 0 it returns immediate predecessors.
hpo.simplify(ids): Simplifies a phenotypic profile, leaving only most informative terms.
This module allows the creation of realistic patient profiles based on the disease annotations. The following steps are followed to sample a patient from a given disease:
- Sample symptoms using the symptom frequency.
- From the selected symptoms, sample imprecision as a Poisson process with a certain probability of getting a less specific term using the HPO ontology.
- Add random noise sampling random HPO terms. The amount of random noise is also a Poisson process, while the selection of the HPO terms to include is uniform across the phenotypic abnormality subontology (disregarding too uninformative terms).
- Sample patient age by assuming that it is close to the disease onset plus a delay of ~1 month.
- Sample patient sex taking into account the inheritance pattern of the disease.
In order to sample 5 patients from a disease, run the following lines:
from rarecrowds import PatientSampler
sampler = PatientSampler()
patients = sampler.sample(patient_params="default", N=5)
As a result, an object similar to the following would be generated:
patients = {
'ORPHA:324': {
'id': 'ORPHA:324',
'name': 'Fabry disease',
'phenotype': {
'HP:0000083': {'Frequency': 'HP:0040281'},
## Many other symptoms here
'HP:0100820': {'Frequency': 'HP:0040283'}},
'cohort': [ # As many items in the list as patients simulated
{
'ageOnset': None,
'phenotype': {
'HP:0025031': {},
## Other symptoms here
'HP:0100279': {}
}
}
]
}
}
You can configure the imprecision and noise levels used to sample patient symptoms:
'''
These are the options for patient simulation parameters
"default": {
"imprecision": 1,
"noise": 0.25,
"omit_frequency": False,
},
"ideal": {
"imprecision": 0,
"noise": 0,
"omit_frequency": True,
}, # For debugging. No noise. All patients = disease.
"freqs": {
"imprecision": 0,
"noise": 0,
"omit_frequency": False,
}, # For simple cases without noise. All patients = disease*frequencies.
"impre": {
"imprecision": 1,
"noise": 0,
"omit_frequency": False,
}, # Meant for patients without the most granular terms.
"impre2": {
"imprecision": 2,
"noise": 0,
"omit_frequency": False,
}
'''
Comparing phenotypic profiles is often tricky. Venn diagrams are helpful, but often fall short in cases with complicated symptom relations. This module offers a detailed view of the overlap between, at most, 2 phenotypic profiles. It plots the HPO ontology graph with nodes color coded marking the common nodes and the nodes belonging to each profile. The plots use Plotly, so an interactivity-enabled viewer is recommended (most notebooks support this).
If a single phenotypic profile is passed as argument, it will plot the symptoms:
from rarecrowds import PhenotypicComparison
fig = PhenotypicComparison(patient = patients['ORPHA:324']['cohort'][0]['phenotype'])
If two phenotypic profiles are passed as argument, it will plot a comparison:
fig = PhenotypicComparison(
patient = patients['ORPHA:324']['cohort'][0]['phenotype'],
disease = { # This entry may also be a list of HPO terms.
'name': patients['ORPHA:324']['name'],
'id': patients['ORPHA:324']['id'],
'phenotype': patients['ORPHA:324']['phenotype']})
Finally, you may use the PhenotypicDatabase module to pull data from public sources. Currently, these are the supported sources:
Publication | Edited | Source | N. cases |
---|---|---|---|
Stavropoulos, 2016 | No | Rao, 2018 | 28 |
Bone, 2016 | No | Rao, 2018 | 3 |
Stelzer, 2016 | No | Rao, 2018 | 2 |
Lee, 2014 | No | Rao, 2018 | 200 |
Kleyner, 2016 | Yes | Kleyner, 2016 | 1 |
Zemojtel, 2014 | Added disease ID | Supp. | 11 |
Cipriani, 2020 | Added disease ID | Supp. | 134 |
Tomar, 2019 | Added disease ID | Supp. | 50 |
Ebiki, 2019 | No | Supp. | 20 |
ClinVar | Subsampled | ClinVar | 68153 |
Robinson (Multiple publications) | No | Robinson | 384 |
Any publication or algorithm stemming from data from the sources above MUST cite the source properly. It is the onus of the publisher to comply with this.
To get an instance of the PhenotypicDatabase
:
from rarecrowds import PhenotypicDatabase
db = PhenotypicDatabase()
The PhenotypicDatabase instance manages your local database. You may add data to it by downloading available data or by generating it locally (via simulations or a local push). Available datasets are not in your local database until you explicitly download them. To check what datasets are available and load them for later usage run:
datasets = db.get_available_datasets()
db.load('some_dataset')
In order to dump data from your database, you can get either a pandas dataframe or a list of dictionaries. To get a dataframe of the data in the database:
df = db.generate_dataframe()
To get a list of dictionaries of the data in the database:
data = db.generate_list_of_dicts()
There are many publications exploring the prediction of having a particular rare disease based on a patient's phenotype. The phenotype analysis piece, which may or may not be the central aspect of a publication, largely falls under two categories: ontology- or representation-based algorithms. The ontology-based algorithms define a logic by which distances between terms are calculated based on their position within the ontology and on how common each of them are within the rare diseases (via the information content: IC = -log(p)). The representation-based algorithms compute term representation based on embeddings calculated over a specific dataset. Ideally, the dataset should consist of individual (anonymous) patients in order to gather the most granular information. In the abscence of this option it is recommended to simulated such dataset.
- CADA: phenotype-driven gene prioritization based on a case-enriched knowledge graph, 2021, Peng et al. https://academic.oup.com/nargab/article/3/3/lqab078/6363753
- Disease Prediction via Graph Neural Networks, 2021, Sun et al. https://pubmed.ncbi.nlm.nih.gov/32749976/
- Graph Neural Network-Based Diagnosis Prediction, 2020, Li et al. https://pubmed.ncbi.nlm.nih.gov/32783631/
- Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization, 2019, Jagadeesh et al. https://www.nature.com/articles/s41436-018-0072-y
- HPO2Vec+: Leveraging heterogeneous knowledge resources to enrich node embeddings for the Human Phenotype Ontology, 2019, Shen et al. https://pubmed.ncbi.nlm.nih.gov/31255713/
- Phenotype-driven gene prioritization for rare diseases using graph convolution on heterogeneous networks, 2018, Rao et al. https://pubmed.ncbi.nlm.nih.gov/29980210/
- Phenotype-loci associations in networks of patients with rare disorders: application to assist in the diagnosis of novel clinical cases, 2018, Bueno et al. https://www.nature.com/articles/s41431-018-0139-x?platform=hootsuite
- Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome, 2014, Zemojtel et al. https://pubmed.ncbi.nlm.nih.gov/25186178/
- PhenoDigm: analyzing curated annotations to associate animal models with human diseases, 2013, Smedley et al. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3649640/
- Bayesian ontology querying for accurate and noise-tolerant semantic searches, 2012, Bauer et al. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3463114/
- Clinical Diagnostics in Human Genetics with Semantic Similarity Searches in Ontologies, 2009, Köhler et al. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2756558/
- OligoPVP: Phenotype-driven analysis of individual genomic information to prioritize oligogenic disease variants, 2018, Boudellioua et al. https://pubmed.ncbi.nlm.nih.gov/30279426/
- Phenotype-driven strategies for exome prioritization of human Mendelian disease genes, 2015, Smedley et al. https://pubmed.ncbi.nlm.nih.gov/26229552/
- Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome, 2014, Zemojtel et al. https://pubmed.ncbi.nlm.nih.gov/25186178/
The following references need to be added:
- Reference: Pavan S et al. Clinical Practice Guidelines for Rare Diseases: The Orphanet Database. PLoS One. 2017 Jan 18;12(1):e0170365. doi: 10.1371/journal.pone.0170365. PMID: 28099516; PMCID: PMC5242437.
- Link: https://www.orpha.net/
- Logo:
- Reference: Sebastian Köhler et al. The Human Phenotype Ontology in 2021, Nucleic Acids Research, Volume 49, Issue D1, 8 January 2021, Pages D1207–D1217, https://doi.org/10.1093/nar/gkaa1043
- Link: https://hpo.jax.org/app/
- Logo:
- Reference: Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, Gu B, Hart J, Hoffman D, Jang W, Karapetyan K, Katz K, Liu C, Maddipatla Z, Malheiro A, McDaniel K, Ovetsky M, Riley G, Zhou G, Holmes JB, Kattman BL, Maglott DR. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018 Jan 4. PubMed PMID: 29165669
- Link: https://www.ncbi.nlm.nih.gov/clinvar/
- Logo:
- Powered by NCBI: