Skip to content

Commit

Permalink
Update to BMC template
Browse files Browse the repository at this point in the history
  • Loading branch information
cthoyt committed Jun 6, 2019
1 parent 7268c30 commit 17bb03c
Show file tree
Hide file tree
Showing 19 changed files with 9,688 additions and 3,611 deletions.
13 changes: 13 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
.DS_Store
.idea
*.aux
*.log
*.out
*.bcf
*.run.xml
*.toc
*.bbl
*.fdb_latexmk
*.blg
*.fls
*.dvi
Binary file removed OUP_First_SBk_Bot_8401-eps-converted-to.pdf
Binary file not shown.
Binary file removed OUP_First_SBk_Bot_8401.eps
Binary file not shown.
7 changes: 7 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Bio2BEL Manuscript
==================
To build as a PDF, clone the repository and use the following command:

.. source-code:: bash

$ latexmk -pdf -pvc bmc_article
35 changes: 35 additions & 0 deletions background.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
The integration of heterogeneous, multi-scale, and multi-modal biomedical data has become a cornerstone of modern computational investigation of the mechanisms and aetiologies underlying complex diseases (Iyappan et al., 2014; van Dam et al. 2014; Wanichthanarak et al., 2015; Himmelstein et al., 2017; Fan et al., 2019).
An overarching strategy was proposed by Davidson \textit{et al.} more than two decades ago that outlined the transformation of data into a common model, semantic alignment of related objects, integration of schemata, and federation of data (Davidson et al., 1995).
However, integration remains a challenging task that requires the identification and deep understanding of biological data sources and their respective formats, conversion, harmonization, and unification.

Initial interest in the semantic web and linked open data along with the adoption of RDF (Resource Description Framework) in the biomedical community led to the Bio2RDF project, in which pipelines for converting and serializing several biological data sources to RDF were developed (Belleau et al., 2008).
Several updates have been issued since its deployment such as the inclusion of chemical information systems (Chen et al., 2010). Further, it has also influenced in and has been adopted by subsequent projects such as Open PHACTS (Williams et al., 2012).
While RDF is highly expressive and each of these projects have developed and enforced well-defined schemata, the format is often not well-suited for downstream analyses and must first be queried with languages like SPARQL (SPARQL Query Language for RDF) and subsequently be transformed into appropriate formats with general-purpose programming languages.
Alternatives to RDF/SPARQL such as property graphs (e.g., Neo4j, OrientDB) are comparable (Alocci et al., 2015) but also necessitate similar post-processing.

Conversely, there have been several biologically meaningful integration efforts (e.g., STRING; Warde-Farley, et al. 2010, GeneMANIA; Szklarczyk et al., 2015, GeneCards; Stelzer et al., 2016).
However, most suffer from a lack of defined schemata or standardized data format that impede biological database interoperability.
As interoperability itself is a multifaceted concept, we would like to highlight three of its facets: first, data sources should refer to named entities using high-quality, publicly accessible terminologies as prescribed by the Minimal Information Requested in the Annotation of Biochemical Models standard (Laibe and Le Nov\`ere, 2007).
Second, data sources should additionally denote the ontological classes of named entities (e.g., gene, transcript, protein, pathway, disease) along with their reference using controlled vocabularies such as the Systems Biology Ontology (Courtot et al., 2011).
Some identifiers, such as those for genes, are often used to refer not only to the physical region of DNA within the genome, but also the corresponding RNA transcript(s) or protein product(s).
Unfortunately, many biological databases do not explicitly distinguish between these entity classes.
For example, the STRING database lists gene-centric homology relationships, transcript-centric co-expression relationships, and protein-centric protein-protein interactions using gene-centric nomenclature.
While it may be possible to identify the classes of entities based on their incident relationships, doing so requires specific knowledge of the database including the semantics of its relationships.
Third, resources should, at a minimum, map their relationships to controlled vocabularies such as the Relation Ontology, or further use standardized data formats with defined semantics (e.g., PSI MI-TAB) to minimize both the interpretation and implementation effort when combining them with other resources.

OmniPath (T\"urei et al., 2016) began to address these facets when it combined several signaling pathway and transcriptional regulation databases.
It achieved interoperability between several databases by normalizing the identifiers and relationships between entities from several databases describing the same phenomena (e.g., microRNA-target interactions, protein-protein interactions, etc.) and creating a unified network.
However, because it did not use a standard format or schema as mentioned in the third facet for interoperability, OmniPath itself cannot readily be directly integrated with other biological data sources.
Pathway Commons (Cerami et al., 2011) addressed this concern when combining several molecular pathway and interaction databases by translating the source databases into the BioPAX standard (Demir et al., 2010) using automated pipelines.
However, it suffers from low granularity and low recovery of information from some of its primary biological data sources which may be due to prioritization of software development time, data usage restrictions, or shortcomings in the BioPAX standard.
While BioPAX is well-suited for representing biological reactions and transformations, it is limited in its ability to represent correlative and associative relationships across multi-scale biology (e.g., at the levels of processes, phenotypes, and clinical observations).

As an alternative, we propose the use of Biological Expression Language (BEL; Slater, 2014) as an integration schema in order to overcome the limits faced by previous efforts and to simultaneously address all three facets of interoperability.
BEL has begun to prove itself as a robust format in the curation and integration of previously isolated biological data sources of high granular information on genetic variation (Naz et al., 2016), epigenetics (Irin et al., 2015), chemogenomics (Emon et al., 2017), and clinical biomarkers (Iyappan et al., 2017).
Its syntax and semantics are also appropriate for representing, for example, disease-disease similarities, disease-protein associations, chemical space networks, genome-wide association studies, and phenome-wide association studies.

With the same focus on reproducibility as Bio2RDF, OmniPath, and Pathway Commons as well as deference to software maintainability and the ease of development and inclusion of new biological data sources, we have developed a growing list of Bio2BEL packages, each capable of downloading, structuring, and serializing various biological data sources to BEL (Table 2).
Each can be found in the Bio2BEL GitHub organization (\url{https://github.com/bio2bel}) as an independent open-source Python package that can readily be installed with pip.
We have also developed and freely provided a framework (\url{https://github.com/bio2bel/bio2bel}) in the Python programming language to enable code reuse and the fast generation of additional Bio2BEL packages.
Notably, the list of Bio2BEL packages includes one for OmniPath as a proof of concept that authors of other resources can implement their own Bio2BEL packages.
In this article, we present the philosophy and implementation of Bio2BEL packages, a summary of past and future Bio2BEL packages, and finally, several case studies including the utility of Bio2BEL packages during curation of pathway mappings, in the analysis of cancer genome data, and for machine learning applications.
Loading

0 comments on commit 17bb03c

Please sign in to comment.