MetabolicNetworkX¶
A networkX extension allowing biological network analysis and visualization¶
Group 2: Francisco Costa PG32922 | Mário Varela PG37984 | Rui Pires PG38280 | Patrícia Machado PG30377¶
The MetabolicNetworkX class was built to allow compreensive and fast analysis of a wide range of biological networks. This report will go over multiple of the class function on a step by step basis
First we will import all the necessary libraries. Note the presence of NetworkX and Pyvis packages, the main building blocks for this class
In [2]:
import networkx as nx
import xml.etree.ElementTree as ET
from pyvis.network import Network
import matplotlib.pyplot as plt
from MetabolicNetworkX import MetabolicNetworkX
%matplotlib inline
We will now initiate the class
In [3]:
MN = MetabolicNetworkX()
Note that right now the class is only an empty nx.DiGraph(), the networkX data structure best suitted for biological network manipulation. We now turn to two main methods of the class loadXML() and load_from_file()
loadXML() loads a metabolic network of a XML-SBML file from BioModels - EMBL-EBI. This method was only tested on microorganism but allows virtually infinite networks to manipulate. Three models are available to test from Aspergillus niger (iMA871), Streptococcus thermophilus (iMP429) and Staphylococcus aureus (iSB619) however we will focus on S. thermophilus for demonstration purposes
load_from_file() loads metabolic network from TXT file with the following format: "REACTION: METABOLITE => PRODUCT". Reversible reactions are represented by '<=>', and multiple metabolites and products are separated by ' + '. For demonstrarion we will mainly focus on Escherichia coli K-12 (iJR904) used in "Bioinformatics Algorithms" by Miguel Rocha and Pedro Ferreira
In [4]:
ST = MetabolicNetworkX()
ST.loadXML('MODEL1507180063_url.xml')
EC = MetabolicNetworkX()
EC.load_from_file('ecoli.txt')
Both networks are now available and can be analised with class methods. summary() and print_network() are the best ways to check the loading results and preview your network
In [57]:
EC.summary()
print('#####')
#EC.print_network()
Network has: 1937 nodes
Network has: 5256 edges
#####
The print_network() method, while functional, might be more informative for smaller networks and was not displayed so as to not occupy most of this report's length
In [6]:
print('E.coli dim')
EC.summary()
print('Stermo dim')
ST.summary()
E.coli dim
Network has: 1937 nodes
Network has: 5256 edges
Stermo dim
Network has: 1267 nodes
Network has: 3212 edges
Before begining the analysis itself we'll discuss a few miscellaneous functions:
see_disconnected() importing networks leads to possible data fragments and isolated nodes or graphs, these fragments can be due to errors or leftovers in the original file or bugs on import. This method allows to see all subgraphs and isolated nodes in the data structure
remove_disconnected() Besides seeing we might need to remove these subgraphs as to not have any bias in our analysis
remove_reactions() This extension was built primarily for Metabolic Networks, specifically Metabolic-Reaction (M-R) networks however this method allows the conversion to a Metabolic-Metabolic (M-M) network, that can be both lighter on processing and more informative in some cases. Note this overrides the original network and is not revertable unless you reload the network as we've seen previously
In [4]:
# due to the irreversible nature of these methods we will use a different network
SA = MetabolicNetworkX()
SA.loadXML('MODEL1507180070_url.xml') #S.aureus
SA.summary()
SA.remove_disconnected()
print('After removal')
SA.summary()
SA.remove_reactions()
print('After Conversion to M-M')
SA.summary()
Network has: 1538 nodes
Network has: 4567 edges
After removal
Network has: 1529 nodes
Network has: 4559 edges
After Conversion to M-M
Network has: 648 nodes
Network has: 3262 edges
We can look at any of our current networks using the method see_graph2() where molecules are represented by yellow dots and reactions with green triangles
In [9]:
EC.see_graph2('ecolibasenetwork.html')
Note the physics options on the side. We begin the rendering process with the Barnes–Hut approximation algorithm since this tends to perform better on an already prolonged rendering however we recomend ForceAtlas2 Based for optimal viewing, with a low min/max velocity and timestep to prevent jittering.
Note: If you are running the code locally you'll have noticed a new file called ecolibasenetwork.html in your working directory.
The file can't be displayed in this report directly due to the way the method works but we can rebuild its core struture using its pyvis components.
In [5]:
g = Network( height="800px", width="800px", notebook=True)
g.barnes_hut()
g.from_nx(SA.mnetwork) #smaller network
g.show('core.html')
Out[5]:
Note the absence of reactions, removed previously, but also organizational and aesthetic differences. The see_graph2 method allows a comprehensive and dynamic representaction of the network with individual nodes and reactions and respective successors. This display will be integrated later with the topological analysis
Topological Analysis¶
Metrics to check properties of the network may lead to biological insight. For this report we focused on four different metrics, Degree Distribution, Shortest Path Analysis, Clustering Coefficient and Hubs and Centrality Measures.
Degree Distribution¶
For this metric we implemented the following metrics
- [in/out/all]_degree_list(): Lists all nodes and respective in, out or inout degrees
- all_degreetype_dict(): Similar to the previous method but in a dictionary struture
- mean_degree(): Calculates the Networks Mean Degree of a given degree type
- prob_degree(): Calculates the probability of degree value in the network
- plot_prob_degree(): Plots the previous method
In [33]:
#these methods would list all nodes and make the output relativly unreadable
#print(EC.all_degree_list())
#print(EC.all_degreetype_dict())
print('MeanDegree: ',EC.mean_degree())
print()
print(EC.prob_degree())
EC.plot_prob_degree()
MeanDegree: 5.4269488900361385
{6: 0.07072792978833248, 603: 0.0005162622612287042, 24: 0.0005162622612287042, 3: 0.1254517294785751, 87: 0.0005162622612287042, 92: 0.0005162622612287042, 2: 0.18172431595250388, 5: 0.1590087764584409, 4: 0.2973670624677336, 312: 0.0005162622612287042, 81: 0.0005162622612287042, 46: 0.0005162622612287042, 57: 0.0005162622612287042, 31: 0.0005162622612287042, 66: 0.0005162622612287042, 8: 0.006711409395973154, 11: 0.002581311306143521, 211: 0.0005162622612287042, 13: 0.0015487867836861124, 175: 0.0005162622612287042, 10: 0.005678884873515746, 18: 0.002065049044914817, 177: 0.0005162622612287042, 7: 0.049044914816726896, 16: 0.0015487867836861124, 1: 0.05162622612287042, 22: 0.0005162622612287042, 172: 0.0005162622612287042, 9: 0.021683014971605574, 15: 0.0015487867836861124, 62: 0.0005162622612287042, 64: 0.0005162622612287042, 37: 0.0005162622612287042, 58: 0.0005162622612287042, 17: 0.0015487867836861124, 12: 0.004646360351058337, 41: 0.0005162622612287042, 19: 0.0010325245224574084, 26: 0.0010325245224574084, 14: 0.0015487867836861124, 40: 0.0005162622612287042, 25: 0.0005162622612287042, 23: 0.0005162622612287042}
plot_prob_degree() uses matplotlib and can be further explored for better visualization
In [28]:
EC.plot_prob_degree(title='E. coli Probability Degree', xlabel='k', ylabel='P(k)', color = 'm', xlim=(0,50))
We can see a clear higher probability of nodes with low degrees consistent with a theoretical scale free network and can be seen in all other networks already covered
In [30]:
ST.plot_prob_degree(title='S. thermo Probability Degree', xlabel='k', ylabel='P(k)', color = 'c', xlim=(0,50))
A perk of our data structure is that we can override mnetwork to fit any networkX available graph types, as such it is possible to see what a theoretical scale free network and random network would look like
In [32]:
SF = MetabolicNetworkX()
SF.mnetwork = nx.scale_free_graph(1000, seed=2424)
R = MetabolicNetworkX()
R.mnetwork = nx.gnp_random_graph(1000, 0.5, seed=2424) # Erdös–Rényi (ER) model
SF.plot_prob_degree(title = 'Theoretical Scale Free', xlabel='k', ylabel='P(k)', color = 'r',)
R.plot_prob_degree(title = 'Theoretical Random', xlabel='k', ylabel='P(k)', color = 'k',)
We can see how similar the Theoretical Scale Free Network is to our real data and how the Theoretical Random aproaxh normal distributions, replicating the results found in literature (A-L.Barabási, 2004; M.Rocha, 2018)
Shortest Path Analysis¶
For this metric no methods were implemented thanks to networkX's nx.average_shortest_path_length() that allows us to determine directly the average shortest path of our network.
Once again we can compare these across our networks and even theoretical networks
In [35]:
ST.remove_disconnected() #this funtion can't perform with subgraphs present
print('Mean Distances for E.coli :', nx.average_shortest_path_length(EC.mnetwork))
print('Mean Distances for S.thermo :', nx.average_shortest_path_length(ST.mnetwork))
print('Mean Distances for Theoretical Random:', nx.average_shortest_path_length(R.mnetwork))
Mean Distances for E.coli : 4.96509496452297
Mean Distances for S.thermo : 4.791015840942451
Mean Distances for Theoretical Random: 1.4999099099099098
The results for metabolic networks show a low range of steps corroborating the scale free denotation however with the random network this result is even lower. This is because with the Erdös–Rényi model a node has a 50/50 change of connecting to any other node when generated making it almost a certanty that it will be able to reach any other node in the netwok in less than two steps or less.
Clustering Coefficient¶
For calculting the clustering coefficient two methods were implemented
- mean_clustering_perdegree() calculates all average clustering coeffient for each degree using networkX's nx.clustering()
- plot_clustering_perdegree allows ploting of the previous methods results. Similar in parameters to plot_prob_degree()
In [37]:
print('Average Clustering E.coli: ', nx.average_clustering(EC.mnetwork))
print()
print(EC.mean_clustering_perdegree())
Average Clustering E.coli: 0.0
{6: 0.0, 603: 0.0, 24: 0.0, 3: 0.0, 87: 0.0, 92: 0.0, 2: 0.0, 5: 0.0, 4: 0.0, 312: 0.0, 81: 0.0, 46: 0.0, 57: 0.0, 31: 0.0, 66: 0.0, 8: 0.0, 11: 0.0, 211: 0.0, 13: 0.0, 175: 0.0, 10: 0.0, 18: 0.0, 177: 0.0, 7: 0.0, 16: 0.0, 1: 0.0, 22: 0.0, 172: 0.0, 9: 0.0, 15: 0.0, 62: 0.0, 64: 0.0, 37: 0.0, 58: 0.0, 17: 0.0, 12: 0.0, 41: 0.0, 19: 0.0, 26: 0.0, 14: 0.0, 40: 0.0, 25: 0.0, 23: 0.0}
In [42]:
EC.plot_clustering_perdegree(title='E.coli Clustering Per Degree', xlabel='k', ylabel='C(k)', color = 'g')
In [41]:
#testing against random graph
R.plot_clustering_perdegree(title='Theoretical Random Clustering Per Degree', xlabel='k', ylabel='C(k)', color = 'k')
Hubs and Centrality Measures¶
This metric consists on identifying important nodes for the network are was, as such integrated with see_graph2() the methods implemeted were
- highest_degrees() using all_degree_list() lists nodes with highest degree centrality
- betweenness_centrality() calculates betweeness for all nodes
- highest_betweenness() using betweenness_centrality() lists nodes with highest betweeness
- closeness_centrality() calculates closeness for all nodes
- highest_closeness() using closeness_centrality() lists nodes with highest closeness
Highest Degree¶
In [43]:
print('E.coli nodes with Highest Degrees:', EC.highest_degrees())
E.coli nodes with Highest Degrees: ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_pi_c', 'M_adp_c', 'M_h_e', 'M_nad_c', 'M_nadh_c', 'M_ppi_c', 'M_co2_c']
This can be expanded to any number within range
In [46]:
print('E.coli nodes with Highest Degrees:', EC.highest_degrees(20))
E.coli nodes with Highest Degrees: ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_pi_c', 'M_adp_c', 'M_h_e', 'M_nad_c', 'M_nadh_c', 'M_ppi_c', 'M_co2_c', 'M_nadp_c', 'M_nadph_c', 'M_glu_DASH_L_c', 'M_pyr_c', 'M_coa_c', 'M_nh4_c', 'M_amp_c', 'M_akg_c', 'M_accoa_c', 'M_ACP_c']
Visualizing these nodes was represented by changing their default color to highlight this metric. Metabolite nodes, default yellow are converted to red and Reaction Nodes, default green turn blue. You will notice however none of the listed nodes are a reaction, theoretically it will be quite difficult to find a reaction on the top highest degree nodes in biological network however the code does support this possibility
In [48]:
EC.see_graph2(htmlname='ecolihighestdegree.html', highlightCentrality=True ,h_degree=True,
cent_between=False, cent_close=False, top_num=10)
You have selected to highlight nodes and edges based on their centrality measures
Highlighted Nodes based on highest degree:
(['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_pi_c', 'M_adp_c', 'M_h_e', 'M_nad_c', 'M_nadh_c', 'M_ppi_c', 'M_co2_c'], [])
Highest Closeness¶
The method closeness_centrality() uses networkX's nx.algorithms.closeness_centrality(), highest_closeness() similarly to highest_degrees() returns a sorted top of the nodes. Closeness identifies nodes which are closest to other nodes, to visualize this we enlarged these nodes to highlight their centrality to all others of the network
In [50]:
print('E.coli nodes with Highest Closeness:', EC.highest_closeness())
#like before we can expand this to any number
print('E.coli nodes with Highest Closeness:', EC.highest_closeness(20))
E.coli nodes with Highest Closeness: ['M_h_c', 'M_pi_c', 'R_PRASCS_b', 'R_ACCOACr_b', 'R_DBTSr_b', 'R_PRAGSr_b', 'R_ALAALAr_b', 'R_ATPS4r_b', 'R_OCBT_b', 'R_ADA']
E.coli nodes with Highest Closeness: ['M_h_c', 'M_pi_c', 'R_PRASCS_b', 'R_ACCOACr_b', 'R_DBTSr_b', 'R_PRAGSr_b', 'R_ALAALAr_b', 'R_ATPS4r_b', 'R_OCBT_b', 'R_ADA', 'R_PIt2r_b', 'R_ADD', 'R_DADA', 'R_GUAD', 'R_UGLYCH', 'R_CSND', 'R_DCYTD', 'R_DCTPD', 'R_CYTD', 'R_DHPPDA2']
In [54]:
EC.see_graph2(htmlname='ecolihighestcloseness.html', highlightCentrality=True ,h_degree=False,
cent_between=False, cent_close=True, top_num=10)
You have selected to highlight nodes and edges based on their centrality measures
Highlighted Nodes based on Closeness Centrality:
['M_h_c', 'M_pi_c', 'R_PRASCS_b', 'R_ACCOACr_b', 'R_DBTSr_b', 'R_PRAGSr_b', 'R_ALAALAr_b', 'R_ATPS4r_b', 'R_OCBT_b', 'R_ADA']
Highest Betweenness¶
The method betweenness_centrality() uses networkX's nx.algorithms.betweenness_centrality(), highest_betweenness() similarly to highest_degrees() and highest_closeness() returns a sorted top of the nodes. Betweenness is based on the shortest paths between nodes therefore identifying nodes with high betweenness is identifying sources/origins of edges that will be used more often by shortest paths as such, to visualize these paths we enlarge the edges of top nodes and color them with their origin's color.
In [52]:
print('E.coli nodes with Highest Betweenness:', EC.highest_betweenness())
#like before we can expand this to any number
print('E.coli nodes with Highest Betweenness:', EC.highest_betweenness(20))
E.coli nodes with Highest Betweenness: ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_h_e', 'M_pi_c', 'M_adp_c', 'M_nad_c', 'M_ppi_c', 'M_nadp_c', 'R_ATPS4r']
E.coli nodes with Highest Betweenness: ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_h_e', 'M_pi_c', 'M_adp_c', 'M_nad_c', 'M_ppi_c', 'M_nadp_c', 'R_ATPS4r', 'M_pyr_c', 'M_co2_c', 'M_glu_DASH_L_c', 'M_nh4_c', 'M_akg_c', 'R_CBLAT_DELETE_b', 'R_CBIAT_DELETE_b', 'M_coa_c', 'R_GLUDy_b', 'R_ACCOACr_b']
In [55]:
EC.see_graph2(htmlname='ecolihighestbetweenness.html', highlightCentrality=True ,h_degree=False,
cent_between=True, cent_close=False, top_num=10)
You have selected to highlight nodes and edges based on their centrality measures
Highlighted Edges based on Betweenness Centrality:
['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_h_e', 'M_pi_c', 'M_adp_c', 'M_nad_c', 'M_ppi_c', 'M_nadp_c', 'R_ATPS4r']
Finally we can display all previous centrality measures combined for a trully informative graph visualization¶
In [56]:
EC.see_graph2('EcoliAllCentrality.html', highlightCentrality=True)
You have selected to highlight nodes and edges based on their centrality measures
Highlighted Nodes based on highest degree:
(['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_pi_c', 'M_adp_c', 'M_h_e', 'M_nad_c', 'M_nadh_c', 'M_ppi_c', 'M_co2_c'], [])
Highlighted Nodes based on Closeness Centrality:
['M_h_c', 'M_pi_c', 'R_PRASCS_b', 'R_ACCOACr_b', 'R_DBTSr_b', 'R_PRAGSr_b', 'R_ALAALAr_b', 'R_ATPS4r_b', 'R_OCBT_b', 'R_ADA']
Highlighted Edges based on Betweenness Centrality:
['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_h_e', 'M_pi_c', 'M_adp_c', 'M_nad_c', 'M_ppi_c', 'M_nadp_c', 'R_ATPS4r']
Bulk Analysis¶
The final method implemeted in MetabolicNetworkX is topological_analysis_full() allowing us to perform all previous analysis in a single step
In [38]:
EC.topological_analysis_full()
###Network Topological Analysis###
1. Degree Distribution
Network Mean Degree:
Mean InDegree: 2.7134744450180692
Mean OutDegree: 2.7134744450180692
Mean (Inout)Degree: 5.4269488900361385
Network Probability (Inout)Degree:
{6: 0.07072792978833248, 603: 0.0005162622612287042, 24: 0.0005162622612287042, 3: 0.1254517294785751, 87: 0.0005162622612287042, 92: 0.0005162622612287042, 2: 0.18172431595250388, 5: 0.1590087764584409, 4: 0.2973670624677336, 312: 0.0005162622612287042, 81: 0.0005162622612287042, 46: 0.0005162622612287042, 57: 0.0005162622612287042, 31: 0.0005162622612287042, 66: 0.0005162622612287042, 8: 0.006711409395973154, 11: 0.002581311306143521, 211: 0.0005162622612287042, 13: 0.0015487867836861124, 175: 0.0005162622612287042, 10: 0.005678884873515746, 18: 0.002065049044914817, 177: 0.0005162622612287042, 7: 0.049044914816726896, 16: 0.0015487867836861124, 1: 0.05162622612287042, 22: 0.0005162622612287042, 172: 0.0005162622612287042, 9: 0.021683014971605574, 15: 0.0015487867836861124, 62: 0.0005162622612287042, 64: 0.0005162622612287042, 37: 0.0005162622612287042, 58: 0.0005162622612287042, 17: 0.0015487867836861124, 12: 0.004646360351058337, 41: 0.0005162622612287042, 19: 0.0010325245224574084, 26: 0.0010325245224574084, 14: 0.0015487867836861124, 40: 0.0005162622612287042, 25: 0.0005162622612287042, 23: 0.0005162622612287042}
2. Shortest Path Analysis
Mean Distances: 4.96509496452297
3. Clustering Coefficients
Average Clustering: 0.0
Mean Clustering per Degree:
{6: 0.0, 603: 0.0, 24: 0.0, 3: 0.0, 87: 0.0, 92: 0.0, 2: 0.0, 5: 0.0, 4: 0.0, 312: 0.0, 81: 0.0, 46: 0.0, 57: 0.0, 31: 0.0, 66: 0.0, 8: 0.0, 11: 0.0, 211: 0.0, 13: 0.0, 175: 0.0, 10: 0.0, 18: 0.0, 177: 0.0, 7: 0.0, 16: 0.0, 1: 0.0, 22: 0.0, 172: 0.0, 9: 0.0, 15: 0.0, 62: 0.0, 64: 0.0, 37: 0.0, 58: 0.0, 17: 0.0, 12: 0.0, 41: 0.0, 19: 0.0, 26: 0.0, 14: 0.0, 40: 0.0, 25: 0.0, 23: 0.0}
4. Hubs and Centrality Measures
Nodes w/ Highest Degrees: ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_pi_c', 'M_adp_c', 'M_h_e', 'M_nad_c', 'M_nadh_c', 'M_ppi_c', 'M_co2_c']
Nodes w/ Highest Betweenness: ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_h_e', 'M_pi_c', 'M_adp_c', 'M_nad_c', 'M_ppi_c', 'M_nadp_c', 'R_ATPS4r']
Nodes w/ Highest Closeness: ['M_h_c', 'M_pi_c', 'R_PRASCS_b', 'R_ACCOACr_b', 'R_DBTSr_b', 'R_PRAGSr_b', 'R_ALAALAr_b', 'R_ATPS4r_b', 'R_OCBT_b', 'R_ADA']
This method uses all methods discussed on the analysis section of this report replacing a degree of freedom for choosing parameters for a more confortable and quick way to display the relevant statistics in a single command