MetabolicNetworkX¶

A networkX extension allowing biological network analysis and visualization¶

Group 2: Francisco Costa PG32922 | Mário Varela PG37984 | Rui Pires PG38280 | Patrícia Machado PG30377¶

The MetabolicNetworkX class was built to allow compreensive and fast analysis of a wide range of biological networks. This report will go over multiple of the class function on a step by step basis

First we will import all the necessary libraries. Note the presence of NetworkX and Pyvis packages, the main building blocks for this class

In [2]:

import networkx as nx
import xml.etree.ElementTree as ET
from pyvis.network import Network
import matplotlib.pyplot as plt
from MetabolicNetworkX import MetabolicNetworkX
%matplotlib inline

We will now initiate the class

In [3]:

MN = MetabolicNetworkX()

Note that right now the class is only an empty nx.DiGraph(), the networkX data structure best suitted for biological network manipulation. We now turn to two main methods of the class loadXML() and load_from_file()

loadXML() loads a metabolic network of a XML-SBML file from BioModels - EMBL-EBI. This method was only tested on microorganism but allows virtually infinite networks to manipulate. Three models are available to test from Aspergillus niger (iMA871), Streptococcus thermophilus (iMP429) and Staphylococcus aureus (iSB619) however we will focus on S. thermophilus for demonstration purposes

load_from_file() loads metabolic network from TXT file with the following format: "REACTION: METABOLITE => PRODUCT". Reversible reactions are represented by '<=>', and multiple metabolites and products are separated by ' + '. For demonstrarion we will mainly focus on Escherichia coli K-12 (iJR904) used in "Bioinformatics Algorithms" by Miguel Rocha and Pedro Ferreira

In [4]:

ST = MetabolicNetworkX()
ST.loadXML('MODEL1507180063_url.xml')

EC = MetabolicNetworkX()
EC.load_from_file('ecoli.txt')

Both networks are now available and can be analised with class methods. summary() and print_network() are the best ways to check the loading results and preview your network

In [57]:

EC.summary()
print('#####')
#EC.print_network()

Network has: 1937 nodes
Network has: 5256 edges

#####

The print_network() method, while functional, might be more informative for smaller networks and was not displayed so as to not occupy most of this report's length

In [6]:

print('E.coli dim')
EC.summary()
print('Stermo dim')
ST.summary()

E.coli dim
Network has: 1937 nodes
Network has: 5256 edges

Stermo dim
Network has: 1267 nodes
Network has: 3212 edges

Before begining the analysis itself we'll discuss a few miscellaneous functions:

see_disconnected() importing networks leads to possible data fragments and isolated nodes or graphs, these fragments can be due to errors or leftovers in the original file or bugs on import. This method allows to see all subgraphs and isolated nodes in the data structure

remove_disconnected() Besides seeing we might need to remove these subgraphs as to not have any bias in our analysis

remove_reactions() This extension was built primarily for Metabolic Networks, specifically Metabolic-Reaction (M-R) networks however this method allows the conversion to a Metabolic-Metabolic (M-M) network, that can be both lighter on processing and more informative in some cases. Note this overrides the original network and is not revertable unless you reload the network as we've seen previously

In [4]:

# due to the irreversible nature of these methods we will use a different network
SA = MetabolicNetworkX()
SA.loadXML('MODEL1507180070_url.xml') #S.aureus
SA.summary()
SA.remove_disconnected()
print('After removal')
SA.summary()
SA.remove_reactions()
print('After Conversion to M-M')
SA.summary()

Network has: 1538 nodes
Network has: 4567 edges

After removal
Network has: 1529 nodes
Network has: 4559 edges

After Conversion to M-M
Network has: 648 nodes
Network has: 3262 edges

We can look at any of our current networks using the method see_graph2() where molecules are represented by yellow dots and reactions with green triangles

In [9]:

EC.see_graph2('ecolibasenetwork.html')

Note the physics options on the side. We begin the rendering process with the Barnes–Hut approximation algorithm since this tends to perform better on an already prolonged rendering however we recomend ForceAtlas2 Based for optimal viewing, with a low min/max velocity and timestep to prevent jittering.

Note: If you are running the code locally you'll have noticed a new file called ecolibasenetwork.html in your working directory.

The file can't be displayed in this report directly due to the way the method works but we can rebuild its core struture using its pyvis components.

In [5]:

g = Network( height="800px", width="800px", notebook=True)
g.barnes_hut()
g.from_nx(SA.mnetwork) #smaller network
g.show('core.html')

Out[5]:

Note the absence of reactions, removed previously, but also organizational and aesthetic differences. The see_graph2 method allows a comprehensive and dynamic representaction of the network with individual nodes and reactions and respective successors. This display will be integrated later with the topological analysis

Topological Analysis¶

Metrics to check properties of the network may lead to biological insight. For this report we focused on four different metrics, Degree Distribution, Shortest Path Analysis, Clustering Coefficient and Hubs and Centrality Measures.

Degree Distribution¶

For this metric we implemented the following metrics

[in/out/all]_degree_list(): Lists all nodes and respective in, out or inout degrees
all_degreetype_dict(): Similar to the previous method but in a dictionary struture
mean_degree(): Calculates the Networks Mean Degree of a given degree type
prob_degree(): Calculates the probability of degree value in the network
plot_prob_degree(): Plots the previous method

In [33]:

#these methods would list all nodes and make the output relativly unreadable
#print(EC.all_degree_list())
#print(EC.all_degreetype_dict())

print('MeanDegree: ',EC.mean_degree())
print()
print(EC.prob_degree())

EC.plot_prob_degree()

MeanDegree:  5.4269488900361385

{6: 0.07072792978833248, 603: 0.0005162622612287042, 24: 0.0005162622612287042, 3: 0.1254517294785751, 87: 0.0005162622612287042, 92: 0.0005162622612287042, 2: 0.18172431595250388, 5: 0.1590087764584409, 4: 0.2973670624677336, 312: 0.0005162622612287042, 81: 0.0005162622612287042, 46: 0.0005162622612287042, 57: 0.0005162622612287042, 31: 0.0005162622612287042, 66: 0.0005162622612287042, 8: 0.006711409395973154, 11: 0.002581311306143521, 211: 0.0005162622612287042, 13: 0.0015487867836861124, 175: 0.0005162622612287042, 10: 0.005678884873515746, 18: 0.002065049044914817, 177: 0.0005162622612287042, 7: 0.049044914816726896, 16: 0.0015487867836861124, 1: 0.05162622612287042, 22: 0.0005162622612287042, 172: 0.0005162622612287042, 9: 0.021683014971605574, 15: 0.0015487867836861124, 62: 0.0005162622612287042, 64: 0.0005162622612287042, 37: 0.0005162622612287042, 58: 0.0005162622612287042, 17: 0.0015487867836861124, 12: 0.004646360351058337, 41: 0.0005162622612287042, 19: 0.0010325245224574084, 26: 0.0010325245224574084, 14: 0.0015487867836861124, 40: 0.0005162622612287042, 25: 0.0005162622612287042, 23: 0.0005162622612287042}

plot_prob_degree() uses matplotlib and can be further explored for better visualization

In [28]:

EC.plot_prob_degree(title='E. coli Probability Degree', xlabel='k', ylabel='P(k)', color = 'm', xlim=(0,50))

We can see a clear higher probability of nodes with low degrees consistent with a theoretical scale free network and can be seen in all other networks already covered

In [30]:

ST.plot_prob_degree(title='S. thermo Probability Degree', xlabel='k', ylabel='P(k)', color = 'c', xlim=(0,50))

A perk of our data structure is that we can override mnetwork to fit any networkX available graph types, as such it is possible to see what a theoretical scale free network and random network would look like

In [32]:

SF = MetabolicNetworkX()
SF.mnetwork = nx.scale_free_graph(1000, seed=2424)

R = MetabolicNetworkX()
R.mnetwork = nx.gnp_random_graph(1000, 0.5, seed=2424)  # Erdös–Rényi (ER) model


SF.plot_prob_degree(title = 'Theoretical Scale Free', xlabel='k', ylabel='P(k)', color = 'r',)
R.plot_prob_degree(title = 'Theoretical Random', xlabel='k', ylabel='P(k)', color = 'k',)

We can see how similar the Theoretical Scale Free Network is to our real data and how the Theoretical Random aproaxh normal distributions, replicating the results found in literature (A-L.Barabási, 2004; M.Rocha, 2018)

Shortest Path Analysis¶

For this metric no methods were implemented thanks to networkX's nx.average_shortest_path_length() that allows us to determine directly the average shortest path of our network.

Once again we can compare these across our networks and even theoretical networks

In [35]:

ST.remove_disconnected() #this funtion can't perform with subgraphs present
print('Mean Distances for E.coli :', nx.average_shortest_path_length(EC.mnetwork))
print('Mean Distances for S.thermo :', nx.average_shortest_path_length(ST.mnetwork))
print('Mean Distances for Theoretical Random:', nx.average_shortest_path_length(R.mnetwork))

Mean Distances for E.coli : 4.96509496452297
Mean Distances for S.thermo : 4.791015840942451
Mean Distances for Theoretical Random: 1.4999099099099098

The results for metabolic networks show a low range of steps corroborating the scale free denotation however with the random network this result is even lower. This is because with the Erdös–Rényi model a node has a 50/50 change of connecting to any other node when generated making it almost a certanty that it will be able to reach any other node in the netwok in less than two steps or less.

Clustering Coefficient¶

For calculting the clustering coefficient two methods were implemented

mean_clustering_perdegree() calculates all average clustering coeffient for each degree using networkX's nx.clustering()
plot_clustering_perdegree allows ploting of the previous methods results. Similar in parameters to plot_prob_degree()

In [37]:

print('Average Clustering E.coli: ', nx.average_clustering(EC.mnetwork))
print()
print(EC.mean_clustering_perdegree())

Average Clustering E.coli:  0.0

{6: 0.0, 603: 0.0, 24: 0.0, 3: 0.0, 87: 0.0, 92: 0.0, 2: 0.0, 5: 0.0, 4: 0.0, 312: 0.0, 81: 0.0, 46: 0.0, 57: 0.0, 31: 0.0, 66: 0.0, 8: 0.0, 11: 0.0, 211: 0.0, 13: 0.0, 175: 0.0, 10: 0.0, 18: 0.0, 177: 0.0, 7: 0.0, 16: 0.0, 1: 0.0, 22: 0.0, 172: 0.0, 9: 0.0, 15: 0.0, 62: 0.0, 64: 0.0, 37: 0.0, 58: 0.0, 17: 0.0, 12: 0.0, 41: 0.0, 19: 0.0, 26: 0.0, 14: 0.0, 40: 0.0, 25: 0.0, 23: 0.0}

In [42]:

EC.plot_clustering_perdegree(title='E.coli Clustering Per Degree', xlabel='k', ylabel='C(k)', color = 'g')

In [41]:

#testing against random graph
R.plot_clustering_perdegree(title='Theoretical Random Clustering Per Degree', xlabel='k', ylabel='C(k)', color = 'k')

Hubs and Centrality Measures¶

This metric consists on identifying important nodes for the network are was, as such integrated with see_graph2() the methods implemeted were

highest_degrees() using all_degree_list() lists nodes with highest degree centrality
betweenness_centrality() calculates betweeness for all nodes
highest_betweenness() using betweenness_centrality() lists nodes with highest betweeness
closeness_centrality() calculates closeness for all nodes
highest_closeness() using closeness_centrality() lists nodes with highest closeness

Highest Degree¶

In [43]:

print('E.coli nodes with Highest Degrees:', EC.highest_degrees())

E.coli nodes with Highest Degrees: ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_pi_c', 'M_adp_c', 'M_h_e', 'M_nad_c', 'M_nadh_c', 'M_ppi_c', 'M_co2_c']

This can be expanded to any number within range

In [46]:

print('E.coli nodes with Highest Degrees:', EC.highest_degrees(20))

E.coli nodes with Highest Degrees: ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_pi_c', 'M_adp_c', 'M_h_e', 'M_nad_c', 'M_nadh_c', 'M_ppi_c', 'M_co2_c', 'M_nadp_c', 'M_nadph_c', 'M_glu_DASH_L_c', 'M_pyr_c', 'M_coa_c', 'M_nh4_c', 'M_amp_c', 'M_akg_c', 'M_accoa_c', 'M_ACP_c']

Visualizing these nodes was represented by changing their default color to highlight this metric. Metabolite nodes, default yellow are converted to red and Reaction Nodes, default green turn blue. You will notice however none of the listed nodes are a reaction, theoretically it will be quite difficult to find a reaction on the top highest degree nodes in biological network however the code does support this possibility

In [48]:

EC.see_graph2(htmlname='ecolihighestdegree.html', highlightCentrality=True ,h_degree=True,
              cent_between=False, cent_close=False, top_num=10)

You have selected to highlight nodes and edges based on their centrality measures

Highlighted Nodes based on highest degree:
 (['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_pi_c', 'M_adp_c', 'M_h_e', 'M_nad_c', 'M_nadh_c', 'M_ppi_c', 'M_co2_c'], [])

Highest Closeness¶

The method closeness_centrality() uses networkX's nx.algorithms.closeness_centrality(), highest_closeness() similarly to highest_degrees() returns a sorted top of the nodes. Closeness identifies nodes which are closest to other nodes, to visualize this we enlarged these nodes to highlight their centrality to all others of the network

In [50]:

print('E.coli nodes with Highest Closeness:', EC.highest_closeness())
#like before we can expand this to any number
print('E.coli nodes with Highest Closeness:', EC.highest_closeness(20))

E.coli nodes with Highest Closeness: ['M_h_c', 'M_pi_c', 'R_PRASCS_b', 'R_ACCOACr_b', 'R_DBTSr_b', 'R_PRAGSr_b', 'R_ALAALAr_b', 'R_ATPS4r_b', 'R_OCBT_b', 'R_ADA']
E.coli nodes with Highest Closeness: ['M_h_c', 'M_pi_c', 'R_PRASCS_b', 'R_ACCOACr_b', 'R_DBTSr_b', 'R_PRAGSr_b', 'R_ALAALAr_b', 'R_ATPS4r_b', 'R_OCBT_b', 'R_ADA', 'R_PIt2r_b', 'R_ADD', 'R_DADA', 'R_GUAD', 'R_UGLYCH', 'R_CSND', 'R_DCYTD', 'R_DCTPD', 'R_CYTD', 'R_DHPPDA2']

In [54]:

EC.see_graph2(htmlname='ecolihighestcloseness.html', highlightCentrality=True ,h_degree=False,
              cent_between=False, cent_close=True, top_num=10)

You have selected to highlight nodes and edges based on their centrality measures

Highlighted Nodes based on Closeness Centrality:
 ['M_h_c', 'M_pi_c', 'R_PRASCS_b', 'R_ACCOACr_b', 'R_DBTSr_b', 'R_PRAGSr_b', 'R_ALAALAr_b', 'R_ATPS4r_b', 'R_OCBT_b', 'R_ADA']

Highest Betweenness¶

The method betweenness_centrality() uses networkX's nx.algorithms.betweenness_centrality(), highest_betweenness() similarly to highest_degrees() and highest_closeness() returns a sorted top of the nodes. Betweenness is based on the shortest paths between nodes therefore identifying nodes with high betweenness is identifying sources/origins of edges that will be used more often by shortest paths as such, to visualize these paths we enlarge the edges of top nodes and color them with their origin's color.

In [52]:

print('E.coli nodes with Highest Betweenness:', EC.highest_betweenness())
#like before we can expand this to any number
print('E.coli nodes with Highest Betweenness:', EC.highest_betweenness(20))

E.coli nodes with Highest Betweenness: ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_h_e', 'M_pi_c', 'M_adp_c', 'M_nad_c', 'M_ppi_c', 'M_nadp_c', 'R_ATPS4r']
E.coli nodes with Highest Betweenness: ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_h_e', 'M_pi_c', 'M_adp_c', 'M_nad_c', 'M_ppi_c', 'M_nadp_c', 'R_ATPS4r', 'M_pyr_c', 'M_co2_c', 'M_glu_DASH_L_c', 'M_nh4_c', 'M_akg_c', 'R_CBLAT_DELETE_b', 'R_CBIAT_DELETE_b', 'M_coa_c', 'R_GLUDy_b', 'R_ACCOACr_b']

In [55]:

EC.see_graph2(htmlname='ecolihighestbetweenness.html', highlightCentrality=True ,h_degree=False,
              cent_between=True, cent_close=False, top_num=10)

You have selected to highlight nodes and edges based on their centrality measures

Highlighted Edges based on Betweenness Centrality:
 ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_h_e', 'M_pi_c', 'M_adp_c', 'M_nad_c', 'M_ppi_c', 'M_nadp_c', 'R_ATPS4r']

Finally we can display all previous centrality measures combined for a trully informative graph visualization¶

In [56]:

EC.see_graph2('EcoliAllCentrality.html', highlightCentrality=True)

You have selected to highlight nodes and edges based on their centrality measures

Highlighted Nodes based on highest degree:
 (['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_pi_c', 'M_adp_c', 'M_h_e', 'M_nad_c', 'M_nadh_c', 'M_ppi_c', 'M_co2_c'], [])
Highlighted Nodes based on Closeness Centrality:
 ['M_h_c', 'M_pi_c', 'R_PRASCS_b', 'R_ACCOACr_b', 'R_DBTSr_b', 'R_PRAGSr_b', 'R_ALAALAr_b', 'R_ATPS4r_b', 'R_OCBT_b', 'R_ADA']
Highlighted Edges based on Betweenness Centrality:
 ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_h_e', 'M_pi_c', 'M_adp_c', 'M_nad_c', 'M_ppi_c', 'M_nadp_c', 'R_ATPS4r']

Bulk Analysis¶

The final method implemeted in MetabolicNetworkX is topological_analysis_full() allowing us to perform all previous analysis in a single step

In [38]:

EC.topological_analysis_full()

###Network Topological Analysis###

1. Degree Distribution

    Network Mean Degree:
         Mean InDegree: 2.7134744450180692
         Mean OutDegree: 2.7134744450180692
         Mean (Inout)Degree: 5.4269488900361385

    Network Probability (Inout)Degree:
 {6: 0.07072792978833248, 603: 0.0005162622612287042, 24: 0.0005162622612287042, 3: 0.1254517294785751, 87: 0.0005162622612287042, 92: 0.0005162622612287042, 2: 0.18172431595250388, 5: 0.1590087764584409, 4: 0.2973670624677336, 312: 0.0005162622612287042, 81: 0.0005162622612287042, 46: 0.0005162622612287042, 57: 0.0005162622612287042, 31: 0.0005162622612287042, 66: 0.0005162622612287042, 8: 0.006711409395973154, 11: 0.002581311306143521, 211: 0.0005162622612287042, 13: 0.0015487867836861124, 175: 0.0005162622612287042, 10: 0.005678884873515746, 18: 0.002065049044914817, 177: 0.0005162622612287042, 7: 0.049044914816726896, 16: 0.0015487867836861124, 1: 0.05162622612287042, 22: 0.0005162622612287042, 172: 0.0005162622612287042, 9: 0.021683014971605574, 15: 0.0015487867836861124, 62: 0.0005162622612287042, 64: 0.0005162622612287042, 37: 0.0005162622612287042, 58: 0.0005162622612287042, 17: 0.0015487867836861124, 12: 0.004646360351058337, 41: 0.0005162622612287042, 19: 0.0010325245224574084, 26: 0.0010325245224574084, 14: 0.0015487867836861124, 40: 0.0005162622612287042, 25: 0.0005162622612287042, 23: 0.0005162622612287042}

2. Shortest Path Analysis
    Mean Distances: 4.96509496452297


3. Clustering Coefficients
    Average Clustering: 0.0
    Mean Clustering per Degree:
 {6: 0.0, 603: 0.0, 24: 0.0, 3: 0.0, 87: 0.0, 92: 0.0, 2: 0.0, 5: 0.0, 4: 0.0, 312: 0.0, 81: 0.0, 46: 0.0, 57: 0.0, 31: 0.0, 66: 0.0, 8: 0.0, 11: 0.0, 211: 0.0, 13: 0.0, 175: 0.0, 10: 0.0, 18: 0.0, 177: 0.0, 7: 0.0, 16: 0.0, 1: 0.0, 22: 0.0, 172: 0.0, 9: 0.0, 15: 0.0, 62: 0.0, 64: 0.0, 37: 0.0, 58: 0.0, 17: 0.0, 12: 0.0, 41: 0.0, 19: 0.0, 26: 0.0, 14: 0.0, 40: 0.0, 25: 0.0, 23: 0.0}

4. Hubs and Centrality Measures
    Nodes w/ Highest Degrees: ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_pi_c', 'M_adp_c', 'M_h_e', 'M_nad_c', 'M_nadh_c', 'M_ppi_c', 'M_co2_c']
    Nodes w/ Highest Betweenness: ['M_h_c', 'M_h2o_c', 'M_atp_c', 'M_h_e', 'M_pi_c', 'M_adp_c', 'M_nad_c', 'M_ppi_c', 'M_nadp_c', 'R_ATPS4r']
    Nodes w/ Highest Closeness: ['M_h_c', 'M_pi_c', 'R_PRASCS_b', 'R_ACCOACr_b', 'R_DBTSr_b', 'R_PRAGSr_b', 'R_ALAALAr_b', 'R_ATPS4r_b', 'R_OCBT_b', 'R_ADA']

This method uses all methods discussed on the analysis section of this report replacing a degree of freedom for choosing parameters for a more confortable and quick way to display the relevant statistics in a single command

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
jupyter_generated_files		jupyter_generated_files
miscnetworks		miscnetworks
MODEL1507180047_url.xml		MODEL1507180047_url.xml
MODEL1507180063_url.xml		MODEL1507180063_url.xml
MODEL1507180070_url.xml		MODEL1507180070_url.xml
MetaNetXShowcase.html		MetaNetXShowcase.html
MetaNetXShowcase.ipynb		MetaNetXShowcase.ipynb
MetabolicNetworkX.py		MetabolicNetworkX.py
README.md		README.md
Report_PT.pdf		Report_PT.pdf
core.html		core.html
ecoli.txt		ecoli.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MetabolicNetworkX¶

A networkX extension allowing biological network analysis and visualization¶

Group 2: Francisco Costa PG32922 | Mário Varela PG37984 | Rui Pires PG38280 | Patrícia Machado PG30377¶

Topological Analysis¶

Degree Distribution¶

Shortest Path Analysis¶

Clustering Coefficient¶

Hubs and Centrality Measures¶

Highest Degree¶

Highest Closeness¶

Highest Betweenness¶

Finally we can display all previous centrality measures combined for a trully informative graph visualization¶

Bulk Analysis¶

About

Releases

Packages

Languages

frmachadoecosta/MetabolicNetworkX

Folders and files

Latest commit

History

Repository files navigation

MetabolicNetworkX¶

A networkX extension allowing biological network analysis and visualization¶

Group 2: Francisco Costa PG32922 | Mário Varela PG37984 | Rui Pires PG38280 | Patrícia Machado PG30377¶

Topological Analysis¶

Degree Distribution¶

Shortest Path Analysis¶

Clustering Coefficient¶

Hubs and Centrality Measures¶

Highest Degree¶

Highest Closeness¶

Highest Betweenness¶

Finally we can display all previous centrality measures combined for a trully informative graph visualization¶

Bulk Analysis¶

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages