Skip to content

didiervega/Multi-Centrality-Index

Repository files navigation

Multi-centrality index

How to use it and examples

This is a notebook showing how to use the code of the proposed method for Multi-Centrality Index, which was employed for the analysis of keywords in [1]. The code is in Python3, and some toolboxes are necessary to run the commands.

The following very common packages are necessary for running the code:

  • numpy
  • pandas
  • networkx
  • sklearn
import numpy as np
import pandas as pd
import networkx as nx
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler

Then, you can run the example from a preloaded matrix of features (centralities) of a previous constructed graph-of-words (network) with the co-occurrence approach.

Each word of the graph is sorted by the corresponding Multi-centrality Index (MCI) value. In this example, the MCI is the combination of these centrality measures: ['Degree','Pagerank','Eigenvector','StructuralHoles']

Top words are considered the keywords of the text.

You run the code as follow:

python MultiCentralityIndex.py
           Word       MCI
0         MAMET  1.501088
1          PLAY  1.484412
2      DIRECTOR  0.968968
3     ANARCHIST  0.887786
4       THEATER  0.712872
5      PULITZER  0.647991
6        LONDON  0.635058
7         GOOLD  0.619282
8          YEAR  0.605723
9     GLENGARRY  0.572361
10        DEBUT  0.530749
11        DAVID  0.311259
12         YORK  0.225655
13  PRIZEWINNER  0.215347
14        PRIZE  0.208866
 ... 
import MultiCentralityIndex as mc
mc.test()

png

Besides, you can import and use the code as your necessity. For example, lets calculate the MCI for the Coauthorships in network science

A figure depicting the largest component of this network Extracted from Prof. Newman Web site

import MultiCentralityIndex as MCI
import networkx as nx

# Creating the MCI object
mc = MCI.MCI()

#loading the netscience graph
G = nx.read_gml('netscience.gml',label='label')
node_size=[float(G.degree(v)) for v in G]

#Showing the graph of the full network
nx.draw_networkx(G, arrows=True, node_size=20, node_color=node_size,edge_color='grey',alpha=.5,with_labels=False)

png

Now, let's define the set of centrality measures to be calculated as

setCentralities = ['Degree','Pagerank','Eigenvector','StructuralHoles','Closeness', 'Betweenness']

In this example we are calculating the MCI for a single graph (network). For this, we just call the getMCI_PCA method.

mc.getMCI_PCA(G,setCetralities=setCentralities)[:10]
Word MCI
0 NEWMAN, M 1.550027
1 BARABASI, A 1.330304
2 JEONG, H 1.221634
3 PASTORSATORRAS, R 1.043823
4 SOLE, R 1.040874
5 BOCCALETTI, S 0.978564
6 MORENO, Y 0.903129
7 HOLME, P 0.871221
8 CALDARELLI, G 0.808137
9 VESPIGNANI, A 0.807601

Note: In the case of ref[1], we calculated the matrix of features from a collection or set of graphs-of-words of a repository and, then, we computed the first Principal Component (getPC1 method) from this matrix of features of the entire repository

Behind the scene, the getMCI_PCA is calling the getPC1FromGraph method, which obtains the matrix of centrality measures of the graph (mtxDoc) and calls the getPC1 function for computing the first Principal Component (PC1) of the graph.

For illustration purpose, this is the matrix of features of the graph:

mtxDoc = mc.getMatrixFeaturesGraph(G,setCentralities)
display(mtxDoc)
Word Degree Pagerank Eigenvector StructuralHoles Closeness Betweenness
0 ABRAMSON, G 0.0588 0.1398 1.418e-15 0.8871 0.0231 0.000000
1 KUPERMAN, M 0.0882 0.2209 1.410e-15 0.5085 0.0309 0.000071
2 ACEBRON, J 0.1176 0.1425 1.394e-15 0.6565 0.0412 0.000000
3 BONILLA, L 0.1176 0.1425 1.396e-15 0.6565 0.0412 0.000000
4 PEREZVICENTE, C 0.1176 0.1425 1.419e-15 0.6565 0.0412 0.000000
... ... ... ... ... ... ... ...
1584 MONDRAGON, R 0.0294 0.1425 1.403e-15 0.8805 0.0103 0.000000
1585 ZHU, H 0.0588 0.2196 1.405e-15 0.4027 0.0206 0.000035
1586 HUANG, Z 0.0294 0.1040 1.407e-15 0.8805 0.0137 0.000000
1587 ZHU, J 0.0294 0.1040 1.409e-15 0.8805 0.0137 0.000000
1588 ZIMMERMANN, M 0.0588 0.0731 1.422e-15 0.5524 0.1381 0.000000

1589 rows × 7 columns

And this is the first Principal Component (PC1) of the graph according to all columns (centralities) in the matrix of features

PC1 = mc.getPC1(mtxDoc)
print(PC1)
     Degree  Pagerank  Eigenvector  StructuralHoles  Closeness  Betweenness
0  0.535797  0.434088     0.202644        -0.494563   0.329897     0.360556

Or, you can filter selecting specific centralities

PC1 = mc.getPC1(mtxDoc,setCentralities=['Degree', 'Pagerank', 'StructuralHoles'])
print(PC1)
     Degree  Pagerank  StructuralHoles
0  0.621633  0.532563        -0.574412

Now, we can calculate the MCI of the graph considering the previous PC1 and calling the function

N = 10
centralNodes = mc.getMCI_PCA(G, PC1, N=N)

where N means the top N nodes. If N = -1 it returns all the nodes.

N = 10
MCI = mc.getMCI_PCA(G, PC1, N=N)
display(MCI)
Word MCI
0 BARABASI, A 1.091113
1 NEWMAN, M 1.026212
2 JEONG, H 0.829966
3 YOUNG, M 0.623524
4 OLTVAI, Z 0.603774
5 BOCCALETTI, S 0.601977
6 SOLE, R 0.567841
7 KURTHS, J 0.519397
8 ALON, U 0.510426
9 PASTORSATORRAS, R 0.492352

Clearly, the ranking changes depending on the selected centrality measures. This is why our proposal in Ref[1] of finding the best subset of centralities according to your supervised problem. For instance, applying some Feature Selection methods, correlation analysis, etc.

In unsupervised problems, a good approach could be to select the group of centrality measures less correlated.

TextMiner:

for KeyWords and KeyPhrase extraction

TextMiner is a module that employs the MCI for extracting keywords and keyphrase from a single or collection of texts. For using, you will need to have nltk and itertools packages installed:

os, re, string, nltk, en_core_web_sm, itertools, collections

Following, an example of use:

import TextMiner as tm
import en_core_web_sm

Miner = tm.TextMiner(punctuations=None,min_length_sent=7, nlp=en_core_web_sm.load())
Miner.candi_pos = ['NOUN', 'PROPN', 'ADJ']  

# number of returned keywords and keyphrases
N = 10

# Loading three examples of stories written by Edgar Allan Poe
# content[0] is the "The Black Cat" story
from data.content import content

#Case 1. Considering a single text    
keywords = Miner.get_keywords_MCI_from_text(content[0],numberKeyWords=N)


print('CASE 1: \n\t Keywords\n')
print(keywords)  
print('\n \t Keysentences \n')

for sentence in Miner.get_ranked_phrases()[:N]:
    print('----------\n',sentence)

print('\n\t\t================ || =================\n')   

#Case 2. Considering a collection of texts
mtxDoc = Miner.get_mtxDoc_from_collection(content,
                                          setCentralities=['Degree', 
                                                        'Pagerank', 
                                                        'StructuralHoles'])

keywords = Miner.get_keywords_MCI_from_text(content[0],mtxDoc=mtxDoc,
                                            numberKeyWords=N)
print('CASE 2: \n \t Keywords\n')
print(keywords)  
print('\n \t Keysentences \n')

for sentence in Miner.get_ranked_phrases()[:N]:
    print('----------\n',sentence)
print('-------')
CASE 1: 
	 Keywords

       Word       MCI
0      mere  1.933236
1      half  1.337548
2    horror  1.005507
3       cat  0.979326
4     other  0.973901
5     beast  0.971519
6      such  0.847697
7    reason  0.845035
8  terrible  0.844638
9  humanity  0.833198

 	 Keysentences 

----------
 half of horror and half of triumph
----------
 which goes directly to the heart of him who has had frequent occasion to test the paltry friendship and gossamer fidelity of mere man
----------
 i indeed wretched beyond the wretchedness of mere humanity
----------
 mournful and terrible engine of horror and of crime
----------
 and many persons seemed to be examining a particular portion of it with very minute and eager attention
----------
 i experienced a sentiment half of horror
----------
 my next step was to look for the beast which had been the cause of so much wretchedness
----------
 and which constituted the sole visible difference between the strange beast and the one i had destroyed
----------
 this dread was not exactly a dread of physical evil
----------
 that the terror and horror with which the animal inspired me

		================ || =================

CASE 2: 
 	 Keywords

    Word       MCI
0   mere  1.253141
1   half  0.947709
2  beast  0.819257
3    cat  0.787077
4  other  0.687237
5   many  0.650715
6   more  0.650715
7  white  0.638784
8   wall  0.626681
9  sense  0.613148

 	 Keysentences 

----------
 half of horror and half of triumph
----------
 which goes directly to the heart of him who has had frequent occasion to test the paltry friendship and gossamer fidelity of mere man
----------
 i indeed wretched beyond the wretchedness of mere humanity
----------
 this dread was not exactly a dread of physical evil
----------
 i experienced a sentiment half of horror
----------
 there came back into my spirit a half
----------
 that the terror and horror with which the animal inspired me
----------
 and many persons seemed to be examining a particular portion of it with very minute and eager attention
----------
 for no other reason than because he knows he should not
----------
 and which for a long time my reason struggled to reject as fanciful
-------

In CASE 1, the base of knowledge is extracted from the same text. The get_keywords_MCI_from_text implicitly construct the matrix of word features from content[0], which is used to extract the MCI keywords.

In CASE 2, the matrix of word features is constructed from the entire collection of text, by using the get_mtxDoc_from_collection method. Then, the matrix of word features (mtxDoc) is passed as parameter for finding the MCI keywords in content[0].

References

You can use this code as it is for academic purpose. If you found it useful for your research, we appreciate your reference to our work A multi-centrality index for graph-based keyword extraction:

[1] Didier A. Vega-Oliveros, Pedro Spoljaric Gomes, Evangelos E. Milios, Lilian Berton. Information Processing & Management, V. 56, I. 6, November 2019, 102063. https://doi.org/10.1016/j.ipm.2019.102063

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published