Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SINr implementation to compute node embeddings from communities #156

Merged

Conversation

nicolasdugue
Copy link
Contributor

@nicolasdugue nicolasdugue commented Feb 29, 2024

As fans of the karateclub package, we propose to add the implementation of SINr, a graph embedding approach based on community detection to the package.
SINr was first published in :
Thibault Prouteau, Victor Connes, Nicolas Dugué, Anthony Perez, Jean-Charles Lamirel, et al.. SINr: Fast Computing of Sparse Interpretable Node Representations is not a Sin!. Advances in Intelligent Data Analysis XIX, 19th International Symposium on Intelligent Data Analysis, IDA 2021. pp.325-337, ⟨10.1007/978-3-030-74251-5_26⟩. ⟨hal-03197434⟩

It allows to produce interpretable node representations and can also compute word embeddings on textual graph data :
Béranger, A., Dugué, N., Guillot, S., & Prouteau, T. (2023, November). Filtering communities in word co-occurrence networks to foster the emergence of meaning. In International Conference on Complex Networks and Their Applications (pp. 377-388).

A complete SINr implementation can be found here : https://github.com/SINr-Embeddings/sinr
But the code we provide in this pull request is enough to train efficiently node embeddings using networkx and scipy.
We would really appreciate our method to be included in the karateclub package that we use a lot !

@LucaCappelletti94
Copy link
Collaborator

Hi @nicolasdugue! Thank you for your pull request.

First of all, if I am not mistaken, erase_base_features is not defined and is never used across the model, what is up with that?
Same thing for workers.

Since this implementation as I understand it is primarily meant for didactical purposes, being it in NetworkX and all, could you kindly extend the comments and explain more extensively things such as:

  • What are the parameters for, especially erase_base_features and gamma and what do they do in your model? Are there optimal values? Is there any correlation between gamma and the number of nodes or the density of the graph (especially for gamma)?
  • How does the algorithm scale? What are the memory requirements for a graph with V nodes and E edges? Please provide both the computational and space complexities for the asymptotic worst case.

Another significant bit is testing. Please extend the test suite and test your implementation. Things such as the erase_base_features error would not have survived testing.

@nicolasdugue
Copy link
Contributor Author

Hi @LucaCappelletti94,
Thanks for your comment, I am glad you answered so fast !

I am sorry for the workers and erase_base_features mistakes !

Regarding your remarks/questions :

  • I tried to add some insights about the gamma parameter. Is it enough ?
  • I extended the test suite and SINr class now survives testing :)
  • Complexity of the algorithm is quasi-linear. It runs actually quite fast, and can scale easily to huge graphs. The only requirement is to be able to store the adjacency matrix and community membership matrix in ram.

I was unclear sorry, it is not included in networkx. But the implementation I provided here is based on networkx louvain community detection. In our package, we mostly rely on networkit.

Do you need some more information/tests ?

@codecov-commenter
Copy link

codecov-commenter commented Mar 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.44%. Comparing base (5e1fe9c) to head (e6064c1).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #156      +/-   ##
==========================================
+ Coverage   97.41%   97.44%   +0.03%     
==========================================
  Files          63       64       +1     
  Lines        2712     2744      +32     
==========================================
+ Hits         2642     2674      +32     
  Misses         70       70              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@LucaCappelletti94
Copy link
Collaborator

Everything seems okay, I see there are a few typos in the comments. If you have the time to fix them I would appreciate it, otherwise I will do it when I get the time.

@nicolasdugue
Copy link
Contributor Author

Hi @LucaCappelletti94,
I tried to fix the typos in the comments, but my english is not great. I did my best but it may not be enough, I am sorry :(
Thank you for your help, I am glad that SINr will be included in the package.

@nicolasdugue
Copy link
Contributor Author

Hi,
I asked someone to proofread me. Comments should now be okay :)

@LucaCappelletti94
Copy link
Collaborator

Thanks, I appreciate it. Later today (or at the latest tomorrow) I will go through the whole thing and I look forward to merge the whole thing.

@LucaCappelletti94 LucaCappelletti94 merged commit e6064c1 into benedekrozemberczki:master Mar 5, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants