Hybrid Girvan Newman

Table of Contents


Hybrid Girvan Newman. Code for the paper "A Distributed Hybrid Community Detection Methodology for Social Networks."

The proposed methodology is an iterative, divisive community detection process that combines the network topology features of loose similarity and local edge betweenness measure, along with the user content information in order to remove the inter-connection edges and thus unravel the subjacent community structure. Even if this iterative process might sound computationally over-demanding, its application is certainly not prohibitive, since it can be safely concluded from the experimentation results that the aforementioned measures are that well-informative and highly representative, so merely few iterations are required to converge to the final community hierarchy at any case.

Implementation last tested with Python 3.6, Apache Spark 2.4.5 and GraphFrames 0.8.0

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.


You need to have a machine with Python = 3.6, Apache Spark = 2.4.5, GraphFrames = 0.8.0 and any Bash based shell (e.g. zsh) installed. For Apache Spark = 2.4.5 you will also need Java 8.

$ python3.6 -V
Python 3.6.9

echo $SHELL

Set the required environment variables

In order to run the or the tests you will need to set the following environmental variables in your system (or in the spark.env file):

$ export SPARK_HOME="<Path to Spark Home>"
$ export PYSPARK_SUBMIT_ARGS="--packages graphframes:graphframes:0.8.0-spark2.4-s_2.11 pyspark-shell"
$ export JAVA_HOME="<Path to Java 8>"


$ ./bin/pyspark --version
Installing, Testing, Building

All the installation steps are being handled by the Makefile.

If you don't want to go through the setup steps and finish the installation and run the tests, execute the following command:

$ make install server=local

If you executed the previous command, you can skip through to the Running locally section.

Check the available make commands

$ make help

Clean any previous builds

Create a new venv and install the requirements

$ make create_venv server=local
Run the tests

The tests are located in the tests folder. To run all of them, execute the following command:

$ make run_tests server=local
Build the project locally

To build the project locally using the command, execute the following command:

$ make setup server=local
Running the code locally

In order to run the code now, you should place under the data/input_graphs the graph you want the communities to be identified from.
You will also only need to create a yml file for any new graph before executing the

Modifying the Configuration

There two already configured yml files: confs/quakers.yml and confs/hamsterster.yml with the following structure:

tag: dev  # Required
  - config:  # The spark settings
      spark.master: local[*]  # Required
      spark.submit.deployMode: client  # Required
      spark_warehouse_folder: data/spark-warehouse  # Required
      spark.ui.port: 4040
      spark.driver.cores: 5
      spark.driver.memory: 8g
      spark.driver.memoryOverhead: 4096
      spark.driver.maxResultSize: 0
      spark.executor.instances: 2
      spark.executor.cores: 3
      spark.executor.memory: 4g
      spark.executor.memoryOverhead: 4096
      spark.sql.broadcastTimeout: 3600
      spark.sql.autoBroadcastJoinThreshold: -1
      spark.sql.shuffle.partitions: 4
      spark.default.parallelism: 4 3600s
      df_data_folder: data/dataframes  # Folder to store the DataFrames as parquets
      spark_warehouse_folder: data/spark-warehouse
      checkpoints_folder: data/checkpoints
      communities_csv_folder: data/csv_data  # Folder to save the computed communities as csvs
  - config:  # All properties required
      name: Quakers
        path: data/input_graphs/Quakers/quakers_nodelist.csv2  # Path to the nodes file
        has_header: true  # Whether they have a header with the attribute names
        delimiter: ','
        encoding: ISO-8859-1
        feature_names:  # You can rename the attribute names (the number should be the same as the original)
          - id
          - Historical_Significance
          - Gender
          - Birthdate
          - Deathdate
          - internal_id
        path: data/input_graphs/Quakers/quakers_edgelist.csv2  # Path to the edges file
        has_header: true  # Whether they have a header with the source and dest
        has_weights: false  # Whether they have a weight column
        delimiter: ','
    type: local
run_options:  # All properties required
  - config:
      cached_init_step: false  # Whether the cosine similarities and edge_betweenness been already been computed
      # See the paper for info regarding the following attributes
      feature_min_avg: 0.33
      r_lvl1_thres: 0.50
      r_lvl2_thres: 0.85
      max_edge_weight: 0.50
      betweenness_thres: 10
      max_sp_length: 2
      min_comp_size: 2 
      max_steps: 30  # Max steps for the algorithm to run if it doesn't converge
      features_to_check:  # Which attributes to take into consideration for the cosine similarities
        - id
        - Gender
output:  # All properties required
  - config:
      logs_folder: data/logs
      save_communities_to_csvs: false  # Whether to save the computed communities in csvs or not
        dimensions: 3  # Dimensions of the scatter plot (2 or 3)
        save_img: true
        folder: data/plots
        steps:  # The steps to plot
          - 0   # The step before entering the main loop
          - -1  # The Last step

The !ENV flag indicates that a environmental value follows. For example you can set:
logs_folder: !ENV ${LOGS_FOLDER}
You can change the values/environmental var names as you wish. If a yaml variable name is changed/added/deleted, the corresponding changes should be reflected on the Configuration class and the yml_schema.json too.

Execution Options

First, make sure you are in the created virtual environment:

$ source venv/bin/activate
$ which python

Now, in order to run the code you can either call the directly, or the HGN console script.

$ python -h
usage: -c CONFIG_FILE [-d] [-h]

A Distributed Hybrid Community Detection Methodology for Social Networks.

Required Arguments:
  -c CONFIG_FILE, --config-file CONFIG_FILE
                        The configuration yml file

Optional Arguments:
  -d, --debug           Enables the debug log messages
  -h, --help            Show this help message and exit

usage: hgn -c CONFIG_FILE [-d] [-h]

A Distributed Hybrid Community Detection Methodology for Social Networks.

Required Arguments:
  -c CONFIG_FILE, --config-file CONFIG_FILE
                        The configuration yml file

Optional Arguments:
  -d, --debug           Enables the debug log messages
  -h, --help            Show this help message and exit


It is recommended that you deploy the application to a Spark Cluster.
Please see:

Continuous Integration

For the continuous integration, the CircleCI service is being used. For more information you can check the setup guide.

Again, you should set the above-mentioned environmental variables (reference) and for any modifications, edit the circleci config.


Read the TODO to see the current task list.

Built With


This project is licensed under the GNU License - see the LICENSE file for details.