-
Notifications
You must be signed in to change notification settings - Fork 3
Benchmark to study communication and memory-access performance of graphs.
License
sg0/neve
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
*-*-*-*-*-*-*-*-* | | | | | | | | | *-*-*-neve-*-*-*- | | | | | | | | | *-*-*-*-*-*-*-*-* ***** About ***** `neve` is a microbenchmark that distributes a graph (real-world or synthetic) across processes and then performs variable-sized message exchanges (equivalent to the number of ghost vertices shared between processes) between process neighbors and reports bandwidth/latency. Further, neve allows options to shrink a real-world graph (by a percent of ghost vertices shared across processes), study performance of message exchanges for a specific process neighborhood, analyze the impact of an edge-balanced real-world graph distribution, and generate a graph in-memory. neve reports bandwidth/latency, and follows the best practices of existing MPI microbenchmarks. We require the total number of processes to be a power of 2 and total number of vertices to be perfectly divisible by the number of processes when parallel RGG generation options are used. This constraint does not apply to real world graphs passed to neve. neve has an in-memory random geometric graph generator (just like our other mini-application, miniVite[?]) that can be used for weak-scaling analysis. An n-D random geometric graph (RGG) is generated by randomly placing N vertices in an n-D space and connecting pairs of vertices whose Euclidean distance is less than or equal to d. We only consider 2D RGGs contained within a unit square, [0,1]^2. We distribute the domain such that each process receives N/p vertices (where p is the total number of processes). Each process owns (1 * 1/p) portion of the unit square and d is computed as (please refer to Section 4 of miniVite paper[?] for details): d = (dc + dt)/2; where, dc = sqrt(ln(N) / pi*N); dt = sqrt(2.0736 / pi*N) Therefore, the number of vertices (N) passed during execution on p processes must satisfy the condition -- 1/p > d. Unlike miniVite, the edge weights of the generated graph is always 1.0, and not the Euclidean distance between vertices (because we only perform communication in this microbenchmark, and no computation, so edge weights are irrelevant). [?] Ghosh, Sayan, Mahantesh Halappanavar, Antonio Tumeo, Ananth Kalyanaraman, and Assefaw H. Gebremedhin. "miniVite: A Graph Analytics Benchmarking Tool for Massively Parallel Systems." In 2018 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 51-56. IEEE, 2018. Please note, the default distribution of graph generated from the in-built random geometric graph generator causes a process to only communicate with its two immediate neighbors. If you want to increase the communication intensity for generated graphs, please use the "-p" option to specify an extra percentage of edges that will be generated, linking random vertices. As a side-effect, this option significantly increases the time required to generate the graph, therefore low values are preferred. The max number of edges that can be added randomly must be less than equal to INT_MAX, at present we don't handle cases in which "-p <percent>" resolves to extra edges more than INT_MAX. We also allow users to pass any real world graph as input. However, we expect an input graph to be in a certain binary format, which we have observed to be more efficient than reading ASCII format files. The code for binary conversion (from a variety of common graph formats) is packaged separately with another software called Vite, which is our distributed-memory implementation of graph community detection. Please follow instructions in Vite README for binary file conversion. Vite could be downloaded from (please don't use the past PNNL/PNL link to download Vite, the following GitHub link is the correct one): https://github.com/Exa-Graph/vite Recently, we have added a non-MPI shared-memory version of this microbenchmark, which is inspired from the STREAM benchmark, except arrays are replaced by graphs. The graph data structure is CSR, and there are two tests: `Neighborhood Scan' and `Neighborhood Sum', both of which scans graph neighborhood (an innate pattern for many graph workloads) and reports average bandwidth and latency. Like the MPI-based neve, these tests can handle both real-world (same binary format) and synthetic graphs (using the RGG generator as discussed above). Please contact the following for any queries or support: Sayan Ghosh, PNNL (sg0 at pnnl dot gov) Related paper (only covers the communication aspects): Ghosh S, Tallent N, Halappanavar M. Characterizing Performance of Graph Neighborhood Communication Patterns. IEEE Transactions on Parallel and Distributed Systems. 2021 Aug 2. https://ieeexplore.ieee.org/abstract/document/9503355 ******* Compile ******* neve is a C++11 header-only library and requires an MPI implementation. It uses MPI Send/Recv and collectives (for synthetic graph generation). Please update the Makefile with compiler flags and use a C++11 compliant compiler of your choice. Invoke `make clean; make` after setting paths to MPI for generating the binary. Use `mpirun` or `mpiexec` or `srun` to execute the code with specific runtime arguments mentioned in the next section. It is also possible to run neve on a network simulator such as SST-Macro, please enable the ENABLE_SSTMACRO shell variable in the Makefile (and pass the path to SST-Macro build), and consult the SST-Macro document for running MPI codes with parameter files. Pass -DPRINT_DIST_STATS while building for printing distributed graph characteristics. [Specific instructions for neve_threads] For the non-MPI build on real-world graphs, please pass a suitable value (equal to the #sockets or NUMA nodes on the system) to the GRAPH_FT_LOAD macro at compile-time, like -DGRAPH_FT_LOAD=4. This is very important for `first touch' purposes, but has no effect when RGG codepath is selected. We are yet to enable first touch When the synthetic graph option is selected (i.e., RGG codepath), since the graph generation part is serial. This would be enabled in a future commit. We have also enabled default parameters such that executing without any arguments will work. STREAM by default initiates an array of 10,000,000 elements, so we have set the default graph (one that would get generated if you just invoke ./neve_threads) to have dimensions such that it would generate about 10M edges. The graph generation part is serial, so large graphs can take significantly longer to generate (but we believe reasonable sized graphs can be generated fairly quickly). Also, our edge data structure consists of a vertex end-point and weight, optionally we support an edge data structure consisting of a head, tail and weight (vertex pair). In order to enable the edge data structure that stores a `vertex pair', -DEDGE_AS_VERTEX_PAIR must be passed (this is option also supports the binary file format of Grappolo*). It is better to keep the other options in the Makefile intact, only changing the OpenMP and other optimization flags depending on the compiler. Please check the code to review other macros of interest. [*] https://github.com/Exa-Graph/grappolo ***************** Execution options ***************** E.g.: mpiexec -n 2 bin/./neve_mpi -f karate.bin -w -t 0 -z 5 mpiexec -n 4 bin/./neve_mpi -f karate.bin -w -t 0 -s 2 -g 5 mpiexec -n 2 bin/./neve_mpi -l -n 100 -w mpiexec -n 2 bin/./neve_mpi -n 100 -t 0 mpiexec -n 2 bin/./neve_mpi -p 2 -n 100 -w bin/./neve_threads -n 64 bin/./neve_threads -f karate.bin bin/./neve_threads -p 2 -n 100 Possible options for MPI version, i.e., neve_mpi (can be combined): 1. -f <bin-file> : Specify input binary file after this argument. 2. -b : Only valid for real-world inputs. Attempts to distribute approximately equal number of edges among processes. Irregular number of vertices owned by a particular process. Increases the distributed graph creation time due to serial overheads, but may improve overall execution time. 3. -n <vertices> : Only valid for synthetically generated inputs. Pass total number of vertices of the generated graph. 4. -l : Use distributed LCG for randomly choosing edges. If this option is not used, we will use C++ random number generator (using std::default_random_engine). 5. -p <percent> : Only valid for synthetically generated inputs. Specify percent of overall edges to be randomly generated between processes. 6. -r <nranks> : This is used to control the number of aggregators in MPI I/O and is meaningful when an input binary graph file is passed with option "-f". naggr := (nranks > 1) ? (nprocs/nranks) : nranks; 7 -w : Report Bandwidth in MB/s[*]. 8. -t <0|1|2> : Report Latency in microseconds[*]. Option '0' uses nonblocking Send/Recv, Option '1' uses MPI_Neighbor_alltoall and Option '2' uses MPI_Neighbor_allgather. 9. -x <bytes> : Maximum data exchange size (in bytes). 10. -m <bytes> : Minimum data exchange size (in bytes). 11. -d <0|1> : Perform work in addition to message exchanges (for latency test, i.e., option `-t <>`) between the process neighbors. `-d 0` invokes kernel that computes max weight per vertex degree and `-d 1` invokes kernel that sums weight per vertex degree. Both the kernels can use OpenMP. 12. -s <PE> : Analyze a single process neighborhood. Performs bidirectional message exchanges between the process neighbors of a particular PE passed by the user. This transforms the neighbor subgraph of a particular PE as a fully connected graph. 13. -g <count> : Specify maximum number of ghosts shared between process neighbors of a particular PE. This option is only valid when -s <PE> is passed (see #11). 14. -z <percent> : Select a percentage of ghost vertices of a real-world graph distributed across processes as actual ghosts. For e.g., lets assume that a graph is distributed across 4 processes, and PE#0 has two neighbors, PE#1 and PE#3 with whom it shares a 100 ghost vertices each. In such a configuration, the b/w test would perform variable-sized message exchanges a 100 times between PE#0 and {PE#1, PE#3}. The -z <percent> option can limit the number of message exchanges by selecting a percentage of the actual the number of ghosts, but maintaining the overall structure of the original graph. Hence, this option can help 'shrink' a graph. Only valid for b/w test, because latency test will just perform message transfers between process neighborhoods. 15. -o <1-??> : Generates comma-separated (default) MPI rank order based on graph distribution across processes - regular (1), weighted (2), regular & ascending order of degrees (3), regular & descending order of degrees (4), weighted & ascending order of degrees (5), weighted & descending order of degrees (6), weighted & normal distribution of degrees (7), common neighbors & descending order of degrees (8), common neighbors & ascending order of degrees (9), common neighbors & normal distribution of degrees (10) and graph matching (>=11). Default is weighted & ascending order of degrees (5). 16. -i <0-??> : Dump out process graph (default: adjacency), if greater than 0 (e.g., -i 1) is passed, then graph in directed SANDIA Chaco format is returned (-i 3 is for weighted). 17. -a : Perform top-down BFS (similar to Graph500 reference implementation). Build with -DUSE_ALLREDUCE_FOR_EXIT, otherwise it hangs sometimes. 18. -j : Perform delta-stepping SSSP (similar to Graph500 reference implementation). [*]Note: Unless -w or -t <...> is passed, the code will just load the graph and create the graph data structure. This is deliberate, in case we want to measure the overhead of loading a graph and time only the file I/O part (like measuring the impact of different #aggregators through the -r <nranks> option). Possible options for non-MPI version, i.e., neve_threads (can be combined): 1. -f <bin-file> : Specify input binary file after this argument. 2. -n <vertices> : Only valid for synthetically generated inputs. Pass total number of vertices of the generated graph. 3. -l : Use distributed LCG for randomly choosing edges. If this option is not used, we will use C++ random number generator (using std::default_random_engine). 4. -p <percent> : Only valid for synthetically generated inputs. Specify percent of overall edges to be randomly generated between vertices. 5. -h : Print run options.
About
Benchmark to study communication and memory-access performance of graphs.
Topics
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published