CS6240 Group 10 Project

Fall 2020

Visit https://masonleon.github.io/largescale-spark-graph-analytics/ for additional project information.

Code author

April Gustafson, Mason Leon, Matthew Sobkowski

Installation

These components are installed:

OpenJDK 1.8.0_265
Scala 2.11.12
Hadoop 2.9.1
Spark 2.3.1 (without bundled Hadoop)
Maven 3.6.3
AWS CLI (for EMR execution)

Dataset

https://snap.stanford.edu/data/soc-LiveJournal1.html

To download to input dir:

```
bash ./data-download.sh
```

Environment

Example ~/.bash_aliases:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=$HADOOP_HOME/hadoop/hadoop-2.9.1  
export SCALA_HOME=$SCALA_HOME/scala/scala-2.11.12  
export SPARK_HOME=$SPARK_HOME/spark/spark-2.3.1-bin-without-hadoop  
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop  
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$SPARK_HOME/bin  
export SPARK_DIST_CLASSPATH=$(hadoop classpath)

Explicitly set JAVA_HOME in $HADOOP_HOME/etc/hadoop/hadoop-env.sh:
```
 export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 
```
[Optional] Setup Docker Environment
https://docs.docker.com/get-docker/

Execution

All of the build & execution commands are organized in the Makefile.

Unzip project file.
Open command prompt.
Navigate to directory where project files unzipped.
Edit the Makefile to customize the environment at the top.
Sufficient for standalone: hadoop.root, jar.name, local.input
Other defaults acceptable for running standalone.
Standalone Hadoop:
make switch-standalone -- set standalone Hadoop environment (execute once)
make local
Pseudo-Distributed Hadoop: (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation)
make switch-pseudo -- set pseudo-clustered Hadoop environment (execute once)
make pseudo -- first execution
make pseudoq -- later executions since namenode and datanode already running
AWS EMR Hadoop: (you must configure the emr.* config parameters at top of Makefile)
make upload-input-aws -- only before first execution
make aws -- check for successful execution with web interface (aws.amazon.com)
download-output-aws -- after successful execution & termination
Docker Jupyter Scala/Spark Almond Notebook: (https://github.com/almond-sh/almond) make run-container-spark-jupyter-almond -- run docker container with scala + spark kernel for local standalone copy token from terminal and paste in browser http://127.0.0.1:8888/?token=<TOKEN_FROM_TERMINAL>
Docker Standalone Hadoop/Spark make run-container-spark-jar-local -- run docker container environment with compiled .jar app make run-container-spark-jar-local 2>&1 | tee logs/logfile.log -- run docker container environment with compiled .jar app and redirect standard error+output to log

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.github		.github
config		config
data		data
docker		docker
notebooks		notebooks
results		results
src/main		src/main
target/classes		target/classes
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.json		config.json
data-download.sh		data-download.sh
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS6240 Group 10 Project

Code author

Installation

Dataset

Environment

Execution

About

Releases 1

Packages

Languages

License

masonleon/largescale-spark-graph-analytics

Folders and files

Latest commit

History

Repository files navigation

CS6240 Group 10 Project

Code author

Installation

Dataset

Environment

Execution

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages