It is delivered a system built on Amazon Web Services using Apache Spark, Java and Python. The system completes objectives defined in the Objectives section with given data to generate loan reports.
- Make sure that, a vanilla and default S3 bucket was created, for this case study, a S3 bucket named
loan-data-bucket-aws
was created. - Make sure that, you are using the right region, for this case study, the region named
us-east-1
is used. Note that, if you use a different region, accordingly, you ought to change the region name in the code and links stated in this case study. - Make sure that, maven is installed and updated on your computer. It will be used to compile java code and to build jar files.
- Make sure that, a Kinesis Data Firehose Delivery Stream was created and has access to your S3 bucket. For this case study, a delivery stream named
Loan-Data-Loader
was created. Its S3 Compression isGZIP
and its source is configured asDirect PUT or other sources
.
- Using the LendingClub public loan data two reports are generated using Aws Kinesis Firehose, S3 and EMR.
- CSV data is sent to Kinesis Firehose to be written to a S3 bucket in gzip format using a Firehose client application.
- A Spark application is generated two reports described below in desired format in the same bucket under report_one and report_two directories.
- The Spark application reads data from S3 bucket and also applications run on an EMR cluster, and the cluster is configured to auto terminate after the Spark application finishes.
1. According to the given yearly income ranges, {<40k, 40-60k, 60-80k, 80-100k, >100k}, the application generates a report that contains average loan amount and average term of loan in months based on these 5 income ranges. Result file is like "income range, avg amount, avg term"
2. In loans which are fully funded and loan amounts greater than $1000, It is extracted the fully paid amount rate for every loan grade of the borrowers. Result file is like "credit grade,fully paid amount rate", eg."A,%95"
- Python 3.6
- boto3 (AWS S3 SDK for Python), for creating s3client and writing data to s3
- wget, for downloading csv data initially
- Get following credentials from your AWS Account Console:
- aws_access_key_id
- aws_secret_access_key
- aws_session_token
- Enter your credentials to
<PROJECT_FOLDER>/modules/AWS-Kinesis-Firehose-Client/credentials.json
. - Run command using terminal to build:
$ <PROJECT_FOLDER>/modules/AWS-Kinesis-Firehose-Client/build.sh
- Run command using terminal to run:
$ <PROJECT_FOLDER>/modules/AWS-Kinesis-Firehose-Client/run.sh
- Java 8 + Scala 2.11
- Spark 2.10
- Amazon S3 SDK for Java
- Java 8
- Go to folder:
<PROJECT_FOLDER>/modules/Spark-Loan-Processing/
. - Run command using terminal to compile:
$ mvn install
- You will see the
<PROJECT_FOLDER>/modules/Spark-Loan-Processing/target/
directory after compilation. - Once compilation is completed, you will see file
<PROJECT_FOLDER>/modules/Spark-Loan-Processing/target/Loan-Data-Report-with-AWS-1.0-SNAPSHOT-jar-with-dependencies.jar
. - Simply, upload a jar file named
Loan-Data-Report-with-AWS-1.0-SNAPSHOT-jar-with-dependencies.jar
to your S3 bucket.
-
Get following credentials from your AWS Account Console:
- aws_access_key_id
- aws_secret_access_key
- aws_session_token
-
Create Cluster:
- Click: "AWS Console"
- Head to: https://console.aws.amazon.com/elasticmapreduce/home?region=us-east-1
- Click: "Create Cluster"
- Cluster Configurations:
[Logging]: Enabled [Logging - S3 Folder]: s3://loan-data-bucket-aws/logs/ [Launch mode]: Step execution [Step type]: Spark Application [Step type-Configure]: - [Spark-submit options]: --class loanprocessing.LoanProcessor - [Application location]: s3://loan-data-bucket-aws/Loan-Data-Report-with-AWS-1.0-SNAPSHOT-jar-with-dependencies.jar - [Arguments] (NOTE: Make sure you used quote around parameters!): "aws_access_key_id" "aws_secret_access_key" "aws_session_token" [Software configuration]: emr-5.31.0 [Hardware configuration]: m4.large, 3 instances [Security and access]: DEFAULT
-
Check S3 Bucket:
- Wait until the cluster is created.
- Head to: https://s3.console.aws.amazon.com/s3/buckets/loan-data-bucket-aws/?region=us-east-1
- You will see directories
report_one/
andreport_two/
.
Dataset URL already defined in DATA_URL
property in file <PROJECT_FOLDER>/modules/AWS-Kinesis-Firehose-Client/properties.json
. If this dataset URL is absent, you can find the dataset here, the zip archive contains raw data in csv format and an excel dictionary file explaining fields of data.