Skip to content

This repo presents a reference architecture for running serverless MapReduce jobs. This has been implemented using AWS Lambda and Amazon S3.

License

Notifications You must be signed in to change notification settings

giedri/lambda-refarch-mapreduce

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Serverless Reference Architecture: MapReduce

This serverless MapReduce reference architecture demonstrates how to use AWS Lambda in conjunction with Amazon S3 to build a MapReduce framework that can process data stored in S3.

By leveraging this framework, you can build a cost-effective pipeline to run ad hoc MapReduce jobs. The price-per-query model and ease of use make it very suitable for data scientists and developers alike.

Features

  • Close to "zero" setup time
  • Pay per execution model for every job
  • Cheaper than other data processing solutions
  • Enables data processing within a VPC

Architecture

Serverless MapReduce architecture

IAM policies

  • Lambda execution role with
    • S3 read/write access
    • Cloudwatch log access (logs:CreateLogGroup, logs:CreateLogStream, logs:PutLogEvents)
    • X-Ray write access (xray:PutTraceSegments, xray:PutTelemetryRecords)

Check cf_template.yaml that you can extend as needed.

  • To execute the driver locally, make sure that you configure your AWS profile with access to:

Quickstart::Step by Step

To run the example, you must have the AWS CLI set up. Your credentials must have access to create and invoke Lambda and access to list, read, and write to a S3 bucket.

  1. Start CloudFormation console and create new stack using cf_template.yaml. CloudFormation will create:
  • S3 bucket for the results,
  • biglambda_role IAM role for AWS Lambda execution with appropriate inline policy,
  • SSM Parameter Store parameters used by the Lambda functions.
  • (Optionally) AWS Cloud9 IDE environment
  1. Run AWS X-Ray Daemon locally, otherwise you will not be able to see traces from the local driver in AWS X-Ray console. However, traces from Reducer Coordinator Lambda functions will be present.

  2. Run the driver

    $ python driver.py

AWS Cloud9 IDE

You can select AWS Cloud9 IDE instance type while creating CloudFormation stack. By default it is set to "None" (does not create IDE). After CloudFormation stack with instance type selected is created check Outputs section of the stack description for Cloud9 IDE URL. Code from this Git repository will be pulled to that instance already. You will need to install Boto3 and X-Ray Python SDK by running folowing commands in the IDE Bash tab:

$ sudo python -m pip install boto3
$ sudo python -m pip install aws-xray-sdk

Navigate to the code location

$ cd lambda-refarch-mapreduce/src/python

Run the driver

$ python driver.py

If you'd like to run code from IDE directly make sure to update current working directory (CWD) in the default Runner or create new Runner

Note that deleting CloudFormation stack will also delete Cloud9 IDE created as part of it.

Modifying the Job

You can modify cf_template.yaml and update CloudFormation stack.

Outputs

smallya$ aws s3 ls s3://JobBucket/py-bl-1node-2 --recursive --human-readable --summarize

2016-09-26 15:01:17   69 Bytes py-bl-1node-2/jobdata
2016-09-26 15:02:04   74 Bytes py-bl-1node-2/reducerstate.1
2016-09-26 15:03:21   51.6 MiB py-bl-1node-2/result 
2016-09-26 15:01:46   18.8 MiB py-bl-1node-2/task/
….

smallya$ head –n 3 result 
67.23.87,5.874290244999999
30.94.22,96.25011190570001
25.77.91,14.262780186000002

Cleaning up the example resources

To remove all resources created by this example, do the following:

  1. Delete all objects from the S3 bucket listed in jobBucket created by the job.
  2. Delete CloudFormation stack created for the job
  3. Delete the Cloudwatch log groups for each of the Lambda functions created by the job.

Languages

  • Python 2.7 (active development)
  • Node.js

The Python version is under active development and feature enhancement.

Benchmark

To compare this framework with other data processing frameworks, we ran a subset of the Amplab benchmark. The table below has the execution time for each workload in seconds:

Dataset

s3n://big-data-benchmark/pavlo/[text|text-deflate|sequence|sequence-snappy]/[suffix].

S3 Suffix Scale Factor Rankings (rows) Rankings (bytes) UserVisits (rows) UserVisits (bytes) Documents (bytes) /5nodes/ 5 90 Million 6.38 GB 775 Million 126.8 GB 136.9 GB

Queries:

  • Scan query (90 M Rows, 6.36 GB of data)

  • SELECT pageURL, pageRank FROM rankings WHERE pageRank > X ( X= {1000, 100, 10} )

    • 1a) SELECT pageURL, pageRank FROM rankings WHERE pageRank > 1000
    • 1b) SELECT pageURL, pageRank FROM rankings WHERE pageRank > 100
  • Aggregation query on UserVisits ( 775M rows, ~127GB of data)

    • 2a) SELECT SUBSTR(sourceIP, 1, 8), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 8)

NOTE: Only a subset of the queries could be run, as Lambda currently supports a maximum container size of 1536 MB. The benchmark is designed to increase the output size by an order of magnitude for the a,b,c iterations. Given that the output size doesn't fit in Lambda memory, we currently can't process to compute the final output.

|-----------------------|---------|---------|--------------|
| Technology            | Scan 1a | Scan 1b | Aggregate 2a | 
|-----------------------|---------|---------|--------------|
| Amazon Redshift (HDD) | 2.49    | 2.61    | 25.46        |
|-----------------------|---------|---------|--------------|
| Impala - Disk - 1.2.3 | 12.015  | 12.015  | 113.72       |
|-----------------------|---------|---------|--------------|
| Impala - Mem - 1.2.3  | 2.17    | 3.01    | 84.35        |
|-----------------------|---------|---------|--------------|
| Shark - Disk - 0.8.1  | 6.6     | 7       | 151.4        |
|-----------------------|---------|---------|--------------|
| Shark - Mem - 0.8.1   | 1.7     | 1.8     | 83.7         |
|-----------------------|---------|---------|--------------|
| Hive - 0.12 YARN      | 50.49   | 59.93   | 730.62       |
|-----------------------|---------|---------|--------------|
| Tez - 0.2.0           | 28.22   | 36.35   | 377.48       |
|-----------------------|---------|---------|--------------|
| Serverless MapReduce  | 39      | 47      | 200          |   
|-----------------------|---------|---------|--------------|

Serverless MapReduce Cost:

|---------|---------|--------------|
| Scan 1a | Scan 1b | Aggregate 2a | 
|---------|---------|--------------|
| 0.00477 | 0.0055  | 0.1129       |   
|---------|---------|--------------|

License

This reference architecture sample is licensed under the Amazon Software License.

About

This repo presents a reference architecture for running serverless MapReduce jobs. This has been implemented using AWS Lambda and Amazon S3.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 55.8%
  • Python 44.2%