Building Serverless Data Lakes on AWS

Author: Unni Pillai | Amazon Web Services | Twitter | Linkedin

Learning outcomes from this workshop?

Design serverless data lake architecture
Build a data processing pipeline and Data Lake using Amazon S3 for storing data
Use Amazon Kinesis for real-time streaming data
Use AWS Glue to automatically catalog datasets
Run interactive ETL scripts in an Amazon SageMaker Jupyter notebook connected to an AWS Glue development endpoint
Query data using Amazon Athena & visualize it using Amazon QuickSight

Pre-requisites:

You need to have access to an AWS account with AdminstratorAccess
This lab should be executed in us-east-1 region
Best is to follow links from this guide & open them in new a tab
Run this lab in a modern browser

Part 1: Ingest and Storage

Create S3 Bucket

In this step, we will navigate to S3 Console and create the S3 bucket used throughout this demo.

Login to AWS Console: https://console.aws.amazon.com/console/home?region=us-east-1

Navigate to S3 Console & Create a new bucket in us-east-1 region :

GoTo : https://s3.console.aws.amazon.com/s3/home?region=us-east-1
Click - Create Bucket
- Bucket Name : yourname-datalake-demo-bucket
- Region : US EAST (N. Virginia)
- Click Create (bottom left)
Adding reference data
- Open - yourname-datalake-demo-bucket
  - Click - Create folder
    - New folder called : data
    - Click - Save
- Open - data
  - Click - Create folder
    - New folder : reference_data
    - Click - Save
- Open - reference_data
  - download this file locally : tracks_list.json
  - Click - Upload
    - Click Add files & upload the tracks_list.json file here
    - Click Upload (bottom left)

Create Kinesis Firehose

In this step we will create navigate to Kinesis Console & create a Kinesis Firehose delivery stream to ingest data & store in S3:

GoTo: https://console.aws.amazon.com/kinesis/home?region=us-east-1#/get-started
Click Create Delivery Stream
- Step 1: Name and source
  - Delivery stream name : sg-summit-demo-stream
  - Source : Direct PUT or other sources
  - Click Next
- Step 2: Transform records
  - Record transformation : Disabled
  - Record format conversion: Disabled
  - Click Next
- Step 3: Choose destination
  - Destination : Amazon S3
  - S3 bucket : yourname-datalake-demo-bucket
  - Prefix : data/raw/ (slash / after raw is important, if you miss it Firehose will copy the data in an undesired location)
  - Error prefix : Leave Blank
  - Click Next
- Step 4: Configure settings
  - S3 buffer conditions: Buffer size: 1 (MB)
  - S3 buffer conditions: Buffer interval: 60 (sec)
  - S3 compression : Disabled
  - S3 encryption : Disabled
  - Error logging : Enabled
  - Leave everything else default
  - IAM role : Click on Create new, or Choose
    - Open a new window, leave everything to default
    - Click - Allow
  - Click - Next
- Step 5: Review
  - Review the configuration & make sure its as mentioned above
  - Click - Create delivery stream

Generate Dummy Data

In this step we will configure Kinesis Data Generator to produce fake data and ingest it into Kinesis Firehose

Configure Amazon Cognito for Kinesis Data Generator - In this step we will launch a cloud formation stack that will configure Cognito. This cloudformation scripts launches in Oregon region (No need to change this region)
- Goto : https://console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/new?stackName=Kinesis-Data-Generator-Cognito-User&templateURL=https://s3-us-west-2.amazonaws.com/kinesis-helpers/cognito-setup.json
- Click - Next
- Specify Details:
  - Username - admin
  - Password - choose a alphanumeric password
  - Click - Next
- Options:
  - Click - Next
- Review:
  - Scroll down
  - I acknowledge that AWS CloudFormation might create IAM resources: Check
  - Click - Create
- Refresh your AWS Cloudformation Console
- Wait till the stack status changes to Create_Complete
  - Select the **Kinesis-Data-Generator-Cognito-User **stack
  - GoTo outputs tab : click on the link that says: KinesisDataGeneratorUrl - This will open your Kinesis Data Generator tool
On Amazon Kinesis Data Generator homepage
- Login with your username & password from previous step
- Region: us-east-1
- Stream/delivery stream : sg-summit-demo-stream
- Records per second : 2000
- **Record template : **In the big text area, add the following json template
- Click - Send Data - do not click without pasting the below bit of template in the big text area

Once the tools send ~ 100,000 messages, you can click on - Stop sending data to Kinesis


{
  "uuid": "{{random.uuid}}",
  "device_ts": "{{date.utc("YYYY-MM-DD HH:mm:ss.SSS")}}",
  "device_id": {{random.number(50)}},
  "device_temp": {{random.weightedArrayElement(
    {"weights":[0.30, 0.30, 0.20, 0.20],"data":[32, 34, 28, 40]}
  )}},
  "track_id": {{random.number(30)}},  
  "activity_type": {{random.weightedArrayElement(
        {
            "weights": [0.1, 0.2, 0.2, 0.3, 0.2],
            "data": ["\"Running\"", "\"Working\"", "\"Walking\"", "\"Traveling\"", "\"Sitting\""]
        }
    )}}
}

Validate that data has arrived in S3

After few moments GoTo S3 console:https://s3.console.aws.amazon.com/s3/home?region=us-east-1

Click - yourname-datalake-demo-bucket > Data
There should be a folder called raw created > Open it and keep navigating, you will notice that firehose has dumped the data in S3 using yyyy/mm/dd/hh partitioning

Part 2 : Catalog Data

Create IAM Role

In this step we will navigate to IAM Console & create a new Glue service role, this allows AWS Glue to access data sitting in S3 and create necessary entities in Glue catalog.

GoTo: https://console.aws.amazon.com/iam/home?region=us-east-1#/roles
Click - Create role
- Choose the service that will use this role: Glue
- Click - Next: Permissions
- Search for - AmazonS3FullAccess
  - Select Checkbox
- Search for - AWSGlueServiceRole
  - Select Checkbox
- Click - Next: Tags
  - Leave as default, no changes here
- Click - Next: Review
- Role name: AWSGlueServiceRoleDefault
- make sure that are two policies attached to this role (AmazonS3FullAccess, AWSGlueServiceRole)
- Click - Create role

Create AWS Glue Crawlers

In this step, we will navigate to AWS Glue Console & create glue crawlers to discovery the newly ingested data in S3.

GoTo: https://console.aws.amazon.com/glue/home?region=us-east-1
On the left panel, click on Crawlers > Click on Add Crawler
- Crawler info
  - Crawler name: summitcrawler
  - Click - Next
- Data store
  - Data store: S3
  - Crawl data in: Specified path in my account
  - Include path: s3://yourname-datalake-demo-bucket/data/
  - Click - Next
  - Add another data store : No
  - Click - Next
- IAM Role
  - Choose: Create an IAM role
  - Role Name: AWSGlueServiceRoleDefault
  - Click - Next
- Schedule
  - Frequency: Run on demand
  - Click - Next
- Output
  - Click - Add database
  - Database name: summitdb
  - Click - Create
  - Click - Next
- Review all steps
  - Review the configuration & make sure its as mentioned above
  - Click - Finish
You should see a message : Crawler summitcrawler was created to run on demand.
- Click - Run on demand This will run the crawler
- Wait for few minutes

Verify newly created tables in catalog

Navigate to Glue Catalog & explore the crawled data:

GoTo : https://console.aws.amazon.com/glue/home?region=us-east-1#catalog:tab=databases
- Click - summitdb
Click - Tables in summitdb
- Click - raw
- Look around and explore the schema for your dataset
  - look for the averageRecordSize, recordCount, compressionType

Query ingested data using Amazon Athena

Lets query newly ingested data using Amazon Athena

GoTo : https://us-east-1.console.aws.amazon.com/athena/home?region=us-east-1#query
On the left panel (Database) drop down , select summitdb > select table raw
Click on 3 dots (3 vertical dots) > Select Preview Table
Review the output
In query editor, paste the following query > Click on Run Query

SELECT activity_type,
         count(activity_type)
FROM raw
GROUP BY  activity_type
ORDER BY  activity_type

Part 3 : Transform Data

Create Glue Development Endpoint

In this step you will be creating a glue endpoint to interactively develop Glue ETL scripts using PySpark

GoTo : https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=devEndpoints
Click - Add endpoint
- Development endpoint name - devendpoint1
  - IAM role - AWSGlueServiceRoleDefault
  - Expand - Security configuration.. parameters
    - Data processing units (DPUs): 2 (this affects the cost of the running this lab)
- Click - Next
- Networking screen :
  - Choose - Skip networking information
- Add an SSH public key (Optional)
  - Leave as defaults
  - Click: Next
- Review the settings
  - Click: Finish

It will take close to 10 mins for the new Glue console to spin up.

You have to wait for this step to complete before moving to next step.

Create SageMaker Notebooks (Jupyter) for Glue Dev Endpoints

GoTo: https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=notebooks
Select tab : Sagemaker notebooks
Click: Create notebook
- Notebook name: notebook1
- Attach to development endpoint: devendpoint1
- Choose: Create an IAM role
- IAM Role: notebook1
- VPC (optional): Leave blank
- Encryption key (optional): Leave blank
- Click: Create Notebook

This will take few minutes, wait for this to finish

Launch Jupyter Notebook

Download and save this file locally on your laptop : summit-techfest-datalake-notebook.ipynb
GoTo: https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=notebooks
Click - aws-glue-notebook1
Click - Open, This will open a new tab
On Sagemaker Jupyter Notebook
- Click - Upload (right top part of screen)
- Browse and upload summit-techfest-datalake-notebook.ipynb which you downloaded earlier
- Click - Upload to confirm the download
- Click on **summit-techfest-datalake-notebook.ipynb ** to open the notebook
- Make sure it says 'Sparkmagic (PySpark)' on top right part of the notebook, this is the name of the kernel Jupyter will use to execute code blocks in this notebook

Follow the instructions on the notebook - Read and understand the instructions, they explain important Glue concepts

Validate - Transformed / Processed data has arrived in S3

Once the ETL script has ran successfully. console:https://s3.console.aws.amazon.com/s3/home?region=us-east-1

Click - yourname-datalake-demo-bucket > data
There should be a folder called processed-data created here > Open it & ensure that .parquet files are created in this folder.

Part 4 : Analyze

Explore transformed data using Athena

In this step we will analyze the transformed data using Athena

Login to the Amazon Athena Console.

GoTo: https://console.aws.amazon.com/athena/home?region=us-east-1#query
As Athena uses the AWS Glue catalog for keeping track of data source, any S3 backed table in Glue will be visible to Athena.
On the left panel, select 'summitdb' from the dropdown
Run the following query :

SELECT artist_name,
         count(artist_name) AS count
FROM processed_data
GROUP BY  artist_name
ORDER BY  count desc

Explore the Athena UI and try running some queries
This query returns the list of tracks repeatedly played by devices , we will later visualize this using QuickSight

SELECT device_id,
         track_name,
         count(track_name) AS count
FROM processed_data
GROUP BY  device_id, track_name
ORDER BY  count desc

Part 4: Visualize

Setting Up QuickSight

In this step we will visualize it using QuickSight

Login to Amazon Quick Sight Console & complete the registration & sign-up

GoTo: https://us-east-1.quicksight.aws.amazon.com/sn/start
You should see quicksight hompage:

Setting QuickSight Permissions

GoTo: https://us-east-1.quicksight.aws.amazon.com/sn/console/resources?#
Select: Amazon Athena - this enables QuickSight access to Amazon Athena databases
Select: Amazon S3
- Select: yourname-datalake-demo-bucket
- Click: Select buckets
Click: Update

Adding a New Dataset

GoTo: https://us-east-1.quicksight.aws.amazon.com/sn/start
On top right, Click - Manage Data
- Click - New Data Set
- Click - Athena
- New Athena data source
  - Data source name: summitdemo
  - Click - Validate Connection
    - This will check if your QuickSight can access Athena
    - If validation fails - click on your username on top-right >
  - Click - Create data source
- Choose your table:
  - Database: contain sets of tables: select - summitdb
  - Tables: contain the data you can visualize : select - processed_data
  - Click - Select
- Finish data set creation:
  - Select - Directly query your data
  - Click Visualize

Using Amazon Quick Sight to Visualize Our Processed Data

Visualization 1: Heat map of users and tracks they are listening to

In this step, we will create a visualization that show us which users are listening to repetitive tracks

On the bottom-left panel - Visual types
- Hover on icon there to see names of the visualizations
- Click on - Heat Map
On top-left panel - Fields list
- Click - device_id
- Click - track_name
Just above the visualization you should see Field wells : [Rows - device_id] [Columns - track_name]

If you hover on dark blue patches on the heatmap you will see that those particular users are listening to same track repeatedly.

Visualization 2: Tree map of most played Artist Names

In this step we will create a visualization that shows who are the host played artists

On top-left - Click on '+ Add' > Add Visual, this will add a new panel to the right pane
On the bottom-left panel - Visual types
- Hover on icon there to see names of the visualizations
- Click on - Tree Map
On top-left panel - Fields list
- Click - artist_name

Play around and explore Amazon QuickSight Console. Try out filters, other visualization types, etc.

Clean Up

Failing to do this will result in incuring AWS usage charges.

Make sure you bring down / delete all resources created as part of this lab

Resources to delete

Kinesis Firehose Delivery Stream
- GoTo: https://console.aws.amazon.com/firehose/home?region=us-east-1#/
- Delete Firehose: sg-summit-demo-stream
Glue Database
- GoTo: https://console.aws.amazon.com/glue/home?region=us-east-1#catalog:tab=databases
- Delete Database: summitdb
Glue Crawler
- GoTo: https://console.aws.amazon.com/glue/home?region=us-east-1#catalog:tab=crawlers
- Delete Crawler: summitcrawler
Glue Dev Endpoint
- GoTo: https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=devEndpoints
- Delete endpoint: devendpoint1
Sagemaker Notebook
- You may wish you download the notebook file locally on your laptop before deleting the notebook)
- GoTo: https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=notebooks
- Delete Notebook: aws-glue-notebook1
Delete IAM Role
- GoTo: https://console.aws.amazon.com/iam/home?region=us-east-1#/roles
- Search for AWSGlueServiceRoleDefault
- Delete Role: AWSGlueServiceRoleDefault
Delete S3 bucket
- GoTo: https://s3.console.aws.amazon.com/s3/home?region=us-east-1
- Delete Bucket: yourname-datalake-demo-bucket
Delete Cognito Setup :
- Goto: https://us-west-2.console.aws.amazon.com/cloudformation/home?region=us-west-2#/stacks/
- Click: Kinesis-Data-Generator-Cognito-User
- Click: Actions > DeleteStack
- On confirmation screen: Click: Delete
Close QuickSight account
- GoTo: https://us-east-1.quicksight.aws.amazon.com/sn/admin#permissions
- Click: Unsubscribe
Cognito Userpool
- GoTo: https://us-west-2.console.aws.amazon.com/cognito/users/?region=us-west-2#/

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
img		img
README.md		README.md
summit-techfest-datalake-notebook.ipynb		summit-techfest-datalake-notebook.ipynb
tracks_list.json		tracks_list.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building Serverless Data Lakes on AWS

Learning outcomes from this workshop?

Pre-requisites:

Part 1: Ingest and Storage

Create S3 Bucket

Create Kinesis Firehose

Generate Dummy Data

Validate that data has arrived in S3

Part 2 : Catalog Data

Create IAM Role

Create AWS Glue Crawlers

Verify newly created tables in catalog

Query ingested data using Amazon Athena

Part 3 : Transform Data

Create Glue Development Endpoint

Create SageMaker Notebooks (Jupyter) for Glue Dev Endpoints

Launch Jupyter Notebook

Validate - Transformed / Processed data has arrived in S3

Part 4 : Analyze

Explore transformed data using Athena

Part 4: Visualize

Setting Up QuickSight

Setting QuickSight Permissions

Adding a New Dataset

Using Amazon Quick Sight to Visualize Our Processed Data

Visualization 1: Heat map of users and tracks they are listening to

Visualization 2: Tree map of most played Artist Names

Clean Up

Resources to delete

About

Releases

Packages

Contributors 2

Languages

unnipillai/techfest-building-serverless-datalake-on-aws

Folders and files

Latest commit

History

Repository files navigation

Building Serverless Data Lakes on AWS

Learning outcomes from this workshop?

Pre-requisites:

Part 1: Ingest and Storage

Create S3 Bucket

Create Kinesis Firehose

Generate Dummy Data

Validate that data has arrived in S3

Part 2 : Catalog Data

Create IAM Role

Create AWS Glue Crawlers

Verify newly created tables in catalog

Query ingested data using Amazon Athena

Part 3 : Transform Data

Create Glue Development Endpoint

Create SageMaker Notebooks (Jupyter) for Glue Dev Endpoints

Launch Jupyter Notebook

Validate - Transformed / Processed data has arrived in S3

Part 4 : Analyze

Explore transformed data using Athena

Part 4: Visualize

Setting Up QuickSight

Setting QuickSight Permissions

Adding a New Dataset

Using Amazon Quick Sight to Visualize Our Processed Data

Visualization 1: Heat map of users and tracks they are listening to

Visualization 2: Tree map of most played Artist Names

Clean Up

Resources to delete

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages