Skip to content

This technical workshop is to showcase how one can quickly take advantage of various data services (CDF, CDE, CDW) & Data Viz to build modern data fabric.

Notifications You must be signed in to change notification settings

ylysn/e2e-workshop

 
 

Repository files navigation

Cloudera Technical Workshop


Version : 3.1.0 26th June 2023


HOL

Recording

The entire lab is recorded and you can watch the same to have a better understanding of the lab.
End-to-End Workshop (Recording).

Introduction

The purpose of this repository is to enable the easy and quick setup of the workshop. Cloudera Data Platform (CDP) has been built from the ground up to support hybrid, multi-cloud data management in support of a Data Fabric architecture. This workshop introduces CDP, with a focus on the data management capabilities that enable the Data Fabric and Data Lakehouse.

Overview

In this exercise, we will work get stock data from Alpha Vantage, that offers free stock APIs in JSON and CSV formats for real-time and historical stock market data.

  • Data ingestion and streaming — provided by Cloudera Data Flow (CDF) and Cloudera Data Engineering (CDE).

  • Global data access and persistence—​provided by Cloudera Data Warehouse (CDW).

  • Data visualization with CDP Data Visualization.

Cloudera DataFlow (CDF) offers a flow-based low-code development paradigm that aligns best with how developers design, develop, and test data distribution pipelines. With over 450+ connectors and processors across the ecosystem of hybrid cloud services—including data lakes, lakehouses, cloud warehouses, and on-premises sources—CDF-PC provides indiscriminate data distribution. These data distribution flows can then be version-controlled into a catalog where operators can self-serve deployments to different runtimes.

Cloudera Data Engineering (CDE) is the only cloud-native service purpose-built for enterprise data engineering teams. Building on Apache Spark, Data Engineering is an all-inclusive data engineering toolset that enables orchestration automation with Apache Airflow, advanced pipeline monitoring, visual troubleshooting, and comprehensive management tools to streamline ETL processes across enterprise analytics teams.

Cloudera Data Warehouse (CDW) is a cloud service for creating self-service data warehouses and the underlying compute clusters for teams of business analysts. Data Warehouse is an auto-scaling, highly concurrent and cost effective analytics service that ingests high scale data anywhere, from structured, unstructured and edge sources. It supports hybrid and multi-cloud infrastructure models by seamlessly moving workloads between on-premise and any cloud for reports, dashboards, ad-hoc and advanced analytics, including AI, with consistent security and governance.

CDP Data Visualization enables data engineers, business analysts, and data scientists explore data quickly and easily, collaborate, and share insights across the data lifecycle—​from data ingest to data insights and beyond. Delivered natively as part of Cloudera Data Platform (CDP), Data Visualization delivers a consistent and easy to use data visualization experience with intuitive and accessible drag-and-drop dashboards and custom application creation.

architecture

High-Level Steps

Below are the high-level steps for what we will be doing in the workshop.
(1) Get Alpha Vantage key to be used in Cloudera Data Flow (CDF) to collect stock data (IBM, AMZN, MSFT, GOOGL).
(2) Create CDF workflow and run it to ingest data into S3.
(3) Create Iceberg Table using Cloudera Data Warehouse (CDW/Hue).
(4) Create CDE job and run it to ingest data into iceberg table.
(5) Use Cloudera Data Viz to create a simple dashboard on Iceberg table.
(6) Run the CDE job with updated ticker (NVDA).
(7) Use/Test Iceberg time travel features.

Pre-requisites

  1. Laptop with a supported OS (Windows 7 not supported) or MacBook.

  2. A modern browser - Google Chrome (IE, Firefox, Safari not supported).

  3. Wi-Fi Internet connection.

  4. Git installed (optional).

Please complete only one of the two following steps: Step 1(a) or Step 1(b). Follow Step 1(a) if you have Git installed on your machine, else, follow Step 1(b).

Step 0: Access Details

Your instructor will guide you through this.
(1) Credentials: Participants must enter their First Name, Last Name & Company details and make a note of corresponding Workshop Login Username, Workshop Login Password and CDP Workload User to be used in this workshop.
(2) Workshop login: Using the details in the previous step make sure you are able to login here.

Step 1(a): Get github project

You can use the workshop project cloning this github repository : e2e-cdp-alphavantage repository

git clone https://github.com/DashDipti/e2e-cdp-alphavantage

Step 1(b): Download repository using GUI

Scroll up the page here and click on <> Code and then choose the option Download ZIP.
1

Use any unzip utility to download extract the content of the e2e-cdp-alphavantage.zip file.
2

In the extracted content just be sure that the downloaded files has a file Stocks_Intraday_Alpha_Template.json which should be around ~65 KB in size. You will need this file in later step.
3

Step 2: Get Alpha Vantage Key

  1. Go to website Alpha Vantage.

  2. Choose link -> GET YOUR FREE API KEY TODAY.

alphaVantagePortal

  1. Choose 'Student' for the question - Which of the following best describes you?.

  2. Enter your own organisation name for the question - Organization (e.g. company, university, etc.):

  3. Enter your email address for the question - Email: (Note: Please enter personal email id and not the workshop email id)

  4. Click on GET FREE API KEY.

claimApiKey

You should see a message like - 'Welcome to Alpha Vantage! Your dedicated access key is: YXXXXXXXXXXXXXXE. Please record this API key at a safe place for future data access.

getKey

Step 3: Access CDP Public Cloud Portal

Please use the login url: Workshop login.
Enter the Workshop Login Username and Workshop Login Password that you obtained as part of Step 0.
(Note: Note that your Workshop Login Username would be something like wuser00@workshop.com and not just wuser00).

1

You should be able to get the following home page of CDP Public Cloud.

2

Step 4: Define Workload Password

You will need to define your workload password that will be used to acess non-SSO interfaces. You may read more about it: Non-SSO interfaces. Please keep it with you. If you have forgotten it, you will be able to repeat this process and define another one.

  1. Click on your user name (Ex: wuser00@workshop.com) at the lower left corner.

  2. Click on the Profile option.

1

  1. Click option Set Workload Password.

  2. Enter a suitable Password and Confirm Password.

  3. Click button Set Workload Password.

2

3


Check that you got the message - Workload password is currently set or alternatively, look for a message next to Workload Password which says (Workload password is currently set)

4

Step 5: Create the flow to ingest stock data via API to Object Storage

CDP Portal

Click on Home option on top left corner to go to the landing page.

1

Click on DataFlow icon as shown in the image below.

2

Create a new CDF Catalog

  1. On the left menu click on the option -> Catalog.

  2. On the top right corner click the button -> Import Flow Definition.

3

Fill up those parameters :

Flow Name

(user)-stock-data

Depending upon your user name it should be something like - wuser00-stock-data.

Nifi Flow Configuration

Upload the file Stocks_Intraday_Alpha_Template.json
(Note: You had downloaded this file in Step 1(a) or Step 1(b) depending on what you chose initially.).

Click Import

4

The new catalog has been added. Type in the name so that you can only see the one that you had created and not the others. For example - wuser00-stock-data

5

Now let’s deploy it.

Deploy DataFlow

Click on the small arrow towards right of the catalog you just created. Click on Deploy button.

6
You will need to select the workshop environment meta-workshop.
Click on Continue →

7
Give a name to this dataflow.
Deployment Name

(user)_stock_dataflow

Depending on your user name it should be something like - wuser00_stock_dataflow.

Make sure that the right Target Environment is selected. Click Next.

8

Let parameters be the default ones. Click Next.

9

CDP_Password

Fill up your CDP workload password here

CDP_User

your user

Depending on your user name it should be something like - wuser00.

S3 Path

stocks

api_alpha_key

your Alpha Vantage key

stock_list

IBM
GOOGL
AMZN
MSFT

Click Next →.

10
Nifi Node Sizing

Extra Small

Slide button to right to Enable Auto scaling and let the min nodes be 1 and max nodes be 3.

Let parameters by default

Click Next→.

11

You can define KPI’s in regards what has been specified in your dataflow, but we will skip this for now. Click Next→

12

Click Deploy to launch the deployment.

13

The deployment will get initiated. Check the deployment on the run and look for the status Good Health.

14

15

Dataflow is up and running and you can confirm the same by looking at the green tick and message Good Health against the dataflow name. It’s will take ~7 minutes before you see the green tick. Notice the Event History and there are approximately 8 steps that happen after the flow deployment. You might want to observe those.

15 1
16

After the successful deployment we will start receiving stock information into our bucket. If you want you can check in your bucket under the path s3a://stc2-eet1/user/(username)/stocks/new.
Note: You don’t have access to the S3 bucket. The instructor will confirm if the data files have been received after your workflow runs.
Note: Successful deployment DOESN’T mean that the flow logic got successfully implemented and hence, we need to make sure that the flow ran successfully.
Proceed to the next section to make sure if the flow ran successfully without any errors and also check with the instructor if the data has populated in S3 bucket.

View Nifi DataFlow

Click on blue arrow on the right of your deployed dataflow wuser00_stock_dataflow.

16

Select Manage Deployment on top right corner.

17

On this window, choose Actions -> View in NiFi.

18

19

You can see the Nifi data flow that has been deployed from the json file. You can click each of the processor groups to go inside and see the flow details. Make sure that there are no errors in the flow. If you see any please let the instructor know.

20

At this stage you can suspend this dataflow, go back to Deployment Manager -> Actions -> Suspend flow. We will add a new stock later and restart it.

21

On getting the pop up, click on Suspend Flow.

22

Confirm that the status is Suspended.

23

Step 6: Create Iceberg Table

Now we are going to create the Iceberg table. Click on Home option on top left corner to go to the landing page.

1

From the CDP Portal or CDP Menu choose Data Warehouse.

2

From the CDW Overview window, click the "HUE" button on the right corner as shown under the Virtual Warehouses to the right.

3

Now you’re accessing to the sql editor called - "HUE" (Hadoop User Experience).

4

Let’s select the Impala engine that you will be using for interacting with database. On the top left corner select </> and select the Editor to be Impala.

Make sure that you can see Impala instead of Unified Analytics on top of the area where you would write queries.

5

Create database using your login For example: wuser00. Replace <user> by your username for database creation in the command below.

CREATE DATABASE <user>_stocks;

See the result to notice a message Database has been created.

6

After creating the database create an Iceberg table. Replace <user> by your username for iceberg table creation in the command below.

CREATE TABLE IF NOT EXISTS <user>_stocks.stock_intraday_1min (
  interv STRING,
  output_size STRING,
  time_zone STRING,
  open DECIMAL(8,4),
  high DECIMAL(8,4),
  low DECIMAL(8,4),
  close DECIMAL(8,4),
  volume BIGINT)
PARTITIONED BY (
  ticker STRING,
  last_refreshed string,
  refreshed_at string)
STORED AS iceberg;

See the result to notice a message Table has been created.

7

Let’s now create our engineering process.

Step 7: Process and Ingest Iceberg using CDE

Now we will use Cloudera Data Engineering to check the files in the object storage that were populated as a part of the above DataFlow run and then compare if it’s new data, and insert them into the Iceberg table.

Click on Home option on top left corner to go to the landing page.

1

From the CDP Portal or CDP Menu choose Data Engineering.

2

Let’s create a job. Click on Jobs. Make sure that you can see meta-workshop-de on the top.
Then click Create Job button in the right side of the screen.
Note: This page may differ a little bit depending on the fact that some user may have created a job prior to you or not.

3

Fill the following values carefully.

Job Type*

Choose Spark 3.2.3

Name*
Replace (user) with your username. For example: wuser00-StockIceberg.

(user)-StockIceberg

Make sure Application File that is selected is File. Select the option Select from Resource.

Select stockdata-job -> stockdatabase_2.12-1.0.jar

4

Main Class

com.cloudera.cde.stocks.StockProcessIceberg

Make sure the below arguments are filled so that (user) is replaced with the actual username. For example wuser00_stocks and instead of (user) at the end it is wuser00. Make sure to check the next screenshot to comply.

Arguments

(user)_stocks
s3a://stc2-eet1/
stocks
(user)

5

Click the Create and Run button at the bottom. (There is no screenshot for the same).
Note: It might take ~3 minutes. So, it’s okay to wait until it’s done.

This application will:

  • Check new files in the new directory;

  • Create a temp table in Spark/cache this table and identify duplicated rows (in case that NiFi loaded the same data again);

  • MERGE INTO the final table, INSERT new data or UPDATE if exists;

  • Archive files in the bucket;

After execution, the processed files will be in your bucket but under the name which has the format - processed"+date/.

6

You don’t have access to it. The instructor has access to the same. The next section is optional.

Step 7 (Optional): Checking Logs of CDE Job Run

Click on the Job Name - wuser-StockIceberg. 7

Click on the Run Id. 8

You will reach the Trends option. 9

Click the Logs and go through the various tabs like 'stderr+stdout' to understand better. 10

Under Logs tab check for the following. In most of the cases Processing temp dirs indicates that job would run successfully and is in it’s last stages. 11

Step 8: Create Dashboard using CDP DataViz

Note: Before moving ahead with this section make sure that the CDE job ran successfully. Go to Job Runs option in the left pane and look for the job that you ran now. It should have a green tick box next to it’s name.

1

We will now create a simple dashboard using Cloudera Data Viz.

Click on Home option on top left corner to go to the landing page.

2

From the CDP Portal or CDP Menu choose Data Warehouse.

3

You will reach the Overview page.

4

In the menu on the left choose Data Visualization. Look for meta-workshop-dataviz. Then click the Data VIZ button on the right.

5

You will access to the following window. Choose DATA on the upper menu bar next to the options of HOME, SQL, VISUALS.
6

Click meta-workshop option in the left pane and then click on NEW DATASET option on top.

7

Replace (user) with your username wherever it is applicable.
Dataset title

(user)_dataset

Dataset Source

From Table

Select Database

(user)_stocks

Select Table

stock_intraday_1min

Click CREATE.

8

Select "New Dashboard" -> 9 icon next to the Table that you created just now.

10

You’ll land in the following page. 11

Let’s drag from DATA section on the right under Dashboard Designer the following attribute/metric. And the 'REFRESH THE VISUAL'

Dimensions -> ticker

Move it to Visuals -> Dimensions

Measures -> #volume

Move it to Visuals -> Measures

12

Then on 'VISUALS' choose Packed Bubbles.

13 Your visual could be slighltly different from the image here.

Make it PUBLIC by changing the option from PRIVATE to PUBLIC. Save it by clicking the SAVE button on the top. You have succeeded to create a simple dashboard. Now, let’s query our data and explore the time-travel and snapshot capabilties of Iceberg.

Step 9: Query Iceberg Tables in Hue and Cloudera Data Visualization

Step 9(a): For Reading only (Optional): Iceberg Architecture

Apache Iceberg is an open table format, originally designed at Netflix to overcome the challenges faced when using already existing data lake formats like Apache Hive.

The design structure of Apache Iceberg is different from Apache Hive, where the metadata layer and data layer are managed and maintained on object storage like Hadoop, s3, etc.

It uses a file structure (metadata and manifest files) that is managed in the metadata layer. Each commit at any timeline is stored as an event on the data layer when data is added. The metadata layer manages the snapshot list. Additionally, it supports integration with multiple query engines,

Any update or delete to the data layer, creates a new snapshot in the metadata layer from the previous latest snapshot and parallelly chains up the snapshot, enabling faster query processing as the query provided by users pulls data at the file level rather than at the partition level.


iceberg architecture

Our example will load the intraday stock daily since the free API does not give real-time data, but we can change the Cloudera Dataflow Parameter to add one more ticker and we’ve scheduled to run hourly the CDE process. After this we will be able to see the new ticker information in the dashboard and also perform time travel using Iceberg!

Step 9(b): Logging into Hue

Click on Home option on top left corner to go to the landing page.

1

From the CDP Portal or CDP Menu choose Data Warehouse.

2

From the CDW Overview window, click the "HUE" button on the right corner as shown under the Virtual Warehouses to the right. Make sure that the correct 'Virtual Warehouse' is selected - In this case it is meta-workshop-ww.

3

Now you’re accessing to the sql editor called "HUE".

4

Let’s select the Impala engine that you will be using for interacting with database. On the top left corner select </> and select the Editor to be Impala.

Make sure that you can see Impala instead of Unified Analytics on top of the area where you would write queries.

5

Step 9(c): Iceberg snapshots

Let’s see the Iceberg table history. Replace <user> with your username. For example: wuser00.

DESCRIBE HISTORY <user>_stocks.stock_intraday_1min;


6


Copy and paste the snapshot_id and use it on the following impala queries. Replace <user> with your username. For example: wuser00.

SELECT ticker, count(*)
FROM <user>_stocks.stock_intraday_1min
FOR SYSTEM_VERSION AS OF <snapshot_id>
GROUP BY ticker;


7


Step 9(d): Add a New stock (NVDA)

We shall load new data and this time we will include additional stock ticker - NVDA. Go to CDF, and find the data flow that you had created earlier. It should be in suspended state if you had suspended it towards the end of
Step 5: Create the flow to ingest stock data via API to Object Storage section of the workshop.

Go to Cloudera Data Flow option and look for the flow that you had created earlier based on your user name. Ex - wuser00_stock_dataflow. Click on the arrow towards the right side of the flow and then click on Manage Deployment.

8

9

Click on the Parameters tab and then scroll down to the text box where you had entered stock tickers (stock_list).

10

Add the stock 'NVDA'. And then click on Apply Changes. 11
12

Now, start the flow again by clicking Actions and then Start flow. 13
14
15

The S3 bucket gets updated with new data and this time it includes the new ticker 'NVDA' as well. We will see it. You can see the same in S3 bucket as shown here. 16

Now go to Cloudera Data Engineering from the home page and Jobs. Choose the CDE Job that you had created earlier with your username. 17

Click the 3 dots next to your job that you had created earloer and then click on Run Now. 18
19

Click on Job Runs in the left to see the status of the job that was initiated now. It should succeed. 20
21


As CDF has ingested a new stock value and then CDE has merged those value it has created new Iceberg snapshots. Copy and paste the new 'snapshot_id' and use it on the following impala query.

Step 9(e): Check new snapshot history

Now let check again the snapshot history by going to Hue.

DESCRIBE HISTORY <user>_stocks.stock_intraday_1min;


22

SELECT ticker, count(*)
FROM <user>_stocks.stock_intraday_1min
FOR SYSTEM_VERSION AS OF <new_snapshot_id>
GROUP BY ticker;


23


Now, we can see that this snapshot retrieves the count value for stock NVDA that has been added in the CDF stock_list parameter.

Show Data Files

Replace <user> with your username. For example: wuser00.

show files in <user>_stocks.stock_intraday_1min;


24


Check the Iceberg table. Replace <user> with your username. For example: wuser00.

describe formatted <user>_stocks.stock_intraday_1min;


25


Note: Please make sure that the data flow that was created by you is 'suspended' else it will be running continously.

About

This technical workshop is to showcase how one can quickly take advantage of various data services (CDF, CDE, CDW) & Data Viz to build modern data fabric.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Scala 100.0%