The entire lab is recorded and you can watch the same to have a better understanding of the lab.
End-to-End Workshop (Recording).
The purpose of this repository is to enable the easy and quick setup of the workshop. Cloudera Data Platform (CDP) has been built from the ground up to support hybrid, multi-cloud data management in support of a Data Fabric architecture. This workshop introduces CDP, with a focus on the data management capabilities that enable the Data Fabric and Data Lakehouse.
In this exercise, we will work get stock data from Alpha Vantage, that offers free stock APIs in JSON and CSV formats for real-time and historical stock market data.
-
Data ingestion and streaming — provided by Cloudera Data Flow (CDF) and Cloudera Data Engineering (CDE).
-
Global data access and persistence—provided by Cloudera Data Warehouse (CDW).
-
Data visualization with CDP Data Visualization.
Cloudera DataFlow (CDF) offers a flow-based low-code development paradigm that aligns best with how developers design, develop, and test data distribution pipelines. With over 450+ connectors and processors across the ecosystem of hybrid cloud services—including data lakes, lakehouses, cloud warehouses, and on-premises sources—CDF-PC provides indiscriminate data distribution. These data distribution flows can then be version-controlled into a catalog where operators can self-serve deployments to different runtimes.
Cloudera Data Engineering (CDE) is the only cloud-native service purpose-built for enterprise data engineering teams. Building on Apache Spark, Data Engineering is an all-inclusive data engineering toolset that enables orchestration automation with Apache Airflow, advanced pipeline monitoring, visual troubleshooting, and comprehensive management tools to streamline ETL processes across enterprise analytics teams.
Cloudera Data Warehouse (CDW) is a cloud service for creating self-service data warehouses and the underlying compute clusters for teams of business analysts. Data Warehouse is an auto-scaling, highly concurrent and cost effective analytics service that ingests high scale data anywhere, from structured, unstructured and edge sources. It supports hybrid and multi-cloud infrastructure models by seamlessly moving workloads between on-premise and any cloud for reports, dashboards, ad-hoc and advanced analytics, including AI, with consistent security and governance.
CDP Data Visualization enables data engineers, business analysts, and data scientists explore data quickly and easily, collaborate, and share insights across the data lifecycle—from data ingest to data insights and beyond. Delivered natively as part of Cloudera Data Platform (CDP), Data Visualization delivers a consistent and easy to use data visualization experience with intuitive and accessible drag-and-drop dashboards and custom application creation.
Below are the high-level steps for what we will be doing in the workshop.
(1) Get Alpha Vantage key to be used in Cloudera Data Flow (CDF) to collect stock data (IBM, AMZN, MSFT, GOOGL).
(2) Create CDF workflow and run it to ingest data into S3.
(3) Create Iceberg Table using Cloudera Data Warehouse (CDW/Hue).
(4) Create CDE job and run it to ingest data into iceberg table.
(5) Use Cloudera Data Viz to create a simple dashboard on Iceberg table.
(6) Run the CDE job with updated ticker (NVDA).
(7) Use/Test Iceberg time travel features.
-
Laptop with a supported OS (Windows 7 not supported) or MacBook.
-
A modern browser - Google Chrome (IE, Firefox, Safari not supported).
-
Wi-Fi Internet connection.
-
Git installed (optional).
Please complete only one of the two following steps: Step 1(a)
or Step 1(b)
. Follow Step 1(a)
if you have Git installed on your machine, else, follow Step 1(b)
.
Your instructor will guide you through this.
(1) Credentials: Participants must enter their First Name
, Last Name
& Company
details and make a note of corresponding Workshop Login Username
, Workshop Login Password
and CDP Workload User
to be used in this workshop.
(2) Workshop login: Using the details in the previous step make sure you are able to login here.
You can use the workshop project cloning this github repository : e2e-cdp-alphavantage repository
git clone https://github.com/DashDipti/e2e-cdp-alphavantage
Scroll up the page here and click on <> Code
and then choose the option Download ZIP
.
-
Go to website Alpha Vantage.
-
Choose link ->
GET YOUR FREE API KEY TODAY
.
-
Choose 'Student' for the question -
Which of the following best describes you?
. -
Enter your own organisation name for the question -
Organization (e.g. company, university, etc.):
-
Enter your email address for the question -
Email:
(Note: Please enter personal email id and not the workshop email id) -
Click on
GET FREE API KEY
.
You should see a message like - 'Welcome to Alpha Vantage! Your dedicated access key is: YXXXXXXXXXXXXXXE.
Please record this API key at a safe place for future data access.
Please use the login url: Workshop login.
Enter the Workshop Login Username
and Workshop Login Password
that you obtained as part of Step 0
.
(Note: Note that your Workshop Login Username would be something like wuser00@workshop.com
and not just wuser00
).
You should be able to get the following home page of CDP Public Cloud.
You will need to define your workload password that will be used to acess non-SSO interfaces. You may read more about it: Non-SSO interfaces. Please keep it with you. If you have forgotten it, you will be able to repeat this process and define another one.
-
Click on your
user name (Ex: wuser00@workshop.com
) at the lower left corner. -
Click on the
Profile
option.
-
Click option
Set Workload Password
. -
Enter a suitable
Password
andConfirm Password
. -
Click button
Set Workload Password
.
Check that you got the message - Workload password is currently set
or alternatively, look for a message next to Workload Password
which says (Workload password is currently set)
Click on Home
option on top left corner to go to the landing page.
Click on DataFlow
icon as shown in the image below.
-
On the left menu click on the option ->
Catalog
. -
On the top right corner click the button ->
Import Flow Definition
.
Fill up those parameters :
Flow Name
(user)-stock-data
Depending upon your user name it should be something like - wuser00-stock-data
.
Nifi Flow Configuration
Upload the file Stocks_Intraday_Alpha_Template.json
(Note
: You had downloaded this file inStep 1(a)
orStep 1(b)
depending on what you chose initially.).
Click Import
The new catalog has been added. Type in the name so that you can only see the one that you had created and not the others. For example - wuser00-stock-data
Now let’s deploy it.
Click on the small arrow towards right of the catalog you just created. Click on Deploy
button.
(user)_stock_dataflow
Depending on your user name it should be something like - wuser00_stock_dataflow
.
Make sure that the right Target Environment
is selected.
Click Next
.
Let parameters be the default ones. Click Next
.
CDP_Password
Fill up your CDP workload password here
CDP_User
your user
Depending on your user name it should be something like - wuser00
.
S3 Path
stocks
api_alpha_key
your Alpha Vantage key
stock_list
IBM
GOOGL
AMZN
MSFT
Click Next →
.
Extra Small
Slide button to right to Enable Auto scaling
and let the min nodes be 1 and max nodes be 3.
Let parameters by default
Click Next→
.
You can define KPI’s in regards what has been specified in your dataflow, but we will skip this for now.
Click Next→
Click Deploy
to launch the deployment.
The deployment will get initiated. Check the deployment on the run and look for the status Good Health
.
Dataflow is up and running and you can confirm the same by looking at the green tick and message Good Health
against the dataflow name. It’s will take ~7 minutes
before you see the green tick. Notice the Event History
and there are approximately 8 steps that happen after the flow deployment. You might want to observe those.
After the successful deployment we will start receiving stock information into our bucket.
If you want you can check in your bucket under the path s3a://stc2-eet1/user/(username)/stocks/new
.
Note
: You don’t have access to the S3 bucket. The instructor will confirm if the data files have been received after your workflow runs.
Note
: Successful deployment DOESN’T mean that the flow logic got successfully implemented and hence, we need to make sure that the flow ran successfully.
Proceed to the next section to make sure if the flow ran successfully without any errors and also check with the instructor if the data has populated in S3 bucket.
Click on blue arrow on the right of your deployed dataflow wuser00_stock_dataflow
.
Select Manage Deployment
on top right corner.
On this window, choose Actions
-> View in NiFi
.
You can see the Nifi data flow that has been deployed from the json file. You can click each of the processor groups to go inside and see the flow details. Make sure that there are no errors in the flow.
If you see any please let the instructor know
.
At this stage you can suspend this dataflow, go back to Deployment Manager
-> Actions
-> Suspend flow
.
We will add a new stock later and restart it.
On getting the pop up, click on Suspend Flow
.
Confirm that the status is Suspended
.
Now we are going to create the Iceberg table.
Click on Home
option on top left corner to go to the landing page.
From the CDP Portal or CDP Menu choose Data Warehouse
.
From the CDW Overview
window, click the "HUE" button on the right corner as shown under the Virtual Warehouses
to the right.
Now you’re accessing to the sql editor called - "HUE" (Hadoop User Experience)
.
Let’s select the Impala engine that you will be using for interacting with database.
On the top left corner select </>
and select the Editor to be Impala
.
Make sure that you can see Impala
instead of Unified Analytics
on top of the area where you would write queries.
Create database using your login For example: wuser00
. Replace <user>
by your username for database creation in the command below.
CREATE DATABASE <user>_stocks;
See the result to notice a message Database has been created
.
After creating the database create an Iceberg table. Replace <user>
by your username for iceberg table creation in the command below.
CREATE TABLE IF NOT EXISTS <user>_stocks.stock_intraday_1min (
interv STRING,
output_size STRING,
time_zone STRING,
open DECIMAL(8,4),
high DECIMAL(8,4),
low DECIMAL(8,4),
close DECIMAL(8,4),
volume BIGINT)
PARTITIONED BY (
ticker STRING,
last_refreshed string,
refreshed_at string)
STORED AS iceberg;
See the result to notice a message Table has been created
.
Let’s now create our engineering process.
Now we will use Cloudera Data Engineering to check the files in the object storage that were populated as a part of the above DataFlow run and then compare if it’s new data, and insert them into the Iceberg table.
Click on Home
option on top left corner to go to the landing page.
From the CDP Portal or CDP Menu choose Data Engineering
.
Let’s create a job.
Click on Jobs
. Make sure that you can see meta-workshop-de
on the top.
Then click Create Job
button in the right side of the screen.
Note: This page may differ a little bit depending on the fact that some user may have created a job prior to you or not.
Fill the following values carefully
.
Job Type*
Choose Spark 3.2.3
Name*
Replace (user)
with your username. For example: wuser00-StockIceberg
.
(user)-StockIceberg
Make sure Application File
that is selected is File
. Select the option Select from Resource
.
Select stockdata-job -> stockdatabase_2.12-1.0.jar
Main Class
com.cloudera.cde.stocks.StockProcessIceberg
Make sure the below arguments are filled so that (user) is replaced with the actual username. For example wuser00_stocks
and instead of (user) at the end it is wuser00
. Make sure to check the next screenshot to comply.
Arguments
(user)_stocks
s3a://stc2-eet1/
stocks
(user)
Click the Create and Run
button at the bottom. (There is no screenshot for the same).
Note: It might take ~3 minutes. So, it’s okay to wait until it’s done.
This application will:
-
Check new files in the new directory;
-
Create a temp table in Spark/cache this table and identify duplicated rows (in case that NiFi loaded the same data again);
-
MERGE INTO the final table, INSERT new data or UPDATE if exists;
-
Archive files in the bucket;
After execution, the processed files will be in your bucket but under the name which has the format - processed"+date/
.
You don’t have access to it. The instructor has access to the same. The next section is optional.
Note: Before moving ahead with this section make sure that the CDE job ran successfully. Go to Job Runs
option in the left pane and look for the job that you ran now. It should have a green tick box next to it’s name.
We will now create a simple dashboard using Cloudera Data Viz.
Click on Home
option on top left corner to go to the landing page.
From the CDP Portal or CDP Menu choose Data Warehouse
.
You will reach the Overview
page.
In the menu on the left choose Data Visualization
.
Look for meta-workshop-dataviz
. Then click the Data VIZ
button on the right.
You will access to the following window. Choose DATA
on the upper menu bar next to the options of HOME, SQL, VISUALS.
Click meta-workshop
option in the left pane and then click on NEW DATASET
option on top.
Replace (user)
with your username wherever it is applicable.
Dataset title
(user)_dataset
Dataset Source
From Table
Select Database
(user)_stocks
Select Table
stock_intraday_1min
Click CREATE
.
Let’s drag from DATA
section on the right under Dashboard Designer
the following attribute/metric. And the 'REFRESH THE VISUAL'
Dimensions
-> ticker
Move it to Visuals ->
Dimensions
Measures
-> #volume
Move it to Visuals ->
Measures
Then on 'VISUALS' choose Packed Bubbles
.
Make it PUBLIC by changing the option from PRIVATE
to PUBLIC
. Save it by clicking the SAVE
button on the top. You have succeeded to create a simple dashboard. Now, let’s query our data and explore the time-travel and snapshot capabilties of Iceberg.
Apache Iceberg is an open table format, originally designed at Netflix to overcome the challenges faced when using already existing data lake formats like Apache Hive.
The design structure of Apache Iceberg is different from Apache Hive, where the metadata layer and data layer are managed and maintained on object storage like Hadoop, s3, etc.
It uses a file structure (metadata and manifest files) that is managed in the metadata layer. Each commit at any timeline is stored as an event on the data layer when data is added. The metadata layer manages the snapshot list. Additionally, it supports integration with multiple query engines,
Any update or delete to the data layer, creates a new snapshot in the metadata layer from the previous latest snapshot and parallelly chains up the snapshot, enabling faster query processing as the query provided by users pulls data at the file level rather than at the partition level.
Our example will load the intraday stock daily since the free API does not give real-time data, but we can change the Cloudera Dataflow Parameter to add one more ticker and we’ve scheduled to run hourly the CDE process. After this we will be able to see the new ticker information in the dashboard and also perform time travel using Iceberg!
Click on Home
option on top left corner to go to the landing page.
From the CDP Portal or CDP Menu choose Data Warehouse
.
From the CDW Overview
window, click the "HUE" button on the right corner as shown under the Virtual Warehouses
to the right. Make sure that the correct 'Virtual Warehouse' is selected - In this case it is meta-workshop-ww
.
Now you’re accessing to the sql editor called "HUE".
Let’s select the Impala engine that you will be using for interacting with database.
On the top left corner select </>
and select the Editor to be Impala
.
Make sure that you can see Impala
instead of Unified Analytics
on top of the area where you would write queries.
Let’s see the Iceberg table history.
Replace <user> with your username. For example: wuser00
.
DESCRIBE HISTORY <user>_stocks.stock_intraday_1min;
Copy and paste the snapshot_id
and use it on the following impala queries. Replace <user> with your username. For example: wuser00
.
SELECT ticker, count(*)
FROM <user>_stocks.stock_intraday_1min
FOR SYSTEM_VERSION AS OF <snapshot_id>
GROUP BY ticker;
We shall load new data and this time we will include additional stock ticker - NVDA
.
Go to CDF, and find the data flow that you had created earlier. It should be in suspended state if you had suspended it towards the end of
Step 5: Create the flow to ingest stock data via API to Object Storage
section of the workshop.
Go to Cloudera Data Flow option and look for the flow that you had created earlier based on your user name. Ex - wuser00_stock_dataflow
. Click on the arrow towards the right side of the flow and then click on Manage Deployment
.
Click on the Parameters
tab and then scroll down to the text box where you had entered stock tickers (stock_list
).
The S3 bucket gets updated with new data and this time it includes the new ticker 'NVDA' as well. We will see it. You can see the same in S3 bucket as shown here.
Now go to Cloudera Data Engineering
from the home page and Jobs
. Choose the CDE Job that you had created earlier with your username.
Click on Job Runs
in the left to see the status of the job that was initiated now. It should succeed.
As CDF has ingested a new stock value and then CDE has merged those value it has created new Iceberg snapshots. Copy and paste the new 'snapshot_id' and use it on the following impala query.
Now let check again the snapshot history by going to Hue.
DESCRIBE HISTORY <user>_stocks.stock_intraday_1min;
SELECT ticker, count(*)
FROM <user>_stocks.stock_intraday_1min
FOR SYSTEM_VERSION AS OF <new_snapshot_id>
GROUP BY ticker;
Now, we can see that this snapshot retrieves the count value for stock NVDA that has been added in the CDF stock_list
parameter.
Replace <user> with your username. For example: wuser00
.
show files in <user>_stocks.stock_intraday_1min;
Check the Iceberg table. Replace <user> with your username. For example: wuser00
.
describe formatted <user>_stocks.stock_intraday_1min;
Note
: Please make sure that the data flow that was created by you is 'suspended' else it will be running continously.