Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Update demos and documentation #57

Merged
merged 21 commits into from
Jul 16, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
f142863
update nifi-kafka-druid-water-level-data demo documentation and fix m…
xeniape Jun 24, 2024
74567ac
update stackablectl stacklet list
xeniape Jun 24, 2024
7af1f5a
update stackablectl stacklet list
xeniape Jun 24, 2024
ac98f4c
update nifi-kafka-druid-earthquake-data demo and adjustments for cons…
xeniape Jun 25, 2024
67a2939
update demo/stack resources, airflow-scheduled-job and hbase-hdfs-loa…
xeniape Jun 27, 2024
1876bad
update jupyterhub and logging demo documentation
xeniape Jun 27, 2024
bc0a848
update signal-processing demo and documentation
xeniape Jun 28, 2024
313de87
Update docs/modules/demos/pages/logging.adoc
xeniape Jul 3, 2024
ba11b63
Update docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detec…
xeniape Jul 3, 2024
5911d01
Update docs/modules/demos/pages/jupyterhub-pyspark-hdfs-anomaly-detec…
xeniape Jul 3, 2024
9e66a18
update spark-k8s-anomaly-detection-taxi-data demo documentation
xeniape Jul 4, 2024
8aa35ed
update trino-iceberg demo documentation
xeniape Jul 4, 2024
180ae78
Merge branch 'main' into fix/update-nifi-kafka-druid-documentation
xeniape Jul 4, 2024
249112c
update trino-taxi-data demo documentation
xeniape Jul 4, 2024
9d0297d
update data-lakehouse-iceberg-trino-spark demo and documentation
xeniape Jul 9, 2024
e63aeba
Merge branch 'main' into fix/update-nifi-kafka-druid-documentation
xeniape Jul 9, 2024
4267b72
Update docs/modules/demos/pages/trino-taxi-data.adoc
xeniape Jul 15, 2024
cf075ca
change doc ref
xeniape Jul 15, 2024
d412699
change doc ref
xeniape Jul 15, 2024
ba9940d
change doc ref
xeniape Jul 15, 2024
7e8a5a7
update pvc resources for data-lakehouse-iceberg-trino-spark demo
xeniape Jul 16, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3857,7 +3857,7 @@
</entry>
<entry>
<key>Remote URL</key>
<value> https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations.json</value>
<value>https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations.json</value>
</entry>
<entry>
<key>disable-http2</key>
Expand Down Expand Up @@ -4790,7 +4790,7 @@
</entry>
<entry>
<key>Remote URL</key>
<value> https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations.json</value>
<value>https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations.json</value>
</entry>
<entry>
<key>disable-http2</key>
Expand Down Expand Up @@ -6496,7 +6496,7 @@
</entry>
<entry>
<key>Remote URL</key>
<value> https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations.json</value>
<value>https://www.pegelonline.wsv.de/webservices/rest-api/v2/stations.json</value>
</entry>
<entry>
<key>disable-http2</key>
Expand Down
4 changes: 2 additions & 2 deletions demos/demos-v2.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ demos:
supportedNamespaces: ["default"]
resourceRequests:
cpu: 8700m
memory: 29746Mi
memory: 42034Mi
pvc: 75Gi # 30Gi for Kafka
nifi-kafka-druid-water-level-data:
description: Demo ingesting water level data into Kafka using NiFi, streaming it into Druid and creating a Superset dashboard
Expand All @@ -91,7 +91,7 @@ demos:
supportedNamespaces: ["default"]
resourceRequests:
cpu: 8900m
memory: 30042Mi
memory: 42330Mi
pvc: 75Gi # 30Gi for Kafka
spark-k8s-anomaly-detection-taxi-data:
description: Demo loading New York taxi data into an S3 bucket and carrying out an anomaly detection analysis on it
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/modules/demos/images/logging/tenant.png
4 changes: 2 additions & 2 deletions docs/modules/demos/pages/airflow-scheduled-job.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ continuously:

image::airflow-scheduled-job/airflow_7.png[]

Click on the `run_every_minute` box in the centre of the page and then select `Log`:
Click on the `run_every_minute` box in the centre of the page and then select `Logs`:

[WARNING]
====
Expand All @@ -118,7 +118,7 @@ image::airflow-scheduled-job/airflow_10.png[]

Go back to DAG overview screen. The `sparkapp_dag` job has a scheduled entry of `None` and a last-execution time
(`2022-09-19, 07:36:55`). This allows a DAG to be executed exactly once, with neither schedule-based runs nor any
https://airflow.apache.org/docs/apache-airflow/stable/dag-run.html?highlight=backfill#backfill[backfill]. The DAG can
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dag-run.html#backfill[backfill]. The DAG can
always be triggered manually again via REST or from within the Webserver UI.

image::airflow-scheduled-job/airflow_11.png[]
Expand Down
89 changes: 47 additions & 42 deletions docs/modules/demos/pages/data-lakehouse-iceberg-trino-spark.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ $ stackablectl demo install data-lakehouse-iceberg-trino-spark
[#system-requirements]
== System requirements

The demo was developed and tested on a kubernetes cluster with 10 nodes (4 cores (8 threads), 20GB RAM and 30GB HDD).
The demo was developed and tested on a kubernetes cluster with 10 nodes (4 cores (8 threads), 20GiB RAM and 30GB HDD).
xeniape marked this conversation as resolved.
Show resolved Hide resolved
Instance types that loosely correspond to this on the Hyperscalers are:

- *Google*: `e2-standard-8`
Expand Down Expand Up @@ -94,7 +94,7 @@ directly into S3 using the https://trino.io/docs/current/connector/hive.html[Hiv
below-mentioned mechanism.
* *Built-in compaction:* Running table maintenance functions such as compacting smaller files (including deleted files)
into larger files for best query performance is recommended. Iceberg offers out-of-the-box tools for this.
* *Hidden partitioning:* Image you have a table `sales (day varchar, ts timestamp)` partitioned by `day`. Lots of times,
* *Hidden partitioning:* Imagine you have a table `sales (day varchar, ts timestamp)` partitioned by `day`. Lots of times,
users would run a statement such as `select count(\*) where ts > now() - interval 1 day`, resulting in a full table
scan as the partition column `day` was not filtered in the query. Iceberg resolves this problem by using hidden
partitions. In Iceberg, your table would look like `sales (ts timestamp) with (partitioning = ARRAY['day(ts)'])`. The
Expand All @@ -112,35 +112,33 @@ https://iceberg.apache.org[website] or https://github.com/apache/iceberg/[GitHub

To list the installed installed Stackable services run the following command:

// TODO(Techassi): Update console output below

[source,console]
----
$ stackablectl stacklet list
PRODUCT NAME NAMESPACE ENDPOINTS EXTRA INFOS

hive hive default hive 212.227.224.138:31022
metrics 212.227.224.138:30459

hive hive-iceberg default hive 212.227.233.131:31511
metrics 212.227.233.131:30003

kafka kafka default metrics 217.160.118.190:32160
kafka 217.160.118.190:31736

nifi nifi default https https://217.160.120.117:31499 Admin user: admin, password: adminadmin

opa opa default http http://217.160.222.211:31767

superset superset default external-superset http://212.227.233.47:32393 Admin user: admin, password: adminadmin

trino trino default coordinator-metrics 212.227.224.138:30610
coordinator-https https://212.227.224.138:30876

zookeeper zookeeper default zk 212.227.224.138:32321

minio minio default http http://217.160.222.211:32031 Third party service
console-http http://217.160.222.211:31429 Admin user: admin, password: adminadmin
┌───────────┬───────────────┬───────────┬────────────────────────────────────────────────────┬─────────────────────────────────┐
│ PRODUCT ┆ NAME ┆ NAMESPACE ┆ ENDPOINTS ┆ CONDITIONS │
╞═══════════╪═══════════════╪═══════════╪════════════════════════════════════════════════════╪═════════════════════════════════╡
│ hive ┆ hive ┆ default ┆ ┆ Available, Reconciling, Running │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hive ┆ hive-iceberg ┆ default ┆ ┆ Available, Reconciling, Running │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kafka ┆ kafka ┆ default ┆ metrics 217.160.99.235:31148 ┆ Available, Reconciling, Running │
│ ┆ ┆ ┆ kafka-tls 217.160.99.235:31202 ┆ │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ nifi ┆ nifi ┆ default ┆ https https://5.250.180.98:31825 ┆ Available, Reconciling, Running │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ opa ┆ opa ┆ default ┆ ┆ Available, Reconciling, Running │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ superset ┆ superset ┆ default ┆ external-http http://87.106.122.58:32452 ┆ Available, Reconciling, Running │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ trino ┆ trino ┆ default ┆ coordinator-metrics 212.227.194.245:31920 ┆ Available, Reconciling, Running │
│ ┆ ┆ ┆ coordinator-https https://212.227.194.245:30841 ┆ │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ zookeeper ┆ zookeeper ┆ default ┆ ┆ Available, Reconciling, Running │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ minio ┆ minio-console ┆ default ┆ http http://217.160.99.235:30238 ┆ │
└───────────┴───────────────┴───────────┴────────────────────────────────────────────────────┴─────────────────────────────────┘
----

include::partial$instance-hint.adoc[]
Expand All @@ -150,7 +148,7 @@ include::partial$instance-hint.adoc[]
=== Listing Buckets

The S3 provided by MinIO is used as persistent storage to store all the data used. Open the `minio` endpoint
`console-http` retrieved by the `stackablectl stacklet list` command in your browser (http://217.160.222.211:31429 in
`http` retrieved by the `stackablectl stacklet list` command in your browser (http://217.160.99.235:30238 in
this case).

image::data-lakehouse-iceberg-trino-spark/minio_1.png[]
Expand All @@ -168,7 +166,7 @@ Here, you can see the two buckets contained in the S3:

=== Inspecting Lakehouse

Click on the blue button `Browse` on the bucket `lakehouse`.
Click on the bucket `lakehouse`.

image::data-lakehouse-iceberg-trino-spark/minio_3.png[]

Expand All @@ -177,7 +175,7 @@ Multiple folders (called prefixes in S3), each containing a different dataset, a

image::data-lakehouse-iceberg-trino-spark/minio_4.png[]

As you can see, the table `house-sales` is partitioned by day. Go ahead and click on any folder.
As you can see, the table `house-sales` is partitioned by year. Go ahead and click on any folder.

image::data-lakehouse-iceberg-trino-spark/minio_5.png[]

Expand All @@ -199,7 +197,7 @@ sources are statically downloaded (e.g. as CSV), and others are fetched dynamica
=== View ingestion jobs

You can have a look at the ingestion job running in NiFi by opening the NiFi endpoint `https` from your
`stackablectl stacklet list` command output (https://217.160.120.117:31499 in this case).
`stackablectl stacklet list` command output (https://5.250.180.98:31825 in this case).

[NOTE]
====
Expand All @@ -215,17 +213,18 @@ Log in with the username `admin` and password `adminadmin`.
image::data-lakehouse-iceberg-trino-spark/nifi_2.png[]

As you can see, the NiFi workflow consists of lots of components. You can zoom in by using your mouse and mouse wheel.
On the left side are two strands, that
On the left side are three strands, that

. Fetch the list of known water-level stations and ingest them into Kafka.
. Continuously run a loop fetching the measurements of the last 30 for every measuring station and ingesting them into
. Fetch measurements of the last 30 days for every measuring station and ingest them into Kafka.
. Continuously run a loop fetching the measurements for every measuring station and ingesting them into
Kafka.

On the right side are three strands that
On the right side are three strands, that

. Fetch the current shared bike station information
. Fetch the current shared bike station status
. Fetch the current shared bike bike status
. Fetch the current shared bike status

For details on the NiFi workflow ingesting water-level data, please read the
xref:nifi-kafka-druid-water-level-data.adoc#_nifi[nifi-kafka-druid-water-level-data documentation on NiFi].
Expand Down Expand Up @@ -278,7 +277,7 @@ schema = StructType([ \
])
----

Afterwards, a streaming read from Kafka is started. It reads from our Kafka at `kafka:9090` with the topic
Afterwards, a streaming read from Kafka is started. It reads from our Kafka at `kafka:9093` with the topic
`water_levels_measurements`. When starting up, the job will ready all the existing messages in Kafka (read from
earliest) and will process 50000000 records as a maximum in a single batch. As Kafka has retention set up, Kafka records
might alter out of the topic before Spark has read the records, which can be the case when the Spark application wasn't
Expand All @@ -294,7 +293,7 @@ explanation.
spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.options(**kafkaOptions) \
.option("subscribe", "water_levels_measurements") \
.option("startingOffsets", "earliest") \
.option("maxOffsetsPerTrigger", 50000000) \
Expand Down Expand Up @@ -470,7 +469,7 @@ Trino is used to enable SQL access to the data.
=== Accessing the web interface

Open up the the Trino endpoint `coordinator-https` from your `stackablectl stacklet list` command output
(https://212.227.224.138:30876 in this case).
(https://212.227.194.245:30841 in this case).

image::data-lakehouse-iceberg-trino-spark/trino_1.png[]

Expand Down Expand Up @@ -498,7 +497,7 @@ Here you can see all the available Trino catalogs.
== Superset

Superset provides the ability to execute SQL queries and build dashboards. Open the Superset endpoint
`external-superset` in your browser (http://212.227.233.47:32393 in this case).
`external-http` in your browser (http://87.106.122.58:32452 in this case).

image::data-lakehouse-iceberg-trino-spark/superset_1.png[]

Expand Down Expand Up @@ -526,16 +525,16 @@ Another dashboard to look at is `Taxi trips`.

image::data-lakehouse-iceberg-trino-spark/superset_6.png[]

There are multiple other dashboards you can explore on you own.
There are multiple other dashboards you can explore on your own.

=== Viewing Charts

The dashboards consist of multiple charts. To list the charts, select the `Charts` tab at the top.

=== Executing arbitrary SQL statements

Within Superset, you can create dashboards and run arbitrary SQL statements. On the top click on the tab `SQL Lab` ->
`SQL Editor`.
Within Superset, you can create dashboards and run arbitrary SQL statements. On the top click on the tab `SQL` ->
`SQL Lab`.

image::data-lakehouse-iceberg-trino-spark/superset_7.png[]

Expand All @@ -544,6 +543,12 @@ On the left, select the database `Trino lakehouse`, the schema `house_sales`, an

image::data-lakehouse-iceberg-trino-spark/superset_8.png[]

[NOTE]
====
This older screenshot shows how the table preview would look like. Currently, there is an https://github.com/apache/superset/issues/25307[open issue]
with previewing trino tables using the Iceberg connector. This doesn't affect the execution the following execution of the SQL statement.
====

In the right textbox, you can enter the desired SQL statement. If you want to avoid making one up, use the following:

[source,sql]
Expand Down
Loading