In today’s fast-paced world, there is emphasis on getting instant insights. Typical use-cases involve SaaS operators providing real-time metrics for their KPIs or marketeers' need for quick insights on performance of their offers or experiments on the website.
This solution will demonstrate how to build a real-time website analytics dashboard on GCP.
User events / Message bus provides system decoupling, Pub/Sub is a fully managed message/event bus and provides an easy way to handle the fast click-stream generated by typical websites. The click-stream contains signals which can be processed to derive insights in real time.
Metrics processing pipeline is required to process the click-stream from Pub/Sub into the metrics database. Dataflow will be used, which is a serverless, fully managed processing service supporting real-time streaming jobs.
Metrics Database, needs to be an in-memory database to support real-time use-cases. Some common web analytic metrics are unique visitors, number of active experiments, conversion rate of each experiment, etc. The common theme is to calculate uniques, i.e. Cardinality counting, although from a marketeer's standpoint a good estimation is sufficient, the HyperLogLog algorithm is an efficient solution to the count-unique problem by trading off some accuracy.
Cloud Memorystore (Redis) provides a slew of in-built functions for sets and cardinality measurement, alleviating the need to perform them in code.
The analytics reporting and visualization makes the reports available to the marketeer easily.
A Spring dashboard application is used for demo purposes only. The application uses
Jedis client to access metrics from Redis using
scard
and
sinterstore
commands for identifying user overlap and
other cardinality values. It then uses Javascript based web-ui to render graphs using
Google Charts library.
Part 1 | Part 2 |
---|---|
- Clone this repository
git clone https://github.com/GoogleCloudPlatform/redis-dataflow-realtime-analytics.git cd redis-dataflow-realtime-analytics
- Update and activate all environment variables in
set_variables.sh
source set_variables.sh
- Enable required Cloud products
gcloud services enable \ compute.googleapis.com \ pubsub.googleapis.com \ redis.googleapis.com \ dataflow.googleapis.com \ storage-component.googleapis.com
Pub/Sub is a global message bus enabling easy message consumption in a decoupled fashion. Create a Pub/Sub topic to receive application instrumentation messages
gcloud pubsub topics create $APP_EVENTS_TOPIC --project $PROJECT_ID
Protecting the Redis instance is important as it does not provide any protections from external entities.
- Creating a sepate VPC network with external ingress blocked by a firewall provides basic security for the instance.
gcloud compute networks create $VPC_NETWORK_NAME \ --subnet-mode=auto \ --bgp-routing-mode=regional
- Create Firewall rule to enable SSH
gcloud compute firewall-rules create allow-internal-ssh \ --network $VPC_NETWORK_NAME \ --allow tcp:22,icmp
Cloud Memorystore provides a fully managed Redis database. Redis is a NoSQL In-Memory database, which offers comprehensive in-built functions for SETs operations, including efficient HLL operations for cardinality measurement.
- Create Redis instance in Memorystore.
gcloud redis instances create $REDIS_NAME \ --size=1 \ --region=$REGION_ID \ --zone="$ZONE_ID" \ --network=$VPC_NETWORK_NAME \ --tier=standard
Be patient, this can take some time.
- Capture instance's IP to configure the Dataflow and Visualization application
export REDIS_IP="$(gcloud redis instances describe $REDIS_NAME --region=$REGION_ID \ | grep host \ | sed 's/host: //')"
The analytic metrics pipeline will read click-stream messages from Pub/Sub and update metrics in the Redis database in real-time. The visualization application can then use the Redis database for the dashboard.
- Create Cloud Storage bucket for temporary and staging area for the pipeline
gsutil mb -l $REGION_ID -p $PROJECT_ID gs://$TEMP_GCS_BUCKET
- Launch the pipeline using Maven
cd processor
mvn clean compile exec:java \ -Dexec.mainClass=com.google.cloud.solutions.realtimedash.pipeline.MetricsCalculationPipeline \ -Dexec.cleanupDaemonThreads=false \ -Dmaven.test.skip=true \ -Dexec.args=" \ --streaming \ --project=$PROJECT_ID \ --runner=DataflowRunner \ --stagingLocation=gs://$TEMP_GCS_BUCKET/stage/ \ --tempLocation=gs://$TEMP_GCS_BUCKET/temp/ \ --inputTopic=projects/$PROJECT_ID/topics/$APP_EVENTS_TOPIC \ --workerMachineType=n1-standard-4 \ --region=$REGION_ID \ --subnetwork=regions/$REGION_ID/subnetworks/$VPC_NETWORK_NAME \ --redisHost=$REDIS_IP \ --redisPort=6379"
The dummy event generator is a Python executable, which needs to keep running, this can be achieved by launching the generator in a separate shell session.
- Create and initialize a new python3 virtual environment (you need to have
pyhton3-venv
package)python3 -m venv ~/generator-venv source ~/generator-venv/bin/activate pip install -r loggen/requirements.txt
- Run the logs generator
python loggen/message_generator.py \ --topic $APP_EVENTS_TOPIC \ --project-id $PROJECT_ID \ --enable-log true
Use the simple reporting application located in dashboard/
folder, built using SpringBoot and simple HTML+JS based UI.
The application reads the metrics from the Redis database and makes it available to the dashboard UI. The Application server needs to be on the same VPC network as the Redis server, to achieve this for demo purposes, we will use a Proxy VM to tunnel the ports to Cloud Shell VM, as its not on the same network.
-
Create a VM to act as proxy
gcloud compute instances create proxy-server \ --zone $ZONE_ID \ --image-family debian-10 \ --image-project debian-cloud \ --network $VPC_NETWORK_NAME
-
Start SSH port forwarding
gcloud compute ssh proxy-server --zone $ZONE_ID -- -N -L 6379:$REDIS_IP:6379 -4 &
-
Start the Visualization Spring boot application.
cd dashboard/ mvn clean compile package spring-boot:run
-
Click on the icon to open web preview, to access the application's web-ui in the browser.
a. Click "Preview on port 8080"
b. On the dashboard, click "Auto Update" which will keep the dashboard fresh.