Skip to content

Latest commit

 

History

History
110 lines (70 loc) · 4.44 KB

README.md

File metadata and controls

110 lines (70 loc) · 4.44 KB

OpenTSDB Rollups Spark job

Spark job that reads OpenTSDB data from an HBase snapshot and generates rollup data points.
This is the accompanying repository to the Skyscanner Engineering blog post on the same topic.

Deployment

We're running the Job on AWS using a data pipelinewhich creates a Spark and an HBase cluster for us. The input for the job is read from HBase snapshots that have been uploaded to S3. For a detailed description of the infrastructure, see the blog post. The job takes snapshots of the OpenTSDB tables for raw data points and UIDs as input (as defined in tsd.storage.hbase.data_table and tsd.storage.hbase.uid_table of the OpenTSDB configuration, respectively). The names used in this script are the defaults of tsdb and tsdb-uid.

The following assumptions are made:

  • Snapshots follow the naming convention <table_name>-YYYY-MM-DD.
  • Snapshots from the live cluster are experted using HBase's ExportSnapshot tool and exported to s3a://${BackupBucket}/#{BackupWeekNumber}/${HBaseClusterColour}/<snapshot_name>/
  • The JAR that is built as part of this repo is published to an S3 bucket, to the path s3://${JobBucket}/rollups/${BuildId}/opentsdb-rollup-all.jar

Input parameters of the CloudFormation script

This describes the input parameters to the CloudFormation script in cloudformation/cloudformation.yaml that is used to create the data pipeline described above.

BuildId

An identifier that is generated by a build pipeline that builds and publishes the JAR file. It's used to generate the path in an S3 bucket (s3://${JobBucket}/rollups/${BuildId}) which is used for resources like the JAR itself, config files and other scripts.

BackupWeekNumber

The number of the calendar week that the input snapshots were taken. Used to generate the input path (s3a://${BackupBucket}/#{BackupWeekNumber}/${HBaseClusterColour}/<snapshot_name>).

SnapshotRestoreDate

The date of the snapshot in the format YYYY-MM-DD.

BackupBucket

Name of the Bucket that the snapshots are uploaded to. Note the assumptions listed above for details on the exact path where the snapshots are expected.

BeforeTimestamp

Timestamp of the first data point we want to include in this run of the job. Filters out every point before the given timestamp. Unit: milliseconds. Must be less than AfterTimestamp.

AfterTimestamp

Timestamp of the first data point we want to exclude in this run of the job. Filters out every point after the given timestamp. Unit: milliseconds. Must be greater than BeforeTimestamp. BeforeTimestamp and AfterTimestamp define the timestamps of the data points that are to be rolled up.

HBaseClusterColour

Colour of the cluster. Useful when running an active/standby cluster setup. Used to construct the full path to the backup (see assumptions).

TerminateAfter

When to terminate the created EMR cluster at the very latest. Directly passed through to the EMR cluster config.

SSHKeypair

The SSH key pair to use for connecting to the EC2 instances that form the EMR cluster. Optional.

JobBucket

Name of an S3 bucket for supporting files.

TriggersAlert

Boolean value indicating whether or not to send an alert if the job fails.

VictorOpsIntegrationHook

VictorOps hook for the CloudWatch integration. Used to route the alerts on failure.

Development

The project uses Java 8, but should be compatible with newer Java versions.

Developing locally

To build the job's fat jar, run

./gradlew build jar  

The output can be found in build/libs.

Tests can be run with

./gradlew test

Subprojects

Apart from the main RollupJob code, there is a subproject serializer

Serializer

The serializer project is needed to create a shaded JAR for serialising rollup schemas. Our schemas use proto3 while HBase still uses proto2. No additional steps are needed to update this code as the shaded JAR is automatically included in the main rollup job build.