RFC-005: Integrated offline jobs with unified metastore and command-line #713

tobegit3hub · 2021-11-17T09:54:55Z

tobegit3hub
Nov 17, 2021
Maintainer

Problem

Currently, users run online queries and offline batch jobs of OpenMLDB in different ways.

Take offline jobs as example, we have to write PySpark or Spark applications in Scala/Java and use "spark-submit" command to submit the jobs. It's not integrated with the OpenMLDB CLI. Moreover, we have to register the databases and tables with the remote file paths before running SQL for every time. Setting up the client environment is kind of complex since users have to install Java/Spark/Hadoop packages to connect to Yarn cluster.

Solution 1: Use Hive Metastore

We will add the unified metadata service for online and offline storage and integrate the offline jobs with OpenMLDB CLI.

The unified metadata service is based on Hive Metastore and provides APIs to manage the databases and tables. Here is the workflow of SQL DDL operations.

The client may request NameServer to create table or do other DDL operations.
The NameServer will register the metadata like schema of the table in Hive Metastore.
The Hive Metastore will return the result.
Then NameServer can return the result to the clients.

Here is the workflow of importing actual data to storage.

The client may request NameServer to import data in offline storge.
The NameServer will check metadata in Hive MetaStore.
The Hive Metastore will return the result such as if the table exists or not.
Then NameServer will request TaskManager to submit the data import job.
TaskManager will do submitting jobs in Yarn cluster and the jobs will write data in HDFS.
Once the job has been submitted, TaskManager will get the job id so that we can track this long running job.
TaskManager will return the job status including job id to NameServer.
Finally NameServer will return the job status to the clients.

The metadata service can be implements in different ways and we will discuss their pros and cons.

Proposal	Pros	Cons
Seft-defined Metadata	Easy to implement	Need to modify Spark to read custom metadata service, Cannot support external systems like Trino
Hive Metastore	Support parquet/iceberg/orc and other data format, Support external systems like Hive/Spark/Trino/Impala and so on	Add external component, Need to modify NameServer to read and write metadata service
Iceberg Hadoop Catalog	No external dependency, Easy to use	Not industry standard metadata service

There are some kinds of offline table format and we will discuss their pros and cons.

Proposal	Pros	Cons
Parquet	Column-based, Popular	Can not update
CSV	Easy to read, Row-based	Can not update
Iceberg with parquet	Support update/Upsert/Delete, Column-based, Storage with index	Need to use with Iceberg package
Support multiple format	Flexible	Need extra operations before reading and writing data

Solution 2: Use Self-defined Metadata

There is another solution to use self-defined metadata service which may be considered.

The metadata of catalog can be stored in NameServer and provide the APIs for other systems to access. Here is the workflow of SQL DDL operations.

Clients can request NameServer to create table.
NameServer will store metadata in memory and persist, then return result to client.

Here is the workflow of importing actual data to storage.

Clients request NameServer to import data.
NameServer checks table info in internal catalog and requests TaskManager to submit job.
TaskManager will submit Spark jobs in Yarn or Kubernetes.
TaskManager will get the job status from Yarn or Kubernetes. At the same time, the Spark job will read metadata of catalog from NameServer if it needs to read registered tables.
TaskManager will return the job status to NameServer. At the same time, the Spark job can write data into online tablet and offlien HDFS.
NameServer returns the job status and result to the clients.

Compare with solution 1, solution 2 is simpler and does not depend on external service like Hive Metastore. Since NameServer has existing APIs to access the catalog data, we need to update the Spark batch jobs to request NameServer and register the tables if needed. The self-defined metadata service cannot support Iceberg table format yet. If we want to support Iceberg, we have to extend Iceberg catalog service.

Chosen Solution

For OpenMLDB 0.4.0, solution 2 is chosen and all the functions will be implemented in this release.

Changes and Additions to Public Interfaces
Since we will integrate offline jobs with OpenMLDB CLI, we may add new commands to enable submitting offline jobs.

We may reuse the "use" command by adding new config named "EXECUTE_MODE". The default value of "EXECUTE_MODE" is "online" which uses the C++ in-memory batch execution engine. And if the value is "offline", the client will submit jobs in cluster to run offline OpenMLDB batch SQL.

Command	Parameter	Response	Notice
SET	EXECUTE_MODE=offline

Performance Impact

This is the new component and will not affect the current performance.

Backwards Compatibility and Upgrade Path

This is the new component and doesn't has backwards compatibility and upgrade issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC-005: Integrated offline jobs with unified metastore and command-line #713

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

RFC-005: Integrated offline jobs with unified metastore and command-line #713

tobegit3hub Nov 17, 2021 Maintainer

Problem

Solution 1: Use Hive Metastore

Solution 2: Use Self-defined Metadata

Chosen Solution

Performance Impact

Backwards Compatibility and Upgrade Path

Related Works

Replies: 0 comments

tobegit3hub
Nov 17, 2021
Maintainer