RFC-005: Integrated offline jobs with unified metastore and command-line #713
tobegit3hub
started this conversation in
RFCs
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Problem
Currently, users run online queries and offline batch jobs of OpenMLDB in different ways.
Take offline jobs as example, we have to write PySpark or Spark applications in Scala/Java and use "spark-submit" command to submit the jobs. It's not integrated with the OpenMLDB CLI. Moreover, we have to register the databases and tables with the remote file paths before running SQL for every time. Setting up the client environment is kind of complex since users have to install Java/Spark/Hadoop packages to connect to Yarn cluster.
Solution 1: Use Hive Metastore
We will add the unified metadata service for online and offline storage and integrate the offline jobs with OpenMLDB CLI.
The unified metadata service is based on Hive Metastore and provides APIs to manage the databases and tables. Here is the workflow of SQL DDL operations.
Here is the workflow of importing actual data to storage.
The metadata service can be implements in different ways and we will discuss their pros and cons.
There are some kinds of offline table format and we will discuss their pros and cons.
Solution 2: Use Self-defined Metadata
There is another solution to use self-defined metadata service which may be considered.
The metadata of catalog can be stored in NameServer and provide the APIs for other systems to access. Here is the workflow of SQL DDL operations.
Here is the workflow of importing actual data to storage.
Compare with solution 1, solution 2 is simpler and does not depend on external service like Hive Metastore. Since NameServer has existing APIs to access the catalog data, we need to update the Spark batch jobs to request NameServer and register the tables if needed. The self-defined metadata service cannot support Iceberg table format yet. If we want to support Iceberg, we have to extend Iceberg catalog service.
Chosen Solution
For OpenMLDB 0.4.0, solution 2 is chosen and all the functions will be implemented in this release.
Changes and Additions to Public Interfaces
Since we will integrate offline jobs with OpenMLDB CLI, we may add new commands to enable submitting offline jobs.
We may reuse the "use" command by adding new config named "EXECUTE_MODE". The default value of "EXECUTE_MODE" is "online" which uses the C++ in-memory batch execution engine. And if the value is "offline", the client will submit jobs in cluster to run offline OpenMLDB batch SQL.
Performance Impact
This is the new component and will not affect the current performance.
Backwards Compatibility and Upgrade Path
This is the new component and doesn't has backwards compatibility and upgrade issues.
Related Works
Beta Was this translation helpful? Give feedback.
All reactions