Modern OLAPs are designed to perform analytic queries, with which we often need to do subqueries, joins, and aggregation. The problem is that the query optimizer needs metadata and statistics, such as data distribution and indexes to do a better job. To solve this problem, we need to build a catalog that serves as a “database of metadata for the database” that stores statistics from the execution engine and data discovery from I/O services, and that provides metadata to the planner and data location to schedule the query.
We will be exposing a REST API for interaction with other components. The inputs and results for the same are a part of the API Spec.
The components of our architecture include the rust service application and rocksDB as an embedded database.
A Rust application that exposes a REST API to modify database metadata of the OLAP database. The application consists of the following components:
- A data model that defines the structure and meaning of the metadata we store, such as schemas, tables, columns, measures, dimensions, hierarchies, and levels. The data model is represented by Rust structs that can be serialized and deserialized using Substrait.
- A database layer that interacts with RocksDB. The database layer provides methods for storing and retrieving the database metadata as key-value pairs using the RocksDB crate.
- A service layer that contains the business logic for the REST API, such as validating inputs, checking permissions, handling errors, etc. The service layer depends on the database layer and uses the data model to manipulate the database metadata.
- A controller layer that exposes the service methods as RESTful endpoints using a web framework, such as warp or axum. The controller layer uses the web framework’s features, such as filters, macros, and async functions, to parse the request parameters and format the response.
We choose RocksDB as the database in the catalog to store metadata. It is a fast and persistent key-value store that can be used as an embedded database for Rust applications.
Most design decisions were made with the assumption that we do not have any schema updates and writes are infrequent with bulk data
We contemplated two embedded database candidates for catalog service: SQLite and RocksDB. We chose RocksDB because
- Better concurrency control: SQLite locks the entire database when dealing with concurrency writing, whereas RocksDB supports snapshots.
- Flexibility: RocksDB provides more configuration options.
- Scalability: RocksDB stores data in different partitions, whereas SQLite stores data in one single file, which isn’t ideal for scalability.
- Based on [1], the catalog for an OLAP system behaves a lot like an OLTP database. They state how using a key-value store in the form of FoundationDB has proved to be beneficial based on performance and scalability. This includes supporting high-frequency reads, and support for dynamic metadata storage.
- [2] compares and benchmarks the performance of tabular storage vs hierarchical organization of metadata as seen in Iceberg and finds the single node processing in Iceberg performs better than the others for small tables but fails to scale. It concludes that the metadata access patterns have a significant impact on the performance of distributed query processing.
- Taking these factors into account, we have decided to go ahead with a key-value store for the simplicity and flexibility it provides along with the performance benefits.
After looking through several available options to use build APIs, such as Hyper and Actix, we have selected Axum.
- Axum framework is built on top of Hyper and Tokio and abstracts some of the low level details
- This, however, does not result in any significant performance overhead.
- Benchmarks for frameworks are listed in [3]
A detailed description of how you are going to determine that your implementation is both (1) correct and (2) performant. You should describe the short unit tests and long running regression tests. Some portion of your testing plan must also use your project's public API, thus you are allowed to share testing infrastructure with the other group implementing the same thing.
For the correctness of the catalog, we plan to conduct unit tests and regression tests. In unit testing, we will test key components and operations such as metadata retrieval, metadata storage, update, and snapshot isolation.
- Concurrency Workloads: Test the performance under concurrent access by simulating multiple clients performing various operations simultaneously (e.g., multiple clients creating tables, querying metadata, committing snapshots, etc.).
- Large Schemas and Datasets: Evaluate the performance with tables having a large number of columns (e.g., hundreds or thousands of columns) and large datasets with many partitions or snapshots.
- Mixed Workloads: Combine different types of operations in a single workload to simulate a more realistic scenario where various operations are performed concurrently.
-
The biggest trade-off being made in the current design is the absence of any optimizations for updates. Updates to any tables will result in the metadata of the tables stored to become stale. Efficiently updating these values is a design challenge. This has not been prioritized based on the assumption that updates in an OLAP system will be infrequent.
-
We chose RocksDB to store metadata, whereas Iceberg Catalog has its own metadata layer that includes metadata files, manifest lists, and manifests. Using RocksDB could be more straightforward to implement compared to building everything from scratch. The components in Iceberg Catalog are likely to be optimized for Iceberg Catalog, and they could outperform RocksDB, which is not dedicated to catalog service.
Our catalog service is designed to support parallelism to enhance performance. This is achieved through the following ways:
-
Concurrency Control in RocksDB: RocksDB, our chosen database, supports concurrent reads and writes. This allows multiple threads to read and write to the database simultaneously, improving the throughput of our service.
-
Asynchronous API: The REST API exposed by our Rust application is asynchronous, meaning it can handle multiple requests at the same time without blocking. This is particularly useful for operations that are I/O-bound, such as reading from or writing to the database.
Performance tuning is crucial for the efficiency and speed of our catalog service. Here's our plan:
-
API Optimization: We will monitor the performance of our API endpoints and optimize the slow ones. This will involve optimizing the database access methods and refactoring the code.
-
Load Testing: We will conduct load testing to understand how our service performs under heavy load. This will help us identify bottlenecks and areas for improvement.
✅ 75%: Basic API support
✅ 100%: Support for parallelism and performance tuning
✅ 125%: Performance testing against Iceberg Catalog
Command to run benchmark script
python3 benchmark_copy/bench.py -b catalog1 -d ../rocksdb -u http://localhost:3000 -n 100 -t 10000 -r 50 -p
[1] https://www.snowflake.com/blog/how-foundationdb-powers-snowflake-metadata-forward/ [2] https://15721.courses.cs.cmu.edu/spring2024/papers/18-databricks/p92-jain.pdf [3] https://github.com/programatik29/rust-web-benchmarks/blob/master/result/hello-world.md