Replies: 3 comments 4 replies
-
Thanks for your awesome design proposal first.
For the second solution, it's a costless and consistent way to generate SST files, but I have a bit worried about the performance. Because every command needs to go through the network stack. And for the command part, it'd be better to use the RESP format if possible, so that we won't need to take care of this special format. To see if other guys have any thoughts on this topic. @torwig @PragmaTwice @mapleFU @enjoy-binbin @caipengbo @ShooterIT |
Beta Was this translation helpful? Give feedback.
-
The amount of imported data is usually very large, and it may be an inefficient job to use kvrocks to do this. I think we should provide external tools or big data programs to directly generate the final Key-Value data.
For importing information, I think you can use the command like:
The common use cases I can think of for bulk loading are cold starts, migrating to kvrocks from other databases, or periodically ingesting data. Blocking writes should be an acceptable behavior for the user. In general, ingesting data should be fast(I think tens of seconds is enough for big data), and what's slower is generating and downloading the data. |
Beta Was this translation helpful? Give feedback.
-
BulkLoad is nice, which makes us able to make kvrocks load the batch result. The datasource can come from spark ETL and others. Generally, generate SST is not hard, however, we need to considering the syntax for Bulkload:
|
Beta Was this translation helpful? Give feedback.
-
Discussion for this issue #1301
Hello everyone, after some effort, I have now got the basic logic of the bulk load. However, there are still some aspects that I am unsure of and require discussion with the community. Below, I will briefly explain the basic process of the bulk load function and present some questions.
How can we implement bulk load
The bulk load use the ingest SST feature of RocksDB, which allows us to quickly import pre-generated SST files into the kvrocks DB. Similar open-source projects such as TiDB and Pegasus, which use RocksDB as their storage engine, also use this way to implement the bulk load. We can use an tool to generate SST files in the kvrocks database format, and then ingest these SST files into kvrocks.
Therefore, to implement the bulk load, we need to design three things:
a tool for generating SST files (we called
make-sst tool
),an ingest command in kvrocks that can ingest SST files (we called
ingest command
)a format for SST resources that the ingest command can accept, which is generated by the make-sst tool (we called
exchange format file
).Make-sst tool
There are two ways to generate SST files
Using
SstFileWriter
Generate SST files directly based on the kvrocks SST file format.Starting a kvrocks storage engine and inserting elements using its API, and finally generating a DB file.
For method 1, because
SstFileWriter
requires that the writes be in order, we need to sort the elements ourselves before writing them to the SST file. In addition, we also need to determine an appropriate size for the SST file.For method 2, I am not sure if the kvrocks storage engine can be started independently and if the DB SST files can be directly provided to the
ingest command
. If the entire process is ok, then I think method 2 can reduce many compatibility risks for us.Exchange format file
For the exchange format, we need to record which SST file should be merged into which column family, and we also need the checksum of each SST file to verify its integrity.
I think we can use a JSON file to store this information. By provided this JSON file to the
ingest command
, we can import multiple SST files. Here is an example schema:In this schema,
version
specifies the version of the exchange format,files
is an array of SST files to ingest, and each file object contains the path to the SST file, the column family to merge into, and the checksum of the file.Ingest command
The
Ingest command
needs to support whether to overwrite the existing keys in the database with the imported keys, and we need to provide both sync and async versions of the Ingest command.Summary
So, we need to talk about the following issues.
exchange format
have?Reference
RocksDB docs
TiDb docs
TiDb docs
Pegasus blog(only Chinese)
https://rockset.com/blog/optimizing-bulk-load-in-rocksdb/
https://www.cockroachlabs.com/blog/bulk-ingest-from-csv/
Beta Was this translation helpful? Give feedback.
All reactions