Import HDFS data to OpenMLDB #83
Replies: 5 comments 8 replies
-
About BulkLoad WayFirst, why we need BulkLoad Way? But in our db's MemTable, the number of segments is const in one index, and MemTable doesn't need to do compaction. So we don't have "growing pains" in OpenMLDB. All we want is the fast import.
Case 1 will cost less than O(nlogn), so we can only consider about case 2 for simplicity. What we save in case 2?We can load all the corresponding data to a skiplist in one step. So we can support a fast import way by adding "multi put". "multi put" of one segment will read multi rows(unsorted) from a source, and put rows to the skiplist in one step. The serialized skiplist will be a improvement, but not required. SourceTalking about the source, we may have HDFS, Kafka, etc. |
Beta Was this translation helpful? Give feedback.
-
BulkLoad Way Case 1我们考虑下Case 1应该如何处理。 Index Region两层skiplist如何序列/反序列化,是这个comment的重点。 skiplist backward下面是redis实现:
反向保存数据后,load时就会每次停止在head处,不会沿跳表查下去,这样,不用保存太多额外信息,也能高效load skiplist。 所以,投射到OpenMLDB,我们在外部打包Segment数据时,只需要:
由于每条数据有不同的ttl,它们不适合放在一块内存内,所以一次bulk load request传来的data块必须经历一次拷贝,也就是拷贝到各个DataBlock。因此也可以使用baidu_std的request_attachment来传递data,避免额外的序列化,attachment是IOBuf,本身也不应该长时间保留,所以拷贝创建各个DataBlock也是应该做的。 |
Beta Was this translation helpful? Give feedback.
-
BulkLoad Way Case 1 POCServer Side我们先关注OpenMLDB需要做的改动。关键是tablet server需要支持
简单起见,BulkLoad rpc定义为由importer直接发出。
|
Beta Was this translation helpful? Give feedback.
-
Bulk Load Way Case1 Test
测试集为NYCTaxiTripDuration train.csv 192M,约1.4M条数据insert mode
bulk load mode(包含binlog)1 partitionclient total: 42790 ms
** real generate cost 23515 ms
8 partition - 8 threads, 1 thread for 1 MemTable, send 1 bulk load requestclient total: 21978 ms 8 MemTable detail:
** real generate cost per MemTable: 3902ms
|
Beta Was this translation helpful? Give feedback.
-
Bulk Load Way Case1 Test - Size limit version
default size limit 32MB |
Beta Was this translation helpful? Give feedback.
-
Overview
Related issue #75
We may have two ways to import big data into OpenMLDB.
SDK Way
Standalone Tool
Needs:
Reader
Need to support csv/parquet/orc.
After parsing,
csv => String[]
parquet => SimpleGroup
orc => Object
So We need a wrapper to show uniform interface to outside, each format implements it.
Reader could support read more than one file. e.g.
Writer
We cannot use
SQLInsertRows
to write rows at a time. Because it will reject the error, we can't know which row insert failed. So we must useSQLInsertRow
.Here, we need to show the status about latency or something, to let users know about whether the import works fine.
Concurrency
We can split all files into X parts. Create X readers and X writers, or just 1 or Y(< X) writers. One reader could handle some files, even a part of a file. Writer can use async to speed up.
Error handling
We do not guarantee the atomicity of the import operation. We only support retry operations, but it's no guarantee of success. We will return the statistic results of import, e.g. X rows succeed/ Y rows failed. Users should handle the failuer by themselves.
DataX
Further
Beta Was this translation helpful? Give feedback.
All reactions