Import HDFS data to OpenMLDB #83

vagetablechicken · 2021-05-28T09:43:31Z

vagetablechicken
May 28, 2021
Maintainer

Overview

Related issue #75
We may have two ways to import big data into OpenMLDB.

preprocess data, transport to each tablet server, then the servers load it straightly, like HBase BulkLoad. So we call it "BulkLoad Way".
read file from HDFS, then use fedb-jdbc(java client) to write to OpenMLDB, hereafter called "SDK Way".

BulkLoad Way needs more research.

SDK Way

Standalone Tool

Needs:

FileSystemManager: support read/write HDFS. It should be able to handle multiple concurrent requests.
Scanner/Reader: support csv/orc/parquet file format types.
Writer: wrap fedb-jdbc.

Reader

Need to support csv/parquet/orc.
After parsing,
csv => String[]
parquet => SimpleGroup
orc => Object
So We need a wrapper to show uniform interface to outside, each format implements it.

Reader:
    checkSchema():
        if has schema in header:
            return is_equal
    read_next():
        if block end:
            block = read from FileSystemManager
        return block.get_next()

Reader could support read more than one file. e.g.

block = read file from FileSystemManager
if block is null:
    file = next file

Writer

We cannot use SQLInsertRows to write rows at a time. Because it will reject the error, we can't know which row insert failed. So we must use SQLInsertRow.

Writer:
    write():
        while reader.next():
            row = getInserRow
            for each column:
                 switch type:
                     ps.setXX(reader.getXX())
            insert(row)
            if failed:
                retry

Here, we need to show the status about latency or something, to let users know about whether the import works fine.

Concurrency

We can split all files into X parts. Create X readers and X writers, or just 1 or Y(< X) writers. One reader could handle some files, even a part of a file. Writer can use async to speed up.

Error handling

We do not guarantee the atomicity of the import operation. We only support retry operations, but it's no guarantee of success. We will return the statistic results of import, e.g. X rows succeed/ Y rows failed. Users should handle the failuer by themselves.

DataX

How about integrating into DataX?

Further

start way: config file or SQL "LOAD DATA INFILE"
hdfs security
ETL

vagetablechicken · 2021-06-08T09:00:51Z

vagetablechicken
Jun 8, 2021
Maintainer Author

About BulkLoad Way

First, why we need BulkLoad Way?
For hbase or other dbs, BulkLoad Way could save I/O and compactions and so on. Rocksdb's ingest is the same. How-to: Use HBase Bulk Loading, and Why calls them “growing pains”.

But in our db's MemTable, the number of segments is const in one index, and MemTable doesn't need to do compaction. So we don't have "growing pains" in OpenMLDB. All we want is the fast import.
We will meet two use cases:

Original dataset load: segments are empty, so we can build segments(everyone has a skiplist, actually we build the skiplist).
Incremental load: segments(skiplists) are not empty, so we need to insert data into skiplist one by one.
In case 1, if we serialize the skiplist in another place, the building of skiplist will cost O(n) to deserialize it. If we only transport data, we need to create the skiplist here. I can't find a better way to create skiplist from a sorted list. So it should insert data to skiplist one by one, like case 2, will cost O(nlogn). In case 2, it'll cost O(nlogn).

Case 1 will cost less than O(nlogn), so we can only consider about case 2 for simplicity.

What we save in case 2?

We can load all the corresponding data to a skiplist in one step.
The tablet servers get the data by themselves, so we don't need the data importer process to send data, and we won't limited by the concurrency of one machine.
And Segment::Put needs lock. If we put all rows in one step, the lock cost will be saved. We can call it "multi put".
So the rpc cost, segment's lock cost, ... are avoided.

So we can support a fast import way by adding "multi put". "multi put" of one segment will read multi rows(unsorted) from a source, and put rows to the skiplist in one step. The serialized skiplist will be a improvement, but not required.
“Multi put” is more like Redis Pipeline/PipeMode. We create a pipe for each segment.

Source

Talking about the source, we may have HDFS, Kafka, etc.
How about using spark streaming?
Spark streaming can read from both HDFS & KafKa.
Or we need to support HDFS Reader and Kafka sink connector.

0 replies

vagetablechicken · 2021-06-21T12:41:15Z

vagetablechicken
Jun 21, 2021
Maintainer Author

BulkLoad Way Case 1

我们考虑下Case 1应该如何处理。
具体来说，1个MemTable下多个Segment，每个Segment对应一块数据。Segment数据段可以理解为两层skiplist，外层key为Slice，内层key为time，我们可以用，<Slice X, Time Y>来表示一条数据。
我们可以将属于一个Segment的数据在外部组合出来。Case 1这种初始场景，还可以直接把两层skiplist结构也生成出来。我们将这两个部分分开来，纯数据部分可以叫Data Region，两层skiplist可以叫Index Region。
DataRegion就是紧凑的数据块，此处不多赘述。Index Region指向某一条数据时，使用<offset, length>来表示。（由于DataBlock有成员变量dim_cnt_down，可以考虑放在这里，比如，DataBlockInfo{dim_cnt_down, offset, length}）这样，loader方（或deserializer方）load DataRegion，首地址为data_head, load skiplist时就可以<data_head+offset, length>来定位DataBlock，即指向真实数据行。

Index Region

两层skiplist如何序列/反序列化，是这个comment的重点。
on-disk b+tree，JDBM BTree等节点型结构都是使用id替代地址。但skiplist其实有一个很好的特性，可以使设计变得非常简单。
这个思路来自redis的zset落盘逻辑，zset有skiplist实现，所以rdb打备份时也需要保存skiplist结构。redis选择的方式是反向保存各条数据。

skiplist backward

下面是redis实现：
https://github.com/redis/redis/blob/e0cd3ad0de0bf2fe6ea0227e5ad7a0a489688b33/src/rdb.c#L872-L890

        /* We save the skiplist elements from the greatest to the smallest
         * (that's trivial since the elements are already ordered in the
         * skiplist): this improves the load process, since the next loaded
         * element will always be the smaller, so adding to the skiplist
         * will always immediately stop at the head, making the insertion
         * O(1) instead of O(log(N)). */

反向保存数据后，load时就会每次停止在head处，不会沿跳表查下去，这样，不用保存太多额外信息，也能高效load skiplist。

所以，投射到OpenMLDB，我们在外部打包Segment数据时，只需要：

让skiplist支持backward指针
内层skiplist，实际是skiplist<time, DataBlockId>，反向保存为数组。
外层skiplist，<Slice, skiplist>，同样反向保存，内层skiplist已保存为数组。
所以二维数组就可以序列化/反序列化skiplist，可以简单的用protobuf来表示。

由于每条数据有不同的ttl，它们不适合放在一块内存内，所以一次bulk load request传来的data块必须经历一次拷贝，也就是拷贝到各个DataBlock。因此也可以使用baidu_std的request_attachment来传递data，避免额外的序列化，attachment是IOBuf，本身也不应该长时间保留，所以拷贝创建各个DataBlock也是应该做的。
由于多条entry可能指向同一个DataBlock，所以Index Region中是DataBlockId，TabletServer执行bulk load时，会先创建好DataBlock，在load skiplist时将id替换为DataBlock指针。

3 replies

vagetablechicken Jun 22, 2021
Maintainer Author

DataBlock* 指向同一块DataBlock，需要修改

vagetablechicken Jun 22, 2021
Maintainer Author

整个导数工具的rfc

vagetablechicken Jun 22, 2021
Maintainer Author

单机测试对比

x线程sdk insert（insert+binlog append）
x线程gene data/index（单机程序即可，后面再考虑spark/mr等方式，和test1使用相同资源），直到server load成功
证明test2效率更高（也会记录binlog耗时）

vagetablechicken · 2021-06-30T12:56:32Z

vagetablechicken
Jun 30, 2021
Maintainer Author

BulkLoad Way Case 1 POC

Server Side

我们先关注OpenMLDB需要做的改动。关键是tablet server需要支持

rpc BulkLoad(BulkLoadRequest) returns(GeneralResponse);

简单起见，BulkLoad rpc定义为由importer直接发出。
BulkLoad的Request格式如下（待调整）：

message DataBlockInfo {
    optional uint32 ref_cnt = 1;
    optional uint64 offset = 2; // offset in DataRegion
    optional uint32 length = 3;
}

message Segment {
    message KeyEntry {
        optional bytes key = 1;
        message TimeEntry {
            optional uint64 time = 1;
            optional uint32 block_id = 2;
        }
        repeated TimeEntry time_entries = 2;
    }
    repeated KeyEntry key_entries = 1;
}

message Index {
    optional uint32 inner_index_id = 1;
    repeated Segment segments = 2;
}

// we use attachment(baidu_std supports) to send data region
message BulkLoadRequest {
    optional uint32 tid = 1;
    optional uint32 pid = 2;
    repeated Index index_region = 3;
    repeated DataBlockInfo block_info = 4; // idx is the block id, entries in Index are using the block id
}

TabletServer会收到各个<tid, pid>的BulkLoadRequest，然后找到相应的MemTable（仅主副本）。（这里我们假设传来的数据已经是倒序了。只需要顺序的读取Index Region，依次put进入（MemTable里的）各个Segment。）同时，构建好所有DataBlock，后面插入segment的entry将被替换会DataBlock指针。
MemTable新增BulkLoad方法，先reset所有segments（case1约定table在bulk load导入前不应该有数据），再顺序读index region部分，逐一segment->Put()，线性时间复杂度。
与TabletImpl::Put()一致，数据导入完成后，需要写binlog。为逻辑简单，bulk load也是仅主副本导入数据，写binlog，再同步给从副本。这里最简单的方法是逐个Append LogEntry，但可能有效率问题，可能需要改进为batch
append或者其他方式快速生成binlog。

step 1，2正在coding。
step 3需要讨论？

Further more
为保证可重试，如果即将bulk load，需先将MemTable里的segment清空。这样重复操作bulk load，就是覆盖语义。

5 replies

vagetablechicken Jul 2, 2021
Maintainer Author

brpc streaming request

vagetablechicken Jul 2, 2021
Maintainer Author

预估规模，一个key下100～500条time entry，每条几百byte，1kB以下。
importer key100, size100 -> if >50MB, send key100,99,98
97...
可能单key的size也会过大，也需要拆分

vagetablechicken Jul 2, 2021
Maintainer Author

DataBlock::data使用attachment，避免message内的field的序列化

vagetablechicken Jul 2, 2021
Maintainer Author

只有主副本接受bulk load。主从同步可能方案：binlog现有逻辑（逐一append），binlog改进，等导入完一次性打snapshot。
binlog改进可能思路：一次锁/无锁，多entries append
binlog改进可能思路：entry.SerializeToString() 能不能改成多entries一次性serialize，write

vagetablechicken Jul 2, 2021
Maintainer Author

POC时间线：
07-05 预计一周 bulk load test完整流程，错误处理可能不够
bulk load test，sdk test的性能测试，需要引入timer/counter/tracer之类的，尽量简单

vagetablechicken · 2021-07-19T02:56:36Z

vagetablechicken
Jul 19, 2021
Maintainer Author

Bulk Load Way Case1 Test

OpenMLDB为release版本，已支持BulkLoad rpc。只启动一个tablet server。
insert mode可配置X线程并发写入，将原数据均分为X份，分别insert。不会做负载均衡等处理。tablet server端为完整流程，会写binlog。
bulk load mode多线程模式为，为表的X个MemTable分别创建X个bulk load generator，generator负责生成单个bulk load request并发送给tablet server。每个bulk load request完成后会逐条写入binlog。
bulk load mode中生产者只负责计算pid，分发数据到各个消费者bulk load generator，data打包与bulk load文件生成以及rpc send都由消费者bulk load generator来负责。

测试集为NYCTaxiTripDuration train.csv 192M，约1.4M条数据

insert mode

partition num	threads	time
1	1 thread (no binlog)	282765 ms
8	8 threads (no binlog)	40906 ms

partition num	threads	time
1	1 thread (write binlog)	343968 ms
8	8 threads (write binlog)	45040 ms

bulk load mode(包含binlog)

1 partition

client total: 42790 ms

generate cost	rpc send/reply cost
29047 ms	10656 ms

** real generate cost 23515 ms

tablet server total	load	binlog(num=1.46M)
7126 ms	1195 ms	5930 ms

8 partition - 8 threads, 1 thread for 1 MemTable, send 1 bulk load request

client total: 21978 ms

8 MemTable detail:

generate cost per MemTable	rpc send/reply cost per MemTable
8276 ms	2520 ms

** real generate cost per MemTable: 3902ms

tablet server total per MemTable	load per MemTable	binlog(num=1.46M/8)
1298 ms	200 ms	1091 ms

0 replies

vagetablechicken · 2021-08-19T12:18:42Z

vagetablechicken
Aug 19, 2021
Maintainer Author

Bulk Load Way Case1 Test - Size limit version

OpenMLDB为release版本，已支持BulkLoad rpc。只启动一个tablet server。
importer端限制rpc size的大小，整体流程为，在append data途中，如果cache的data size >= size limit就先发送一部分数据(其中包括binlog entry，binlog和datablock一一对应，保证rpc size < size limit); 所有data扫描结束后，先把剩下的data发送过去，然后再拆分index region并发送，index region已有完整信息，所以只需要遍历，达到size limit就revert最后超过的这一个，保证rpc size < size limit。
loader端（即tablet server），在load index region（即重建两级skiplist）前，会接收到data region的parts，每接受一个part就做data block create和binlog write。收到index region的parts时，便重建mem table内的两级skiplist。

default size limit 32MB

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import HDFS data to OpenMLDB #83

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Import HDFS data to OpenMLDB #83

vagetablechicken May 28, 2021 Maintainer

Overview

SDK Way

Standalone Tool

Reader

Writer

Concurrency

Error handling

DataX

Replies: 5 comments · 8 replies

vagetablechicken Jun 8, 2021 Maintainer Author

About BulkLoad Way

What we save in case 2?

Source

vagetablechicken Jun 21, 2021 Maintainer Author

BulkLoad Way Case 1

Index Region

skiplist backward

vagetablechicken Jun 22, 2021 Maintainer Author

vagetablechicken Jun 22, 2021 Maintainer Author

vagetablechicken Jun 22, 2021 Maintainer Author

vagetablechicken Jun 30, 2021 Maintainer Author

BulkLoad Way Case 1 POC

Server Side

vagetablechicken Jul 2, 2021 Maintainer Author

vagetablechicken Jul 2, 2021 Maintainer Author

vagetablechicken Jul 2, 2021 Maintainer Author

vagetablechicken Jul 2, 2021 Maintainer Author

vagetablechicken Jul 2, 2021 Maintainer Author

vagetablechicken Jul 19, 2021 Maintainer Author

Bulk Load Way Case1 Test

测试集为NYCTaxiTripDuration train.csv 192M，约1.4M条数据

insert mode

bulk load mode(包含binlog)

1 partition

8 partition - 8 threads, 1 thread for 1 MemTable, send 1 bulk load request

vagetablechicken Aug 19, 2021 Maintainer Author

Bulk Load Way Case1 Test - Size limit version

default size limit 32MB

vagetablechicken
May 28, 2021
Maintainer

Replies: 5 comments 8 replies

vagetablechicken
Jun 8, 2021
Maintainer Author

vagetablechicken
Jun 21, 2021
Maintainer Author

vagetablechicken Jun 22, 2021
Maintainer Author

vagetablechicken Jun 22, 2021
Maintainer Author

vagetablechicken Jun 22, 2021
Maintainer Author

vagetablechicken
Jun 30, 2021
Maintainer Author

vagetablechicken Jul 2, 2021
Maintainer Author

vagetablechicken Jul 2, 2021
Maintainer Author

vagetablechicken Jul 2, 2021
Maintainer Author

vagetablechicken Jul 2, 2021
Maintainer Author

vagetablechicken Jul 2, 2021
Maintainer Author

vagetablechicken
Jul 19, 2021
Maintainer Author

vagetablechicken
Aug 19, 2021
Maintainer Author