Skip to content

Commit

Permalink
docs: flow dev guide (#959)
Browse files Browse the repository at this point in the history
Co-authored-by: Yiran <cuiyiran3@gmail.com>
  • Loading branch information
discord9 and nicecui authored May 16, 2024
1 parent f5d1f07 commit 2f44146
Show file tree
Hide file tree
Showing 7 changed files with 90 additions and 0 deletions.
15 changes: 15 additions & 0 deletions docs/nightly/en/contributor-guide/flownode/arrangement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Arrangement

Arrangement stores the state in the dataflow's process. It stores the streams of update flows for further querying and updating.

The arrangement essentially stores key-value pairs with timestamps to mark their change time.

Internally, the arrangement receives tuples like
`((Key Row, Value Row), timestamp, diff)` and stores them in memory. One can query key-value pairs at a certain time using the `get(now: Timestamp, key: Row)` method.
The arrangement also assumes that everything older than a certain time (also known as the low watermark) has already been ingested to the sink tables and does not keep a history for them.

:::tip NOTE

The arrangement allows for the removal of keys by setting the `diff` to -1 in incoming tuples. Moreover, if a row has been previously added to the arrangement and the same key is inserted with a different value, the original value is overwritten with the new value.

:::
12 changes: 12 additions & 0 deletions docs/nightly/en/contributor-guide/flownode/dataflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Dataflow

The `dataflow` module (see `flow::compute` module) is the core computing module of `flow`.
It takes a SQL query and transforms it into flow's internal execution plan.
This execution plan is then rendered into an actual dataflow, which is essentially a directed acyclic graph (DAG) of functions with input and output ports.
The dataflow is triggered to run when needed.

Currently, this dataflow only supports `map` and `reduce` operations. Support for `join` operations will be added in the future.

Internally, the dataflow handles data in row format, using a tuple `(row, time, diff)`. Here, `row` represents the actual data being passed, which may contain multiple `Value` objects.
`time` is the system time which tracks the progress of the dataflow, and `diff` typically represents the insertion or deletion of the row (+1 or -1).
Therefore, the tuple represents the insert/delete operation of the `row` at a given system `time`.
17 changes: 17 additions & 0 deletions docs/nightly/en/contributor-guide/flownode/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Overview

## Introduction


`Flownode` provides a simple streaming process (known as `flow`) ability to the database.
`Flownode` manages `flows` which are tasks that receive data from the `source` and send data to the `sink`.

In current version, `Flownode` only supports standalone mode. In the future, we will support distributed mode.

## Components

A `Flownode` contains all the components needed for the streaming process of a flow. Here we list the vital parts:

- A `FlownodeManager` for receiving inserts forwarded from the `Frontend` and sending back results for the flow's sink table.
- A certain number of `FlowWorker` instances, each running in a separate thread. Currently for standalone mode, there is only one flow worker, but this may change in the future.
- A `Flow` is a task that actively receives data from the `source` and sends data to the `sink`. It is managed by the `FlownodeManager` and run by a `FlowWorker`.
4 changes: 4 additions & 0 deletions docs/nightly/en/summary.yml
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,10 @@
- admin-api
- dist-lock
- selector
- Flownode:
- overview
- dataflow
- arrangement
- Tests:
- overview
- unit-test
Expand Down
13 changes: 13 additions & 0 deletions docs/nightly/zh/contributor-guide/flownode/arrangement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Arrangement

Arrangement 存储数据流进程中的状态,存储 flow 的更新流(stream)以供进一步查询和更新。

Arrangement 本质上存储的是带有时间戳的键值对。
在内部,Arrangement 接收类似 `((Key Row, Value Row), timestamp, diff)` 的 tuple,并将其存储在内存中。
你可以使用 `get(now: Timestamp, key: Row)` 查询某个时间的键值对。
Arrangement 假定早于某个时间(也称为 Low Watermark)的所有内容都已被写入到 sink 表中,不会为其保留历史记录。

:::tip 注意
Arrangement 允许通过将传入 tuple 的 `diff` 设置为 -1 来删除键。
此外,如果已将行数据添加到 Arrangement 并且使用不同的值插入相同的键,则原始值将被新值覆盖。
:::
13 changes: 13 additions & 0 deletions docs/nightly/zh/contributor-guide/flownode/dataflow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# 数据流

Dataflow 模块(参见 `flow::compute` 模块)是 `flow` 的核心计算模块。
它接收 SQL 查询并将其转换为 `flow` 的内部执行计划。
然后,该执行计划被转化为实际的数据流,而数据流本质上是一个由带有输入和输出端口的函数组成的有向无环图(DAG)。
数据流会在需要时被触发运行。

目前该数据流只支持 `map``reduce` 操作,未来将添加对 `join` 等操作的支持。

在内部,数据流使用 `tuple(row, time, diff)` 以行格式处理数据。
这里 `row` 表示实际传递的数据,可能包含多个 `value` 对象。
`time` 是系统时间,用于跟踪数据流的进度,`diff` 通常表示行的插入或删除(+1 或 -1)。
因此,`tuple` 表示给定系统时间的 `row` 的插入/删除操作。
16 changes: 16 additions & 0 deletions docs/nightly/zh/contributor-guide/flownode/overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# 概述

## 简介

`Flownode` 为数据库提供了一种简单的流处理(称为 `flow`)能力。
`Flownode` 管理 `flow`,这些 `flow` 是从 `source` 接收数据并将数据发送到 `sink` 的任务。

在当前版本中,`Flownode` 仅在单机模式中支持,未来将支持分布式模式。

## 组件

`Flownode` 包含了 flow 流式处理的所有组件,以下是关键部分:

- `FlownodeManager`:用于接收从 `Frontend` 转发的插入数据并将结果发送回 flow 的 sink 表。
- 一定数量的 `FlowWorker` 实例,每个实例在单独的线程中运行。当前在单机模式中只有一个 flow worker,但这可能会在未来发生变化。
- `Flow` 是一个主动从 `source` 接收数据并将数据发送到 `sink` 的任务。由 `FlownodeManager` 管理并由 `FlowWorker` 运行。

0 comments on commit 2f44146

Please sign in to comment.