diff --git a/docs/nightly/en/summary.yml b/docs/nightly/en/summary.yml index 4e9625fbb..9d50987ad 100644 --- a/docs/nightly/en/summary.yml +++ b/docs/nightly/en/summary.yml @@ -52,6 +52,12 @@ - sql - promql - query-external-data + - Continuous-Aggregation: + - overview + - manage-flow + - query + - define-time-window + - expression - Client-Libraries: - overview - go diff --git a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md new file mode 100644 index 000000000..8400b190f --- /dev/null +++ b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md @@ -0,0 +1,40 @@ +# Define Time Window + +Time window is an important attribute of your continuous aggregation query. It defines how the data is aggregated in the flow. + +A time window corresponds to a range of time. Data from source table will be mapped to the corresponding window based on the time index column. Time window is also the scope of one calculation of an aggregation expression, so each time window will result in one row in the result table. + +GreptimeDB provides two types of time windows: `hop` and `tumble`, or "sliding window" and "fixed window" in other words. You can specify the time window in the `GROUP BY` clause using `hop()` function or `tumble()` function respectively. These two functions are only supported in continuous aggregate queries's `GROUP BY` position. + +Here illustrates how the `hop()` and `tumble()` functions work: + +![Time Window](/time-window.svg) + +## Tumble + +`tumble()` defines fixed windows that do not overlap. + +``` +tumble(col, interval, start_time) +``` + +- `col` specifies use which column to compute the time window. The provided column must have a timestamp type. +- `interval` specifies the size of the window. The `tumble` function divides the time column into fixed-size windows and aggregates the data in each window. +- `start_time` specify the start time of the first window. + + +## Hop (not supported yet) + +`hop` defines sliding window that moves forward by a fixed interval. This feeaure is not supported yet and is expected to be available in the near future. + + diff --git a/docs/nightly/en/user-guide/continuous-aggregation/expression.md b/docs/nightly/en/user-guide/continuous-aggregation/expression.md new file mode 100644 index 000000000..81b8b5ced --- /dev/null +++ b/docs/nightly/en/user-guide/continuous-aggregation/expression.md @@ -0,0 +1,9 @@ +# Expression + +This part list all supported aggregate functions. + +- `count(column)`: count the number of rows. +- `sum(column)`: sum the values of the column. +- `avg(column)`: calculate the average value of the column. +- `min(column)`: find the minimum value of the column. +- `max(column)`: find the maximum value of the column. diff --git a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md new file mode 100644 index 000000000..d8e486a62 --- /dev/null +++ b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md @@ -0,0 +1,80 @@ +# Manage Flows + +Each `flow` is a continuous aggregation query in GreptimeDB. +It continuously updates the aggregated data based on the incoming data. +This document describes how to create, update, and delete a flow. + +A `flow` have those attributes: +- `name`: the name of the flow. It's an unique identifier in the catalog level. +- `source tables`: tables provide data for the flow. Each flow can have multiple source tables. +- `sink table`: the table to store the materialized aggregated data. + +- `comment`: the description of the flow. +- `SQL`: the continuous aggregation query to define the flow. Refer to [Expression](./expression.md) for the available expressions. + +## Create or update a flow + +The grammar to create a flow is: + + + +```sql +CREATE FLOW [ IF NOT EXISTS ] +OUTPUT TO +[ EXPIRE AFTER ] +[ COMMENT = "" ] +AS +; +``` + + + +`sink-table-name` is the table name to store the materialized aggregated data. It can be an existing table or a new table, `flow` will create the sink table if it doesn't exist. But if the table already exists, the schema of the table must match the schema of the query result. + + + +`SQL` part defines the continuous aggregation query. Refer to [Write a Query](./query.md) for the details. Generally speaking, the `SQL` part is just like a normal `SELECT` clause with a few difference. + +A simple example to create a flow: + + + +```sql +CREATE FLOW IF NOT EXISTS my_flow +OUTPUT TO my_sink_table +COMMENT = "My first flow in GreptimeDB" +AS +SELECT count(item) from my_source_table GROUP BY tumble(time_index, INTERVAL '5 minutes', '2024-05-20 00:00:00'); +``` + +The created flow will compute `count(item)` for every 5 minutes and store the result in `my_sink_table`. For the `tumble()` function, refer to [define time window](./define-time-window.md) part. + + + +## Delete a flow + +To delete a flow, use the following `DROP FLOW` clause: + +```sql +DROP FLOW [IF EXISTS] +``` + +For example: + +```sql +DROP FLOW IF EXISTS my_flow; +``` diff --git a/docs/nightly/en/user-guide/continuous-aggregation/overview.md b/docs/nightly/en/user-guide/continuous-aggregation/overview.md new file mode 100644 index 000000000..abd27e4f6 --- /dev/null +++ b/docs/nightly/en/user-guide/continuous-aggregation/overview.md @@ -0,0 +1,106 @@ +# Overview + +GreptimeDB provides a continuous aggregation feature that allows you to aggregate data in real-time. This feature is useful when you need to calculate and query the sum, average, or other aggregations on the fly. The continuous aggregation feature is provided by the Flow engine. It continuously updates the aggregated data based on the incoming data and materialize it. + +When you insert data into the source table, the data is also sent to and stored in the Flow engine. +The Flow engine calculate the aggregation by time windows and store the result in the sink table. +The entire process is illustrated in the following image: + +![Continuous Aggregation](/flow-ani.svg) + +## Quick start with an example + +Here is a complete example of how a continuous aggregation query looks like. + +First, create a source table `numbers_input` and a sink table `out_num_cnt` with following clauses: + +```sql +CREATE TABLE numbers_input ( + number INT, + ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + PRIMARY KEY(number), + TIME INDEX(ts) +); +``` + +```sql +CREATE TABLE out_num_cnt ( + sum_number BIGINT, + start_window TIMESTAMP TIME INDEX, + end_window TIMESTAMP, + update_at TIMESTAMP, +); +``` + +Then create the flow `test_numbers` to aggregate the sum of `number` column in `numbers_input` table. The aggregation is calculated in 1-second fixed windows. + +```sql +CREATE FLOW test_numbers +SINK TO out_num_cnt +AS +SELECT sum(number) FROM numbers_input GROUP BY tumble(ts, '1 second', '2021-07-01 00:00:00'); +``` + +To observe the outcome of the continuous aggregation in the `out_num_cnt` table, insert some data into the source table `numbers_input`. + +```sql +INSERT INTO numbers_input +VALUES + (20, "2021-07-01 00:00:00.200"), + (22, "2021-07-01 00:00:00.600"); +``` + +The sum of the `number` column is 42 (20+22), so the sink table `out_num_cnt` should have the following data: + +```sql +SELECT * FROM out_num_cnt; +``` + +```sql + sum_number | start_window | end_window | update_at +------------+----------------------------+----------------------------+---------------------------- + 42 | 2021-07-01 00:00:00.000000 | 2021-07-01 00:00:01.000000 | 2024-05-17 08:32:56.026000 +(1 row) +``` + +Try to insert more data into the `numbers_input` table: + +```sql +INSERT INTO numbers_input +VALUES + (23,"2021-07-01 00:00:01.000"), + (24,"2021-07-01 00:00:01.500"); +``` + +The sink table `out_num_cnt` now contains two rows: representing the sum data 42 and 47 (23+24) for the two respective 1-second windows. + +```sql +SELECT * FROM out_num_cnt; +``` + +```sql + sum_number | start_window | end_window | update_at +------------+----------------------------+----------------------------+---------------------------- + 42 | 2021-07-01 00:00:00.000000 | 2021-07-01 00:00:01.000000 | 2024-05-17 08:32:56.026000 + 47 | 2021-07-01 00:00:01.000000 | 2021-07-01 00:00:02.000000 | 2024-05-17 08:33:10.048000 +(2 rows) +``` + +Here is the explanation of the columns in the `out_num_cnt` table: + +- `sum_number`: the sum of the `number` column in the window. +- `start_window`: the start time of the window. +- `end_window`: the end time of the window. +- `update_at`: the time when the row data is updated. + +The `start_window`, `end_window`, and `update_at` columns are automatically added by the time window functions of Flow engine. + +## Next Steps + +Congratulations you already have a preliminary understanding of the continuous aggregation feature. +Please refer to the following sections to learn more: + +- [Manage Flows](./manage-flow.md) describes how to create, update, and delete a flow. Each of your continuous aggregation query is a flow. +- [Write a Query](./query.md) describes how to write a continuous aggregation query. +- [Define Time Window](./define-time-window.md) describes how to define the time window for the continuous aggregation. Time window is an important attribute of your continuous aggregation query. It defines the time interval for the aggregation. +- [Expression](./expression.md) is a reference of available expressions in the continuous aggregation query. diff --git a/docs/nightly/en/user-guide/continuous-aggregation/query.md b/docs/nightly/en/user-guide/continuous-aggregation/query.md new file mode 100644 index 000000000..f941af1cd --- /dev/null +++ b/docs/nightly/en/user-guide/continuous-aggregation/query.md @@ -0,0 +1,19 @@ +# Write a Query + +This chapter describes how to write a continuous aggregation query in GreptimeDB. Query here should be a `SELECT` statement with either aggregating functions or non-aggregating functions (i.e., scalar function). + +The grammar of the query is like the following: + +```sql +SELECT AGGR_FUNCTION(column1, column2,..) FROM GROUP BY TIME_WINDOW_FUNCTION(); +``` + +Only two kinds of expression are allowed after `SELECT` keyword: +- Aggregate functions: see the reference in [Expression](./expression.md) for detail. +- Scalar functions: like `col`, `to_lowercase(col)`, `col + 1`, etc. This part is the same as the normal `SELECT` clause in GreptimeDB. + +The query should have a `FROM` clause to identify the source table. As the join clause is currently not supported, the query can only aggregate columns from a single table. + +`GROUP BY` clause works as in a normal query. It groups the data by the specified columns. One special thing is the time window functions `hop()` and `tumble()` described in [Define Time Window](./define-time-window.md) part. They are used in the `GROUP BY` clause to define the time window for the aggregation. Other expressions in `GROUP BY` can be either literal, column or scalar expressions. + +Others things like `ORDER BY`, `LIMIT`, `OFFSET` are not supported in the continuous aggregation query. diff --git a/docs/nightly/zh/summary-i18n.yml b/docs/nightly/zh/summary-i18n.yml index 0cb759fcd..97414a90b 100644 --- a/docs/nightly/zh/summary-i18n.yml +++ b/docs/nightly/zh/summary-i18n.yml @@ -7,6 +7,7 @@ Clients: 客户端 Client-Libraries: 客户端库 Write-Data: 写入数据 Query-Data: 读取数据 +Continuous-Aggregation: 持续聚合 Python-Scripts: Python 脚本 Operations: 运维操作 Prometheus: Prometheus diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md new file mode 100644 index 000000000..ba56ab259 --- /dev/null +++ b/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md @@ -0,0 +1,46 @@ +# 定义时间窗口 + +时间窗口是连续聚合查询的重要属性。 +它定义了数据在流中的聚合方式。 + +时间窗口对应于时间范围。 +source 表中的数据将根据时间索引列映射到相应的窗口。 +时间窗口也是聚合表达式计算的范围,因此每个时间窗口将在结果表中生成一行。 + +GreptimeDB 提供两种时间窗口类型:`hop` 和 `tumble`,或者换句话说是滑动窗口和固定窗口。 +你可以在 `GROUP BY` 子句中使用 `hop()` 函数或 `tumble()` 函数指定时间窗口。 +这两个函数仅支持在连续聚合查询的 `GROUP BY` 位置使用。 + +下图展示了 `hop()` 和 `tumble()` 函数的工作方式: + +![Time Window](/time-window.svg) + +## Tumble + +`tumble()` 定义固定窗口,窗口之间不重叠。 + +``` +tumble(col, interval, start_time) +``` + +- `col` 指定使用哪一列计算时间窗口。提供的列必须是时间戳类型。 +- `interval` 指定窗口的大小。`tumble` 函数将时间列划分为固定大小的窗口,并在每个窗口中聚合数据。 +- `start_time` 用于指定第一个窗口的开始时间。 + + +## Hop(尚未支持) + +`hop` 定义滑动窗口,窗口按固定间隔向前移动。 +此功能尚未支持,预计将在不久的将来提供。 + + diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/expression.md b/docs/nightly/zh/user-guide/continuous-aggregation/expression.md new file mode 100644 index 000000000..e5a76ec67 --- /dev/null +++ b/docs/nightly/zh/user-guide/continuous-aggregation/expression.md @@ -0,0 +1,9 @@ +# 表达式 + +此处列出了所有支持的表达式。 + +- `count(column)`: 行数。 +- `sum(column)`: 列的和。 +- `avg(column)`: 列的平均值。 +- `min(column)`: 列的最小值。 +- `max(column)`: 列的最大值。 diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/zh/user-guide/continuous-aggregation/manage-flow.md new file mode 100644 index 000000000..8c77b6ab5 --- /dev/null +++ b/docs/nightly/zh/user-guide/continuous-aggregation/manage-flow.md @@ -0,0 +1,87 @@ +# 管理 Flows + +每一个 `flow` 是 GreptimeDB 中的一个连续聚合查询。 +它根据传入的数据持续更新并聚合数据。 +本文档描述了如何创建、更新和删除一个 flow。 + +一个 `flow` 有以下属性: +- `name`: flow 的名称。在目录级别中是唯一的标识符。 +- `source tables`: 为 flow 提供数据的表。每个 flow 可以有多个 source 表。 +- `sink table`: 存储聚合数据的结果表。 + +- `comment`: flow 的描述。 +- `SQL`: 定义 flow 的连续聚合查询。有关可用表达式,请参阅 [表达式](./expression.md)。 + +## 创建或更新 flow + +创建 flow 的语法是: + + + +```sql +CREATE FLOW [ IF NOT EXISTS ] +OUTPUT TO +[ EXPIRE AFTER ] +[ COMMENT = "" ] +AS +; +``` + + + +`sink-table-name` 是存储聚合数据的表名。 +它可以是一个现有表或一个新表,`flow` 会在目标表不存在时创建它。 +但如果表已经存在,表的 schema 必须与查询结果的模式匹配。 + + + +`SQL` 部分定义了连续聚合查询。 +有关详细信息,请参阅 [编写查询](./query.md) 部分。 +一般来说,`SQL` 部分就像一个普通的 `SELECT` 子句,只是有一些不同。 + +一个创建 flow 的简单示例: + + + +```sql +CREATE FLOW IF NOT EXISTS my_flow +OUTPUT TO my_sink_table +COMMENT = "My first flow in GreptimeDB" +AS +SELECT count(item) from my_source_table GROUP BY tumble(time_index, INTERVAL '5 minutes', '2024-05-20 00:00:00'); +``` + +该 Flow 将每 5 分钟计算 `count(item)` 并将结果存储在 `my_sink_table` 中。 +有关 `tumble()` 函数,请参考[定义时间窗口](./define-time-window.md) 部分。 + + + +## 删除 flow + +请使用以下 `DROP FLOW` 子句删除 flow: + +To delete a flow, use the following `DROP FLOW` clause: + +```sql +DROP FLOW [IF EXISTS] +``` + +例如: + +```sql +DROP FLOW IF EXISTS my_flow; +``` diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/overview.md b/docs/nightly/zh/user-guide/continuous-aggregation/overview.md new file mode 100644 index 000000000..ff80864f4 --- /dev/null +++ b/docs/nightly/zh/user-guide/continuous-aggregation/overview.md @@ -0,0 +1,110 @@ +# 概述 + +GreptimeDB 提供连续聚合功能允许你实时聚合数据。 +当你需要实时计算和查询总和、平均值或其他聚合时,此功能非常有用。 +连续聚合功能由 Flow 引擎提供。 +它根据传入的数据不断更新聚合数据。 + +当你将数据插入 source 表时,数据也会被发送到 Flow 引擎并存储在其中。 +Flow 引擎通过时间窗口计算聚合并将结果存储在目标表中。 +整个过程如下图所示: + +![Continuous Aggregation](/flow-ani.svg) + +## 快速开始示例 + +以下是连续聚合查询的一个完整示例。 + +首先,使用以下语句创建一个 source 表 `numbers_input` 和一个 sink 表 `out_num_cnt`: + +```sql +CREATE TABLE numbers_input ( + number INT, + ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + PRIMARY KEY(number), + TIME INDEX(ts) +); +``` + +```sql +CREATE TABLE out_num_cnt ( + sum_number BIGINT, + start_window TIMESTAMP TIME INDEX, + end_window TIMESTAMP, + update_at TIMESTAMP, +); +``` + +然后创建名为 `test_numbers` 的 flow 来聚合 `numbers_input` 表中 `number` 列的总和,并在 1 秒固定窗口中聚合计算数据。 + +```sql +CREATE FLOW test_numbers +SINK TO out_num_cnt +AS +SELECT sum(number) FROM numbers_input GROUP BY tumble(ts, '1 second', '2021-07-01 00:00:00'); +``` + +要观察 `out_num_cnt` 表中连续聚合的结果,向 source 表 `numbers_input` 插入一些数据。 + +```sql +INSERT INTO numbers_input +VALUES + (20, "2021-07-01 00:00:00.200"), + (22, "2021-07-01 00:00:00.600"); +``` + +`number` 列的总和为 42 (20+22),因此 sink 表 `out_num_cnt` 应该包含以下数据: + +```sql +SELECT * FROM out_num_cnt; +``` + +```sql + sum_number | start_window | end_window | update_at +------------+----------------------------+----------------------------+---------------------------- + 42 | 2021-07-01 00:00:00.000000 | 2021-07-01 00:00:01.000000 | 2024-05-17 08:32:56.026000 +(1 row) +``` + +尝试向 `numbers_input` 表中插入更多数据: + +```sql +INSERT INTO numbers_input +VALUES + (23,"2021-07-01 00:00:01.000"), + (24,"2021-07-01 00:00:01.500"); +``` + +sink 表 `out_num_cnt` 现在包含两行:分别表示两个 1 秒窗口的数据之和 42 和 47 (23+24) 。 + +```sql +SELECT * FROM out_num_cnt; +``` + +```sql + sum_number | start_window | end_window | update_at +------------+----------------------------+----------------------------+---------------------------- + 42 | 2021-07-01 00:00:00.000000 | 2021-07-01 00:00:01.000000 | 2024-05-17 08:32:56.026000 + 47 | 2021-07-01 00:00:01.000000 | 2021-07-01 00:00:02.000000 | 2024-05-17 08:33:10.048000 +(2 rows) +``` + +`out_num_cnt` 表中的列解释如下: + +- `sum_number`:窗口中 `number` 列的总和。 +- `start_window`:窗口的开始时间。 +- `end_window`:窗口的结束时间。 +- `update_at`:更新行数据的时间。 + +其中 `start_window`、`end_window` 和 `update_at` 列是 Flow 引擎的时间窗口函数自动添加的。 + +## 下一步 + +恭喜你已经初步了解了连续聚合功能。 +请参考以下章节了解更多: + +- [管理 Flow](./manage-flow.md) 描述了如何创建、更新和删除 flow。你的每个连续聚合查询都是一个 flow。 +- [编写查询语句](./query.md) 描述了如何编写连续聚合查询。 +- [定义时间窗口](./define-time-window.md) 描述了如何为连续聚合定义时间窗口。时间窗口是连续聚合查询的一个重要属性,它定义了聚合的时间间隔。 +- [表达式](./expression.md) 是连续聚合查询中可用表达式。 + diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/query.md b/docs/nightly/zh/user-guide/continuous-aggregation/query.md new file mode 100644 index 000000000..5bd129b38 --- /dev/null +++ b/docs/nightly/zh/user-guide/continuous-aggregation/query.md @@ -0,0 +1,24 @@ +# 编写查询语句 + +本章节描述如何在 GreptimeDB 中编写连续聚合查询。 +查询语句应该是一个带有聚合函数或非聚合函数(即 scalar 函数)的 `SELECT` 语句。 + +查询的语法如下: + +```sql +SELECT AGGR_FUNCTION(column1, column2,..) FROM GROUP BY TIME_WINDOW_FUNCTION(); +``` + +`SELECT` 关键字后只允许两种表达式: +- 聚合函数:详细信息请参考 [表达式](./expression.md) 部分。 +- 标量函数:如 `col`、`to_lowercase(col)`、`col + 1` 等。这部分与 GreptimeDB 中的普通 `SELECT` 子句相同。 + +查询应该有一个 `FROM` 子句来标识 source 表。由于不支持 join 子句,目前只能从单个表中聚合列。 + +`GROUP BY` 子句与普通查询中的工作方式相同。 +它根据指定的列对数据进行分组。 +`GROUP BY` 子句中使用的时间窗口函数 `hop()` 和 `tumble()` 在 [定义时间窗口](./define-time-window.md) 部分中有描述。 +它们用于在聚合中定义时间窗口。 +`GROUP BY` 中的其他表达式可以是 literal、列名或 scalar 表达式。 + +连续聚合查询不支持 `ORDER BY`、`LIMIT`、`OFFSET` 等其他操作。 diff --git a/docs/public/flow-ani.svg b/docs/public/flow-ani.svg new file mode 100644 index 000000000..77c3832e9 --- /dev/null +++ b/docs/public/flow-ani.svg @@ -0,0 +1,3 @@ + + +
Input Data
Source Table
Flow Engine
Sink Table
Sum(a)
Max(b)
...
\ No newline at end of file diff --git a/docs/public/time-window.svg b/docs/public/time-window.svg new file mode 100644 index 000000000..8040fc854 --- /dev/null +++ b/docs/public/time-window.svg @@ -0,0 +1,15 @@ + + + +hop + + + +tumble + + +