From efb3145671d31f94b9d8a3fbc3be5d9d7534d2ef Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Tue, 14 May 2024 19:27:33 +0800 Subject: [PATCH 01/20] docs: add flow user guide Signed-off-by: Ruihang Xia --- docs/nightly/en/summary.yml | 6 +++ .../define-time-window.md | 29 +++++++++++ .../continuous-aggregation/expression.md | 9 ++++ .../continuous-aggregation/manage-flow.md | 51 +++++++++++++++++++ .../continuous-aggregation/overview.md | 8 +++ .../continuous-aggregation/query.md | 18 +++++++ 6 files changed, 121 insertions(+) create mode 100644 docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md create mode 100644 docs/nightly/en/user-guide/continuous-aggregation/expression.md create mode 100644 docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md create mode 100644 docs/nightly/en/user-guide/continuous-aggregation/overview.md create mode 100644 docs/nightly/en/user-guide/continuous-aggregation/query.md diff --git a/docs/nightly/en/summary.yml b/docs/nightly/en/summary.yml index c7352d020..c457ee913 100644 --- a/docs/nightly/en/summary.yml +++ b/docs/nightly/en/summary.yml @@ -52,6 +52,12 @@ - sql - promql - query-external-data + - Continuous-Aggregation: + - overview + - manage-flow + - define-time-window + - query + - expression - Client-Libraries: - overview - go diff --git a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md new file mode 100644 index 000000000..5f13d2838 --- /dev/null +++ b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md @@ -0,0 +1,29 @@ +# Define Time Window + +Time window is an important attribute of your continuous aggregation query. It defines how the data is aggregated in the flow. GreptimeDB provides two types of time windows: `hop` and `tumble`, or "sliding window" and "fixed window" in other words. You can specify the time window in the `GROUP BY` clause using `hop()` function or `tumble()` function respectively. + +# Tumble + +`tumble()` defines fixed windows that do not overlap. It signature is like the following: + +``` +tumble(col, interval, ) +``` + +Where `col` specifies use which column to compute the time window. The provided column must have a timestamp type. `interval` specifies the size of the window. The `tumble` function divides the time column into fixed-size windows and aggregates the data in each window. + +`start_time` is an optional parameter to specify the start time of the first window. If not provided, the start time will be aligned to calender. + +# Hop + +`hop` defines sliding window that moves forward by a fixed interval. It signature is like the following: + +``` +hop(col, size_interval, hop_interval, ) +``` + +Where `col` specifies use which column to compute the time window. The provided column must have a timestamp type. + +`size_interval` specifies the size of each window, while `hop_interval` specifies the delta interval between windows' start timestamp. You can think the `tumble()` function as a special case of `hop()` function where the `size_interval` and `hop_interval` are the same. + +`start_time` is an optional parameter to specify the start time of the first window. If not provided, the start time will be aligned to calender. diff --git a/docs/nightly/en/user-guide/continuous-aggregation/expression.md b/docs/nightly/en/user-guide/continuous-aggregation/expression.md new file mode 100644 index 000000000..81b8b5ced --- /dev/null +++ b/docs/nightly/en/user-guide/continuous-aggregation/expression.md @@ -0,0 +1,9 @@ +# Expression + +This part list all supported aggregate functions. + +- `count(column)`: count the number of rows. +- `sum(column)`: sum the values of the column. +- `avg(column)`: calculate the average value of the column. +- `min(column)`: find the minimum value of the column. +- `max(column)`: find the maximum value of the column. diff --git a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md new file mode 100644 index 000000000..0b8e27507 --- /dev/null +++ b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md @@ -0,0 +1,51 @@ +# Manage Flow + +Each `flow` is a continuous aggregation query in GreptimeDB. It is a query that continuously updates the aggregated data based on the incoming data and materializes the result. This document describes how to create, update, and delete a flow. + +A `flow` have those attributes: +- `name`: the name of the flow. It's an unique identifier in the catalog level. +- `source tables`: tables provide data for the flow. Each flow can have multiple source tables. +- `sink table`: the table to store the materialized aggregated data. +- `expire after`: the interval to expire the data from the source table. Data after the expiration time will not be used in the flow. +- `comment`: the description of the flow. +- `SQL`: the continuous aggregation query to define the flow. Refer to [Expression](./expression.md) for the available expressions. + +# Create flow or update flow + +The grammar to create a flow is: + +```sql +CREATE [ OR REPLACE ] FLOW [ IF NOT EXISTS ] +OUTPUT TO +[ EXPIRE AFTER ] +[ COMMENT = "" ] +AS +; +``` + +When `OR REPLACE` is specified, if a flow with the same name already exists, it will be updated to the new one. + +`sink-table-name` is the table name to store the materialized aggregated data. It can be an existing table or a new table, `flow` will create the sink table if it doesn't exist. But if the table already exists, the schema of the table must match the schema of the query result. + +`SQL` part defines the continuous aggregation query. Refer to [Query](./query.md) for the details. Generally speaking, the `SQL` part is just like a normal `SELECT` clause with a few difference. + +A simple example to create a flow: + +```sql +CREATE FLOW IF NOT EXISTS my_flow +OUTPUT TO my_sink_table +EXPIRE AFTER INTERVAL '1 hour' +COMMENT = "My first flow in GreptimeDB" +AS +SELECT count(item) from my_source_table GROUP BY tumble(time_index, INTERVAL '5 minutes'); +``` + +The created flow will compute `count(item)` for every 5 minutes and store the result in `my_sink_table`. All data comes within 1 hour will be used in the flow. For the `tumble()` function, refer to [define time window](./define-time-window.md) part. + +# Delete flow + +To delete a flow, use the following `DROP FLOW` clause: + +```sql +DROP FLOW [IF EXISTS] +``` diff --git a/docs/nightly/en/user-guide/continuous-aggregation/overview.md b/docs/nightly/en/user-guide/continuous-aggregation/overview.md new file mode 100644 index 000000000..3d079274b --- /dev/null +++ b/docs/nightly/en/user-guide/continuous-aggregation/overview.md @@ -0,0 +1,8 @@ +# Overview + +GreptimeDB provides a continuous aggregation feature that allows you to aggregate data in real-time. This feature is useful when you need to calculate and query the sum, average, or other aggregations on the fly. The continuous aggregation feature is provided by our `Flow` engine. It continuously updates the aggregated data based on the incoming data and materialize it. + +- [Manage Flow](./manage-flow.md) describes how to create, update, and delete a flow. Each of your continuous aggregation query is a flow. +- [Define time window](./define-time-window.md) describes how to define the time window for the continuous aggregation. Time window is an important attribute of your continuous aggregation query. It defines the time interval for the aggregation. +- [Query](./query.md) describes how to write a continuous aggregation query. +- [Expression](./expression.md) is a reference of available expressions in the continuous aggregation query. diff --git a/docs/nightly/en/user-guide/continuous-aggregation/query.md b/docs/nightly/en/user-guide/continuous-aggregation/query.md new file mode 100644 index 000000000..5bc7e40b0 --- /dev/null +++ b/docs/nightly/en/user-guide/continuous-aggregation/query.md @@ -0,0 +1,18 @@ +# Query + +This chapter describes how to write a continuous aggregation query in GreptimeDB. Query here should be a `SELECT` statement with either aggregating functions or non-aggregating functions (i.e., scalar function). + +Only two kinds of expression are allowed after `SELECT` keyword: +- Aggregate functions: see the reference in [Expression](./expression.md) for detail. +- Scalar functions: like `col`, `to_lowercase(col)`, `col + 1`, etc. This part is the same as the normal `SELECT` clause in GreptimeDB. + +Then each query should have a `FROM` clause to specify the source table. The referenced source table should be one in the flow's source tables in `CREATE FLOW` clause. Join is currently not supported so each query can reference only one table. + +`GROUP BY` clause works as in a normal query. It groups the data by the specified columns. One special thing is the time window functions `hop()` and `tumble()` described in [Define Time Window](./define-time-window.md) part. They are used in the `GROUP BY` clause to define the time window for the aggregation. Other expressions in `GROUP BY` can be either literal, column or scalar expressions. + +Notice these two time window functions will add several columns to the output schema, you don't need to `SELECT` them. The columns are: +- `window_start`: the start time of the window. +- `window_end`: the end time of the window. +- `updated_at`: the time when the window is updated. + +Others things like `ORDER BY`, `LIMIT`, `OFFSET` are not supported in the continuous aggregation query. From 9de41ad96cf846701d8cca8ac25baef0efc6d182 Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Wed, 15 May 2024 11:16:48 +0800 Subject: [PATCH 02/20] drop example Signed-off-by: Ruihang Xia --- .../en/user-guide/continuous-aggregation/manage-flow.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md index 0b8e27507..ffe7f4305 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md @@ -49,3 +49,9 @@ To delete a flow, use the following `DROP FLOW` clause: ```sql DROP FLOW [IF EXISTS] ``` + +For example: + +```sql +DROP FLOW IF EXISTS my_flow; +``` From 61fc0fe05bd854550f724adffd7a5c3b4a710cd8 Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Wed, 15 May 2024 20:18:37 +0800 Subject: [PATCH 03/20] add animated svg Signed-off-by: Ruihang Xia --- docs/nightly/en/user-guide/continuous-aggregation/overview.md | 2 ++ docs/public/flow-ani.svg | 3 +++ 2 files changed, 5 insertions(+) create mode 100644 docs/public/flow-ani.svg diff --git a/docs/nightly/en/user-guide/continuous-aggregation/overview.md b/docs/nightly/en/user-guide/continuous-aggregation/overview.md index 3d079274b..6c55cbcc0 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/overview.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/overview.md @@ -6,3 +6,5 @@ GreptimeDB provides a continuous aggregation feature that allows you to aggregat - [Define time window](./define-time-window.md) describes how to define the time window for the continuous aggregation. Time window is an important attribute of your continuous aggregation query. It defines the time interval for the aggregation. - [Query](./query.md) describes how to write a continuous aggregation query. - [Expression](./expression.md) is a reference of available expressions in the continuous aggregation query. + +![Continuous Aggregation](./flow-ani.svg) \ No newline at end of file diff --git a/docs/public/flow-ani.svg b/docs/public/flow-ani.svg new file mode 100644 index 000000000..77c3832e9 --- /dev/null +++ b/docs/public/flow-ani.svg @@ -0,0 +1,3 @@ + + +
Input Data
Source Table
Flow Engine
Sink Table
Sum(a)
Max(b)
...
\ No newline at end of file From 800b3e6a6e77d09adbc620cb6c330f46c39644a4 Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Wed, 15 May 2024 20:31:41 +0800 Subject: [PATCH 04/20] correct path Signed-off-by: Ruihang Xia --- .../nightly/en/user-guide/continuous-aggregation/manage-flow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md index ffe7f4305..fe33b51ae 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md @@ -40,7 +40,7 @@ AS SELECT count(item) from my_source_table GROUP BY tumble(time_index, INTERVAL '5 minutes'); ``` -The created flow will compute `count(item)` for every 5 minutes and store the result in `my_sink_table`. All data comes within 1 hour will be used in the flow. For the `tumble()` function, refer to [define time window](./define-time-window.md) part. +The created flow will compute `count(item)` for every 5 minutes and store the result in `my_sink_table`. All data comes within 1 hour will be used in the flow. For the `tumble()` function, refer to [define time window](/define-time-window.md) part. # Delete flow From 33ad034e8ae7e5d81c975355ab3f8d7e25e689e0 Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Wed, 15 May 2024 20:34:26 +0800 Subject: [PATCH 05/20] fix path again Signed-off-by: Ruihang Xia --- .../nightly/en/user-guide/continuous-aggregation/manage-flow.md | 2 +- docs/nightly/en/user-guide/continuous-aggregation/overview.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md index fe33b51ae..ffe7f4305 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md @@ -40,7 +40,7 @@ AS SELECT count(item) from my_source_table GROUP BY tumble(time_index, INTERVAL '5 minutes'); ``` -The created flow will compute `count(item)` for every 5 minutes and store the result in `my_sink_table`. All data comes within 1 hour will be used in the flow. For the `tumble()` function, refer to [define time window](/define-time-window.md) part. +The created flow will compute `count(item)` for every 5 minutes and store the result in `my_sink_table`. All data comes within 1 hour will be used in the flow. For the `tumble()` function, refer to [define time window](./define-time-window.md) part. # Delete flow diff --git a/docs/nightly/en/user-guide/continuous-aggregation/overview.md b/docs/nightly/en/user-guide/continuous-aggregation/overview.md index 6c55cbcc0..add471c82 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/overview.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/overview.md @@ -7,4 +7,4 @@ GreptimeDB provides a continuous aggregation feature that allows you to aggregat - [Query](./query.md) describes how to write a continuous aggregation query. - [Expression](./expression.md) is a reference of available expressions in the continuous aggregation query. -![Continuous Aggregation](./flow-ani.svg) \ No newline at end of file +![Continuous Aggregation](/flow-ani.svg) \ No newline at end of file From 401480c8898744df3ee2501fe81646591c0e976d Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Thu, 16 May 2024 11:17:11 +0800 Subject: [PATCH 06/20] apply review sugg Signed-off-by: Ruihang Xia --- .../en/user-guide/continuous-aggregation/manage-flow.md | 6 +++--- .../en/user-guide/continuous-aggregation/overview.md | 2 +- docs/nightly/en/user-guide/continuous-aggregation/query.md | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md index ffe7f4305..7acfe2f18 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md @@ -1,4 +1,4 @@ -# Manage Flow +# Manage Flows Each `flow` is a continuous aggregation query in GreptimeDB. It is a query that continuously updates the aggregated data based on the incoming data and materializes the result. This document describes how to create, update, and delete a flow. @@ -10,7 +10,7 @@ A `flow` have those attributes: - `comment`: the description of the flow. - `SQL`: the continuous aggregation query to define the flow. Refer to [Expression](./expression.md) for the available expressions. -# Create flow or update flow +# Create or update a flow The grammar to create a flow is: @@ -42,7 +42,7 @@ SELECT count(item) from my_source_table GROUP BY tumble(time_index, INTERVAL '5 The created flow will compute `count(item)` for every 5 minutes and store the result in `my_sink_table`. All data comes within 1 hour will be used in the flow. For the `tumble()` function, refer to [define time window](./define-time-window.md) part. -# Delete flow +# Delete a flow To delete a flow, use the following `DROP FLOW` clause: diff --git a/docs/nightly/en/user-guide/continuous-aggregation/overview.md b/docs/nightly/en/user-guide/continuous-aggregation/overview.md index add471c82..35ebac60f 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/overview.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/overview.md @@ -3,7 +3,7 @@ GreptimeDB provides a continuous aggregation feature that allows you to aggregate data in real-time. This feature is useful when you need to calculate and query the sum, average, or other aggregations on the fly. The continuous aggregation feature is provided by our `Flow` engine. It continuously updates the aggregated data based on the incoming data and materialize it. - [Manage Flow](./manage-flow.md) describes how to create, update, and delete a flow. Each of your continuous aggregation query is a flow. -- [Define time window](./define-time-window.md) describes how to define the time window for the continuous aggregation. Time window is an important attribute of your continuous aggregation query. It defines the time interval for the aggregation. +- [Define Time Window](./define-time-window.md) describes how to define the time window for the continuous aggregation. Time window is an important attribute of your continuous aggregation query. It defines the time interval for the aggregation. - [Query](./query.md) describes how to write a continuous aggregation query. - [Expression](./expression.md) is a reference of available expressions in the continuous aggregation query. diff --git a/docs/nightly/en/user-guide/continuous-aggregation/query.md b/docs/nightly/en/user-guide/continuous-aggregation/query.md index 5bc7e40b0..d66f0da18 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/query.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/query.md @@ -1,4 +1,4 @@ -# Query +# Write a Query This chapter describes how to write a continuous aggregation query in GreptimeDB. Query here should be a `SELECT` statement with either aggregating functions or non-aggregating functions (i.e., scalar function). @@ -10,7 +10,7 @@ Then each query should have a `FROM` clause to specify the source table. The ref `GROUP BY` clause works as in a normal query. It groups the data by the specified columns. One special thing is the time window functions `hop()` and `tumble()` described in [Define Time Window](./define-time-window.md) part. They are used in the `GROUP BY` clause to define the time window for the aggregation. Other expressions in `GROUP BY` can be either literal, column or scalar expressions. -Notice these two time window functions will add several columns to the output schema, you don't need to `SELECT` them. The columns are: +Notice the two time window functions will add several columns to the output schema. The columns are: - `window_start`: the start time of the window. - `window_end`: the end time of the window. - `updated_at`: the time when the window is updated. From 1f55d584653564ae6f3a692d1e32604b75fbc2e1 Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Thu, 16 May 2024 17:38:25 +0800 Subject: [PATCH 07/20] state time window Signed-off-by: Ruihang Xia --- .../user-guide/continuous-aggregation/define-time-window.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md index 5f13d2838..ef5a85036 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md @@ -1,6 +1,6 @@ # Define Time Window -Time window is an important attribute of your continuous aggregation query. It defines how the data is aggregated in the flow. GreptimeDB provides two types of time windows: `hop` and `tumble`, or "sliding window" and "fixed window" in other words. You can specify the time window in the `GROUP BY` clause using `hop()` function or `tumble()` function respectively. +Time window is an important attribute of your continuous aggregation query. It defines how the data is aggregated in the flow. GreptimeDB provides two types of time windows: `hop` and `tumble`, or "sliding window" and "fixed window" in other words. You can specify the time window in the `GROUP BY` clause using `hop()` function or `tumble()` function respectively. These two functions are only supported in continuous aggregate queries's `GROUP BY` position. # Tumble @@ -14,7 +14,7 @@ Where `col` specifies use which column to compute the time window. The provided `start_time` is an optional parameter to specify the start time of the first window. If not provided, the start time will be aligned to calender. -# Hop +# Hop (not supported yet) `hop` defines sliding window that moves forward by a fixed interval. It signature is like the following: @@ -24,6 +24,6 @@ hop(col, size_interval, hop_interval, ) Where `col` specifies use which column to compute the time window. The provided column must have a timestamp type. -`size_interval` specifies the size of each window, while `hop_interval` specifies the delta interval between windows' start timestamp. You can think the `tumble()` function as a special case of `hop()` function where the `size_interval` and `hop_interval` are the same. +`size_interval` specifies the size of each window, while `hop_interval` specifies the delta between two windows' start timestamp. You can think the `tumble()` function as a special case of `hop()` function where the `size_interval` and `hop_interval` are the same. `start_time` is an optional parameter to specify the start time of the first window. If not provided, the start time will be aligned to calender. From 0940119fcc7d80bea42ffb46ee30a56e7b918e7c Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Fri, 17 May 2024 16:22:55 +0800 Subject: [PATCH 08/20] illustrate time window Signed-off-by: Ruihang Xia --- .../continuous-aggregation/define-time-window.md | 4 ++++ docs/public/time-window.svg | 15 +++++++++++++++ 2 files changed, 19 insertions(+) create mode 100644 docs/public/time-window.svg diff --git a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md index ef5a85036..35b22a0b9 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md @@ -2,6 +2,10 @@ Time window is an important attribute of your continuous aggregation query. It defines how the data is aggregated in the flow. GreptimeDB provides two types of time windows: `hop` and `tumble`, or "sliding window" and "fixed window" in other words. You can specify the time window in the `GROUP BY` clause using `hop()` function or `tumble()` function respectively. These two functions are only supported in continuous aggregate queries's `GROUP BY` position. +Here illustrates the `hop()` and `tumble()` functions works: + +![Time Window](/time-window.svg) + # Tumble `tumble()` defines fixed windows that do not overlap. It signature is like the following: diff --git a/docs/public/time-window.svg b/docs/public/time-window.svg new file mode 100644 index 000000000..8040fc854 --- /dev/null +++ b/docs/public/time-window.svg @@ -0,0 +1,15 @@ + + + +hop + + + +tumble + + + From 130c7881db2eaab0e415687aa1b568d6dcae6c9b Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Fri, 17 May 2024 17:09:26 +0800 Subject: [PATCH 09/20] add example Signed-off-by: Ruihang Xia --- .../define-time-window.md | 2 +- .../continuous-aggregation/overview.md | 80 ++++++++++++++++++- 2 files changed, 80 insertions(+), 2 deletions(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md index 35b22a0b9..c6c5cf79e 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md @@ -2,7 +2,7 @@ Time window is an important attribute of your continuous aggregation query. It defines how the data is aggregated in the flow. GreptimeDB provides two types of time windows: `hop` and `tumble`, or "sliding window" and "fixed window" in other words. You can specify the time window in the `GROUP BY` clause using `hop()` function or `tumble()` function respectively. These two functions are only supported in continuous aggregate queries's `GROUP BY` position. -Here illustrates the `hop()` and `tumble()` functions works: +Here illustrates how the `hop()` and `tumble()` functions work: ![Time Window](/time-window.svg) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/overview.md b/docs/nightly/en/user-guide/continuous-aggregation/overview.md index 35ebac60f..18953a438 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/overview.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/overview.md @@ -7,4 +7,82 @@ GreptimeDB provides a continuous aggregation feature that allows you to aggregat - [Query](./query.md) describes how to write a continuous aggregation query. - [Expression](./expression.md) is a reference of available expressions in the continuous aggregation query. -![Continuous Aggregation](/flow-ani.svg) \ No newline at end of file +![Continuous Aggregation](/flow-ani.svg) + +## Example + +Here is a complete example of how a continuous aggregation query looks like. + +First, we create a source table `numbers_input` and a sink table `out_num_cnt` with following clauses: + +```sql +CREATE TABLE numbers_input ( + number INT, + ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + PRIMARY KEY(number), + TIME INDEX(ts) +); +``` + +```sql +CREATE TABLE out_num_cnt ( + sum_number BIGINT, + start_window TIMESTAMP TIME INDEX, + end_window TIMESTAMP, + update_at TIMESTAMP, +); +``` + +Then create our flow `test_numbers` to aggregate the sum of `number` column in `numbers_input` table. The aggregation is calculated in 1-second fixed windows. + +```sql +CREATE FLOW test_numbers +SINK TO out_num_cnt +AS +SELECT sum(number) FROM numbers_input GROUP BY tumble(ts, '1 second'); +``` + +Now we can insert some data into `numbers_input` table and see the result in `out_num_cnt` table. + +```sql +INSERT INTO numbers_input +VALUES + (20,1625097600000), + (22,1625097600500); +``` + +The output table `out_num_cnt` should have the following data: + +```sql +SELECT * FROM out_num_cnt; +``` + +```sql + sum_number | start_window | end_window | update_at +------------+----------------------------+----------------------------+---------------------------- + 42 | 2021-07-01 00:00:00.000000 | 2021-07-01 00:00:01.000000 | 2024-05-17 08:32:56.026000 +(1 row) +``` + +Let's try to insert more data into `numbers_input` table. + +```sql +public=> INSERT INTO numbers_input +VALUES + (23,1625097601000), + (24,1625097601500); +``` + +Now the output table `out_num_cnt` should have the following data: + +```sql +SELECT * FROM out_num_cnt; +``` + +```sql + sum_number | start_window | end_window | update_at +------------+----------------------------+----------------------------+---------------------------- + 42 | 2021-07-01 00:00:00.000000 | 2021-07-01 00:00:01.000000 | 2024-05-17 08:32:56.026000 + 47 | 2021-07-01 00:00:01.000000 | 2021-07-01 00:00:02.000000 | 2024-05-17 08:33:10.048000 +(2 rows) +``` From 8833ca2e7aae582d1974fedb48de464be2e56a57 Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Fri, 17 May 2024 21:37:13 +0800 Subject: [PATCH 10/20] comment out hop window Signed-off-by: Ruihang Xia --- .../user-guide/continuous-aggregation/define-time-window.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md index c6c5cf79e..83438bd92 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md @@ -20,7 +20,9 @@ Where `col` specifies use which column to compute the time window. The provided # Hop (not supported yet) -`hop` defines sliding window that moves forward by a fixed interval. It signature is like the following: +`hop` defines sliding window that moves forward by a fixed interval. This feeaure is not supported yet and is expected to be available in the near future. + + From e0b5582dc384465a75981a0c7c164d36ac92912e Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Fri, 17 May 2024 21:48:20 +0800 Subject: [PATCH 11/20] explain time window and expire after Signed-off-by: Ruihang Xia --- .../user-guide/continuous-aggregation/define-time-window.md | 6 +++++- .../en/user-guide/continuous-aggregation/manage-flow.md | 2 ++ 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md index 83438bd92..46de43295 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md @@ -1,6 +1,10 @@ # Define Time Window -Time window is an important attribute of your continuous aggregation query. It defines how the data is aggregated in the flow. GreptimeDB provides two types of time windows: `hop` and `tumble`, or "sliding window" and "fixed window" in other words. You can specify the time window in the `GROUP BY` clause using `hop()` function or `tumble()` function respectively. These two functions are only supported in continuous aggregate queries's `GROUP BY` position. +Time window is an important attribute of your continuous aggregation query. It defines how the data is aggregated in the flow. + +A time window corresponds to a range of time. Data from source table will be mapped to the corresponding window based on the time index column. Time window is also the scope of one calculation of an aggregation expression, so each time window will result in one row in the result table. + +GreptimeDB provides two types of time windows: `hop` and `tumble`, or "sliding window" and "fixed window" in other words. You can specify the time window in the `GROUP BY` clause using `hop()` function or `tumble()` function respectively. These two functions are only supported in continuous aggregate queries's `GROUP BY` position. Here illustrates how the `hop()` and `tumble()` functions work: diff --git a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md index 7acfe2f18..ead2e65b3 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md @@ -27,6 +27,8 @@ When `OR REPLACE` is specified, if a flow with the same name already exists, it `sink-table-name` is the table name to store the materialized aggregated data. It can be an existing table or a new table, `flow` will create the sink table if it doesn't exist. But if the table already exists, the schema of the table must match the schema of the query result. +`expire after` is an optional inverval to expire the data from the source table. The expiration time is a relative time from the current time (by "current time" we means the physical time of the data arrive the Flow engine). For example, `INTERVAL '1 hour'` means the data after 1 hour from the current time will be expired. Expired will be dropped directly. + `SQL` part defines the continuous aggregation query. Refer to [Query](./query.md) for the details. Generally speaking, the `SQL` part is just like a normal `SELECT` clause with a few difference. A simple example to create a flow: From d56ae7b38aa1727ecc8464c94d298b58896b0969 Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Fri, 17 May 2024 23:01:32 +0800 Subject: [PATCH 12/20] fix typo Signed-off-by: Ruihang Xia --- .../nightly/en/user-guide/continuous-aggregation/manage-flow.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md index ead2e65b3..eb0134c9d 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md @@ -27,7 +27,7 @@ When `OR REPLACE` is specified, if a flow with the same name already exists, it `sink-table-name` is the table name to store the materialized aggregated data. It can be an existing table or a new table, `flow` will create the sink table if it doesn't exist. But if the table already exists, the schema of the table must match the schema of the query result. -`expire after` is an optional inverval to expire the data from the source table. The expiration time is a relative time from the current time (by "current time" we means the physical time of the data arrive the Flow engine). For example, `INTERVAL '1 hour'` means the data after 1 hour from the current time will be expired. Expired will be dropped directly. +`expire after` is an optional interval to expire the data from the source table. The expiration time is a relative time from the current time (by "current time" we means the physical time of the data arrive the Flow engine). For example, `INTERVAL '1 hour'` means the data after 1 hour from the current time will be expired. Expired will be dropped directly. `SQL` part defines the continuous aggregation query. Refer to [Query](./query.md) for the details. Generally speaking, the `SQL` part is just like a normal `SELECT` clause with a few difference. From 5193da77db590bae63c04393b6cb212083f73238 Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Mon, 20 May 2024 11:24:24 +0800 Subject: [PATCH 13/20] Apply suggestions from code review Co-authored-by: Yiran --- docs/nightly/en/user-guide/continuous-aggregation/overview.md | 2 +- docs/nightly/en/user-guide/continuous-aggregation/query.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/overview.md b/docs/nightly/en/user-guide/continuous-aggregation/overview.md index 18953a438..d58be0199 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/overview.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/overview.md @@ -67,7 +67,7 @@ SELECT * FROM out_num_cnt; Let's try to insert more data into `numbers_input` table. ```sql -public=> INSERT INTO numbers_input +INSERT INTO numbers_input VALUES (23,1625097601000), (24,1625097601500); diff --git a/docs/nightly/en/user-guide/continuous-aggregation/query.md b/docs/nightly/en/user-guide/continuous-aggregation/query.md index d66f0da18..a04a3b3c8 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/query.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/query.md @@ -6,7 +6,7 @@ Only two kinds of expression are allowed after `SELECT` keyword: - Aggregate functions: see the reference in [Expression](./expression.md) for detail. - Scalar functions: like `col`, `to_lowercase(col)`, `col + 1`, etc. This part is the same as the normal `SELECT` clause in GreptimeDB. -Then each query should have a `FROM` clause to specify the source table. The referenced source table should be one in the flow's source tables in `CREATE FLOW` clause. Join is currently not supported so each query can reference only one table. +The query should have a `FROM` clause to identify the source table. As the join clause is currently not supported, the query can only aggregate columns from a single table. `GROUP BY` clause works as in a normal query. It groups the data by the specified columns. One special thing is the time window functions `hop()` and `tumble()` described in [Define Time Window](./define-time-window.md) part. They are used in the `GROUP BY` clause to define the time window for the aggregation. Other expressions in `GROUP BY` can be either literal, column or scalar expressions. From 22df07d5cd40341b3efaa3c044e62a3d5ab8ffab Mon Sep 17 00:00:00 2001 From: Ruihang Xia Date: Mon, 20 May 2024 11:28:04 +0800 Subject: [PATCH 14/20] explain OR REPLACE Signed-off-by: Ruihang Xia --- .../en/user-guide/continuous-aggregation/manage-flow.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md index eb0134c9d..37686cc8c 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md @@ -23,11 +23,11 @@ AS ; ``` -When `OR REPLACE` is specified, if a flow with the same name already exists, it will be updated to the new one. +When `OR REPLACE` is specified, if a flow with the same name already exists, it will be updated to the new one. Notice that this only affects the flow task itself, and both source and sink tables will not be changed. `sink-table-name` is the table name to store the materialized aggregated data. It can be an existing table or a new table, `flow` will create the sink table if it doesn't exist. But if the table already exists, the schema of the table must match the schema of the query result. -`expire after` is an optional interval to expire the data from the source table. The expiration time is a relative time from the current time (by "current time" we means the physical time of the data arrive the Flow engine). For example, `INTERVAL '1 hour'` means the data after 1 hour from the current time will be expired. Expired will be dropped directly. +`expire after` is an optional interval to expire the data from the source table. The expiration time is a relative time from the current time (by "current time" we means the physical time of the data arrive the Flow engine). For example, `INTERVAL '1 hour'` means the data **older** than 1 hour from the current time will be expired. Expired data will be dropped directly. `SQL` part defines the continuous aggregation query. Refer to [Query](./query.md) for the details. Generally speaking, the `SQL` part is just like a normal `SELECT` clause with a few difference. From 293f90251558eb9b7100811ccf09bbf8f55f4977 Mon Sep 17 00:00:00 2001 From: Yiran Date: Mon, 20 May 2024 18:03:51 +0800 Subject: [PATCH 15/20] rewrite flow docs --- docs/nightly/en/summary.yml | 2 +- .../continuous-aggregation/manage-flow.md | 37 ++++++++++---- .../continuous-aggregation/overview.md | 50 +++++++++++++------ .../continuous-aggregation/query.md | 11 ++-- 4 files changed, 69 insertions(+), 31 deletions(-) diff --git a/docs/nightly/en/summary.yml b/docs/nightly/en/summary.yml index c457ee913..63c01bfb1 100644 --- a/docs/nightly/en/summary.yml +++ b/docs/nightly/en/summary.yml @@ -55,8 +55,8 @@ - Continuous-Aggregation: - overview - manage-flow - - define-time-window - query + - define-time-window - expression - Client-Libraries: - overview diff --git a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md index 37686cc8c..c435b7854 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md @@ -6,45 +6,64 @@ A `flow` have those attributes: - `name`: the name of the flow. It's an unique identifier in the catalog level. - `source tables`: tables provide data for the flow. Each flow can have multiple source tables. - `sink table`: the table to store the materialized aggregated data. -- `expire after`: the interval to expire the data from the source table. Data after the expiration time will not be used in the flow. + - `comment`: the description of the flow. - `SQL`: the continuous aggregation query to define the flow. Refer to [Expression](./expression.md) for the available expressions. -# Create or update a flow +## Create or update a flow The grammar to create a flow is: -```sql + + +```sql +CREATE FLOW [ IF NOT EXISTS ] +OUTPUT TO +[ EXPIRE AFTER ] +[ COMMENT = "" ] +AS +; ``` -When `OR REPLACE` is specified, if a flow with the same name already exists, it will be updated to the new one. Notice that this only affects the flow task itself, and both source and sink tables will not be changed. + `sink-table-name` is the table name to store the materialized aggregated data. It can be an existing table or a new table, `flow` will create the sink table if it doesn't exist. But if the table already exists, the schema of the table must match the schema of the query result. -`expire after` is an optional interval to expire the data from the source table. The expiration time is a relative time from the current time (by "current time" we means the physical time of the data arrive the Flow engine). For example, `INTERVAL '1 hour'` means the data **older** than 1 hour from the current time will be expired. Expired data will be dropped directly. + -`SQL` part defines the continuous aggregation query. Refer to [Query](./query.md) for the details. Generally speaking, the `SQL` part is just like a normal `SELECT` clause with a few difference. +`SQL` part defines the continuous aggregation query. Refer to [Write a Query](./query.md) for the details. Generally speaking, the `SQL` part is just like a normal `SELECT` clause with a few difference. A simple example to create a flow: -```sql + + +```sql +CREATE FLOW IF NOT EXISTS my_flow +OUTPUT TO my_sink_table +COMMENT = "My first flow in GreptimeDB" +AS +SELECT count(item) from my_source_table GROUP BY tumble(time_index, INTERVAL '5 minutes'); ``` -The created flow will compute `count(item)` for every 5 minutes and store the result in `my_sink_table`. All data comes within 1 hour will be used in the flow. For the `tumble()` function, refer to [define time window](./define-time-window.md) part. +The created flow will compute `count(item)` for every 5 minutes and store the result in `my_sink_table`. For the `tumble()` function, refer to [define time window](./define-time-window.md) part. + + -# Delete a flow +## Delete a flow To delete a flow, use the following `DROP FLOW` clause: diff --git a/docs/nightly/en/user-guide/continuous-aggregation/overview.md b/docs/nightly/en/user-guide/continuous-aggregation/overview.md index d58be0199..c529279a7 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/overview.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/overview.md @@ -1,19 +1,18 @@ # Overview -GreptimeDB provides a continuous aggregation feature that allows you to aggregate data in real-time. This feature is useful when you need to calculate and query the sum, average, or other aggregations on the fly. The continuous aggregation feature is provided by our `Flow` engine. It continuously updates the aggregated data based on the incoming data and materialize it. +GreptimeDB provides a continuous aggregation feature that allows you to aggregate data in real-time. This feature is useful when you need to calculate and query the sum, average, or other aggregations on the fly. The continuous aggregation feature is provided by the Flow engine. It continuously updates the aggregated data based on the incoming data and materialize it. -- [Manage Flow](./manage-flow.md) describes how to create, update, and delete a flow. Each of your continuous aggregation query is a flow. -- [Define Time Window](./define-time-window.md) describes how to define the time window for the continuous aggregation. Time window is an important attribute of your continuous aggregation query. It defines the time interval for the aggregation. -- [Query](./query.md) describes how to write a continuous aggregation query. -- [Expression](./expression.md) is a reference of available expressions in the continuous aggregation query. +When you insert data into the source table, the data is also sent to and stored in the Flow engine. +The Flow engine calculate the aggregation by time windows and store the result in the sink table. +The entire process is illustrated in the following image: ![Continuous Aggregation](/flow-ani.svg) -## Example +## Quick start with an example Here is a complete example of how a continuous aggregation query looks like. -First, we create a source table `numbers_input` and a sink table `out_num_cnt` with following clauses: +First, create a source table `numbers_input` and a sink table `out_num_cnt` with following clauses: ```sql CREATE TABLE numbers_input ( @@ -33,7 +32,7 @@ CREATE TABLE out_num_cnt ( ); ``` -Then create our flow `test_numbers` to aggregate the sum of `number` column in `numbers_input` table. The aggregation is calculated in 1-second fixed windows. +Then create the flow `test_numbers` to aggregate the sum of `number` column in `numbers_input` table. The aggregation is calculated in 1-second fixed windows. ```sql CREATE FLOW test_numbers @@ -42,16 +41,16 @@ AS SELECT sum(number) FROM numbers_input GROUP BY tumble(ts, '1 second'); ``` -Now we can insert some data into `numbers_input` table and see the result in `out_num_cnt` table. +To observe the outcome of the continuous aggregation in the `out_num_cnt` table, insert some data into the source table `numbers_input`. ```sql INSERT INTO numbers_input VALUES - (20,1625097600000), - (22,1625097600500); + (20, "2021-07-01 00:00:00.200"), + (22, "2021-07-01 00:00:00.600"); ``` -The output table `out_num_cnt` should have the following data: +The sum of the `number` column is 42 (20+22), so the sink table `out_num_cnt` should have the following data: ```sql SELECT * FROM out_num_cnt; @@ -64,16 +63,16 @@ SELECT * FROM out_num_cnt; (1 row) ``` -Let's try to insert more data into `numbers_input` table. +Try to insert more data into the `numbers_input` table: ```sql INSERT INTO numbers_input VALUES - (23,1625097601000), - (24,1625097601500); + (23,"2021-07-01 00:00:01.000"), + (24,"2021-07-01 00:00:01.500"); ``` -Now the output table `out_num_cnt` should have the following data: +The sink table `out_num_cnt` now contains two rows: representing the sums of 42 and 47 (23+24) for the two respective 1-second windows. ```sql SELECT * FROM out_num_cnt; @@ -86,3 +85,22 @@ SELECT * FROM out_num_cnt; 47 | 2021-07-01 00:00:01.000000 | 2021-07-01 00:00:02.000000 | 2024-05-17 08:33:10.048000 (2 rows) ``` + +Here is the explanation of the columns in the `out_num_cnt` table: + +- `sum_number`: the sum of the `number` column in the window. +- `start_window`: the start time of the window. +- `end_window`: the end time of the window. +- `update_at`: the time when the row data is updated. + +The `start_window`, `end_window`, and `update_at` columns are automatically added by the time window functions of Flow engine. + +## Next Steps + +Congratulations you already have a preliminary understanding of the continuous aggregation feature. +Please refer to the following sections to learn more: + +- [Manage Flows](./manage-flow.md) describes how to create, update, and delete a flow. Each of your continuous aggregation query is a flow. +- [Write a Query](./query.md) describes how to write a continuous aggregation query. +- [Define Time Window](./define-time-window.md) describes how to define the time window for the continuous aggregation. Time window is an important attribute of your continuous aggregation query. It defines the time interval for the aggregation. +- [Expression](./expression.md) is a reference of available expressions in the continuous aggregation query. diff --git a/docs/nightly/en/user-guide/continuous-aggregation/query.md b/docs/nightly/en/user-guide/continuous-aggregation/query.md index a04a3b3c8..6b3a456f3 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/query.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/query.md @@ -2,6 +2,12 @@ This chapter describes how to write a continuous aggregation query in GreptimeDB. Query here should be a `SELECT` statement with either aggregating functions or non-aggregating functions (i.e., scalar function). +The grammar of continuous aggregation query is like the following: + +```sql +SELECT AGGR_FUNCTION(column1, column2,..) FROM source_table GROUP BY TIME_WINDOW_FUNCTION(); +``` + Only two kinds of expression are allowed after `SELECT` keyword: - Aggregate functions: see the reference in [Expression](./expression.md) for detail. - Scalar functions: like `col`, `to_lowercase(col)`, `col + 1`, etc. This part is the same as the normal `SELECT` clause in GreptimeDB. @@ -10,9 +16,4 @@ The query should have a `FROM` clause to identify the source table. As the join `GROUP BY` clause works as in a normal query. It groups the data by the specified columns. One special thing is the time window functions `hop()` and `tumble()` described in [Define Time Window](./define-time-window.md) part. They are used in the `GROUP BY` clause to define the time window for the aggregation. Other expressions in `GROUP BY` can be either literal, column or scalar expressions. -Notice the two time window functions will add several columns to the output schema. The columns are: -- `window_start`: the start time of the window. -- `window_end`: the end time of the window. -- `updated_at`: the time when the window is updated. - Others things like `ORDER BY`, `LIMIT`, `OFFSET` are not supported in the continuous aggregation query. From 77a7401bb00cbac5fe412369dbaaefe8a4d0a8d4 Mon Sep 17 00:00:00 2001 From: Yiran Date: Mon, 20 May 2024 18:29:21 +0800 Subject: [PATCH 16/20] zh docs --- .../define-time-window.md | 9 +- .../continuous-aggregation/manage-flow.md | 4 +- .../continuous-aggregation/overview.md | 2 +- .../continuous-aggregation/query.md | 4 +- docs/nightly/zh/summary-i18n.yml | 1 + .../define-time-window.md | 46 ++++++++ .../continuous-aggregation/expression.md | 9 ++ .../continuous-aggregation/manage-flow.md | 87 ++++++++++++++ .../continuous-aggregation/overview.md | 110 ++++++++++++++++++ .../continuous-aggregation/query.md | 24 ++++ 10 files changed, 288 insertions(+), 8 deletions(-) create mode 100644 docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md create mode 100644 docs/nightly/zh/user-guide/continuous-aggregation/expression.md create mode 100644 docs/nightly/zh/user-guide/continuous-aggregation/manage-flow.md create mode 100644 docs/nightly/zh/user-guide/continuous-aggregation/overview.md create mode 100644 docs/nightly/zh/user-guide/continuous-aggregation/query.md diff --git a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md index 46de43295..c84494d26 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md @@ -12,15 +12,16 @@ Here illustrates how the `hop()` and `tumble()` functions work: # Tumble -`tumble()` defines fixed windows that do not overlap. It signature is like the following: +`tumble()` defines fixed windows that do not overlap. ``` tumble(col, interval, ) ``` -Where `col` specifies use which column to compute the time window. The provided column must have a timestamp type. `interval` specifies the size of the window. The `tumble` function divides the time column into fixed-size windows and aggregates the data in each window. - -`start_time` is an optional parameter to specify the start time of the first window. If not provided, the start time will be aligned to calender. +- `col` specifies use which column to compute the time window. The provided column must have a timestamp type. +- `interval` specifies the size of the window. The `tumble` function divides the time column into fixed-size windows and aggregates the data in each window. +- `start_time` specify the start time of the first window. + # Hop (not supported yet) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md index c435b7854..aae172af2 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md @@ -1,6 +1,8 @@ # Manage Flows -Each `flow` is a continuous aggregation query in GreptimeDB. It is a query that continuously updates the aggregated data based on the incoming data and materializes the result. This document describes how to create, update, and delete a flow. +Each `flow` is a continuous aggregation query in GreptimeDB. +It continuously updates the aggregated data based on the incoming data. +This document describes how to create, update, and delete a flow. A `flow` have those attributes: - `name`: the name of the flow. It's an unique identifier in the catalog level. diff --git a/docs/nightly/en/user-guide/continuous-aggregation/overview.md b/docs/nightly/en/user-guide/continuous-aggregation/overview.md index c529279a7..53a5c6f29 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/overview.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/overview.md @@ -72,7 +72,7 @@ VALUES (24,"2021-07-01 00:00:01.500"); ``` -The sink table `out_num_cnt` now contains two rows: representing the sums of 42 and 47 (23+24) for the two respective 1-second windows. +The sink table `out_num_cnt` now contains two rows: representing the sum data 42 and 47 (23+24) for the two respective 1-second windows. ```sql SELECT * FROM out_num_cnt; diff --git a/docs/nightly/en/user-guide/continuous-aggregation/query.md b/docs/nightly/en/user-guide/continuous-aggregation/query.md index 6b3a456f3..f941af1cd 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/query.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/query.md @@ -2,10 +2,10 @@ This chapter describes how to write a continuous aggregation query in GreptimeDB. Query here should be a `SELECT` statement with either aggregating functions or non-aggregating functions (i.e., scalar function). -The grammar of continuous aggregation query is like the following: +The grammar of the query is like the following: ```sql -SELECT AGGR_FUNCTION(column1, column2,..) FROM source_table GROUP BY TIME_WINDOW_FUNCTION(); +SELECT AGGR_FUNCTION(column1, column2,..) FROM GROUP BY TIME_WINDOW_FUNCTION(); ``` Only two kinds of expression are allowed after `SELECT` keyword: diff --git a/docs/nightly/zh/summary-i18n.yml b/docs/nightly/zh/summary-i18n.yml index a12a77d22..fde2a67a9 100644 --- a/docs/nightly/zh/summary-i18n.yml +++ b/docs/nightly/zh/summary-i18n.yml @@ -7,6 +7,7 @@ Clients: 客户端 Client-Libraries: 客户端库 Write-Data: 写入数据 Query-Data: 读取数据 +Continuous-Aggregation: 持续聚合 Python-Scripts: Python 脚本 Operations: 运维操作 Prometheus: Prometheus diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md new file mode 100644 index 000000000..61b3126e1 --- /dev/null +++ b/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md @@ -0,0 +1,46 @@ +# 定义时间窗口 + +时间窗口是连续聚合查询的重要属性。 +它定义了数据在流中的聚合方式。 + +时间窗口对应于时间范围。 +source 表中的数据将根据时间索引列映射到相应的窗口。 +时间窗口也是聚合表达式计算的范围,因此每个时间窗口将在结果表中生成一行。 + +GreptimeDB 提供两种时间窗口类型:`hop` 和 `tumble`,或者换句话说是滑动窗口和固定窗口。 +你可以在 `GROUP BY` 子句中使用 `hop()` 函数或 `tumble()` 函数指定时间窗口。 +这两个函数仅支持在连续聚合查询的 `GROUP BY` 位置使用。 + +下图展示了 `hop()` 和 `tumble()` 函数的工作方式: + +![Time Window](/time-window.svg) + +# Tumble + +`tumble()` 定义固定窗口,窗口之间不重叠。 + +``` +tumble(col, interval, ) +``` + +- `col` 指定使用哪一列计算时间窗口。提供的列必须是时间戳类型。 +- `interval` 指定窗口的大小。`tumble` 函数将时间列划分为固定大小的窗口,并在每个窗口中聚合数据。 +- `start_time` 用于指定第一个窗口的开始时间。 + + +# Hop(尚未支持) + +`hop` 定义滑动窗口,窗口按固定间隔向前移动。 +此功能尚未支持,预计将在不久的将来提供。 + + diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/expression.md b/docs/nightly/zh/user-guide/continuous-aggregation/expression.md new file mode 100644 index 000000000..e5a76ec67 --- /dev/null +++ b/docs/nightly/zh/user-guide/continuous-aggregation/expression.md @@ -0,0 +1,9 @@ +# 表达式 + +此处列出了所有支持的表达式。 + +- `count(column)`: 行数。 +- `sum(column)`: 列的和。 +- `avg(column)`: 列的平均值。 +- `min(column)`: 列的最小值。 +- `max(column)`: 列的最大值。 diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/zh/user-guide/continuous-aggregation/manage-flow.md new file mode 100644 index 000000000..ca2ea8deb --- /dev/null +++ b/docs/nightly/zh/user-guide/continuous-aggregation/manage-flow.md @@ -0,0 +1,87 @@ +# 管理 Flows + +每一个 `flow` 是 GreptimeDB 中的一个连续聚合查询。 +它根据传入的数据持续更新并聚合数据。 +本文档描述了如何创建、更新和删除一个 flow。 + +一个 `flow` 有以下属性: +- `name`: flow 的名称。在目录级别中是唯一的标识符。 +- `source tables`: 为 flow 提供数据的表。每个 flow 可以有多个 source 表。 +- `sink table`: 存储聚合数据的表。 + +- `comment`: flow 的描述。 +- `SQL`: 定义 flow 的连续聚合查询。有关可用表达式,请参阅 [表达式](./expression.md)。 + +## 创建或更新 flow + +创建 flow 的语法是: + + + +```sql +CREATE FLOW [ IF NOT EXISTS ] +OUTPUT TO +[ EXPIRE AFTER ] +[ COMMENT = "" ] +AS +; +``` + + + +`sink-table-name` 是存储聚合数据的表名。 +它可以是一个现有表或一个新表,`flow` 会在目标表不存在时创建它。 +但如果表已经存在,表的 schema 必须与查询结果的模式匹配。 + + + +`SQL` 部分定义了连续聚合查询。 +有关详细信息,请参阅 [编写查询](./query.md) 部分。 +一般来说,`SQL` 部分就像一个普通的 `SELECT` 子句,只是有一些不同。 + +一个创建 flow 的简单示例: + + + +```sql +CREATE FLOW IF NOT EXISTS my_flow +OUTPUT TO my_sink_table +COMMENT = "My first flow in GreptimeDB" +AS +SELECT count(item) from my_source_table GROUP BY tumble(time_index, INTERVAL '5 minutes'); +``` + +该 Flow 将每 5 分钟计算 `count(item)` 并将结果存储在 `my_sink_table` 中。 +有关 `tumble()` 函数,请参考[定义时间窗口](./define-time-window.md) 部分。 + + + +## 删除 flow + +请使用以下 `DROP FLOW` 子句删除 flow: + +To delete a flow, use the following `DROP FLOW` clause: + +```sql +DROP FLOW [IF EXISTS] +``` + +例如: + +```sql +DROP FLOW IF EXISTS my_flow; +``` diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/overview.md b/docs/nightly/zh/user-guide/continuous-aggregation/overview.md new file mode 100644 index 000000000..0576abef1 --- /dev/null +++ b/docs/nightly/zh/user-guide/continuous-aggregation/overview.md @@ -0,0 +1,110 @@ +# 概述 + +GreptimeDB 提供连续聚合功能允许你实时聚合数据。 +当你需要实时计算和查询总和、平均值或其他聚合时,此功能非常有用。 +连续聚合功能由 Flow 引擎提供。 +它根据传入的数据不断更新聚合数据。 + +当你将数据插入 source 表时,数据也会被发送到 Flow 引擎并存储在其中。 +Flow 引擎通过时间窗口计算聚合并将结果存储在目标表中。 +整个过程如下图所示: + +![Continuous Aggregation](/flow-ani.svg) + +## 快速开始示例 + +以下是连续聚合查询的一个完整示例。 + +首先,使用以下语句创建一个 source 表 `numbers_input` 和一个 sink 表 `out_num_cnt`: + +```sql +CREATE TABLE numbers_input ( + number INT, + ts TIMESTAMP DEFAULT CURRENT_TIMESTAMP, + PRIMARY KEY(number), + TIME INDEX(ts) +); +``` + +```sql +CREATE TABLE out_num_cnt ( + sum_number BIGINT, + start_window TIMESTAMP TIME INDEX, + end_window TIMESTAMP, + update_at TIMESTAMP, +); +``` + +然后创建名为 `test_numbers` 的 flow 来聚合 `numbers_input` 表中 `number` 列的总和,并在 1 秒固定窗口中聚合计算数据。 + +```sql +CREATE FLOW test_numbers +SINK TO out_num_cnt +AS +SELECT sum(number) FROM numbers_input GROUP BY tumble(ts, '1 second'); +``` + +要观察 `out_num_cnt` 表中连续聚合的结果,向 source 表 `numbers_input` 插入一些数据。 + +```sql +INSERT INTO numbers_input +VALUES + (20, "2021-07-01 00:00:00.200"), + (22, "2021-07-01 00:00:00.600"); +``` + +`number` 列的总和为 42 (20+22),因此 sink 表 `out_num_cnt` 应该包含以下数据: + +```sql +SELECT * FROM out_num_cnt; +``` + +```sql + sum_number | start_window | end_window | update_at +------------+----------------------------+----------------------------+---------------------------- + 42 | 2021-07-01 00:00:00.000000 | 2021-07-01 00:00:01.000000 | 2024-05-17 08:32:56.026000 +(1 row) +``` + +尝试向 `numbers_input` 表中插入更多数据: + +```sql +INSERT INTO numbers_input +VALUES + (23,"2021-07-01 00:00:01.000"), + (24,"2021-07-01 00:00:01.500"); +``` + +sink 表 `out_num_cnt` 现在包含两行:分别表示两个 1 秒窗口的数据之和 42 和 47 (23+24) 。 + +```sql +SELECT * FROM out_num_cnt; +``` + +```sql + sum_number | start_window | end_window | update_at +------------+----------------------------+----------------------------+---------------------------- + 42 | 2021-07-01 00:00:00.000000 | 2021-07-01 00:00:01.000000 | 2024-05-17 08:32:56.026000 + 47 | 2021-07-01 00:00:01.000000 | 2021-07-01 00:00:02.000000 | 2024-05-17 08:33:10.048000 +(2 rows) +``` + +`out_num_cnt` 表中的列解释如下: + +- `sum_number`:窗口中 `number` 列的总和。 +- `start_window`:窗口的开始时间。 +- `end_window`:窗口的结束时间。 +- `update_at`:更新行数据的时间。 + +其中 `start_window`、`end_window` 和 `update_at` 列是 Flow 引擎的时间窗口函数自动添加的。 + +## 下一步 + +恭喜你已经初步了解了连续聚合功能。 +请参考以下章节了解更多: + +- [管理 Flow](./manage-flow.md) 描述了如何创建、更新和删除 flow。你的每个连续聚合查询都是一个 flow。 +- [编写查询语句](./query.md) 描述了如何编写连续聚合查询。 +- [定义时间窗口](./define-time-window.md) 描述了如何为连续聚合定义时间窗口。时间窗口是连续聚合查询的一个重要属性,它定义了聚合的时间间隔。 +- [表达式](./expression.md) 是连续聚合查询中可用表达式。 + diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/query.md b/docs/nightly/zh/user-guide/continuous-aggregation/query.md new file mode 100644 index 000000000..5bd129b38 --- /dev/null +++ b/docs/nightly/zh/user-guide/continuous-aggregation/query.md @@ -0,0 +1,24 @@ +# 编写查询语句 + +本章节描述如何在 GreptimeDB 中编写连续聚合查询。 +查询语句应该是一个带有聚合函数或非聚合函数(即 scalar 函数)的 `SELECT` 语句。 + +查询的语法如下: + +```sql +SELECT AGGR_FUNCTION(column1, column2,..) FROM GROUP BY TIME_WINDOW_FUNCTION(); +``` + +`SELECT` 关键字后只允许两种表达式: +- 聚合函数:详细信息请参考 [表达式](./expression.md) 部分。 +- 标量函数:如 `col`、`to_lowercase(col)`、`col + 1` 等。这部分与 GreptimeDB 中的普通 `SELECT` 子句相同。 + +查询应该有一个 `FROM` 子句来标识 source 表。由于不支持 join 子句,目前只能从单个表中聚合列。 + +`GROUP BY` 子句与普通查询中的工作方式相同。 +它根据指定的列对数据进行分组。 +`GROUP BY` 子句中使用的时间窗口函数 `hop()` 和 `tumble()` 在 [定义时间窗口](./define-time-window.md) 部分中有描述。 +它们用于在聚合中定义时间窗口。 +`GROUP BY` 中的其他表达式可以是 literal、列名或 scalar 表达式。 + +连续聚合查询不支持 `ORDER BY`、`LIMIT`、`OFFSET` 等其他操作。 From d6628b2bb83601a03af3ee7f05f1c1cb2ea9f8a1 Mon Sep 17 00:00:00 2001 From: Yiran Date: Mon, 20 May 2024 18:32:29 +0800 Subject: [PATCH 17/20] title hierarchies --- .../user-guide/continuous-aggregation/define-time-window.md | 4 ++-- .../user-guide/continuous-aggregation/define-time-window.md | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md index c84494d26..ba2e7c54f 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md @@ -10,7 +10,7 @@ Here illustrates how the `hop()` and `tumble()` functions work: ![Time Window](/time-window.svg) -# Tumble +## Tumble `tumble()` defines fixed windows that do not overlap. @@ -23,7 +23,7 @@ tumble(col, interval, ) - `start_time` specify the start time of the first window. -# Hop (not supported yet) +## Hop (not supported yet) `hop` defines sliding window that moves forward by a fixed interval. This feeaure is not supported yet and is expected to be available in the near future. diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md index 61b3126e1..74579e524 100644 --- a/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md +++ b/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md @@ -15,7 +15,7 @@ GreptimeDB 提供两种时间窗口类型:`hop` 和 `tumble`,或者换句话 ![Time Window](/time-window.svg) -# Tumble +## Tumble `tumble()` 定义固定窗口,窗口之间不重叠。 @@ -28,7 +28,7 @@ tumble(col, interval, ) - `start_time` 用于指定第一个窗口的开始时间。 -# Hop(尚未支持) +## Hop(尚未支持) `hop` 定义滑动窗口,窗口按固定间隔向前移动。 此功能尚未支持,预计将在不久的将来提供。 From d25bec501091f452f9f6e2a504c7c8eeb3c1405a Mon Sep 17 00:00:00 2001 From: Yiran Date: Mon, 20 May 2024 18:42:04 +0800 Subject: [PATCH 18/20] fix sql in overview --- docs/nightly/en/user-guide/continuous-aggregation/overview.md | 2 +- docs/nightly/zh/user-guide/continuous-aggregation/overview.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/overview.md b/docs/nightly/en/user-guide/continuous-aggregation/overview.md index 53a5c6f29..abd27e4f6 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/overview.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/overview.md @@ -38,7 +38,7 @@ Then create the flow `test_numbers` to aggregate the sum of `number` column in ` CREATE FLOW test_numbers SINK TO out_num_cnt AS -SELECT sum(number) FROM numbers_input GROUP BY tumble(ts, '1 second'); +SELECT sum(number) FROM numbers_input GROUP BY tumble(ts, '1 second', '2021-07-01 00:00:00'); ``` To observe the outcome of the continuous aggregation in the `out_num_cnt` table, insert some data into the source table `numbers_input`. diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/overview.md b/docs/nightly/zh/user-guide/continuous-aggregation/overview.md index 0576abef1..ff80864f4 100644 --- a/docs/nightly/zh/user-guide/continuous-aggregation/overview.md +++ b/docs/nightly/zh/user-guide/continuous-aggregation/overview.md @@ -41,7 +41,7 @@ CREATE TABLE out_num_cnt ( CREATE FLOW test_numbers SINK TO out_num_cnt AS -SELECT sum(number) FROM numbers_input GROUP BY tumble(ts, '1 second'); +SELECT sum(number) FROM numbers_input GROUP BY tumble(ts, '1 second', '2021-07-01 00:00:00'); ``` 要观察 `out_num_cnt` 表中连续聚合的结果,向 source 表 `numbers_input` 插入一些数据。 From 0ccba22418351f3ea6bab5cbe45c64d89aefc7c9 Mon Sep 17 00:00:00 2001 From: Yiran Date: Mon, 20 May 2024 19:15:20 +0800 Subject: [PATCH 19/20] Apply suggestions from code review Co-authored-by: Jeremyhi --- .../en/user-guide/continuous-aggregation/define-time-window.md | 2 +- .../zh/user-guide/continuous-aggregation/define-time-window.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md index ba2e7c54f..8400b190f 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/define-time-window.md @@ -15,7 +15,7 @@ Here illustrates how the `hop()` and `tumble()` functions work: `tumble()` defines fixed windows that do not overlap. ``` -tumble(col, interval, ) +tumble(col, interval, start_time) ``` - `col` specifies use which column to compute the time window. The provided column must have a timestamp type. diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md b/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md index 74579e524..ba56ab259 100644 --- a/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md +++ b/docs/nightly/zh/user-guide/continuous-aggregation/define-time-window.md @@ -20,7 +20,7 @@ GreptimeDB 提供两种时间窗口类型:`hop` 和 `tumble`,或者换句话 `tumble()` 定义固定窗口,窗口之间不重叠。 ``` -tumble(col, interval, ) +tumble(col, interval, start_time) ``` - `col` 指定使用哪一列计算时间窗口。提供的列必须是时间戳类型。 From bb3fdacc48390e262487ed828579b7534f8c36ea Mon Sep 17 00:00:00 2001 From: Yiran Date: Mon, 20 May 2024 19:19:20 +0800 Subject: [PATCH 20/20] add start time to tumble function --- .../en/user-guide/continuous-aggregation/manage-flow.md | 2 +- .../zh/user-guide/continuous-aggregation/manage-flow.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md index aae172af2..d8e486a62 100644 --- a/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md +++ b/docs/nightly/en/user-guide/continuous-aggregation/manage-flow.md @@ -58,7 +58,7 @@ CREATE FLOW IF NOT EXISTS my_flow OUTPUT TO my_sink_table COMMENT = "My first flow in GreptimeDB" AS -SELECT count(item) from my_source_table GROUP BY tumble(time_index, INTERVAL '5 minutes'); +SELECT count(item) from my_source_table GROUP BY tumble(time_index, INTERVAL '5 minutes', '2024-05-20 00:00:00'); ``` The created flow will compute `count(item)` for every 5 minutes and store the result in `my_sink_table`. For the `tumble()` function, refer to [define time window](./define-time-window.md) part. diff --git a/docs/nightly/zh/user-guide/continuous-aggregation/manage-flow.md b/docs/nightly/zh/user-guide/continuous-aggregation/manage-flow.md index ca2ea8deb..8c77b6ab5 100644 --- a/docs/nightly/zh/user-guide/continuous-aggregation/manage-flow.md +++ b/docs/nightly/zh/user-guide/continuous-aggregation/manage-flow.md @@ -7,7 +7,7 @@ 一个 `flow` 有以下属性: - `name`: flow 的名称。在目录级别中是唯一的标识符。 - `source tables`: 为 flow 提供数据的表。每个 flow 可以有多个 source 表。 -- `sink table`: 存储聚合数据的表。 +- `sink table`: 存储聚合数据的结果表。 - `comment`: flow 的描述。 - `SQL`: 定义 flow 的连续聚合查询。有关可用表达式,请参阅 [表达式](./expression.md)。 @@ -62,7 +62,7 @@ CREATE FLOW IF NOT EXISTS my_flow OUTPUT TO my_sink_table COMMENT = "My first flow in GreptimeDB" AS -SELECT count(item) from my_source_table GROUP BY tumble(time_index, INTERVAL '5 minutes'); +SELECT count(item) from my_source_table GROUP BY tumble(time_index, INTERVAL '5 minutes', '2024-05-20 00:00:00'); ``` 该 Flow 将每 5 分钟计算 `count(item)` 并将结果存储在 `my_sink_table` 中。