Skip to content

Commit

Permalink
[doc](statistics)Update docs for statistics (apache#29926)
Browse files Browse the repository at this point in the history
Update docs for statistics
  • Loading branch information
Jibing-Li authored Jan 16, 2024
1 parent 04701b6 commit df78d64
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 4 deletions.
10 changes: 8 additions & 2 deletions docs/en/docs/query-acceleration/statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,13 @@ Where:
- `sync`: Collect statistics synchronously. Returns after collection. If not specified, it executes asynchronously and returns a JOB ID.
- `sample percent | rows`: Collect statistics with sampling. You can specify a sampling percentage or a number of sampling rows.

Here are some examples:
By default (WITH SAMPLE is not specified), a table will be fully analyzed. For relatively large tables (e.g. above 5GiB), from the perspective of system resources, we recommend sampling analyze, and the number of sampled rows is recommended to be no less than 4 million rows. Here are some examples:

Collect statistics for a table with full analyze:

```sql
ANALYZE TABLE lineitem;
```

Collect statistics for a table with a 10% sampling rate:

Expand All @@ -84,7 +90,7 @@ ANALYZE TABLE lineitem WITH SAMPLE ROWS 100000;

This feature has been officially supported since 2.0.3 and is enabled by default. The basic operation logic is described below. After each import transaction commit, Doris records the number of rows updated by the import transaction to estimate the health of the existing table's statistics data (for tables that have not collected statistics, their health is 0). When the health of a table is below 60 (adjustable through the `table_stats_health_threshold` parameter), Doris considers the statistics for that table outdated and triggers statistics collection jobs for that table in subsequent operations. For tables with a health value above 60, no repeated collection is performed.

The collection jobs for statistics themselves consume a certain amount of system resources. To minimize the overhead, for tables with a large amount of data (default 5 GiB, adjustable with the FE parameter `huge_table_lower_bound_size_in_bytes`), Doris automatically uses sampling to collect statistics. Automatic sampling defaults to sampling 4,194,304 (2^22) rows to reduce the system's burden and complete the collection job as quickly as possible. If you want to sample more rows to obtain a more accurate data distribution, you can increase the sampling row count by adjusting the `huge_table_default_sample_rows` parameter. In addition, for tables with data larger than `huge_table_lower_bound_size_in_bytes` * 5, Doris ensures that the collection time interval is not less than 12 hours (which can be controlled by adjusting the `huge_table_auto_analyze_interval_in_millis` parameter).
The collection jobs for statistics themselves consume a certain amount of system resources. To minimize the overhead, Doris automatically uses sampling to collect statistics. Automatic sampling defaults to sample 4,194,304 (2^22) rows to reduce the system's burden and complete the collection job as quickly as possible. If you want to sample more rows to obtain a more accurate data distribution, you can increase the sampling row count by adjusting the `huge_table_default_sample_rows` parameter. You can also control the full collection of small tables and the collection interval of large tables through session variables. For detailed configuration, please refer to [3.1](statistics.md#31-session-variables).

If you are concerned about automatic collection jobs interfering with your business, you can specify a time frame for the automatic collection jobs to run during low business loads by setting the `auto_analyze_start_time` and `auto_analyze_end_time` parameters according to your needs. You can also completely disable this feature by setting the `enable_auto_analyze` parameter to `false`.

Expand Down
10 changes: 8 additions & 2 deletions docs/zh-CN/docs/query-acceleration/statistics.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,14 @@ ANALYZE < TABLE table_name | DATABASE db_name >
- sync:同步收集统计信息。收集完后返回。若不指定则异步执行并返回JOB ID。
- sample percent | rows:抽样收集统计信息。可以指定抽样比例或者抽样行数。

默认情况下(不指定WITH SAMPLE),会对一张表全量采样。 对于比较大的表(5GiB以上),从集群资源的角度出发,一般情况下我们建议采样收集,采样的行数建议不低于400万行。下面是一些例子

对一张表全量收集统计信息:

```sql
ANALYZE TABLE lineitem;
```

下面是一些例子

对一张表按照10%的比例采样收集统计数据:

Expand All @@ -87,7 +93,7 @@ ANALYZE TABLE lineitem WITH SAMPLE ROWS 100000;

此功能从2.0.3开始正式支持,默认为全天开启状态。下面对其基本运行逻辑进行阐述,在每次导入事务提交后,Doris将记录本次导入事务更新的表行数用以估算当前已有表的统计数据的健康度(对于没有收集过统计数据的表,其健康度为0)。当表的健康度低于60(可通过参数`table_stats_health_threshold`调节)时,Doris会认为该表的统计信息已经过时,并在之后触发对该表的统计信息收集作业。而对于统计信息健康度高于60的表,则不会重复进行收集。

统计信息的收集作业本身需要占用一定的系统资源,为了尽可能降低开销,对于数据量较大(默认为5GiB,可通过设置FE参数`huge_table_lower_bound_size_in_bytes`来调节此行为)的表,Doris会自动采取采样的方式去收集,自动采样默认采样4194304(2^22)行,以尽可能降低对系统造成的负担并尽快完成收集作业。如果希望采样更多的行以获得更准确的数据分布信息,可通过调整参数`huge_table_default_sample_rows`增大采样行数。另外对于数据量大于`huge_table_lower_bound_size_in_bytes` * 5 的表,Doris保证其收集时间间隔不小于12小时(该时间可通过调整参数`huge_table_auto_analyze_interval_in_millis`控制)
统计信息的收集作业本身需要占用一定的系统资源,为了尽可能降低开销,Doris会使用采样的方式去收集,自动采样默认采样4194304(2^22)行,以尽可能降低对系统造成的负担并尽快完成收集作业。如果希望采样更多的行以获得更准确的数据分布信息,可通过调整参数`huge_table_default_sample_rows`增大采样行数。用户还可通过参数控制小表全量收集,大表收集时间间隔等行为。详细配置请参考详[3.1](statistics.md#31-会话变量)

如果担心自动收集作业对业务造成干扰,可结合自身需求通过设置参数`auto_analyze_start_time`和参数`auto_analyze_end_time`指定自动收集作业在业务负载较低的时间段执行。也可以通过设置参数`enable_auto_analyze``false`来彻底关闭本功能。

Expand Down

0 comments on commit df78d64

Please sign in to comment.