Skip to content

Commit

Permalink
Merge pull request #2784 from emqx/20241219-r58-mailbox-alarms
Browse files Browse the repository at this point in the history
feat: new mailbox alarms
  • Loading branch information
Meggielqk authored Dec 25, 2024
2 parents b418c15 + 4e7e8e3 commit 0c54578
Show file tree
Hide file tree
Showing 2 changed files with 28 additions and 24 deletions.
30 changes: 16 additions & 14 deletions en_US/observability/alarms.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Alarm is an EMQX Enterprise feature.

:::

EMQX offers a built-in monitoring and alarm functionality for monitoring the internal state changes, such as CPU occupancy, system, and process memory occupancy, number of processes, rule engine resource status, and cluster partition and healing. EMQX triggers and records these changes when they exceed a threshold or deviate from expectations, and removes them from the list once they are restored.
EMQX offers a built-in monitoring and alarm functionality for monitoring the internal state changes, such as CPU occupancy, system, and process memory occupancy, number of processes, rule engine resource status, and cluster partition and healing. EMQX triggers and records these changes when they exceed a threshold or deviate from expectations, and removes them from the list once they are restored.

This page introduces the alarm information EMQX provides, how to obtain and check the detailed alarm information, and how to configure the alarm settings and thresholds in EMQX. The monitoring and alarm function keeps you notified of potential problems during operation. By configuring alarms and setting appropriate thresholds, you can make sure that EMQX remains secure, stable, and reliable.

Expand All @@ -30,15 +30,17 @@ The levels are defined from development perspectives and are only for recommenda

**Alarm list for EMQX Open Source edition:**

| **Alarm** | Level | Description | **Details** | **Threshold** |
| :------------------------ | -------- | :----------------------------------------------------------- | :--------------------------------------- | :----------------------------------------------------------- |
| high_system_memory_usage | Warning | System memory usage is too high | "System memory usage is higher than ~p%" | `os_mon.sysmem_high_watermark = 70%` |
| high_process_memory_usage | Warning | Single Erlang process memory usage is too high (percentage of system memory usage) | Process memory usage is higher than ~p% | `os_mon.procmem_high_watermark = 5%` |
| high_cpu_usage | Warning | CPU usage is too high | ~p% cpu usage | `os_mon.cpu_high_watermark = 80%` `os_mon.cpu_low_watermark = 60%` |
| too_many_processes | Warning | Too many processes | ~p% process usage | `vm_mon.process_high_watermark = 80%` `vm_mon.process_low_watermark = 60%` |
| partition | Critical | Partition occurs at node | Partition occurs at node ~s | - |
| resource | Critical | Resource is disconnected | Resource ~s(~s) is down | - |
| conn_congestion | Critical | Connection process congestion | connection congested | - |
| **Alarm** | Level | Description | **Details** | **Threshold** |
|:------------------------------------|----------|:-----------------------------------------------------------------------------------|:-----------------------------------------|:---------------------------------------------------------------------------|
| high_system_memory_usage | Warning | System memory usage is too high | "System memory usage is higher than ~p%" | `os_mon.sysmem_high_watermark = 70%` |
| high_process_memory_usage | Warning | Single Erlang process memory usage is too high (percentage of system memory usage) | Process memory usage is higher than ~p% | `os_mon.procmem_high_watermark = 5%` |
| high_cpu_usage | Warning | CPU usage is too high | ~p% cpu usage | `os_mon.cpu_high_watermark = 80%` `os_mon.cpu_low_watermark = 60%` |
| too_many_processes | Warning | Too many processes | ~p% process usage | `vm_mon.process_high_watermark = 80%` `vm_mon.process_low_watermark = 60%` |
| mnesia_transaction_manager_overload | Warning | mnesia overloaded; mailbox size: N | mailbox size = N | `sysmon.mnesia_tm_mailbox_threshold = 500` |
| broker_pool_overload | Warning | broker pool overloaded; mailbox size: N | mailbox size = N | `sysmon.broker_pool_mailbox_threshold = 500` |
| partition | Critical | Partition occurs at node | Partition occurs at node ~s | - |
| resource | Critical | Resource is disconnected | Resource ~s(~s) is down | - |
| conn_congestion | Critical | Connection process congestion | connection congested | - |

**Alarm list for EMQX Enterprise edition:**

Expand All @@ -56,7 +58,7 @@ The levels are defined from development perspectives and are only for recommenda

## Get Alarms

EMQX provides you with various ways to get alarms and check detailed alarm information. One way is to view the alarms on EMQX Dashboard, where you can view a list of active or historical alarms. However, it is only a central place for easy access to an overview of alarms that have been triggered. Another way is to subscribe to system topics through MQTT to receive real-time notifications of alarms with detailed alarm information. Alarms can also be accessed from the log or via REST API.
EMQX provides you with various ways to get alarms and check detailed alarm information. One way is to view the alarms on EMQX Dashboard, where you can view a list of active or historical alarms. However, it is only a central place for easy access to an overview of alarms that have been triggered. Another way is to subscribe to system topics through MQTT to receive real-time notifications of alarms with detailed alarm information. Alarms can also be accessed from the log or via REST API.

### View Alarms on Dashboard

Expand Down Expand Up @@ -116,8 +118,8 @@ The settings for alarms can only be configured by modifying the configuration it

Alarm thresholds can be configured on EMQX Dashboard. There are two ways to launch the **Monitoring** page for configuring the alarm thresholds:

1. On the **Alarms** page, click the **Setting** button and you will be led to the **Monitoring** page.
2. From the left navigation menu, click **Management** -> **Monitoring**.
1. On the **Alarms** page, click the **Setting** button and you will be led to the **Monitoring** page.
2. From the left navigation menu, click **Management** -> **Monitoring**.

On the **Monitoring** -> **System** tab, click the **Erlang VM** tab, you can configure the following items for the system performance of the Erlang Virtual Machine:

Expand All @@ -129,7 +131,7 @@ On the **Monitoring** -> **System** tab, click the **Erlang VM** tab, you can c
- **Process low watermark**: Specify the threshold value of processes that can simultaneously exist at the local node. When the percentage is lowered to the specified number, an alarm is cleared. The default value is `60` percent.

- **Enable Long GC monitoring**: Disabled by default. When enabled, a warning-level log `long_gc` is emitted and an MQTT message is published to the system topic `$SYS/sysmon/long_gc` when an Erlang process spends long time performing garbage collection.
- **Enable Long Schedule monitoring**: Enabled by default, which means when the Erlang VM detects a task scheduled for too long, a warning level log `long_schedule` is emitted. You can set the proper time scheduled for a task in the text box. The default value is `240` milliseconds.
- **Enable Long Schedule monitoring**: Enabled by default, which means when the Erlang VM detects a task scheduled for too long, a warning level log `long_schedule` is emitted. You can set the proper time scheduled for a task in the text box. The default value is `240` milliseconds.

- **Enable Large Heap monitoring**: Enabled by default, which means when an Erlang process consumed a large amount of memory for its heap space, a warning level log `large_heap` is emitted, and an MQTT message is published to the system topic `$SYS/sysmon/large_heap`. You can set the limit of space bytesize in the text box. The default value is `32` MB.

Expand Down
22 changes: 12 additions & 10 deletions zh_CN/observability/alarms.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,15 +28,17 @@ EMQX 提供内置的监控和告警功能,用于监视内部状态变化,如

**EMQX 开源版告警列表:**

| **告警** | 级别 | 描述 | **详情** | **阈值** |
| ------------------------- | ---- | -------------------------------------------------- | ----------------------- | ------------------------------------------------------------ |
| high_system_memory_usage | 警告 | 系统内存使用过高 | "系统内存使用高于 ~p%" | `os_mon.sysmem_high_watermark = 70%` |
| high_process_memory_usage | 警告 | 单个 Erlang 进程内存使用过高(占系统内存的百分比) | 进程内存使用高于 ~p% | `os_mon.procmem_high_watermark = 5%` |
| high_cpu_usage | 警告 | CPU 使用率过高 | ~p% CPU 使用率 | `os_mon.cpu_high_watermark = 80%` `os_mon.cpu_low_watermark = 60%` |
| too_many_processes | 警告 | 进程过多 | ~p% 进程使用率 | `vm_mon.process_high_watermark = 80%` `vm_mon.process_low_watermark = 60%` |
| partition | 严重 | 节点发生分区 | 节点发生分区 ~s | - |
| resource | 严重 | 资源断开连接 | 资源 ~s(~s)已断开连接 | - |
| conn_congestion | 严重 | 连接过程拥塞 | 连接拥塞 | - |
| **告警** | 级别 | 描述 | **详情** | **阈值** |
| ----------------------------------- | ---- | -------------------------------------------------- | ----------------------- | ------------------------------------------------------------ |
| high_system_memory_usage | 警告 | 系统内存使用过高 | "系统内存使用高于 ~p%" | `os_mon.sysmem_high_watermark = 70%` |
| high_process_memory_usage | 警告 | 单个 Erlang 进程内存使用过高(占系统内存的百分比) | 进程内存使用高于 ~p% | `os_mon.procmem_high_watermark = 5%` |
| high_cpu_usage | 警告 | CPU 使用率过高 | ~p% CPU 使用率 | `os_mon.cpu_high_watermark = 80%` `os_mon.cpu_low_watermark = 60%` |
| too_many_processes | 警告 | 进程过多 | ~p% 进程使用率 | `vm_mon.process_high_watermark = 80%` `vm_mon.process_low_watermark = 60%` |
| mnesia_transaction_manager_overload | 警告 | mnesia 事务管理器过载;邮箱消息数量:N | mailbox size = N | `sysmon.mnesia_tm_mailbox_threshold = 500` |
| broker_pool_overload | 警告 | broker 消息处理池过载;邮箱消息数量:N | mailbox size = N | `sysmon.broker_pool_mailbox_threshold = 500` |
| partition | 严重 | 节点发生分区 | 节点发生分区 ~s | - |
| resource | 严重 | 资源断开连接 | 资源 ~s(~s)已断开连接 | - |
| conn_congestion | 严重 | 连接过程拥塞 | 连接拥塞 | - |

**EMQX 企业版告警列表:**

Expand Down Expand Up @@ -125,7 +127,7 @@ EMQX 提供多种方式获取告警并查看详细信息。其中一种方式是
- **进程限制检查时间**:指定周期性检查进程限制的时间间隔。默认值为 `30` 秒。
- **进程数高水位线**:指定可以同时存在于本地节点的进程的阈值百分比。当超过指定数值时,会触发告警。默认值为 `80`%。
- **进程数低水位线**:指定可以同时存在于本地节点的进程的阈值百分比。当降低到指定数值时,告警将被清除。默认值为 `60`%。
- **启用长垃圾回收 监控**:默认禁用。启用后,当 Erlang 进程执行长时间垃圾回收时,将发出警告级别的日志 `long_gc`,并发布 MQTT 消息到系统主题 `$SYS/sysmon/long_gc`
- **启用长垃圾回收监控**:默认禁用。启用后,当 Erlang 进程执行长时间垃圾回收时,将发出警告级别的日志 `long_gc`,并发布 MQTT 消息到系统主题 `$SYS/sysmon/long_gc`
- **启用长调度监控**:默认启用,意味着当 Erlang VM 检测到任务调度时间过长时,会发出警告级别的日志 `long_schedule`。您可以在文本框中设置任务的适当调度时间。默认值为 `240` 毫秒。
- **启用大 heap 监控**:默认启用,意味着当 Erlang 进程为其堆空间消耗大量内存时,会发出警告级别的日志 `large_heap`,并发布 MQTT 消息到系统主题 `$SYS/sysmon/large_heap`。您可以在文本框中设置空间字节大小的限制。默认值为 `32` MB。
- **启用分布式端口过忙监控**:默认启用,意味着当用于与集群中其他节点通信的远程过程调用(RPC)连接过载时,会发出警告级别的日志 `busy_dis_port`,并发布 MQTT 消息到系统主题 `$SYS/sysmon/busy_dis_port`
Expand Down

0 comments on commit 0c54578

Please sign in to comment.