❗❗ This repo will no longer be maintained, please visit https://github.com/milvus-io/bootcamp ❗ ❗
Milvus uses Prometheus to monitor and store performance metrics, and uses open source timing data analysis and visualizable platform Grafana to display performance metrics.
Milvus collects monitoring data and pushes it to Pushgateway.Milvus collects monitoring data and pushes it to Pushgateway.Meanwhile, Prometheus Server will pull data from Pushgateway and save it to its timing database (TSDB) on a regular basis. Prometheus Server will push the alarm information to Alertmanager when an alarm is generated. Grafana can be used to visualize the collected data.
1、Prometheus
2、Alertmanager
3、Grafana
Firstly, Prometheus is used to collect Milvus monitoring indicators, and how to connect Alertmanager to Prometheus to realize the visualization of data display and alarm mechanism.
Download the Prometheus binary zip file.
tar xvfz prometheus-*.tar.gz
cd prometheus-*
Download the Pushgateway binary zip file.
tar xvfz pushgateway-*.tar.gz
cd pushgateway-*
./pushgateway
Turn on Prometheus monitor in server_config.yaml and set the address and port number of Pushgateway.
metric:
enable: true # Set the value to true to turn on Prometheus monitoring
address: 127.0.0.1 # Set the IP address of Pushgateway
port: 9091 # Set the port number of Pushgateway.
Download the Milvus Prometheus profile:
$ wget https://raw.githubusercontent.com/milvus-io/docs/master/v0.10.3/assets/monitoring/prometheus.yml \ -O prometheus.yml
Download Milvus alarm rules file to Prometheus root directory:
$ wget -P rules https://raw.githubusercontent.com/milvus-io/docs/master/v0.10.3/assets/monitoring/alert_rules.yml
Edit Prometheus configuration file according to actual requirements:
-
Global: Configure parameters such as Scrape_Interval and evaluation_interval.
global: scrape_interval: 2s # Set the fetch time interval to 2S evaluation_interval: 2s # Set the evaluation interval to 2S
-
Alerting: Set the address and port of the Alertmanager.
alerting: alertmanagers: - static_configs: - targets: ['localhost:9093']
-
Rule_files: Sets the alarm rule file.
rule_files: - "alert_rules.yml"
-
Scrape_configs: Sets information such as Job_name and Targets for fetching data.
scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'pushgateway' honor_labels: true static_configs: - targets: ['localhost:9091']
Start the Prometheus:
./prometheus --config.file=prometheus.yml
Login through the browser http://:9090,Go to the prometheus user interaction page.
Alertmanager is primarily used to receive alarm messages sent by Prometheus. Here's the events that need to create alarm rules.
-
Server down
Alarm rule: Send an alarm when Milvus server goes down.
How to tell: When Milvus servers go down, indicators on the monitoring dashboard show No Data.
-
The CPU/GPU is too hot
Alarm rule: Send alarm message when CPU/GPU temperature exceeds 80 ° C.
How to judge: Check CPU Temperature and GPU Temperature on the monitoring dashboard.
Download the Alertmanager binary zip file
tar xvfz Alertmanager-*.tar.gz
cd Alertmanager-*
Create the configuration file alertManager.yml based on the configuration Alertmanager, specify the mailbox to which to receive alarm notifications, and add the configuration file to the root of the Alertmanager
Activate the Alertmanager service and specify the configuration file:
./alertmanager --config.file=alertmanager.yml
- Running Grafana:
docker run -i -p 3000:3000 grafana/grafana
Open it in a browser http://:3000Url, and login to the Grafana User Interaction page.
From the Grafana User Interaction page, click Configuration>Data Sources>Prometheus, and set the following Data source properties:
Field | Definition |
---|---|
Name | Prometheus |
Default | True |
URL | http://:9090 |
Access | Browser |
-
Download the Grafana configuration file
-
Import the configuration file into Grafana
-
Configure the monitoring metrics provided by Milvus through the Grafana profile provided by Milvus,The Milvus monitoring metrics are shown below.
Milvus performance indicators
Indicators | Instructions |
---|---|
Insert per Second | The number of vectors inserted per second |
Queries per Minute | The number of queries run per minute |
Query Time per Vector | Single vector query time = query time/number of vectors |
Query Service Level | Query service level = number of queries within a certain time threshold/total number of queries |
Uptime | How long the Milvus server is up (minutes) |
System performance index
Indicators | Instructions |
---|---|
GPU Utilization | GPU utilization rate (%) |
GPU Memory Usage | Amount of display (GB) currently used by Milvus |
CPU Utilization | CPU utilization (%) = server task execution time/server total elapsed time |
Memory Usage | Current amount of memory used by Milvus (GB) |
Cache Utilization | Cache utilization (%) |
Network IO | Read/write speed of network port (GB/s) |
Disk Read Speed | Disk read speed (GB/s) |
Disk Write Speed | Disk write speed (GB/s). |
Hardware storage metrics
Indicators | Instructions |
---|---|
Data Size | Total amount of data stored by Milvus (GB) |
Total File | The total number of data files stored in Milvus |