This component has 2 goals:
- Monitor OpenSearch clusters managed by the operator and the Operator itself.
- Supplying a getting started Monitoring and Alerting solution, allowing the user to immediately view the metrics from the first goal and have built in alerts from them
The most popular open source monitoring format today is Prometheus. The latter works by periodically pulling a http endpoint containing the metrics and their values in a format called Prometheus Text Based Exposition format. OpenSearch doesn't contain a native Prometheus text format endpoint (yet), there for the OpenSearch Operator will install an OpenSearch Plugin called opensearch-prometheus-exporter which exposes per node its Node Metrics and on a specific node it will also exposes Cluster and Indices metrics.
The operator is a Go-based process, which will expose its metrics (per component) using the Prometheus Text Based Exposition format on default endpoint.
The most popular open source solution for monitoring and alerting today is Prometheus - serving as the time-series database the metrics will be written to and read from (queried) and as the alerting engine executing alerts. The most popular solution for viewing metrics on dashboards is Grafana. The OpenSearch operator will install them both and more:
- Prometheus: Will pull metrics from OpenSearch and OpenSearch Operator and write them into its own time series database.
- Alert Manager: A component from Prometheus project which handle de-dup and notifications.
- Grafana: the UI for viewing the metrics, will come with pre-installed dashboards for those metrics.
Prometheus will be managed by Prometheus Operator, which will be installed by the helm charts of OpenSearch Operator.
flowchart LR
subgraph opensearch-operator-ns [OpenSearch Operator Namespace]
direction BT
service-monitor-controller[[ServiceMonitor CRD]]-- references -->opensearch-controller[/OpenSearch Controller/]
end
subgraph opensearch-clusters
subgraph opensearch-cluster-1 [OpenSearch Cluster 1 Namespace]
direction BT
subgraph ss1 [OpenSearch Nodes StatefulSets]
subgraph node11 [OpenSearch Node]
plugin11[Prometheus Exporter Plugin]
end
subgraph node12 [OpenSearch Node]
plugin12[Prometheus Exporter Plugin]
end
end
service-monitor-ss-1[[ServiceMonitor CRD]]-- references -->plugin11
service-monitor-ss-1-- references -->plugin12
end
subgraph opensearch-cluster-2 [OpenSearch Cluster 2 Namespace]
direction BT
subgraph ss2 [OpenSearch Nodes StatefulSets]
subgraph node21 [OpenSearch Node]
plugin21[Prometheus Exporter Plugin]
end
subgraph node22 [OpenSearch Node]
plugin22[Prometheus Exporter Plugin]
end
end
service-monitor-ss-2[[ServiceMonitor CRD]]-- references -->plugin21
service-monitor-ss-2-- references -->plugin22
end
end
subgraph prometheus-operator-ns [Prometheus Operator Namespace]
direction TB
prometheus-controller[/Prometheus Controller/]
alert-manager[/Alert Manager/]
grafana[/Grafana/]-- Query -->prometheus[/Prometheus/]
prometheus-. Node Discovery .->service-monitor-ss-1
prometheus-. Node Discovery .->service-monitor-ss-2
prometheus-- Read Metrics -->plugin11
end
opensearch-controller-. Deploys .->service-monitor-ss-1
- Permissions to view cluster metrics will be limited in Grafana
Default prometheus metrics sent by the go app. Here is a list of these metrics:
go_gc_duration_seconds
go_gc_duration_seconds_sum
go_gc_duration_seconds_count
go_goroutines
go_info{version="go1.16.4"} 1
go_memstats_alloc_bytes
go_memstats_alloc_bytes_total
go_memstats_buck_hash_sys_bytes
go_memstats_frees_total
go_memstats_gc_cpu_fraction
go_memstats_gc_sys_bytes
go_memstats_heap_alloc_bytes
go_memstats_heap_idle_bytes
go_memstats_heap_inuse_bytes
go_memstats_heap_objects
go_memstats_heap_released_bytes
go_memstats_heap_sys_bytes
go_memstats_last_gc_time_seconds
go_memstats_lookups_total
go_memstats_mallocs_total
go_memstats_mcache_inuse_bytes
go_memstats_mcache_sys_bytes
go_memstats_mspan_inuse_bytes
go_memstats_mspan_sys_bytes
go_memstats_next_gc_bytes
go_memstats_other_sys_bytes
go_memstats_stack_inuse_bytes
go_memstats_stack_sys_bytes
go_memstats_sys_bytes
go_threads
promhttp_metric_handler_requests_in_flight 1
promhttp_metric_handler_requests_total{code="200"} 0
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0
The simple way to enable them is described here: Instrumenting a go application for prometheus.
These metrics will be collected from the controller.
The third group of metrics, which could be collected from the controller. Suggested metrics:
Metric | Description |
---|---|
os_restart_total | Number of times a node has restarted |
os_cluster_management_state_info | Management state used by the cluster |
os_storage_info | Number of nodes using emphimeral or persistent storage |
os_redundancy_policy_info | Redundancy policy used by the cluster |
os_index_retention_seconds | Number of seconds that documents are |
os_defined_delete_namespaces_total | Number of defined namespaces deleted per index policy |
os_misconfigured_memory_resources_info | Number of nodes with misconfigured memory resources |
The opensearch-prometheus-exporter plugin includes metrics for each Node, and also Cluster level metrics and Index level metrics.
- Default metrics publication
- Prometheus-plugin integration
- Opensearch controller metrics publication
- Deployment of the service-monitor for every clusters
- Deployment of the Prometheus-operator
- Prometheus-operator integration
- Grafana dashboards development
- Grafana permissions testing