Skip to content

Commit

Permalink
prometheus logger: fix potential unlimited memory usage (#529)
Browse files Browse the repository at this point in the history
* support golang-lru cache to avoid unlimited memory usage
* move counter to gauge
* rename metrics
* Update docs and fix tests
  • Loading branch information
dmachard authored Jan 3, 2024
1 parent c08d285 commit 8cd4d0f
Show file tree
Hide file tree
Showing 9 changed files with 456 additions and 467 deletions.
38 changes: 38 additions & 0 deletions config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -274,6 +274,44 @@ multiplexer:
# chan-buffer-size: 65535
# # compute histogram for qnames length, latencies, queries and replies size repartition
# histogram-metrics-enabled: false
# # compute requesters metrics - total and top requesters
# requesters-metrics-enabled: true
# # compute domains metrics - total and top domains
# domains-metrics-enabled: true
# # compute NOERROR domains metrics - total and top domains
# noerror-metrics-enabled: true
# # compute NOERROR domains metrics - total and top domains
# servfail-metrics-enabled: true
# # compute NXDOMAIN domains metrics - total and top domains
# nonexistent-metrics-enabled: true
# # compute TIMEOUT domains metrics - total and top domains
# timeout-metrics-enabled: true
# # prometheus-labels: (list of strings) labels to add to metrics. Currently supported labels: stream_id, resolver, stream_global
# prometheus-labels: ["stream_id"]
# # LRU (least-recently-used) cache size for observed clients DNS
# requesters-cache-size: 250000
# # maximum time (in seconds) before eviction from the LRU cache
# requesters-cache-ttl: 3600
# # LRU (least-recently-used) cache size for observed domains
# domains-cache-size: 500000
# # maximum time (in seconds) before eviction from the LRU cache
# domains-cache-ttl: 3600
# # LRU (least-recently-used) cache size for observed NOERROR domains
# noerror-domains-cache-size: 500000
# # maximum time (in seconds) before eviction from the LRU cache
# noerror-domains-cache-ttl: 3600
# # LRU (least-recently-used) cache size for observed SERVFAIL domains
# servfail-domains-cache-size: 500000
# # maximum time (in seconds) before eviction from the LRU cache
# servfail-domains-cache-ttl: 3600
# # LRU (least-recently-used) cache size for observed NX domains
# nonexistent-domains-cache-size: 500000
# # maximum time (in seconds) before eviction from the LRU cache
# nonexistent-domains-cache-ttl: 3600
# # LRU (least-recently-used) cache size for observed other domains (suspicious, tlds, ...)
# default-domains-cache-size: 500000
# # maximum time (in seconds) before eviction from the LRU cache
# default-domains-cache-ttl: 3600

# # write captured dns traffic to text or binary files with rotation and compression support
# logfile:
Expand Down
1 change: 1 addition & 0 deletions dnsutils/constant.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ const (
ProtoDoT = "DOT"
ProtoDoH = "DOH"

DNSRcodeNoError = "NOERROR"
DNSRcodeNXDomain = "NXDOMAIN"
DNSRcodeServFail = "SERVFAIL"
DNSRcodeTimeout = "TIMEOUT"
Expand Down
54 changes: 47 additions & 7 deletions docs/loggers/logger_prometheus.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,17 @@ Options:
- `top-n`: (string) default number of items on top
- `chan-buffer-size`: (integer) channel buffer size used on incoming dns message, number of messages before to drop it.
- `histogram-metrics-enabled`: (boolean) compute histogram for qnames length, latencies, queries and replies size repartition
- `prometheus-labels`: (list of strings) labels to add to metrics. Currently supported labels: `stream_id`, `resolver`
- `prometheus-labels`: (list of strings) labels to add to metrics. Currently supported labels: `stream_id` (default), `stream_global`, `resolver`
- `requesters-cache-size`: (integer) LRU (least-recently-used) cache size for observed clients DNS per stream
- `requesters-cache-ttl`: (integer) maximum time (in seconds) before eviction from the LRU cache
- `domains-cache-size`: (integer) LRU (least-recently-used) cache size for observed domains per stream
- `domains-cache-ttl`: (integer) maximum time (in seconds) before eviction from the LRU cache
- `noerror-domains-cache-size`: (integer) LRU (least-recently-used) cache size for observed NOERROR domains per stream
- `noerror-domains-cache-ttl`: (integer) maximum time (in seconds) before eviction from the LRU cache
- `servfail-domains-cache-size`: (integer) LRU (least-recently-used) cache size for observed SERVFAIL domains per stream
- `servfail-domains-cache-ttl`: (integer) maximum time (in seconds) before eviction from the LRU cache
- `nonexistent-domains-cache-size`: (integer) LRU (least-recently-used) cache size for observed NX domains per stream
- `nonexistent-domains-cache-ttl`: (integer) maximum time (in seconds) before eviction from the LRU cache

Default values:

Expand All @@ -39,7 +49,25 @@ prometheus:
top-n: 10
chan-buffer-size: 65535
histogram-metrics-enabled: false
requesters-metrics-enabled: true
domains-metrics-enabled: true
noerror-domains-metrics-enabled: true
servfail-domains-metrics-enabled: true
nonexistent-domains-metrics-enabled: true
timeout-domains-metrics-enabled: true
prometheus-labels: ["stream_id"]
requesters-cache-size: 250000
requesters-cache-ttl: 3600
domains-cache-size: 500000
domains-cache-ttl: 3600
noerror-domains-cache-size: 100000
noerror-domains-cache-ttl: 3600
servfail-domains-cache-size: 10000
servfail-domains-cache-ttl: 3600
nonexistent-domains-cache-size: 10000
nonexistent-domains-cache-ttl: 3600
default-domains-cache-size: 1000
default-domains-cache-ttl: 3600
```
Scrape metric with curl:
Expand All @@ -55,9 +83,11 @@ The full metrics can be found [here](./../metrics.txt).
| Metric | Notes
|-------------------------------------------------|------------------------------------
| dnscollector_build_info | Build info
| dnscollector_requesters_total | The total number of requesters per stream identity
| dnscollector_nxdomains_total | The total number of NX domains per stream identity
| dnscollector_domains_total | The total number of domains per stream identity
| dnscollector_total_requesters_lru | Total number of DNS clients most recently observed per stream identity.
| dnscollector_total_domains_lru | Total number of serverfail domains most recently observed per stream identity
| dnscollector_total_noerror_domains_lru | Total number of serverfail domains most recently observed per stream identity
| dnscollector_total_servfail_domains_lru | Total number of serverfail domains most recently observed per stream identity
| dnscollector_total_nonexistentçdomains_lru | Total number of NX domains most recently observed per stream identity
| dnscollector_dnsmessage_total | Counter of total of DNS messages
| dnscollector_queries_total | Counter of total of queries
| dnscollector_replies_total | Counter of total of replies
Expand All @@ -77,15 +107,15 @@ The full metrics can be found [here](./../metrics.txt).
| dnscollector_reassembled_total | Total of reassembled DNS messages (TCP level)
| dnscollector_throughput_ops | Number of ops per second received, partitioned by stream
| dnscollector_throughput_ops_max | Max number of ops per second observed, partitioned by stream
| dnscollector_tlds_total | The total number of tld per stream identity
| dnscollector_total_tlds_lru | Total number of tld most recently observed per stream identity
| dnscollector_top_domains | Number of hit per domain topN, partitioned by stream and qname
| dnscollector_top_nxdomains | Number of hit per nx domain topN, partitioned by stream and qname
| dnscollector_top_sfdomains | Number of hit per servfail domain topN, partitioned by stream and qname
| dnscollector_top_requesters | Number of hit per requester topN, partitioned by client IP
| dnscollector_top_tlds | Number of hit per tld - topN
| dnscollector_top_unanswered | Number of hit per unanswered domain - topN
| dnscollector_unanswered_total | The total number of unanswered domains per stream identity
| dnscollector_suspicious_total | The total number of unanswered domains per stream identity
| dnscollector_total_unanswered_lru | Total number of unanswered domains most recently observed per stream identity
| dnscollector_total_suspicious_lru | Total number of suspicious domains most recently observed per stream identity
| dnscollector_qnames_size_bytes_bucket | Histogram of the size of the qname in bytes
| dnscollector_queries_size_bytes_bucket | Histogram of the size of the queries in bytes.
| dnscollector_replies_size_bytes_bucket | Histogram of the size of the replies in bytes.
Expand All @@ -97,3 +127,13 @@ The following [build-in](https://grafana.com/grafana/dashboards/16630) dashboard
<p align="center">
<img src="../_images/dashboard_prometheus.png" alt="dnscollector"/>
</p>

# Merge streams for metrics computation

Use the following setting to consolidate all streams into one for metric computations.

```yaml
prometheus:
....
prometheus-labels: ["stream_global"]
```
Loading

0 comments on commit 8cd4d0f

Please sign in to comment.