-
-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more prometheus metrics #141
Conversation
if ok, key := parseMetadataPath(path); ok { | ||
return server.getMetadata(ctx, httpHeaders, key) | ||
archive, handler = key, "tile" | ||
status, headers, data = server.getTile(ctx, headers, key, z, x, y, ext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern here is that someone could make a bunch of random requests to nonexistent archives and all of the 404's for unique resources would blow up the metric response size. To avoid that, we could either:
- try to map all "archive does not exist" responses to handler="404" archive="" (but we don't want to miss logging legit failures about a bad connection to an archive you care about)
- or we could just recommend that applications put this behind a proxy that allows only requests to archives you know exist
Thoughts @bdon ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer 1). for 2) we would have to recommend they put the whole go-pmtiles application behind a allowlist-proxy, right? that is too strong of a limitation imho.
for 1) I think that discovering failures about missing archives seems outside of the scope of performance metrics, I don't have a lot of experience in how people use prometheus though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case I want to catch is if people are reading from an archive, and it suddenly goes missing or server starts returning 500 errors we want to able to see which archive is no longer working. We could limit the per-archive metrics to only archives that we've gotten an OK response for the root directory, but that would miss if the server restarts and fails to read at startup... it would also mean 2 metrics you need to alert on to know if ties are down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to limit the universe of stored tileset names to ~100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://prometheus.io/docs/practices/naming/#labels seems to recommend against unbounded cardinality... I feel like detecting a spike in 404s/500s and inspecting logs is good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, changed bucket and overall request metrics to set archive to "" on not found errors.
|
||
func createMetrics(scope string, logger *log.Logger) *metrics { | ||
namespace := "pmtiles" | ||
durationBuckets := prometheus.DefBuckets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default buckets are 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s
Add more prometheus metrics to pmtiles server:
Also fixed a few things I found testing out the corner cases:
The impact on performance seems to be negligible, tilesiege with 1.25m requests was around 37k qps before/after this change.
Example /metrics response after a bunch of requests to "planet" archive that exists and "plaet" archive that does not