-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: cache stack hash to node in stacktreeTree #3827
Conversation
See the inline comment for more context. Profiling against my workload showed the compactor spending significant CPU time in `stacktraceTree.insert()`. Staring at the code, I realized it was performing a full stack walk to populate/resolve the leaf-most tree node representing the stack. Since many stacks will be identical, we can benefit from caching the mapping of the unique stack back to the node index. The performance speedups are greater if there are many repeated stacks and when stacks are deep. My data set consists of a lot of JVM profiles and Java is infamous for having very deep stacks. So this yields a considerable speedup for me.
|
can we add a benchmark to measure the difference? |
Btw, there should not be fully identical samples in the pprof converted from JFR due to this https://github.com/grafana/jfr-parser/blob/a8d22a1cd731f0ef8b48f84c5bab58532a9af541/pprof/pprof.go#L66 How do you ingest your data? do you ingest jfr to pyroscope? |
For single pprof profiles, yes, there should be 1 entry for each distinct stack (assuming the pprof generator deduplicated, which the JFR converting code does). But for the compactor, it will see multiple occurrences of the same stack from different source blocks, no? (I found this hot code when looking at compactor performance.) |
// Many stacks are repeating. So we benefit from an optimization that | ||
// can quickly map the input sequence back to a node without | ||
// walking the tree. We simply cache a map of stack digest back to the | ||
// node index. If there's a hit, we avoid a stack walk to resolve | ||
// the leaf node. If not, we pay a penalty for computing the hash | ||
// and performing a map lookup. | ||
digest := hashLocations(refs) | ||
existing, ok := t.hashedStacks[digest] | ||
if ok { | ||
return existing | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a valid approach, and your reasoning is correct, if I understand it correctly.
However, we already have a caching layer on top of trees (so-called chunks), and we never access a tree directly, bypassing the cache in the write path (valid for both ingesters and compactors):
pyroscope/pkg/phlaredb/symdb/partition_memory.go
Lines 63 to 113 in 26950d5
func (p *stacktraces) append(dst []uint32, s []*schemav1.Stacktrace) { | |
if len(s) == 0 { | |
return | |
} | |
var ( | |
id uint32 | |
found bool | |
misses int | |
) | |
p.m.RLock() | |
for i, x := range s { | |
if dst[i], found = p.hashToIdx[hashLocations(x.LocationIDs)]; !found { | |
misses++ | |
} | |
} | |
p.m.RUnlock() | |
if misses == 0 { | |
return | |
} | |
// NOTE(kolesnikovae): | |
// | |
// Maybe we don't need this map at all: tree insertion might be | |
// done in a thread safe fashion, and optimised to the extent | |
// that its performance is comparable with: | |
// map_read + r_(un)lock + map_overhead + | |
// miss_rate * (map_write + w_(un)lock) | |
// | |
// Instead of inserting stacks one by one, it is better to | |
// build a tree, and merge it to the existing one. | |
p.m.Lock() | |
defer p.m.Unlock() | |
for i, v := range dst[:len(s)] { | |
if v != 0 { | |
// Already resolved. ID 0 is reserved | |
// as it is the tree root. | |
continue | |
} | |
x := s[i].LocationIDs | |
// Tree insertion is idempotent, | |
// we don't need to check the map. | |
id = p.tree.insert(x) | |
h := hashLocations(x) | |
p.hashToIdx[h] = id | |
dst[i] = id | |
} | |
} |
(if we revive chunking, the cache is to be shared across chunks)
And it works in the same way: locations hash => stack trace ID (leaf node index) lookup:
- There is a chance of hash collisions, and we accept this risk.
- We eliminate an extra map lookup by relying on insertion idempotence: since the stack traces we want to insert might be added between mutex locks, we may write them again. Given that there are very few such writes, this approach is more efficient than checking each stack trace after acquiring the write lock.
Therefore, I'm wondering if we benefit from adding a cache at the tree level. I can say for sure that it will increase memory consumption significantly, while I can't see how it improves performance.
This is how it looks in a loaded cluster (half of 10G link of ingress, samples over 6 hours):
In compactors:
Note that there's a certain issue with the compaction process, which makes the cache less helpful. We're aware of it and we have a solution to the problem – something we're actively working on.
Thank you for the contribution @airbnb-gps! I believe the proposed optimisation has already been implemented – please check my comment |
I agree with your assessment. I still don't have a great way to reproduce my performance tests so my methodology was flawed, leading to faulty conclusions on my part. Sorry for the noise! |
No worries at all! Thank you for a good PR, and I hope you can contribute something else another time :) |
See the inline comment for more context.
Profiling against my workload showed the compactor spending significant CPU time in
stacktraceTree.insert()
.Staring at the code, I realized it was performing a full stack walk to populate/resolve the leaf-most tree node representing the stack.
Since many stacks will be identical, we can benefit from caching the mapping of the unique stack back to the node index.
The performance speedups are greater if there are many repeated stacks and when stacks are deep. My data set consists of a lot of JVM profiles and Java is infamous for having very deep stacks. So this yields a considerable speedup for me.