ValueMap - separate HashMap for sorted attribs #2288

fraillt · 2024-11-08T10:03:09Z

Changes

Internal structure for ValueMap has changed, in order to achieve several things:

it fully solves Fix metrics dedup/sort bug #2093
it finally fixes collection phase by reducing interference with collection phase as much as possible
it is prerequisite for implementing sharding (I have measured locally 2x+ throughput by implementing sharding locally)
there's no significant difference in benchmark and stress test results (will link results in separate comment)

Merge requirement checklist

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

codecov · 2024-11-08T10:07:58Z

Codecov Report

Attention: Patch coverage is 93.84615% with 4 lines in your changes missing coverage. Please review.

Project coverage is 79.4%. Comparing base (a707bb9) to head (c9621ae).

Files with missing lines	Patch %	Lines
opentelemetry-sdk/src/metrics/internal/mod.rs	93.8%	4 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            main   #2288     +/-   ##
=======================================
- Coverage   79.4%   79.4%   -0.1%     
=======================================
  Files        122     122             
  Lines      20783   20786      +3     
=======================================
+ Hits       16506   16507      +1     
- Misses      4277    4279      +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

fraillt · 2024-11-08T13:37:17Z

Benchmark results:
CPU model name : AMD Ryzen 5 3600 6-Core Processor
Rust version: 1.82

cargo bench --bench metric

Gnuplot not found, using plotters backend
Counter/AddNoAttrs      time:   [7.2384 ns 7.2490 ns 7.2608 ns]
                        change: [-0.3100% +0.1072% +0.5524%] (p = 0.67 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe
Counter/AddOneAttr      time:   [45.796 ns 45.867 ns 45.938 ns]
                        change: [-0.5833% -0.2135% +0.0918%] (p = 0.24 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
Counter/AddThreeAttr    time:   [99.452 ns 99.608 ns 99.761 ns]
                        change: [-9.9815% -9.7098% -9.4304%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
Counter/AddFiveAttr     time:   [148.26 ns 148.56 ns 148.92 ns]
                        change: [+4.8369% +5.1886% +5.5451%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
Counter/AddTenAttr      time:   [309.47 ns 310.21 ns 311.03 ns]
                        change: [+7.7586% +8.1184% +8.4869%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
Counter/AddOneTillMaxAttr
                        time:   [37.624 µs 37.705 µs 37.795 µs]
                        change: [+1.6454% +1.9274% +2.2247%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
Counter/AddMaxAttr      time:   [78.584 µs 78.793 µs 79.009 µs]
                        change: [+3.7150% +4.0650% +4.4258%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
Counter/AddInvalidAttr  time:   [73.262 ns 73.440 ns 73.638 ns]
                        change: [+2.3110% +2.7092% +3.1217%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  2 (2.00%) high mild
  3 (3.00%) high severe
Counter/AddSingleUseAttrs
                        time:   [304.18 ns 304.78 ns 305.37 ns]
                        change: [-3.0664% -1.5680% -0.0620%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) low severe
  4 (4.00%) low mild
Counter/AddSingleUseInvalid
                        time:   [413.16 ns 413.95 ns 414.79 ns]
                        change: [-6.9220% -6.5482% -6.1999%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Counter/AddSingleUseFiltered
                        time:   [396.07 ns 396.90 ns 397.65 ns]
                        change: [+0.1536% +1.7121% +3.2202%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) low severe
  4 (4.00%) low mild
Counter/CollectOneAttr  time:   [315.57 ns 316.02 ns 316.47 ns]
                        change: [-5.6803% -4.3837% -2.9715%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) low severe
  3 (3.00%) low mild
Counter/CollectTenAttrs time:   [472.16 ns 473.01 ns 473.87 ns]
                        change: [-0.5746% -0.2430% +0.0796%] (p = 0.14 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Histogram/Record0Attrs10bounds
                        time:   [26.869 ns 26.915 ns 26.961 ns]
                        change: [-0.9337% -0.6483% -0.4099%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Histogram/Record3Attrs10bounds
                        time:   [178.07 ns 179.04 ns 180.05 ns]
                        change: [+1.5352% +2.4123% +3.2730%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
Histogram/Record5Attrs10bounds
                        time:   [234.77 ns 235.38 ns 235.96 ns]
                        change: [+0.4491% +1.2595% +2.0163%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high mild
Histogram/Record7Attrs10bounds
                        time:   [282.89 ns 283.98 ns 285.04 ns]
                        change: [+0.8349% +1.4837% +2.1590%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
Histogram/Record10Attrs10bounds
                        time:   [376.05 ns 377.20 ns 378.42 ns]
                        change: [+2.7093% +3.1356% +3.6163%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low severe
  1 (1.00%) high severe
Histogram/Record0Attrs49bounds
                        time:   [33.860 ns 33.907 ns 33.955 ns]
                        change: [+0.3106% +0.5685% +0.8228%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
Histogram/Record3Attrs49bounds
                        time:   [186.80 ns 187.70 ns 188.55 ns]
                        change: [+1.0925% +2.1158% +3.2827%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild
Histogram/Record5Attrs49bounds
                        time:   [240.13 ns 241.55 ns 243.11 ns]
                        change: [+0.2945% +0.9409% +1.5861%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe
Histogram/Record7Attrs49bounds
                        time:   [291.67 ns 292.72 ns 293.68 ns]
                        change: [-0.9525% -0.3652% +0.2304%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  7 (7.00%) low mild
  1 (1.00%) high mild
Histogram/Record10Attrs49bounds
                        time:   [390.41 ns 392.35 ns 394.46 ns]
                        change: [+5.4833% +5.9922% +6.5449%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
Histogram/Record0Attrs50bounds
                        time:   [35.360 ns 35.929 ns 36.593 ns]
                        change: [+5.8425% +7.6970% +9.3905%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  7 (7.00%) high mild
  10 (10.00%) high severe
Histogram/Record3Attrs50bounds
                        time:   [173.17 ns 177.38 ns 181.59 ns]
                        change: [+1.2474% +3.7926% +6.1711%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
Histogram/Record5Attrs50bounds
                        time:   [237.79 ns 238.73 ns 239.65 ns]
                        change: [+0.0990% +0.6121% +1.1306%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild
Histogram/Record7Attrs50bounds
                        time:   [296.32 ns 297.19 ns 298.01 ns]
                        change: [+1.3802% +1.9040% +2.4259%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild
Histogram/Record10Attrs50bounds
                        time:   [384.34 ns 385.16 ns 385.98 ns]
                        change: [+3.9310% +4.3664% +4.8244%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
Histogram/Record0Attrs1000bounds
                        time:   [47.258 ns 47.310 ns 47.366 ns]
                        change: [-0.4105% -0.2422% -0.0622%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Histogram/Record3Attrs1000bounds
                        time:   [210.33 ns 211.80 ns 213.32 ns]
                        change: [+4.9472% +6.0007% +7.0373%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild
Histogram/Record5Attrs1000bounds
                        time:   [265.83 ns 267.22 ns 268.70 ns]
                        change: [+5.3381% +6.0642% +6.8171%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low severe
  1 (1.00%) high mild
  1 (1.00%) high severe
Histogram/Record7Attrs1000bounds
                        time:   [315.73 ns 316.93 ns 318.13 ns]
                        change: [+2.4567% +3.1600% +3.8292%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  1 (1.00%) high mild
Histogram/Record10Attrs1000bounds
                        time:   [405.73 ns 407.40 ns 408.94 ns]
                        change: [+2.8342% +3.3088% +3.7869%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
Benchmarking Histogram/CollectOne: Warming up for 3.0000 sthread 'main' panicked at opentelemetry-sdk/benches/metric.rs:349:36:
index out of bounds: the len is 0 but the index is 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

error: bench failed, to rerun pass `-p opentelemetry_sdk --bench metric`

It looks like there's relation between number number of attributes, E.g. for 1 attribute, performance is increased, for 10 is decreased, but I cannot explain why this happens.

cargo run --release --package stress --bin metrics

Main branch: 
* metrics_histogram: 10.3M it/s
* metrics_counter: 17.6M it/s
This PR:
* metrics_histogram: 9.8M it/s
* metrics_counter: 18.8M it/s

Same theme as before, counter performance increase, histograms performance decrease.
Although I have experimenting with sharding locally, results is opposite: with sharding counter performance increase ~2x, with histograms ~3x.

fraillt · 2024-11-08T14:07:11Z

opentelemetry-sdk/src/metrics/internal/mod.rs

+        let trackers = match self.sorted_attribs.lock() {
+            Ok(mut trackers) => {                
+                let new = HashMap::with_capacity(trackers.len());
+                replace(trackers.deref_mut(), new)


We have briefly touched this situation in my previous PR. about unnecessary allocation, but we didn't discussed downsides of it.
Here's the table of possible solutions.

memory policy approach benefits drawbacks

concervative take(trackers) smallest memory footprint never grows more than needed lots of allocation every time new collection phase starts

balanced replace(trackers, HashMap with_capacity (previous_len)) also small memory footprint on stable loads might save a lot of unnecessary allocations, on spiky loads might either waist memory or do more allocations

liberal replace(trackers, previous_tracker with same_memory) no allocations at all memory only grow, which might be a problem if you have huge spike, allocated memory will stay there forever (though it's only for contents of "empty" buckets, maybe not that bad?). Also it's a bit more complexity in code, as you need to have extra variable (and lock) just for that, and finally this might be unexpected for end user, as end user might expect that delta temporality doesn't do any sort of caching

At the moment I chose "balanced approach", I found it most natural :) but happy to hear your opinion on this too.

utpilla · 2024-11-09T01:55:14Z

opentelemetry-sdk/src/metrics/internal/mod.rs

-        prepare_data(dest, self.count.load(Ordering::SeqCst));
-        if self.has_no_attribute_value.swap(false, Ordering::AcqRel) {
+        // reset sorted trackers so new attributes set will be written into new hashmap
+        let trackers = match self.sorted_attribs.lock() {


I believe this is making the contention between collect and update worse. Earlier, the only point of contention during collect was the write lock under which we replace the existing RwLock<Hashmap> with a default one. With the changes in this PR, we now have two points of contention during collect:

Locking the Mutex with self.sorted_attribs.lock() which would compete with updates trying to add newer attribute sets

Acquiring the write lock with self.all_attribs.write() which would compete with all the updates

Yes, I know, but I don't see other way if we want to implement sharding, so my goal with this PR was mostly try to preserve existing performance (as much as I could) while at the same time, make it easy to add sharding in the future.
However, I expect this extra lock to be very cheap in practice, for two reasons:

Mutex is generally fast

at the end of collection cycle, there should be quite low contention on inserting new attribute sets.

but I don't see other way if we want to implement sharding,

Why is that? Implementing sharding for the Hashmap should be only about replacing the existing HashMap type with the type that implements sharding (something like Dashmap). How is the lack of having a sorted and unsorted set of attributes trackers preventing us from trying a sharded Hashmap?

Everything could work with one hashmap, we could probably tolerate slower cumulative collection phase, but the real problem is with delta temporality collection.
Here's the problem:
We have hard requirement that attribute-set keys are order agnostic.
In practice this means, that we can have multiple attribute-sets in hashmap, which points to same tracker (hence, we have Arc<A>.).
With sharding this means, that same tracker can be in multiple shards, so the question is how do you reset hashmap?

we cannot simply "replace" hashmap with fresh instance.

we cannot iterate through each shard, deduplicate (using hashset to identify already processed attribute-sets), and reset/clear each shard while iterating. This doesn't work because, even if you remove attribute-set in one shard, on next tracker update same tracker instance will be put back, as long as sorted attribute-set instance lives in another shard.

The only option to actually clear sharded hashmap is to lock ALL shards, then iterate/clone and unlock.

So my approach is to have two hashmaps instead, because all things considered it has more benefits than drawbacks.

update existing tracker - no effect, because attribute set is found on first try in both cases (in case user never provides sorted attribute-sets, it might be even better, as first hashmap will be half as small, so less chance for collision).

inserting new tracker - probably not expected, but it is performance increase actually (with cargo bench --bench metrics_counter), probably due to less checks and fast lock with Mutex.

collection phase (both delta and cumulative) - huge benefits, short lock time, no need to have hashset in order to dedup, and solves Fix metrics dedup/sort bug #2093 (which would further slow down collection phase).

So basically this approach has bunch of benefits and 0 drawbacks, so why not try it :)

I understand that sharding has proven to be improving throughout significantly, but those tests were done at the time we took a Mutex lock in hotpath. Now we have replaced that with Read lock on hotpath, so there is no contention anymore. i.e if 10 threads wants to update(), they can all do so without being blocked by each other. With this existing design, how does sharding help improve performance?

I think it has to do with CPU cache contention.
There's still single lock instance (on happy path) that all cores/threads will fight for. It doesn't matter if it's Mutex/RwLock or any other primitive, the fact remains that when we "lock it" some state is updated under the hood.
Once one CPU updates a memory (there's no difference if it's atomic or not), it marks this L1 cache line as "invalid" on all other CPU cores, so when next core comes in, it needs to fetch it (might be from main memory or previous CPU cache).

So sharding solves exactly this problem:

it will have multiple lock instances (per shard), so that multiple threads don't fight that much anymore, and don't need to invalidate other CPU cache line every single time.

As I mentioned, I have done some tests locally, using naive implementation (I didn't used any padding between shards, to avoid false-sharing (when invalidating one shard, might also invalidate another if they happen to live on same L1 cache line)), and results are already ~3x improvement (it varies greatly between different CPUs, I saw 2x on one, and 5x on another in some cases). Proper implementation could be even more performant :)

fraillt · 2024-11-13T18:38:36Z

Some more benchmark results:

cargo bench --bench metrics_counter

Counter_Created
Counter_Add_Sorted      time:   [126.49 ns 126.74 ns 127.04 ns]
                        change: [+0.1623% +0.7014% +1.2831%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

Counter_Created
Counter_Add_Unsorted    time:   [125.91 ns 126.33 ns 127.12 ns]
                        change: [-2.8605% -2.2424% -1.6517%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  4 (4.00%) high severe

Counter_Created
Counter_Overflow        time:   [540.16 ns 540.42 ns 540.71 ns]
                        change: [-16.803% -16.718% -16.638%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

ThreadLocal_Random_Generator_5
                        time:   [32.559 ns 32.612 ns 32.721 ns]
                        change: [+0.0105% +0.2862% +0.5248%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

fraillt requested a review from a team as a code owner November 8, 2024 10:03

Separate HashMap for sorted attribs

c9621ae

fraillt force-pushed the value-map-separate-map-for-sorted-attribs branch from b4ee62f to c9621ae Compare November 8, 2024 10:04

fraillt commented Nov 8, 2024

View reviewed changes

utpilla reviewed Nov 9, 2024

View reviewed changes

fraillt mentioned this pull request Nov 11, 2024

Add simple sharding in ValueMap to boost metrics update performance under stress. #2297

Open

cijothomas mentioned this pull request Nov 14, 2024

Hashed utility for ValueMap to improve hashing performance #2296

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueMap - separate HashMap for sorted attribs #2288

ValueMap - separate HashMap for sorted attribs #2288

fraillt commented Nov 8, 2024

codecov bot commented Nov 8, 2024

fraillt commented Nov 8, 2024

fraillt Nov 8, 2024

utpilla Nov 9, 2024

fraillt Nov 9, 2024

utpilla Nov 13, 2024 •

edited

Loading

fraillt Nov 13, 2024

cijothomas Nov 14, 2024

fraillt Nov 14, 2024

fraillt commented Nov 13, 2024

memory policy	approach	benefits	drawbacks
concervative	take(trackers)	smallest memory footprint never grows more than needed	lots of allocation every time new collection phase starts
balanced	replace(trackers, HashMap with_capacity (previous_len))	also small memory footprint	on stable loads might save a lot of unnecessary allocations, on spiky loads might either waist memory or do more allocations
liberal	replace(trackers, previous_tracker with same_memory)	no allocations at all	memory only grow, which might be a problem if you have huge spike, allocated memory will stay there forever (though it's only for contents of "empty" buckets, maybe not that bad?). Also it's a bit more complexity in code, as you need to have extra variable (and lock) just for that, and finally this might be unexpected for end user, as end user might expect that delta temporality doesn't do any sort of caching

ValueMap - separate HashMap for sorted attribs #2288

Are you sure you want to change the base?

ValueMap - separate HashMap for sorted attribs #2288

Conversation

fraillt commented Nov 8, 2024

Changes

Merge requirement checklist

codecov bot commented Nov 8, 2024

Codecov Report

fraillt commented Nov 8, 2024

fraillt Nov 8, 2024

Choose a reason for hiding this comment

utpilla Nov 9, 2024

Choose a reason for hiding this comment

fraillt Nov 9, 2024

Choose a reason for hiding this comment

utpilla Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

fraillt Nov 13, 2024

Choose a reason for hiding this comment

cijothomas Nov 14, 2024

Choose a reason for hiding this comment

fraillt Nov 14, 2024

Choose a reason for hiding this comment

fraillt commented Nov 13, 2024

utpilla Nov 13, 2024 •

edited

Loading