Prometheus metric ceremony #4034

marcellorigotti · 2023-09-20T14:32:28Z

Pull Request

Closes: PRO-xxx

Checklist

Please conduct a thorough self-review before opening the PR.

I am confident that the code works.
I have updated documentation where appropriate.

Summary

This PR adds the metric relative to a Ceremony:

CEREMONY_DURATION (ms)
CEREMONY_TIMEOUT_MISSING_MSG (number of missing messages when reaching timeout)
STAGE_DURATION (ms)
STAGE_FAILING (count the number of stage reaching timeout)
STAGE_COMPLETING (count the number of stages completing -> the one that receives all messages)

codecov · 2023-09-20T14:41:58Z

Codecov Report

Merging #4034 (d32ddf1) into main (178df88) will decrease coverage by 0%.
Report is 6 commits behind head on main.
The diff coverage is 82%.

@@          Coverage Diff           @@
##            main   #4034    +/-   ##
======================================
- Coverage     72%     72%    -0%     
======================================
  Files        368     369     +1     
  Lines      58484   59126   +642     
  Branches   58484   59126   +642     
======================================
+ Hits       42327   42658   +331     
- Misses     14060   14353   +293     
- Partials    2097    2115    +18

Files Changed	Coverage Δ
...ngine/multisig/src/client/common/ceremony_stage.rs	`100% <ø> (ø)`
engine/multisig/src/client/ceremony_runner.rs	`82% <73%> (+<1%)`	⬆️
utilities/src/with_std/metrics.rs	`81% <82%> (+1%)`	⬆️
engine/multisig/src/client/common/broadcast.rs	`87% <94%> (+<1%)`	⬆️

... and 25 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

marcellorigotti · 2023-09-20T14:42:33Z

Possible modification:

CEREMONY_TIMEOUT_MISSING_MSG: Do we also want the name of the stage here?

this could help us understand which stages completed even if we miss some messages (if we add another metric true/false (we would use some numbers like 1/0) for each ceremony and stage to know if it completed correctly or failed) I.E. we have CEREMONY_TIMEOUT_MISSING_MSG(Ceremony1, stage2, 3missing msg) we then have the new metric: CEREMONY_STAGE_RESULT(Ceremony1, stage2, true) => we completed even if we had some missing msg, otherwise we end up with CEREMONY_STAGE_RESULT(Ceremony1, stage2, false)
this could help us understand better which stages are the most likely to fail

We can add the stage name just for reason number (2)

engine/multisig/src/client/ceremony_runner.rs

engine/multisig/src/client/common/broadcast.rs

engine/multisig/src/client/ceremony_runner.rs

engine/multisig/src/client/common/ceremony_stage.rs

utilities/src/with_std/metrics.rs

msgmaxim · 2023-09-21T03:52:16Z

utilities/src/with_std/metrics.rs

+				non_const_labels: &[&str; { $labels.len() - $const_labels.len() }],
+			) {
+				if self.drop {
+					self.labels.insert(non_const_labels.map(|s| s.to_string()));


I don't understand why this appends to self.labels every time you increment come counter (and the same in all other functions).

In case we need to drop the metric and we have some const_label, we need to specify const_label + non_const_label when we delete it.
In order to do that we need to save the non_const_labels we use so that when we delete it we can reconstruct all the combination used (const_labels + non_const_labels1, const_labels + non_const_labels2, etc...)

I will change the name of the field to something like non_const_label_used to make it more clear!

I see. You have to do this because there is no way to partially match labels when deleting (e.g. everything with a given cermeony_id) and we don't know all labels in advance? Seems like there is quite a bit of overhead/complexity just to increment an integer, and this feels error prone.

I'm just thinking out loud, but do we really need to use things like IntCounterVec? It is not that difficult to produce the values in the format that prometheus expects. Say, if we stored these metrics in our own datastructure, we could easily have things like "delete all entires for a given ceremony id", e.g. using a map with ceremony id as the key. I don't think there is anything else we'd need to worry about with regards to cleaning up. Just feels like most difficulties/complexity comes from the fact that we have to work around the library's limitations

Exactly, unfortunately it is not possible to just specify a single label when deleting, you need to provide the full combination of labels in the correct order (I think because behind the scenes IntCounterVec is concatenating the string and hashing them to obtain a key to the map containing the actual value).
And yeah probably creating our own version of the prometheus client which we can tune to our needs would reduce complexity a lot! I guess we decided to go with the prometheus library at the beginning when we didn't know yet all the metrics required and how to handle them.

This is definitely something worth keeping in mind for the future, rn I think it is not worth to change everything tho.

…ONY_TIMEOUT_MISSING_MSG

…of wrapper, passed as a parameter when calling the macro

…p the metric

engine/multisig/src/client/ceremony_runner.rs

engine/multisig/src/client/common/broadcast.rs

msgmaxim · 2023-09-22T02:49:52Z

utilities/src/with_std/metrics.rs

@@ -346,23 +351,28 @@ macro_rules! build_counter_vec_struct {
 		pub struct $struct_ident {
 			metric: &'static $metric_ident,
 			const_labels: [String; { $const_labels.len() }],
-			labels: HashSet<[String; { $labels.len() - $const_labels.len() }]>,
+			non_const_labels_used: HashSet<[String; { $labels.len() - $const_labels.len() }]>,


I imagine it is not too difficult to have to first part of this macro to be implemented using the second part of the marco with an empty const label array?

Yes I think it is possible to do so, but wouldn't this add some useless complexity to the metric which don't use the const_labels? (also we would then have to pass an empty array every time we interact with those metrics making the code a bit messy?)

There are probably ways to remove the empty array argument, but yeah, complicating the macro might not be worth it.

engine/multisig/src/client/ceremony_runner.rs

msgmaxim

(dummy message so I can submit review)

do conversion inside the constructor add test to check deletion of metrics inside CeremonyMetrics strucs

…-> deletion done before as expected

msgmaxim · 2023-09-25T02:36:39Z

utilities/src/with_std/metrics.rs

+				request_test("metrics", reqwest::StatusCode::OK, "# HELP ceremony_bad_msg Count all the bad msgs processed during a ceremony\n# TYPE ceremony_bad_msg counter\nceremony_bad_msg{chain=\"Chain1\",reason=\"AA\"} 1\n# HELP ceremony_duration Measure the duration of a ceremony in ms\n# TYPE ceremony_duration gauge\nceremony_duration{ceremony_id=\"7\",ceremony_type=\"Keygen\",chain=\"Chain1\"} 999\n# HELP ceremony_msg Count all the processed messages for a given ceremony\n# TYPE ceremony_msg counter\nceremony_msg{ceremony_id=\"7\",ceremony_type=\"Keygen\",chain=\"Chain1\"} 2\n# HELP ceremony_timeout_missing_msg Measure the number of missing messages when reaching timeout\n# TYPE ceremony_timeout_missing_msg gauge\nceremony_timeout_missing_msg{ceremony_id=\"7\",ceremony_type=\"Keygen\",chain=\"Chain1\",stage=\"stage1\"} 5\n# HELP stage_completing Count the number of stages which are completing succesfully by receiving all the messages\n# TYPE stage_completing counter\nstage_completing{chain=\"Chain1\",stage=\"stage1\"} 2\nstage_completing{chain=\"Chain1\",stage=\"stage2\"} 1\n# HELP stage_duration Measure the duration of a stage in ms\n# TYPE stage_duration gauge\nstage_duration{ceremony_id=\"7\",chain=\"Chain1\",phase=\"processing\",stage=\"stage1\"} 78\nstage_duration{ceremony_id=\"7\",chain=\"Chain1\",phase=\"receiving\",stage=\"stage1\"} 780\n# HELP stage_failing Count the number of stages which are failing with the cause of the failure attached\n# TYPE stage_failing counter\nstage_failing{chain=\"Chain1\",reason=\"NotEnoughMessages\",stage=\"stage3\"} 1\n").await;
+
+				//Second request we get only the metrics which don't depend on a specific label like ceremony_id
+				request_test("metrics", reqwest::StatusCode::OK, "# HELP ceremony_bad_msg Count all the bad msgs processed during a ceremony\n# TYPE ceremony_bad_msg counter\nceremony_bad_msg{chain=\"Chain1\",reason=\"AA\"} 1\n# HELP stage_completing Count the number of stages which are completing succesfully by receiving all the messages\n# TYPE stage_completing counter\nstage_completing{chain=\"Chain1\",stage=\"stage1\"} 2\nstage_completing{chain=\"Chain1\",stage=\"stage2\"} 1\n# HELP stage_failing Count the number of stages which are failing with the cause of the failure attached\n# TYPE stage_failing counter\nstage_failing{chain=\"Chain1\",reason=\"NotEnoughMessages\",stage=\"stage3\"} 1\n").await;


Just an observation, but it is nearly impossible to tell if this output is what we should expect (it is hard to parse and this contains a lot of noise e.g. help messages). Make me want to write our own metrics data structure even more (so we can look at it without encoding).

One thing you can do to improve readability somewhat is to use raw string literals:

request_test("metrics", reqwest::StatusCode::OK, r#"# HELP ceremony_bad_msg Count all the bad msgs processed during a ceremony # TYPE ceremony_bad_msg counter ceremony_bad_msg{chain="Chain1",reason="AA"} 1 # HELP ceremony_duration Measure the duration of a ceremony in ms # TYPE ceremony_duration gauge ceremony_duration{ceremony_id="7",ceremony_type="Keygen",chain="Chain1"} 999 # HELP ceremony_msg Count all the processed messages for a given ceremony # TYPE ceremony_msg counter ceremony_msg{ceremony_id="7",ceremony_type="Keygen",chain="Chain1"} 2 # HELP ceremony_timeout_missing_msg Measure the number of missing messages when reaching timeout # TYPE ceremony_timeout_missing_msg gauge ceremony_timeout_missing_msg{ceremony_id="7",ceremony_type="Keygen",chain="Chain1",stage="stage1"} 5 # HELP stage_completing Count the number of stages which are completing succesfully by receiving all the messages # TYPE stage_completing counter stage_completing{chain="Chain1",stage="stage1"} 2 stage_completing{chain="Chain1",stage="stage2"} 1 # HELP stage_duration Measure the duration of a stage in ms # TYPE stage_duration gauge stage_duration{ceremony_id="7",chain="Chain1",phase="processing",stage="stage1"} 78 stage_duration{ceremony_id="7",chain="Chain1",phase="receiving",stage="stage1"} 780 # HELP stage_failing Count the number of stages which are failing with the cause of the failure attached # TYPE stage_failing counter stage_failing{chain="Chain1",reason="NotEnoughMessages",stage="stage3"} 1 "# ).await;

(Note that there should be no leading whitespace, otherwise the strings wouldn't be identical)

Otherwise, maybe we could use some simple string manupulation before comparing them, .e.g

[ r#"# HELP ceremony_bad_msg Count all the bad msgs processed during a ceremony"#, r#"# TYPE ceremony_bad_msg counter"# ].join('\n')

or even skipping the # HELP and # TYPE lines

I know this is not ideal, that's why I added the check_deleted_metrics function to be sure everything was correctly deleted, still I wanted to check the final output for a request that's why I kept this messy request_test() as well.
Anyway I followed your first suggestion and used the raw string literals, it is much more clear in this way what we are expecting.

msgmaxim · 2023-09-25T02:52:32Z

utilities/src/with_std/metrics.rs

+				}
+
+				//First request after the ceremony ended we get all the metrics (same as the request above), and after we delete the ones that have no more reason to exists
+				request_test("metrics", reqwest::StatusCode::OK, "# HELP ceremony_bad_msg Count all the bad msgs processed during a ceremony\n# TYPE ceremony_bad_msg counter\nceremony_bad_msg{chain=\"Chain1\",reason=\"AA\"} 1\n# HELP ceremony_duration Measure the duration of a ceremony in ms\n# TYPE ceremony_duration gauge\nceremony_duration{ceremony_id=\"7\",ceremony_type=\"Keygen\",chain=\"Chain1\"} 999\n# HELP ceremony_msg Count all the processed messages for a given ceremony\n# TYPE ceremony_msg counter\nceremony_msg{ceremony_id=\"7\",ceremony_type=\"Keygen\",chain=\"Chain1\"} 2\n# HELP ceremony_timeout_missing_msg Measure the number of missing messages when reaching timeout\n# TYPE ceremony_timeout_missing_msg gauge\nceremony_timeout_missing_msg{ceremony_id=\"7\",ceremony_type=\"Keygen\",chain=\"Chain1\",stage=\"stage1\"} 5\n# HELP stage_completing Count the number of stages which are completing succesfully by receiving all the messages\n# TYPE stage_completing counter\nstage_completing{chain=\"Chain1\",stage=\"stage1\"} 2\nstage_completing{chain=\"Chain1\",stage=\"stage2\"} 1\n# HELP stage_duration Measure the duration of a stage in ms\n# TYPE stage_duration gauge\nstage_duration{ceremony_id=\"7\",chain=\"Chain1\",phase=\"processing\",stage=\"stage1\"} 78\nstage_duration{ceremony_id=\"7\",chain=\"Chain1\",phase=\"receiving\",stage=\"stage1\"} 780\n# HELP stage_failing Count the number of stages which are failing with the cause of the failure attached\n# TYPE stage_failing counter\nstage_failing{chain=\"Chain1\",reason=\"NotEnoughMessages\",stage=\"stage3\"} 1\n").await;


succesfully -> successfully

* added macro to create gauges that get deleted * added ceremony_duration metric * fixed gauge to handle convertion to i64 * ceremony missing messages on timeout metric added * added chain label to CEREMONY_PROCESSED_MSG, CEREMONY_DURATION, CEREMONY_TIMEOUT_MISSING_MSG * modified macro to support drop (deletion of labels) on all the types of wrapper, passed as a parameter when calling the macro * added STAGE_DURATION metric * use collect_array * added STAGE_COMPLETING/STAGE_FAILING metrics * avoid saving labels already seen (add to the hashset) if we don't drop the metric * fixed missing imports caused by rebasing * fix double imports * avoid using format! and to_string every time -> use clone() * fixed typo * addressed PR comments * use Option for stage/ceremony _start do conversion inside the constructor add test to check deletion of metrics inside CeremonyMetrics strucs * fixed test * cargo fmt * added manual deletion inside tests and make sure it returns an error -> deletion done before as expected * address review comments

msgmaxim reviewed Sep 21, 2023

View reviewed changes

marcellorigotti added 15 commits September 21, 2023 15:55

added macro to create gauges that get deleted

cccc380

added ceremony_duration metric

8226df4

fixed gauge to handle convertion to i64

818d4a5

ceremony missing messages on timeout metric added

683cde8

added chain label to CEREMONY_PROCESSED_MSG, CEREMONY_DURATION, CEREM…

5bedc8c

…ONY_TIMEOUT_MISSING_MSG

modified macro to support drop (deletion of labels) on all the types …

97d849d

…of wrapper, passed as a parameter when calling the macro

added STAGE_DURATION metric

ad9b386

use collect_array

1153b74

added STAGE_COMPLETING/STAGE_FAILING metrics

84181dc

avoid saving labels already seen (add to the hashset) if we don't dro…

205828a

…p the metric

fixed missing imports caused by rebasing

5083735

fix double imports

ab400b1

avoid using format! and to_string every time -> use clone()

f884ccf

fixed typo

823f27b

addressed PR comments

7e90d22

marcellorigotti force-pushed the prometheusMetricCeremony branch from e7c17ae to 7e90d22 Compare September 21, 2023 13:55

msgmaxim reviewed Sep 22, 2023

View reviewed changes

marcellorigotti added 3 commits September 22, 2023 12:02

use Option for stage/ceremony _start

b082d8f

do conversion inside the constructor add test to check deletion of metrics inside CeremonyMetrics strucs

fixed test

1b04fe1

cargo fmt

c73b55b

marcellorigotti marked this pull request as ready for review September 22, 2023 11:29

added manual deletion inside tests and make sure it returns an error …

4a82892

…-> deletion done before as expected

msgmaxim reviewed Sep 25, 2023

View reviewed changes

address review comments

d32ddf1

msgmaxim approved these changes Sep 25, 2023

View reviewed changes

marcellorigotti merged commit 61dbd66 into main Sep 25, 2023
44 checks passed

marcellorigotti deleted the prometheusMetricCeremony branch September 25, 2023 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus metric ceremony #4034

Prometheus metric ceremony #4034

marcellorigotti commented Sep 20, 2023 •

edited

Loading

codecov bot commented Sep 20, 2023 •

edited

Loading

marcellorigotti commented Sep 20, 2023

msgmaxim Sep 21, 2023

marcellorigotti Sep 21, 2023

msgmaxim Sep 22, 2023

marcellorigotti Sep 22, 2023 •

edited

Loading

msgmaxim Sep 22, 2023

marcellorigotti Sep 22, 2023

msgmaxim Sep 25, 2023

msgmaxim left a comment •

edited

Loading

msgmaxim Sep 25, 2023

msgmaxim Sep 25, 2023

marcellorigotti Sep 25, 2023

msgmaxim Sep 25, 2023

Prometheus metric ceremony #4034

Prometheus metric ceremony #4034

Conversation

marcellorigotti commented Sep 20, 2023 • edited Loading

Pull Request

Checklist

Summary

codecov bot commented Sep 20, 2023 • edited Loading

Codecov Report

marcellorigotti commented Sep 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcellorigotti Sep 22, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msgmaxim left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcellorigotti commented Sep 20, 2023 •

edited

Loading

codecov bot commented Sep 20, 2023 •

edited

Loading

marcellorigotti Sep 22, 2023 •

edited

Loading

msgmaxim left a comment •

edited

Loading