-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ANE-1027] Reduce broker trace file size #146
Conversation
src/debug.rs
Outdated
.flatten_event(true) | ||
.with_thread_ids(true) | ||
.with_span_list(false) | ||
.with_span_events(FmtSpan::ACTIVE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flatten_event
1- This is just a minor simplification of the json structure that should save a bit of space
with_span_list
2 andwith_thread_ids
3- Idea here is that we don't need to add the span list to every trace if we include the thread id
- We can infer the span list by looking at the previously entered/exited spans on the same thread id for a given trace
with_span_events
4- Switching to
ACTIVE
as that includes all enters/exits, which seem to me to be far more relevant for debugging than creates/drops
- Switching to
Footnotes
-
https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/struct.Layer.html#method.flatten_event ↩
-
https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/struct.Layer.html#method.with_span_list ↩
-
https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/struct.Layer.html#method.with_thread_ids ↩
-
https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/struct.Layer.html#method.with_span_events ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flatten_event
Do you have examples of what this looks like before/after?
with_span_list
andwith_thread_ids
Idea here is that we don't need to add the span list to every trace if we include the thread id
Is the idea that by recording thread IDs, we can retain the context across temporally intertwined span events without actually including them in the message?
If so, this won't actually do what you're after: tokio
is multithreaded, yes, but only as an implementation detail; it is primarily a work-stealing executor of concurrent lightweight tasks.
In other words, the same thread can work on many disparate tasks that have no relation to one another. So unfortunately I don't think this'll avoid that problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have examples of what this looks like before/after?
Sure! Before:
{"timestamp":"2024-09-19T21:32:44.075971Z","level":"INFO","fields":{"message":"enter"},"target":"broker::db::sqlite","span":{"location":"\"/Users/nic/.config/fossa/broker/db.sqlite\"","name":"connect"},"threadId":"ThreadId(1)"}
After:
{"timestamp":"2024-09-19T19:38:11.111064Z","level":"INFO","message":"enter","target":"broker::db::sqlite","span":{"location":"\"/Users/nic/.config/fossa/broker/db.sqlite\"","name":"connect"},"threadId":"ThreadId(1)"}
If so, this won't actually do what you're after: tokio is multithreaded, yes, but only as an implementation detail; it is primarily a work-stealing executor of concurrent lightweight tasks.
In other words, the same thread can work on many disparate tasks that have no relation to one another. So unfortunately I don't think this'll avoid that problem.
Damn, okay, good to know. I was hoping that this would be a cheeky built-in way to accomplish what you mentioned in this thread but alas. I'll take another stab at it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment above
We should definitely figure out what's going on here- if we don't understand, we should figure it out. At minimum this means our mental model for changes doesn't fully match reality. |
- Don't show thread IDs - Include the span list - Only show NEW and CLOSE span events
@@ -342,7 +342,6 @@ fn parse_ls_remote(output: String) -> Vec<Reference> { | |||
.collect() | |||
} | |||
|
|||
#[tracing::instrument] | |||
fn line_to_git_ref(line: &str) -> Option<Reference> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was causing A LOT of traces. The function above (parse_ls_remote
) calls this function for every single reference in the repo, and that can be a lot (tensorflow has 44,567 for example). I don't think we need a span created for each of these.
@@ -140,7 +140,8 @@ impl Config { | |||
.with( | |||
tracing_subscriber::fmt::layer() | |||
.json() | |||
.with_span_events(FmtSpan::FULL) | |||
.flatten_event(true) | |||
.with_span_events(FmtSpan::NEW | FmtSpan::CLOSE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, these are the span events we really care about. From the docs:
FmtSpan::NEW
: An event will be synthesized when spans are created.
FmtSpan::ENTER
: An event will be synthesized when spans are entered.
FmtSpan::EXIT
: An event will be synthesized when spans are exited.
FmtSpan::CLOSE
: An event will be synthesized when a span closes.
From my analysis, I noticed that ENTER
is emitted way more often than it is actually called. For example, I know that the main
function in broker::cmd::run
is only called once. During one sample execution, a NEW
event was understandably emitted only once, but ENTER
was emitted 426 times. I suspect this is because a thread is constantly exiting and re-entering spans as it context-switches between different processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect this is because a thread is constantly exiting and re-entering spans as it context-switches between different processes.
Correct!
IMO, these are the span events we really care about.
Good insight, I agree completely. Really, the "new" and "close" events are all we need for latency measures.
@jssblck I think I've come to a much better solution now. See my two new comments where I justify the changes. I've also updated the table in the PR description (way more significant file size reductions 🙌). Let me know what you think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 97% reductions with no real loss in available information are always great to see. Great work!
8b2d3c3
to
454a43c
Compare
Overview
Broker generates a lot of traces. This can result in very large trace file sizes. This PR aims to reduce file size by updating the trace subscriber layer to:
new
andclose
eventsDoes this even work?
I ran broker against a handful of repos independently. Here are the results:
As you can see: a significant improvement.
What else can we do?
The ticket points to two possible improvements that I think would be more consistent/substantial than the changes are able to accomplish here.
The first is to compress the trace files:
The second is to serialize the traces in something more compact than json:
I think both of these are strong options, but I fear they may be too time consuming for an inexperienced Rust developer like myself to do as part of this ticket. I am hoping that this improvement is satisfactory enough, but perhaps we should make follow-up tickets for either of these two options.
Note about unit tests
There doesn't seem to be a precedent for testing the traces, so I haven't added any. Let me know if there's a good way to add some.
Acceptance criteria
Testing plan
broker.trace
file sizebroker.trace
file size to be smallerRisks
We lose some trace information, such as when spans are entered/exited. We still have info on when spans are created and closed though, which I think is more relevant.
References
Checklist