Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ANE-1027] Reduce broker trace file size #146

Merged
merged 4 commits into from
Sep 23, 2024

Conversation

nficca
Copy link
Contributor

@nficca nficca commented Sep 19, 2024

Overview

Broker generates a lot of traces. This can result in very large trace file sizes. This PR aims to reduce file size by updating the trace subscriber layer to:

  • Reducing trace verbosity
  • Only logging span new and close events

Does this even work?

I ran broker against a handful of repos independently. Here are the results:

Repo Old trace file size New trace file size Diff
https://github.com/fossas/fossa-cli 11.6MB 1.6MB -10MB (-86%)
https://github.com/tensorflow/tensorflow 168.7MB 2.4MB -166.3MB (-99%)
https://github.com/kubernetes/kubernetes 403.1MB 2.5MB -400.6MB (-99%)
https://github.com/facebook/react 2.04GB 62.3MB -1.98GB (-97%)

As you can see: a significant improvement.

What else can we do?

The ticket points to two possible improvements that I think would be more consistent/substantial than the changes are able to accomplish here.

The first is to compress the trace files:

We should also probably store the trace files compressed. Since compression algorithms rely on a dictionary of compression words, this would normally mean keeping the trace file open as a compression stream through the whole program lifetime, however we want to be as resilient as possible to program termination (modulo disk cache, can’t do much about that). So we’ll probably need to have a system where the primary trace file is written uncompressed, and a background job periodically compresses it into a larger trace file that spans the entire day.

The second is to serialize the traces in something more compact than json:

Lastly, it may be useful to record traces in a format other than JSON. We chose this format in the hopes that it’d be debuggable without dedicated tooling (until we get tooling built up), but if switching to a different format would save a ton of space it’s probably worth taking the time to build the tooling. After all, it’s not really realistic for a person to review gigabytes of JSON, much less actually debug it, without dedicated tooling either.

I think both of these are strong options, but I fear they may be too time consuming for an inexperienced Rust developer like myself to do as part of this ticket. I am hoping that this improvement is satisfactory enough, but perhaps we should make follow-up tickets for either of these two options.

Note about unit tests

There doesn't seem to be a precedent for testing the traces, so I haven't added any. Let me know if there's a good way to add some.

Acceptance criteria

  • Broker trace file size should now be smaller

Testing plan

  1. Run broker for a period of time without these changes and observe the broker.trace file size
  2. Move the trace file somewhere else/wipe it completely.
  3. Run broker for the same period of time with these changes and observe the broker.trace file size to be smaller

Risks

We lose some trace information, such as when spans are entered/exited. We still have info on when spans are created and closed though, which I think is more relevant.

References

  • ANE-1027: Make Broker trace file size more manageable

Checklist

  • I added tests for this PR's change (or explained in the PR description why tests don't make sense).

src/debug.rs Outdated
Comment on lines 143 to 146
.flatten_event(true)
.with_thread_ids(true)
.with_span_list(false)
.with_span_events(FmtSpan::ACTIVE)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • flatten_event1
    • This is just a minor simplification of the json structure that should save a bit of space
  • with_span_list2 and with_thread_ids3
    • Idea here is that we don't need to add the span list to every trace if we include the thread id
    • We can infer the span list by looking at the previously entered/exited spans on the same thread id for a given trace
  • with_span_events4
    • Switching to ACTIVE as that includes all enters/exits, which seem to me to be far more relevant for debugging than creates/drops

Footnotes

  1. https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/struct.Layer.html#method.flatten_event

  2. https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/struct.Layer.html#method.with_span_list

  3. https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/struct.Layer.html#method.with_thread_ids

  4. https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/struct.Layer.html#method.with_span_events

Copy link
Member

@jssblck jssblck Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flatten_event

Do you have examples of what this looks like before/after?

with_span_list and with_thread_ids
Idea here is that we don't need to add the span list to every trace if we include the thread id

Is the idea that by recording thread IDs, we can retain the context across temporally intertwined span events without actually including them in the message?

If so, this won't actually do what you're after: tokio is multithreaded, yes, but only as an implementation detail; it is primarily a work-stealing executor of concurrent lightweight tasks.

In other words, the same thread can work on many disparate tasks that have no relation to one another. So unfortunately I don't think this'll avoid that problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have examples of what this looks like before/after?

Sure! Before:

{"timestamp":"2024-09-19T21:32:44.075971Z","level":"INFO","fields":{"message":"enter"},"target":"broker::db::sqlite","span":{"location":"\"/Users/nic/.config/fossa/broker/db.sqlite\"","name":"connect"},"threadId":"ThreadId(1)"}

After:

{"timestamp":"2024-09-19T19:38:11.111064Z","level":"INFO","message":"enter","target":"broker::db::sqlite","span":{"location":"\"/Users/nic/.config/fossa/broker/db.sqlite\"","name":"connect"},"threadId":"ThreadId(1)"}

If so, this won't actually do what you're after: tokio is multithreaded, yes, but only as an implementation detail; it is primarily a work-stealing executor of concurrent lightweight tasks.

In other words, the same thread can work on many disparate tasks that have no relation to one another. So unfortunately I don't think this'll avoid that problem.

Damn, okay, good to know. I was hoping that this would be a cheeky built-in way to accomplish what you mentioned in this thread but alas. I'll take another stab at it!

@nficca nficca marked this pull request as ready for review September 19, 2024 21:22
@nficca nficca requested a review from a team as a code owner September 19, 2024 21:22
@nficca nficca requested a review from zlav September 19, 2024 21:22
@jssblck jssblck self-requested a review September 19, 2024 21:22
@zlav zlav removed their request for review September 19, 2024 21:24
Copy link
Member

@jssblck jssblck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment above

@jssblck
Copy link
Member

jssblck commented Sep 19, 2024

As you can see, the results vary. Most runs saw a significant improvement, but react actually got very slightly worse. I'm not totally sure why this is.

We should definitely figure out what's going on here- if we don't understand, we should figure it out. At minimum this means our mental model for changes doesn't fully match reality.

- Don't show thread IDs
- Include the span list
- Only show NEW and CLOSE span events
@@ -342,7 +342,6 @@ fn parse_ls_remote(output: String) -> Vec<Reference> {
.collect()
}

#[tracing::instrument]
fn line_to_git_ref(line: &str) -> Option<Reference> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was causing A LOT of traces. The function above (parse_ls_remote) calls this function for every single reference in the repo, and that can be a lot (tensorflow has 44,567 for example). I don't think we need a span created for each of these.

@@ -140,7 +140,8 @@ impl Config {
.with(
tracing_subscriber::fmt::layer()
.json()
.with_span_events(FmtSpan::FULL)
.flatten_event(true)
.with_span_events(FmtSpan::NEW | FmtSpan::CLOSE)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, these are the span events we really care about. From the docs:

FmtSpan::NEW: An event will be synthesized when spans are created.
FmtSpan::ENTER: An event will be synthesized when spans are entered.
FmtSpan::EXIT: An event will be synthesized when spans are exited.
FmtSpan::CLOSE: An event will be synthesized when a span closes.

From my analysis, I noticed that ENTER is emitted way more often than it is actually called. For example, I know that the main function in broker::cmd::run is only called once. During one sample execution, a NEW event was understandably emitted only once, but ENTER was emitted 426 times. I suspect this is because a thread is constantly exiting and re-entering spans as it context-switches between different processes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect this is because a thread is constantly exiting and re-entering spans as it context-switches between different processes.

Correct!

IMO, these are the span events we really care about.

Good insight, I agree completely. Really, the "new" and "close" events are all we need for latency measures.

@nficca nficca requested a review from jssblck September 20, 2024 17:40
@nficca
Copy link
Contributor Author

nficca commented Sep 20, 2024

@jssblck I think I've come to a much better solution now. See my two new comments where I justify the changes. I've also updated the table in the PR description (way more significant file size reductions 🙌). Let me know what you think.

Copy link
Member

@jssblck jssblck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 97% reductions with no real loss in available information are always great to see. Great work!

@nficca nficca force-pushed the ANE-1027-reduce-broker-trace-file-size branch from 8b2d3c3 to 454a43c Compare September 23, 2024 17:20
@nficca nficca merged commit 97dd73b into main Sep 23, 2024
14 checks passed
@nficca nficca deleted the ANE-1027-reduce-broker-trace-file-size branch September 23, 2024 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants