[ANE-1027] Reduce broker trace file size #146

nficca · 2024-09-19T20:26:03Z

Overview

Broker generates a lot of traces. This can result in very large trace file sizes. This PR aims to reduce file size by updating the trace subscriber layer to:

Reducing trace verbosity
Only logging span new and close events

Does this even work?

I ran broker against a handful of repos independently. Here are the results:

Repo	Old trace file size	New trace file size	Diff
https://github.com/fossas/fossa-cli	11.6MB	1.6MB	-10MB (-86%)
https://github.com/tensorflow/tensorflow	168.7MB	2.4MB	-166.3MB (-99%)
https://github.com/kubernetes/kubernetes	403.1MB	2.5MB	-400.6MB (-99%)
https://github.com/facebook/react	2.04GB	62.3MB	-1.98GB (-97%)

As you can see: a significant improvement.

What else can we do?

The ticket points to two possible improvements that I think would be more consistent/substantial than the changes are able to accomplish here.

The first is to compress the trace files:

We should also probably store the trace files compressed. Since compression algorithms rely on a dictionary of compression words, this would normally mean keeping the trace file open as a compression stream through the whole program lifetime, however we want to be as resilient as possible to program termination (modulo disk cache, can’t do much about that). So we’ll probably need to have a system where the primary trace file is written uncompressed, and a background job periodically compresses it into a larger trace file that spans the entire day.

The second is to serialize the traces in something more compact than json:

Lastly, it may be useful to record traces in a format other than JSON. We chose this format in the hopes that it’d be debuggable without dedicated tooling (until we get tooling built up), but if switching to a different format would save a ton of space it’s probably worth taking the time to build the tooling. After all, it’s not really realistic for a person to review gigabytes of JSON, much less actually debug it, without dedicated tooling either.

I think both of these are strong options, but I fear they may be too time consuming for an inexperienced Rust developer like myself to do as part of this ticket. I am hoping that this improvement is satisfactory enough, but perhaps we should make follow-up tickets for either of these two options.

Note about unit tests

There doesn't seem to be a precedent for testing the traces, so I haven't added any. Let me know if there's a good way to add some.

Acceptance criteria

Broker trace file size should now be smaller

Testing plan

Run broker for a period of time without these changes and observe the broker.trace file size
Move the trace file somewhere else/wipe it completely.
Run broker for the same period of time with these changes and observe the broker.trace file size to be smaller

Risks

We lose some trace information, such as when spans are entered/exited. We still have info on when spans are created and closed though, which I think is more relevant.

References

ANE-1027: Make Broker trace file size more manageable

Checklist

I added tests for this PR's change (or explained in the PR description why tests don't make sense).

nficca · 2024-09-19T20:43:41Z

src/debug.rs

+                    .flatten_event(true)
+                    .with_thread_ids(true)
+                    .with_span_list(false)
+                    .with_span_events(FmtSpan::ACTIVE)


flatten_event¹

This is just a minor simplification of the json structure that should save a bit of space

with_span_list² and with_thread_ids³

Idea here is that we don't need to add the span list to every trace if we include the thread id

We can infer the span list by looking at the previously entered/exited spans on the same thread id for a given trace

with_span_events⁴

Switching to ACTIVE as that includes all enters/exits, which seem to me to be far more relevant for debugging than creates/drops

Footnotes

https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/struct.Layer.html#method.flatten_event ↩

https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/struct.Layer.html#method.with_span_list ↩

https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/struct.Layer.html#method.with_thread_ids ↩

https://docs.rs/tracing-subscriber/latest/tracing_subscriber/fmt/struct.Layer.html#method.with_span_events ↩

flatten_event

Do you have examples of what this looks like before/after?

with_span_list and with_thread_ids
Idea here is that we don't need to add the span list to every trace if we include the thread id

Is the idea that by recording thread IDs, we can retain the context across temporally intertwined span events without actually including them in the message?

If so, this won't actually do what you're after: tokio is multithreaded, yes, but only as an implementation detail; it is primarily a work-stealing executor of concurrent lightweight tasks.

In other words, the same thread can work on many disparate tasks that have no relation to one another. So unfortunately I don't think this'll avoid that problem.

Do you have examples of what this looks like before/after?

Sure! Before:

{"timestamp":"2024-09-19T21:32:44.075971Z","level":"INFO","fields":{"message":"enter"},"target":"broker::db::sqlite","span":{"location":"\"/Users/nic/.config/fossa/broker/db.sqlite\"","name":"connect"},"threadId":"ThreadId(1)"}

After:

{"timestamp":"2024-09-19T19:38:11.111064Z","level":"INFO","message":"enter","target":"broker::db::sqlite","span":{"location":"\"/Users/nic/.config/fossa/broker/db.sqlite\"","name":"connect"},"threadId":"ThreadId(1)"}

If so, this won't actually do what you're after: tokio is multithreaded, yes, but only as an implementation detail; it is primarily a work-stealing executor of concurrent lightweight tasks.

In other words, the same thread can work on many disparate tasks that have no relation to one another. So unfortunately I don't think this'll avoid that problem.

Damn, okay, good to know. I was hoping that this would be a cheeky built-in way to accomplish what you mentioned in this thread but alas. I'll take another stab at it!

jssblck

Added comment above

jssblck · 2024-09-19T21:31:25Z

As you can see, the results vary. Most runs saw a significant improvement, but react actually got very slightly worse. I'm not totally sure why this is.

We should definitely figure out what's going on here- if we don't understand, we should figure it out. At minimum this means our mental model for changes doesn't fully match reality.

- Don't show thread IDs - Include the span list - Only show NEW and CLOSE span events

nficca · 2024-09-20T16:47:31Z

src/api/remote/git/repository.rs

@@ -342,7 +342,6 @@ fn parse_ls_remote(output: String) -> Vec<Reference> {
        .collect()
 }

-#[tracing::instrument]
 fn line_to_git_ref(line: &str) -> Option<Reference> {


This was causing A LOT of traces. The function above (parse_ls_remote) calls this function for every single reference in the repo, and that can be a lot (tensorflow has 44,567 for example). I don't think we need a span created for each of these.

nficca · 2024-09-20T17:00:31Z

src/debug.rs

@@ -140,7 +140,8 @@ impl Config {
            .with(
                tracing_subscriber::fmt::layer()
                    .json()
-                    .with_span_events(FmtSpan::FULL)
+                    .flatten_event(true)
+                    .with_span_events(FmtSpan::NEW | FmtSpan::CLOSE)


IMO, these are the span events we really care about. From the docs:

FmtSpan::NEW: An event will be synthesized when spans are created.
FmtSpan::ENTER: An event will be synthesized when spans are entered.
FmtSpan::EXIT: An event will be synthesized when spans are exited.
FmtSpan::CLOSE: An event will be synthesized when a span closes.

From my analysis, I noticed that ENTER is emitted way more often than it is actually called. For example, I know that the main function in broker::cmd::run is only called once. During one sample execution, a NEW event was understandably emitted only once, but ENTER was emitted 426 times. I suspect this is because a thread is constantly exiting and re-entering spans as it context-switches between different processes.

I suspect this is because a thread is constantly exiting and re-entering spans as it context-switches between different processes.

Correct!

IMO, these are the span events we really care about.

Good insight, I agree completely. Really, the "new" and "close" events are all we need for latency measures.

nficca · 2024-09-20T17:43:26Z

@jssblck I think I've come to a much better solution now. See my two new comments where I justify the changes. I've also updated the table in the PR description (way more significant file size reductions 🙌). Let me know what you think.

jssblck

🔥 97% reductions with no real loss in available information are always great to see. Great work!

[ANE-1027] Reduce broker trace file size

2477cbd

nficca commented Sep 19, 2024

View reviewed changes

nficca marked this pull request as ready for review September 19, 2024 21:22

nficca requested a review from a team as a code owner September 19, 2024 21:22

nficca requested a review from zlav September 19, 2024 21:22

jssblck self-requested a review September 19, 2024 21:22

zlav removed their request for review September 19, 2024 21:24

jssblck requested changes Sep 19, 2024

View reviewed changes

nficca added 2 commits September 20, 2024 11:20

Remove unnecessarily expensive tracing instrument

691da8d

Improve formatter

d08480a

- Don't show thread IDs - Include the span list - Only show NEW and CLOSE span events

nficca commented Sep 20, 2024

View reviewed changes

nficca requested a review from jssblck September 20, 2024 17:40

jssblck approved these changes Sep 20, 2024

View reviewed changes

Update cargo version and changelog

454a43c

nficca force-pushed the ANE-1027-reduce-broker-trace-file-size branch from 8b2d3c3 to 454a43c Compare September 23, 2024 17:20

nficca merged commit 97dd73b into main Sep 23, 2024
14 checks passed

nficca deleted the ANE-1027-reduce-broker-trace-file-size branch September 23, 2024 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ANE-1027] Reduce broker trace file size #146

[ANE-1027] Reduce broker trace file size #146

nficca commented Sep 19, 2024 •

edited

Loading

nficca Sep 19, 2024

jssblck Sep 19, 2024 •

edited

Loading

nficca Sep 19, 2024

jssblck left a comment

jssblck commented Sep 19, 2024

nficca Sep 20, 2024

nficca Sep 20, 2024

jssblck Sep 20, 2024

nficca commented Sep 20, 2024

jssblck left a comment

[ANE-1027] Reduce broker trace file size #146

[ANE-1027] Reduce broker trace file size #146

Conversation

nficca commented Sep 19, 2024 • edited Loading

Overview

Does this even work?

What else can we do?

Note about unit tests

Acceptance criteria

Testing plan

Risks

References

Checklist

nficca Sep 19, 2024

Choose a reason for hiding this comment

Footnotes

jssblck Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

nficca Sep 19, 2024

Choose a reason for hiding this comment

jssblck left a comment

Choose a reason for hiding this comment

jssblck commented Sep 19, 2024

nficca Sep 20, 2024

Choose a reason for hiding this comment

nficca Sep 20, 2024

Choose a reason for hiding this comment

jssblck Sep 20, 2024

Choose a reason for hiding this comment

nficca commented Sep 20, 2024

jssblck left a comment

Choose a reason for hiding this comment

nficca commented Sep 19, 2024 •

edited

Loading

jssblck Sep 19, 2024 •

edited

Loading