Preallocate p.stacks() before calling runtime.GoroutineProfile() #11

sumerc · 2020-12-10T07:37:09Z

Fixes #10

felixge

@sumerc thank you so much for submitting this and sorry for the late review.

PTAL at my comment & suggestion!

felixge · 2020-12-15T13:39:54Z

fgprof.go

+ nGoroutines := runtime.NumGoroutine()
+ if len(p.stacks) < nGoroutines {
+ p.stacks = make([]runtime.StackRecord, int(float64(nGoroutines)*1.1))
+ }


Hm, I was thinking about doing this outside the GoroutineProfile() func, otherwise it's costing us extra cycles even after we've settled on a suitably large stacks size. (I wasn't able to see a significant effect from this in the benchmark, but I think it's theoretically clear why this is higher overhead)

Maybe something like this? And then changing all the places where we initialize a &profiler{} struct right now?

// newProfiler returns a new profiler with a pre-allocated stacks field. This // is slightly faster than discovering its size during the first // GoroutineProfile() call. func newProfiler() *profiler { n := runtime.NumGoroutine() return &profiler{ stacks: make([]runtime.StackRecord, int(float64(n)*1.1)), } }

Let me know what you think!

Yes! This is the correct way.

I was trying to avoid make() by if len(p.stacks) < nGoroutines { but as you indicated runtime.NumGoroutine() will be redundantly called most of the time.

Awesome! Do you have interest/time in modifying your PR as suggested?

Well, I have done some further tests on this.

The issue is: as runtime.NumGoroutine is faster than calling GoroutineProfile I thought it would make sense to optimize for this as I opened the case.But I did not know p.stacks is reused. My previous implementation was still not good enough as it still adds more cycles where goroutine counts do not change between calls to GoroutineProfile.

I pushed a sample implementation for the above one, but it seems like it does not make too much difference because the newProfile is not on the hot path anymore only initialized once. It adds more complexity with too little gain. So, it is up to you:

We can close this issue as is and focus on maybe more important things in the hot path,

We can merge,

:)

Well, I have done some further tests on this.

The issue is: as runtime.NumGoroutine is faster than calling GoroutineProfile I thought it would make sense to optimize for this as I opened the case.But I did not know p.stacks is reused. My previous implementation was still not good enough as it still adds more cycles where goroutine counts do not change between calls to GoroutineProfile.

I pushed a sample implementation for the above one, but it seems like it does not make too much difference because the newProfile is not on the hot path anymore only initialized once. It adds more complexity with too little gain.

Yeah, I agree that it won't have a big impact. So if you feel like this is a bad tradeoff, then I'd also prefer to not merge the patch

So, it is up to you:

We can close this issue as is and focus on maybe more important things in the hot path,

We can merge,

Ok, I think we should close this and look for other opportunities.

I think the main issue with fgprof right now is that it doesn't scale to programs that use tons of goroutines. I'm starting a new job at Datadog's Profiler team in January that might give me the opportunity to submit proposals and patches to the Go core for it.

But meanwhile it might be nice for the profiler to detect if it's too slow (and causing issues for the profiled program). When that happens we could either dynamically adjust the sampling rate, or fail with an error.

Let me know if you're interested in playing with this idea.

Also, if you change your mind and want to merge the PR, I'd be okay with that too. It's not a lot of complexity to add IMO.

My mental model is this: Calling runtime.GoroutineProfile() calls stopTheWorld() so any time we spent inside of it, we assume the users application is totally stopped, and we're "stealing" its time for profiling. The goal is to not steal more than e.g 1% of the programs wall clock time.

We don't really know how stuff the program is doing when not profiling, but I don't think it's relevant for the goal of the profiler to minimize interfering with the program.

Now I understand. I forgot about the STW in goroutineprofile. I am still in learning phase for the runtime. Now it makes sense also looking at the wall time as well (instead of CPU time).

I will do some experiments myself and think a bit more...

I was too busy lately and could not find time to look/think into this too much.

But I thought about the problem a bit more and I agree that your approach of trying to bound profiler wall time into %1 seems both simple to implement and powerful enough like you suggested.

Couple of issues that came to my mind, a bit overthinking maybe for the current phase, but curious about your thoughts on these:

While I have not tested it yet: I assume when STW occurs, it might have some adverse effects on the system as it will abuse cache locality of some processors. One test that we could either do manually or automatically is to have a large number of CPU intensive goroutines executing and do some work with them. THen we calculate these both with profiler and without profiler and see if the output is also changed like %1 or so. It would be nice to test this as there might be other kind of adverse effects as well.

I think this is more major problem: what happens if GoroutineProfile itself is actually taking more than %1? Yeah, we could simply skip some samples on the next iterations but that might actually hurt accuracy? What if it took 30ms? 50ms? 100ms? Those numbers are just (maybe?) too big for the application? Well again: I have not yet tested it myself but I think this is not too uncommon if you have many goroutines: runtime: GoroutineProfile causes latency spikes in high-goroutine-count application golang/go#33250. Well one idea might be to make an estimation on how long a GoroutineProfile might take based on previous data, but again: I don't know what we will do even if that is the case. Another idea: get some help from the runtime: like what if GoroutineProfile returns some data even if len(p)<=n. Even when we don't have all the goroutine data, at least we might have something(maybe some kind of randomness can be applied in the runtime for returning different GOROs everytime?). So, instead of returning nothing, we will have at least something to profile this way. Otherwise it is completely binary, right? Any idea on this?

Ah one last thing:

While I was re-reading your previous comments, I could not understand this one:

Of course there is plenty of room to get fancy. E.g. using exponentially weighted averages for tracking the overhead or PID controllers for adapting the frequency.

Can you elaborate a bit what this means?

@sumerc this PR is an odd place for this discussion, so I've created a new issue and will reply there: #12

If you're okay with it, I think we should close this PR for now?

preallocate p.stacks() before calling runtime.GoroutineProfile()

cd0a196

felixge requested changes Dec 15, 2020

View reviewed changes

sumerc added 2 commits December 15, 2020 09:56

tryout newProfiler() func

2cb57ab

tidy

ca73dfe

felixge mentioned this pull request Dec 30, 2020

Reduce risk of running fgprof in production #12

Open

sumerc closed this Dec 30, 2020

felixge mentioned this pull request Aug 27, 2022

A small performance optimization #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preallocate p.stacks() before calling runtime.GoroutineProfile() #11

Preallocate p.stacks() before calling runtime.GoroutineProfile() #11

sumerc commented Dec 10, 2020 •

edited

Loading

felixge left a comment

felixge Dec 15, 2020 •

edited

Loading

sumerc Dec 15, 2020

felixge Dec 15, 2020

sumerc Dec 15, 2020 •

edited

Loading

felixge Dec 15, 2020 •

edited

Loading

felixge Dec 15, 2020

sumerc Dec 15, 2020

sumerc Dec 24, 2020 •

edited

Loading

felixge Dec 30, 2020

sumerc Dec 30, 2020

Preallocate p.stacks() before calling runtime.GoroutineProfile() #11

Preallocate p.stacks() before calling runtime.GoroutineProfile() #11

Conversation

sumerc commented Dec 10, 2020 • edited Loading

felixge left a comment

Choose a reason for hiding this comment

felixge Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

sumerc Dec 15, 2020

Choose a reason for hiding this comment

felixge Dec 15, 2020

Choose a reason for hiding this comment

sumerc Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

felixge Dec 15, 2020 • edited Loading

Choose a reason for hiding this comment

felixge Dec 15, 2020

Choose a reason for hiding this comment

sumerc Dec 15, 2020

Choose a reason for hiding this comment

sumerc Dec 24, 2020 • edited Loading

Choose a reason for hiding this comment

felixge Dec 30, 2020

Choose a reason for hiding this comment

sumerc Dec 30, 2020

Choose a reason for hiding this comment

sumerc commented Dec 10, 2020 •

edited

Loading

felixge Dec 15, 2020 •

edited

Loading

sumerc Dec 15, 2020 •

edited

Loading

felixge Dec 15, 2020 •

edited

Loading

sumerc Dec 24, 2020 •

edited

Loading