Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preallocate p.stacks() before calling runtime.GoroutineProfile() #11

Closed
wants to merge 3 commits into from

Conversation

sumerc
Copy link

@sumerc sumerc commented Dec 10, 2020

Fixes #10

Copy link
Owner

@felixge felixge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumerc thank you so much for submitting this and sorry for the late review.

PTAL at my comment & suggestion!

fgprof.go Outdated
nGoroutines := runtime.NumGoroutine()
if len(p.stacks) < nGoroutines {
p.stacks = make([]runtime.StackRecord, int(float64(nGoroutines)*1.1))
}
Copy link
Owner

@felixge felixge Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I was thinking about doing this outside the GoroutineProfile() func, otherwise it's costing us extra cycles even after we've settled on a suitably large stacks size. (I wasn't able to see a significant effect from this in the benchmark, but I think it's theoretically clear why this is higher overhead)

Maybe something like this? And then changing all the places where we initialize a &profiler{} struct right now?

// newProfiler returns a new profiler with a pre-allocated stacks field. This
// is slightly faster than discovering its size during the first
// GoroutineProfile() call.
func newProfiler() *profiler {
	n := runtime.NumGoroutine()
	return &profiler{
		stacks: make([]runtime.StackRecord, int(float64(n)*1.1)),
	}
}

Let me know what you think!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This is the correct way.

I was trying to avoid make() by if len(p.stacks) < nGoroutines { but as you indicated runtime.NumGoroutine() will be redundantly called most of the time.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Do you have interest/time in modifying your PR as suggested?

Copy link
Author

@sumerc sumerc Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I have done some further tests on this.

The issue is: as runtime.NumGoroutine is faster than calling GoroutineProfile I thought it would make sense to optimize for this as I opened the case.But I did not know p.stacks is reused. My previous implementation was still not good enough as it still adds more cycles where goroutine counts do not change between calls to GoroutineProfile.

I pushed a sample implementation for the above one, but it seems like it does not make too much difference because the newProfile is not on the hot path anymore only initialized once. It adds more complexity with too little gain. So, it is up to you:

  • We can close this issue as is and focus on maybe more important things in the hot path,
  • We can merge,

:)

Copy link
Owner

@felixge felixge Dec 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I have done some further tests on this.

The issue is: as runtime.NumGoroutine is faster than calling GoroutineProfile I thought it would make sense to optimize for this as I opened the case.But I did not know p.stacks is reused. My previous implementation was still not good enough as it still adds more cycles where goroutine counts do not change between calls to GoroutineProfile.

I pushed a sample implementation for the above one, but it seems like it does not make too much difference because the newProfile is not on the hot path anymore only initialized once. It adds more complexity with too little gain.

Yeah, I agree that it won't have a big impact. So if you feel like this is a bad tradeoff, then I'd also prefer to not merge the patch

So, it is up to you:

  • We can close this issue as is and focus on maybe more important things in the hot path,
  • We can merge,

Ok, I think we should close this and look for other opportunities.

I think the main issue with fgprof right now is that it doesn't scale to programs that use tons of goroutines. I'm starting a new job at Datadog's Profiler team in January that might give me the opportunity to submit proposals and patches to the Go core for it.

But meanwhile it might be nice for the profiler to detect if it's too slow (and causing issues for the profiled program). When that happens we could either dynamically adjust the sampling rate, or fail with an error.

Let me know if you're interested in playing with this idea.

Also, if you change your mind and want to merge the PR, I'd be okay with that too. It's not a lot of complexity to add IMO.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mental model is this: Calling runtime.GoroutineProfile() calls stopTheWorld() so any time we spent inside of it, we assume the users application is totally stopped, and we're "stealing" its time for profiling. The goal is to not steal more than e.g 1% of the programs wall clock time.

We don't really know how stuff the program is doing when not profiling, but I don't think it's relevant for the goal of the profiler to minimize interfering with the program.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I understand. I forgot about the STW in goroutineprofile. I am still in learning phase for the runtime. Now it makes sense also looking at the wall time as well (instead of CPU time).

I will do some experiments myself and think a bit more...

Copy link
Author

@sumerc sumerc Dec 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was too busy lately and could not find time to look/think into this too much.

But I thought about the problem a bit more and I agree that your approach of trying to bound profiler wall time into %1 seems both simple to implement and powerful enough like you suggested.

Couple of issues that came to my mind, a bit overthinking maybe for the current phase, but curious about your thoughts on these:

  1. While I have not tested it yet: I assume when STW occurs, it might have some adverse effects on the system as it will abuse cache locality of some processors. One test that we could either do manually or automatically is to have a large number of CPU intensive goroutines executing and do some work with them. THen we calculate these both with profiler and without profiler and see if the output is also changed like %1 or so. It would be nice to test this as there might be other kind of adverse effects as well.

  2. I think this is more major problem: what happens if GoroutineProfile itself is actually taking more than %1? Yeah, we could simply skip some samples on the next iterations but that might actually hurt accuracy? What if it took 30ms? 50ms? 100ms? Those numbers are just (maybe?) too big for the application? Well again: I have not yet tested it myself but I think this is not too uncommon if you have many goroutines: runtime: GoroutineProfile causes latency spikes in high-goroutine-count application golang/go#33250. Well one idea might be to make an estimation on how long a GoroutineProfile might take based on previous data, but again: I don't know what we will do even if that is the case. Another idea: get some help from the runtime: like what if GoroutineProfile returns some data even if len(p)<=n. Even when we don't have all the goroutine data, at least we might have something(maybe some kind of randomness can be applied in the runtime for returning different GOROs everytime?). So, instead of returning nothing, we will have at least something to profile this way. Otherwise it is completely binary, right? Any idea on this?

Ah one last thing:

While I was re-reading your previous comments, I could not understand this one:

Of course there is plenty of room to get fancy. E.g. using exponentially weighted averages for tracking the overhead or PID controllers for adapting the frequency.

Can you elaborate a bit what this means?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sumerc this PR is an odd place for this discussion, so I've created a new issue and will reply there: #12

If you're okay with it, I think we should close this PR for now?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

A small performance optimization
2 participants