Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overhead negligible goroutine sampling #31

Open
zdyj3170101136 opened this issue Jun 27, 2024 · 8 comments
Open

overhead negligible goroutine sampling #31

zdyj3170101136 opened this issue Jun 27, 2024 · 8 comments

Comments

@zdyj3170101136
Copy link

The main cost of goroutine performance analysis comes from stop the world.
And each goroutine needs to traverse the stack through an expensive unwinder function.

We can optimize it by using fp traceback.

The main cost of goroutine performance analysis comes from stop the world.
And each goroutine needs to traverse the stack through an expensive unwinder function.

We can optimize it by using fp traceback. Even if we traverse thousands of goroutines, the cost is negligible, so the only overhead is to stop the world and restart the world.

see golang/go#66915

@felixge
Copy link
Owner

felixge commented Jun 27, 2024

This will eventually be implemented upstream in Go. Once this is done, fgprof will automatically support it. So I think this issue can be closed?

@zdyj3170101136
Copy link
Author

这最终将在 Go 上游实现。一旦完成,fgprof 将自动支持它。所以我认为这个问题可以关闭了?

Stop the world and start the world are so expensive that goroutine sampling cannot be enabled continuously online.

If we can use ebpf to uprobe goroutine status change and then extract the trace id from the request (the offset is fixed),

we can analyze the time spent on offcpu (block or syscall) for each request.

@felixge
Copy link
Owner

felixge commented Jun 28, 2024

Stop the world and start the world are so expensive that goroutine sampling cannot be enabled continuously online.

Are you sure? It should not be too expensive.

If we can use ebpf to uprobe goroutine status change and then extract the trace id from the request (the offset is fixed),

uprobe's are pretty expensive (2-3usec).

Do you have benchmarks?

@zdyj3170101136
Copy link
Author

This will eventually be implemented upstream in Go

This will eventually be implemented upstream in Go

this would only supported in golang 1.23+。

but if we support it as a go module which could be imported.

fgprof could be enabled in lower go version.

@zdyj3170101136
Copy link
Author

zdyj3170101136 commented Jul 1, 2024

benchmark.

run benchmark on src/encoding/json:

go version go1.20.12 linux/amd64

bench code

record all goroutine trace back with 100hz.

type stack struct {
	lo uintptr
	hi uintptr
}

type gobuf struct {
	sp   uintptr
	pc   uintptr
	g    uintptr
	ctxt unsafe.Pointer
	ret  uintptr
	lr   uintptr
	bp   uintptr
}

type g struct {
	// Stack parameters.
	// stack describes the actual stack memory: [stack.lo, stack.hi).
	// stackguard0 is the stack pointer compared in the Go stack growth prologue.
	// It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
	// stackguard1 is the stack pointer compared in the //go:systemstack stack growth prologue.
	// It is stack.lo+StackGuard on g0 and gsignal stacks.
	// It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).
	stack       stack   // offset known to runtime/cgo
	stackguard0 uintptr // offset known to liblink
	stackguard1 uintptr // offset known to liblink

	_panic    unsafe.Pointer // innermost panic - offset known to liblink
	_defer    unsafe.Pointer // innermost defer
	m         unsafe.Pointer // current m; offset known to arm liblink
	sched     gobuf
	syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc
	syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc
	stktopsp  uintptr // expected sp at top of stack, to check in traceback
	// param is a generic pointer parameter field used to pass
	// values in particular contexts where other storage for the
	// parameter would be difficult to find. It is currently used
	// in four ways:
	// 1. When a channel operation wakes up a blocked goroutine, it sets param to
	//    point to the sudog of the completed blocking operation.
	// 2. By gcAssistAlloc1 to signal back to its caller that the goroutine completed
	//    the GC cycle. It is unsafe to do so in any other way, because the goroutine's
	//    stack may have moved in the meantime.
	// 3. By debugCallWrap to pass parameters to a new goroutine because allocating a
	//    closure in the runtime is forbidden.
	// 4. When a panic is recovered and control returns to the respective frame,
	//    param may point to a savedOpenDeferState.
	param        unsafe.Pointer
	atomicstatus atomic.Uint32
}

//go:linkname forEachGRace runtime.forEachGRace
func forEachGRace(fn func(gp *g)) bool

type stwReason uint

type worldStop struct {
	reason stwReason
	start  int64
}

//go:linkname stopTheWorld runtime.stopTheWorld
func stopTheWorld(reason stwReason) worldStop

//go:linkname startTheWorld runtime.startTheWorld
func startTheWorld(w worldStop)

func init() {
	go func() {
		t := time.NewTicker(10 * time.Millisecond)
		for range t.C {
			stw := stopTheWorld(4)
			forEachGRace(func(gp *g) {
				if gp.atomicstatus.Load() == 6 {
					// gdead.
					return
				}
				myfpTracebackPCs(unsafe.Pointer(gp.sched.bp), make([]uintptr, 32)[:])
			})
			startTheWorld(stw)
		}
	}()
}

func myfpTracebackPCs(fp unsafe.Pointer, pcBuf []uintptr) (i int) {
	for i = 0; i < len(pcBuf) && fp != nil; i++ {
		// return addr sits one word above the frame pointer
		pcBuf[i] = *(*uintptr)(unsafe.Pointer(uintptr(fp) + PtrSize))
		prevfp := fp
		// follow the frame pointer to the next one
		fp = unsafe.Pointer(*(*uintptr)(fp))
		if uintptr(fp) <= uintptr(prevfp) {
			break
		}
	}
	return i
}

const PtrSize = 4 << (^uintptr(0) >> 63)

bench result

benchstat:

goos: linux
goarch: amd64
pkg: encoding/json
cpu: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
                                    │     old      │                 new                 │
                                    │    sec/op    │    sec/op     vs base               │
CodeEncoder-32                        508.1µ ±  3%   508.0µ ±  2%       ~ (p=0.853 n=10)
CodeEncoderError-32                   547.4µ ±  3%   550.1µ ±  3%       ~ (p=0.853 n=10)
CodeMarshal-32                        587.8µ ±  5%   590.4µ ±  4%       ~ (p=0.853 n=10)
CodeMarshalError-32                   686.8µ ±  3%   689.1µ ±  2%       ~ (p=0.739 n=10)
MarshalBytes/32-32                    249.7n ±  3%   245.1n ±  1%  -1.80% (p=0.011 n=10)
MarshalBytes/256-32                   761.4n ±  2%   754.6n ±  2%       ~ (p=0.247 n=10)
MarshalBytes/4096-32                  7.335µ ±  2%   7.331µ ±  1%       ~ (p=0.869 n=10)
MarshalBytesError/32-32               526.4µ ±  4%   533.2µ ±  3%       ~ (p=0.436 n=10)
MarshalBytesError/256-32              523.1µ ±  8%   521.2µ ±  3%       ~ (p=0.529 n=10)
MarshalBytesError/4096-32             538.1µ ±  3%   536.2µ ±  2%       ~ (p=1.000 n=10)
CodeDecoder-32                        2.491m ±  2%   2.441m ±  3%       ~ (p=0.143 n=10)
UnicodeDecoder-32                     289.5n ±  2%   291.7n ±  2%       ~ (p=0.393 n=10)
DecoderStream-32                      213.8n ±  2%   216.0n ±  2%       ~ (p=0.143 n=10)
CodeUnmarshal-32                      3.051m ±  5%   3.092m ±  3%       ~ (p=0.315 n=10)
CodeUnmarshalReuse-32                 2.690m ±  3%   2.677m ±  3%       ~ (p=0.912 n=10)
UnmarshalString-32                    61.87n ±  1%   62.16n ±  1%       ~ (p=0.342 n=10)
UnmarshalFloat64-32                   57.03n ±  1%   56.88n ±  1%       ~ (p=0.363 n=10)
UnmarshalInt64-32                     53.05n ±  7%   51.38n ±  3%  -3.16% (p=0.043 n=10)
Issue10335-32                         71.35n ±  1%   71.36n ±  1%       ~ (p=0.781 n=10)
Issue34127-32                         31.81n ±  2%   31.17n ±  2%       ~ (p=0.055 n=10)
Unmapped-32                           135.4n ±  2%   135.9n ±  3%       ~ (p=0.753 n=10)
TypeFieldsCache/MissTypes1-32         23.97µ ±  2%   23.81µ ±  2%       ~ (p=0.218 n=10)
TypeFieldsCache/MissTypes10-32        51.72µ ±  1%   51.61µ ±  2%       ~ (p=0.971 n=10)
TypeFieldsCache/MissTypes100-32       262.2µ ±  3%   261.4µ ±  2%       ~ (p=0.912 n=10)
TypeFieldsCache/MissTypes1000-32      1.975m ±  5%   1.968m ±  4%       ~ (p=0.971 n=10)
TypeFieldsCache/MissTypes10000-32     17.94m ±  2%   18.02m ±  1%       ~ (p=0.143 n=10)
TypeFieldsCache/MissTypes100000-32    176.6m ±  3%   173.7m ±  2%       ~ (p=0.280 n=10)
TypeFieldsCache/MissTypes1000000-32    1.935 ± 10%    1.992 ±  9%       ~ (p=0.481 n=10)
TypeFieldsCache/HitTypes1-32          1.704n ±  8%   1.650n ± 14%       ~ (p=0.325 n=10)
TypeFieldsCache/HitTypes10-32         1.639n ±  7%   1.627n ±  2%       ~ (p=0.541 n=10)
TypeFieldsCache/HitTypes100-32        1.658n ±  5%   1.659n ±  9%       ~ (p=0.837 n=10)
TypeFieldsCache/HitTypes1000-32       1.666n ±  5%   1.655n ±  5%       ~ (p=0.592 n=10)
TypeFieldsCache/HitTypes10000-32      1.622n ±  4%   1.631n ±  3%       ~ (p=0.986 n=10)
TypeFieldsCache/HitTypes100000-32     1.649n ±  8%   1.667n ±  8%       ~ (p=0.403 n=10)
TypeFieldsCache/HitTypes1000000-32                   1.627n ±  2%
geomean                               7.194µ         5.647µ        -0.22%

from benchmark, seems the stop and start the world have small overhead for program.

But a running goroutine needs to be suspended before it can run again, which can be expensive in rpc services.

@felixge
Copy link
Owner

felixge commented Jul 1, 2024

I think you are mistakenly assuming that the goroutine profile does a STW while taking the stack trace of all goroutines.

That has not been the case for a while now. See https://go-review.googlesource.com/c/go/+/387415

@zdyj3170101136
Copy link
Author

https://go-review.googlesource.com/c/go/+/387415

it always need stop the world, if we run goroutine profile per one minute, it is ok.

but run it with 100hz is unacceptable.

@felixge
Copy link
Owner

felixge commented Jul 3, 2024

but run it with 100hz is unacceptable.

Can you provide an execution trace go test -trace go.trace -bench . that shows the unacceptable STW overhead?

I'm also not sure how to read your benchmark. Is the "old" version without goroutine profiling and the "new" version with goroutine profiling? If yes, then it seems like the version with goroutine profiling is actually faster (!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants