overhead negligible goroutine sampling #31

zdyj3170101136 · 2024-06-27T09:35:09Z

The main cost of goroutine performance analysis comes from stop the world.
And each goroutine needs to traverse the stack through an expensive unwinder function.

We can optimize it by using fp traceback.

The main cost of goroutine performance analysis comes from stop the world.
And each goroutine needs to traverse the stack through an expensive unwinder function.

We can optimize it by using fp traceback. Even if we traverse thousands of goroutines, the cost is negligible, so the only overhead is to stop the world and restart the world.

see golang/go#66915

felixge · 2024-06-27T10:39:54Z

This will eventually be implemented upstream in Go. Once this is done, fgprof will automatically support it. So I think this issue can be closed?

zdyj3170101136 · 2024-06-28T02:30:17Z

这最终将在 Go 上游实现。一旦完成，fgprof 将自动支持它。所以我认为这个问题可以关闭了？

Stop the world and start the world are so expensive that goroutine sampling cannot be enabled continuously online.

If we can use ebpf to uprobe goroutine status change and then extract the trace id from the request (the offset is fixed),

we can analyze the time spent on offcpu (block or syscall) for each request.

felixge · 2024-06-28T14:07:37Z

Stop the world and start the world are so expensive that goroutine sampling cannot be enabled continuously online.

Are you sure? It should not be too expensive.

If we can use ebpf to uprobe goroutine status change and then extract the trace id from the request (the offset is fixed),

uprobe's are pretty expensive (2-3usec).

Do you have benchmarks?

zdyj3170101136 · 2024-07-01T04:16:52Z

This will eventually be implemented upstream in Go

This will eventually be implemented upstream in Go

this would only supported in golang 1.23+。

but if we support it as a go module which could be imported.

fgprof could be enabled in lower go version.

zdyj3170101136 · 2024-07-01T04:48:00Z

benchmark.

run benchmark on src/encoding/json:

go version go1.20.12 linux/amd64

bench code

record all goroutine trace back with 100hz.

type stack struct {
	lo uintptr
	hi uintptr
}

type gobuf struct {
	sp   uintptr
	pc   uintptr
	g    uintptr
	ctxt unsafe.Pointer
	ret  uintptr
	lr   uintptr
	bp   uintptr
}

type g struct {
	// Stack parameters.
	// stack describes the actual stack memory: [stack.lo, stack.hi).
	// stackguard0 is the stack pointer compared in the Go stack growth prologue.
	// It is stack.lo+StackGuard normally, but can be StackPreempt to trigger a preemption.
	// stackguard1 is the stack pointer compared in the //go:systemstack stack growth prologue.
	// It is stack.lo+StackGuard on g0 and gsignal stacks.
	// It is ~0 on other goroutine stacks, to trigger a call to morestackc (and crash).
	stack       stack   // offset known to runtime/cgo
	stackguard0 uintptr // offset known to liblink
	stackguard1 uintptr // offset known to liblink

	_panic    unsafe.Pointer // innermost panic - offset known to liblink
	_defer    unsafe.Pointer // innermost defer
	m         unsafe.Pointer // current m; offset known to arm liblink
	sched     gobuf
	syscallsp uintptr // if status==Gsyscall, syscallsp = sched.sp to use during gc
	syscallpc uintptr // if status==Gsyscall, syscallpc = sched.pc to use during gc
	stktopsp  uintptr // expected sp at top of stack, to check in traceback
	// param is a generic pointer parameter field used to pass
	// values in particular contexts where other storage for the
	// parameter would be difficult to find. It is currently used
	// in four ways:
	// 1. When a channel operation wakes up a blocked goroutine, it sets param to
	//    point to the sudog of the completed blocking operation.
	// 2. By gcAssistAlloc1 to signal back to its caller that the goroutine completed
	//    the GC cycle. It is unsafe to do so in any other way, because the goroutine's
	//    stack may have moved in the meantime.
	// 3. By debugCallWrap to pass parameters to a new goroutine because allocating a
	//    closure in the runtime is forbidden.
	// 4. When a panic is recovered and control returns to the respective frame,
	//    param may point to a savedOpenDeferState.
	param        unsafe.Pointer
	atomicstatus atomic.Uint32
}

//go:linkname forEachGRace runtime.forEachGRace
func forEachGRace(fn func(gp *g)) bool

type stwReason uint

type worldStop struct {
	reason stwReason
	start  int64
}

//go:linkname stopTheWorld runtime.stopTheWorld
func stopTheWorld(reason stwReason) worldStop

//go:linkname startTheWorld runtime.startTheWorld
func startTheWorld(w worldStop)

func init() {
	go func() {
		t := time.NewTicker(10 * time.Millisecond)
		for range t.C {
			stw := stopTheWorld(4)
			forEachGRace(func(gp *g) {
				if gp.atomicstatus.Load() == 6 {
					// gdead.
					return
				}
				myfpTracebackPCs(unsafe.Pointer(gp.sched.bp), make([]uintptr, 32)[:])
			})
			startTheWorld(stw)
		}
	}()
}

func myfpTracebackPCs(fp unsafe.Pointer, pcBuf []uintptr) (i int) {
	for i = 0; i < len(pcBuf) && fp != nil; i++ {
		// return addr sits one word above the frame pointer
		pcBuf[i] = *(*uintptr)(unsafe.Pointer(uintptr(fp) + PtrSize))
		prevfp := fp
		// follow the frame pointer to the next one
		fp = unsafe.Pointer(*(*uintptr)(fp))
		if uintptr(fp) <= uintptr(prevfp) {
			break
		}
	}
	return i
}

const PtrSize = 4 << (^uintptr(0) >> 63)

bench result

benchstat:

goos: linux
goarch: amd64
pkg: encoding/json
cpu: Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
                                    │     old      │                 new                 │
                                    │    sec/op    │    sec/op     vs base               │
CodeEncoder-32                        508.1µ ±  3%   508.0µ ±  2%       ~ (p=0.853 n=10)
CodeEncoderError-32                   547.4µ ±  3%   550.1µ ±  3%       ~ (p=0.853 n=10)
CodeMarshal-32                        587.8µ ±  5%   590.4µ ±  4%       ~ (p=0.853 n=10)
CodeMarshalError-32                   686.8µ ±  3%   689.1µ ±  2%       ~ (p=0.739 n=10)
MarshalBytes/32-32                    249.7n ±  3%   245.1n ±  1%  -1.80% (p=0.011 n=10)
MarshalBytes/256-32                   761.4n ±  2%   754.6n ±  2%       ~ (p=0.247 n=10)
MarshalBytes/4096-32                  7.335µ ±  2%   7.331µ ±  1%       ~ (p=0.869 n=10)
MarshalBytesError/32-32               526.4µ ±  4%   533.2µ ±  3%       ~ (p=0.436 n=10)
MarshalBytesError/256-32              523.1µ ±  8%   521.2µ ±  3%       ~ (p=0.529 n=10)
MarshalBytesError/4096-32             538.1µ ±  3%   536.2µ ±  2%       ~ (p=1.000 n=10)
CodeDecoder-32                        2.491m ±  2%   2.441m ±  3%       ~ (p=0.143 n=10)
UnicodeDecoder-32                     289.5n ±  2%   291.7n ±  2%       ~ (p=0.393 n=10)
DecoderStream-32                      213.8n ±  2%   216.0n ±  2%       ~ (p=0.143 n=10)
CodeUnmarshal-32                      3.051m ±  5%   3.092m ±  3%       ~ (p=0.315 n=10)
CodeUnmarshalReuse-32                 2.690m ±  3%   2.677m ±  3%       ~ (p=0.912 n=10)
UnmarshalString-32                    61.87n ±  1%   62.16n ±  1%       ~ (p=0.342 n=10)
UnmarshalFloat64-32                   57.03n ±  1%   56.88n ±  1%       ~ (p=0.363 n=10)
UnmarshalInt64-32                     53.05n ±  7%   51.38n ±  3%  -3.16% (p=0.043 n=10)
Issue10335-32                         71.35n ±  1%   71.36n ±  1%       ~ (p=0.781 n=10)
Issue34127-32                         31.81n ±  2%   31.17n ±  2%       ~ (p=0.055 n=10)
Unmapped-32                           135.4n ±  2%   135.9n ±  3%       ~ (p=0.753 n=10)
TypeFieldsCache/MissTypes1-32         23.97µ ±  2%   23.81µ ±  2%       ~ (p=0.218 n=10)
TypeFieldsCache/MissTypes10-32        51.72µ ±  1%   51.61µ ±  2%       ~ (p=0.971 n=10)
TypeFieldsCache/MissTypes100-32       262.2µ ±  3%   261.4µ ±  2%       ~ (p=0.912 n=10)
TypeFieldsCache/MissTypes1000-32      1.975m ±  5%   1.968m ±  4%       ~ (p=0.971 n=10)
TypeFieldsCache/MissTypes10000-32     17.94m ±  2%   18.02m ±  1%       ~ (p=0.143 n=10)
TypeFieldsCache/MissTypes100000-32    176.6m ±  3%   173.7m ±  2%       ~ (p=0.280 n=10)
TypeFieldsCache/MissTypes1000000-32    1.935 ± 10%    1.992 ±  9%       ~ (p=0.481 n=10)
TypeFieldsCache/HitTypes1-32          1.704n ±  8%   1.650n ± 14%       ~ (p=0.325 n=10)
TypeFieldsCache/HitTypes10-32         1.639n ±  7%   1.627n ±  2%       ~ (p=0.541 n=10)
TypeFieldsCache/HitTypes100-32        1.658n ±  5%   1.659n ±  9%       ~ (p=0.837 n=10)
TypeFieldsCache/HitTypes1000-32       1.666n ±  5%   1.655n ±  5%       ~ (p=0.592 n=10)
TypeFieldsCache/HitTypes10000-32      1.622n ±  4%   1.631n ±  3%       ~ (p=0.986 n=10)
TypeFieldsCache/HitTypes100000-32     1.649n ±  8%   1.667n ±  8%       ~ (p=0.403 n=10)
TypeFieldsCache/HitTypes1000000-32                   1.627n ±  2%
geomean                               7.194µ         5.647µ        -0.22%

from benchmark, seems the stop and start the world have small overhead for program.

But a running goroutine needs to be suspended before it can run again, which can be expensive in rpc services.

felixge · 2024-07-01T09:25:55Z

I think you are mistakenly assuming that the goroutine profile does a STW while taking the stack trace of all goroutines.

That has not been the case for a while now. See https://go-review.googlesource.com/c/go/+/387415

zdyj3170101136 · 2024-07-01T09:50:33Z

https://go-review.googlesource.com/c/go/+/387415

it always need stop the world, if we run goroutine profile per one minute, it is ok.

but run it with 100hz is unacceptable.

felixge · 2024-07-03T07:51:13Z

but run it with 100hz is unacceptable.

Can you provide an execution trace go test -trace go.trace -bench . that shows the unacceptable STW overhead?

I'm also not sure how to read your benchmark. Is the "old" version without goroutine profiling and the "new" version with goroutine profiling? If yes, then it seems like the version with goroutine profiling is actually faster (!).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overhead negligible goroutine sampling #31

overhead negligible goroutine sampling #31

zdyj3170101136 commented Jun 27, 2024

felixge commented Jun 27, 2024

zdyj3170101136 commented Jun 28, 2024

felixge commented Jun 28, 2024

zdyj3170101136 commented Jul 1, 2024

zdyj3170101136 commented Jul 1, 2024 •

edited

Loading

felixge commented Jul 1, 2024 •

edited

Loading

zdyj3170101136 commented Jul 1, 2024

felixge commented Jul 3, 2024

overhead negligible goroutine sampling #31

overhead negligible goroutine sampling #31

Comments

zdyj3170101136 commented Jun 27, 2024

felixge commented Jun 27, 2024

zdyj3170101136 commented Jun 28, 2024

felixge commented Jun 28, 2024

zdyj3170101136 commented Jul 1, 2024

zdyj3170101136 commented Jul 1, 2024 • edited Loading

bench code

bench result

felixge commented Jul 1, 2024 • edited Loading

zdyj3170101136 commented Jul 1, 2024

felixge commented Jul 3, 2024

zdyj3170101136 commented Jul 1, 2024 •

edited

Loading

felixge commented Jul 1, 2024 •

edited

Loading