-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ConcurrencyLimit to worker to enable dynamic tuning of concurrencies #1410
Conversation
13fd5cf
to
c5f2d53
Compare
internal/worker/resizable_permit.go
Outdated
|
||
// AcquireChan creates a PermitChannel that can be used to wait for a permit | ||
// After usage: | ||
// 1. avoid goroutine leak by calling permitChannel.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose this is also "or cancel the context".
A cancel-helper does ensure waiting + need fewer temp-vars though, so I kinda like it 👍. also easier to ensure it happens because it's special.
It's somewhat more common to return a <-chan, cancel func()
tuple instead of an interface though (like context.WithCancel
), and a bit more forget-resistant because you're forced to notice that there are two values instead of one value with an unknown number of methods.
I don't feel too strongly, but if ^ that's convincing to you I'd be happy to re-stamp. Changes should be pretty simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. I've changed to use this approach and removed the interface.
|
||
// Release release one permit | ||
func (p *resizablePermit) Release() { | ||
p.sem.Release(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does semaphore behave if Release() is called multiple times? does it increase the capacity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll (essentially) increase capacity, it's not a sync.Once
or equivalent that can guarantee at-most-once-per-acquire.
When it exceeds the limit by counting negative use (more releases than acquires, possibly/probably at some later time), it panics: https://github.com/marusama/semaphore/blob/master/semaphore.go#L170
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's a bit of a fundamental tradeoff between "ignore misuse, allow multiple calls" and "require exactly correct use, but it might be hard to find the cause".
tbh personally I prefer "require exactly correct use" in most cases, because the alternative might be releasing too early.
plus it's easy to convert "exactly correct" to "ignore misuse" with a sync.Once.Do(cancel)
wrapper, e.g. for stuff like https://github.com/cadence-workflow/cadence/blob/master/common/clock/ratelimiter.go#L354 where it's convenient to use both defer
and early-release to guarantee it happens in a func, and not have to worry about the many combinations possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. But due to the complexity of our poller task processing logic, we will have to pass the issued permit from poller goroutine to task goroutine. This will further complicate things with require exactly correct use
.
task permit acquire -> poll goroutine finish -> pass to task processing goroutine -> release task permit when task goroutine finish
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, there are a few unfortunately-complicated delayed-release things in the client :\ I'd love to get rid of as many as possible, or make them much clearer if not possible.
internal/worker/resizable_permit.go
Outdated
// AcquireChan creates a PermitChannel that can be used to wait for a permit | ||
// After usage: | ||
// 1. avoid goroutine leak by calling permitChannel.Close() | ||
// 2. release permit by calling permit.Release() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// AcquireChan creates a PermitChannel that can be used to wait for a permit | |
// After usage: | |
// 1. avoid goroutine leak by calling permitChannel.Close() | |
// 2. release permit by calling permit.Release() | |
// AcquireChan creates a PermitChannel that can be used to wait for a single permit | |
// After usage: | |
// 1. avoid goroutine leak by calling permitChannel.Close() | |
// 2. if the read succeeded, release permit by calling permit.Release() |
^ single-space indentation is so it's a list in godoc: https://tip.golang.org/doc/comment#lists (unordered lists use 2 spaces, dunno why it's different)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
maxTestDuration: 250 * time.Millisecond, // at least need 100ms * 1000 / 200 = 500ms to acquire all permit | ||
capacity: []int{200}, | ||
goroutines: 1000, | ||
expectFailuresRange: []int{1, 999}, // should at least pass some acquires |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably means we can make the bounds more precise, like 400,600
? Loose bounds make sense to avoid noise, but it's targeting 500 and tends to be fairly close.
or maybe just a "should be ~500" comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. I've loosen the bound
maxTestDuration: 250 * time.Millisecond, | ||
capacity: []int{600, 400, 200}, | ||
goroutines: 1000, | ||
expectFailuresRange: []int{1, 999}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should expect...
- 0ms: 1k enter, 600 accepted, 400 waiting
- 50ms: resized down to 400, 400 still waiting because none have released
- 100ms: resized down to 200, some of the original 600 begin releasing (100-150ms)
- 150ms: all 600 released, 200 additional ones acquired (waiting until at lesat 200ms more before any release, from earliest at 100ms)
- 200ms: earliest 100+100ms start releasing (upper bound is 800+200=1000, if no jitter, and the last 200 are still holding when it times out)
- 250ms: times out, any with >100+100+(50ms total rand) time out and fail
so... expecting 800-1000 success? ignoring jitter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just say at most 500 failures. Jitter is making things more complicated
"github.com/stretchr/testify/assert" | ||
"go.uber.org/atomic" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the simulation's great but it does hide any bugs that'd occur from using only one or the other mode, and that's how this is going to be used AFAICT.
seems worth some very simple tests for "can get" + "can fail from timeout/cancel", but I believe they'll all pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense. Added unit test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are failing (extra release) and minor comments (mostly on comments), but 👍 looks good AFAICT and let me know if you tackle other tests / changing the return format.
@@ -190,9 +193,9 @@ func Test_pollerAutoscaler(t *testing.T) { | |||
go func() { | |||
defer wg.Done() | |||
for pollResult := range pollChan { | |||
pollerScaler.Acquire(1) | |||
pollerScaler.permit.Acquire(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Acquire
returns error which should be handled. otherwise Release
might be called without acquire succeeding which leads to panic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I have been searching for the issue for hours.
Codecov ReportAttention: Patch coverage is
Continue to review full report in Codecov by Sentry.
|
t.Run("acquire timeout", func(t *testing.T) { | ||
permit := NewResizablePermit(1) | ||
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Millisecond) | ||
defer cancel() | ||
time.Sleep(100 * time.Millisecond) | ||
err := permit.Acquire(ctx) | ||
assert.ErrorContains(t, err, "context deadline exceeded") | ||
assert.Empty(t, permit.Count()) | ||
}) | ||
|
||
t.Run("cancel acquire", func(t *testing.T) { | ||
permit := NewResizablePermit(1) | ||
ctx, cancel := context.WithCancel(context.Background()) | ||
cancel() | ||
err := permit.Acquire(ctx) | ||
assert.ErrorContains(t, err, "canceled") | ||
assert.Empty(t, permit.Count()) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two tests are the same fwiw - you probably want to have one that does
- acquire (use up the whole semaphore)
- acquire again (blocks until timeout)
- make sure it didn't return immediately (elapsed time > like 5ms)
permit := NewResizablePermit(1) | ||
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Millisecond) | ||
defer cancel() | ||
time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, this makes it identical to the cancel case below (the chan is closed before it starts)
minor test gap but 👍 looks good. |
What changed?
[High Risk]
[Low Risk]
Why?
needed as first step to enable dynamic tuning of poller and task concurrencies
How did you test it?
Unit Test
[WIP] Canary Test + Bench Test
Potential risks