add ConcurrencyLimit to worker to enable dynamic tuning of concurrencies #1410

shijiesheng · 2024-12-05T19:21:10Z

What changed?

[High Risk]

replaced buffered channel with resizable semaphore to control task concurrency

[Low Risk]

added worker package for modularity
added ConcurrencyLimit entity to worker
removed unused methods of autoscaler interface

Why?

needed as first step to enable dynamic tuning of poller and task concurrencies

How did you test it?

Unit Test
[WIP] Canary Test + Bench Test

Potential risks

Groxx · 2024-12-05T22:03:39Z

internal/worker/resizable_permit.go

+
+// AcquireChan creates a PermitChannel that can be used to wait for a permit
+// After usage:
+// 1. avoid goroutine leak by calling permitChannel.Close()


I suppose this is also "or cancel the context".
A cancel-helper does ensure waiting + need fewer temp-vars though, so I kinda like it 👍. also easier to ensure it happens because it's special.

It's somewhat more common to return a <-chan, cancel func() tuple instead of an interface though (like context.WithCancel), and a bit more forget-resistant because you're forced to notice that there are two values instead of one value with an unknown number of methods.

I don't feel too strongly, but if ^ that's convincing to you I'd be happy to re-stamp. Changes should be pretty simple.

make sense. I've changed to use this approach and removed the interface.

taylanisikdemir · 2024-12-05T22:04:06Z

internal/worker/resizable_permit.go

+
+// Release release one permit
+func (p *resizablePermit) Release() {
+	p.sem.Release(1)


how does semaphore behave if Release() is called multiple times? does it increase the capacity?

It'll (essentially) increase capacity, it's not a sync.Once or equivalent that can guarantee at-most-once-per-acquire.

When it exceeds the limit by counting negative use (more releases than acquires, possibly/probably at some later time), it panics: https://github.com/marusama/semaphore/blob/master/semaphore.go#L170

there's a bit of a fundamental tradeoff between "ignore misuse, allow multiple calls" and "require exactly correct use, but it might be hard to find the cause".

tbh personally I prefer "require exactly correct use" in most cases, because the alternative might be releasing too early.
plus it's easy to convert "exactly correct" to "ignore misuse" with a sync.Once.Do(cancel) wrapper, e.g. for stuff like https://github.com/cadence-workflow/cadence/blob/master/common/clock/ratelimiter.go#L354 where it's convenient to use both defer and early-release to guarantee it happens in a func, and not have to worry about the many combinations possible.

Correct. But due to the complexity of our poller task processing logic, we will have to pass the issued permit from poller goroutine to task goroutine. This will further complicate things with require exactly correct use.

task permit acquire -> poll goroutine finish -> pass to task processing goroutine -> release task permit when task goroutine finish

yea, there are a few unfortunately-complicated delayed-release things in the client :\ I'd love to get rid of as many as possible, or make them much clearer if not possible.

Groxx · 2024-12-05T22:10:09Z

internal/worker/resizable_permit.go

+// AcquireChan creates a PermitChannel that can be used to wait for a permit
+// After usage:
+// 1. avoid goroutine leak by calling permitChannel.Close()
+// 2. release permit by calling permit.Release()


Suggested change

// AcquireChan creates a PermitChannel that can be used to wait for a permit

// After usage:

// 1. avoid goroutine leak by calling permitChannel.Close()

// 2. release permit by calling permit.Release()

// AcquireChan creates a PermitChannel that can be used to wait for a single permit

// After usage:

// 1. avoid goroutine leak by calling permitChannel.Close()

// 2. if the read succeeded, release permit by calling permit.Release()

^ single-space indentation is so it's a list in godoc: https://tip.golang.org/doc/comment#lists (unordered lists use 2 spaces, dunno why it's different)

Groxx · 2024-12-05T22:21:55Z

internal/worker/resizable_permit_test.go

+			maxTestDuration:     250 * time.Millisecond, // at least need 100ms * 1000 / 200 = 500ms to acquire all permit
+			capacity:            []int{200},
+			goroutines:          1000,
+			expectFailuresRange: []int{1, 999}, // should at least pass some acquires


probably means we can make the bounds more precise, like 400,600? Loose bounds make sense to avoid noise, but it's targeting 500 and tends to be fairly close.

or maybe just a "should be ~500" comment.

make sense. I've loosen the bound

Groxx · 2024-12-05T22:28:05Z

internal/worker/resizable_permit_test.go

+			maxTestDuration:     250 * time.Millisecond,
+			capacity:            []int{600, 400, 200},
+			goroutines:          1000,
+			expectFailuresRange: []int{1, 999},


should expect...

0ms: 1k enter, 600 accepted, 400 waiting

50ms: resized down to 400, 400 still waiting because none have released

100ms: resized down to 200, some of the original 600 begin releasing (100-150ms)

150ms: all 600 released, 200 additional ones acquired (waiting until at lesat 200ms more before any release, from earliest at 100ms)

200ms: earliest 100+100ms start releasing (upper bound is 800+200=1000, if no jitter, and the last 200 are still holding when it times out)

250ms: times out, any with >100+100+(50ms total rand) time out and fail

so... expecting 800-1000 success? ignoring jitter.

I'll just say at most 500 failures. Jitter is making things more complicated

Groxx · 2024-12-05T22:34:00Z

internal/worker/resizable_permit_test.go

+	"github.com/stretchr/testify/assert"
+	"go.uber.org/atomic"
+)
+


the simulation's great but it does hide any bugs that'd occur from using only one or the other mode, and that's how this is going to be used AFAICT.

seems worth some very simple tests for "can get" + "can fail from timeout/cancel", but I believe they'll all pass.

make sense. Added unit test

Groxx

Tests are failing (extra release) and minor comments (mostly on comments), but 👍 looks good AFAICT and let me know if you tackle other tests / changing the return format.

taylanisikdemir · 2024-12-06T17:57:33Z

internal/internal_poller_autoscaler_test.go

@@ -190,9 +193,9 @@ func Test_pollerAutoscaler(t *testing.T) {
 				go func() {
 					defer wg.Done()
 					for pollResult := range pollChan {
-						pollerScaler.Acquire(1)
+						pollerScaler.permit.Acquire(context.Background())


Acquire returns error which should be handled. otherwise Release might be called without acquire succeeding which leads to panic

Good catch. I have been searching for the issue for hours.

codecov · 2024-12-06T21:43:30Z

Codecov Report

Attention: Patch coverage is 98.55072% with 1 line in your changes missing coverage. Please review.

Project coverage is 82.58%. Comparing base (641e4a7) to head (6342ff8).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
internal/internal_worker_base.go	95.45%	1 Missing ⚠️

Files with missing lines	Coverage Δ
internal/internal_poller_autoscaler.go	`92.22% <100.00%> (-0.49%)`	⬇️
internal/worker/resizable_permit.go	`100.00% <100.00%> (ø)`
internal/internal_worker_base.go	`82.62% <95.45%> (-0.79%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 641e4a7...6342ff8. Read the comment docs.

Groxx · 2024-12-07T00:37:15Z

internal/worker/resizable_permit_test.go

+	t.Run("acquire timeout", func(t *testing.T) {
+		permit := NewResizablePermit(1)
+		ctx, cancel := context.WithTimeout(context.Background(), 10*time.Millisecond)
+		defer cancel()
+		time.Sleep(100 * time.Millisecond)
+		err := permit.Acquire(ctx)
+		assert.ErrorContains(t, err, "context deadline exceeded")
+		assert.Empty(t, permit.Count())
+	})
+
+	t.Run("cancel acquire", func(t *testing.T) {
+		permit := NewResizablePermit(1)
+		ctx, cancel := context.WithCancel(context.Background())
+		cancel()
+		err := permit.Acquire(ctx)
+		assert.ErrorContains(t, err, "canceled")
+		assert.Empty(t, permit.Count())
+	})


these two tests are the same fwiw - you probably want to have one that does

acquire (use up the whole semaphore)

acquire again (blocks until timeout)

make sure it didn't return immediately (elapsed time > like 5ms)

Groxx · 2024-12-07T00:40:33Z

internal/worker/resizable_permit_test.go

+		permit := NewResizablePermit(1)
+		ctx, cancel := context.WithTimeout(context.Background(), 10*time.Millisecond)
+		defer cancel()
+		time.Sleep(100 * time.Millisecond)


same as above, this makes it identical to the cancel case below (the chan is closed before it starts)

Groxx · 2024-12-07T00:49:16Z

minor test gap but 👍 looks good.
did you find out what the double-release earlier failure was coming from, or was that maybe just a flaky test?

shijiesheng requested review from Groxx, jakobht, 3vilhamster, dkrotx, taylanisikdemir and demirkayaender as code owners December 5, 2024 19:21

shijiesheng mentioned this pull request Dec 5, 2024

add Concurrency entity for worker #1405

Closed

shijiesheng added 3 commits December 5, 2024 11:22

add concurrencylimit entity to worker

14e3c01

wip

082d125

add PermitChannel

c5f2d53

shijiesheng force-pushed the concurrency branch from 13fd5cf to c5f2d53 Compare December 5, 2024 19:22

Groxx reviewed Dec 5, 2024

View reviewed changes

taylanisikdemir reviewed Dec 5, 2024

View reviewed changes

Groxx reviewed Dec 5, 2024

View reviewed changes

Groxx approved these changes Dec 5, 2024

View reviewed changes

Groxx self-requested a review December 5, 2024 22:40

taylanisikdemir reviewed Dec 6, 2024

View reviewed changes

fix unit test and address comment on AcquireChan return

40cd57c

add unit test and fix comments

6342ff8

shijiesheng merged commit 9ffbb1f into cadence-workflow:master Dec 6, 2024
10 checks passed

Groxx reviewed Dec 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add ConcurrencyLimit to worker to enable dynamic tuning of concurrencies #1410

add ConcurrencyLimit to worker to enable dynamic tuning of concurrencies #1410

shijiesheng commented Dec 5, 2024 •

edited

Loading

Groxx Dec 5, 2024 •

edited

Loading

shijiesheng Dec 6, 2024

taylanisikdemir Dec 5, 2024

Groxx Dec 5, 2024 •

edited

Loading

Groxx Dec 5, 2024 •

edited

Loading

shijiesheng Dec 6, 2024

Groxx Dec 7, 2024

Groxx Dec 5, 2024

shijiesheng Dec 6, 2024

Groxx Dec 5, 2024

shijiesheng Dec 6, 2024

Groxx Dec 5, 2024 •

edited

Loading

shijiesheng Dec 6, 2024

Groxx Dec 5, 2024

shijiesheng Dec 6, 2024

Groxx left a comment

taylanisikdemir Dec 6, 2024

shijiesheng Dec 6, 2024

codecov bot commented Dec 6, 2024 •

edited

Loading

Groxx Dec 7, 2024

Groxx Dec 7, 2024

Groxx commented Dec 7, 2024

add ConcurrencyLimit to worker to enable dynamic tuning of concurrencies #1410

add ConcurrencyLimit to worker to enable dynamic tuning of concurrencies #1410

Conversation

shijiesheng commented Dec 5, 2024 • edited Loading

Groxx Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Groxx Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Groxx Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Groxx Dec 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Groxx left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Dec 6, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Groxx commented Dec 7, 2024

shijiesheng commented Dec 5, 2024 •

edited

Loading

Groxx Dec 5, 2024 •

edited

Loading

Groxx Dec 5, 2024 •

edited

Loading

Groxx Dec 5, 2024 •

edited

Loading

Groxx Dec 5, 2024 •

edited

Loading

codecov bot commented Dec 6, 2024 •

edited

Loading