-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ChildWorkflowFuture.Get hangs forever (and leaks a goroutine) if parent workflow times out #1407
Comments
Excellent find, thank you! We've been attempting to figure out a(?) leak for a while, but all the code it has been hitting internally have been incredibly complicated and nearly impossible to reproduce. I'll see what I can find here ASAP. Child workflow state handling is one of the more complicated and buggier sections of the core logic, so it seems plausible and this is by far the best lead I've seen. Hopefully not too many distractions before I get to it :) |
Ah. This one is basically known, and kinda sorta "intended". What's happening is:
The activity and child workflow are ending "normally", i.e. their functions are returning and they are returning values, so their call stack / program counter / etc have gone away naturally. The parent workflow however is timing out and ending server-side, which stops everything... So nothing ever actually communicates back to the worker that it's "done". So it never finishes. This workflow is still cached "normally" though, and when the "sticky workflow cache" runs out of space it'll evict in LRU order. cadence-go-client/internal/internal_workflow.go Lines 851 to 858 in 9ffbb1f
which executes it here: cadence-go-client/internal/internal_workflow.go Lines 800 to 801 in 9ffbb1f
and that'll tear down all of the workflow's goroutines (in this case only one) due to calling runtime.Goexit() .
You can simulate all this much more easily by setting a sticky cache size of like 2 (iirc it kinda sorta behaves like "N-1", so I just avoid 1), and using a single workflow like this: func work(ctx workflow.Context) error {
workflow.GetSignalChannel(ctx, "ignored").Receive(ctx, nil) // and never send a signal
return nil
} and letting it time out. It only needs to be a couple seconds. Regardless of how many you start and with what timing, you shouldn't see the stuck-goroutines exceed the cache size. You can also / probably usually want to simply avoid this, by making sure your workflows end normally before their hard cutoff time. In this case that'd mean keeping your child timeout shorter than your remaining parent time. This is also necessary for having the parent record the child's end state (if that's useful, e.g. if you want to do something with it or report it to parent-workflow observers). So both good and bad news, and both unavoidable details and things we can / should obviously improve: Good news: this doesn't actually block forever. Just until your worker has handled tasks from (by default) 10,000 other workflows. Bad news: yes that's often quite a while later. More than long enough for it to be confusing and/or cause problems, e.g. if each one holds a lot of memory. Unavoidable: even if we did schedule a decision task to "clean up", it wouldn't be guaranteed to clean up, because the worker may have lost its claim to the cache. This can happen if the worker is too slow to pick up the "next" task -> the server gives up on the sticky cache attempt and sends to any random worker -> now there's an "old" goroutine on the old worker that nothing remembers, and a "new" one somewhere else. We can't really hold "try to send cleanup tasks to every worker ever, forever" so at some point it has to give up and stuff can be left dangling. Could improve:
Sad news: unfortunately this wasn't the leak I was hoping for. We appear to have a leak somewhere else that is truly leaking, e.g. can lead to > sticky cache size top-level goroutines hanging around. AFAICT this isn't related tho. |
Thanks for the thorough explanation! The conclusion is a bit of a letdown though I must say. We have some quite long-running parent workflows (they are normally limited to around a week) that we would like to guarantee cleanup after. It would be ideal if that could happen in the workflow code itself. I was hoping to catch a I somehow feel like there is a lack of symmetry in the API semantics. "Everything" (caller, child workflow, activity) behaves as one would intuitively expect, except for the parent workflow which just hangs ... So I'm wondering what we can do to work around this issue. From your reasoning it appears that the problem is that the parent workflow times out only on the cadence server, and the worker never hears again from the server. So can we somehow force server communication to wake up the timed out workflow? What about a future := workflow.ExecuteChildWorkflow(childCtx, wf.childWorkflow.Run)
var wfErr error
sel := workflow.NewSelector(wfCtx)
sel.AddFuture(future, func(future workflow.Future) {
// Workflow ended normally ...
})
timeout := time.Duration(workflow.GetInfo(wfCtx).ExecutionStartToCloseTimeoutSeconds)*time.Second - 1*time.Second
sel.AddFuture(workflow.NewTimer(wfCtx, timeout), func(f workflow.Future) {
// Workflow timed out
// ... handle cleanup with a detached context ...
})
sel.Select(wfCtx) Or is that still not guarenteed to work? Could Cadence being under pressure or the worker being late to pick up the timer result still lead to everything timing out only at the server? Is there some prior art in this area? Perhaps even an idiomatic way of handling cleanup for a timed out long-running workflow? (ideally without resorting to some janitor goroutine that needs to query cadence for workflows) |
Will attempt to add my 2 cents after reading this (quite fascinating) issue. Would it make sense to borrow the "Heartbeat" semantics to child-Workflows as well, in the context of the worker cache? I haven't looked in the code, but... can the worker associate a cache entry with a heartbeat for child workflows? Sorry in advance if this was a brain fart. |
You can, you just can't do anything after your execution timeout - it's your hard cutoff, not a cooperative cancel / canceled context. Otherwise we wouldn't have a hard cutoff and everything could run for infinite time. I think we might be missing a ctx, cancel := workflow.WithCancel(ctx)
workflow.Go(func(ctx) {
workflow.Sleep(ctx, timeout)
cancel()
// ignoring ^ sleep's err return is fine, it'll only err
// if the ctx is canceled, and this is just canceling the ctx.
// either way it's canceled at the right time or does nothing.
})
// this is essentially the same cost as a WithDeadline would be, it's totally fine Basically just get your When doing this though, I will broadly caution/remind that cancellation is cooperative. Missing a // normal go
func thing(ctx) {
select {
case <-doSomething:
case <-etc:
}
// no `case <-ctx.Done()` so this does not unblock when canceled,
// and might never unblock if other channels don't receive anything
}
// cadence
func thing(ctx) {
// cancel ctx after 3h
workflow.NewSelector(ctx).
AddReceive(workflow.GetSignalChannel("something"), ...).
AddReceive(workflow.GetSignalChannel("etc"), ...).
Select(ctx)
// same thing, nothing about ^ that says to unblock when canceled,
// so it does not unblock, and probably locks up until ExecutionTimeout
} This a bit of an awkward corner with Go because we need
As long as it's in the workflow, no, it'll resume just like any other workflow code :) Hence why you can't use the hard cutoff / ExecutionTimeout for it. It'd be unreliable.
Not with ExecutionTimeout, but internally yes. Like that timer + select.
We can't guarantee there isn't some many-hours backlog or multi-day service outage (yours or Cadence), or some kind of deadlock in your code, so in that sense no. It's why ExecutionTimeout exists: it stops work no matter what. If you need time for cleanup, you need to do it within that time / expand the ExecutionTimeout long enough to ensure you have the time you need to make that stuff happen. As far as the implied "this is confusing, can it be improved" question... yes probably :) Both "soft timeout" and a "hard timeout" are useful and fairly common across languages (Go context deadlines and Java thread interrupts, vs a timer for total process death) and it makes sense to streamline it. In Go that would probably just cancel your context at [a time], because that's the normal "stop work" signal for Go. Client-side-only that's not too hard to build, e.g. a workflow interceptor could ensure you have an hour to clean up on everything with the ^ sample above, and that'll work today (with versioning). I think it might make more sense to coordinate for a server-side |
Describe the bug
If a workflow times out while calling
Get
on a child workflow future (as illustrated below), theGet
call never returns and a goroutine is leaked.Notably, all other involved "entities" behave the way (I think) they shuld:
MyWorkflow
above) correctly sees aTimeoutType: START_TO_CLOSE
.childWorkflow
above) correctly sees aCanceledError
.context canceled
.Steps to reproduce the behavior:
Find a sample application attached (sample.tar.gz) that can be used to reproduce and illustrate the issue. Assuming that a Cadence cluster is serving a
cadence-frontend
onlocalhost:7833
the following steps can be used to reproduce the issue:60s
start-to-close timeout:After one minute the client will fail (as expected):
Looking at the worker output, we will see that the child workflow failed (as expected):
Similarly, we will see that the activity timed out (as expected):
But, and here is the big BUT, looking at the debug output from pprof on
http://localhost:6060/debug/pprof/goroutine?debug=1
we will see that the parent workflow still hangs on theGet
call. Something like:That call never returns, which I think violates the intended semantics (in the documentation) and results in a resource leak.
Expected behavior
I would expect the
Get
call made by the (timed out) parent workflow to eventually return withTimeoutError
.The text was updated successfully, but these errors were encountered: