Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock when closing pipe #324

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

mat007
Copy link

@mat007 mat007 commented Dec 12, 2024

Hi!

This PR fixes a deadlock that we face every once in a while.

Thanks!

@mat007 mat007 requested a review from a team as a code owner December 12, 2024 16:46
@mat007 mat007 changed the title Fix race when closing pipe Fix deadlock when closing pipe Dec 12, 2024
@mat007 mat007 force-pushed the close-race branch 2 times, most recently from 90fdf20 to 70746b4 Compare December 13, 2024 06:39
@thaJeztah
Copy link
Contributor

@mat007 I think this PR fixes #257, correct? If so, can you add a fixes xxx to the PR description (not the commit message, as that tends to get noisy 😂)

@mat007
Copy link
Author

mat007 commented Dec 13, 2024

I think this PR fixes #257, correct?

I don’t know if it fixes that one, I was investigating something else. I’m really not sure it’s related. 🤔

@thaJeztah
Copy link
Contributor

Oh, sorry; someone pointed me to that ticket as the reason for maintaining a fork, so I thought this PR was related to that

@kevpar
Copy link
Member

kevpar commented Dec 18, 2024

@mat007 can you explain what the deadlock is that you're seeing?

@mat007
Copy link
Author

mat007 commented Dec 19, 2024

@mat007 can you explain what the deadlock is that you're seeing?

Sure!

The added test without the fix never terminates. Breaking under a debugger we see that the goroutine that called Close is blocked on

<-l.doneCh
while the ListenPipe one is waiting on

go-winio/pipe.go

Lines 462 to 463 in bdc6c11

select {
case <-l.closeCh:

The problem is that we have 2 readers for l.closeCh, with the other one being

case <-l.closeCh:
but we’re signalling it by writing a value to the channel, which can only be read by one reader.
This PR fixes this by closing the channel instead of writing to it.

@mat007
Copy link
Author

mat007 commented Jan 6, 2025

@kevpar what do you think of my explanation? Do you need more info?
Thanks!

@kevpar
Copy link
Member

kevpar commented Jan 6, 2025

@kevpar what do you think of my explanation? Do you need more info? Thanks!

Sorry, I just got back from holiday. I will take a look today and see if we can get this in!

@kevpar
Copy link
Member

kevpar commented Jan 10, 2025

I've been doing some testing/thinking on this. Overall I think it looks good, just have a few pieces of feedback.

  • Let's change closeCh and doneCh to type chan struct{}, since we don't actually care about the value being sent.
  • In the <-l.closeCh case of (*win32PipeListener).Close, I think it's best to still do a <-l.doneCh before returning. This will ensure that if there are multiple concurrent calls to Close, the later calls will not return until doneCh is closed.
  • In TestCloseRace, can we increase the number of iterations, perhaps to say, 1000? In my testing I sometimes found that it took until iteration 150+ before deadlocking, and we don't want to accidentally get a false negative from this test case if a deadlock is reintroduced.
  • Also in TestCloseRace, I would suggset adding a log.Print statement that prints the iteration number that is completed at the end of the loop. This could help tell when we are deadlocking from test logs if we hit a problem in the future.

Thanks!

@kevpar
Copy link
Member

kevpar commented Jan 10, 2025

Also, looks like there are some CI failures.

@thaJeztah
Copy link
Contributor

Had a quick peek; linting errors look unrelated to the PR;

Warning: redefines-builtin-id: redefinition of the built-in function max (revive)
Warning: redefines-builtin-id: redefinition of the built-in function max (revive)
Error: printf: non-constant format string in call to (testing.TB).Fatalf (govet)
Error: directive `//nolint:errorlint` is unused for linter "errorlint" (nolintlint)

The go-generate one is more interesting, as there's a panic;

go run github.com/Microsoft/go-winio/tools/mkwinsyscall -output zsyscall_windows.go socket.go
  go run github.com/Microsoft/go-winio/tools/mkwinsyscall -output zsyscall_windows.go ./bind_filter.go
  go run golang.org/x/tools/cmd/stringer -type=Level -trimprefix=Level
  go: downloading golang.org/x/tools v0.11.0
  go: downloading golang.org/x/mod v0.12.0
  panic: runtime error: invalid memory address or nil pointer dereference [recovered]
  	panic: runtime error: invalid memory address or nil pointer dereference
  [signal 0xc0000005 code=0x0 addr=0x0 pc=0xbff0af]

I see 070c828 changed CI to not used fixed versions of Go, and oldstable moved to go1.22; I wonder if the panic is related to that as well;

Run actions/setup-go@v5
  with:
    go-version: oldstable
    cache: false
    check-latest: false
    token: ***
  env:
    GO_VERSION: oldstable
    GOTESTSUM_VERSION: latest
    GOLANGCILINT_VERSION: latest
Setup go version spec oldstable
oldstable version resolved as 1.22.10
Found in cache @ C:\hostedtoolcache\windows\go\1.22.10\x64
Added go to the path
Successfully set up Go version oldstable
go version go1.22.10 windows/amd64

@thaJeztah thaJeztah mentioned this pull request Jan 10, 2025
@thaJeztah
Copy link
Contributor

Opened a PR for those linting issues;

(not sure about the panic, because I'm not on Windows, so maybe it was just a incident; we'll see on the other PR)

@mat007
Copy link
Author

mat007 commented Jan 13, 2025

I've been doing some testing/thinking on this. Overall I think it looks good, just have a few pieces of feedback.

Thanks! I’ll look into addressing these.

Signed-off-by: Mathieu Champlon <mathieu.champlon@docker.com>
Signed-off-by: Mathieu Champlon <mathieu.champlon@docker.com>
Signed-off-by: Mathieu Champlon <mathieu.champlon@docker.com>
200 was a little low to consistently trigger the issue.

Signed-off-by: Mathieu Champlon <mathieu.champlon@docker.com>
This should help tell when we are deadlocking from test logs if we hit a problem in the future.

Signed-off-by: Mathieu Champlon <mathieu.champlon@docker.com>
@mat007
Copy link
Author

mat007 commented Jan 13, 2025

@kevpar can you please take another look?
I haven’t tried to fix the lint and go generate failures as they don’t appear to be linked to the changes at hand, and @thaJeztah has opened a separate pull request (thanks @thaJeztah!).

@kevpar
Copy link
Member

kevpar commented Jan 13, 2025

@kevpar can you please take another look? I haven’t tried to fix the lint and go generate failures as they don’t appear to be linked to the changes at hand, and @thaJeztah has opened a separate pull request (thanks @thaJeztah!).

Changes look good, thanks! One thing that just occurred to me -- there is a chance of a panic if two goroutines try to close a listener at the same time (such that they both try to close closeCh). Should we wrap this in a sync.Once or something to prevent this?

Signed-off-by: Mathieu Champlon <mathieu.champlon@docker.com>
@mat007
Copy link
Author

mat007 commented Jan 13, 2025

One thing that just occurred to me -- there is a chance of a panic if two goroutines try to close a listener at the same time (such that they both try to close closeCh). Should we wrap this in a sync.Once or something to prevent this?

Good catch! This makes the implementation of Close much simpler as well.

Copy link
Member

@kevpar kevpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the change :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants