-
-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add go1.23.0 to CI #7665
Add go1.23.0 to CI #7665
Conversation
I believe the remaining unit test failure is caused by this change from the go 1.23 release notes:
Digging in to see what the right fix is. |
I've modified
When using go1.23.0, the error has similar structure (same two outer layers, which are produced by boulder code), but a different innermost error:
It turns out that the go1.22.5 code is failing here, because it got a retryable error ( Lines 583 to 589 in 4bf6e2f
But the go1.23.0 code is failing here instead, because it thinks the error ( Lines 611 to 614 in 4bf6e2f
So, two questions:
Digging deeper. |
Easy answer to question (1): Because This now seems certain to be related to golang/go#59017 and https://go-review.googlesource.com/c/go/+/576555, which changed how net/http handles context cancellation (e.g. due to a timeout) during net.Dial, which is exactly what this test case is covering. Because the dial itself is now happening in a background goroutine, which has a much longer timeout than our parent context, we no longer get to see the actual underlying dial error. We only get our own context cancellation error. Here's the fun part: the stdlib's Which just so happens to satisfy the interface defined for net.Error: Which explains why And for extra fun, we discovered this previously! So that's cool. Anyway, it is now the end of the day and I'm not sure what the best path forward is here. Options include:
|
Aaron and I discussed this on a call, posting our analysis for posterity: The problems solved by the change in Go 1.23 likely impact requests made by some of our largest Subscribers. Each connection pool is per-IP so accidental cross-cancellation would only be occurring for requests made on the same VA made to the same IP. This is more likely to happen for orders with a large number names and thus a large number of challenges to be validated which all resolve to the same IP address. The changes made in Go 1.23 solve this issue but at the cost of per-request deadlines no longer being respected at dial time: The original
The changes in Go 1.23 detach the cancellation (and deadline) before getting to
Thus, unless we're willing to extend the per-request timeout to one minute, any shorter timeout will result in a deadlineExceeded error, obscuring any underlying netErr or OpErr. By cancelling earlier than one minute, we stop listening on the results channel (which returns either a connection or an error) before the hard-coded 1 minute deadline is reached. Currently, in the VA, we use hard-coded 10 second deadline and extending this to 60 seconds to obtain more detailed error messages isn't practical in Production. Next steps: Aaron plans to submit a pull request to the Go net/http library that could address this issue in Go 1.24 by harvesting the context deadline and applying it after detaching cancellation, pending acceptance. In the meantime, we might consider forking net/http if the PR is successful. Until these changes take effect, we should still consider upgrading to Go 1.23. Optionally: To maintain the utility of error messages for troubleshooting, I proposed setting HTTP request timeouts in Staging to over one minute. In Production, we can continue with shorter timeouts, which, while resulting in less detailed error messages, will not affect performance. We could post an API announcement explaining that this is a limitation of the Go standard library that we have to live with for the time being. |
Okay so I think I fully figured out what was going on here, and the solution does not require us to contribute anything upstream. The fundamental bug is in the test itself: Lines 1366 to 1371 in cac431c
We create a context with a 250ms timeout, and use that to control the timeout of the va.validateHTTP01 call that we make. But we also use an unmodified test VA, which has va.singleDialTimeout set to its default value of 10*time.Second . So now we have two timeouts fighting each other: the dial will fail itself after 10s, but the whole request (which includes the dial) will fail first after just 250ms.
Under go1.22, this works fine. Because the dial inherits the context timeout and cancellation from its parent, it times out after 250ms, the dial error gets surfaced, and the test passes. But under go1.23, because the dial now uses Compare this to a different test function in the same file: Lines 59 to 67 in cac431c
Here we set the timeouts very low so that the test will be fast, but still ensure that the overall request timeout is still greater than the dial timeout. This matches our actual prod configuration where the request timeout is 15s (our default gRPC request timeout) while the dial timeout is 10s. This means that the dial timeout gets hit first, and we get the correct expected error. In fact, this other test function is so similar that it's testing the exact same thing as the problematic test case. So in the commit above I've removed the failing test, and improved this other test to make it much clearer what it's testing for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks great. I have just one last question; we're losing the support for formatting in probs and we're keeping that support for berrors? Is the rationale here that we hardly ever make use of it with probs and we're almost always making use of it with berrors?
Yep, that's exactly the reasoning. All of the places where I removed it are places where we simply never use it -- with one exception (a type where we used the args functionality exactly once, so I just turned that one into a fmt.Sprintf). |
Begin testing on go1.23. To facilitate this, also update /x/net, golangci-lint, staticcheck, and pebble-challtestsrv to versions which support go1.23. As a result of these updates, also fix a handful of new lint findings, mostly regarding passing non-static (i.e. potentially user-controlled) format strings into Sprintf-style functions.
Additionally, delete one VA unittest that was duplicating the checks performed by a different VA unittest, but with a context timeout bug that caused it to break when go1.23 subtly changed DialContext behavior.