Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add go1.23.0 to CI #7665

Merged
merged 12 commits into from
Aug 23, 2024
Merged

Add go1.23.0 to CI #7665

merged 12 commits into from
Aug 23, 2024

Conversation

aarongable
Copy link
Contributor

@aarongable aarongable commented Aug 13, 2024

Begin testing on go1.23. To facilitate this, also update /x/net, golangci-lint, staticcheck, and pebble-challtestsrv to versions which support go1.23. As a result of these updates, also fix a handful of new lint findings, mostly regarding passing non-static (i.e. potentially user-controlled) format strings into Sprintf-style functions.

Additionally, delete one VA unittest that was duplicating the checks performed by a different VA unittest, but with a context timeout bug that caused it to break when go1.23 subtly changed DialContext behavior.

@aarongable
Copy link
Contributor Author

I believe the remaining unit test failure is caused by this change from the go 1.23 release notes:

The DNSError type now wraps errors caused by timeouts or cancellation. For example, errors.Is(someDNSErr, context.DeadlineExceedeed) will now report whether a DNS error was caused by a timeout.

Digging in to see what the right fix is.

@aarongable
Copy link
Contributor Author

I've modified detailedError to log what it's doing. When run under go1.22.5, the error returned in TestHTTPDialTimeout has this structure:

va.ipError{ip:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xc6, 0x33, 0x64, 0x1}, err:(*url.Error)(0xc0000d23c0)}
unwrapping ipError
&url.Error{Op:"Get", URL:"http://unroutable.invalid/.well-known/acme-challenge/LoqXcYV8q5ONbJQxbmR7SCTNo3tiAXDfowyjxAjEuX0", Err:(*net.OpError)(0xc000444000)}
unwrapping ipError
&net.OpError{Op:"dial", Net:"tcp", Source:net.Addr(nil), Addr:(*net.TCPAddr)(0xc000486600), Err:(*poll.DeadlineExceededError)(0x142ff20)}

When using go1.23.0, the error has similar structure (same two outer layers, which are produced by boulder code), but a different innermost error:

va.ipError{ip:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xc6, 0x33, 0x64, 0x1}, err:(*url.Error)(0xc000418060)}
unwrapping ipError
&url.Error{Op:"Get", URL:"http://unroutable.invalid/.well-known/acme-challenge/LoqXcYV8q5ONbJQxbmR7SCTNo3tiAXDfowyjxAjEuX0", Err:context.deadlineExceededError{}}
unwrapping ipError
context.deadlineExceededError{}
unwrapping net.Error
--- FAIL: TestHTTPDialTimeout (0.05s)

It turns out that the go1.22.5 code is failing here, because it got a retryable error (net.OpError(poll.DeadlineExceededError) but has no other IP addresses to fall back to:

boulder/va/http.go

Lines 583 to 589 in 4bf6e2f

if err != nil && fallbackErr(err) {
// Try to advance to another IP. If there was an error advancing we don't
// have a fallback address to use and must return the original error.
advanceTargetIPErr := target.nextIP()
if advanceTargetIPErr != nil {
return nil, records, newIPError(records[len(records)-1].AddressUsed, err)
}

But the go1.23.0 code is failing here instead, because it thinks the error (context.deadlineExceededError) is not retryable:

boulder/va/http.go

Lines 611 to 614 in 4bf6e2f

} else if err != nil {
// if the error was not a fallbackErr then return immediately.
return nil, records, newIPError(records[len(records)-1].AddressUsed, err)
}

So, two questions:

  1. why doesn't fallbackErr think that context.deadlineExceededError is retryable?
  2. why is the underlying error no longer net.OpError?

Digging deeper.

@aarongable
Copy link
Contributor Author

Easy answer to question (1): Because fallbackErr only thinks that certain kinds of net.OpError are retryable, and context.deadlineExceededError is obviously not one of those. Also, I've realized that question (1) isn't important: whether or not the error was retryable, it's still going to get wrapped in newIPError and returned to be processed by detailedError, whereupon we return to question (2): why is the underlying error type no longer a net.OpError.

This now seems certain to be related to golang/go#59017 and https://go-review.googlesource.com/c/go/+/576555, which changed how net/http handles context cancellation (e.g. due to a timeout) during net.Dial, which is exactly what this test case is covering.

Because the dial itself is now happening in a background goroutine, which has a much longer timeout than our parent context, we no longer get to see the actual underlying dial error. We only get our own context cancellation error.

Here's the fun part: the stdlib's context.deadlineExceededError is implemented like this:
https://github.com/golang/go/blob/4a9d98d49a4f49672521973852fe8983ba790c5f/src/context/context.go#L169-L173

Which just so happens to satisfy the interface defined for net.Error:
https://github.com/golang/go/blob/4a9d98d49a4f49672521973852fe8983ba790c5f/src/net/net.go#L414-L423

Which explains why fallbackErr is treating the context.deadlineExceededError as though it is a net.Error and returning the nearly correct "Timeout after connect" message instead of the expected "Timeout during connect" message.

And for extra fun, we discovered this previously! So that's cool.

Anyway, it is now the end of the day and I'm not sure what the best path forward is here. Options include:

  • Update the test to expect the "Timeout after connect" message, even though that doesn't accurately describe the point in the process at which we timed out
  • Give detailedErr explicit handling for context.DeadlineExceeded, even though the message it returns would have to be somewhat generic because it wouldn't know what step of the process hit the deadline
  • Make detailedErr's error message for net.Error more generic (e.g. "during or after connect")

@beautifulentropy
Copy link
Member

beautifulentropy commented Aug 23, 2024

Aaron and I discussed this on a call, posting our analysis for posterity:

The problems solved by the change in Go 1.23 likely impact requests made by some of our largest Subscribers. Each connection pool is per-IP so accidental cross-cancellation would only be occurring for requests made on the same VA made to the same IP. This is more likely to happen for orders with a large number names and thus a large number of challenges to be validated which all resolve to the same IP address.

The changes made in Go 1.23 solve this issue but at the cost of per-request deadlines no longer being respected at dial time:

The original func (t *Transport) dialConnFor(w *wantConn) { in src/net/http/transport.go used to set a 1 minute deadline only if the incoming context had no deadline or cancellation:

		// If there's no done channel (no deadline or cancellation
		// from the caller possible), at least set some (long)
		// timeout here. This will make sure we don't block forever
		// and leak a goroutine if the connection stops replying
		// after the TCP connect.
		connectCtx := ctx
		if ctx.Done() == nil {
			newCtx, cancel := context.WithTimeout(ctx, 1*time.Minute)
			defer cancel()
			connectCtx = newCtx
		}

The changes in Go 1.23 detach the cancellation (and deadline) before getting to dialConnFor() . So there’s no longer a deadline to be concerned with. Instead a hardcoded deadline of 1 minute is set on all dial requests:

		// Set a (long) timeout here to make sure we don't block forever
		// and leak a goroutine if the connection stops replying after
		// the TCP connect.
		connectCtx, cancel := testHookProxyConnectTimeout(ctx, 1*time.Minute)
		defer cancel()

Thus, unless we're willing to extend the per-request timeout to one minute, any shorter timeout will result in a deadlineExceeded error, obscuring any underlying netErr or OpErr. By cancelling earlier than one minute, we stop listening on the results channel (which returns either a connection or an error) before the hard-coded 1 minute deadline is reached. Currently, in the VA, we use hard-coded 10 second deadline and extending this to 60 seconds to obtain more detailed error messages isn't practical in Production.

Next steps: Aaron plans to submit a pull request to the Go net/http library that could address this issue in Go 1.24 by harvesting the context deadline and applying it after detaching cancellation, pending acceptance. In the meantime, we might consider forking net/http if the PR is successful. Until these changes take effect, we should still consider upgrading to Go 1.23.

Optionally: To maintain the utility of error messages for troubleshooting, I proposed setting HTTP request timeouts in Staging to over one minute. In Production, we can continue with shorter timeouts, which, while resulting in less detailed error messages, will not affect performance. We could post an API announcement explaining that this is a limitation of the Go standard library that we have to live with for the time being.

@aarongable
Copy link
Contributor Author

Okay so I think I fully figured out what was going on here, and the solution does not require us to contribute anything upstream.

The fundamental bug is in the test itself:

boulder/va/http_test.go

Lines 1366 to 1371 in cac431c

va, _ := setup(nil, 0, "", nil, nil)
started := time.Now()
timeout := 250 * time.Millisecond
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()

We create a context with a 250ms timeout, and use that to control the timeout of the va.validateHTTP01 call that we make. But we also use an unmodified test VA, which has va.singleDialTimeout set to its default value of 10*time.Second. So now we have two timeouts fighting each other: the dial will fail itself after 10s, but the whole request (which includes the dial) will fail first after just 250ms.

Under go1.22, this works fine. Because the dial inherits the context timeout and cancellation from its parent, it times out after 250ms, the dial error gets surfaced, and the test passes.

But under go1.23, because the dial now uses context.WithoutCancel to disconnect from the parent context, the dial gets to run all the way to its 10s timeout. But the request still times out after 250ms and we get back a simpler error causing the test to fail.

Compare this to a different test function in the same file:

boulder/va/http_test.go

Lines 59 to 67 in cac431c

func TestPreresolvedDialerTimeout(t *testing.T) {
va, _ := setup(nil, 0, "", nil, nil)
// Timeouts below 50ms tend to be flaky.
va.singleDialTimeout = 50 * time.Millisecond
// The context timeout needs to be larger than the singleDialTimeout
ctxTimeout := 500 * time.Millisecond
ctx, cancel := context.WithTimeout(context.Background(), ctxTimeout)
defer cancel()

Here we set the timeouts very low so that the test will be fast, but still ensure that the overall request timeout is still greater than the dial timeout. This matches our actual prod configuration where the request timeout is 15s (our default gRPC request timeout) while the dial timeout is 10s. This means that the dial timeout gets hit first, and we get the correct expected error.

In fact, this other test function is so similar that it's testing the exact same thing as the problematic test case. So in the commit above I've removed the failing test, and improved this other test to make it much clearer what it's testing for.

@aarongable aarongable marked this pull request as ready for review August 23, 2024 21:24
@aarongable aarongable requested a review from a team as a code owner August 23, 2024 21:24
va/va.go Outdated Show resolved Hide resolved
Copy link
Member

@beautifulentropy beautifulentropy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks great. I have just one last question; we're losing the support for formatting in probs and we're keeping that support for berrors? Is the rationale here that we hardly ever make use of it with probs and we're almost always making use of it with berrors?

@aarongable
Copy link
Contributor Author

Yep, that's exactly the reasoning. All of the places where I removed it are places where we simply never use it -- with one exception (a type where we used the args functionality exactly once, so I just turned that one into a fmt.Sprintf).

@aarongable aarongable merged commit da7865c into main Aug 23, 2024
19 checks passed
@aarongable aarongable deleted the go1.23 branch August 23, 2024 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants