Add go1.23.0 to CI #7665

aarongable · 2024-08-13T16:44:44Z

Begin testing on go1.23. To facilitate this, also update /x/net, golangci-lint, staticcheck, and pebble-challtestsrv to versions which support go1.23. As a result of these updates, also fix a handful of new lint findings, mostly regarding passing non-static (i.e. potentially user-controlled) format strings into Sprintf-style functions.

Additionally, delete one VA unittest that was duplicating the checks performed by a different VA unittest, but with a context timeout bug that caused it to break when go1.23 subtly changed DialContext behavior.

aarongable · 2024-08-21T23:18:45Z

I believe the remaining unit test failure is caused by this change from the go 1.23 release notes:

The DNSError type now wraps errors caused by timeouts or cancellation. For example, errors.Is(someDNSErr, context.DeadlineExceedeed) will now report whether a DNS error was caused by a timeout.

Digging in to see what the right fix is.

aarongable · 2024-08-21T23:36:21Z

I've modified detailedError to log what it's doing. When run under go1.22.5, the error returned in TestHTTPDialTimeout has this structure:

va.ipError{ip:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xc6, 0x33, 0x64, 0x1}, err:(*url.Error)(0xc0000d23c0)}
unwrapping ipError
&url.Error{Op:"Get", URL:"http://unroutable.invalid/.well-known/acme-challenge/LoqXcYV8q5ONbJQxbmR7SCTNo3tiAXDfowyjxAjEuX0", Err:(*net.OpError)(0xc000444000)}
unwrapping ipError
&net.OpError{Op:"dial", Net:"tcp", Source:net.Addr(nil), Addr:(*net.TCPAddr)(0xc000486600), Err:(*poll.DeadlineExceededError)(0x142ff20)}

When using go1.23.0, the error has similar structure (same two outer layers, which are produced by boulder code), but a different innermost error:

va.ipError{ip:net.IP{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xff, 0xff, 0xc6, 0x33, 0x64, 0x1}, err:(*url.Error)(0xc000418060)}
unwrapping ipError
&url.Error{Op:"Get", URL:"http://unroutable.invalid/.well-known/acme-challenge/LoqXcYV8q5ONbJQxbmR7SCTNo3tiAXDfowyjxAjEuX0", Err:context.deadlineExceededError{}}
unwrapping ipError
context.deadlineExceededError{}
unwrapping net.Error
--- FAIL: TestHTTPDialTimeout (0.05s)

It turns out that the go1.22.5 code is failing here, because it got a retryable error (net.OpError(poll.DeadlineExceededError) but has no other IP addresses to fall back to:

boulder/va/http.go

Lines 583 to 589 in 4bf6e2f

    
           if err != nil && fallbackErr(err) { 
        
           	// Try to advance to another IP. If there was an error advancing we don't 
        
           	// have a fallback address to use and must return the original error. 
        
           	advanceTargetIPErr := target.nextIP() 
        
           	if advanceTargetIPErr != nil { 
        
           		return nil, records, newIPError(records[len(records)-1].AddressUsed, err) 
        
           	}

But the go1.23.0 code is failing here instead, because it thinks the error (context.deadlineExceededError) is not retryable:

boulder/va/http.go

Lines 611 to 614 in 4bf6e2f

    
           } else if err != nil { 
        
           	// if the error was not a fallbackErr then return immediately. 
        
           	return nil, records, newIPError(records[len(records)-1].AddressUsed, err) 
        
           }

So, two questions:

why doesn't fallbackErr think that context.deadlineExceededError is retryable?
why is the underlying error no longer net.OpError?

Digging deeper.

aarongable · 2024-08-22T00:57:36Z

Easy answer to question (1): Because fallbackErr only thinks that certain kinds of net.OpError are retryable, and context.deadlineExceededError is obviously not one of those. Also, I've realized that question (1) isn't important: whether or not the error was retryable, it's still going to get wrapped in newIPError and returned to be processed by detailedError, whereupon we return to question (2): why is the underlying error type no longer a net.OpError.

This now seems certain to be related to golang/go#59017 and https://go-review.googlesource.com/c/go/+/576555, which changed how net/http handles context cancellation (e.g. due to a timeout) during net.Dial, which is exactly what this test case is covering.

Because the dial itself is now happening in a background goroutine, which has a much longer timeout than our parent context, we no longer get to see the actual underlying dial error. We only get our own context cancellation error.

Here's the fun part: the stdlib's context.deadlineExceededError is implemented like this:
https://github.com/golang/go/blob/4a9d98d49a4f49672521973852fe8983ba790c5f/src/context/context.go#L169-L173

Which just so happens to satisfy the interface defined for net.Error:
https://github.com/golang/go/blob/4a9d98d49a4f49672521973852fe8983ba790c5f/src/net/net.go#L414-L423

Which explains why fallbackErr is treating the context.deadlineExceededError as though it is a net.Error and returning the nearly correct "Timeout after connect" message instead of the expected "Timeout during connect" message.

And for extra fun, we discovered this previously! So that's cool.

Anyway, it is now the end of the day and I'm not sure what the best path forward is here. Options include:

Update the test to expect the "Timeout after connect" message, even though that doesn't accurately describe the point in the process at which we timed out
Give detailedErr explicit handling for context.DeadlineExceeded, even though the message it returns would have to be somewhat generic because it wouldn't know what step of the process hit the deadline
Make detailedErr's error message for net.Error more generic (e.g. "during or after connect")

beautifulentropy · 2024-08-23T18:37:12Z

Aaron and I discussed this on a call, posting our analysis for posterity:

The problems solved by the change in Go 1.23 likely impact requests made by some of our largest Subscribers. Each connection pool is per-IP so accidental cross-cancellation would only be occurring for requests made on the same VA made to the same IP. This is more likely to happen for orders with a large number names and thus a large number of challenges to be validated which all resolve to the same IP address.

The changes made in Go 1.23 solve this issue but at the cost of per-request deadlines no longer being respected at dial time:

The original func (t *Transport) dialConnFor(w *wantConn) { in src/net/http/transport.go used to set a 1 minute deadline only if the incoming context had no deadline or cancellation:

		// If there's no done channel (no deadline or cancellation
		// from the caller possible), at least set some (long)
		// timeout here. This will make sure we don't block forever
		// and leak a goroutine if the connection stops replying
		// after the TCP connect.
		connectCtx := ctx
		if ctx.Done() == nil {
			newCtx, cancel := context.WithTimeout(ctx, 1*time.Minute)
			defer cancel()
			connectCtx = newCtx
		}

The changes in Go 1.23 detach the cancellation (and deadline) before getting to dialConnFor() . So there’s no longer a deadline to be concerned with. Instead a hardcoded deadline of 1 minute is set on all dial requests:

		// Set a (long) timeout here to make sure we don't block forever
		// and leak a goroutine if the connection stops replying after
		// the TCP connect.
		connectCtx, cancel := testHookProxyConnectTimeout(ctx, 1*time.Minute)
		defer cancel()

Thus, unless we're willing to extend the per-request timeout to one minute, any shorter timeout will result in a deadlineExceeded error, obscuring any underlying netErr or OpErr. By cancelling earlier than one minute, we stop listening on the results channel (which returns either a connection or an error) before the hard-coded 1 minute deadline is reached. Currently, in the VA, we use hard-coded 10 second deadline and extending this to 60 seconds to obtain more detailed error messages isn't practical in Production.

Next steps: Aaron plans to submit a pull request to the Go net/http library that could address this issue in Go 1.24 by harvesting the context deadline and applying it after detaching cancellation, pending acceptance. In the meantime, we might consider forking net/http if the PR is successful. Until these changes take effect, we should still consider upgrading to Go 1.23.

Optionally: To maintain the utility of error messages for troubleshooting, I proposed setting HTTP request timeouts in Staging to over one minute. In Production, we can continue with shorter timeouts, which, while resulting in less detailed error messages, will not affect performance. We could post an API announcement explaining that this is a limitation of the Go standard library that we have to live with for the time being.

aarongable · 2024-08-23T21:04:07Z

Okay so I think I fully figured out what was going on here, and the solution does not require us to contribute anything upstream.

The fundamental bug is in the test itself:

boulder/va/http_test.go

Lines 1366 to 1371 in cac431c

    
           va, _ := setup(nil, 0, "", nil, nil) 
        
           started := time.Now() 
        
           timeout := 250 * time.Millisecond 
        
           ctx, cancel := context.WithTimeout(context.Background(), timeout) 
        
           defer cancel()

We create a context with a 250ms timeout, and use that to control the timeout of the va.validateHTTP01 call that we make. But we also use an unmodified test VA, which has va.singleDialTimeout set to its default value of 10*time.Second. So now we have two timeouts fighting each other: the dial will fail itself after 10s, but the whole request (which includes the dial) will fail first after just 250ms.

Under go1.22, this works fine. Because the dial inherits the context timeout and cancellation from its parent, it times out after 250ms, the dial error gets surfaced, and the test passes.

But under go1.23, because the dial now uses context.WithoutCancel to disconnect from the parent context, the dial gets to run all the way to its 10s timeout. But the request still times out after 250ms and we get back a simpler error causing the test to fail.

Compare this to a different test function in the same file:

boulder/va/http_test.go

Lines 59 to 67 in cac431c

    
           func TestPreresolvedDialerTimeout(t *testing.T) { 
        
           	va, _ := setup(nil, 0, "", nil, nil) 
        
           	// Timeouts below 50ms tend to be flaky. 
        
           	va.singleDialTimeout = 50 * time.Millisecond 
        
           	// The context timeout needs to be larger than the singleDialTimeout 
        
           	ctxTimeout := 500 * time.Millisecond 
        
           	ctx, cancel := context.WithTimeout(context.Background(), ctxTimeout) 
        
           	defer cancel()

Here we set the timeouts very low so that the test will be fast, but still ensure that the overall request timeout is still greater than the dial timeout. This matches our actual prod configuration where the request timeout is 15s (our default gRPC request timeout) while the dial timeout is 10s. This means that the dial timeout gets hit first, and we get the correct expected error.

In fact, this other test function is so similar that it's testing the exact same thing as the problematic test case. So in the commit above I've removed the failing test, and improved this other test to make it much clearer what it's testing for.

va/va.go

beautifulentropy

This change looks great. I have just one last question; we're losing the support for formatting in probs and we're keeping that support for berrors? Is the rationale here that we hardly ever make use of it with probs and we're almost always making use of it with berrors?

aarongable · 2024-08-23T21:55:17Z

Yep, that's exactly the reasoning. All of the places where I removed it are places where we simply never use it -- with one exception (a type where we used the args functionality exactly once, so I just turned that one into a fmt.Sprintf).

aarongable added 7 commits August 13, 2024 09:44

Add go1.23.0 to CI

c9462ce

Update /x/net for compatibility with go1.23

116164a

Update version of pebble-challtestsrv installed in docker

25e3b38

Update golangci-lint to support go1.23

8aacc50

Update staticcheck to support go1.23

c6e74d8

Fix version in local docker-compose

6150a76

Fix new lint findings

3d4cd8b

FREEZE.unindexed

45321d3

Unify two http TestTimeout tests

1f5e839

aarongable added 2 commits August 23, 2024 14:10

Merge branch main into go1.23

e58b7e8

Fix two lints introduced by #7666

a1713af

aarongable marked this pull request as ready for review August 23, 2024 21:24

aarongable requested a review from a team as a code owner August 23, 2024 21:24

aarongable requested a review from beautifulentropy August 23, 2024 21:24

beautifulentropy requested changes Aug 23, 2024

View reviewed changes

va/va.go Outdated Show resolved Hide resolved

Remove debugging printfs

75264f9

beautifulentropy approved these changes Aug 23, 2024

View reviewed changes

aarongable merged commit da7865c into main Aug 23, 2024
19 checks passed

aarongable deleted the go1.23 branch August 23, 2024 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add go1.23.0 to CI #7665

Add go1.23.0 to CI #7665

aarongable commented Aug 13, 2024 •

edited

Loading

aarongable commented Aug 21, 2024

aarongable commented Aug 21, 2024

aarongable commented Aug 22, 2024

beautifulentropy commented Aug 23, 2024 •

edited

Loading

aarongable commented Aug 23, 2024

beautifulentropy left a comment

aarongable commented Aug 23, 2024

Add go1.23.0 to CI #7665

Add go1.23.0 to CI #7665

Conversation

aarongable commented Aug 13, 2024 • edited Loading

aarongable commented Aug 21, 2024

aarongable commented Aug 21, 2024

aarongable commented Aug 22, 2024

beautifulentropy commented Aug 23, 2024 • edited Loading

aarongable commented Aug 23, 2024

beautifulentropy left a comment

Choose a reason for hiding this comment

aarongable commented Aug 23, 2024

aarongable commented Aug 13, 2024 •

edited

Loading

beautifulentropy commented Aug 23, 2024 •

edited

Loading