-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add test for elasticsearch re-connection after network error & allow graceful shutdown #40794
base: main
Are you sure you want to change the base?
Conversation
This pull request is now in conflicts. Could you fix it? 🙏
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
5e4d4de
to
877dc31
Compare
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I have a question. I'll approve once it's answered
1e9dcf4
to
33bcac0
Compare
// There are some cases where the connection is created but Connect | ||
// is not called before it's used, so we populate reqsContext and cancelReqs | ||
// here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just find the places where we don't call connect and fix them? This code is only called within Beats and we can search for uses of esleg
.
Also using context.Background()
is a smell, the parent context should be an argument, which will have the compiler find all NewConnection uses for you so you can audit them to see if they use close inappropriately.
The connection here also don't really represent a network connection at the network level, it looks like it is a convenience wrapper around an HTTP client. From that perspective closing or having a connection level context doesn't make much sense, it is just a wrapper for closing idle connections.
Arguably the contexts should be set on a per request basis in all the places that call execRequest
and then the cancellation should propagate through those individual calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just find the places where we don't call connect and fix them? This code is only called within Beats and we can search for uses of esleg.
I went with this approach to be on the safe side of not introducing the possibility of a panic for calling a nil
function. The panic I got when running the tests seems to be coming form a Connection that is not used but it's closed. That is triggered by a test testing a failure scenario where the ES host is not reachable.
Anyways, I've been looking into that.
Arguably the contexts should be set on a per request basis in all the places that call execRequest and then the cancellation should propagate through those individual calls.
I'm not sure that would achieve the same effect we currently have. Currently reqsContext
is used to cancel in-flight requests when the Connection needs to be shutdown, that is done by the Close
method that is called by a different gorotine than the one waiting for the in-flight request(s) to finish.
The issues that are fixed by this new behaviour:
- Windows service for Beat does not stop when output is unreachable #40518
- Windows service for Beat does not stop gracefully #38666
An the PR fixing them:
Honestly, I rather have this PR merged as is, so the issues listed above and #40705 are correctly fixed on main
. Then we can create a follow up issue to refactor the code ensuring that:
- Methods creating requests accept a context
- In-flight requests can be cancelled when the Connection is closed by a different goroutine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally figured out the reason why Filebeat would panic when the connection was closed. It's an interesting corner case.
The root cause is if the publishing pipeline never tries to publish an event and Filebeat is shutdown, in that case the connection to ES was never used, however it is closed during the shutdown process, leading to a panic if cancelReqs
is nil.
I could add some checks to ensure cancelReqs
and reqsContext
are not used if nil
, however it feels cleaner to keep the previous behaviour of NewConnection
returning a Connection that is safe to use without any change of behaviour on its methods.
33bcac0
to
f2718c3
Compare
// that is passed to this client is also used in a closure, we need | ||
// to ensure both hold a reference to the same instance of the connection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That you have to account for future usage at all here feels wrong too. We need to fix this in a way that does not require knowing future uses of the code. Make it impossible to have this bug again.
Why do we even need a closure, was it just there because it was convenient? Can we just not have a closure anymore? It looks like if we saved the captured onConnect
it could be a method on the client to avoid also capturing the connection.
beats/libbeat/outputs/elasticsearch/client.go
Lines 136 to 160 in 3c03c74
conn.OnConnectCallback = func() error { | |
globalCallbackRegistry.mutex.Lock() | |
defer globalCallbackRegistry.mutex.Unlock() | |
for _, callback := range globalCallbackRegistry.callbacks { | |
err := callback(conn) | |
if err != nil { | |
return err | |
} | |
} | |
if onConnect != nil { | |
onConnect.mutex.Lock() | |
defer onConnect.mutex.Unlock() | |
for _, callback := range onConnect.callbacks { | |
err := callback(conn) | |
if err != nil { | |
return err | |
} | |
} | |
} | |
return nil | |
} | |
I want to make sure I follow the lifetime of the connection properly. The beats/libbeat/publisher/pipeline/client_worker.go Lines 131 to 158 in 3c03c74
The close comes from here: beats/libbeat/outputs/backoff.go Lines 60 to 64 in 3c03c74
Can you make This would require touching the interface of every output but it looks like the correct place for the lifetime of the context to be managed. It looks like we only have one other use of eslegclient for monitoring that is not a test. beats/libbeat/monitoring/report/elasticsearch/elasticsearch.go Lines 215 to 218 in 3c03c74
|
👍 , and in addition if this is done it would require the rest of outputs to honor that new context, too. IIRC each uses different cancellation mechanisms atm. |
bfa6d6c
to
2dca761
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
When the Elasticsearch client fails to publish events, it ends up calling `Close` in the connection (that is reused). To cancel the in-flight requests, the context is cancelled and a new one is created to used in future requests. The callback to check the version holds a reference to the connection via a closure, now the Elasticsearch client holds a pointer to that connection, so whenever Close is called, the callback can create a request with the new, not cancelled, context. An integration test is added to ensure the ES output can always recover from network errors.
This commit moves the creation of the request context to the connect method.
There are some cases where the Connection will be used without calling Connect, so we initialise reqsContext and cancelReqs in the NewConnection function to avoid panics.
Connection.Connect now accepts a context to control the life cycle of its requests.
Add a context to outputs.Connectable.Connect to correctly manage the life cycle of the connection and it's requests.
ae49e18
to
f76aeed
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
Proposed commit message
When the Elasticsearch client fails to publish events, it ends up calling
Close
in the connection (that is reused). To cancel the in-flight requests, the context is cancelled and a new one is created to used in future requests.The callback to check the version holds a reference to the connection via a closure, now the Elasticsearch client holds a pointer to that connection, so whenever Close is called, the callback can create a request with the new, not cancelled, context.
An integration test is added to ensure the
ES output can always recover from network errors.
Checklist
[ ] I have made corresponding changes to the documentation[ ] I have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Disruptive User Impact
It's a bug fix, there is no disruptive user impact
## Author's ChecklistHow to test this PR locally
Related issues
## Use cases## Screenshots## Logs