-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch output does not recover after connection failure #40705
Comments
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
This isn't a DNS error, so to me it looks like we did recover from the DNS failure but something else is wrong, perhaps a context that expired when it shouldn't have somewhere in one of the initial connection callbacks. The beats/libbeat/cmd/instance/beat.go Line 1099 in 8a56ef8
beats/libbeat/esleg/eslegclient/connection.go Lines 400 to 408 in 8a56ef8
beats/libbeat/esleg/eslegclient/connection.go Lines 326 to 332 in 8a56ef8
Perhaps we are closing the connection after the DNS failure instead of reusing it. |
That makes sense. I didn't investigate the issue, I only managed to reproduce and report it and the DNS error seemed to be the trigger for not being able to connect to ES again. |
DNS lookup failure
DNS lookup failure
DNS lookup failure
Potentially related: #40572 |
^Looking at the changes there I see: // Close closes a connection.
func (conn *Connection) Close() error {
conn.HTTP.CloseIdleConnections()
+ conn.cancelReqs()
return nil
} Creating a new context only happens when This change is only in 8.15.1 from what I can tell. |
Linking the 8.15.0 backport: b19844f
|
The first thing we need to do is write a test that reproduces this problem. |
I think we should revert #40572 once we double check removing it fixes the problem, then work on re-adding it back with a fix + test for this. This issue is more severe than the original problem that PR was trying to fix. |
FYI @marc-gr |
I've been testing and you're correct Craig the beats/libbeat/esleg/eslegclient/connection.go Lines 327 to 331 in cb57731
Did not use to cancel any in-flight requests nor it rendered the connection unusable. #40572 cannot be automatically reverted :/, I'll create a PR removing the culprit line, which effectively restores the old behaviour from the Elasticsearch client of not cancelling in-flight requests. Everything else added by #40572 should still work fine. |
This commit removes a call to conn.cancelReqs() that causes the Connection to be unusable after Close() is called, leading to the bug described by elastic#40705.
The "revert-ish" PR: #40769 |
Investigating more, I found the root cause of the issue. On error our beats/libbeat/outputs/backoff.go Lines 60 to 67 in cb57731
The Client then calls beats/libbeat/outputs/elasticsearch/client.go Lines 539 to 541 in cb57731
The connection beats/libbeat/esleg/eslegclient/connection.go Lines 327 to 331 in cb57731
When #40572 was merged, the call to beats/libbeat/esleg/eslegclient/connection.go Lines 187 to 196 in cb57731
that is used in every request (L404) beats/libbeat/esleg/eslegclient/connection.go Lines 400 to 413 in cb57731
and never recreated, which renders the whole |
This commit removes a call to conn.cancelReqs() that causes the Connection to be unusable after Close() is called, leading to the bug described by #40705.
@belimawr thanks for taking a look, would be enough to recreate the context on close to make the client reusable? I think just removing the call to cancelReqs might make the stop racey since IIRC this was the main reason publishers were not closed before on stop. |
Mostly, the same instance of the connection is also used in a callback, so I also made sure they both hold a pointer reference so when the context is recreated both can use the new one.
The PR removing the call to cancelReqs was just a quick patch to keep |
The problematic part of the change that introduced this was reverted in reverted in #40776 and is targeted for release in 8.15.2. Reverting this change will likely bring back these two bugs which were fixed in the change that introduced this problem:
The PR linked to this issue now #40794 introduces a regression test for this situation and also brings in a fix for the two problems above that handles interrupted connections properly. I have updated the description to note that this issue will not exist in 8.15.2. |
I am going to close this to indicate we have resolved the underlying issue in the upcoming 8.15.2 release and move tracking the remaining work separately. |
Follow up work now tracked in: |
For confirmed bugs, please report:
main
,v8.15.1
Update: The problematic part of the change was reverted in #40776 and is targeted for release in 8.15.2.
Steps to reproduce
The configuration I used:
The text was updated successfully, but these errors were encountered: