Elasticsearch output does not recover after connection failure #40705

belimawr · 2024-09-06T21:53:14Z

Relates Add known issue that Beats do not recover from transient network errors ingest-docs#1312
Relates Add known issue that Beats do not recover from transient network errors #40767

For confirmed bugs, please report:

Version: main, v8.15.1
Operating System: Linux

Update: The problematic part of the change was reverted in #40776 and is targeted for release in 8.15.2.

Steps to reproduce

Deploy a Filebeat sending data to a remote Elasticsearch using a domain name in the configuration
Confirm Filebeat is running and sending data
Disable the network

Wait for a DNS lookup error and the publisher errors:

{
  "log.level": "warn",
  "@timestamp": "2024-09-06T08:30:33.240-0400",
  "log.logger": "transport",
  "log.origin": {
    "function": "github.com/elastic/elastic-agent-libs/transport.TestNetDialer.func1",
    "file.name": "transport/tcp.go",
    "file.line": 53
  },
  "message": "DNS lookup failure \"remote-es.elastic.cloud\": lookup remote-es.elastic.cloud: Temporary failure in name resolution",
  "service.name": "filebeat",
  "ecs.version": "1.6.0"
}

{
  "log.level": "error",
  "@timestamp": "2024-09-06T15:50:15.140-0400",
  "log.logger": "publisher_pipeline_output",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run",
    "file.name": "pipeline/client_worker.go",
    "file.line": 148
  },
  "message": "Failed to connect to backoff(elasticsearch(https://remote-es.elastic.cloud:443)): Get \"https://remote-es.elastic.cloud:443\": context canceled",
  "service.name": "filebeat",
  "ecs.version": "1.6.0"
}
{
  "log.level": "info",
  "@timestamp": "2024-09-06T15:50:15.140-0400",
  "log.logger": "publisher_pipeline_output",
  "log.origin": {
    "function": "github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*netClientWorker).run",
    "file.name": "pipeline/client_worker.go",
    "file.line": 139
  },
  "message": "Attempting to reconnect to backoff(elasticsearch(https://remote-es.elastic.cloud:443)) with 475 reconnect attempt(s)",
  "service.name": "filebeat",
  "ecs.version": "1.6.0"
}

Enable the network
Ensure the machine can reach the internet and the remote Elasticsearch
Filebeat will keep logging the same publisher errors and not sending data

The configuration I used:

filebeat.inputs:
  - type: filestream
    id: my-filestream-id
    paths:
      - /tmp/some-logs.txt

output.elasticsearch:
  hosts: ["https://remote-es.elastic.cloud:443"]
  username: elastic
  password: some-very-secret-password

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-09-06T21:53:16Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

cmacknz · 2024-09-09T13:41:56Z

Get "https://remote-es.elastic.cloud:443\": context canceled

This isn't a DNS error, so to me it looks like we did recover from the DNS failure but something else is wrong, perhaps a context that expired when it shouldn't have somewhere in one of the initial connection callbacks. The Get \"https://remote-es.elastic.cloud:443\" is likely to get the version and deployment type.

beats/libbeat/cmd/instance/beat.go

Line 1099 in 8a56ef8

esVersion := conn.GetVersion()

beats/libbeat/esleg/eslegclient/connection.go

Lines 400 to 408 in 8a56ef8

    
           func (conn *Connection) execRequest( 
        
           	method, url string, 
        
           	body io.Reader, 
        
           ) (int, []byte, error) { 
        
           	req, err := http.NewRequestWithContext(conn.reqsContext, method, url, body) 
        
           	if err != nil { 
        
           		conn.log.Warnf("Failed to create request %+v", err) 
        
           		return 0, nil, err 
        
           	}

beats/libbeat/esleg/eslegclient/connection.go

Lines 326 to 332 in 8a56ef8

    
           // Close closes a connection. 
        
           func (conn *Connection) Close() error { 
        
           	conn.HTTP.CloseIdleConnections() 
        
           	conn.cancelReqs() 
        
           	return nil 
        
           }

Perhaps we are closing the connection after the DNS failure instead of reusing it.

belimawr · 2024-09-09T14:10:33Z

That makes sense. I didn't investigate the issue, I only managed to reproduce and report it and the DNS error seemed to be the trigger for not being able to connect to ES again.

cmacknz · 2024-09-11T17:40:59Z

Potentially related: #40572

cmacknz · 2024-09-11T17:44:41Z

^Looking at the changes there I see:

// Close closes a connection.
func (conn *Connection) Close() error {
	conn.HTTP.CloseIdleConnections()
+	conn.cancelReqs()
	return nil
}

Creating a new context only happens when NewConnection is called and there must be code paths that do not do that properly after close is called.

This change is only in 8.15.1 from what I can tell.

cmacknz · 2024-09-11T17:57:45Z

Linking the 8.15.0 backport: b19844f

git tag --contains b19844ffdae6861feaf2ee02ce11936d80b243cb
v8.15.1

cmacknz · 2024-09-11T17:59:25Z

The first thing we need to do is write a test that reproduces this problem.

cmacknz · 2024-09-11T18:43:40Z

I think we should revert #40572 once we double check removing it fixes the problem, then work on re-adding it back with a fix + test for this.

This issue is more severe than the original problem that PR was trying to fix.

cmacknz · 2024-09-11T18:54:27Z

FYI @marc-gr

belimawr · 2024-09-11T19:55:17Z

I've been testing and you're correct Craig the conn.cancelRequs() is the issue. The Close method from the connection

beats/libbeat/esleg/eslegclient/connection.go

Lines 327 to 331 in cb57731

    
           func (conn *Connection) Close() error { 
        
           	conn.HTTP.CloseIdleConnections() 
        
           	conn.cancelReqs() 
        
           	return nil 
        
           }

Did not use to cancel any in-flight requests nor it rendered the connection unusable.

#40572 cannot be automatically reverted :/, I'll create a PR removing the culprit line, which effectively restores the old behaviour from the Elasticsearch client of not cancelling in-flight requests. Everything else added by #40572 should still work fine.

This commit removes a call to conn.cancelReqs() that causes the Connection to be unusable after Close() is called, leading to the bug described by elastic#40705.

belimawr · 2024-09-11T20:05:27Z

The "revert-ish" PR: #40769

belimawr · 2024-09-11T20:25:05Z

Investigating more, I found the root cause of the issue. On error our backoffClient will call Close() on the client:

beats/libbeat/outputs/backoff.go

Lines 60 to 67 in cb57731

    
           func (b *backoffClient) Publish(ctx context.Context, batch publisher.Batch) error { 
        
           	err := b.client.Publish(ctx, batch) 
        
           	if err != nil { 
        
           		b.client.Close() 
        
           	} 
        
           	backoff.WaitOnError(b.backoff, err) 
        
           	return err 
        
           }

The Client then calls Close in the connection:

beats/libbeat/outputs/elasticsearch/client.go

Lines 539 to 541 in cb57731

    
           func (client *Client) Close() error { 
        
           	return client.conn.Close() 
        
           }

The connection Close method is

beats/libbeat/esleg/eslegclient/connection.go

Lines 327 to 331 in cb57731

    
           func (conn *Connection) Close() error { 
        
           	conn.HTTP.CloseIdleConnections() 
        
           	conn.cancelReqs() 
        
           	return nil 
        
           }

When #40572 was merged, the call to conn.cancelReqs() was introduced, which cancels the context created by NewConnection

beats/libbeat/esleg/eslegclient/connection.go

Lines 187 to 196 in cb57731

    
           ctx, cancelFunc := context.WithCancel(context.Background()) 
        
           conn := Connection{ 
        
           	ConnectionSettings: s, 
        
           	HTTP:               esClient, 
        
           	Encoder:            encoder, 
        
           	log:                logger, 
        
           	responseBuffer:     bytes.NewBuffer(nil), 
        
           	reqsContext:        ctx, 
        
           	cancelReqs:         cancelFunc, 
        
           }

that is used in every request (L404)

beats/libbeat/esleg/eslegclient/connection.go

Lines 400 to 413 in cb57731

    
           func (conn *Connection) execRequest( 
        
           	method, url string, 
        
           	body io.Reader, 
        
           ) (int, []byte, error) { 
        
           	req, err := http.NewRequestWithContext(conn.reqsContext, method, url, body) 
        
           	if err != nil { 
        
           		conn.log.Warnf("Failed to create request %+v", err) 
        
           		return 0, nil, err 
        
           	} 
        
           	if body != nil { 
        
           		conn.Encoder.AddHeader(&req.Header) 
        
           	} 
        
           	return conn.execHTTPRequest(req) 
        
           }

and never recreated, which renders the whole Connection unusable, which was not the old behavour.

This commit removes a call to conn.cancelReqs() that causes the Connection to be unusable after Close() is called, leading to the bug described by #40705.

This commit removes a call to conn.cancelReqs() that causes the Connection to be unusable after Close() is called, leading to the bug described by #40705. (cherry picked from commit b0e4f85)

This commit removes a call to conn.cancelReqs() that causes the Connection to be unusable after Close() is called, leading to the bug described by #40705. (cherry picked from commit b0e4f85) Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>

marc-gr · 2024-09-12T14:30:17Z

@belimawr thanks for taking a look, would be enough to recreate the context on close to make the client reusable? I think just removing the call to cancelReqs might make the stop racey since IIRC this was the main reason publishers were not closed before on stop.

belimawr · 2024-09-12T17:15:36Z

would be enough to recreate the context on close to make the client reusable?

Mostly, the same instance of the connection is also used in a callback, so I also made sure they both hold a pointer reference so when the context is recreated both can use the new one.

I think just removing the call to cancelReqs might make the stop racey since IIRC this was the main reason publishers were not closed before on stop.

The PR removing the call to cancelReqs was just a quick patch to keep main releasable, I've just created a new PR with the proper fix.

cmacknz · 2024-09-20T17:00:17Z

The problematic part of the change that introduced this was reverted in reverted in #40776 and is targeted for release in 8.15.2.

Reverting this change will likely bring back these two bugs which were fixed in the change that introduced this problem:

The PR linked to this issue now #40794 introduces a regression test for this situation and also brings in a fix for the two problems above that handles interrupted connections properly.

I have updated the description to note that this issue will not exist in 8.15.2.

cmacknz · 2024-09-20T17:15:23Z

I am going to close this to indicate we have resolved the underlying issue in the upcoming 8.15.2 release and move tracking the remaining work separately.

cmacknz · 2024-09-20T17:25:36Z

Follow up work now tracked in:

Regression test for recovery after Elasticsearch output connection failure #40928

belimawr added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Sep 6, 2024

cmacknz changed the title ~~Elasticsearch output does not recover from DNS lookup failure~~ Elasticsearch output does not recover after DNS lookup failure Sep 9, 2024

cmacknz changed the title ~~Elasticsearch output does not recover after DNS lookup failure~~ Elasticsearch output does not recover after connection failure Sep 11, 2024

jlind23 assigned belimawr Sep 11, 2024

This was referenced Sep 11, 2024

Add known issue that Beats do not recover from transient network errors elastic/ingest-docs#1312

Merged

Add known issue that Beats do not recover from transient network errors #40767

Merged

mergify bot mentioned this issue Sep 11, 2024

[8.15] Add known issue that Beats do not recover from transient network errors (backport #1312) elastic/ingest-docs#1313

Merged

belimawr added a commit to belimawr/beats that referenced this issue Sep 11, 2024

Do not cancel context on Connection.Close

5316387

This commit removes a call to conn.cancelReqs() that causes the Connection to be unusable after Close() is called, leading to the bug described by elastic#40705.

belimawr mentioned this issue Sep 11, 2024

Do not cancel context on Connection.Close #40769

Merged

6 tasks

belimawr added a commit that referenced this issue Sep 11, 2024

Do not cancel context on Connection.Close (#40769)

b0e4f85

This commit removes a call to conn.cancelReqs() that causes the Connection to be unusable after Close() is called, leading to the bug described by #40705.

mergify bot mentioned this issue Sep 11, 2024

[8.15](backport #40769) Do not cancel context on Connection.Close #40776

Merged

6 tasks

mergify bot mentioned this issue Sep 11, 2024

[8.x](backport #40769) Do not cancel context on Connection.Close #40777

Merged

6 tasks

belimawr mentioned this issue Sep 12, 2024

Add test for elasticsearch re-connection after network error & allow graceful shutdown #40794

Merged

4 tasks

lucabelluccini mentioned this issue Sep 17, 2024

Agent cannot connect after restart Fleet Server elastic/elastic-agent#5548

Closed

cmacknz closed this as completed Sep 20, 2024

This was referenced Sep 20, 2024

Windows service for Beat does not stop when output is unreachable #40518

Open

Windows service for Beat does not stop gracefully #38666

Open

Regression test for recovery after Elasticsearch output connection failure #40928

Closed

cmacknz mentioned this issue Sep 24, 2024

Add Fleet & Agent 8.15.2 Release Notes elastic/ingest-docs#1340

Merged

mergify bot mentioned this issue Oct 25, 2024

[8.x](backport #40794) Add test for elasticsearch re-connection after network error & allow graceful shutdown #41454

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch output does not recover after connection failure #40705

Elasticsearch output does not recover after connection failure #40705

belimawr commented Sep 6, 2024 •

edited by cmacknz

Loading

elasticmachine commented Sep 6, 2024

cmacknz commented Sep 9, 2024

belimawr commented Sep 9, 2024

cmacknz commented Sep 11, 2024

cmacknz commented Sep 11, 2024 •

edited

Loading

cmacknz commented Sep 11, 2024

cmacknz commented Sep 11, 2024

cmacknz commented Sep 11, 2024

cmacknz commented Sep 11, 2024

belimawr commented Sep 11, 2024

belimawr commented Sep 11, 2024

belimawr commented Sep 11, 2024

marc-gr commented Sep 12, 2024

belimawr commented Sep 12, 2024

cmacknz commented Sep 20, 2024

cmacknz commented Sep 20, 2024

cmacknz commented Sep 20, 2024 •

edited

Loading

Elasticsearch output does not recover after connection failure #40705

Elasticsearch output does not recover after connection failure #40705

Comments

belimawr commented Sep 6, 2024 • edited by cmacknz Loading

Steps to reproduce

elasticmachine commented Sep 6, 2024

cmacknz commented Sep 9, 2024

belimawr commented Sep 9, 2024

cmacknz commented Sep 11, 2024

cmacknz commented Sep 11, 2024 • edited Loading

cmacknz commented Sep 11, 2024

cmacknz commented Sep 11, 2024

cmacknz commented Sep 11, 2024

cmacknz commented Sep 11, 2024

belimawr commented Sep 11, 2024

belimawr commented Sep 11, 2024

belimawr commented Sep 11, 2024

marc-gr commented Sep 12, 2024

belimawr commented Sep 12, 2024

cmacknz commented Sep 20, 2024

cmacknz commented Sep 20, 2024

cmacknz commented Sep 20, 2024 • edited Loading

belimawr commented Sep 6, 2024 •

edited by cmacknz

Loading

cmacknz commented Sep 11, 2024 •

edited

Loading

cmacknz commented Sep 20, 2024 •

edited

Loading