Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source-postgres: Replication stream getting closed repeatedly by Neon Postgres #2059

Open
willdonnelly opened this issue Oct 15, 2024 · 0 comments
Assignees
Labels
bug Something isn't working

Comments

@willdonnelly
Copy link
Member

While keeping an eye on production CDC task errors/warnings for an unrelated rollout, I noticed an interesting pattern. We have at least four production capture tasks from Neon Postgres instances which keep getting restarted and then quickly shutting down with the following pattern of logs:

2024-10-15T20:53:22Z INFO initializing connector {"address":"REDACTED.neon.tech:5432","database":"REDACTED","slot":"REDACTED","user":"REDACTED"}
2024-10-15T20:53:22Z INFO queried database version {REDACTED}
2024-10-15T20:53:23Z INFO discovering tables null
2024-10-15T20:53:23Z INFO starting replication {"publication":"flow_publication","slot":"flow_slot","startLSN":"REDACTED"}
2024-10-15T20:53:23Z INFO streaming to fence {"fenceAfter":"0s"}
2024-10-15T20:53:23Z INFO advancing confirmed LSN {"ackLSN":"REDACTED"}
2024-10-15T20:53:23Z INFO processed replication events {"events":4}
2024-10-15T20:53:23Z INFO streaming to fence {"fenceAfter":"1m0s"}
2024-10-15T20:53:33Z INFO advancing confirmed LSN {"ackLSN":"REDACTED"}
2024-10-15T20:53:53Z INFO processed replication events {"events":0}
2024-10-15T20:53:53Z WARN replication stream closed unexpectedly null
2024-10-15T20:53:53.755809536Z WARN connector exited with unconsumed input stream remainder {"module":"connector_init::rpc"}

We can see in the above logs that we connected and streamed to a fence (which in Postgres still implies a watermark write) at 2024-10-15T20:53:23Z and then went into an indefinite time-based streaming phase and got disconnected when the replication stream closed at 2024-10-15T20:53:53Z.

This appears to be "benign" in the sense that the captures are remaining connected so long as stuff is taking place and then shutting down after exactly 30 seconds while idle. I'm going to draw the obvious inference that Neon might be disconnecting idle connections really quickly here.

One odd thing, of course, is that we're supposed to be sending a "Standby Status Update" message on that replication stream every 10 seconds on the dot (regardless of whether the confirmed LSN has advanced -- we just don't log it the rest of the time to avoid spamming the logs even worse). So it shouldn't be a TCP-level connection timeout since that connection can't be idle for >10s at a stretch and we're getting disconnected only after 30s.

That's as much investigation and thought as I've given the matter at present, just filing an issue to look into this more later.

@willdonnelly willdonnelly added the bug Something isn't working label Oct 15, 2024
@willdonnelly willdonnelly self-assigned this Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant