Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix prepared statements for parallel queries #136

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

davoclavo
Copy link

@davoclavo davoclavo commented Jun 11, 2019

Hello!

We were running into a problem where many prepared queries happening in parallel were getting the wrong set of parameters assigned to them.

The unproven theory is that the prepared query name and portal values are currently set to "" potentially causing collisions. I changed all those hardcoded "" to a generated value called genName which contains an AtomicInt.incrementAndGet -- that seems to fix the bug in our set up: finagle-postgres 0.11 via quill-finagle-postgres 3.1.0

However, I wasn't able to replicate the bug in the specs. Perhaps the problem is that the queries that I am performing are fast enough to not cause a race condition? or maybe the test queries are somehow running in a serialized fashion anyways? not sure :( - I'd gladly re-write the test to verify this behavior successfully, and would love if someone is able to guide me in the right direction.


I'd like to add that we are getting several symptoms that all seem to be related with this issue:

  1. Sometimes we get ERROR: portal "" cannot be run when those parallel queries are getting executed
  2. Some of these parallel queries are left idle for a long time, causing us to have to use pg_terminate_backend regularly in order to cancel them
  3. In one of our setups we had two sets of queries, each with 10 and 12 different parameters, all running simultaneously. In some cases the query expecting 12 parameters was being sent 10, causing the following error:

bind message supplies 10 parameters, but prepared statement “...” requires 12

@vic vic force-pushed the fix-prepared-statements-for-parallel-queries branch from 78f6316 to fe9dac0 Compare September 3, 2019 18:59
@leonmaia
Copy link
Collaborator

Thanks for the contribution @davoclavo.

I would love to see some failing specs. But the changes look safe enough for me. Could you rebase?

@leonmaia leonmaia self-requested a review November 26, 2019 14:51
PostgreSQL prepared statements and portals are referenced by names, where
setting an empty string is ok, as it means it is unnamed prepared statement. [1]

However there is a weird issue happening under high loads where a different set
of parameters are being set to the wrong prepared statements:

`bind message supplies 10 parameters, but prepared statement “...” requires 12`

And also sporadically we get `ERROR: portal "" cannot be run` making it super
hard to troubleshoot as all portals currently share the same name.

Setting a unique name for each prepared statement has fixed the problem in our
production setting, however I am not able to pinpoint why the problem exists in
the first place, as "" seems to be a valid portal name.

[1] - https://www.postgresql.org/docs/11/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY
@davoclavo davoclavo force-pushed the fix-prepared-statements-for-parallel-queries branch from fe9dac0 to a8d5f78 Compare November 26, 2019 17:57
@davoclavo
Copy link
Author

@leonmaia

Thanks for picking up this stale pull request. I have already rebased with master.

I agree, I would also love to see failing specs but sadly wasn't able to replicate the problem after attempting a couple of times :( - If there is some peace of mind, we have been running this modified version in production for about 6 months without facing this problem.

@davoclavo davoclavo changed the title WIP: Fix prepared statements for parallel queries Fix prepared statements for parallel queries Nov 26, 2019
@dangerousben
Copy link
Collaborator

or maybe the test queries are somehow running in a serialized fashion anyways?

The integration tests all currently run with a pool size of 1, which might explain the inability to replicate.

@davoclavo
Copy link
Author

@dangerousben I noticed that and already tried setting a larger pool for concurrent connections without luck. Is there any other place I should make an adjustment?

See this particular line in my commit: a8d5f78#diff-3f8ad9e99aaed7920116b65d875d3736R49

@dangerousben
Copy link
Collaborator

@dangerousben I noticed that and already tried setting a larger pool for concurrent connections without luck. Is there any other place I should make an adjustment?

It looks ok, and quick bit of debugging shows that the statements are not run sequentially. Not sure why it's not reproducing the problem.

@davoclavo
Copy link
Author

davoclavo commented Jun 22, 2020

It looks ok, and quick bit of debugging shows that the statements are not run sequentially. Not sure why it's not reproducing the problem.

My theory is that the queries I am making are fast enough to not cause this race condition, but I am not entirely sure honestly. I will try to make other attempts to replicate this problem, if you have any ideas I am keen to try them out.

@davoclavo davoclavo closed this Jun 22, 2020
@davoclavo davoclavo reopened this Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants