-
Notifications
You must be signed in to change notification settings - Fork 693
[PubSub] Messages stuck in the queue and not being pulled by listener #1005
Comments
Are you using our |
I have noticed this issue with |
Is it possible that the processed messages are just not being acknowledged? I believe we have auto-acking on in the Stream Binder though. |
Auto-acking is working fine in the stream binder and we are acking the message manually with pubsub-starter. I'm going to try to just listen and ack the messages to see if this happens as well. After some reading, there was an issue with long-lived connection in GKE where connections are closed due to TCP keep-alive timeout. But, there was a fix for that, more information can be found here: googleapis/nodejs-pubsub#11 |
I've been having the same issue. The app is on Spring Boot 2.0.4.RELEASE and spring-cloud-gcp 1.0.0.RELEASE, also on GKE, and I have 4 pods running, on a 3 node K8S cluster. The behavior I'm seeing is that on startup it processes old messages, and new ones as well, with minimal latency. However, after some time, it just stops processing messages. I can see that the messages are sitting there (sometimes for an hour or more) in the queue via I'm using the Spring Integration channel adapter. I tried removing Spring Integration and use The only thing I see weird is the |
Another thing I'd like to point out is that we have any number of short lived python processes that communicate via pubsub, and we haven't seen problems in those, so it definitely seems like an issue with long lived processes. |
@danvalencia per your post in GCP Slack, we observe exactly the same behavior with "bare-bones" Spring and What I'd suggest (if possible) is to use EDIT: just to reiterate what was mentioned above, the issue is not with ACKs, because we don't even receive the messages in the first place. Only restarting the subscriber regularly helps resolve this. |
@meltsufin I have changed the app to just acknowledge the messages without doing any processing and I witnessed the same behavior again. I now strongly believe now that it's not linked to ack deadline extension. As @dinvlad said, we are not even receiving the messages. I will try |
Thanks for the additional information. We will investigate this further and get back to you. |
@ramzimaalej This might be a server side issue. Can you please submit a case in the support portal with the details (project number, subscription name, rough time range)? If you don't have a support plan, try cloud-pubsub@google.com |
@kir-titievsky I contacted cloud-pubsub@google.com 6 days ago, but I have not heard from them yet. I'm on the startup surge program, and I not sure I have access to the support portal. The issue has been happening for the past week consistently. Let me know if you need additional information. Thanks a lot for jumping in on that! |
We're facing same issue as well! I have just put a "bare-bones" app with Spring Integration directly connecting to google-pubsub-cloud lib bypassing spring-gcp-pubsub and so far the messages are being picked up. I will give it more time before saying it might be spring-gcp-pubsub issue but with previous setup (Spring Integration, spring-gcp-pubsub, google-cloud-pubsub), the issue appeared much faster (and so far an hour in with new setup no issue). I've also contacted Google and they'll get back to us if it's a server side issue. Our env: "org.springframework.cloud:spring-cloud-gcp-starter-pubsub:1.0.0.RELEASE", and I was thinking of upgrading to 1.1.0.M1 but that's older than current release. PS: We're also doing manual ack and from what I see in logs it doesn't look like messages are being lost, I've seen it process a message fine and then just stops picking next messages after some idle time. There might be network disconnection during that time and it'll be worth finding out what's causing that from GKE to pubsub but client should be able to recover from those since those issues can still happen in regular scenarios. |
Can anyone share a simple app that reproduces the issue? |
I just created this basic app to recreate this scenario following https://spring.io/guides/gs/messaging-gcp-pubsub/ but with newer versions as given below. I suspect something might be wrong with the config so if we can eliminate this possibility then we can look further before spending time in deeper layers. I have deployed this on GKE connecting to two subscriptions on a single topic. It received messages from both subscriptions fine and then it received only on single subscription because I accidentally had this app also running locally where it was picking the other messages (as it should). But when I stopped local app, this GKE deployed app didn't pick messages. I will try a clean test without local app also running but those should have been independent connections anyway and this deployed version should start picking up all messages when other client was stopped. When I restarted this version it started picking up fine and I'll try to recreate the original scenario where it stops after a while. Meanwhile, here's the config and if this looks right and I can recreate the original issue, I'll share the app. UPDATE: After restart it stopped picking messages from both subscriptions. (I'm really not sure about this PublishSubscribeChannel getting hooked up with multiple PubsubInboundAdapters.)
Versions:
Also, I'm using single-threaded subscriber (need it that way for now). I've also tried prototype beans for pubsubTemplate and publisherFactory and subscriberFactory.
|
I could recreate it with single subscription and 2 subscribing threads. What might be interesting is that when I use gcloud cmd line to pull messages (without acking them), it circles between two messages on the subscription and sometimes even shows 0 messages available only to circle back and show other messages again in next commands. Maybe that's default/expected behavior of pubsub? Following is the new simple config that still runs into this issue. Meanwhile, my other app bypassing spring-gcp-pubsub (and using SI with google-cloud-pubsub jar) is working fine for several hours.
Following are the last logs where it received message "abc" successfully and the next message was sent ~15 minutes later that has not been picked up so far.
|
As a different data point, we saw a similar issues months ago using the Google support provided us with two solutions ...
or
Because of this issue, along with with the way the |
I looked into that as well and went as far as creating a client thread using pull method (from pubsubTemplate). But decided against it since the maintenance of the thread (timeout, scalability etc.) would become our responsibility and we'll be reinventing just another client (that's what I'm really surprised they don't have keep-alive functionality built-in to Since it was working fine till few weeks ago, we know this setup should work so didn't want to invest too much time into writing a custom client. This is what currently works fine for us using
|
@qlodhi-clearlabs strange, this last method worked fine for us previously, but that's what we've been having issues with recently. Perhaps the Google team fixed it, if it was a server-side issue? |
@qlodhi-clearlabs I agree ... you shouldn't write your own client. (We did out of necessity.) Since the Spring If you can publish keep-alive messages to the topic / ignore them on the consumer side, it would be a good data point to see if it's truly a socket timeout / disconnect type of issue at the TCP/IP layer. We actually found the keep-alive messages useful to track end to end latency. |
We encountered the same problem after upgrading to the new version:
Previously, we used a custom build of spring-cloud-gcp based on version 1.0.0.0.M2 for several months. We have never encountered a delay in retrieving messages. |
The version of |
We've noticed this with various versions of |
I'm not sure what our Spring integration with Pub/Sub could do. We're really just delegating the connection management and message retrieval to the |
@qlodhi-clearlabs which version of |
@ramzimaalej what works for me is direct instantiation of Subscriber from The config is given above (with When I put the same jar integrated with spring-gcp-pubsub and it runs into disconnectivity, it just gets stuck. With the direct connection, it must be doing something differently, so it's able to recover. @meltsufin that's the part that's a bit baffling, if spring-gcp-pubsub is also just creating a subscriber then both scenarios should have the same results, but apparently, something is different. |
Yes, I meant |
And we're using "bare-bones" Spring with |
I was able to reproduce this issue as well, spring boot app (2.0.4.RELEASE) with spring-cloud-gcp (1.0.0.RELEASE) on GKE. Here's the code: https://gist.github.com/danvalencia/6cb0235ec01952bbc54fdc91b106be70 |
@meltsufin I have used |
Thank you everyone involved here! Right now since we know that it works with the
That one does this:
Where the
All those options for the Please, give it a shot to analyze that class for the options which are different from those you use in case of plain |
@ramzimaalej Can you clarify which version of |
Just for information: since this morning approx. 5:30 CEST until now (the last 11 hours) we have not seen any further unpolled messages. |
So, Google came back saying that the issue we faced "was due to an internal problem" and we should continue to do things as we were and report any issues if it appears again. Our original code (using It is worry-some that code base using |
I'm glad to hear it's now resolved! |
@meltsufin I used this version |
@ramzimaalej Thanks, we try to always stay on the latest of |
@meltsufin it was indeed an underlying issue with pubsub env, but obviously those using spring-gcp-pubsub were unlucky than those using I'd want to get back to using spring-gcp-pubsub ASAP because of wrapping benefits of it (tweaking params by config instead of writing new code, e.g. for scalability etc.). I'll report here if I can find the difference in next few days, but for now more monitoring on pubsub :) Thanks a lot for your valuable insights, everyone! |
@qlodhi-clearlabs could you please clarify what was the issue? was it on Google side? Thanks |
@dinvlad yes, as it appeared, the underlying cause was some server-side issue on Google pubsub. They just said "internal problem" and I've asked for little more details so that we can determine client-side behavior based on that scenario but I think that's all we are going to get: an "internal problem". And I'm not sure how to categorize this, is it that server doesn't respond to a streamingPull request, or it gets stuck and doesn't respond with some other expected message in case of no data, or something else, I just do not know right now. |
@ramzimaalej Was the problem fixed for you? I'm experiencing the same issue with GAE and the Node.JS library |
@drag0s It was an internal issue on Google side. I advise you to contact them. Yes, the problem is fixed now. |
Is anyone experiencing these issues again? They seem to have come back for us |
We briefly switched back to spring-gcp configuration few days ago, and the issue started appearing again. Still didn't have time to dig whether its something with out config or an underlying Pubsub issue but switching back to directly using |
same here. using 'springCloudVersion', 'Greenwich.SR1' |
We'd appreciate example code that can reliably reproduce the problem to debug further. |
Sorry to hear you're having issues. These are our current versions, which are working fine:
|
@qlodhi-clearlabs I think we found the underlying issue and it has to do with the keep-alive setting. See the PR above. |
Which PR you are speaking about? I experience the same issue in my spring boot application. I use springBootVersion = "2.2.0.RELEASE" In my case application(library code) accepts message but inside com.google.cloud.pubsub.v1.MessageDispatcher#processOutstandingMessage line doesn't lead to |
I was referring to #1384 |
still happens in 1.2.5.RELEASE |
@brachipa Could you open a new issue describing your environment, and any relevant log data? It's somewhat unlikely to be the same exact issue; we'd have to re-investigate the cause. |
Yeah, we noticed this again a couple of weeks ago - undetected loss of connection, reconnected after 1.5 hours in one instance. Haven't noticed it since then, maybe it has happened again but we didn't really monitor it to be sure. We're in the process of upgrading libs. |
@qlodhi-clearlabs If the connection reestablished without a restart, then the client library has done the right thing. If you observe messages getting stuck requiring a restart, though, could you comment on #2552 (the new issue @brachipa created)? |
I have a spring boot application (v 2.0.3.RELEASE) that uses spring-cloud-pubsub (1.0.0.RELEASE) and it's deployed on GKE. I noticed that after some period of inactivity the listener is no longer receiving messages from pubsub and messages get stuck in the queue for a long time (sometimes more than an a hour). I tried increasing the parallel-pull-count but no luck. As soon as I deploy the service the messages get acknowledged then they start piling up in the queue. I have researched this problem and some people think that the connection is dropped by gke due to inactivity. I have also tried spring-cloud-stream-pubsub but I got the same behavior. any ideas on how to get around this?
The text was updated successfully, but these errors were encountered: