Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to find subscription with identifier: {"channel":"GraphqlChannel","channelId":"18bccb1dbd6"} #4702

Closed
Amnesthesia opened this issue Nov 15, 2023 · 14 comments

Comments

@Amnesthesia
Copy link

Describe the bug

This is possibly not a bug in graphql-ruby, and I apologize in advance if that's the case, but I've just run out of ideas on what this could be and thought I'd open this bug report in case this is something anybody else has run into before, or if it's a common problem that I just can't find much information on.

Basically, we see a lot of these errors in our AppSignal monitoring, but I can't reproduce it locally. We don't know if it's something that happens when we deploy new versions and somebody is using the site with (possibly now disconnected websockets?) or what's been going on really

Versions

graphql version: 2.1.1
rails (or other framework): 7.0.8
other applicable versions (graphql-batch, etc):

graphql-pro (1.24.11)
    graphql (2.1.1)
    graphql-metrics (5.0.7)
    graphql-rails_logger (1.2.4)
    rubocop-graphql (1.4.0)
graphql (= 2.1.1)

Steps to reproduce

I'm sorry but honestly, no clue. We can't reproduce this locally, it only happens in our production and staging environments (on Amazon Fargate using ElastiCache Redis)

Expected behavior

Not to throw this error, or to find the subscription

Actual behavior

What specifically went wrong?

Place full backtrace here (if a Ruby exception is involved):

Click to view exception backtrace

lib/action_cable/connection/subscriptions.rb:75 find |  
-- | --
lib/action_cable/connection/subscriptions.rb:47 remove |  
lib/action_cable/connection/subscriptions.rb:18 execute_command |  
lib/action_cable/connection/base.rb:89 dispatch_websocket_message |  
lib/action_cable/server/worker.rb:59 block in invoke |  
lib/active_support/callbacks.rb:118 block in run_callbacks |  
lib/semantic_logger/base.rb:190 block in tagged |  
lib/semantic_logger/semantic_logger.rb:346 tagged |  
lib/semantic_logger/base.rb:202 tagged |  
lib/rails_semantic_logger/extensions/action_cable/tagged_logger_proxy.rb:8 tag |  
lib/action_cable/server/worker/active_record_connection_management.rb:16 with_database_connections |  
lib/active_support/callbacks.rb:127 block in run_callbacks |  
lib/action_cable/engine.rb:71 block (4 levels) in <class:Engine> |  
lib/active_support/execution_wrapper.rb:92 wrap |  
lib/action_cable/engine.rb:66 block (3 levels) in <class:Engine> |  
lib/active_support/callbacks.rb:127 instance_exec |  
lib/active_support/callbacks.rb:127 block in run_callbacks |  
lib/active_support/callbacks.rb:138 run_callbacks |  
lib/action_cable/server/worker.rb:42 work |  
lib/action_cable/server/worker.rb:58 invoke |  
lib/action_cable/server/worker.rb:53 block in async_invoke |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:352 run_task |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:343 block (3 levels) in create_worker |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:334 loop |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:334 block (2 levels) in create_worker |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:333 catch |  
lib/concurrent-ruby/concurrent/executor/ruby_thread_pool_executor.rb:333 block in create_worker
@Amnesthesia
Copy link
Author

Amnesthesia commented Nov 15, 2023

Actually, this being raised in subscriptions.rb when the identifier can't be found in the @subscriptions array, could this be because we run multiple containers, and the user may be subscribed at another container?

@rmosolgo
Copy link
Owner

Hi! Sorry for the trouble and thanks for the detailed report. What I expect on re-deploy is for clients to reconnect, re-sending GraphQL subscriptions as needed. A couple of quick thoughts:

  • What caused this error to start occurring? Did you just add GraphQL subscriptions to your app, or did something else change that caused these to appear?
  • Reviewing the backtrace above, I don't see anything GraphQL-related. Is there any indication that this is GraphQL-specific? (Are there other uses of ActionCable that work properly in your application?)
  • You mention that you can't replicate the error locally. If you're using the app in production during a deploy, do you see anything happening in your browser?
  • In AppSignal, can you see any trends in the occurences in this bug? (Maybe browser? I can imagine browsers handling this event in different ways...)

@Amnesthesia
Copy link
Author

@rmosolgo Right, so to address your questions (and what I think is going on here as well):

  1. This error has been occurring for a very long time for us, but it doesn't seem to cause any issues. We've had a lot of ideas on what it could be, and none of those have turned out to be it. We haven't really prioritized it because like I said, it seems pretty harmless
  2. You're right, and I realized after posting this here that this may not be graphql-related and could be actioncable-related
    3 & 4. We're not seeing anything related to the browser type for this, seems to be a mixed bag

I think it's that we have this in our GraphQLChannel:

def subscribed
    @subscription_ids = []
end

This would start each instance on a blank slate of connections, so if we're scaling and running 3-4 instances, they're all pushing to the same redis. One instance may pick up a subscriptions.trigger(...) from puma on another instance, but doesn't have the gid in its @subscription_ids, because its not being shared.

Perhaps @subscription_ids should be pointing to redis to maintain a shared list of subscribed clients?

@rmosolgo
Copy link
Owner

My intention on @subscription_ids is for that to be a per-instance list of subscribers to that instance. So it works like:

  • subscriptions.trigger is called in one Ruby process, it calls .broadcast to notify other processes of the trigger:
    # An event was triggered; Push the data over ActionCable.
    # Subscribers will re-evaluate locally.
    def execute_all(event, object)
    stream = stream_event_name(event)
    message = @serializer.dump(object)
    @action_cable.server.broadcast(stream, message)
    end
  • meanwhile, any ActionCable process that received a subscription should have setup a stream_from for that subscriber:
    channel.stream_from(stream_event_name(initial_event), coder: @action_cable_coder) do |message|
  • So, when a process receives a notification from a trigger, it runs that stream_from block and generates a payload. Then it broadcasts again, using the specific subscription ID, to update the client:
    @action_cable.server.broadcast(stream_subscription_name(subscription_id), payload)

I hope that helps!

@Amnesthesia
Copy link
Author

@rmosolgo Thank you for taking the time out of the day to explain this a bit, appreciate it! :) We still haven't worked this issue out. We realized we weren't running ActionCable standalone in production, and thought perhaps this would be solved when we move ActionCable to be its own standalone service, but (I presume) because we also scale this out, this "issue" still persists.

We don't really know to what extent its an issue, or if it is an issue at all — e.g, are we getting a certain percentage of lost websockets pushes? or are these errors completely harmless?

So, when a process receives a notification from a trigger, it runs that stream_from block and generates a payload. Then it broadcasts again, using the specific subscription ID, to update the client

In this sense, should an ActionCable process that does not have the subscription_id in its list of subscribers simply not pick this up? Do you see a way that it would happen in a multi-ActionCable setup that ActionCable processes routinely try to handle messages meant for subscriber_ids that they dont have?

@Amnesthesia
Copy link
Author

Amnesthesia commented Apr 4, 2024

@rmosolgo We've moved away from scaling actioncable horizontally and have instead scaled it up, increasing the spec of the machine significantly and only running one machine. We are still seeing these errors every day and we're quite lost at where they come from

We've reduced our identified_by attributes as well to a single one, identified_by :_strategy where _strategy is a custom class managing the JWT token and current user

@rmosolgo
Copy link
Owner

rmosolgo commented Aug 6, 2024

I'm sorry I don't have anything better to offer on debugging this ... to answer those very old questions:

In this sense, should an ActionCable process that does not have the subscription_id in its list of subscribers simply not pick this up?

Yes, that sounds right to me... I think a process would only call stream_from during write_subscription, which is to say, processes without subscription IDs wouldn't call stream_from.

Do you see a way that it would happen in a multi-ActionCable setup that ActionCable processes routinely try to handle messages meant for subscriber_ids that they dont have?

How I expect that to work is:

I notice that it sends more: false to the client there in delete ... maybe that's the problem here?

I pushed a branch which adds a rescue to that code, in case you want to try it: #5053

@rmosolgo
Copy link
Owner

Hey, I spent some more time with this and I was able to replicate the error in the test app in this repository. I could replicate this error by clicking "unsubscribe" twice. The first time, it unsubscribed (using ActionCable.Subscription.unsubscribe), but the second time, my client code called that function again and Rails logged the same error described here.

Here's a screencap:

unsubscribe-error

So, I don't think #5053 is related -- this error is between the client and ActionCable, without any (Ruby) GraphQL-Ruby code involved.

In the test app, I solved this problem by adding an _unsubscribed flag to my unsubscribe function:

unsubscribe() {
  // Don't re-call `unsubscribe` to avoid Rails logging an error 
  if (!this._unsubscribed) {
    this._unsubscribed = true 
    this.subscription.unsubscribe()
  }
}

Would an approach like that work for you?

Also, reviewing this issue, I don't see which ActionCable client you're using. Are you using one of the ones from graphql-ruby-client? If so, I'd be happy to investigate an improvement there to include a fix like the one above.

I'm going to close this issue since it's gone cold, but if you get back to me about the client code, I'll be happy to re-open it until the client is improved.

@steobrien
Copy link
Contributor

Hey @rmosolgo, thanks for the writeup and action. We're seeing this issue (constantly) in production. We're using the createRelaySubscriptionHandler and ActionCable, and have a pretty straightforward Relay implementation:

Is there somewhere you could include the idempotent unsubscribe fix above for us, or should we add it somewhere on our side?

@rmosolgo
Copy link
Owner

rmosolgo commented Sep 3, 2024

@steobrien, in my case, the error was coming from calling unsubscribe() too many times. Is there a chance that Relay calls .dispose() for you? If so, you should remove the call to it yourself. Taking a quick look at the docs, it doesn't look like Relay expects you to call that yourself, so maybe it'll do that itself. Want to give it a try?

If that doesn't work for you, I can try adding a "once and only once" check here:

dispose: function() {
channel.unsubscribe()
}

@steobrien
Copy link
Contributor

I don't think Relay can do it automatically for us in this context – requestSubscription is an effect, so we need to manually clean it up. I think that's why it returns a Disposable. if we didn't do that, I believe we'd end up with a bunch of duplicate subscriptions that would build up over time.

But having said that, our useEffect implementation should guarantee "once and only once" already. So let me see if I can figure out why it could be attempting to dispose multiple times.

@steobrien
Copy link
Contributor

I'm seeing everything behave sensibly in our implementation. Additionally, it looks like Relay's subscription dispose is already idempotent:

  1. https://github.com/facebook/relay/blob/c8e8ee105fb38781c9f29f9e13b846da911e631b/packages/relay-runtime/subscription/requestSubscription.js#L117
  2. https://github.com/facebook/relay/blob/main/packages/relay-runtime/network/RelayObservable.js#L510-L526

I confirmed this by calling dispose() numerous times – only the first sent a message on the websocket, resulting in the expected server log line:

GraphqlChannel stopped streaming from graphql-subscription:9c56ecc8-a313-4c65-8e9c-e8eb0d0f10da

Subsequent calls appeared to no-op.

So, it seems like something more mysterious is happening here.

@rmosolgo
Copy link
Owner

rmosolgo commented Sep 3, 2024

Ok, thanks for giving it a try. Could you open a new issue so we can keep investigating it? It'd be great to have:

  • The full error message and back trace from rails
  • Setup code you're using to hook up GraphQL-Ruby's JS to your Relay app
  • A description of what you do in the app to make the error message happen

Then we can dig in further!

@steobrien
Copy link
Contributor

Sure thing. To your last bullet point, I'd love to figure out a way to reproduce, though. Right now, we see it frequently on production (1000 times today) but I'm struggling to figure out the pattern of events. I'll let you know if I can isolate that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants