NatsJSPublishNoResponseException and no message consumption #349

ValMati · 2024-01-23T14:07:47Z

ValMati
Jan 23, 2024

I have an application deployed on virtual machines running on IIS that uses NATS JetStreams to manage the queueing and execution of tasks that take quite some time to complete.

On publish we check that the publish has been done successfully

PubAckResponse ack = await jetStreamContext.PublishAsync(subject, message, headers: natsHeaders, serializer: serializer);
ack.EnsureSuccess();

And the consumption of messages is done by pull:

var keepGoing = true;
Message<T> message = null!;
while (!cancellationToken.IsCancellationRequested && keepGoing)
{
    var next = await consumer.NextAsync<T>(serializer, cancellationToken: cancellationToken);
    if (next is { } msg)
    {
        keepGoing = false;
        message = new Message<T>(msg);
    }
}

return message;

The application works correctly except when for some reason the VM has to be restarted. Sometimes after restarting the VM I have two problems:

The application stays indefinitely in the NextAsync method even if there are messages to consume.
Trying to post a new message throws NatsJSPublishNoResponseException.

Checking the logs of our application, it seems that the connection is successful:

[08:27:43 INF] Try to connect NATS nats://4.231.74.14:4222
...
[08:27:43 INF] Received server info: ServerInfo {... }

I have also found in the NATS logs that the message arrives that an attempt to publish has been made, confirming that the connection exists.

On the other hand I have tried to simulate this behaviour by raising different instances of the containerised application on our machines, attacking the same NATS cluster and we have not been able to reproduce the problems we experienced in the VMs.

Any idea where the problem might be?

Regards and many thanks

mtmk · 2024-01-23T15:42:21Z

mtmk
Jan 23, 2024
Maintainer

Sounds like a bug in next/fetch consumer. What are your client/server versions?

0 replies

ValMati · 2024-01-24T07:53:53Z

ValMati
Jan 24, 2024
Author

First of all, thank you very much for the quick responses.

On the client side I have version 2.0.1 and on the server side I currently have 2.10.7, although when I started with this problem I had an older one and have been updating as new versions have been released (sorry for not remembering the exact versions).

0 replies

mtmk · 2024-01-24T10:23:31Z

mtmk
Jan 24, 2024
Maintainer

Out of curiosity what's your stream and consumer configurations? Do you ack the messages received?

2 replies

ValMati Jan 24, 2024
Author

This is the consumer configuration:

ConsumerConfig consumerConfig = new ("ASYNCREQUESTS_PENDING")
        {
            DurableName = "ASYNCREQUESTS_PENDING",
            FilterSubject = "MyService.AsyncRequests.Pending",
            AckPolicy = ConsumerConfigAckPolicy.Explicit,
            AckWait = TimeSpan.FromMilliseconds(60000),
            MaxDeliver = 4,
            DeliverPolicy = ConsumerConfigDeliverPolicy.All
        };

And this for the stream:

StreamConfig streamConfig = new (streamConfiguration.Name, new List<string> { streamConfiguration.Subject })
        {
            Storage = StreamConfigStorage.File,
            MaxBytes =1073741824, // 1024*1024*1024 = 1GB
            Retention = StreamConfigRetention.Limits,
            MaxAge = TimeSpan.FromDays(365)
        };

About ack incoming messages. AckWait is 10 minutes, that's a lot, but as I said before, processing these messages takes a long time. And as this time can be longer than 10 minutes when launching the processing of each message, a Task is launched that every 30 seconds does a msg.AckProgressAsync() to prevent a message that is in process from being sent again.

mtmk Jan 24, 2024
Maintainer

Thanks, that's great 💯 I can plug those into my tests trying to repro.

mtmk · 2024-01-24T15:18:06Z

mtmk
Jan 24, 2024
Maintainer

I have been running my test with constantly failing/recovering cluster nodes but I can't reproduce the issue. btw I'm testing against latest and one thing I noticed is that in 2.0.2 we introduced no-responders feature. I also noticed in your report above you did mention NatsNoRespondersException which was introduced in 2.0.2.

So on the publish side it's expected to have NatsNoRespondersExceptions you should handle them and/or increase the retries: (this is all in >=2.0.2 btw)

try
{
    var ack = await js.PublishAsync(subject, data, opts: new NatsJSPubOpts
    {
        RetryAttempts = 10,
        RetryWaitBetweenAttempts = TimeSpan.FromSeconds(1)
    });
    ...
}
catch (NatsNoRespondersException)
{
    // e.g. log warning
    await Task.Delay(3000); // back-off with increasing delay
}

Now for consumers (i.e. NextAsync()) in my tests, after a failure consumer is recovering with in about 30 seconds worst case scenario, but haven't seen it hanging forever yet.

So in the meantime, my suggestion is to upgrade to the latest stable version of the client (which is 2.0.3 at the moment).

Edit: I've been running a test (over 24 hours now) against a cluster with filing nodes every 15 seconds. I can't reproduce the issue of NextAsync() hanging unfortunately. @ValMati let me know when you get a chance if you made any progress on this.

0 replies

ValMati · 2024-01-25T12:53:43Z

ValMati
Jan 25, 2024
Author

First of all I apologise, I wrote the wrong exception (I have already fixed it in the title and my first post). The exception NATSNoRespondersException happens to me from time to time in an old version using NATS.Net v1. In the current version, which is the reason why I opened this thread, using version 2.0.1 the consumer "freezes" in NextAsync and PublishAsync throws the exception NatsJSPublishNoResponseException.

I think in my first post I didn't make myself clear, the VMs that are restarted are the ones on which the application is running, not NATS.

Finally, following your advice I'm testing with version 2.0.3 and there doesn't seem to be any problem. I'll continue testing this afternoon to see if I can solve the problem by simply upgrading.

0 replies

ValMati · 2024-02-01T09:22:48Z

ValMati
Feb 1, 2024
Author

The last days I have been looking for how to reliably reproduce the error and have not found anything conclusive.

The only thing I have seen is that when the application neither consumes nor publishes, this log appears on the NATS server. The third line only appears when there are problems:

However in the logs of the application everything seems correct until I try to publish a message....

3 replies

mtmk Feb 1, 2024
Maintainer

Thank you for testing this. Server logs look normal to me. Leader changes would happen if there were a node failure for example.

There are a few things to note about the client:

Client doesn't connect until the first operation (e.g. publish) or explicitly calling connect
Server may send an INFO message when cluster roles change for example
Client consumers would react to cluster changes by either resetting internal state or reissuing requests to consume

So the above logs look normal to me. Did you have an issue publishing / consuming or did you just wanted to understand the logs? I might be missing something as well, please feel free to elaborate.

ValMati Feb 1, 2024
Author

In this case the connection is made when the application is started, a BackgroundService is launched and it checks that the stream exists and then subscribes to a subject of that stream.

Maybe I have not explained myself well. I showed this log because I have "discovered" that the message "Client connection closed: Read Error" seems to be given when my application neither consumes nor can publish messages.

I have also activated the traces on the server and I have seen another "curious" thing. When the messages are well published, the two last lines (MSG and HMSG) appear, but when my application is not able to publish these lines do not appear.

I hope this can provide some insight, because I'm not able to get NATS to work reliably...

mtmk Feb 1, 2024
Maintainer

Ah ok, didn't realise that.

The last message in the log above looks like a fairly standard JetStream message delivered to client. You should've seen your published messaged too, there may be an issue there.

Could you post a more complete code that reflects your use case so I can try to reproduce on my end? Also could you tell me about your cluster setup please? (I'm assuming a 3-node cluster but just to be sure)

As a side note, we're preparing to release 2.1 (which is in preview at the moment). May I ask you to try with the latest preview as well? we have fairly important changes around send buffers.

ValMati · 2024-02-14T12:21:34Z

ValMati
Feb 14, 2024
Author

I have finally fixed the problems I was experiencing. The problem was neither in the NATS NuGet nor in my code.

I have NATS deployed as a cluster with 3 nodes, because for some reason one of them was corrupted and when my application connected to that node it could neither consume nor publish messages.

In the end I solved it by completely deleting the cluster, including the permanent volumes, and recreating it again.

I haven't been able to identify when the node got corrupted, but I guess it was an update.

In any case, thank you very much @mtmk for your kind and quick replies.

1 reply

mtmk Feb 14, 2024
Maintainer

you're welcome 💯

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NatsJSPublishNoResponseException and no message consumption #349

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

NatsJSPublishNoResponseException and no message consumption #349

ValMati Jan 23, 2024

Replies: 7 comments · 6 replies

mtmk Jan 23, 2024 Maintainer

ValMati Jan 24, 2024 Author

mtmk Jan 24, 2024 Maintainer

ValMati Jan 24, 2024 Author

mtmk Jan 24, 2024 Maintainer

mtmk Jan 24, 2024 Maintainer

ValMati Jan 25, 2024 Author

ValMati Feb 1, 2024 Author

mtmk Feb 1, 2024 Maintainer

ValMati Feb 1, 2024 Author

mtmk Feb 1, 2024 Maintainer

ValMati Feb 14, 2024 Author

mtmk Feb 14, 2024 Maintainer

ValMati
Jan 23, 2024

Replies: 7 comments 6 replies

mtmk
Jan 23, 2024
Maintainer

ValMati
Jan 24, 2024
Author

mtmk
Jan 24, 2024
Maintainer

ValMati Jan 24, 2024
Author

mtmk Jan 24, 2024
Maintainer

mtmk
Jan 24, 2024
Maintainer

ValMati
Jan 25, 2024
Author

ValMati
Feb 1, 2024
Author

mtmk Feb 1, 2024
Maintainer

ValMati Feb 1, 2024
Author

mtmk Feb 1, 2024
Maintainer

ValMati
Feb 14, 2024
Author

mtmk Feb 14, 2024
Maintainer