-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Send buffer changes #346
Send buffer changes #346
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some first-pass comments/questions, will try to review further and pull down for a look.
return ConnectStateMachineAsync(true, connectOpts, cancellationToken); | ||
} | ||
|
||
Interlocked.Add(ref _counter.PendingMessages, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is there a reason to use this over Interlocked.Increment
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copy/paste error. (Was there an argument about Add() being faster?🤷♂️)
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Was there an argument about Add() being faster?🤷♂️)
It might be?
As always devil is in the details, I can't comment for ARM but for x86/x64 I did start to look at the tables on Agner Fog. However,
- It has been a minute since I have peeked at these tables
- They do not list
LOCK INC
- It's entirely possible that the JIT decides to pick the best option regardless...
However digging did lead me to this which claims that INC doesn't update the carry flag compared to ADD
. Whether that still makes a difference in modern arches, I can't say...
This does almost certainly qualify as a micro opt either way, .Increment()
/.Decrement()
feel more readable but am happy with either option.
} | ||
finally | ||
{ | ||
_semLock.Release(); | ||
Interlocked.Add(ref _counter.PendingMessages, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per above, also why not Interlocked.Decrement
?
PipeWriter bw; | ||
lock (_lock) | ||
{ | ||
EnqueueCommand(success); | ||
if (_pipeWriter == null) | ||
ThrowOnDisconnected(); | ||
bw = _pipeWriter!; | ||
} | ||
|
||
_protocolWriter.WriteConnect(bw, connectOpts!); | ||
await bw.FlushAsync(cancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels curious enough that a comment may be helpful?
Main question on first glance, what does the lock + capture buy us over the Semaphore usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_pipeWriter is replaced on reconnect in Reset()
. I wanted to avoid any possible stale reads. afaik in this case semaphore or channel 'async locks' would wouldn't provide any memory consistency guarantees.
public ValueTask PublishAsync<T>(string subject, T? value, NatsHeaders? headers, string? replyTo, INatsSerialize<T> serializer, CancellationToken cancellationToken) | ||
{ | ||
#pragma warning disable CA2016 | ||
#pragma warning disable VSTHRD103 | ||
if (!_semLock.Wait(0)) | ||
#pragma warning restore VSTHRD103 | ||
#pragma warning restore CA2016 | ||
NatsPooledBufferWriter<byte>? headersBuffer = null; | ||
if (headers != null) | ||
{ | ||
return PublishStateMachineAsync(false, subject, value, headers, replyTo, serializer, cancellationToken); | ||
if (!_pool.TryRent(out headersBuffer)) | ||
headersBuffer = new NatsPooledBufferWriter<byte>(); | ||
_headerWriter.Write(headersBuffer, headers); | ||
} | ||
|
||
if (_flushTask is { IsCompletedSuccessfully: false }) | ||
{ | ||
return PublishStateMachineAsync(true, subject, value, headers, replyTo, serializer, cancellationToken); | ||
} | ||
NatsPooledBufferWriter<byte> payloadBuffer; | ||
if (!_pool.TryRent(out payloadBuffer!)) | ||
payloadBuffer = new NatsPooledBufferWriter<byte>(); | ||
if (value != null) | ||
serializer.Serialize(payloadBuffer, value); | ||
|
||
return PublishLockedAsync(subject, replyTo, payloadBuffer, headersBuffer, cancellationToken); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, (not sure if it was a factor in this change) but it looks like this would move or less minimize the concerns I brought up in #318.
lock (_lock) | ||
{ | ||
EnqueueCommand(success); | ||
if (_pipeWriter == null) | ||
ThrowOnDisconnected(); | ||
bw = _pipeWriter!; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the re-use of this pattern, is there any potential value in us trying to consolidate shared logic of the lock/get (and possibly more) into a method call to lower maintenance concerns? (Maybe not at this stage but though it was worth asking.)
private async ValueTask PublishLockedAsync(string subject, string? replyTo, NatsPooledBufferWriter<byte> payloadBuffer, NatsPooledBufferWriter<byte>? headersBuffer, CancellationToken cancellationToken) | ||
{ | ||
// Interlocked.Add(ref _counter.PendingMessages, 1); | ||
await _semLock.WaitAsync(cancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of, putting the WaitAsync
is probably 'better' here overall (the old Wait(0)
-> statemachine pattern has some thrashing concerns) but it does raise some interesting questions in my head again as to the driver.
(Below is a language-lawyer disclaimer and... well, hopefully we aren't worried about it, but if we are, wanted to bring it up.)
This is going to be more fair than the old method (b/c it's always WaitAsync
, but it is worth noting that SemaphoreSlim
like most locking primitives in .NET do not guarantee fairness as to which thread gets released first citation.
If we -do- need stronger ordering/fairness on these, we will need to consider an alternative... and also remember the caveats mentioned in the citation; they are certainly not 'absolutes' however they may require consideration in a solution (again, only if needed, this can be a rabbit hole.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just swapped this out with a channel. WDYT? (@stebet also raised concerns about SemaphoreSlim
before)
// 8520 should fit into 6 packets on 1500 MTU TLS connection or 1 packet on 9000 MTU TLS connection | ||
// assuming 40 bytes TCP overhead + 40 bytes TLS overhead per packet | ||
const int maxSendMemoryLength = 8520; | ||
var sendMemory = new Memory<byte>(new byte[maxSendMemoryLength]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Does this get re-started between connections?
And if yes is there a way to possibly reuse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it lives as long as the CommandWriter
object (and in turn NatsConnection
object). On reconnect pipeline is renewed (not pooled at the moment) but the read loop stays in place.
// The ArrayPool<T> class has a maximum threshold of 1024 * 1024 for the maximum length of | ||
// pooled arrays, and once this is exceeded it will just allocate a new array every time | ||
// of exactly the requested size. In that case, we manually round up the requested size to | ||
// the nearest power of two, to ensure that repeated consecutive writes when the array in | ||
// use is bigger than that threshold don't end up causing a resize every single time. | ||
if (minimumSize > 1024 * 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<3 the exposition here! Great breakdown of the why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not me unfortunately 😅 originally taken from the community toolkit (as mentioned in NatsBufferWriter.cs class). I'll make sure to put an attribution comment later.
# Conflicts: # src/NATS.Client.Core/Commands/ProtocolWriter.cs # src/NATS.Client.Core/Internal/NatsPipeliningWriteProtocolProcessor.cs # tests/NATS.Client.TestUtilities/InMemoryTestLoggerFactory.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left food for thought, primarily around the channel changes.
In summary:
As we are using the channel as a semaphore this way (i.e. using full channel to signal waiting as opposed to an empty channel to signal waiting,) it is better to use WriteAsync
throughout if we want to have better ordering fairness and minimize thrashing on concurrent writes.
If we -do- need to 'prioritize' certain locks, we may want to consider how to do (same problem exists with a semaphore
while (!_channelLock.Writer.TryWrite(1)) | ||
{ | ||
await _channelLock.Writer.WaitToWriteAsync().ConfigureAwait(false); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. This inversion of channel usage does cause some challenges.
I think, -especially- in the dispose,
try
{
await _channelLock.Writer.WriteAsync(1).ConfigureAwait(false);
}
catch(ChannelClosedExeption _)
{
}
Would normally be a bit less 'thrashy' in the case of a failed write.
It may be worth considering such in other cases if possible as well, however I need to take a look at overall usage to provide an informed statement.
} | ||
finally | ||
{ | ||
_semLock.Release(); | ||
while (!_channelLock.Reader.TryRead(out _)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we aren't doing so elsewhere, we should probably .TryComplete()
the channel here, such that if any other writers/readers happen to be waiting, they will return a false
on WaitToWrite
/WaitToRead
or throw on WriteAsync
/ReadAsync
Note: If we use TryComplete
we should make sure anything using WaitToXXXAsync
bails in an appropriate way if it's false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so would you say using ReadAsync() would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
- Need to think more about the
.Dispose()
but probably yes. - I think in general,
ReadAsync
should be used (unless a strong argument exists for thrashing on non-critical ops) - I'll make a sample helper for the pattern I'm thinking of (or pull branch and provide commit link or draft pr against your branch if that's ok and I get the opportunity to)
Where using a BoundedChannel
vs a SemaphoreSlim
can be challenging, is in deciding, for lack of better words, 'the right pivot' on how to use the channel.
Depending on the need, and how the underlying synchronization abstraction is implemented, can have a sizeable impact on performance.
Unless there is a compelling reason to do otherwise I think I'm settled on .ReadAsync
being a better idea for what we are doing here.
If we need more, we can go deeper; BoundedChannel
is lower alloc than SemaphoreSlim
for many cases, however there may be future opportunities for a better pooled abstraction[0]
[0] - In general, if a channel has a waiting value on read, it's no alloc. if it's the FIRST waiter with nondefault cancellation token, it gets a pooled instance. if there's a token or the pooled waiter is used, alloc. A little better than semaphoreslim and has in some ways more reliable semantics, but it can be tricky at first.
{ | ||
Interlocked.Increment(ref _counter.PendingMessages); | ||
|
||
while (!_channelLock.Writer.TryWrite(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK Yeah I think this pattern (.TryWrite
loop) even for sub/publish raises some fairness/behavioral consistency concerns.
In general WriteAsync
is going to provide a slightly 'more fair' ordered queue for callers than the .TryWrite()
loop, as for a bounded channel, once an item is read, things happen in this order:
- If there are any pending
WriteAsync
calls in the queue, and any are not cancelled, the first one is marked as completed and added to the queue (this is still in main read lock, so all thread safe) and the method returns. - If previous didn't return, run through the entire set of waiters and wake them all up.
So, in general WriteAsync
will be a lot less thrashy (queue vs free-for-all when contention happens) and it's worth noting that on the happy path (no contention) it won't alloc.
But that order of operations is important to consider; For instance, in extreme cases, one could use WriteAsync
for 'priority' operations, and take the thrashing pain on non-priority ops (or some sort of magic.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I was aiming at high throughput on a tight loop, but that's just my test and I am actually not seeing much change between TryWrite/loop vs. WriteAsync. Happy to go with WriteAsync here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I was aiming at high throughput on a tight loop, but that's just my test and I am actually not seeing much change between TryWrite/loop vs. WriteAsync. Happy to go with WriteAsync here.
How hard would it be to add some concurrent tests where we see what happens when we have a lot of threads trying to do the needful?
} | ||
internal sealed class NatsPooledBufferWriter<T> : IBufferWriter<T>, IObjectPoolNode<NatsPooledBufferWriter<T>> | ||
{ | ||
private const int DefaultInitialBufferSize = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice in future if there was a way to set this (i.e. cases where people know they have tiny messages for low mem environments, or cases where people know they have 1K payloads as the norm).
OTOH I don't think it's a big deal for the latter case (which is the only one I know of being 'real')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we derive a value from write buffer size for example? I thinking it might not be relevant if the implementation changes in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a derived value would be fine if the math can work out.
namespace NATS.Client.Core.Commands; | ||
|
||
internal struct PingCommand | ||
internal class PingCommand : IValueTaskSource<TimeSpan>, IObjectPoolNode<PingCommand> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this will close #321! <3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we pull this into a separate PR? It's kind of irrelevant here and this PR might linger a bit if it ever goes in that is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Separate PR wouldn't hurt if it's not painful to do in the existing codebase, if we have benchmarks to be sure the alloc savings doesn't cost us, even bettter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping command moved here #358
Taking a look at this now |
This might also be something to take a look at to make things simpler for locking: https://dotnet.github.io/dotNext/features/threading/exclusive.html |
This library now net8.0 only. tried pulling the code in but there is way too much dependency I gave up at 6 KLOC. |
I added a new benchmark to test Parallel Publish in #367 and here are the results:
This PR:
Looking great! |
Signed-off-by: Caleb Lloyd <caleb@synadia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎉
This PR
Main