-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vfu_dma_read breaks if client sends message in between #279
Comments
There are currently only two paths that have this issue, vfu_dma_read() and vfu_dma_write() vfu_irq_message() (what Thanos was referring to above) has been removed for now. |
is this issue still not resolved? |
Still not resolved. |
This issue has bitten me, here are some thoughts. For context, I've been using libvfio-user to hook up PCI device models to hardware PCIe root port implementations, basically by bridging the PCIe transaction layer to VFIO-user. You can think of this as plugging a software-emulated PCIe card into a PCIe hardware design, where the hardware side is augmented with a VFIO-user client that talks to qemu as the vfio-user server. In this scenario, I can't use memory-mapped DMA, because I need to inject DMA transactions back into PCIe hardware as memory read and write transaction layer packets. So I rely on VFIO-user's message-based DMA via Here are a couple avenues I can think of:
Assessing these options:
Overall, I like (3) best. I'm interested to hear other thoughts / opinions. Assuming we can reach consensus on how to move forward, I'll be willing to implement a solution. |
I do like idea (3) the most. I even think we've entairtained this idea of having a separate socket for server -> client messages in the very early days, when we were implementing the protocol specification, but didn't have a solid use case at that point and decided to not bake it in the protocol. I don't think modifying the protocol is an issue, it's still experimental in QEMU (and we're still going to modify the protocol when we adopt migration v2). Let's hear from @jlevon and @jraman567 . |
Thanks @tmakatos. If anyone sees issues with approach (3) please let me know - I hope to have some cycles next week to start working on this. |
Cool!
While that's true today, that wouldn't be true if we ever wanted to do other implementations, over TCP/IP to a remote host, for example. So I don't think it would make sense to build something specific to the DMA path. This problem is the one of the reasons we didn't want to stabilize the library yet. I agree with the enumeration of the three possible approaches. I don't really like 1). I'd sort of be OK with 3), though I can't speak for the qemu or vfio-user client maintainers. It feels like a bit of a workaround of inadequate libvfio-user design though, to be honest. It would also mean that async consumers of the API, still need to block during device->guest DMA transactions, which is an issue in itself. The separate socket of 3) doesn't help at all there. That is, the dma read/write APIs are fundamentally broken anyway.
|
I see what you're saying, but for clarity we should separate the discussion into two concerns:
The libvfio-user changes to go fully async are relatively straightforward, I agree. The challenge is for code that consumes the API, because consuming async APIs causes ripple effects. If you look at qemu for example, it relies on synchronous If we assume that qemu hardware models will stick with a synchronous DMA API, then there are two paths:
Bottom line: I don't really see how a libvfio-user that doesn't provide a synchronous DMA API can be consumed by qemu. This arguably causes some design warts, but a middle ground design that allows both sync/async DMA doesn't seem such a bad way forward, and it also gives API consumers an upgrade path to async. |
Sure, but one informs the other: right now, we can't provide an async API, because the underlying API is sync. While it's easier to build a sync API on top of an async implementation.
Yes, I would expect there to be both sync and async APIs, precisely for the sort of reasons you are mentioning. It's possible the sync version could read then re-queue any intermediate messages, perhaps, but there's risks there in the cases of quiesce, migration, unmap, and so on. |
Plus, it's pretty heavy-weight since at least in theory it needs to queue up an unbounded number of messages of arbitrary sizes. It works though, it's approach (1) in my post above, and I have a hack that I'm currently running with. It seems like we're all in agreement that we should have async DMA (in addition to sync) at the libvfio-user API level. Then the open question is whether we do the queuing approach, or we introduce separate socket(s) for the server->client direction. Trying to summarize pros/cons (please call out anything I have missed): queuing:
separate socket(s):
As is probably obvious by know, I'm leaning towards separate socket(s), mostly because I think it's simpler and makes vfio-user easier to implement, but I can totally relate to the single-socket solution feeling conceptually cleaner. In the end, I don't really feel strongly though, just hoping to reach closure so we can move forward. |
I'm not clear on how a separate socket helps at all with providing a sync API; since the reply arrives on the original socket still, you still need to somehow process other incoming messages. |
Ah, so we've been talking past each other, I'm not intending to send the DMA reply on the main socket. Let me recap the separate socket idea again to clarify:
This enables the DMA operation to be handled "out of band" of the main socket. Thus, a synchronous write-DMA-command-then-read-back-reply on the sockets are unproblematic (since commands on the main socket are only ever sent by the client, and commands on the DMA socket are only ever sent by the server, the condition that is the root cause of the bug never occurs). Furthermore, the server can choose to synchronously wait for the DMA operation to complete and ignore all other I/O on the main socket during that time (this is what I was referring to when I said this is simpler to implement). |
I think I read "messages" here instead of "commands", hence the confusion. It's bi-directional, in fact. So, yeah, that sounds reasonable. |
oops |
@mnissler-rivos I asked Jag about this on slack if you'd like to join |
I've just started reading the rest of the discussion, but the following just occured to me:
What's the point of the vfio-user server requiring a response from the client for a DMA command? If the client doesn't like the command it received, the server doesn't need to know about that specific fact, the client might deal with it any way it sees fit (e.g. reset the device). So if we remove this behavior there's no problem in the first place, right? |
DMA read requires a response with the data |
For a DMA read, it wants the data from the client. There is no conceptual reason why that couldn't be delivered asynchronously though. In case of qemu, the device models often have code that perform DMA reads to get the next network packet and then process it, for example, and that code is mostly synchronous right now. So, IMHO we should definitely provide an async API, but keep a synchronous version to avoid breaking more simplistic server designs. |
D'oh of course.
What if the client never responds to the DMA read? What if it decides to reset the device, so the server should actually be paying attention to the main socket? I guess the client will have to start making assumptions that the server won't be paying attention to the main socket so it first has to fail the DMA read? |
Yep - we'll need to clearly define exact behaviour in these sorts of cases |
I guess it boils down to what real HW would do: if it was in the process of doing DMA would it be handling MMIO in the first place? |
I don't think there's a universal answer on how hardware behaves. At the PCIe TLP level, all packets are tagged in order to match up completions, so there aren't any inherent restrictions. I'm also not aware of any requirements to that effect in the PCIe spec, but I'll ask a coworker who knows it much better than me to make sure. That said, the spec does include a completion timeout mechanism (section 2.8) to deal with situations where an expected reply never arrives. From a VFIO software perspective, robust clients and servers should ideally not make any assumptions on communication patterns in my opinion. Having implemented a fully asynchronous client now, I am painfully aware though that this can be challenging in practice. Plus, you can't just mandate full async and ignore the practical constraints in existing software such as qemu. Note that separating the sockets offer increased flexibility in implementing a correct client: It can spawn a thread to handle DMA separately from the main socket, and unless there is problematic cross-thread synchronization, everything will be fully async safe even without switching to an event-based design. |
This change introduces a new mode for servers to access DMA regions in the client. It reuses the existing VFIO_USER_DMA_{READ,WRITE} message type and format, but messages are not passed across the main communication socket. Instead, communication happens via a separate socket that gets supplied by the client when invoking VFIO_USER_DMA_MAP. The benefit of the new mode is to avoid situations where both client and server send messages at the same time, then expect the response as the next message that is received. When socket-based DMA access mode is used, commands on sockets are only ever sent by one side, side-stepping the command collision problem. For more details and disucssion, please see nutanix#279.
This change introduces a new mode for servers to access DMA regions in the client. It reuses the existing VFIO_USER_DMA_{READ,WRITE} message type and format, but messages are not passed across the main communication socket. Instead, communication happens via a separate socket that gets supplied by the client when invoking VFIO_USER_DMA_MAP. The benefit of the new mode is to avoid situations where both client and server send messages at the same time, then expect the response as the next message that is received. When socket-based DMA access mode is used, commands on sockets are only ever sent by one side, side-stepping the command collision problem. For more details and disucssion, please see nutanix#279.
not sure I have an opinion on that |
This change introduces a new mode for servers to access DMA regions in the client. It reuses the existing VFIO_USER_DMA_{READ,WRITE} message type and format, but messages are not passed across the main communication socket. Instead, communication happens via a separate socket that gets supplied by the client when invoking VFIO_USER_DMA_MAP. The benefit of the new mode is to avoid situations where both client and server send messages at the same time, then expect the response as the next message that is received. When socket-based DMA access mode is used, commands on sockets are only ever sent by one side, side-stepping the command collision problem. For more details and disucssion, please see nutanix#279.
Here's a fresh snapshot of what I have right now: mnissler-rivos/libvfio-user@master...mnissler-rivos:libvfio-user:socket_dma Includes test fixes and a pytest for vfu_sgl_{read,write}. Note that this is still passing the socket fd in DMA_MAP, since I had this almost ready when we started discussing this angle yesterday. I will investigate how a solution that sets up a context-scoped server -> client socket next. Happy to hear any input you may have, but no pressure, I still got work to do :-) |
@mnissler-rivos I think it's mature enough to become a a PR (so we don't have to re-review it, I've already reviewed part of it), passing the sockect fd during negotiate can be fixed up by additional commits. |
@tmakatos I hear you - but please give me a day or so to investigate the context-scoped reverse socket approach. There's a chance that with the alternative implementation I can hide all the details of the reverse socket in the transport layer. If that succeeds, most of the changes I have on the branch won't be necessary. |
@mnissler-rivos Setting up the reverse communication channel during negotiation sounds good to me too. Thank you! |
This change adds support for a separate socket to carry commands in the server-to-client direction. It has proven problematic to send commands in both directions over a single socket, since matching replies to commands can become non-trivial when both sides send commands at the same time and adds significant complexity. See issue nutanix#279 for details. To set up the reverse communication channel, the client indicates support for it via a new capability flag in the version message. The server will then create a fresh pair of sockets and pass one end to the client in its version reply. When the server wishes to send commands to the client at a later point, it now uses its end of the new socket pair rather than the main socket.
Here's a proof of concept for setting up a separate server->client command socket at negotiation time: mnissler-rivos/libvfio-user@master...mnissler-rivos:libvfio-user:cmd_socket It ended up cleaner than I had feared, and it's future-proof in the sense that it transparently moves all server->client commands over to the separate socket. I haven't finished a suitable pytest yet, but I have test against my client and confirmed that it works. It'd be great if you could weigh in on whether this approach is indeed what you want to go with - I'll be happy to turn this into a proper PR then. |
That is indeed a lot less than I expected! I have some comments but looks broadly fine, please open 2 PRs for the separate changes, and don't forget to update the vfio-user spec too. Thanks! |
This change adds support for a separate socket to carry commands in the server-to-client direction. It has proven problematic to send commands in both directions over a single socket, since matching replies to commands can become non-trivial when both sides send commands at the same time and adds significant complexity. See issue nutanix#279 for details. To set up the reverse communication channel, the client indicates support for it via a new capability flag in the version message. The server will then create a fresh pair of sockets and pass one end to the client in its version reply. When the server wishes to send commands to the client at a later point, it now uses its end of the new socket pair rather than the main socket.
This change adds support for a separate socket to carry commands in the server-to-client direction. It has proven problematic to send commands in both directions over a single socket, since matching replies to commands can become non-trivial when both sides send commands at the same time and adds significant complexity. See issue nutanix#279 for details. To set up the reverse communication channel, the client indicates support for it via a new capability flag in the version message. The server will then create a fresh pair of sockets and pass one end to the client in its version reply. When the server wishes to send commands to the client at a later point, it now uses its end of the new socket pair rather than the main socket. Corresponding replies are also passed back over the new socket pair.
Sorry for the delay, I was on vacation. I've just created a pull request that sets up a separate socket pair at negotiation time, complete with tests and specification updates. @jlevon you asked for 2PRs above - not sure whether you'd like me to split this one or see a separate PR for the alternative approach of setting up a socket FD for each DMA mapping? |
@mnissler-rivos close_safely() is unrelated to these changes so should be a separate PR (with the other on top I guess). |
Ah, makes sense. Here is a separate pull request for |
This change adds support for a separate socket to carry commands in the server-to-client direction. It has proven problematic to send commands in both directions over a single socket, since matching replies to commands can become non-trivial when both sides send commands at the same time and adds significant complexity. See issue nutanix#279 for details. To set up the reverse communication channel, the client indicates support for it via a new capability flag in the version message. The server will then create a fresh pair of sockets and pass one end to the client in its version reply. When the server wishes to send commands to the client at a later point, it now uses its end of the new socket pair rather than the main socket. Corresponding replies are also passed back over the new socket pair. Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
This change adds support for a separate socket to carry commands in the server-to-client direction. It has proven problematic to send commands in both directions over a single socket, since matching replies to commands can become non-trivial when both sides send commands at the same time and adds significant complexity. See issue nutanix#279 for details. To set up the reverse communication channel, the client indicates support for it via a new capability flag in the version message. The server will then create a fresh pair of sockets and pass one end to the client in its version reply. When the server wishes to send commands to the client at a later point, it now uses its end of the new socket pair rather than the main socket. Corresponding replies are also passed back over the new socket pair. Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
Use separate socket for server->client commands This change adds support for a separate socket to carry commands in the server-to-client direction. It has proven problematic to send commands in both directions over a single socket, since matching replies to commands can become non-trivial when both sides send commands at the same time and adds significant complexity. See issue #279 for details. To set up the reverse communication channel, the client indicates support for it via a new capability flag in the version message. The server will then create a fresh pair of sockets and pass one end to the client in its version reply. When the server wishes to send commands to the client at a later point, it now uses its end of the new socket pair rather than the main socket. Corresponding replies are also passed back over the new socket pair. Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
fixed via #762 |
Wire up support for DMA for the case where the vfio-user client does not provide mmap()-able file descriptors, but DMA requests must be performed via the VFIO-user protocol. This installs an indirect memory region, which already works for pci_dma_{read,write}, and pci_dma_map works thanks to the existing DMA bounce buffering support. Note that while simple scenarios work with this patch, there's a known race condition in libvfio-user that will mess up the communication channel. See nutanix/libvfio-user#279 for details as well as a proposed fix. Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
Wire up support for DMA for the case where the vfio-user client does not provide mmap()-able file descriptors, but DMA requests must be performed via the VFIO-user protocol. This installs an indirect memory region, which already works for pci_dma_{read,write}, and pci_dma_map works thanks to the existing DMA bounce buffering support. Note that while simple scenarios work with this patch, there's a known race condition in libvfio-user that will mess up the communication channel. See nutanix/libvfio-user#279 for details as well as a proposed fix. Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> Message-Id: <20240212080617.2559498-5-mnissler@rivosinc.com>
Wire up support for DMA for the case where the vfio-user client does not provide mmap()-able file descriptors, but DMA requests must be performed via the VFIO-user protocol. This installs an indirect memory region, which already works for pci_dma_{read,write}, and pci_dma_map works thanks to the existing DMA bounce buffering support. Note that while simple scenarios work with this patch, there's a known race condition in libvfio-user that will mess up the communication channel. See nutanix/libvfio-user#279 for details as well as a proposed fix. Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> Message-Id: <20240212080617.2559498-5-mnissler@rivosinc.com>
Wire up support for DMA for the case where the vfio-user client does not provide mmap()-able file descriptors, but DMA requests must be performed via the VFIO-user protocol. This installs an indirect memory region, which already works for pci_dma_{read,write}, and pci_dma_map works thanks to the existing DMA bounce buffering support. Note that while simple scenarios work with this patch, there's a known race condition in libvfio-user that will mess up the communication channel. See nutanix/libvfio-user#279 for details as well as a proposed fix. Reviewed-by: Jagannathan Raman <jag.raman@oracle.com> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> Message-Id: <20240304100554.1143763-5-mnissler@rivosinc.com>
Wire up support for DMA for the case where the vfio-user client does not provide mmap()-able file descriptors, but DMA requests must be performed via the VFIO-user protocol. This installs an indirect memory region, which already works for pci_dma_{read,write}, and pci_dma_map works thanks to the existing DMA bounce buffering support. Note that while simple scenarios work with this patch, there's a known race condition in libvfio-user that will mess up the communication channel. See nutanix/libvfio-user#279 for details as well as a proposed fix. Reviewed-by: Jagannathan Raman <jag.raman@oracle.com> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> Message-Id: <20240304100554.1143763-5-mnissler@rivosinc.com>
Wire up support for DMA for the case where the vfio-user client does not provide mmap()-able file descriptors, but DMA requests must be performed via the VFIO-user protocol. This installs an indirect memory region, which already works for pci_dma_{read,write}, and pci_dma_map works thanks to the existing DMA bounce buffering support. Note that while simple scenarios work with this patch, there's a known race condition in libvfio-user that will mess up the communication channel. See nutanix/libvfio-user#279 for details as well as a proposed fix. Signed-off-by: Mattias Nissler <mnissler@rivosinc.com>
Wire up support for DMA for the case where the vfio-user client does not provide mmap()-able file descriptors, but DMA requests must be performed via the VFIO-user protocol. This installs an indirect memory region, which already works for pci_dma_{read,write}, and pci_dma_map works thanks to the existing DMA bounce buffering support. Note that while simple scenarios work with this patch, there's a known race condition in libvfio-user that will mess up the communication channel. See nutanix/libvfio-user#279 for details as well as a proposed fix. Reviewed-by: Jagannathan Raman <jag.raman@oracle.com> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> Message-Id: <20240507094210.300566-5-mnissler@rivosinc.com>
Wire up support for DMA for the case where the vfio-user client does not provide mmap()-able file descriptors, but DMA requests must be performed via the VFIO-user protocol. This installs an indirect memory region, which already works for pci_dma_{read,write}, and pci_dma_map works thanks to the existing DMA bounce buffering support. Note that while simple scenarios work with this patch, there's a known race condition in libvfio-user that will mess up the communication channel. See nutanix/libvfio-user#279 for details as well as a proposed fix. Reviewed-by: Jagannathan Raman <jag.raman@oracle.com> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> Message-Id: <20240507143431.464382-7-mnissler@rivosinc.com>
Wire up support for DMA for the case where the vfio-user client does not provide mmap()-able file descriptors, but DMA requests must be performed via the VFIO-user protocol. This installs an indirect memory region, which already works for pci_dma_{read,write}, and pci_dma_map works thanks to the existing DMA bounce buffering support. Note that while simple scenarios work with this patch, there's a known race condition in libvfio-user that will mess up the communication channel. See nutanix/libvfio-user#279 for details as well as a proposed fix. Reviewed-by: Jagannathan Raman <jag.raman@oracle.com> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> Message-Id: <20240507143431.464382-7-mnissler@rivosinc.com>
Wire up support for DMA for the case where the vfio-user client does not provide mmap()-able file descriptors, but DMA requests must be performed via the VFIO-user protocol. This installs an indirect memory region, which already works for pci_dma_{read,write}, and pci_dma_map works thanks to the existing DMA bounce buffering support. Note that while simple scenarios work with this patch, there's a known race condition in libvfio-user that will mess up the communication channel. See nutanix/libvfio-user#279 for details as well as a proposed fix. Reviewed-by: Jagannathan Raman <jag.raman@oracle.com> Signed-off-by: Mattias Nissler <mnissler@rivosinc.com> Message-Id: <20240507143431.464382-7-mnissler@rivosinc.com>
There are cases where a libvfio-user public function (e.g.
vfu_dma_read
) sends a message to the client and then synchronously waits for the response. A response from the server cannot be avoided in the DMA read case as it contains the read data. The problem is that right after having sent the message to the client, the client might send its own message, which will be received by the blocking call of the server. However, the server is expecting a completely different message.vfu_dma_read
is one of the functions that has this problem, effectively every public function that usesvfu_msg
has the same problem, except the version negotiation ones.vfu_irq_trigger
has the same problem, since the server might send a trigger IRQ message and immediatelly receive a new command from the client, while it's expecting to receive the response for the IRQ message it just sent.A similar situation can occur when the client is waiting for a response from the server and the server happens to send it an
VFIO_USER_VM_INTERRUPT
. This is a valid use case though, the client should be expecting such messages.The text was updated successfully, but these errors were encountered: