Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient memory passing between WASM and host #314

Open
loganek opened this issue Feb 28, 2024 · 16 comments
Open

Efficient memory passing between WASM and host #314

loganek opened this issue Feb 28, 2024 · 16 comments

Comments

@loganek
Copy link

loganek commented Feb 28, 2024

Hi all,

I do apologize in advance if this topic was already discussed and documented somewhere - if that's the case, I'd be grateful for any pointers to those discussions.

I've been recently looking into some of the ways we can produce preview1-compatible WASM modules and at the same time not affecting the overall development of WASI. In our case, we run WASM runtime on customer devices, and the runtime in some cases can't be updated - so we're stuck with preview1 in the host for at least 4-5 years (and likely longer).

I've explored a few different options to stay up-to-date with the tooling (mainly, wasi-libc) and still remain compatible with the old runtimes we have in production:

  1. I considered having a bunch of ifdefs in the code; this is pretty much what @abrown proposed here: Document a plan for transitioning to preview 2 and beyond wasi-libc#476 and I think that's a good short-term plan (as mentioned in the doc, it's just for "transition"), I don't think that's going to be maintainable given that we need to keep preview1 around for quite some time
  2. Fork wasi-libc - that's another option we consider, but we'd rather avoid that given that backporting bugfixes and improvements from the upstream would become more and more difficult over time
  3. Implement wasi preview2 -> wasi preview 1 adapter

Whereas all three options are still on the table, I think the 3rd one is the least intrusive one, requires very little maintenance overhead (I think it's almost an one-time effort) and enables us to completely abandon preview1 references in the tooling.

I've started working on a prototype of the adapter. So far I've prototyped (it's very buggy and limited, don't use it yet :) ) clock and sockets APIs (using WAMR-specific preview1-like interfaces). I was able to run a simple tcp client/server application using @dicej 's wasi-libc branch. When implementing the streams API I realized that interfaces that return a list, e.g.

    read: func(
        len: u64
    ) -> list<u8>;

don't allow users to pass a pre-allocated buffer (e.g. on the stack); instead, host is expected to call WASM's malloc to allocate a memory for the return buffer. This is a problem for many embedded applications where dynamic allocations are not recommended or even not allowed.
In addition to that, for this to work with libc, the buffer allocated by the host must be then copied to a buffer provided as a parameter for the libc function, so we have something like (very high-level flow, missing lots of details but hopefully clearly explains the problem):

1. [wasm] user allocates a buffer (e.g. on the stack, char buf[1024])
2. [wasm] user calls wants to read data: recv(buf, 1024, ...other parameters)
3. [wasm] recv implementation calls host's read function
4. [host] read implementation calls WASM's malloc/realloc to allocate buffer for return value
5. [host] copies data into the newly allocated buffer
6. [host] returns a pointer back to WASM
7. [wasm] recv implementation copies data from the pointer returned from host into the user-provided buffer
8. [wasm] recv function frees the memory allocated by the host

So we not only spend some cycles to go from host back to WASM to call allocation function (which depending on the implementation might be slow) but also we only allocate a memory temporarily to copy the value back to a buffer already allocated by the user. This is how it works in wasi-libc and I'm not sure if other languages have similar problem - even if not, C and other languages relying on wasi-libc like C++ or Rust are probably popular enough to not neglect this problem.

I understand why list (and perhaps other data types) used as return values require allocating memory - WASM code doesn't know how big the return value is, so it's reasonable to let runtime do the allocation. However, there are some cases (like read function) where user already provides the maximum requested size, and in those cases they might want to also provide a buffer that's been already allocated.

I wonder if the problem of efficient data passing between host and WASM was discussed, and if so, what's the recommendation? I'm not sure whether this is a problem with a component model per se, or is it more a problem with a design of specific interfaces (or both)? For example, would it be possible to have a read function to be something like:

read: func(out_data: list<u8>) -> error

? So the host knows that the list already points to a buffer of a specific length, and it should fill it with the data? This is just one of the ideas, but I'm curious what others think.

Please note it's not just a problem for the adapter from preview2 to preview1, and it's not only about any specific interface (I can imagine lots of different proposals follow the same pattern). If the problem is not being addressed, I think it might be a blocker for some of the embedded usecases to onboard to preview2+ (those projects would either stick to preview1 in some form, or not use WASM at all).

@lukewagner
Copy link
Member

Being able to use caller-supplied memory buffers for dynamically-sized return values of import calls has been discussed and also filed in #175. Having thought about it a bit more in the interim, it does seem quite doable (even addressing the tricky cases mentioned in #175) to add as a new canonopt. One reason it hasn't been prioritized yet is that there is a cute little hack you can do that mostly achieves the same effect (and, iiuc, this is what is currently done by the preview1-to-2 adapter):

  1. before calling the import, store the caller-supplied buffer pointer and its size in two special globals
  2. use a special realloc function that returns the special global memory if the requested size is in bounds

If you're writing a preview2-to-1 adapter by hand (and thus you know the semantics of the call and thus that, in the function signature read: func(length: u64, offset: u64) -> result<tuple<list<u8>, bool>, error-code>, that length will be the length of the returned list<u8>, then you could:

  1. eagerly call cabi_realloc(0, 0, 1, length) to get a right-sized buffer
  2. call the preview 1 function, passing the realloc'd buffer as the caller-supplied buffer
  3. return the buffer and length as the preview2-Canonical ABI return value

and if the preview2 caller does the above cute-little-hack, then I think the whole path ends up forwarding the original caller-supplied buffer all the way to the preview1 call. There is the additional overhead of the realloc call, but perhaps it could be inlined.

@loganek
Copy link
Author

loganek commented Feb 29, 2024

Thanks @lukewagner , I'm glad to see that this scenario was taken into account already and there's a clean solution for that. I've created a simple example just to make sure I understand what you propose (there's lots of corner cases not being handled, it's just for demonstration purposes) - let me know if I got this right:

  1. I have very simple interface:
interface example {
    read: func(len: u64) -> list<u8>;
}
  1. I implemented a simple allocator based on your description (that uses two global variables to track the user-defined buffer and its size). The user_allocator_alloc function returns the buffer or calls malloc in case the buffer is not big enough (that shouldn't happen though): https://github.com/loganek/wasm-wsp2-efficient-memory/blob/main/user_allocator.c
  2. I have a read_data function that mimics read-like libc function. The function initializes the allocator and calls the read wit function. In case read function returns the user-provided buffer, I don't free the list, and I don't copy the data.
  3. In my adapter I call user_allocator_alloc to get the buffer and use it to store the data.

The implementation is slightly different to what you proposed as I don't override realloc implementation. By doing that, we could probably apply the same optimization for non-adapter usecases, but I'd rather keep the first iteration as non-invasive as possible (although this approach won't work if the interface returns more than one list, which I don't think is the case for the preview1 adapter). We can extend the approach as next step if needed (or just wait for the #175 to have a proper support for that).

Given that supporting the adapter would require changes in wasi-libc, I'm curious if the maintainers would be open for those modifications (@sunfishcode , @sbc100, @abrown ). I think we'd need to have a high-level agreement whether the adapter is a way to go to support preview1 in wasi-libc, but for now just testing the water - would that be something we could perhaps have in wasi-libc?
The other potential place for this change would be a wit-bindgen itself (which eventually generates a code for wasi-libc anyway) but given we just want it (for now) for the preview2->preview1 adapter, and there are some edge cases that won't work for any WIT file, perhaps better to start with wasi-libc.

@tschneidereit
Copy link
Member

3. Implement wasi preview2 -> wasi preview 1 adapter

The wasmtime-(compiled to wasm)-based adapter @cpetig has been working on might fit the bill here.

@loganek
Copy link
Author

loganek commented Feb 29, 2024

  1. Implement wasi preview2 -> wasi preview 1 adapter

The wasmtime-(compiled to wasm)-based adapter @cpetig has been working on might fit the bill here.

Thanks for sharing. I actually used some of the work done by @cpetig shared on Zulip to generate core modules with preview2 cannonical ABI to test my adapter :) (thanks @cpetig for the work BTW).

@lukewagner
Copy link
Member

@loganek I might be missing a constraint, but I think it should be possible to write this preview2-to-1 adapter without any changes to wasi-libc, since the adapter doesn't have to know about this realloc trick; it can just call realloc as a black box (and maybe a caller-supplied buffer is used, and maybe not). The trick is how to get line 22 of your adapter to be able to call the preview2-core-module-defined realloc function. If we're talking about vanilla wasi-sdk output, I think this is just cabi_realloc, and so the question is how to link to cabi_realloc from your adapter module (via static func indices or dynamic function table indices, etc).

@loganek
Copy link
Author

loganek commented Feb 29, 2024

Hmm, I can't imagine making the change without modifications of the wasi-libc; the wasi-libc code is the only place where the caller-supplied buffer is available. Once the preview2 function is being called, this information is lost. So I think that somewhere before the preview2 function is called, the wasi-libc must store the pointer of the caller-supplied buffer (for example, see here: https://github.com/WebAssembly/wasi-libc/pull/477/files#diff-36afeb04f659e0e36e627b55010e437bd0855e5a0b6edf82348cb7528b3c3d17R48).

Also, wasi-libc unconditionally copies the data from the buffer returned by preview2 function into a caller-supplied buffer, and frees the list (e.g. here https://github.com/WebAssembly/wasi-libc/pull/477/files#diff-36afeb04f659e0e36e627b55010e437bd0855e5a0b6edf82348cb7528b3c3d17R60-R61). If the buffer is the one that the caller provided, that function will try to release that buffer (which is wrong). So I don't really see how we could do that optimization without changing a bit wasi-libc code.

I think preview1->preview2 is a bit different - preview1 implementation calls preview2 function. Because preview1 has direct access to the buffer, it can do the "pseudo" allocation within the adapter.

Let me know if I'm missing anything obvious here. I'd love to encapsulate all the hacky logic within the adapter, but can't think of any good way to do that.

@lukewagner
Copy link
Member

Ah, I think the missing link is that while, iiuc, the preview1-to-2 adapter does do this caller-provided-buffer optimization, it looks like the native preview2 wasi-libc implementation does not (yet). But let's say that native preview2 wasi-libc added the caller-provided-buffer optimization (which would be well-motivated as a pure optimization): then this is not a detail that the preview2-to-1 adapter would have to care about: the preview2-to-1 adapter simply calls wasi-libc's cabi_realloc (as it would in any case), and it's just a wasi-libc impl detail that this cabi_realloc call is optimized to return the caller-provided buffer. Thus, while wasi-libc's preview2 support should probably do this optimization, it's not like it must or else the preview2-to-1 adapter breaks or anything like that.

Does that make sense? If so, then what's nice is that there is no extra ABI contract beyond what plain preview2 specifies, which should reduce coupling.

@loganek
Copy link
Author

loganek commented Mar 2, 2024

@lukewagner ok looks like we both talked about the same thing :) So we both agree that wasi-libc changes are necessary - I'll prepare a brief plan with changes and present it to the community. In my example repo I'm doing exactly that:

I implement it as a separate function because I simply didn't want to modify the auto-generated cabi_realloc - and that's what I meant when I said that perhaps wit-bindgen is the place where the change should happen. I guess we could either add an option to wit-bindgen to not generate cabi_realloc and let the consumer of those bindings provide their custom implementation, or already implement the allocator and optimizations there. I'll think a bit about it and propose a solution soon.

Let me know if you're aligned with above and if so, I think we can close the issue.

@lukewagner
Copy link
Member

Ah, ok, gotcha, good to hear. Overall, it seems like a good optimization to have, so I think we're aligned that it'd be a good idea to do somehow. I'm not sure where exactly in the toolchain is the right place to make those changes, but what you're saying sounds plausible, so I'll leave you to it to discuss with the folks working on the individual parts.

@xwang98
Copy link

xwang98 commented Mar 8, 2024

before calling the import, store the caller-supplied buffer pointer and its size in two special globals
use a special realloc function that returns the special global memory if the requested size is in bounds

Hi @lukewagner , could you please explain a bit more about how to use using caller-supplied buffer for avoiding allocation and copy during calling read() API. I made a diagram for the workingflow based on my understanding. It may be inaccurate. Could you please share how to apply this method as you mentioned?

preview2 excalidraw

@xwang98
Copy link

xwang98 commented Mar 8, 2024

Or here is another way to implement the read() API for libc with p2 in my guess. The right side flow is about how the data is returned from callee to the user caller. (Is the new wasi-libc source code that calls the p2 import functions available? I haven't found it. )

preview2 excalidraw

@xwang98
Copy link

xwang98 commented Mar 8, 2024

Talk it a bit with my colleague Liang He. We figure it might be something like below:

c__repos_xin-notes_preview2

This workflow only works for the native as the import provider. It implies the wasm runtime will do private handling for the native implementer of the WIT interface, like skipping all the memory reallocation and copying during the function call. And the native implementation of import and libc read() will share some special information.
@lukewagner we would appreciate your comments for clarification.

@loganek
Copy link
Author

loganek commented Mar 8, 2024

Hi @xwang98 I think there important thing to know is that there's a cabi_realloc method implemented in WASM code that must be imported by the runtime. The runtime calls this method whenever it needs to allocate a memory, and it's up to WASM code how the memory should be allocated.

The proposal here is that in the libc's read() function, we'll save a pointer to a caller-provided buffer; cabi_realloc method will then return this buffer when runtime needs to allocate the memory for the read buffer. So in a very high level, it'd be something like:

// This is WASI libc code
typedef struct {
    void* buffer;
    size_t len;
} user_allocator_t;

_Thread_local user_allocator_t user_allocator = {0};

void *cabi_realloc(void *ptr, size_t old_size, size_t align, size_t new_size) {
  if (user_allocator.buffer && user_allocator.len >= new_size) {
    return buffer;
  } else {
    // buffer wasn't provided, fallback to malloc
    return malloc(new_size);
  }
}

void read(const char* path, char *buf, size_t buf_size)
{
  user_allocator.buffer = buf;
  user_allocator.len = buf_size;
  // we don't pass the buffer, but cabi_realloc knows about it already
  void* ret = preview2_read(path, size); // call a function from the runtime
  if (ret != buf) {
    // something went wrong, and runtime couldn't use the caller-provided buffer
    memcpy(buf, ret, size);
    free(ret);
  }
  // reset caller-provided allocator
  user_allocator.buffer = NULL;
  user_allocator.len = 0;
}

// This is Runtime implementation
void* preview2_read(const char* path, size)
{
  void *buf = cabi_realloc(NULL, 0, 1, size);
  fread(buf, size, 1, file);
  return buf;
}

The implementation above just presents a high-level idea and obviously misses a lot of edge cases, but I hope that makes things clear. When it comes to preview2->preview1 adapter, the preview2_read() implementation in the adapter (so also in WASM code), and instead of calling fread(), we'll be calling WASI fd_read.

I've implemented the idea here: https://github.com/loganek/wasm-wsp2-efficient-memory/ (please note though that I don't use cabi_realloc as it is autogenerated for now, instead, I call user_allocator_alloc but that's pretty much the same). I'm also now working on a long-term proposal to support preview1 in the toolchaing using the adapter and that efficient memory passing will be part of the proposal. I hope to finish it in a few weeks and present it to the community for review and approval.

@xwang98
Copy link

xwang98 commented Mar 9, 2024

@loganek it looks like our lastest diagram pretty much follows what you explained. The following question is what if there are two list arguments in the WIT interface?

It looks like a potential need for WIT to support a dual-access memory concept. It is not preferred to introduce too many tricks in the libc implementation as it is so foundational.

@loganek
Copy link
Author

loganek commented Mar 9, 2024

@xwang98 I was mainly interested in the preview2 to preview1 adapter scenario and I don't think there's a need for more than one pointer (but didn't look at the all API s yet, so I might be wrong). One idea that comes to my mind though is to have a list/stack of buffer pointers in the allocator instead of just one, and consecutive calls to realloc would get/pop pointers from the stack. That'd however require a contract between the wash code and runtime code that defines the order of allocations.

@lum1n0us
Copy link

may or may not be helpful. just FYI: WebAssembly/WASI#594

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants