- Where: zoom.us
- When: October 1st, 4pm-5pm UTC (October 1st, 9am-10am Pacific Daylight Time)
- Location: link on calendar invite
None required if you've attended before. Send an email to the acting WebAssembly CG chair to sign up if it's your first time. The meeting is open to CG members only.
The meeting will be on a zoom.us video conference. Installation is required, see the calendar invite.
- Opening, welcome and roll call
- Opening of the meeting
- Introduction of attendees
- Find volunteers for note taking (acting chair to volunteer)
- Adoption of the agenda
- Proposals and discussions
- Review of action items from prior meeting.
- SIMD 64x2 instructions, poll for inclusion back into the spec (WebAssembly/simd#101)
- Trapping semantics of bulk memory operations (WebAssembly/bulk-memory-operations#111)
- Closure
- Continue discussion on Trapping semantics of bulk memory operations in a future meeting (see notes)
None
- Luke Wagner
- Derek Schuff
- Deepti Gandluri
- Thomas Lively
- Mingqiu Sun
- Conrad Watt
- Rick Battagline
- Alex Chricton
- TatWai Chong
- Adam Klein
- Jay Phelps
- Zhi An Ng
- Bill Ticehurst
- Alon Zakai
- Peter Jensen
- Jacob Gravelle
- Dan Gohman
- Paul Dworzanski
- Shravan Narayan
- Ms2ger
- Ross Tate
- Ryan Hunt
- Jakob Kummerow
- Francis McCabe
- Lars Hansen
- Alon Zakai
- Andreas Rossberg
- Petr Penzin
Include SIMD 64x2 instructions back into the spec (WebAssembly/simd#101)
Zhi An Ng: We discussed adding these ops back to the spec in linked github issue. Informal poll has had folks show interest in option 3 (f64x2 ops and common i64x2 ops). Not sure if I should take a poll for each, or what?
DS: Does anyone want to express opinions on the options here? ... [no response]
DS: maybe it's best to make a particular recommendation if this was discussed separately. It looks like option 3 was definitely what people wanted in the issue. So maybe we should poll for that?
TL we've seen benchmark numbers showing good results for most of these instructions, right?
ZN: yes
JP: it looks like there was a discussion about these ops before. Were any of the folks there also here? Why were they removed in the first place?
DG: we originally polled to remove them due to lack of benchmarks showing they were valuable. So it was uncontroversial. Also it was in line with our originally-agreed approach to only add operations with good benchmark evidence for inclusion Zhi's work has been to collect that info, and now we have good evidence that they perform well.
TL: what about more niche stuff like sqrt and min/max? Do we want those?
DG: you mean 64x2?
TL: yes
DG: for now we've only tried to include the ones we've seen actual usage/benefit for, but we haven't for those.
Poll: Include in the SIMD spec all the discussed f64x2 operations, and a subset of i64x2. Specifically:
- f64x2.{splat,extract_lane,replace_lane}
- f64x2.{add,sub,mul,neg}
- f64x2.{div,sqrt}
- f64x2.{shl,shr_s,shr_u}
- f64x2.{any_true,all_true}
- f64x2.{eq,le,lt,ge,gt}
- f64x2.{abs,min,max}
- f64x2.convert_{s,u}/i64x2
- i64x2.{splat,extract_lane,replace_lane}
- i64x2.{add,sub,mul,neg}
- i64x2.{shl,shr_s,shr_u}
SF: 6 F: 5 N: 5 A: 0 SA: 0
Trapping semantics of bulk memory operations (WebAssembly/bulk-memory-operations#111)
Ryan Hunt: presenting We've seen a performance regression since LLVM started including memory.copy instructions. The current Spidermonkey implementation is not very optimized, Started looking at optimizing it for this use case.
LH: You’re giving memory.copy load multiple/store multiple semantics without specifying the semantics, and letting the wasm compiler infer
AR: if you can’t fit everything into registers you'd still have to go low-to-high?
RH: In that case, we know the length is constant, we should know as we’re compiling whether we have enough space, if we don’t we have to fall back to the slow path which is bounds check + memmove which is as it is today. Testing is not exhaustive, but my tests haven’t needed more space than SSE right now
PP: some platforms have hardware support for move byte, could we use those
RH: like rep movsb? I did some benchmarking. We weren't seeing very good performance, but we might not have been doing it optimally
PP: What platform were you running on?
RH: Macbook pro x64, can copy disassembly if needed
RH: it’s public, there’s a patch on the bug
PP: I can take a look to see if I can help with anything, in theory you should be able to use byte moves, but they are not consistent across different x64 checks, didn’t get the part about alignment hints, can you elaborate?
PP: can you elaborate on the alignment hint alternative?
RH: if the instruction had an immediate which specifies the min alignment of the src and dest, similar to current loads and stores. We could have a branch that checks the alignment, and we could use loads and stores of that size to implement inline. I tried and it seems ok, but if LLVM emits just 1 because its unknown, it would fall back to just a byte copy.
PP: For example, the way LLVM would give you the hint is the alignment if 4 bytes?
RH: It’s not super well formed, I don’t know how LLVM would optimize this
TL: from the toolchain, when we implemented bulk mem, we just optimized for code size, since we didn't’ have benchmark data. We changed it back now that we have some info. The initial regression is gone now with the current trunk. So this is fixed in the toolchain now. SO it depends on what we think the memmove instruction is for. If it’s for constant small moves, then this spc change makes sense. But if that’s not what it’s for, and we shouldn’t need to optimize for this case, then there’s not really a problem?
I'm sympathetic to making it more useful, this seems like an extra complication, and unknowns for spec work, we've already gone back and forth on the instruction - if we do want an extra instructions, adding it to the spec in the form of a new multi load/store instruction would be more appropriate.
LH: we had these semantics already, right? We changed it to the current semantics to make space for memory.map/protect/grow. But there is some issue about whether those are plausible. So it’s possible that the motivation that caused us to create the current semantics was flawed?
Initially we talked about memory.copy for large chunks of memory, and now it’s tempting to use it for smaller chunks, we should make it possible to use for smaller chunks - there seems to be no reason to change it apart from the fact that Chrome already ships it?
AR: basically this assumes that memory accessibility stays monotonic forever, right? It seems a pretty strong assumption. Given that, I’m sympathetic to Thomas’ argument, why not have a dedicated instruction?
LH: Wouldn’t a store multiple have the same type of problems in terms of monotonicity?
AR: If we are given alignment, we can specify more of the details, but in general you are right
PP: You are using a register that you could be using for anything else, maybe there could be different instructions for doing the short version…
LH: The original version was to specify a constant length, and then the page size became large, so now we have bytes … Anyone working on a JS engine knows that people will use it and assume that it works fast
RT: one issue is that small is variable. Memcpy could have a flag to specify the checking semantics, strict vs lazy.
CW: Wouldn’t address the original issue why we include the semantics originally, the semantics will get non-deterministic if we introduce something in the future that relies on monotonicity
RT: non-monotonicity means you’d revoke/protect memory in the future?
CW: right. So it would be nondeterministic what gets written in that case where a write races with a protect
AR: That’s alsos semantics we could introduce now, but that seems not particularly attractive, trapping in-between a memory.copy/fill right now is undefined because the order is not defined
RT: this is only when we have parallelism and nondeterminism anyway? CW: right, this in the presence of shrink or protect racing with a copy
RT: Shrink/Protect in parallel with this operation?
AR: What about alignment hints? RH: It would need to be used TL: When LLVM emits small constant size loads, it does have alignment info, or we could easily get it
AR: We can always modify the proposal to add alignment hints right now, how hard would that be?
AZ: there’s a benchmark with inline/align guard. Could you do the check differently <e.g. separate the OOB from alignment?>
LH: Tried coming up with a good code sequence for that, it’s been difficult, there’s source, and destination alignments… it wasn’t obvious how to do that correctly, feedback welcome?
AZ: why does the alignment matter, compared to checking just OOB?
RH: if you know the alignment is 4 and you write 4 at a time, you won’t do a partial write, you'll do full-size writes up to the bound.
AZ: You would have a check inline, saying I’m doing a copy from x - x+ k, do a check if x+k is in bounds, if it’s out of bounds we can go into the case that handles all the things? Another way to phrase it is whether it’s going to trap, or whether it’s aligned? They seem different enough to handle separately? Why do they need to be intertwined?
CW: fundamentally if you had that fast path, if we end up implementing protect, that wouldn’t satisfy the byte-by-byte semantics anyway. So maybe this idea of byte-by-byte wouldn’t really help and we should have nondeterminism anyway?
AR: I have the same question as Alon, the proposal is to add for add bounds checks before the operation, why can’t we add that in the user space?
CW: Having explicit memory.length semantics will be expensive
AR: sorry the engine could implement that as the fast path
LH: Several constraints, minimize code size, alignment requirements, other constraints as well, so it’s hard to handle the combination of all these..
AR: Still don’t understand why alignment, or out of bounds matters
If you know you can do multibyte copies, and the alignment is right, it’s still faster
AR: for the OOB thing that seems to be unrelated to alignment. If you check OOB, then why do you still care about alignment?
LH: Suppose your destination pointer is not aligned, you will hit the end of memory before you write, you do the fast path in line, in response to Alon, when implementing it I found it to be a very complicated check for something that was only copying a few bytes
AR: I don't understand why it's more than 1 branch?
LH: we should take it offline, it's hard to discuss details without the code in front of me.
CW: even if you successfully did that, if we were to have memory protect later, you wouldn't conform. So maybe the byte-by-byte wasn't a good idea.
AR: The only answer would be completely non-deterministic?
CW: Correct, that’s just what we have to Spec
AR: why would you bound-check in the beginning, in that case?
CW: in the usual case, then nothing changes and we have determinism, only in races would there be nondeterminism.
AR: All options seem fairly unpleasant, I would punt to saying this is a toolchain issue - take LH’s point that this will just be an engine race
RT: can we discuss this offline on an issue?
PP: we should, it's hard to discuss these details in a meeting.
RH: Posted the github issue in the chat, it’s in the agenda, will link the benchmark, and the SpiderMonkey patch in the issue
AR: what is the sentiment… what’s the promise that C compilers have? Do they promise that it’s always fast?
PP: Not necessarily, it’s expected to be a call to a library routine
AR: Many compilers inline that check
DS: C compilers freely turn load/stores into memcpy and back, will do it if it thinks it will make it fast
PP: right, hw doesn’t have a bulk memory store. If you’re copying values that can be a register, that should become load/store.
LH: One point I made a couple of times, memory.copy is not memcpy, the semantics that the C compiler expects are not mapped to memcpy, it’s memmove + extra overhead, so the name is confusing
PP: yeah it's definitely true that it's not like anything in native
TL: the current state is that i reverted LLVM’s change to always use memory.copy. But when bulk memory is enabled, it uses memory.move anywhere there’s a memcpy intrinsic (e.g. if the user writes it or if the compiler detects it). But as lars said that’s technically different semantics than memcpy. So can the instruction be at least as fast as a wasm implementation of memcpy? Because that’s what it’s really replacing. So is that a good choice?
LH: Except for the very smallest sizes, yes.
RH: I'd guess yes, ive not done benchmarking in unrolled wasm load/store vs memory.copy. We still have some low-hanging fruit to optimized though.
TL: If this gets really complicated, we have the fallback of saying this is a tools problem, and we can solve it
DS: We don’t want to be in a bad situation where the LLVM toolchain is generating a bunch of wasm code, but engines are under pressure to optimize it anyway because of other toolchains.
AR: we should at least document what the expected use is.
TL: If we have dozens of implementations, then we can add a note to say don’t use it
LH it would still be nice to pick the best semantics we can for our constraints. So if the current semantics don’t help the memory.protect case and the old ones are still better for the cases we want, we should still consider changing
CW: Starting to think this as well
TL: Fine with me
DS: It sounds like that folks would continue to discuss this offline, should we bring it back if there’s more progress offline?
LH: Only concern I have is that Chrome has already shipped this, how long before it becomes a problem?
DS: The only problem is when you are trapping - when you are importing/exporting the memory and you trap (and then later read the memory e.g. from JS), that’s when this would come into the picture. Emscripten does technically support this, but seems like a niche use case. It’s definitely UB for C, it seems to be only a thing if you compile with a language that has more well defined semantics on top of LLVM, so not super concerned, does anyone else see other concerns?
AR: Meta concern of why have you shipped this already when this hasn’t been fully specified, seems against the process?
DS: It’s never happened outside of the threads proposal, (and bulk memory was pulled out of the threads proposal); without bulk memory you don’t have conditional segment initialization, so it made toolchain support problematic.
AK: We don’t want to be shipping things at the stage, but history of threads, and SABs has been complicated, I don’t think the intent to do this in general.
DS: One other general comment, we want to keep the door open to refine the proposal based on memory model changes; if we don’t say something in the spec about the map/protect/shrink idea, embedders will do it anyway via APIs, possibly in different or incompatible ways. I would prefer that we have non-determinism of the same type as the one we have for threads, compared to giving up on that completely.
AR: You make a good point that embedders will invent a way to do that anyway, very concretely this has been suggested in my company that we should have something like memory.shrink to reduce memory usage, but this would break compilers if they hoist out bounds checks etc. ?If the embedders start inventing stuff, it will be incompatible.. I would rather anticipate the possibility of features like this, just the fact that embedders might like to invent something is already at risk.. We should proactively handle that
DS: we are out of time. We can continue discussing the specifics on WebAssembly/bulk-memory-operations#111 and bring this back to a future meeting.