Agenda for the May meeting of WebAssembly's Community Group

Host: Google, Mountain View, CA
Dates: Tuesday-Wednesday May 30-31, 2017
Times:
- Tuesday - 9:00am to 5:00pm
- Wednesday - 9:30am to 5:00pm
Location: 1881 Landings Dr, Mountain View, CA 94043
Wifi: GoogleGuest (no password)
Dinner: Tenative: Stein's Beer Garden Wed May 31st. 5:30pm. Please sign up for dinner here.
Contact:
- Name: Brad Nelson
- Phone: +1 650-214-2933
- Email: bradnelson@google.com

Registration

Please register before the event here.

Logistics

Where to park
- Free parking is available outside the building.
How to access the building
- The morning of event we'll meet you at the door.
- At other times, please call or email the host.
Technical presentation requirements (adapters, google hangouts/other accounts required, etc.)
- Presentations will be done with a Google Hangout.
- Contact host if you need alternatives.
Any other logistics required to participate in the meeting
- Please register before the event here as we may not be able to accommodate late arrivals.

Hotels

Nearby Hotels

Agenda items

Opening, welcome and roll call
1. Opening of the meeting
2. Introduction of attendees
3. Host facilities, local logistics
Find volunteers for note taking
Adoption of the agenda
Proposals and discussions
1. Working Process
2. Non-trapping float-to-int conversions [1,2] (Dan Gohman) : 30 minutes
3. Threads (Ben Smith)
  - Proposal
  - Brief overview of the proposal
  - Topics for committee feedback
4. SIMD (Jakob Stoklund Olesen)
  - Proposal
  - Brief overview of the proposal
  - Topics for committee feedback
  - Definition of performance benchmarks and acceptance criteria
5. Working Group Formation + Working Process
  - Draft charter
  - Proposed working process
  - Making progress online
Closure

Schedule constraints

Dates and locations of future meetings

Dates	Location	Host
2017-07-18 to 2017-07-20	Kirkland, WA	Google
2017-11-06 to 2017-11-07	Burlingame, CA	TPAC

Proposal details

Threads proposal details

i8 and i16 Value Types

Tracking Issue: i8/i16 values with limited operators #1049

Poll:

Add i8 and i16 value types.

Adding i8 and i16 value types can reduce the number of opcodes required to implement atomics. Because they were not considered for the initial WebAssembly design, adding them now introduces some potential inconsistencies.

Encoding Scheme

Tracking issue: Encoding Scheme #12

Poll:

Use a prefix byte followed by LEB128 for all new prefix-based opcodes.

This is currently irrelevant for the threads proposal, since the number of new opcodes is always less than 128. For SIMD, there are currently more than 128 new opcodes, so some new opcodes will require 3 bytes instead of 2 if this proposal is accepted. The benefit is that there will be more opcode space to use in the future, without having to use a new top-level prefix byte or add a secondary prefix byte.

Poll:

Use 0xf0 prefix byte for all threading opcodes.

Use 0xff prefix byte for all threading opcodes.

This should be coordinated with the SIMD proposal.

Poll:

Encoding proposal #1: No new non-atomic operators. Add 107 new atomic operators.

Encoding proposal #2: Use i8 and i16 value types. Add 12 new conversion operators. Add 40 new atomic operators.

Encoding proposal #3: Add 5 new sign-extend operators. Add 67 new atomic operators.

See the full description of these proposals in the overview.

Matching SharedArrayBuffer/Atomics for v1

Tracking issue: Matching behavior of SharedArrayBuffers in ES spec #5

Poll:

Include the same atomic and threading primitives as provided by the ES spec.

This includes:

is_lock_free
wait for 32-bit integers
wake
Atomic ops for 8-, 16-, and 32-bit signed and unsigned integers:
- load and store
- add, sub, and, or, xor, xchg
- cmpxchg

Float atomics

Tracking issue: float atomics, i64 atomics, i64.wait #7

Poll:

Include f32 and f64 atomic load and store operators.

These are not currently included in the ES spec.

C++11 supports the equivalent of load, store, xchg and cmpxchg, without specialization of the std::atomic template.

A C++20 proposal adds a template specialization for floating types, RMW add and sub operations, and other features not relevant here.

i64 atomics

Tracking issue: float atomics, i64 atomics, i64.wait #7

Poll:

Include i64 atomic load, store, and rmw operators.

These are not currently in the ES spec.

i64.wait

Tracking issue: float atomics, i64 atomics, i64.wait #7

Poll:

Include i64.wait operator.

This is not currently in the ES spec. It is included in the proposal because it is symmetric (i32.wait atomically loads a 32-bit value), but it is not clear whether it is useful.

Is the set of proposed instructions sufficient?

Tracking issue: Is the set of proposed instructions sufficient? #11

Discussion: are there further atomic/threading operations that should be included?

pause (spinloop relaxation instruction)

Lock-free Guarantees for 8- and 16-bit Atomics

Tracking Issue: Lock-free guarantees for 8- and 16-bit atomics #19

Poll:

Guarantee lock-free atomics for only 32-bit accesses

Guarantee lock-free atomics for 8-, 16-, and 32-bit accesses

The current proposal matches the ECMAScript specification, which only guarantees that 32-bit atomic accesses are lock-free. It may be better to guarantee lock-free atomics for smaller sizes, as well:

Clang already assumes that 32-bit lock-freedom imples 8- and 16-bit lock-freedom
The reservations about specifying this for ECMAScript mostly hinge on uncertainty in the MIPS specification of 32-bit LL/SC, but Clang already relies on this behavior as does WebKit

Blocking main thread in web embedding

Tracking issue: Blocking the main thread in web embedding #6

Poll:

*.wait instructions may trap when executed in certain contexts (e.g. main thread in web embedding).

This was required for the ES spec. The specific contexts depend on the [[CanBlock]] property of the agent.

Thread-local and Shared Global variables

Tracking issue: What are the plans for thread-local and shared global variables? #4

Discussion:

Globals currently cannot be imported or exported if they are mutable.

Globals could be used to implement the C/C++ stack pointer, but cannot be dynamically linked if mutable globals cannot be imported/exported. This can be worked around by using linear memory instead.

When linear memory is shared, there is no way to implement dynamically linked thread-local variables: globals cannot be used because they cannot be exported, and linear memory cannot be used because it is shared.

Should we implement mutable globals as thread-local values? Should the threading proposal depend on this?

Consider supporting verifiably safe concurrency

Tracking issue: Consider supporting verifiably safe concurrency #3

Discussion: is there some subset of WebAssembly that can be provided for verifiably safe concurrency?

Native WebAssembly threads - needed for v1?

Tracking issue: Native WebAssembly threads #8

Poll:

Add "native" threads to WebAssembly, in addition to embedder-created threads, for v1.

The current proposal has no mechanism for creating or joining threads, relying on the embedder to provide these operations. This allows for simple integration with SharedArrayBuffer, where a dedicated worker is used to create a new thread.

There is a drawback for WebAssembly, as a worker is a relatively heavyweight construct, and a native thread would be considerably more lean.

Memory model standardization

Tracking issue: Memory model standardization #9

Poll:

Is the memory model document required before merging the threads proposal?

It's assumed the memory model document is needed, this question is asking when.

Poll:

Reference ES memory model, with WebAssembly-specific modifications.

Create a shared memory model document that can be referenced by both ES and WebAssembly.

Create an independent memory model document for WebAssembly.

Memory orderings - sequential consistency only?

Tracking issue: Memory ordering - Sequential Consistency only? #13

Poll:

Provide only sequentially-consistent atomic operators for v1.

How to test?

Tracking issue: How to test? #10

Discussion:

What tests will we want for WebAssembly threads?

wait/wake
Memory sharing
Atomic API
Atomicity?
Ordering constraints?

SIMD proposal details

SIMD types and basic operations

Tracking issue: Initial Proposal #1.

The SIMD proposal uses a single v128 value type to represent a 128-bit SIMD vector.

Poll:

Adopt a v128 value type along with basic operations v128.const, v128.load, v128.store as well as bitwise logic instructions v128.and, v128.or, v128.xor, and v128.not.

Predicate vectors are represented with separate value types that don't have an in-memory representation. This allows implementations to choose their own representation. The v*.select operations are the primary consumers of predicate vectors.

Poll:

Adopt the b8x16, b16x8, b32x4, and b64x2 value types along with the basic operations b*.const, b*.splat, b*.extract_lane, b*.replace_lane, as well as the logical operations b*.and, b*.or, b*.xor, b*.not, and v*.select.

Floating-point arithmetic is supported by the f32x4.* and f64x2.* operations. The inclusion of div and sqrt will be discussed later, as will the handling of subnormal numbers.

ARMv7 NEON supports f32x4 arithmetic with subnormal numbers flushed to zero only, and f64x2 operations must be implemented with pairs of scalar VFP instructions.

Poll:

Adopt the following f32x4.* operations:

Basics: f32x4.{splat,extract_lane,replace_lane}.

Comparisons: f32x4.{eq,ne,lt,le,gt,ge}.

Arithmetic: f32x4.{neg,abs,min,max,add,sub,mul}.

Poll:

Adopt the following f64x2.* operations:

Basics: f64x2.{splat,extract_lane,replace_lane}.

Comparisons: f64x2.{eq,ne,lt,le,gt,ge}.

Arithmetic: f64x2.{neg,abs,min,max,add,sub,mul}.

Small integer arithmetic is provided by the i8x16.*, i16x8.*, and i32x4.* operations. These operations are widely supported by both Intel and ARM ISAs. Integer division is not included since it is not widely supported.

Poll:

Adopt the following i8x16.*, i16x8.*, and i32x4.* operations:

Basics: {i8x16,i16x8,i32x4}.{splat,extract_lane*,replace_lane}.

Arithmetic: {i8x16,i16x8,i32x4}.{add,sub,mul,neg}.

Shifts: {i8x16,i16x8,i32x4}.{shl,shr_s,shr_u}.

Equalities: {i8x16,i16x8,i32x4}.{eq,ne}.

Inequalities: {i8x16,i16x8,i32x4}.{lt,le,gt,ge}_[su].

The i64x2 operations are more limited, but they are still useful for bitwise manipulation of f64x2 vectors. Division and multiplication are not widely available. Comparisons exist in SSE 4.2 and ARMv8 only.

Poll:

Adopt the following i64x2.* operations:

Basics: {i8x16,i16x8,i32x4}.{splat,extract_lane*,replace_lane}.

Arithmetic: {i8x16,i16x8,i32x4}.{add,sub,neg}.

Shifts: {i8x16,i16x8,i32x4}.{shl,shr_s,shr_u}.

64-bit comparisons will be slow on ARMv7 and Intel chips without SSE 4.2.

Poll:

Adopt the following i64x2. operations:

Equalities: {i8x16,i16x8,i32x4}.{eq,ne}.

Inequalities: {i8x16,i16x8,i32x4}.{lt,le,gt,ge}_[su].

The 64-bit multiplication is implemented in AVX-512, but would be slower everywhere else. All ISAs have 32 x 32 –> 64 multiplies (pmuludq, umull) that can be used to implement this operation.

Poll:

Adopt the i64x2.mul operation.

Saturating integer arithmetic is widely available for 8-bit and 16-bit lanes, but not for i32x4.

Poll:

Adopt the saturating integer arithmetic operations {i8x16,i16x8}.{add,sub}_saturate_[su].

Integer-to-float conversions are always non-trapping.

Poll:

Adopt the f*.convert_[su]/i* integer-to-float conversions.

When converting floating-point numbers to integers, we can choose to trap or saturate on overflow.

Poll:

Adopt the trapping i*.trunc_[su]/f* float-to-integer conversions.

Poll:

Adopt the saturating i*.trunc_[su]/f*:sat float-to-integer conversions.

SIMD binary encoding

Tracking issue: SIMD binary encoding #14.

The current SIMD proposal defines 193 new operations, and it is likely that more SIMD operations will be added in the future. We need an extensible way of encoding the opcodes for these operations.

Poll:

SIMD opcodes are encoded as 0xf1 varuint32, where 0xf1 indicates a SIMD operation and the varuint32 number identifies the specific operation.

This should be coordinated with the threads proposal which currently uses a 0xf0 prefix byte followed by an opcode byte in the range 0x00 - 0x7a.

The new value types also need assigned numbers.

Poll:

Assign the following numbers to the new SIMD value types:

0x7b => v128

0x7a => b16x8

0x79 => b8x16

0x78 => b4x32

0x77 => b2x64

Allow flushing of subnormals in floating point SIMD operations

Tracking issue: Allow flushing of subnormals in floating point SIMD operations #2.

ARMv7 NEON devices support f32x4 arithmetic, but only with subnormals flushed to zero. On these devices, f64x2 arithmetic needs to be implemented with the slower VFP instruction set which does support subnormals correctly, so the problem only applies to f32x4 operations.

Poll:

Adopt the proposed specification text which allows SIMD floating-point instructions to flush subnormals to zero.

Reciprocal (sqrt) approximation instructions

Tracking issues: Implementation-dependent reciprocal (sqrt) approximation instructions #3 and Eliminate divide, square root #13.

Floating-point divide and square root instructions are slow. They take many cycles to complete, and they are not fully pipelined, so they block the ALU while they are executing. Further, ARMv7 NEON does not provide vectorized floating-point divide and square root instructions, so they have to be implemented as a sequence of scalar VFP instructions.

All SIMD ISAs provide fast f32x4 approximation instructions that compute 9-12 bits of 1/x or 1/sqrt(x) respectively. These instructions are fully pipelined and typically as fast as a floating-point addition. They don't provide the exact same approximation on different platforms, though.

Poll:

Adopt f32x4 1/x and 1/sqrt(x) approximation instructions.

The f64x2 versions of the approximation instructions are only available in AVX-512 and ARMv8's A64 mode.

Poll:

Adopt f64x2 1/x and 1/sqrt(x) approximation instructions.

It has also been proposed to remove the exact division and square root instructions since they are slow:

Poll:

Remove the f32x4.div and f64x2.div instructions.

Poll:

Remove the f32x4.sqrt and f64x2.sqrt instructions.

Presumably, the answer to these questions should depend on whether the approximation instructions are adopted.

Alternative to Swizzle / Shuffle

Tracking issue: Alternative to Swizzle / Shuffle #8.

The initial SIMD proposal contains fully general swizzle and shuffle instructions that take a sequence of lane indexes as immediate operands. An alternative proposal is to only define a subset of the possible swizzle and shuffle operations that are known to map to fast instructions.

The v8x16.shuffle instruction is pivotal since its functionality subsumes all the other swizzle and shuffle instructions. It can be implemented in terms of pshufb or vtbl instructions that are quite fast, see the GitHub issue for details.

Poll:

Adopt the v8x16.shuffle instruction.

The v8x16.shuffle instruction is quite large with its 16 bytes of immediate operands. Most of the time, shuffles with larger granularity suffice, and they can be encoded with smaller immediate operands (one byte per lane).

Poll:

Adopt the {v16x8,v32x4,v64x2}.shuffle instructions.

The v*.swizzle(x) instructions pick lanes from a single input vector instead of two. Their functionality is subsumed by v*.shuffle(x, x) which requires a get_local.

Poll:

Adopt the v*.swizzle instructions.

SIMD ISAs have various shuffle instructions that are faster than the fully general pshufb and vtbl instructions. An implementation is free to use these instructions for the corresponding immediate operands on v8x16.shuffle. The alternative proposal in the tracking GitHub issue is to define a set of fixed shuffle instructions that are known to map to these fast instructions on all ISAs.

Poll:

Define a small set of primitive permutations that we know can be implemented efficiently.

Packed Horizontal addition

Tracking issue: Consider adding Horizontal Add #20

Poll:

Include v*.addHoriz instructions.

Packed horizontal additions are natively supported on ARMv8/SSE3. These would be useful for complex multiplications, and in the absence of the opcodes, would need to be a combination of shifts and adds. The issue proposes adding horizontal addition for f32x4, i32x4, i16x8, with a potential addendum for 64x2 types.

Poll:

Include instructions for horizontal operations.

Discuss the inclusion of horizontal instructions apart from horizontal addition mentioned above.

Conversions between integer vectors of different lane size

Tracking issue: Add opcodes for converting between integer vectors of different lane size. #21

Poll:

Include integer narrowing, and widening conversion instructions with signed, and unsigned saturation.

Meeting Notes

Attendees

Abbreviation	Name	Organization	CG member?	Note
KS	Karl Schimpf	Google, Inc.	✅
AG	Aseem Garg	Google, Inc.	✅
KC	Kevin Cheung	Autodesk, Inc.	✅
JG	Jacob Gravelle	Google, Inc.	✅
HA	Heejin Ahn	Google, Inc.	✅
RW	Richard Winterton	Intel Corporation	❌
CM	Cristian Mattarei	Stanford University	❌
MD	Mike Dunne	ACTIV Financial Systems, Inc.	❌
WS	Wink Saville		✅
PG	Peter Grandmaison	Adobe Systems Inc.	✅
RG	Robert Goldberg	Adobe Systems Inc.	✅
LW	Luke Wagner	Mozilla Foundation	✅
JZ	James Zern	Google, Inc.	✅
JO	Jakob Olesen	Mozilla Foundation	✅
JM	Jordon Mears	Google, Inc.	✅
PJ	Peter Jensen	Intel Corporation	✅
VB	Vincent Belliard	ARM Limited	✅
MH	Michael Holman	Microsoft Corp.	✅
LZ	Limin Zhu	Microsoft Corp.	✅
AC	Aliaksei Chapyzhenka	Arteris, Inc.	✅
TL	Thomas Lively	Google, Inc.	✅
AL	Anthony LaForge	Google, Inc.	✅
MB	Martin Becze	Ethereum Foundation	✅
AD	Aaron Davis	?	❌
DG	Deepti Gandluri	Google, Inc.	✅
BB	Bill Budge	Google, Inc.	✅
ST	Seth Thompson	Google, Inc.	✅
FP	Filip Pizlo	Apple, Inc.	✅
SB	Saam Barati	Apple, Inc.	✅
BN	Bradley Nelson	Google, Inc.	✅	Host
DG2	Dan Gohman	Mozilla Foundation	✅
DM	Denis Merigoux	Mozilla Foundation	✅
KM	Keith Miller	Apple, Inc.	✅
JF	JF Bastien	Apple, Inc.	✅	CG chair
BS	Ben Smith	Google, Inc.	✅
BM	Bill Maddox	Adobe Systems Inc.	❌
EH	Eric Holk	Google, Inc.	✅

Tuesday May 30

Opening and welcome

Two main presentations:
- Ben Smith - Threads
- Jakob - SIMD
Consensus details
- 5 way polls
- Consensus rule: Chair decides whether there is consensus
- Neutral => status quo
- Anyone can vote. Non registered members can’t bring substantial new ideas.
Only experts call in.
Agenda has been adopted.

Non Trapping Float to Int

Dan Gohman presenting
Aseem Garg taking notes
Presentation
No perf data
Poll: New operators for Saturating conversion like Java?
- Discussion: NaN -> 0 makes no sense
- FP: x86 behavior? Not sure which saturating to pick and whether to add new instruction (can add both) or modify existing
Options:
- 8 new opcodes.
- Poll: Backward compatible change (not keep backward compatible)?
  - SF: 3
  - F: 2
  - N: 8
  - A: 5
  - SA: 1
- JF: Depending on language need to have branching anyways
- Poll: Do we care about trapping / non-trapping
  - SF: 8
  - F: 6
  - N: 1
  - A: 1
  - SA: 1
  - Consensus: neutral. Hence Status Quo
- FP: No perf data. Hence strongly against.
  - Standard speed bump
  - KM: seems like ergonomic problem
  - MB: similar problem as trap handler. Think in that context.
    - Ans: not in current agenda
    - JF: trap handler good solution if exceptional path.
  - JF: definitely ergonomic issue. If ppl hit problem, they’ll go away.
  - JG: Codegen we need to do on LLVM side, need to add WebAssembly bounds check. More WebAssembly code
    - Microbenchmark has a 2-3x slowdown
    - Banana bread program 2-3% performance
- Poll: 8 new opcodes or prefix wise?
  - JF: context for prefix - both for threads and SIMD encoding
  - FP: what do prefix even mean?
  - Going beyond 256 opcodes mean multi byte scheme.
  - DG2: This will be little different as it is modifier byte (can apply to simd too)
  - Poll: Modifier byte:
    - SF: 0
    - F: 3
    - N: 6
    - A: 7
    - SA: 1
    - Consensus: against
  - Poll: new prefix for scarcely used opcodes
    - SF: 8
    - F: 5
    - N: 5
    - A: 1
    - SA: 1
    - Consensus: agree
    - FP: he would have been for if not called miscellaneous. Changes to agree if called Numeric (this is picked now).
    - BS: 88 opcodes still available
  - Poll: naming (as per slides)
    - BS: Haven’t used this for anything else
    - BN: change names for trap (cold response)
    - BS: What about other mappings for NaN values with this scheme?
    - JF: customize what happens
      - Trap handler in future can do that.
    - No poll actually done.
  - AC: Is this optional implementation
    - Broader qt.
    - JF: Post mvp stuff can be added and users can test
    - LW: webassembly.validate should be real stuff
    - FP: not proposing adding optional space.
    - JF: Etherium may not want to implement threads.
    - LW: permanently optional?
    - BN: when do tools change?
      - Developers test
      - Automatic polyfil
      - Timeline: up to a year

WebAssembly Threads proposal

Presented by Ben Smith
Aseem Garg taking notes
Slides
Maximum size can be loosened in future
Globals
- Explanation: no easy way to implement dynamically-linked thread local values (such as stack pointer) without import/export of mutable globals
- Globals are thread locals
- There is a question, do we want true globals?
- WS: Name suggestion: agent local
FP: SeqCst means they fence non atomic operation?
- Yes
- FP: Cas loop, store, cas (x86, arm behavior)
  - Will the behavior be same as c++ would have done
  - Mfence will be more expensive
  - BS: proposal currently matches ES
- JF: see C++ mapping
- BS: why not issue in ES?
  - FP: ES has enough other overhead that it probably doesn’t matter
- JF: stanford folks (i.e., CM) are applying formal methods to the analysis of the ECMAScript memory model.
- FP: what SeqCst atomics mean for code-gen (arm v7)?
  - Need heavy weight fences
  - No one present who cares about arm v7.
  - DG2: non-might be most important from portability perspective.
  - KM: hard to tell perf numbers now
  - BN: we will need to revisit
- FP: not clear WebAssembly causes fences for JS.
  - JF: IRIW should not be observable from JS
  - FP: JS should not worry about what happens to shared memory
Wait
- FP: wake address operand is just identifier
- Wait times - 0, inf and nan
Wake
- 0 means wake nothing
- LW: is there an extension byte to specify multiple memories
  - BS: Yes. they all use a memarg (same as load/store) which has extension for adding memory index
JF: module has to declare whether it will accept shared mem?
- Yes
JS api changes:
- Currently only number
- Alternate proposal:
  - Immutable globals come out as numbers
  - Mutable come out as object
  - Will it break things?
  - JF: dominic and Anne might have feedback on this design
  - BN: why tragedy if only come out as mutable?
  - Give it different name?
  - KM: section still called global for agent local
  - FP: whacky names
    - LW: consistent with JS
    - BN: should do the renaming now
    - FP: add new section?
    - LW: originally const which were globals
    - MH: could use multiple memories for thread-locals instead
- Potentially separable proposal (useful for single thread too for SP)
- JF: we don’t have consensus on this. Will take up after lunch.
MH: this changes sharedarraybuffer from grow memory perspective.
- Create in WebAssembly, grow in WebAssembly and use in JS
- LW: no.
- JF: create a SAB through WebAssembly, create a JS SAB, then grow the WebAssembly one and create another JS SAB. Now you have two JS SABs pointing to the same memory, but each with a different length.
WS: 0 index qt (update?)
- Every operation needs to specify
- JF: if we allow multiple memories using non-static indices to select which memory is being accessed, then we can’t statically prove that an atomic access is to a shared memory
  - FP: don’t disallow atomic accesses to non-shared memories
  - JF: true, it doesn’t matter because you can’t share it anyways
  - LW: wake / wait get interesting, do nothing and deadlock
WS: Is shared memory a separate memory area with a different index from other memory?
- In multiple memory world, yes. Currently must choose either shared memory or non-shared
SB: For JS embeddings, what’s the plan for agents?
- Map to ES agents
- SB: want to embed multiple agents in same JS context?
  - Native WebAssembly threads (there’s a poll for it)
- KM: imports check at runtime
- BS: can’t take up threads before this design.
MH: Shared tables?
- No
Poll: Add i8 and i16 value types. Will reduce atomic opcodes to 40. But adds new non atomic ops. Bunch of qts come up. See doc
- FP: we don’t have 8->32 and 16->32 sign extend operators?
  - No
- There is alternate proposal with sign extension operators.
  - JF: If we add atomic operators without sign-extension, we’ve introduced an inconsistency
  - FP: not really. For non-atomics there is likely a sign-extending instruction that we can use, for atomics there isn’t
- BN: stakes are low for adding ops since behind prefix byte.
- DG: will it also include all arithmetic ops for these types?
  - BS: just looking into it still. If we do add, it will be useful for atomics too.
- MH: benefits SIMD too.
  - FP: add more opcodes or more types?
  - BS: new types mean more since they can be used in other contexts
    - This makes more sense in SIMD context (extract lane)
- JF: rephrase this poll as Ben should do more work to explore more benefits rather than actually adopting.
  - VB: also explore booleans.
  - JF: add I1.
  - SF: 3
  - F: 2
  - N: 3
  - A: 6
  - SA: 4
  - Consensus: not do more work.
- Not doing more work on this for now.
  - FP: 2 things can change, SIMD and GCed objects.
Poll: Encoding 1 and 3
- For means 1 and against means 3
- SF: 0
- F: 2
- N: 2
- A: 6
- SA: 5
- Consensus: encoding 3 (sign extend operators and 67 atomic ops)
Poll: How do we handle prefix byte. LEB128 after prefix
- SB: We could encode another way (e.g, not LEB128. We could use SQLite, or something else).
- JF: make opaque
- DG2: 3 byte simd will be rare
- FP: rare anyways. Fewer polls forever.
- SF:10
- F: 7
- N: 0
- A: 0
- SA: 0
- Consensus: For.
Particular prefix byte.
- LW: count downward. FE Simd and so on.
- JF: lock prefix on x86. Cute. VEX prefix for simd is free too. Currently opcodes have some grouping.
- WS: FF reserved for future. And in downward scheme FE will be first.
- Poll: Choose 0xF0
  - SF: 3
  - F: 2
  - N: 1
  - A: 5
  - SA: 4
  - Consensus: against.
- Poll: Counting back from 0xFF for atomics
  - SF: 8
  - F: 2
  - N: 4
  - A: 1
  - SA: 2
  - Consensus: For.
  - AC: How many buckets should we have before picking. Also, good to have something when we fail. We don’t have a limit when to stop.
  - WS: reiterate reserve one for failure.
  - KM: who opposes if we change to 0xFE
    - No opposition.
    - Consensus: we pick 0xFE
- Final consensus: count backwards for prefixes. 0xFE for atomics.
- Poll: 0xFF should be reserved prefix for when we run out:
  - DG2: is this the right group?
  - JF: none of these polls are binding. Everyone can edit the notes and after a week we’ll post the notes to Github and then people will have a week to object or support the polls and then there is a process then.
  - SF: 4
  - F: 7
  - N: 4
  - A: 0
  - SA: 0
  - Consensus: For.
Poll: Include float atomics (load and store)
- Not having these will mean reinterpret cast.
- MH: why didn’t ES have it?
  - Lars: created more bulk and there weren’t strong use cases.
  - BS: in ES f64 was only 8 byte type
- JF: doesn’t cost anything. But Ben is supporting only load and store.
- LW: should either do everything (incl. RMW) or nothing.
- BN: who uses?
  - JF: Dept. of energy uses extensively.
- JF: no cost of going the int route unless separate registers. Will change floating point environment
- FP: What CPUs support?
  - JF: none. There was discussions to add these to unnamed ISAs.
  - FP: fp load/store doesn’t have atomicity as defined here.
- FP: opposed since CPUs don’t support and benefits no one in room.
- Poll: Do we want any floating point atomics?
  - SF: 0
  - F: 0
  - N: 3
  - A: 10
  - SA: 1
  - Consensus: against.
int 64 atomics
- JF: as spec they are not guaranteed to be lock free right now. In practice most recent CPUs support.
- FP: atomicity wrt other like operations or all operations. Hence, it will overlap with other non atomic operations (clarify?).
  - JF: maybe discuss is lock free first?
- Poll: I64 atomics excluding wait (no guarantee of lock free):
  - SF: 6
  - F: 8
  - N: 0
  - A: 0
  - SA: 0
  - Consensus: For.
Poll: should we have I64 wait.
- JF: wait doesn’t specify size.
- LW: address still 32 bit.
- SF: 2
- F: 0
- N: 8
- A: 1
- SA: 0
- PJ: see no use.
- Lars: seems useful as natural.
  - JF: linux futex don’t have 64 bit.
  - FP: over looked probably.
- FP: use case in conditional variable.
- Consensus: No strong consensus. PJ changed to neutral.
Poll: Lock free guarantees.
- No CPUs have 32 bit but not 8, 16 bit.
- FP: 32 bit cas lock free implies smaller become lock free
  - Lars: sub width store wouldn’t correctly validate.
  - BN: memory model might allow for 8 bit store to not invalidate 32 bit CAS.
- BN: revisit in context of SAB memory model (clarify?)
- JF:
  - are 32 bit lock free same way SABs are
  - In WebAssembly are 8 and 16 bit also guaranteed to be lock free
  - 64 bit lock free
- 32 bit lock free:
  - Unanimous consensus.
- Poll: 8 and 16 bit
  - SF: 6
  - F: 11
  - N: 0
  - A: 0
  - SA: 0
  - Consensus: For. Qt how this will interact with ES.
- 64 bit lock free
  - There is option to tell Ben do more work and come with better data
  - WS: some CPUs would not support 64 bit lock free.
    - FP: few years ago would have agreed.
    - Lars: has had since arm v6k
    - BS: mips? Do we care?
    - FP: 32 bit power will be hurt. Not sure though.
  - JF: 2 polls assume lock free and if that doesn’t succeed, ask ben to come back with data.
  - Poll: 64 bit lock free atomics:
    - SF: 4
    - F: 10
    - N: 2
    - A: 3
    - SA: 0
    - Lars: not sure of population of CPUs
    - Consensus: change spec to go with lock free and people who are against who can come up with data to oppose.
  - FP: term lock free non normative.
    - JF: Non lock free can cause tearing
    - FP: on architecture to work around to support.
    - FP: lock free guarantee always contentious. On some x86 CPUs can theoretically have no fwd progress.
    - Basically atomic operations are atomic.
    - DG2: do we have fp guarantee.
    - JF: Signal handlers can deadlock (as per c++ style).
    - JF: Explain this one in detail?????
    - JF: C++ always ensures 8, 16, 32, 64 are always lock free. Others not. In WebAssembly, 8, 16, 32, 64 bit will be lock free.
    - FP: Actually atomic not lock free.
    - WS: this means WebAssembly will be lock free but JS won’t.
      - Actually WebAssembly only has guarantees as host allows.
    - BN: any major archs where 32 bit and 64 bit atomics tear?
      - JF: if not natually aligned
      - Lars: Mips doesn’t have double word atomics but arm, power and x86 should be fine.
  - Poll: Get rid of term isLockFree and if we mention it’s non normative (as well as require that all atomics sizes regardless of if they exactly overlap do on tear, i.e. are genuinely + generally atomic).
    - SF: 2
    - F: 8
    - N: 4
    - A: 0
    - SA: 0
    - Consensus: for.
- Final consensus get rid of lock free (except non normative)
Poll: Blocking the main thread (wait instruction may trap in certain contexts)
- JF: may trap in ES spec
- Lars: may trap is a property of host
- BS: property of host.
- There is some constant per-agent property (mirroring ES [[CanBlock]]) which specifies whether main thread can block.
  - SF: 8
  - F: 5
  - N: 2
  - A: 0
  - SA: 0
  - Consensus: for
Poll: include same atomic and threading primitives as ES except is_lock_free
- FP: we omitted neg.
  - Actually missed quite a few
- AC: don’t need sub if we have add
  - FP: Actually minimum we need is cmpxchg
- Full consensus for it.
Poll: Are these instructions sufficient. Other options include pause
- Lars: Pause - every arch has it. Too low level for ES but might make sense for WebAssembly.
  - FP: used jikes rvm thin locks and webkit and in both cases either neutral and in case of intel regression (for performance considerably worse. Didn’t check power). Once you called yield you don’t need to do pause. NVIDIA had similar findings.
  - WS: useful over spin lock.
  - FP: in favor of yield over pause.
  - JF: actually experiment done on x86. Might be win on some but WebAssembly tries to be portable.
  - FP: this is more than NOP. So, wouldn’t just make some better.
  - BS: pause could be instruction for our virtual ISA.
  - JF: 2 things - do we want pause (which would probably be no) and do we want something like yield (feels should require more study).
  - BS: what do we say it does
  - WS: someone on my priority or above can run.
  - FP: argument against coming back: difficult to match perf numbers without sched_yield.
  - Final: Get a better proposal with performance and power numbers (on various archs) and check suggested Win32 impl (SwitchToThread() == sched_yield()?).
- Neg
  - No
- Min/Max
  - JF: rust has them. Would be good to know why c++ doesn’t.
  - FP: mostly can’t observe any perf difference over cmpxchg.
  - Leave for later
- Signed/Unsigned
  - Leave for later
- Nand (JF)
  - Leave for later
Thread local and Shared Global
- BN: what do we do for things like stack top without this. If we don’t have this creates problems for tools.
- JF: we don’t have concrete proposal but tools people are saying things we did for asm.js was ugly and can’t be done here. Do we want to solve this issue in V1 of threads or do we not care.
- BS: this proposal has some text but not very concrete.
- FP: When we do this, perhaps flip the discussion to how do we do it for SP, instead of generically for all thread-local values.
- LW: pointer to TLS array for dynamic linking.
- WS: errno is a TLS value.
  - SP actually hotter.
- FP: pinned register for SP optimal
- LW: Should we have special case for SP? Not clear pin it to register or having it live in memory.
- JF: diving into implementation. Offline explore more.
- FP: do we have good thread-local proposal for V1. Don’t want some bad unofficial thing to get baked in.
- WS: can we have impl without thread-local
  - BS: yes. Qt comes down to dynamic-linking.
  - BN: earth folks would be using WebAssembly now if we had dynamic-linking + threads.
- Poll: someone (maybe BS) should explore this before next meeting for threads V1
  - SF: 2
  - F: 11
  - N: 0
  - A: 0
  - SA: 0
  - BN: we are in situation where we have bunch of features in ES and WebAssembly is not in feature parity. Doesn’t want threads blocking on further discussion.
  - Consensus: for
Native WebAssembly threads. Required for V1 or leave for later.
- WS: Agents which are WebAssembly only communicate within themselves efficiently.
  - BS: current proposal allows through shared memory. But you need to rely on host to create thread (expensive).
- Google earth: don’t spawn and kill a lot of threads.
  - Code not sharing is an issue though.
- KC: How would WebAssembly threads share memory
  - Not clear yet.
- SB: how does it work in Node.js
  - MB: node has only processes.
- Poll: For => we want native threads in V1
  - SF: 0
  - F: 0
  - N: 4
  - A: 5
  - SA: 5
  - Consensus: against
Provide only seq_cst atomic operators for v1
- SB: we can decide on seq_cst and still add others later.
- WS: are we proposing a parameter for ordering which is only seq_cst for now or no param.
  - BS: we have LEB that can be used in future.
- AC: what do we need in WebAssembly to have seq_cst
  - BS: implementation dependent.
- WS: will we need barriers which might be slow
  - BS: yes
- JF: current SAB doesn’t have other orderings and having it in WebAssembly might be weird. Also, having them now without too much thought can increase work for later if issues come up. Barely any compilers come up with any of those optimizations right now.
- Poll: Provide only seq_cst atomic operators for v1
- SF: 7
- F: 8
- N: 1
- A: 0
- SA: 0
- Consensus: for
Memory model standardization. Do we want to block on a formal memory model
- JF: ES doesn’t have formal memory model (it is speced).
- JF: arm, intel and power didn’t have ‘mathematicized’ memory model till recently. C++11 had but even that had some holes.
- BN: constrained by SAB. If we found mistake in a proposed memory model adopted, we’ll change memory model not the implementation.
- WS: will there be no stmt about it or is it TBD
  - BN: argue for TBD
- JF: weakly specified proposed memory model. Specifies the tearing for non aligned cases.
  - BN: what would it be.
  - JF: WG chair decides.
- BN: What does blocking mean?
  - Wouldn’t stop implementors from implementing atomics.
- BN: for SAB there is quite a bit of text around what the ops do. Do we need that?
  - JF: give to WG.
- BN: someone eager to push for something more formal
  - No response
- Poll: Memory Model prose spec is a V1 blocker
  - SF: 0
  - F: 4
  - N: 2
  - A: 4
  - SA: 0
  - Consensus: No consensus
- LW: Maybe we should just reference ES doc, and note differences
  - Poll: reference ES doc, and note differences
  - SF: 5
  - F: 7
  - N: 0
  - A: 0
  - SA: 0
  - Consensus: for
- Consensus to create a doc which references ES.
BN: Eventually shared doc between ES and WebAssembly.
Testing - What are the things we need to WebAssembly to feel satisfied
- JF: do we believe ML interpreter can properly implement this feature.
- JF: derive mathematical model that give you exhaustive list of outputs and then have tests to check if we have bad output.
- JF: we could use tests from others but they are weaker. Handwritten will never be exhaustive.
- CM: the method that we are using for the formal analysis of the ECMAScript memory model consists in: 1) formalize the prose definition of the memory model into First Order Logic (FOL), 2) combine it with a formal representation of an ES program that relies on SABs, and 3) automatically extract all assignments to reads-from, happens-before, memory-order, and synchronizes-with relations by using an SMT solver. Each assignment represents a valid execution, and it can be used to identify one of the (valid) outputs of the program provided in 2). This technique allows us to generate a (simple) JavaScript program that contains an assertion checking whether the produced output matches one of the valid executions (as in 3)). This program represents a litmus test that is run multiple times (e.g., million of times) against a JS engine in order to expose as more behaviours as possible.
  - Runs for 1 program at a time
- JF: what set of tests do we want for next meeting? So the poll is come up with some hand written tests for V1 or not.
  - DG2: useful to have API coverage.
- JF: 2 polls - API coverage tests, tests that mostly cover the surface area (based on literature).
- Poll: API coverage test and some basic interaction
  - SF: 11
  - F: 2
  - N: 0
  - A: 0
  - SA: 0
  - Consensus: for
- Poll: Some litmus tests for v1 (non comprehensive) based on literature
  - SF: 1
  - F: 4
  - N: 3
  - A: 2
  - SA: 0
  - Consensus: weak. We will try to do it but can skip
In future we will host CG and also WG maybe 1 day out of it.
Discuss tomorrow what does it mean to toss things over to WG.

Wednesday May 31

Threads: extra poll

Brad Nelson taking notes.

Carryover polls to clarify yesterday:

Poll: The current proposal implies that grow_memory of a shared WebAssembly Memory can result in SharedArrayBuffers backed by the same memory, but with different sizes. This is fine.

RESULT: No objections
LW: The key breaking change is if we want exported Globals to change behavior.
MH: We're not likely to be able to change this quickly.
MH: If we rename set_global to set_thread_local, we might want to have some other name for set_global_atomic.
LW: Because there's only one instance per agent right now, there's no interesting distinction between these.
BS: We currently have globals, and they don't map the convention definition of global.
LW: They're global relative to something.
LW: A renaming that makes sense with a complete proposal seems like it would be fine.

SIMD

Presentation on SIMD
Brad Nelson taking notes.
JO: All common computing platforms have 128-bit SIMD now. (SIMD Availability slide)
JF: ARM also has the wide SIMD extensions
JO: RISC5 does not currently have SIMD, but they have reserved space for it.
JO: Intel intrinsics can be translated to something portable if you stick to a particular subset.
JO: Arm has DSP ops missing from Intel, Intel has [something] missing from arm.
JO: In the proposal most of the non-intersection ops that we have exist as an op on ARM, but can be emulated
Discussion
RW: I've looked at some of the instructions, and some require non-trivial work to emulate on Intel. Mostly
JF: For us what matters isn't the implementation complexity. What we actually care about is getting portable performance. That's what we want to focus the objectives of the SIMD proposal on.
RW: There are 256-bit ops that could be emulated fairly easily on ARM.
JF: Worst case / scalar performance is the key thing.

Slides resume (Portable Subset)

JZ: Entropy coding can benefit from get MSB / BSR.
JO: We can add it, we need to investigate if it exists everywhere.
JZ: BSR would be most useful of the bit counting set for video/image
JO: Wouldn't variable shifts be sufficient for arithmetic coding?
JZ: Not currently.
JO: We shouldn't write these off.

Presentation resumes (Omitted)

JB: We should be sure we have someone that wants floating point rounding if we're going to add it.
JO: We should look at what arches have armv7 is missing some of these.

Presentation resumes (Candidates)

RW: Have you looked at the most popular instructions? Horizontal adds might not be a perf win. I have permission to provide a list of most popular. Here are the ones typically used.
JF: Can you provide us with static as well as dynamic instruction count? We care about different classes of applications too.
RW: I need to look into what I can share.
BN: Can you tell me where these counts are from?
RW: That's sensitive. We can probably bin them.
DG2: We gathered data primarily from porting games for SIMD.js
RW: If folks want to email me, I can collate them.
JF: If we could have a public list that has both static and dynamic count that'd be great.
MB: How easily can this be extended to 256 / 512 bit SIMD?
JO: I would say Intel is going that direction, unclear if others will follow. ARM has a scalable vector extension. It's a very different programing model.
RW: If we chose particular operations that are "really good ones" we could emulate those on ARM.
JO: My opinion is that longer vectors is something that primarily benefits Intel. The priority is to lift lower end devices.
RW: Longer vector benefit unrolling primarily.
JF: Not sure I agree about wider vectors. Say you also target a CPU that doesn't have wider types, it seems like that wouldn't be so bad. But prior experience suggests that you change your algorithm to match the stride of the vector type. I worry people will tune on a laptop and see worse perf on a phone.
RW: Maybe not data structures, but register pressure would be the problem.
JF: Once webassembly has jit capabilities, we'll see things like Halide be ported. That would be pretty neat. We should punt past v1. If you have numbers willing to reconsider.
BN: To clarify you'd be comfortable supporting arch specific SIMD if high level languages become available that can adaptively target each arch.
JF: Because the user doesn't have control at the level they'd need they're likely to tune to one arch.
BN: Looking for confidence that model handles higher level abstractions.
JO: What is halide?
JF: Project by MIT for image processing and computational photography
DG2: We should add optional features gated on higher level abstractions.
JF: 128-bit SIMD is kind of magical as it happens to be where everyone intersects.
JZ: >128 is not critical for video codecs, but can help a little.
JZ: Question about widening and narrowing: are there any that convert between signed and unsigned while widening, we’ve found these to be useful
JO: Can be considered if useful
SB: Gonna channel inner Fil :-) Thinking about this proposal, wondering what data has been gathered.
JF: This item would be highly influenced by later items about performance. Though I think 128-bit SIMD is the right place to start. We need to show this is a gain on multiple arches.
SB: I think the performance should guide what we build.
JF, SB, JO: figuring out what perf measurement we should use.
BN: we want to avoid microbenchmarks.
KM: so we need to measure well.
JF: Probably good to figure out what class of applications are faster with a certain subset of instructions - and it would be useful to have that data to make decisions on what should be standardized for SIMD
BN: how do we know when we’re done? What do we consider? What’s the bar?
JF: How many apps do we want to have before we can decide?
JF: These instructions seem separable. What’s the minimum core that’s useful, but not just a joke?
JO: We should have logical subsets of instructions we should include instead of scattered instructions
JF, BN: We should have a poll to decide which sizes should be included: 128, 256?
Discussion about having publicly available data
RW: It would be nice to see a real venn diagram w/ actual instructions
BN: trouble is that we don’t have enough portable SIMD examples
PJ: It is hard to find portable benchmarks, but 128 bit SIMD is a no brainer because they are widely supported
SB: disagree with the premise; if we can’t measure a speedup, what’s the point?
SB, BN: Discussion about coverage, hand done applications, microbenchmarks being a valid performance measurement
PJ: could rely on auto-vectorizers for portable examples
DG2: The benefit is that they give a data point, we have data for auto-vectorizers that perform better than the scalar code
SB: We would be interested in seeing that data - are these real programs?
DG2: Yes, real programs. You kind of have to try hard to make simple SIMD examples not go faster
JF: What kind of data do we really want to collect? What are blockers for V1?
DG2: Is the fixed width SIMD the right way to go? There seems to be consensus, the second thing we need to agree on is the subset of operations
JO: my goal, I want assigned instruction numbers -- something we can build and then use to benchmark. We don’t need the CG to accept SIMD today.

Break

DG2: If we get speed ups (but no regression) is that ok?
JF: Maybe not.
WS: Are SIMD and threads separable?
LW: Yes
KM: They can be shipped independently.

Discussion about how to construct the poll.

Discussion about biases in performance data.

Poll: In order for SIMD to go forward as part of WebAssembly, distinct SIMD groups of instructions (justified by data) need to be identified and justified separately. Tentatively these categories may be similar to the groups on Jakob's slides. Instructions might be in multiple groups.

SF: 14
F: 3
N: 1
A: 0
SA: 0
Strong consensus is present.

Discussion about SIMD being a “foot-gun”.

Discussion about SIMD/Threads being features that developers write directly vs. compilation target

Poll: Performance wins should be positive across multiple relevant arches within an instruction group. Individual operations within an instruction group might be neutral / negative on some arches, but the aggregate group should be a win. The strength of performance evidence required increases when an op is clearly negative on some arch.

SF: 3
F: 14
N: 2
A: 2
SA: 0
Folks against: There are classes of applications that might only make sense on desktop. Doing better on Intel even if neutral here might be fine.
I don't see the point in setting this requirement now.
JO&JF: This is helpful to allow me to figure out what to focus on for V1.

JO: For WebAssembly SIMD want to begin with a single 128-bit value type (there are boolean types).
BN: Integers vs Floating point have different variants on Intel. Historically these had a perf difference. Do we care?
JO: Earlier implementation (nehalem) had differences between andps vs andpd in perf. On newer CPUs the difference is small, 1 cycle. Not completely sure on perf difference. In my opinion the abstraction level in WebAssembly is too high to do that. Shuffles for example end up mixing int / float. I would argue the code that needs to select these is the engine compiler / not the format.

The SIMD proposal uses a single v128 value type to represent a 128-bit SIMD vector.

Poll: Adopt a v128 value type along with basic operations v128.const, v128.load, v128.store as well as bitwise logic instructions v128.and, v128.or, v128.xor, and v128.not. This would be the sole SIMD value type.

SF: 7
F: 9
N: 1
A: 0
SA: 0
Consensus for this.

Predicate vectors are represented with separate value types that don't have an in-memory representation. This allows implementations to choose their own representation. The v*.select operations are the primary consumers of predicate vectors.

JO: AVX has an explicit concept of boolean vectors. (Thus this is future looking).
JF: This is different than how WebAssembly does booleans now (without a true boolean type).
JO: These types can never be load/stored to memory. Arm and Intel can represent these as 128-bit values to represent these.
JF: This would be for local, globals, stack value, and function signatures.
VB: Could we allow extract lane to return false as all zeros, and true as either lowest position one or all ones?
MB: What is the default value?
JO: Good point we should spec zero as the default.
JF: Specifically to address VB's question could we pull that out into a separate poll?
BN, DG2: discussion about extract lane for boolean vector; is it needed? Can use boolean vector select instead?
LW: We could focus just on 128-bit now, and do something special later.
PJ: Would it be useful to have a single boolean type that leaves the unused bits alone.
JO: There would be undefined places that result from this.
WS: Are these validatable?
JO: These are type checked at compile time. He was suggesting a single boolean. That would result in undefined values.
JO: We could have a b128 vector that represents booleans in a fashion that matches existing arches.
LW: Is there any other portable definition that works across the two?
JO: No, ARM has a bitwise select, whereas intel uses the most significant bit.
LW: Hiding this is the main reason to have this, not the future facing thing.
JO: We could do something that works without this but it would penalize either ARM or Intel (for a naive implementation). A simple peephole optimization would allow us to recover these.
RW: This proposal has the advantage of covering the issue now, and is future looking.
MH: I look forward and see a lot more types coming.
JO: These types have no load and store operations.
BS: There would be parameters, locals, globals.
AG: How would compares work.
JO: We would return a V128.
JF: What would we explore without it?
JO: Compare would return V128, select could be either a bitwise op like arm, or we could provide a select like Intel (top bit).
JF: From the discussion we just had it seems like this won't get consensus. It may be AVX etc. might want something else.
JO: The undefined behavior for having a single boolean vector type comes from what happens with the bit not used in the types with fewer lanes.
BN: Is the pushback mainly about complexity?
MH: Don’t want to add that many local types

Discussion: These are just general abstractions, complexity of validation

Poll: Adopt the b8x16, b16x8, b32x4, and [b64x2] boolean types along with the basic operations. Support b*.const, b*.splat, as well as the logical operations b*.and, b*.or, b*.xor, b*.not, and v*.select.

SF: 3
F: 4
N: 6
A: 1
SA: 2
No consensus.

WS: I voted against because the engines can typically optimize this regardless.

Poll: We should use a V128 type to represent boolean results such as compares using either the Arm or Intel convention (one at Jakob's discretion).

SF: 2
F: 3
N: 6
A: 3
SA: 1
No Consensus.

Floating-point arithmetic is supported by the f32x4.* and f64x2.* operations. The inclusion of div and sqrt will be discussed later, as will the handling of subnormal numbers. ARMv7 NEON supports f32x4 arithmetic with subnormal numbers flushed to zero only, and f64x2 operations must be implemented with pairs of scalar VFP instructions.

Poll: Adopt the following f32x4.* operations:

Basics: f32x4.{splat,extract_lane,replace_lane}.
Comparisons: f32x4.{eq,ne,lt,le,gt,ge}.
Arithmetic: f32x4.{neg,abs,min,max,add,sub,mul}.

Results:

SF: 8
F: 4
N: 3
A: 0
SA: 0
Solid support.

Poll: Adopt the following f64x2.* operations:

Basics: f64x2.{splat,extract_lane,replace_lane}.
Comparisons: f64x2.{eq,ne,lt,le,gt,ge}.
Arithmetic: f64x2.{neg,abs,min,max,add,sub,mul}.

Results:

SF: 3
F: 4
N: 9
A: 1
SA: 0
Mild consensus to explore.
Objection: This isn't the highest priority for v1. Solid numbers would help convince this is worth doing.

Small integer arithmetic is provided by the i8x16.*, i16x8.*, and i32x4.* operations. These operations are widely supported by both Intel and ARM ISAs. Integer division is not included since it is not widely supported.

JF: Why did we add the narrower types?
DG2: We started adding codecs as use cases.

Poll: Adopt the following i8x16.*, i16x8.*, and i32x4.* operations:

Basics: {i8x16,i16x8,i32x4}.{splat,extract_lane*,replace_lane}.
Arithmetic: {i8x16,i16x8,i32x4}.{add,sub,mul,neg}.
Shifts: {i8x16,i16x8,i32x4}.{shl,shr_s,shr_u}.
Equalities: {i8x16,i16x8,i32x4}.{eq,ne}.
Inequalities: {i8x16,i16x8,i32x4}.{lt,le,gt,ge}_[su].

Results:

SF: 9
F: 3
N: 5
A: 0
SA: 0
Consensus to explore.

The i64x2 operations are more limited, but they are still useful for bitwise manipulation of f64x2 vectors. Division and multiplication are not widely available. Comparisons exist in SSE 4.2 and ARMv8 only.

Poll: Adopt the following i64x2.* operations:

Basics: {i64x2}.{splat,extract_lane*,replace_lane}.
Arithmetic: {i64x2}.{add,sub,neg}.
Shifts: {i64x2}.{shl,shr_s,shr_u}.

Results:

SF: 3
F: 3
N: 9
A: 1
SA: 0
Objection, added complexity not justified.
Seems a lower priority for v1.

64-bit comparisons will be slow on ARMv7 and Intel chips without SSE 4.2.

Poll: Adopt the following i64x2.* operations:

Equalities: {i64x2}.{eq,ne}.
Inequalities: {i64x2}.{lt,le,gt,ge}_[su].

The 64-bit multiplication is implemented in AVX-512, but would be slower everywhere else. All ISAs have 32 x 32 –> 64 multiplies (pmuludq, umull) that can be used to implement this operation.

Results:

SF: 1
F: 4
N: 9
A: 3
SA: 0
No consensus, but not objection either.

JO: The case for this one is that without it, if you want to do a 64-bit multiply, you'd basically have to drop to scalar.
MB: Would it make sense to have a wide multiply instruction?
JO: Possibly.
MB: Is wide multiplication more generic?
JO: No.
WS: Does this give a 128-bit result?
JO: No (it's clipped to 64-bit) 2x64-bit.
LW: Have we considered doing i32xi32 -> i64
JO: For scalar that can be optimized easily.
JO: Introducing widening would require us to come up with a way to express this in source. Whereas this op is more expressible in C code.

Poll: Adopt the i64x2.mul operation.

Results:

SF: 0
F: 0
N: 10
A: 3
SA: 0
Slightly negative consensus.

WS: Older chips?
JO: Yes, available everywhere.

Saturating integer arithmetic is widely available for 8-bit and 16-bit lanes, but not for i32x4.

Poll: Adopt the saturating integer arithmetic operations {i8x16,i16x8}.{add,sub}_saturate_[su].

Result:

SF: 8
F: 7
N: 3
A: 0
SA: 0
Solid consensus.

JF: Are these same width?
JO: Yes.
WS: Is it universally available.
JO: On arm yes. On intel signed only.
BB: Rounding mode?
JO: Round to zero (float to int), round to nearest (int to float).

Integer-to-float conversions are always non-trapping.

Poll: Adopt the f*.convert_[su]/i* integer-to-float conversions.

Result:

SF: 5
F: 5
N: 6
A: 0
SA: 0
Solid Consensus

When converting floating-point numbers to integers, we can choose to trap or saturate on overflow.

BS: How would you trap?
DG&JO: You'd add compares to decide when to trap. Intel would require more ops to do this.
BS: Could this be used for debugging?
JO: Maybe.
DG2: This allow people to explicitly rule out a class of rounding mistakes.
PJ: Does just one lane with an error trap?
JO: Yes.
WS: In the trap are we going to define what's available for inspection.
JF: Trap would be the same as for the rest of WebAssembly.
DG2: We might add, but not right now.

Poll: Adopt the trapping i*.trunc_[su]/f* float-to-integer conversions.

Results:

SF: 0
F: 1
N: 8
A: 4
SA: 0
Weak consensus against.

WS: How useful is this?
JO: Medium
RW: You'll hit it.
DG2: We've seen it.
JO: C defines this an undefined behavior, so you can't use it without intrinsics.
BS: If you didn't have this you'd have to do an explicit check?
Others: Yes.

Poll: Adopt the saturating i*.trunc_[su]/f*:sat float-to-integer conversions.

Result:

SF: 3
F: 8
N: 3
A: 0
SA: 0
Strong consensus

SIMD binary encoding. Tracking issue: SIMD binary encoding #14. The current SIMD proposal defines 193 new operations, and it is likely that more SIMD operations will be added in the future. We need an extensible way of encoding the opcodes for these operations.

Poll: SIMD opcodes are encoded as 0xfd varuint32, where 0xfd indicates a SIMD operation and the varuint32 number identifies the specific operation. This should be coordinated with the threads proposal which currently uses a 0xfe prefix byte followed by an opcode byte in the range 0x00 - 0x7a.

Results:

Unanimous consensus

The new value types also need assigned numbers.

Poll: Assign the following numbers to the new SIMD value types:

0x7b => v128
0x7a => b8x16
0x79 => b16x8
0x78 => b32x4
0x77 => b64x2

Results:

Unanimous consent

Allow flushing of subnormals in floating point SIMD operations

JF: This could be done in a conformant way on armv7 by scalarizing (or JIT tricks)
BN: Maybe
LW: Could we always flush to zero
JO: Not without affecting scalar.
MB: Would there be a way to canonicalize these easily?
JO & DG2: No. Not without scalarizing.
DG2: We could support a flush for everything mode. Some sort of scoped construct.
BN: Scoped seems nice.
LW: Scoped lets you decide when to pay.
JO: People typically want fast, not a particular behavior.
JF: My view on subnormals is that nobody cares.
DG2: I support non-determinism for this use case (for speed). Because it makes armv7 unusable.
JF: I'd want to see perf measures on the how bad this is for armv7.

Tracking issue: Allow flushing of subnormals in floating point SIMD operations #2. ARMv7 NEON devices support f32x4 arithmetic, but only with subnormals flushed to zero. On these devices, f64x2 arithmetic needs to be implemented with the slower VFP instruction set which does support subnormals correctly, so the problem only applies to f32x4 operations.

Poll: Adopt the proposed specification text which allows SIMD floating-point instructions to flush subnormals to zero.

Results:

Decided a poll would be non-informative.

Reciprocal (sqrt) approximation instructions

JF: Besides NaN, subnormals, and threads, this would be the first place we offer non-determinism.
MH: How about if we specify an error bound?
JF: I'd want to see a perf win.
BN & MH & JF: Agree that non-determinism increases the perf bar to justify it.

Tracking issues: Implementation-dependent reciprocal (sqrt) approximation instructions #3 and Eliminate divide, square root #13. Floating-point divide and square root instructions are slow. They take many cycles to complete, and they are not fully pipelined, so they block the ALU while they are executing. Further, ARMv7 NEON does not provide vectorized floating-point divide and square root instructions, so they have to be implemented as a sequence of scalar VFP instructions. All SIMD ISAs provide fast f32x4 approximation instructions that compute 9-12 bits of 1/x or 1/sqrt(x) respectively. These instructions are fully pipelined and typically as fast as a floating-point addition. They don't provide the exact same approximation on different platforms, though.

Poll: Adopt f32x4 1/x and 1/sqrt(x) approximation instructions.

Results:

SF: 3
F: 5
N: 3
A: 4
SA: 0
Mild consensus, but higher perf bar to justify it.

The f64x2 versions of the approximation instructions are only available in AVX-512 and ARMv8's A64 mode.

Poll: Adopt f64x2 1/x and 1/sqrt(x) approximation instructions.

Results:

JF: We should have one if we have the other.
No other champions. No poll.

BB: Scheduled closely together forces bad perf vs scalar.
JF: Bad scheduling not a good reason not to.
PJ: We have evidence this gets used: Bullet physics, etc. See issue 13 on simd tracker.
DG2: Does seem like we should push folks towards fast things.
DG2: We could offer an emscripten option.
WS: Aren't there operations where people want more precision?

It has also been proposed to remove the exact division and square root instructions since they are slow.

Poll: Remove the f32x4.div and f64x2.div instructions.

Results:

SF: 0
F: 0
N: 9
A: 4
SA: 2
Weak consensus against.

JO: Armv8 has, armv7 doesn't have it.

Poll: Remove the f32x4.sqrt and f64x2.sqrt instructions.

Presumably, the answer to these questions should depend on whether the approximation instructions are adopted.

Results:

Consensus we'd have the same result as on div.

Alternative to Swizzle / Shuffle. Tracking issue: Alternative to Swizzle / Shuffle #8. The initial SIMD proposal contains fully general swizzle and shuffle instructions that take a sequence of lane indexes as immediate operands. An alternative proposal is to only define a subset of the possible swizzle and shuffle operations that are known to map to fast instructions. The v8x16.shuffle instruction is pivotal since its functionality subsumes all the other swizzle and shuffle instructions. It can be implemented in terms of pshufb or vtbl instructions that are quite fast, see the GitHub issue for details.

Poll: Adopt the v8x16.shuffle instruction.

RESULT: Discussed below no vote.

The v8x16.shuffle instruction is quite large with its 16 bytes of immediate operands. Most of the time, shuffles with larger granularity suffice, and they can be encoded with smaller immediate operands (one byte per lane).

Poll: Adopt the {v16x8,v32x4,v64x2}.shuffle instructions.

RESULT: Discussed below no vote.

The v*.swizzle(x) instructions pick lanes from a single input vector instead of two. Their functionality is subsumed by v*.shuffle(x, x) which requires a get_local.

Poll: Adopt the v*.swizzle instructions.

RESULT: Discussed below no vote.

SIMD ISAs have various shuffle instructions that are faster than the fully general pshufb and vtbl instructions. An implementation is free to use these instructions for the corresponding immediate operands on v8x16.shuffle. The alternative proposal in the tracking GitHub issue is to define a set of fixed shuffle instructions that are known to map to these fast instructions on all ISAs.

Poll: Define a small set of primitive permutations that we know can be implemented efficiently.

RESULT: Discussed below no vote.

WR: When are you seeing issues with several PSHUFBs ?
JZ: With several in rapid succession (on different ports). You hit the issue on lower powered intel chips.
JZ: If your baseline has no optimizations you can fall short.
JO: I imagine we'd start with mapping to a PSHUFB, but we can go back and optimize.
BS: Would this encourage folks to use alternatives?
BN: There's aren't really any.
JZ: Would people adapt what they do, probably. I don't strictly speaking need a general shuffle. As long as I have unpacks and ORs.
BN: Sounds like implementers should note this as an optimization point.
JZ: Yes.
LW: Add a non-normative note to encourage implementers to implement a particular set of particular shuffle patterns.
JF: Unless codesize is an issue I don't care.

Consensus: Do the general shuffle. We can come back and discuss macros for the common ones.

Packed Horizontal addition. Tracking issue: Consider adding Horizontal Add #20

Poll: Include v*.addHoriz instructions.

Packed horizontal additions are natively supported on ARMv8/SSE3. These would be useful for complex multiplications, and in the absence of the opcodes, would need to be a combination of shifts and adds. The issue proposes adding horizontal addition for f32x4, i32x4, i16x8, with a potential addendum for 64x2 types.

Poll: Include instructions for horizontal operations. (This is just the pairwise.)

Discuss the inclusion of horizontal instructions apart from horizontal addition mentioned above.

Results:

Consensus: Do it later. Strongly neutral.

Conversions between integer vectors of different lane size

JO: This is good for codecs?
JZ: Yes.
BB: This was at one point in simd.js
DG2: Yes. It didn’t yet have all the saturating forms, but we were moving in that direction.
JZ: Signed16 to Unsigned8 would be most useful.
JO: This is only for ones of the same sign.
JZ&WR: Instructions exist to do this.
WR: One way is better for us (Intel).
JZ: Minimally Unsigned8 to Signed16 and back.
JZ: Equiv to PACK and UNPACK on Intel, narrowing on ARM.
WR: They're useful, but we want to limit it to the saturate ones.
JZ: Having all of the PACKs important. The others are for parity.

Tracking issue: Add opcodes for converting between integer vectors of different lane size. #21

Poll: Include integer narrowing, and widening conversion instructions with signed, and unsigned saturation.

Result:

Bring this issue to github.

Talk about AllTrue, AnyTrue

JO: These should have been part of the boolean types.
JO: If we're going to do a v128 implementation, we need to decide how this gets done.
DG2: It's fast on Intel, but requires a manual reduction on arm.
JF: Are these tied to horizontal instruction?
DG2: No the use cases are different.
BS: Searching for a null uses it.
JO: Mandelbrot uses it.

Comment: These are strongly tied to whatever we do with boolean instructions. Add back into whatever proposal we end up with.

Upcoming meetings

Discussion of upcoming meetings. Kirkland, TPAC, then maybe Europe.

CG / WG process

Brad presents proposed CG/WG working process.

AI Brad Nelson: Convert the working process document into a pull request against the meetings repo to discuss issue of how to allow WG to have some acknowledgement of a proposal before browsers launch (or if this should even be called out).

Files

CG-05.md

Latest commit

History

CG-05.md

File metadata and controls

Table of Contents

Agenda for the May meeting of WebAssembly's Community Group

Registration

Logistics

Hotels

Agenda items

Schedule constraints

Dates and locations of future meetings

Proposal details

Threads proposal details

i8 and i16 Value Types

Encoding Scheme

Matching SharedArrayBuffer/Atomics for v1

Float atomics

i64 atomics

i64.wait

Is the set of proposed instructions sufficient?

Lock-free Guarantees for 8- and 16-bit Atomics

Blocking main thread in web embedding

Thread-local and Shared Global variables

Consider supporting verifiably safe concurrency

Native WebAssembly threads - needed for v1?

Memory model standardization

Memory orderings - sequential consistency only?

How to test?

SIMD proposal details

SIMD types and basic operations

SIMD binary encoding

Allow flushing of subnormals in floating point SIMD operations

Reciprocal (sqrt) approximation instructions

Alternative to Swizzle / Shuffle

Packed Horizontal addition

Conversions between integer vectors of different lane size

Meeting Notes

Attendees

Tuesday May 30

Opening and welcome

Non Trapping Float to Int

WebAssembly Threads proposal

Wednesday May 31

Threads: extra poll

SIMD

Upcoming meetings

CG / WG process