-
Notifications
You must be signed in to change notification settings - Fork 97
VP4DPWSSD
VP4DPWSSD ZMM{K}{Z},ZMM,M128
[AVX512_4VNNIW] Multiply signed words from source register block indicated by zmm2 by signed words from m128 and accumulate resulting signed dwords in zmm1.
This instruction computes 4 sequential register source-block dot-products of two signed word operands with doubleword accumulation; see Figure 5-1 below. The memory operand is sequentially selected in each of the four steps. In the above box, the notation of “+3”' is used to denote that the instruction accesses 4 source registers based on that operand; sources are consecutive, start in a multiple of 4 boundary, and contain the encoded register operand. This instruction supports memory fault suppression. The entire memory operand is loaded if any bit of the lowest 16-bits of the mask is set to 1 or if a “no masking” encoding is used. The tuple type T1_4X implies that four 32-bit elements (16 bytes) are referenced by the memory operation portion of this instruction.
src_reg_id is the 5 bit index of the vector register specified in the instruction as the src1 register.
VP4DPWSSD dest, src1, src2
(KL,VL) = (16,512)
N ← 4
ORIGDEST ← DEST
src_base ← src_reg_id & ~ (N-1) // for src1 operand
FOR i ← 0 to KL-1:
IF k1[i] or *no writemask*:
FOR m ← 0 to N-1:
t ← SRC2.dword[m]
p1dword ← reg[src_base+m].word[2*i] * t.word[0]
p2dword ← reg[src_base+m].word[2*i+1] * t.word[1]
DEST.dword[i] ← DEST.dword[i] + p1dword + p2dword
ELSE IF *zeroing*:
DEST.dword[i] ← 0
ELSE
DEST.dword[i] ← ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] ← 0
That boils down to 128 (16-bit) signed word multiplications and 128 (32-bit) signed dword additions:
reg[src_base+0] = [ a0, a1, ..., a32]
reg[src_base+1] = [ b0, b1, ..., b32]
reg[src_base+2] = [ c0, c1, ..., c32]
reg[src_base+3] = [ d0, d1, ..., d32]
SRC2 = [m0, m1, ..., m8]
DEST.i32[0] += a0 * m0 + a1 *m1 + b0 *m2 + b1 *m3 + c0 *m4 + c1 *m5 + d0 *m6 + d1 *m7
DEST.i32[1] += a2 * m0 + a3 *m1 + b2 *m2 + b3 *m3 + c2 *m4 + c3 *m5 + d2 *m6 + d3 *m7
DEST.i32[2] += a4 * m0 + a5 *m1 + b4 *m2 + b5 *m3 + c4 *m4 + c5 *m5 + d4 *m6 + d5 *m7
DEST.i32[3] += a6 * m0 + a7 *m1 + b6 *m2 + b7 *m3 + c6 *m4 + c7 *m5 + d6 *m6 + d7 *m7
DEST.i32[4] += a8 * m0 + a9 *m1 + b8 *m2 + b9 *m3 + c8 *m4 + c9 *m5 + d8 *m6 + d9 *m7
DEST.i32[5] += a10* m0 + a11*m1 + b10*m2 + b11*m3 + c10*m4 + c11*m5 + d10*m6 + d11*m7
DEST.i32[6] += a12* m0 + a13*m1 + b12*m2 + b13*m3 + c12*m4 + c13*m5 + d12*m6 + d13*m7
DEST.i32[7] += a14* m0 + a15*m1 + b14*m2 + b15*m3 + c14*m4 + c15*m5 + d14*m6 + d15*m7
DEST.i32[8] += a16* m0 + a17*m1 + b16*m2 + b17*m3 + c16*m4 + c17*m5 + d16*m6 + d17*m7
DEST.i32[9] += a18* m0 + a19*m1 + b18*m2 + b19*m3 + c18*m4 + c19*m5 + d18*m6 + d19*m7
DEST.i32[10] += a20* m0 + a21*m1 + b20*m2 + b21*m3 + c20*m4 + c21*m5 + d20*m6 + d21*m7
DEST.i32[11] += a22* m0 + a23*m1 + b22*m2 + b23*m3 + c22*m4 + c23*m5 + d22*m6 + d23*m7
DEST.i32[12] += a24* m0 + a25*m1 + b24*m2 + b25*m3 + c24*m4 + c25*m5 + d24*m6 + d25*m7
DEST.i32[13] += a26* m0 + a27*m1 + b26*m2 + b27*m3 + c26*m4 + c27*m5 + d26*m6 + d27*m7
DEST.i32[14] += a28* m0 + a29*m1 + b28*m2 + b29*m3 + c28*m4 + c29*m5 + d28*m6 + d29*m7
DEST.i32[15] += a30* m0 + a31*m1 + b30*m2 + b31*m3 + c30*m4 + c31*m5 + d30*m6 + d31*m7