VP4DPWSSD

VP4DPWSSD - Dot Product of Signed Words with Dword Accumulation (4 iterations)

VP4DPWSSD zmm1{k1}{z}, zmm2+3, m128 [AVX512_4VNNIW] Multiply signed words from source register block indicated by zmm2 by signed words from m128 and accumulate resulting signed dwords in zmm1.

Description

This instruction computes 4 sequential register source-block dot-products of two signed word operands with doubleword accumulation. The memory operand is sequentially selected in each of the four steps. In the above box, the notation of “+3”' is used to denote that the instruction accesses 4 source registers based on that operand; sources are consecutive, start in a multiple of 4 boundary, and contain the encoded register operand. This instruction supports memory fault suppression. The entire memory operand is loaded if any bit of the lowest 16-bits of the mask is set to 1 or if a “no masking” encoding is used. The tuple type T1_4X implies that four 32-bit elements (16 bytes) are referenced by the memory operation portion of this instruction.

Operation

src_reg_id is the 5 bit index of the vector register specified in the instruction as the src1 register.

VP4DPWSSD dest, src1, src2
(KL,VL) = (16,512)
N ← 4

ORIGDEST ← DEST
src_base ← src_reg_id & ~ (N-1) // for src1 operand

FOR i ← 0 to KL-1:
    IF k1[i] or *no writemask*:
    FOR m ← 0 to N-1:
        t ← SRC2.dword[m]
        p1dword ← reg[src_base+m].word[2*i] * t.word[0]
        p2dword ← reg[src_base+m].word[2*i+1] * t.word[1]
        DEST.dword[i] ← DEST.dword[i] + p1dword + p2dword
    ELSE IF *zeroing*:
       DEST.dword[i] ← 0
    ELSE
       DEST.dword[i] ← ORIGDEST.dword[i]
DEST[MAX_VL-1:VL] ← 0

That boils down to 128 (16-bit) signed word multiplications and 128 (32-bit) signed Dword additions.

How to use this instruction to implement a neural network

reg[src_base+0] = [ a0, a1, ..., a32]  
reg[src_base+1] = [ b0, b1, ..., b32]  
reg[src_base+2] = [ c0, c1, ..., c32]  
reg[src_base+3] = [ d0, d1, ..., d32]  

SRC2 = [s0, s1, ..., s7]  

DEST.i32[0]  += (a0 *s0) + (a1 *s1) + (b0 *s2) + (b1 *s3) + (c0 *s4) + (c1 *s5) + (d0 *s6) + (d1 *s7)  
DEST.i32[1]  += (a2 *s0) + (a3 *s1) + (b2 *s2) + (b3 *s3) + (c2 *s4) + (c3 *s5) + (d2 *s6) + (d3 *s7)  
DEST.i32[2]  += (a4 *s0) + (a5 *s1) + (b4 *s2) + (b5 *s3) + (c4 *s4) + (c5 *s5) + (d4 *s6) + (d5 *s7)  
DEST.i32[3]  += (a6 *s0) + (a7 *s1) + (b6 *s2) + (b7 *s3) + (c6 *s4) + (c7 *s5) + (d6 *s6) + (d7 *s7)  
DEST.i32[4]  += (a8 *s0) + (a9 *s1) + (b8 *s2) + (b9 *s3) + (c8 *s4) + (c9 *s5) + (d8 *s6) + (d9 *s7)  
DEST.i32[5]  += (a10*s0) + (a11*s1) + (b10*s2) + (b11*s3) + (c10*s4) + (c11*s5) + (d10*s6) + (d11*s7)  
DEST.i32[6]  += (a12*s0) + (a13*s1) + (b12*s2) + (b13*s3) + (c12*s4) + (c13*s5) + (d12*s6) + (d13*s7)  
DEST.i32[7]  += (a14*s0) + (a15*s1) + (b14*s2) + (b15*s3) + (c14*s4) + (c15*s5) + (d14*s6) + (d15*s7)  
DEST.i32[8]  += (a16*s0) + (a17*s1) + (b16*s2) + (b17*s3) + (c16*s4) + (c17*s5) + (d16*s6) + (d17*s7)  
DEST.i32[9]  += (a18*s0) + (a19*s1) + (b18*s2) + (b19*s3) + (c18*s4) + (c19*s5) + (d18*s6) + (d19*s7)  
DEST.i32[10] += (a20*s0) + (a21*s1) + (b20*s2) + (b21*s3) + (c20*s4) + (c21*s5) + (d20*s6) + (d21*s7)  
DEST.i32[11] += (a22*s0) + (a23*s1) + (b22*s2) + (b23*s3) + (c22*s4) + (c23*s5) + (d22*s6) + (d23*s7)  
DEST.i32[12] += (a24*s0) + (a25*s1) + (b24*s2) + (b25*s3) + (c24*s4) + (c25*s5) + (d24*s6) + (d25*s7)  
DEST.i32[13] += (a26*s0) + (a27*s1) + (b26*s2) + (b27*s3) + (c26*s4) + (c27*s5) + (d26*s6) + (d27*s7)  
DEST.i32[14] += (a28*s0) + (a29*s1) + (b28*s2) + (b29*s3) + (c28*s4) + (c29*s5) + (d28*s6) + (d29*s7)  
DEST.i32[15] += (a30*s0) + (a31*s1) + (b30*s2) + (b31*s3) + (c30*s4) + (c31*s5) + (d30*s6) + (d31*s7)

If we take the total incoming signal of a neuron j to be the sum of the state of the afferent (incoming) neurons i times the efficacy (weight) of the pathway from i to j, we can use a VP4DPWSSD to calculate 8 incoming pathways for 15 neurons in one single step.

interpret s0 to s7 as the state of 8 incoming neurons, and a0, a1, b0, b1, c0, c1, d0, d1 as the weights to a neuron we have a0 is the weight of pathway from s0 to the neuron.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VP4DPWSSD

VP4DPWSSD - Dot Product of Signed Words with Dword Accumulation (4 iterations)

Description

Operation

Clone this wiki locally