Instruction | General theme | Optional special features |
---|---|---|
ldx |
x[i] = memory[i] |
Load pair |
ldy |
y[i] = memory[i] |
Load pair |
ldz ldzi |
z[_][i] = memory[i] |
Load pair, interleaved Z |
stx |
memory[i] = x[i] |
Store pair |
sty |
memory[i] = y[i] |
Store pair |
stz stzi |
memory[i] = z[_][i] |
Store pair, interleaved Z |
Bit | Width | Meaning | Notes |
---|---|---|---|
10 | 22 | A64 reserved instruction | Must be 0x201000 >> 10 |
5 | 5 | Instruction | 0 for ldx 1 for ldy 2 for stx 3 for sty 4 for ldz 5 for stz 6 for ldzi 7 for stzi |
0 | 5 | 5-bit GPR index | See below for the meaning of the 64 bits in the GPR |
For ldx
/ ldy
:
Bit | Width | Meaning |
---|---|---|
63 | 1 | Ignored |
62 | 1 | Load multiple registers (1 ) or single register (0 ) |
61 | 1 | On M1/M2: Ignored (loads are always to consecutive registers) On M3: Load to non-consecutive registers ( 1 ) or to consecutive registers (0 ) |
60 | 1 | On M1: Ignored ("multiple" always means two registers) On M2/M3: "Multiple" means four registers ( 1 ) or two registers (0 ) |
59 | 1 | Ignored |
56 | 3 | X / Y register index |
0 | 56 | Pointer |
For stx
/ sty
:
Bit | Width | Meaning |
---|---|---|
63 | 1 | Ignored |
62 | 1 | Store pair of registers (1 ) or single register (0 ) |
59 | 3 | Ignored |
56 | 3 | X / Y register index |
0 | 56 | Pointer |
For ldz
/ stz
:
Bit | Width | Meaning |
---|---|---|
63 | 1 | Ignored |
62 | 1 | Load / store pair of registers (1 ) or single register (0 ) |
56 | 6 | Z row |
0 | 56 | Pointer |
For ldzi
/ stzi
:
Bit | Width | Meaning |
---|---|---|
62 | 2 | Ignored |
57 | 5 | Z row (high 5 bits thereof) |
56 | 1 | Right hand half (1 ) or left hand half (0 ) of Z register pair |
0 | 56 | Pointer |
Move 64 bytes of data between memory (does not have to be aligned) and an AMX register, or move 128 bytes of data between memory (must be aligned to 128 bytes) and an adjacent pair of AMX registers. On M2/M3, can also move 256 bytes of data from memory to four consecutive X or Y registers. On M3, can move 128 or 256 bytes of data from memory to non-consecutive X or Y registers: if bit 61 is set, 128 bytes are moved to registers n
and (n+4)%8
, or 256 bytes are moved to registers n
, (n+2)%8
, (n+4)%8
, (n+6)%8
.
The ldzi
and stzi
instructions manipulate half of a pair of Z registers. Viewing the 64 bytes of memory and the 64 bytes of every Z register as vectors of i32 / u32 / f32, the mapping between memory and Z is:
Memory | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Z0 | 0 L | 2 L | 4 L | 6 L | 8 L | 10 L | 12 L | 14 L | 0 R | 2 R | 4 R | 6 R | 8 R | 10 R | 12 R | 14 R |
Z1 | 1 L | 3 L | 5 L | 7 L | 9 L | 11 L | 13 L | 15 L | 1 R | 3 R | 5 R | 7 R | 9 R | 11 R | 13 R | 15 R |
In other words, the even Z register contains the even lanes from memory, and the odd Z register contains the odd lanes from memory. By a happy coincidence, this matches up with the "interleaved pair" lane arrangements of mixed-width mac16
/ fma16
/ fms16
instructions, and with the "interleaved pair" lane arrangements of other instructions when in a (16, 16, 32) arrangement.
See ldst.c.
A representative sample is:
void emulate_AMX_LDX(amx_state* state, uint64_t operand) {
ld_common(state->x, operand, 7);
}
void ld_common(amx_reg* regs, uint64_t operand, uint32_t regmask) {
uint32_t rn = (operand >> 56) & regmask;
const uint8_t* src = (uint8_t*)((operand << 8) >> 8);
memcpy(regs + rn, src, 64);
if (operand & LDST_MULTIPLE) {
uint32_t rs = 1;
if ((AMX_VER >= AMX_VER_M3) && (operand & LDST_NON_CONSECUTIVE) && (regmask <= 15)) {
rs = (operand & LDST_MULTIPLE_MEANS_FOUR) ? 2 : 4;
}
memcpy(regs + ((rn + rs) & regmask), src + 64, 64);
if ((AMX_VER >= AMX_VER_M2) && (operand & LDST_MULTIPLE_MEANS_FOUR) && (regmask <= 15)) {
memcpy(regs + ((rn + rs*2) & regmask), src + 128, 64);
memcpy(regs + ((rn + rs*3) & regmask), src + 192, 64);
}
}
}