-
Notifications
You must be signed in to change notification settings - Fork 62
Golang ppc64x asm Reference
- VADDCUQ,对应s390x的VACCQ,计算进位。
- VADDUQM,对应s390x的VAQ,计算两个数之和的低128位。
- VADDECUQ,对应s390x的VACCCQ,带进位加,计算进位。
- VADDEUQM,对应s390x的VACQ,带进位加,两个数和进位的总和的低128位。
(中间的Q代表位宽)
所以,两个数相加要同时使用多个指令。下面示例演示 T2||T1||T0 = T1||T0 + RED2||RED1。
VADDCUQ T0, RED1, CAR1 // VACCQ T0, RED1, CAR1
VADDUQM T0, RED1, T0 // VAQ T0, RED1, T0
VADDECUQ T1, RED2, CAR1, CAR2 // VACCCQ T1, RED2, CAR1, CAR2
VADDEUQM T1, RED2, CAR1, T1 // VACQ T1, RED2, CAR1, T1
VADDUQM T2, CAR2, T2 // VAQ T2, CAR2, T2
- VSUBCUQ,对应s390x的VSCBIQ,计算借位
- VSUBUQM,对应s390x的VSQ,计算两数之差,结果是差值的低128位。
- VSUBECUQ,对应s390x的VSBCBIQ,带借位减,计算借位
- VSUBEUQM,对应s390x的VSBIQ,带借位减,计算结果的低128位。
下面示例演示 T2||TT1||TT0 = T2||T1||T0 - ZERO||PH||PL。T2是借位。
VSUBCUQ T0, PL, CAR1 // VSCBIQ PL, T0, CAR1
VSUBUQM T0, PL, TT0 // VSQ PL, T0, TT0
VSUBECUQ T1, PH, CAR1, CAR2 // VSBCBIQ T1, PH, CAR1, CAR2
VSUBEUQM T1, PH, CAR1, TT1 // VSBIQ T1, PH, CAR1, TT1
VSUBEUQM T2, ZER, CAR2, T2 // VSBIQ T2, ZER, CAR2, T2
乘法太复杂:
// The following macros are used to implement the ppc64le
// equivalent function from the corresponding s390x
// instruction for vector multiply high, low, and add,
// since there aren't exact equivalent instructions.
// The corresponding s390x instructions appear in the
// comments.
// Implementation for big endian would have to be
// investigated, I think it would be different.
//
//
// Vector multiply word
//
// VMLF x0, x1, out_low
// VMLHF x0, x1, out_hi
#define VMULT(x1, x2, out_low, out_hi) \
VMULEUW x1, x2, TMP1; \
VMULOUW x1, x2, TMP2; \
VMRGEW TMP1, TMP2, out_hi; \
VMRGOW TMP1, TMP2, out_low
//
// Vector multiply add word
//
// VMALF x0, x1, y, out_low
// VMALHF x0, x1, y, out_hi
#define VMULT_ADD(x1, x2, y, one, out_low, out_hi) \
VMULEUW y, one, TMP2; \
VMULOUW y, one, TMP1; \
VMULEUW x1, x2, out_low; \
VMULOUW x1, x2, out_hi; \
VADDUDM TMP2, out_low, TMP2; \
VADDUDM TMP1, out_hi, TMP1; \
VMRGOW TMP2, TMP1, out_low; \
VMRGEW TMP2, TMP1, out_hi
- VSPLTISB "Vector Splat Immediate Signed Byte". This instruction is used to fill a vector register with a specified 8-bit signed integer.填充立即数到目标向量寄存器。
- VSPLTB "Vector Splat Byte". This instruction is used to replicate a specified byte across all elements of a vector register.从源向量寄存器中取指定位置的字节,填充到目标向量寄存器。
- VSPLTISW "Vector Splat Immediate Signed Word". This instruction is used to fill a vector register with a specified 16-bit signed integer.
- VSPLTW "Vector Splat Word". This instruction is used to replicate a specified word (32-bit element) across all elements of a vector register.
- LVX "Load Vector Indexed". This instruction is used to load a vector from memory into a vector register. The LVX instructions on ppc64 require 16 byte alignment of the data. To avoid that requirement, data is loaded using LXVD2X with VPERM to reorder bytes correctly.
- LXVDSX "Load VSR Vector Doubleword and Splat Indexed". This instruction is used to load a doubleword (64-bit element) from memory into a vector register.从指定内存位置加载64位数据,将其存储到目标向量寄存器的lower half(byte index from 0-7), 并将相同值复制到higher half (byte index from 8-16)。
- LVXD2X "Load Vector Doubleword 2 Indexed". This instruction is used to load two consecutive doublewords (64-bit elements) from memory into a vector register. 加载两个连续的64位数到目标向量寄存器。
- LXVW4X "Load Vector Word Indexed". It loads a vector of 4 words (16 bytes total, as each word is 4 bytes) from memory into a vector register.
用LVXD2X加载两个连续64位整数:
DATA ·mask+0x00(SB)/8, $0x0f0e0d0c0b0a0908 // Permute for vector doubleword endian swap
DATA ·mask+0x08(SB)/8, $0x0706050403020100
GLOBL ·mask(SB), RODATA, $16
MOVD $·mask(SB), R4
LVXD2X (R4), V0
那V[0] = 0x0f0e0d0c, V[1] = 0x0b0a0908, V[2] = 0x07060504, V[3] = 0x03020100
用LVX加载两个连续64位整数:
DATA ·mask+0x00(SB)/8, $0x0f0e0d0c0b0a0908 // Permute for vector doubleword endian swap
DATA ·mask+0x08(SB)/8, $0x0706050403020100
GLOBL ·mask(SB), RODATA, $16
MOVD $·mask(SB), R4
LVX (R4), V0
那V[2] = 0x0f0e0d0c, V[3] = 0x0b0a0908, V[0] = 0x07060504, V[1] = 0x03020100
用LVXD2X加载四个32位整数:
假设四个连续32位整数为:[0x0f0e0d0c, 0x0b0a0908, 0x07060504, 0x03020100]
则V[0] = 0x0b0a0908, V[1] = 0x0f0e0d0c, V[2] = 0x03020100, V[3] = 0x07060504
用LVX加载四个32位整数:
假设四个连续32位整数为:[0x0f0e0d0c, 0x0b0a0908, 0x07060504, 0x03020100]
则V[0] = 0x03020100, V[1] = 0x07060504, V[2] = 0x0b0a0908, V[3] = 0x0f0e0d0c
- STVX "Store Vector Indexed". This instruction is used to store a vector from a vector register into memory.The STVX instructions on ppc64 require 16 byte alignment of the data. To avoid that requirement, data is stored using STXVD2X with VPERM to reorder bytes correctly.
- STXVD2X "Store Vector Doubleword 2 Indexed". This instruction is used to store two consecutive doublewords (64-bit elements) from a vector register into memory.
- STXVW4X "Store Vector Word Indexed". It stores a vector of 4 words (16 bytes total, as each word is 4 bytes) from a vector register to memory.
- VCMPEQUD "Vector Compare Equal Unsigned Doubleword". This instruction is used to compare the corresponding doublewords (64-bit elements) in two vector registers for equality. The instruction compares the doublewords in the source registers for equality. If the doublewords are equal, the corresponding element in the result is set to all ones; otherwise, it is set to all zeros.
- VCMPEQUDCC "Vector Compare Equal Unsigned Doubleword and Conditionally Clear". This instruction is used to compare two vector registers for equality on a doubleword (64-bit) basis. The instruction compares the doublewords in the source registers for equality. If the doublewords are equal, the corresponding element in the result is set to all ones; otherwise, it is set to all zeros. The result of the comparison is stored in the condition register field CR6. If the comparison result is true (all ones), the instruction also clears the condition register field CR6.
- VSLDOI "Vector Shift Left Double by Octet Immediate". This instruction is used to shift the contents of a vector register left by a specified number of octets (8-bit bytes).
- XXPERMDI "Vector Permute Doubleword Immediate". This instruction is used to permute (rearrange) the doublewords (64-bit elements) in a vector register based on an immediate value.
- VPERM "Vector Permute". This instruction is used to permute (rearrange) the bytes in two vector registers based on a permutation vector.
The typical CPUs for ppc64 (PowerPC 64-bit Big Endian) and ppc64le (PowerPC 64-bit Little Endian) are IBM's POWER series of processors.
For ppc64, the IBM POWER5, POWER6, POWER7, and POWER8 processors are commonly used. These processors are often found in high-performance computing environments, enterprise servers, and similar applications.
For ppc64le, the IBM POWER8 and POWER9 processors are typically used. The switch to little-endian mode in these processors was made to improve compatibility with software written for x86_64, which also uses little-endian byte order. These processors are used in a variety of applications, from supercomputers to servers for cloud and data analytics workloads.
VSX (Vector-Scalar Extension) and AltiVec (also known as VMX - Vector Multimedia Extension) are both vector processing extensions to the PowerPC architecture, but they have different features and capabilities.
- Introduction: AltiVec was introduced by Motorola (now NXP) and IBM in the late 1990s.
- Registers: AltiVec provides 32 vector registers, each 128 bits wide.
- Data Types: AltiVec supports operations on various data types, including 8-bit, 16-bit, and 32-bit integers, as well as single-precision floating-point numbers.
- Instruction Set: AltiVec includes a wide range of instructions for arithmetic, logical, permute, and data movement operations.
- Usage: AltiVec is commonly used in multimedia, signal processing, and other applications that benefit from SIMD (Single Instruction, Multiple Data) parallelism.
- Introduction: VSX was introduced by IBM in the PowerISA 2.06 specification, starting with the POWER7 processor.
- Registers: VSX extends the existing floating-point registers (FPRs) and vector registers (VRs) to create a unified set of 64 registers, each 128 bits wide. The first 32 registers overlap with the FPRs, and the second 32 registers overlap with the VRs.
- Data Types: VSX supports operations on a wider range of data types, including 8-bit, 16-bit, 32-bit, and 64-bit integers, as well as single-precision and double-precision floating-point numbers.
- Instruction Set: VSX includes an expanded set of instructions that build on the capabilities of AltiVec, adding more advanced arithmetic, logical, and data movement operations.
- Usage: VSX is designed to enhance the performance of scientific computing, cryptography, and other applications that require high-performance vector and scalar processing.
- Registers: AltiVec has 32 dedicated vector registers, while VSX extends the existing FPRs and VRs to create a unified set of 64 registers.
- Data Types: VSX supports a broader range of data types, including 64-bit integers and double-precision floating-point numbers, which are not supported by AltiVec.
- Instruction Set: VSX includes more advanced instructions compared to AltiVec, providing greater flexibility and performance for certain applications.
- Compatibility: VSX is backward compatible with AltiVec, meaning that code written for AltiVec can run on processors with VSX support, but the reverse is not necessarily true.
In summary, VSX is a more advanced and flexible extension compared to AltiVec, offering enhanced capabilities for high-performance computing applications.
what's the difference between function definition in golang asm with funcname<>(SB), funcname(SB) and ·funcname(SB)?
In Go assembly, the naming conventions for function definitions are as follows:
-
funcname(SB): This is a global function named funcname. The (SB) suffix refers to the static base pointer, which is used to reference global symbols. This function can be called from anywhere in the program.
-
funcname<>(SB): This is a file-local function named funcname. The <> makes the function local to the file it's defined in. Other files in the same package cannot call this function directly.
-
·funcname(SB): This is a special notation used in the Plan 9 assembly syntax that Go uses. The · prefix denotes a global symbol. This function can be called from anywhere in the program. It's essentially the same as funcname(SB).
In Go assembly, the ·
symbol is used to denote a global symbol. When you define a global symbol like ·mask
, it's expected to be unique in the entire program. If you try to define ·mask
more than once, you'll get a duplicate symbol definition error.
On the other hand, mask<>
is a local symbol. Local symbols are only visible within the file they are defined in. You can have a mask<>
symbol in each assembly file in your program, and they won't conflict with each other because they are not visible outside their own files.
So, the difference comes from the scope of the symbols. Global symbols like ·mask are visible throughout the entire program and must be unique. Local symbols like mask<>
are only visible within their own file and can be defined in each file without causing conflicts.