run: ./build.sh step7
intel 8260
fma peak performance 81gflop/s
step0 naive : 0.607gflop/s
step1 c code optimize : 0.663gflop/s
step2 kernel 8x8 : 20.829gflop/s
step3 Kc Mc tile : 21.718gflop/s
step4 Pack B : 21.569gflop/s
step5 Pack A : 48.245gflop/s
step6 kernel 16x6 : 53.913gflop/s
step7 asm kernel16x6/aligned memory : 67.108gflop/s (82.8%)