Merge branch 'main' into gregory/windows-support

intel · Nov 12, 2024 · fdb63be · fdb63be
2 parents f017395 + ee755e8
commit fdb63be
Show file tree

Hide file tree

Showing 28 changed files with 336 additions and 178 deletions.
diff --git a/README.md b/README.md
@@ -6,7 +6,6 @@
 
 This is the development repository of Intel® XPU Backend for Triton\*, a new [Triton](https://github.com/triton-lang/triton/) backend for Intel GPUs. Intel® XPU Backend for Triton\* is a out of tree backend module for [Triton](https://github.com/triton-lang/triton/blob/main/CONTRIBUTING.md) used to provide best-in-class performance and productivity on any Intel GPUs for [PyTorch](https://github.com/triton-lang/triton/blob/main/CONTRIBUTING.md) and standalone usage.
 
-<<<<<<< HEAD
 # Compatibility
 
 * Operating systems:
@@ -22,25 +21,11 @@ This is the development repository of Intel® XPU Backend for Triton\*, a new [T
   * Latest [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
 
 Note that Intel® XPU Backend for Triton\* is not compatible with Intel® Extension for PyTorch\* and Intel® oneAPI Base Toolkit\*.
-=======
-| **`Documentation`** | **`Nightly Wheels`** |
-|-------------------- | -------------------- |
-| [![Documentation](https://github.com/triton-lang/triton/actions/workflows/documentation.yml/badge.svg)](https://triton-lang.org/) | [![Wheels](https://github.com/triton-lang/triton/actions/workflows/wheels.yml/badge.svg?branch=release/2.0.x)](https://github.com/triton-lang/triton/actions/workflows/wheels.yml) |
-
-# Triton
-
-This is the development repository of Triton, a language and compiler for writing highly efficient custom Deep-Learning primitives. The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs.
-
-The foundations of this project are described in the following MAPL2019 publication: [Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations](http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf). Please consider citing this work if you use Triton!
-
-The [official documentation](https://triton-lang.org) contains installation instructions and tutorials.  See also these third-party [Triton puzzles](https://github.com/srush/Triton-Puzzles), which can all be run using the Triton interpreter -- no GPU required.
->>>>>>> d6739d3c33dee481f2d4dee4f6ecd4123f671597
 
 # Quick Installation
 
 ## Prerequisites
 
-<<<<<<< HEAD
 1. Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html) or [Long Term Support Release](https://dgpu-docs.intel.com/driver/installation.html) of GPU driver
 2. Latest release of [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
 3. Latest release of [Profiling Tools Interfaces for Intel GPU (PTI for GPU)](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
@@ -55,35 +40,18 @@ Extract the archive and in the extracted directory execute:
 ```shell
 pip install torch-*.whl triton-*.whl
 ```
-=======
-```shell
-pip install triton
-```
-
-Binary wheels are available for CPython 3.8-3.12 and PyPy 3.8-3.9.
->>>>>>> d6739d3c33dee481f2d4dee4f6ecd4123f671597
 
 Before using Intel® XPU Backend for Triton\* you need to initialize the toolchain.
 The default location is `/opt/intel/oneapi` (if installed as a `root` user) or `~/intel/oneapi` (if installed as a regular user).
 
 ```shell
-<<<<<<< HEAD
 # replace /opt/intel/oneapi with the actual location of PyTorch Prerequisites for Intel GPUs
 source /opt/intel/oneapi/setvars.sh
-=======
-pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly
->>>>>>> d6739d3c33dee481f2d4dee4f6ecd4123f671597
 ```
 
 # Install from source
 
-<<<<<<< HEAD
 ## Prerequisites
-=======
-```shell
-git clone https://github.com/triton-lang/triton.git;
-cd triton;
->>>>>>> d6739d3c33dee481f2d4dee4f6ecd4123f671597
 
 1. Latest [Rolling Release](https://dgpu-docs.intel.com/driver/installation-rolling.html) or [Long Term Support Release](https://dgpu-docs.intel.com/driver/installation.html) of GPU driver
 2. Latest release of [PyTorch Prerequisites for Intel GPUs](https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html)
@@ -104,14 +72,9 @@ source /opt/intel/oneapi/setvars.sh
 Clone this repository:
 
 ```shell
-<<<<<<< HEAD
 git clone https://github.com/intel/intel-xpu-backend-for-triton.git
 cd intel-xpu-backend-for-triton
 ```
-=======
-git clone https://github.com/triton-lang/triton.git;
-cd triton;
->>>>>>> d6739d3c33dee481f2d4dee4f6ecd4123f671597
 
 To avoid potential conflicts with installed packages it is recommended to create and activate a new Python virtual environment:
 
@@ -242,7 +205,6 @@ For detailed instructions on how to debug Triton's frontend, please refer to thi
 
 # Usage Guide
 
-<<<<<<< HEAD
 ## Code Modifications
 Intel® XPU Backend for Triton\* requires a special version of PyTorch that can be built from sources or installed from nightly wheels.
 
@@ -346,14 +308,6 @@ Note that the user needs to explicitly set `TRITON_XPU_PROFILE=1` when the user
 ```Bash
 export TRITON_XPU_PROFILE=1
 ```
-=======
-Version 2.0 is out! New features include:
-
-- Many, many bug fixes
-- Performance improvements
-- Backend rewritten to use MLIR
-- Support for kernels that contain back-to-back matmuls (e.g., flash attention)
->>>>>>> d6739d3c33dee481f2d4dee4f6ecd4123f671597
 
 # Contributing
 
@@ -363,24 +317,10 @@ Community contributions are more than welcome, whether it be to fix bugs or to a
 
 _MIT License_. As found in [LICENSE](https://github.com/intel/intel-xpu-backend-for-triton/blob/main/LICENSE) file.
 
-<<<<<<< HEAD
 
 ## Security
 
 See Intel's [Security Center](https://www.intel.com/content/www/us/en/security-center/default.html)
 for information on how to report a potential security issue or vulnerability.
 
 See also: [Security Policy](security.md)
-=======
-# Compatibility
-
-Supported Platforms:
-
-- Linux
-
-Supported Hardware:
-
-- NVIDIA GPUs (Compute Capability 8.0+)
-- AMD GPUs (ROCm 5.2+)
-- Under development: CPUs
->>>>>>> d6739d3c33dee481f2d4dee4f6ecd4123f671597
diff --git a/benchmarks/triton_kernels_benchmark/gemm_benchmark.py b/benchmarks/triton_kernels_benchmark/gemm_benchmark.py
@@ -129,8 +129,8 @@ def matmul_kernel_with_block_pointers_batched(
         stride_cz: tl.constexpr, stride_cm: tl.constexpr, stride_cn: tl.constexpr,
         # Meta-parameters
         BLOCK_SIZE_M: tl.constexpr, BLOCK_SIZE_N: tl.constexpr, BLOCK_SIZE_K: tl.constexpr, GROUP_SIZE_M: tl.constexpr):
-    bid = tl.program_id(axis=0)
-    pid = tl.program_id(axis=1)
+    bid = tl.program_id(axis=1)
+    pid = tl.program_id(axis=0)
     num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
     num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
     num_pid_in_group = GROUP_SIZE_M * num_pid_n
@@ -186,8 +186,8 @@ def matmul(a, b, c, transpose_a=False, transpose_b=False):
         B = a.shape[0]
         # 1D launch kernel where each block gets its own program.
         grid = lambda META: (
-            B,
             triton.cdiv(M, META['BLOCK_SIZE_M']) * triton.cdiv(N, META['BLOCK_SIZE_N']),
+            B,
         )
         matmul_kernel_with_block_pointers_batched[grid](
             a, b, c,  #

diff --git a/include/triton/Tools/Sys/GetEnv.hpp b/include/triton/Tools/Sys/GetEnv.hpp
@@ -34,6 +34,7 @@ inline const std::set<std::string> CACHE_INVALIDATING_ENV_VARS = {
     "TRITON_INTEL_ADVANCED_PATH",
     "TRITON_INTEL_AGGRESSIVE_DPAS_REUSE",
     "TRITON_INTEL_DO_NOT_SINK_INSTR_ACROSS_RGN",
+    "TRITON_INTEL_DISABLE_LARGE_BLOCK_SIZE_IO_FOR_TRANS_DOT_B",
     "TRITON_INTEL_ENABLE_ADDRESS_PAYLOAD_OPT",
     "TRITON_INTEL_ENABLE_FIRST_LOAD_TO_SLM",
     "TRITON_INTEL_ENABLE_INSTR_SCHED",

diff --git a/lib/Conversion/TritonGPUToLLVM/AssertOpToLLVM.cpp b/lib/Conversion/TritonGPUToLLVM/AssertOpToLLVM.cpp
@@ -35,15 +35,21 @@ struct AssertOpConversion : public ConvertOpToLLVMPattern<triton::AssertOp> {
       }
     }
     llAssert(op, condition, adaptor.getMessage(), rewriter);
+    if (isa<RankedTensorType>(op.getCondition().getType())) {
+      // Add a barrier to avoid a race condition in case an assert is followed
+      // by an op that may trap if the assert condition is true. Since the
+      // tensor in those two operations may have different layout we need to
+      // make sure all the threads are done executing the assert before going to
+      // the next op.
+      barrier();
+    }
     rewriter.eraseOp(op);
     return success();
   }
   // op: the op at which the assert is inserted. Unlike printf, we need to
   // know about the op to split the block.
   void llAssert(Operation *op, Value condition, StringRef message,
                 ConversionPatternRewriter &rewriter) const {
-    ConversionPatternRewriter::InsertionGuard guard(rewriter);
-
     auto ctx = rewriter.getContext();
     auto loc = op->getLoc();
 
@@ -79,6 +85,7 @@ struct AssertOpConversion : public ConvertOpToLLVMPattern<triton::AssertOp> {
     rewriter.create<cf::BranchOp>(loc, thenBlock);
     rewriter.setInsertionPointToEnd(prevBlock);
     rewriter.create<cf::CondBranchOp>(loc, condition, ifBlock, thenBlock);
+    rewriter.setInsertionPointToStart(thenBlock);
   }
 
 protected:

diff --git a/lib/Target/SPIRV/spirv-llvm-translator.conf b/lib/Target/SPIRV/spirv-llvm-translator.conf
@@ -1 +1 @@
-cf697333b60d2000509ab7e79869ecab5eda9e9c
+1a1bf17d9e8684cd826e4278e78f63aa80e2e2ca
diff --git a/python/triton/language/semantic.py b/python/triton/language/semantic.py
@@ -1729,10 +1729,6 @@ def device_print(prefix: str, args: List[tl.tensor], hex: bool, builder: ir.buil
 def device_assert(cond: tl.tensor, msg: str, builder: ir.builder) -> tl.tensor:
     if not builder.options.debug:
         return
-    cond_ty = cond.type
-    if not cond_ty.is_block():
-        cond_ty = tl.block_type(cond_ty.scalar, (1, ))
-        cond = tl.tensor(builder.create_splat(cond.handle, (1, )), cond_ty)
     return tl.tensor(builder.create_assert(cond.handle, msg), tl.void)
 
 

diff --git a/scripts/compile-pytorch-ipex.sh b/scripts/compile-pytorch-ipex.sh
@@ -100,7 +100,7 @@ fi
 # Configure, build and install PyTorch from source.
 
 if [[ $BUILD_PYTORCH = true ]]; then
-  PYTORCH_PROJ=$BASE/pytorch
+  PYTORCH_PROJ=$BASE/pytorch-stonepia
 
   echo "**** Cleaning $PYTORCH_PROJ before build ****"
   rm -rf $PYTORCH_PROJ

diff --git a/scripts/skiplist/a770/language.txt b/scripts/skiplist/a770/language.txt
@@ -1,5 +1,7 @@
 # https://github.com/intel/intel-xpu-backend-for-triton/issues/1434
 test/unit/language/test_core.py::test_precise_math[1-tl.math.sqrt_rn(x)-tl.math.sqrt(x.to(tl.float64)).to(tl.float32)]
+# https://github.com/intel/intel-xpu-backend-for-triton/issues/2662
+test/unit/language/test_core.py::test_scan_layouts[True-1-src_layout10-64-32]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float16-float16]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float16-float32]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float32-float32]

diff --git a/scripts/skiplist/conda/language.txt b/scripts/skiplist/conda/language.txt
@@ -115,6 +115,8 @@ test/unit/language/test_core.py::test_dot_max_num_imprecise_acc[64-float8e4b15-1
 test/unit/language/test_core.py::test_dot_max_num_imprecise_acc[128-float8e5-128-256-128-128-256-256]
 # https://github.com/intel/intel-xpu-backend-for-triton/issues/1434
 test/unit/language/test_core.py::test_precise_math[1-tl.math.sqrt_rn(x)-tl.math.sqrt(x.to(tl.float64)).to(tl.float32)]
+# https://github.com/intel/intel-xpu-backend-for-triton/issues/2662
+test/unit/language/test_core.py::test_scan_layouts[True-1-src_layout10-64-32]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float16-float16]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float16-float32]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float32-float32]

diff --git a/scripts/skiplist/default/language.txt b/scripts/skiplist/default/language.txt
@@ -1,2 +1,4 @@
 # https://github.com/intel/intel-xpu-backend-for-triton/issues/1434
 test/unit/language/test_core.py::test_precise_math[1-tl.math.sqrt_rn(x)-tl.math.sqrt(x.to(tl.float64)).to(tl.float32)]
+# https://github.com/intel/intel-xpu-backend-for-triton/issues/2662
+test/unit/language/test_core.py::test_scan_layouts[True-1-src_layout10-64-32]
diff --git a/scripts/skiplist/lts/language.txt b/scripts/skiplist/lts/language.txt
@@ -115,6 +115,8 @@ test/unit/language/test_core.py::test_dot_max_num_imprecise_acc[64-float8e4b15-1
 test/unit/language/test_core.py::test_dot_max_num_imprecise_acc[128-float8e5-128-256-128-128-256-256]
 # https://github.com/intel/intel-xpu-backend-for-triton/issues/1434
 test/unit/language/test_core.py::test_precise_math[1-tl.math.sqrt_rn(x)-tl.math.sqrt(x.to(tl.float64)).to(tl.float32)]
+# https://github.com/intel/intel-xpu-backend-for-triton/issues/2662
+test/unit/language/test_core.py::test_scan_layouts[True-1-src_layout10-64-32]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float16-float16]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float16-float32]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float32-float32]

diff --git a/scripts/skiplist/mtl/language.txt b/scripts/skiplist/mtl/language.txt
@@ -1,5 +1,7 @@
 # https://github.com/intel/intel-xpu-backend-for-triton/issues/1434
 test/unit/language/test_core.py::test_precise_math[1-tl.math.sqrt_rn(x)-tl.math.sqrt(x.to(tl.float64)).to(tl.float32)]
+# https://github.com/intel/intel-xpu-backend-for-triton/issues/2662
+test/unit/language/test_core.py::test_scan_layouts[True-1-src_layout10-64-32]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float16-float16]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float16-float32]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float32-float32]

diff --git a/scripts/skiplist/xe2/language.txt b/scripts/skiplist/xe2/language.txt
@@ -1,5 +1,7 @@
 # https://github.com/intel/intel-xpu-backend-for-triton/issues/1434
 test/unit/language/test_core.py::test_precise_math[1-tl.math.sqrt_rn(x)-tl.math.sqrt(x.to(tl.float64)).to(tl.float32)]
+# https://github.com/intel/intel-xpu-backend-for-triton/issues/2662
+test/unit/language/test_core.py::test_scan_layouts[True-1-src_layout10-64-32]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float16-float16]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float16-float32]
 test/unit/language/test_core.py::test_dot3d[1-1-32-32-32-32-32-float32-float32]

diff --git a/test/Analysis/test-liveness.mlir b/test/Analysis/test-liveness.mlir
@@ -19,11 +19,11 @@ module attributes {"triton_gpu.num-warps" = 8 : i32} {
 
     // CHECK: scf.if
     // CHECK-NEXT: LiveIntervals for block: ^bb0
-    // CHECK-NEXT:  [[[LOAD1:%.*]], [[LOAD1]]] for value: %arg0
-    // CHECK-NEXT:  [[[LOAD1]], scf.yield] for value: [[LOAD1]]
-    // CHECK-NEXT: LiveIntervals for block: ^bb0
-    // CHECK-NEXT:  [[[LOAD2:%.*]], [[LOAD2]]] for value: %arg1
-    // CHECK-NEXT:  [[[LOAD2]], scf.yield] for value: [[LOAD2]]
+    // CHECK-DAG:  [[[LOAD1:%.*]], [[LOAD1]]] for value: %arg0
+    // CHECK-DAG:  [[[LOAD1]], scf.yield] for value: [[LOAD1]]
+    // CHECK-DAG: LiveIntervals for block: ^bb0
+    // CHECK-DAG:  [[[LOAD2:%.*]], [[LOAD2]]] for value: %arg1
+    // CHECK-DAG:  [[[LOAD2]], scf.yield] for value: [[LOAD2]]
 
     %c1024_i32 = arith.constant 1024 : i32
     %c64_i32 = arith.constant 64 : i32

diff --git a/test/Conversion/intel/intel-allocate-shared-memory.mlir b/test/Conversion/intel/intel-allocate-shared-memory.mlir
@@ -0,0 +1,65 @@
+// RUN: triton-opt %s -split-input-file --intel-allocate-shared-memory | FileCheck %s
+
+#blocked = #triton_gpu.blocked<{sizePerThread = [1, 16], threadsPerWarp = [16, 1], warpsPerCTA = [1, 1], order = [0, 1]}>
+#blocked1 = #triton_gpu.blocked<{sizePerThread = [16, 1], threadsPerWarp = [1, 16], warpsPerCTA = [1, 1], order = [0, 1]}>
+
+// Check no scratch memory is allocated for sub-group shuffle-like layout conversions.
+
+// CHECK-LABEL: module attributes
+// CHECK-SAME: triton_gpu.shared = 0 : i32
+module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 1 : i32, "triton_gpu.threads-per-warp" = 16 : i32} {
+  // CHECK: tt.func @test_sub_group_shuffle
+  // CHECK-NOT: llvm.ptr<3>
+  tt.func @test_sub_group_shuffle(%arg0: tensor<16xf16, #triton_gpu.slice<{dim = 1, parent = #blocked}>>) -> tensor<16xf16, #triton_gpu.slice<{dim = 1, parent = #blocked1}>> {
+    %0 = triton_gpu.convert_layout %arg0 : tensor<16xf16, #triton_gpu.slice<{dim = 1, parent = #blocked}>> -> tensor<16xf16, #triton_gpu.slice<{dim = 1, parent = #blocked1}>>
+    tt.return %0 : tensor<16xf16, #triton_gpu.slice<{dim = 1, parent = #blocked1}>>
+  }
+}
+
+// -----
+
+#blocked = #triton_gpu.blocked<{sizePerThread = [16, 1], threadsPerWarp = [1, 16], warpsPerCTA = [1, 1], order = [0, 1]}>
+#blocked1 = #triton_gpu.blocked<{sizePerThread = [1, 16], threadsPerWarp = [16, 1], warpsPerCTA = [1, 1], order = [0, 1]}>
+
+// Check scracth memory configuration for different sub-group transpose-like layout conversions.
+
+// CHECK-LABEL: module attributes
+// CHECK-SAME: triton_gpu.shared = 512 : i32
+module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 1 : i32, "triton_gpu.threads-per-warp" = 16 : i32} {
+  tt.func @test_f16(%arg0: tensor<16x16xf16, #blocked>) -> tensor<16x16xf16, #blocked1> {
+    %0 = triton_gpu.convert_layout %arg0 : tensor<16x16xf16, #blocked> -> tensor<16x16xf16, #blocked1>
+    tt.return %0 : tensor<16x16xf16, #blocked1>
+  }
+}
+
+// -----
+
+#blocked = #triton_gpu.blocked<{sizePerThread = [16, 1], threadsPerWarp = [1, 16], warpsPerCTA = [1, 1], order = [0, 1]}>
+#blocked1 = #triton_gpu.blocked<{sizePerThread = [1, 16], threadsPerWarp = [16, 1], warpsPerCTA = [1, 1], order = [0, 1]}>
+
+// Check scracth memory configuration for different sub-group transpose-like layout conversions.
+
+// CHECK-LABEL: module attributes
+// CHECK-SAME: triton_gpu.shared = 1024 : i32
+module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 1 : i32, "triton_gpu.threads-per-warp" = 16 : i32} {
+  tt.func @test_f32(%arg0: tensor<16x16xf32, #blocked>) -> tensor<16x16xf32, #blocked1> {
+    %0 = triton_gpu.convert_layout %arg0 : tensor<16x16xf32, #blocked> -> tensor<16x16xf32, #blocked1>
+    tt.return %0 : tensor<16x16xf32, #blocked1>
+  }
+}
+
+// -----
+
+#blocked = #triton_gpu.blocked<{sizePerThread = [16, 1], threadsPerWarp = [1, 16], warpsPerCTA = [4, 2], order = [0, 1]}>
+#blocked1 = #triton_gpu.blocked<{sizePerThread = [1, 16], threadsPerWarp = [16, 1], warpsPerCTA = [4, 2], order = [0, 1]}>
+
+// Check scracth memory configuration for different sub-group transpose-like layout conversions.
+
+// CHECK-LABEL: module attributes
+// CHECK-SAME: triton_gpu.shared = 32768 : i32
+module attributes {"triton_gpu.num-ctas" = 1 : i32, "triton_gpu.num-warps" = 8 : i32, "triton_gpu.threads-per-warp" = 16 : i32} {
+  tt.func @test_f32(%arg0: tensor<128x64xf32, #blocked>) -> tensor<128x64xf32, #blocked1> {
+    %0 = triton_gpu.convert_layout %arg0 : tensor<128x64xf32, #blocked> -> tensor<128x64xf32, #blocked1>
+    tt.return %0 : tensor<128x64xf32, #blocked1>
+  }
+}