Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ONNXToLinalg Conversion #1848

Open
chentong319 opened this issue Nov 9, 2022 · 9 comments
Open

Add ONNXToLinalg Conversion #1848

chentong319 opened this issue Nov 9, 2022 · 9 comments

Comments

@chentong319
Copy link
Collaborator

I found that the proposal from Microsoft of using Linalg worth of exploring. The Linalg has some good features and there are existing optimization passes and backend.
I am working on a draft to add passes to lower some ONNX Op to Linalg while the current lowering to knrl is kept working. The passes will look like:

existing passes
1. ONNXToLinalg 
2. ONNXToKrnl
3. KrnlToAffine
4. LinalgToAffine
existing passes

The order of 1 and 2, and that of 3 and 4 may be switchable.
In my experiment, I will only translate one ONNX op (currently, ONNXMatMul is chosen for simplicity) to Linalg. I will use memref for Linalg Op. I feel it may be easier to incorporate ONNX shape inference result for allocation and ONNXToKrnl conversion to lower to memref, instead of using the Linalg detensor pass. Will this decision be a problem for future optimization?

If the framework is set up, collaboration is needed to implement conversion of more ONNX Ops to Linalg. Which lowering should be applied to each ONNX Ops may be controlled by options and restricted by expressiveness of dialect. If the conversion to Linalg is disabled, onnx-mlir works as it does now.

Comments are welcomed.

@ashay
Copy link
Contributor

ashay commented Nov 16, 2022

This is great! The fact that this translation enables the use of existing upstream passes is a huge plus.

But in the same spirit, how do you feel about lowering to linalg-on-tensor instead of memref and then use the upstream bufferization passes to lower to memref? I worry that in trying to lower to memref directly, the ONNXToLinalg pass might get too complicated.

@chentong319
Copy link
Collaborator Author

chentong319 commented Nov 17, 2022

I think it is doable to lower to linalg-on-tensor. There are two possible paths:

  1. If onnx is lowered to krnl before detensor of Linagl, we just need to add bufferization.to_memref for input to krnl Ops if the input comes from Linagl.
  2. If detensor of Linagl is called before onnx is lowered to krnl, I think the bufferization pass will add bufferization.to_tensor for onnx Ops automatically.
    It seems to me that no extra work is needed for the second path. But we can try both. Adding to_memref and to_tensor for krnl is needed to handle IR with mixed tensor-level dialects.

@chentong319
Copy link
Collaborator Author

chentong319 commented Dec 2, 2022

@sstamenova
I did an experiment (#1891 ) to lower onnx.MatMulOp to Linalg.MatmulOp at tensor level and then called the Linalg bufferization pass to convert tensor to memref and the Linalg to affine pass to convert Linalg to affine. With some change in lowering onnx to Krnl, the compilation can reach the lowering to LLVM stage. I ran into one issue: I did not find an existing pass to lower bufferization.alloc_tensor to memref.alloc. I do not think I need to lower that op by myself though it is straightforward. Does anyone know the solution?
Should we discuss the ONNXToLinalg conversion at the meeting on next Tuesday?

@chentong319
Copy link
Collaborator Author

I put some results of #1891 here.
Model:

func.func @matmul(%arg0 : tensor<2x3xf32>, %arg1 : tensor<3x4xf32>) -> tensor<4x2xf32> {
  %1 = "onnx.MatMul"(%arg0, %arg1) : (tensor<2x3xf32>, tensor<3x4xf32>) -> tensor<2x4xf32>
  %2 = "onnx.Transpose"(%1) {perm = [1, 0]} : (tensor<2x4xf32>) -> tensor<4x2xf32>
  return %2 : tensor<4x2xf32>
}

After Lowering to Linalg

// -----// IR Dump After onnx_mlir::ONNXToLinalgLoweringPass (convert-onnx-to-linalg) //----- //
module attributes {llvm.data_layout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-apple-darwin20.3.0"} {
  func.func @matmul(%arg0: tensor<2x3xf32>, %arg1: tensor<3x4xf32>) -> tensor<4x2xf32> {
    %0 = tensor.empty() : tensor<2x4xf32>
    %1 = linalg.matmul ins(%arg0, %arg1 : tensor<2x3xf32>, tensor<3x4xf32>) outs(%0 : tensor<2x4xf32>) -> tensor<2x4xf32>
    %2 = "onnx.Transpose"(%1) {perm = [1, 0]} : (tensor<2x4xf32>) -> tensor<4x2xf32>
    return %2 : tensor<4x2xf32>
  }
}

After Krnl:

// -----// IR Dump After onnx_mlir::FrontendToKrnlLoweringPass (convert-onnx-to-krnl) //----- //
module attributes {llvm.data_layout = "e-m:o-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-apple-darwin20.3.0"} {
  func.func @matmul(%arg0: memref<2x3xf32>, %arg1: memref<3x4xf32>) -> memref<4x2xf32> {
    %0 = bufferization.to_tensor %arg1 : memref<3x4xf32>
    %1 = bufferization.to_tensor %arg0 : memref<2x3xf32>
    %2 = tensor.empty() : tensor<2x4xf32>
    %3 = linalg.matmul ins(%1, %0 : tensor<2x3xf32>, tensor<3x4xf32>) outs(%2 : tensor<2x4xf32>) -> tensor<2x4xf32>
    %4 = bufferization.to_memref %3 : memref<2x4xf32>
    %c4 = arith.constant 4 : index
    %c2 = arith.constant 2 : index
    %alloc = memref.alloc() {alignment = 16 : i64} : memref<4x2xf32>
    %5:2 = krnl.define_loops 2
    %c0 = arith.constant 0 : index
    %c2_0 = arith.constant 2 : index
    %c4_1 = arith.constant 4 : index
    krnl.iterate(%5#0, %5#1) with (%5#0 -> %arg2 = 0 to 2, %5#1 -> %arg3 = 0 to 4){
      %6:2 = krnl.get_induction_var_value(%5#0, %5#1) : (!krnl.loop, !krnl.loop) -> (index, index)
      %7 = krnl.load %4[%6#0, %6#1] : memref<2x4xf32>
      krnl.store %7, %alloc[%6#1, %6#0] : memref<4x2xf32>
    }
    return %alloc : memref<4x2xf32>
  }
}

After Linalg bufferization:

// -----// IR Dump After LinalgBufferize (linalg-bufferize) //----- //
func.func @matmul(%arg0: memref<2x3xf32>, %arg1: memref<3x4xf32>) -> memref<4x2xf32> {
  %0 = bufferization.to_tensor %arg1 : memref<3x4xf32>
  %1 = bufferization.to_tensor %arg0 : memref<2x3xf32>
  %2 = tensor.empty() : tensor<2x4xf32>
  %3 = bufferization.to_memref %2 : memref<2x4xf32>
  %alloc = memref.alloc() {alignment = 128 : i64} : memref<2x4xf32>
  memref.copy %3, %alloc : memref<2x4xf32> to memref<2x4xf32>
  %4 = bufferization.to_tensor %alloc : memref<2x4xf32>
  linalg.matmul ins(%arg0, %arg1 : memref<2x3xf32>, memref<3x4xf32>) outs(%alloc : memref<2x4xf32>)
  %5 = bufferization.to_tensor %alloc : memref<2x4xf32>
  %c4 = arith.constant 4 : index
  %c2 = arith.constant 2 : index
  %alloc_0 = memref.alloc() {alignment = 16 : i64} : memref<4x2xf32>
  %6:2 = krnl.define_loops 2
  %c0 = arith.constant 0 : index
  %c2_1 = arith.constant 2 : index
  %c4_2 = arith.constant 4 : index
  krnl.iterate(%6#0, %6#1) with (%6#0 -> %arg2 = 0 to 2, %6#1 -> %arg3 = 0 to 4){
    %7:2 = krnl.get_induction_var_value(%6#0, %6#1) : (!krnl.loop, !krnl.loop) -> (index, index)
    %8 = krnl.load %alloc[%7#0, %7#1] : memref<2x4xf32>
    krnl.store %8, %alloc_0[%7#1, %7#0] : memref<4x2xf32>
  }
  return %alloc_0 : memref<4x2xf32>
}

After both Linalg and Krnl to affine

func.func @matmul(%arg0: memref<2x3xf32>, %arg1: memref<3x4xf32>) -> memref<4x2xf32> attributes {llvm.emit_c_interface} {
  %0 = bufferization.alloc_tensor() : tensor<2x4xf32>
  %1 = bufferization.to_memref %0 : memref<2x4xf32>
  %alloc = memref.alloc() {alignment = 128 : i64} : memref<2x4xf32>
  memref.copy %1, %alloc : memref<2x4xf32> to memref<2x4xf32>
  affine.for %arg2 = 0 to 2 {
    affine.for %arg3 = 0 to 4 {
      affine.for %arg4 = 0 to 3 {
        %2 = affine.load %arg0[%arg2, %arg4] : memref<2x3xf32>
        %3 = affine.load %arg1[%arg4, %arg3] : memref<3x4xf32>
        %4 = affine.load %alloc[%arg2, %arg3] : memref<2x4xf32>
        %5 = arith.mulf %2, %3 : f32
        %6 = arith.addf %4, %5 : f32
        affine.store %6, %alloc[%arg2, %arg3] : memref<2x4xf32>
      }
    }
  }
  %alloc_0 = memref.alloc() {alignment = 16 : i64} : memref<4x2xf32>
  affine.for %arg2 = 0 to 2 {
    affine.for %arg3 = 0 to 4 {
      %2 = affine.load %alloc[%arg2, %arg3] : memref<2x4xf32>
      affine.store %2, %alloc_0[%arg3, %arg2] : memref<4x2xf32>
    }
  }
  return %alloc_0 : memref<4x2xf32>
}

@sstamenova
Copy link
Collaborator

We have a team even this Tuesday, so we won't be able to attend. However, we can do this the Tuesday following.

@chentong319
Copy link
Collaborator Author

We have a team even this Tuesday, so we won't be able to attend. However, we can do this the Tuesday following.

Let's try Dec 13 (Tuesday).

@hunterzju
Copy link

It is a good idea, if onnx is lowered to linalg, we can do tiling and packing in linalg. And how is the current progress?

@AlexandreEichenberger
Copy link
Collaborator

Microsoft was looking into this, I have not heard much on this front in a while.

@ashay
Copy link
Contributor

ashay commented Oct 25, 2023

I can't speak on behalf of the Microsoft folks, but it's now possible to convert ONNX to StableHLO and use the changes in openxla/stablehlo#1817 to lower to LinAlg. The ONNX-to-StableHLO conversion has some gaps, so it needs some work, but most ONNX operations go through without issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants