{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":322328576,"defaultBranch":"develop","name":"rocMLIR","ownerLogin":"sjw36","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2020-12-17T15:01:52.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/48454132?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1713807384.0","currentOid":""},"activityList":{"items":[{"before":"4a231163402f3bd34814542870dfaa6673fbae1b","after":"7f83159a3252c47931f119190f83de4a430df09f","ref":"refs/heads/develop","pushedAt":"2024-09-03T18:38:44.000Z","pushType":"push","commitsCount":49,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"Reduced split-k range (#1616)\n\n* Enabled splitK tuning with convolution\r\n\r\n* Corrected tests\r\n\r\n* Redused splitK factor to 1 & 4\r\n\r\n* Update sdxl-conv-configs and skip tuning split-l for ConvBwdWeight (it does not work)\r\n\r\n* Addressing PR comments\r\n\r\n* Addressing PR comments\r\n\r\n---------\r\n\r\nCo-authored-by: Daniel Hernandez-Juarez ","shortMessageHtmlLink":"Reduced split-k range (ROCm#1616)"}},{"before":"964c66f5d2091f8a9dd975774d023859686f74c0","after":"4a231163402f3bd34814542870dfaa6673fbae1b","ref":"refs/heads/develop","pushedAt":"2024-08-07T14:07:25.000Z","pushType":"push","commitsCount":51,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"Merge pull request #1584 from ROCm/int4-basic-test\n\n[DO NOT SQUASH!] Non-MIGraphX-related changes for int4 support","shortMessageHtmlLink":"Merge pull request ROCm#1584 from ROCm/int4-basic-test"}},{"before":"e50d72fc6ab9a7a792d92a1ba7db6db45e4c508c","after":"964c66f5d2091f8a9dd975774d023859686f74c0","ref":"refs/heads/develop","pushedAt":"2024-07-01T18:21:31.000Z","pushType":"push","commitsCount":78,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"[Attention] Enable blockwise transposes/rotations to avoid bank conflicts. (#1526)\n\n* Refactored vectorization and lds layout knobs out of gemm\r\nlowering to be used with attention.\r\nRefactored the attention + gemm lowering to use those utils\r\nbut in attention I have disabled lds layout based optimizations for\r\nnow becuase they are not producing the correct results.\r\n\r\n* * disable lds layout optimizations when\r\n secondGemm LDS bypass happens\r\n\r\n* * clang format\r\n\r\n* * further cleanup gridwise lowering\r\n\r\n* clang format","shortMessageHtmlLink":"[Attention] Enable blockwise transposes/rotations to avoid bank confl…"}},{"before":"87a55290b7f89f9f8b97c611e0cf929399a9e0c5","after":"e50d72fc6ab9a7a792d92a1ba7db6db45e4c508c","ref":"refs/heads/develop","pushedAt":"2024-05-06T14:07:45.000Z","pushType":"push","commitsCount":88,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"[HOTFIX] Work around busted bf16 tests from ext/trunc folding (#1502)\n\nThe x86 backend now merges together sequences of the form\r\n```llvm\r\n%1 = fptrunc float %0 to bfloat\r\n%2 = fpext bfloat %1 to float\r\n...(%2)\r\n```\r\nto just `%0`.\r\n\r\nNormally, this is fine, and just gives higher precision. However,\r\nin our test code, when we're verifying a GPU reference (that is, one\r\nwhere we did the computation in floats to test a lower-precision\r\ndatatype) we were relying on such trunc/ext pairs to lose the extra\r\nprecision (since the code under test returnd a memref<...xbfloat>, and\r\nwas extf'd to f32 for verification).\r\n\r\nThere is an arithmetic fence intrinsic that should let us prevent the\r\noptimization in that one spot, but supporting it for bfloat on x86 is\r\na currently open PR. Until then, I've picked decently complex noop that\r\nblocks this optimization.\r\n\r\nSpceifically, for bfloat, we do\r\n```mlir\r\n%v = load %gpuRefOutput[...] : float\r\n%t = arith.truncf %v : float to bfloat\r\n%isneginf = arith.cmpf olt, %t, -BFLOAT_MAX\r\n%t.clamp = arith.maximumf %t, -BFLOAT_MAX\r\n%fix.neginf = select %isneginf, %t, %t.clamp\r\n%vMasked = arith.extf %fix.neginf : bfloat to float\r\nstore %vMasked, %gpuRefOutput[...]\r\n```\r\n\r\nThis'll work as a nice hotfix from what I can tell.","shortMessageHtmlLink":"[HOTFIX] Work around busted bf16 tests from ext/trunc folding (ROCm#1502"}},{"before":"b9705ba85d82c308b9289276c6ee27f5cd214c3a","after":"08f85e953658440257dad0722d2f668997735c40","ref":"refs/heads/fix-div-0","pushedAt":"2024-04-23T00:56:14.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"* format","shortMessageHtmlLink":"* format"}},{"before":null,"after":"b9705ba85d82c308b9289276c6ee27f5cd214c3a","ref":"refs/heads/fix-div-0","pushedAt":"2024-04-22T17:36:24.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"MLIR#1470: fix crash when blockPerCU == 0\n\nhttps://github.com/ROCm/rocMLIR-internal/issues/1470","shortMessageHtmlLink":"MLIR#1470: fix crash when blockPerCU == 0"}},{"before":"c6dafa1aaa93dc6711bea6028e20b2b94bc5d9ab","after":"87a55290b7f89f9f8b97c611e0cf929399a9e0c5","ref":"refs/heads/develop","pushedAt":"2024-04-22T16:52:42.000Z","pushType":"push","commitsCount":52,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"Filter out invalid WMMA perf configs (ex. m/nPerWave < 16) (#1480)\n\nIf mPerWave or nPerWave are less than the input length of the WMMA,\r\nwe can't cohenetly codegen usages of those WMMAs.\r\n\r\nSimilarly, while it wasn't an issue in this bug, if mPerWave or\r\nnPerWave aren't evenly divisible by the wmma's input length, reject\r\nthat config as well.\r\n\r\nFixes https://github.com/ROCm/rocMLIR-internal/issues/1444\r\n\r\nMove the tests","shortMessageHtmlLink":"Filter out invalid WMMA perf configs (ex. m/nPerWave < 16) (ROCm#1480)"}},{"before":"3bb598089dd18f3de54c41d7b1348d66671e8798","after":"c6dafa1aaa93dc6711bea6028e20b2b94bc5d9ab","ref":"refs/heads/develop","pushedAt":"2024-03-18T20:40:18.000Z","pushType":"push","commitsCount":8,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"Change vectorization and merge collapse APIs to take Value arguments (#1420)\n\n* Change vectorization and merge collapse APIs to take Value arguments\r\n\r\nThis commit removes getMaxVectorization[forDataType]() as a function\r\nthat takes a sequence of transform map attributes and replaces it with\r\na version that takes a transformed Value. Similarly,\r\ncollapseContiguousMerges() now takes the transformed Value and clones\r\nthe IR containing said contiguous merges in place instead of taking a\r\ntransform stack.\r\n\r\nFurthermore, getMaxVectorization() now returns a VectorizationResult\r\nstruct, which includes the maximum vectorization fot the underlying\r\nbuffer along with information on how said underlying buffer is\r\nnaturally vectorized (this is currently unused, but will be needed for\r\nrock.scalarize).\r\n\r\n(While I was at it, I made the 128 bits / element type bitwidth thing\r\nthe default behavior for vectorization, since we almost always use it.)\r\n\r\nThis change is nedeed:\r\n1. To enable getMaxVectorization() to reason about - and through - the\r\nupcoming rock.scalarize operation, which isn't a coordinate\r\ntransformation but sort of acts like one.\r\n2. To adapt the API for future dynamic shape cases, where a\r\nTransformMapAttr might have symbols and so will need to, inherently,\r\nbe part of a TransformOp.\r\n\r\nAs a side effect, the vectorization inferenece tests now have nice\r\nnames and are split up into individual functions.\r\n\r\n* Add a fusion-traversing vectorization\r\n\r\n* Address feedack, no contiguous merge test yet\r\n\r\n* Add a fail test for vectorization\r\n\r\n* Add merge collapse test pass\r\n\r\n* Remove old merge collaps unit tests\r\n\r\n* Add explicit API for isolating transform chains","shortMessageHtmlLink":"Change vectorization and merge collapse APIs to take Value arguments (R…"}},{"before":"31299e8476e6ea4ee0d9a169033dd771b39c13f1","after":"f5b62577f7610fd94303975c8d0b486baf24fa56","ref":"refs/heads/global-sync-2","pushedAt":"2024-03-07T21:54:51.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"* changed global sync logic to similar to binary fusion impl\n* needs array of counters for each sync point","shortMessageHtmlLink":"* changed global sync logic to similar to binary fusion impl"}},{"before":"1326212996682a1f61c01d2797107ce2dea99973","after":"3bb598089dd18f3de54c41d7b1348d66671e8798","ref":"refs/heads/develop","pushedAt":"2024-03-07T21:54:33.000Z","pushType":"push","commitsCount":4,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"Added CK-based split-k benchmark (#1448)","shortMessageHtmlLink":"Added CK-based split-k benchmark (ROCm#1448)"}},{"before":"d845a7403c5808e123a96744dd2dacb539d1aa32","after":"31299e8476e6ea4ee0d9a169033dd771b39c13f1","ref":"refs/heads/global-sync-2","pushedAt":"2024-03-07T21:54:19.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"* changed global sync logic to similar to binary fusion impl\n* needs array of counters for each sync point","shortMessageHtmlLink":"* changed global sync logic to similar to binary fusion impl"}},{"before":"da3df73b13c69d6b1d5ffec9c8281c261a0df257","after":"1326212996682a1f61c01d2797107ce2dea99973","ref":"refs/heads/develop","pushedAt":"2024-03-04T18:50:13.000Z","pushType":"push","commitsCount":6,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"[NightlyFix] Change missed cross attention problem key changes (#1438)\n\nChange missed cross attention problem key changes\r\n\r\nThis commit fixes a place where I missed to\r\nupdate the perfRunner when the problem key for\r\nattention changed to support cross attention.","shortMessageHtmlLink":"[NightlyFix] Change missed cross attention problem key changes (ROCm#…"}},{"before":"f42f16b60dda95f95bbe0c2941290d63f8982218","after":"d845a7403c5808e123a96744dd2dacb539d1aa32","ref":"refs/heads/global-sync-2","pushedAt":"2024-02-22T21:32:35.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"* updated test to N=16","shortMessageHtmlLink":"* updated test to N=16"}},{"before":"5d75a6fa61b8b638e50133a7007800d108642493","after":"f42f16b60dda95f95bbe0c2941290d63f8982218","ref":"refs/heads/global-sync-2","pushedAt":"2024-02-22T18:28:36.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"* message","shortMessageHtmlLink":"* message"}},{"before":"7861b5abe47990fd89862d2cc67b8bd0469faea2","after":"da3df73b13c69d6b1d5ffec9c8281c261a0df257","ref":"refs/heads/develop","pushedAt":"2024-02-21T16:29:41.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"[Attention] Fix cross-attention padding (#1424)\n\nThe padding application was wrong for\r\nwhere gemm0 had : k0 x m * k0 x n0\r\nand gemm1 had : n0 x m * n0 x n1\r\n\r\nwhere m != n0 in cross-attention","shortMessageHtmlLink":"[Attention] Fix cross-attention padding (ROCm#1424)"}},{"before":"b2203d101eac8dc3114ff91392fe306f686abaf0","after":"7861b5abe47990fd89862d2cc67b8bd0469faea2","ref":"refs/heads/develop","pushedAt":"2024-02-20T17:18:51.000Z","pushType":"push","commitsCount":19,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"Fix incorrect indexing inte Merge parameters in -rock-fold-broadcast (#1419)\n\nFixes #1417\r\n\r\nFoldBroadcast would crash when processing Merge operations like\r\n [\"a\", \"b\"] at [1, 2]>, as it would index\r\nthe merge parameters with 1 and 2 and not 0 and 1, causing out of\r\nbounds array accesses and thus assertion failures.\r\n\r\nThis commit resolves that issue.\r\n\r\nIn addition, while looking around nearby, I identifidied that the\r\nhandling for Broadcast{} operations that handled multiple\r\ndimensions (like, say,\r\n [\"a\", \"b\"] at [0, 1]> )\r\nwas incorrect because the worklist update logic always checked the\r\nlast underlying dimension for Broadcast{}. This commit fixes that\r\nissue.","shortMessageHtmlLink":"Fix incorrect indexing inte Merge parameters in -rock-fold-broadcast (R…"}},{"before":"f09dbb836584662f4be5fe04250431416c6f1f91","after":"5d75a6fa61b8b638e50133a7007800d108642493","ref":"refs/heads/global-sync-2","pushedAt":"2024-02-20T17:18:11.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"* message","shortMessageHtmlLink":"* message"}},{"before":null,"after":"f3b2c4e528e66bf8162a8de3e756fe8ee80d87e2","ref":"refs/heads/gcc-compile","pushedAt":"2024-02-12T18:31:57.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"* fixed gcc compile","shortMessageHtmlLink":"* fixed gcc compile"}},{"before":"2b654a16a82b879bd3f1cde2f76f32350805e026","after":"b2203d101eac8dc3114ff91392fe306f686abaf0","ref":"refs/heads/develop","pushedAt":"2024-02-05T18:57:32.000Z","pushType":"push","commitsCount":72,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"Add mixed-type support in rocmlir-gen (#1394)\n\n* Add mixed-type support in rocmlir-gen\r\n* Disable acceleration for mixed types","shortMessageHtmlLink":"Add mixed-type support in rocmlir-gen (ROCm#1394)"}},{"before":"234389240a4ae9d06bd17adf6a00cd35c4320bda","after":"f09dbb836584662f4be5fe04250431416c6f1f91","ref":"refs/heads/global-sync-2","pushedAt":"2024-01-18T18:25:51.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"* message","shortMessageHtmlLink":"* message"}},{"before":"907dc94ac0f948bc24989d6fd7a81b84254ef329","after":"2b654a16a82b879bd3f1cde2f76f32350805e026","ref":"refs/heads/develop","pushedAt":"2024-01-17T17:50:16.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"Fix threadwise_read_into vector reads/writes (#1372)\n\nCurrently our threadwise_read_into lowering assumes that srcVecLen could only be >1 iff the source memref is vector typed.\r\nThis assumption is not correct because our vectorization widgets, will find infer a vector length for scalar memrefs.\r\n\r\nThis commit fixes this by correcting the bounds and amending the index to be as if a vector index.\r\nThen, it will find a common vector length to transfer from reads and stores if both source and dest uses vectors.","shortMessageHtmlLink":"Fix threadwise_read_into vector reads/writes (ROCm#1372)"}},{"before":null,"after":"a9e57a3ee6df3a0cddd0a04c80d2ab1165754c7d","ref":"refs/heads/codeowners","pushedAt":"2024-01-12T22:03:20.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"[git] Add codeowners","shortMessageHtmlLink":"[git] Add codeowners"}},{"before":"81479caa48b33fc03d96e9795394ad1940652c9a","after":"a93e6e3d4642244fc36fc9dc9172b41073f0764b","ref":"refs/heads/navi-wgp","pushedAt":"2024-01-12T21:18:30.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"[amdgpu] Fixed arch db settings for Navi in WGP mode (#1325)\n\n The compiler backend defaults to WGP mode on Navi arch. Updated the\n AmdArchDB settings accordingly.\n Also added a new field for `maxSharedMemPerWG` since in WGP mode\n a WG may only allocate up to half the shared mem size.\n\nhttps://github.com/ROCm/rocMLIR-internal/issues/1325","shortMessageHtmlLink":"[amdgpu] Fixed arch db settings for Navi in WGP mode (ROCm#1325)"}},{"before":"192515b04fcdcc7c60fc3a8ae8d2bc146ac178a9","after":"907dc94ac0f948bc24989d6fd7a81b84254ef329","ref":"refs/heads/develop","pushedAt":"2024-01-12T21:18:18.000Z","pushType":"push","commitsCount":5,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"[mlir][AMDGPU] Update amdgpu.lds_barrier implementation (#823)\n\nOn some architectures (currently gfx90a, gfx94*, and gfx10**), we can\r\nimplement an LDS barrier using compiler intrinsics instead of inline\r\nassembly, improving optimization possibilities and decreasing the\r\nfragility of the underlying code.\r\n\r\nOther AMDGPU chipsets continue to require inline assembly to implement\r\nthis barrier, as, by the default, the LLVM backend will insert waits\r\non global memory (s_waintcnt vmcnt(0)) before barriers in order to\r\nensure memory watchpoints set by debuggers work correctly.\r\n\r\nUse of amdgpu.lds_barrier, on these architectures, imposes a tradeoff\r\nbetween debugability and performance. The documentation, as well as\r\nthe generated inline assembly, have been updated to explicitly call\r\nattention to this fact.\r\n\r\nFor chipsets that did not require the inline assembly hack, we move to\r\nthe s.waitcnt and s.barrier intrinsics, which have been added to the\r\nROCDL dialect. The magic constants used as an argument to the waitcnt\r\nintrinsic can be derived from\r\nllvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp","shortMessageHtmlLink":"[mlir][AMDGPU] Update amdgpu.lds_barrier implementation (ROCm#823)"}},{"before":null,"after":"81479caa48b33fc03d96e9795394ad1940652c9a","ref":"refs/heads/navi-wgp","pushedAt":"2024-01-12T19:17:41.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"[amdgpu] Fixed arch db settings for Navi in WGP mode (#1325)\n\n The compiler backend defaults to WGP mode on Navi arch. Updated the\n AmdArchDB settings accordingly.\n Also added a new field for `maxSharedMemPerWG` since in WGP mode\n a WG may only allocate up to half the shared mem size.\n\nhttps://github.com/ROCm/rocMLIR-internal/issues/1325","shortMessageHtmlLink":"[amdgpu] Fixed arch db settings for Navi in WGP mode (ROCm#1325)"}},{"before":"caa6676f856ef2ccb2b13796d3473e3e3dc12b17","after":"192515b04fcdcc7c60fc3a8ae8d2bc146ac178a9","ref":"refs/heads/develop","pushedAt":"2024-01-08T18:06:12.000Z","pushType":"push","commitsCount":3,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"[Attention] Replace 2nd blockwise_gemm with threadwise_gemm (#1362)\n\nThis commit replaces the second blockwise_gemm operation\r\nwith threadwise_gemm operation and friends.\r\n\r\nThis is a stepping stone to analyze the LDS traffic used to\r\nswizzle the gemm operands into accelerator layouts (MFMA/WMMA)\r\nand to potentially remove unnecessary LDS traffic in future.","shortMessageHtmlLink":"[Attention] Replace 2nd blockwise_gemm with threadwise_gemm (ROCm#1362)"}},{"before":"3202ef9b92a9278708b445a2611fff41750ffeea","after":"caa6676f856ef2ccb2b13796d3473e3e3dc12b17","ref":"refs/heads/develop","pushedAt":"2024-01-04T18:14:22.000Z","pushType":"push","commitsCount":5,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"Merge pull request #1351 from ROCm/group-conv2d\n\n[DO NOT SQUASH]Add support to group conv2d","shortMessageHtmlLink":"Merge pull request ROCm#1351 from ROCm/group-conv2d"}},{"before":"5da3448e426eb9359bd009522166c76cc5e87df0","after":"234389240a4ae9d06bd17adf6a00cd35c4320bda","ref":"refs/heads/global-sync-2","pushedAt":"2024-01-03T18:21:29.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"* cleanup","shortMessageHtmlLink":"* cleanup"}},{"before":"ab609f35ab02ef9c1fd5e4f4924d2260c76d2545","after":"3202ef9b92a9278708b445a2611fff41750ffeea","ref":"refs/heads/develop","pushedAt":"2024-01-03T18:20:13.000Z","pushType":"push","commitsCount":10,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"[Attention] Use transposed/rotated attention kernel (#1354)\n\nBefore this commit, the attention kernel would look like as follows :\r\nout = softmax(Q x K) x V\r\n\r\nAfter this commit, it is now :\r\nout = transpose( transpose(V) x softmax( transpose(K) x transpose(Q) ) )\r\n\r\nThis layout is interesting because for two reasons:\r\n\r\nThe reductions axis of the softmax will be laid out in just two threads for MFMA (hence 1344 is important)\r\nIt introduces an opportunity (not done here) to avoid LDS-based swizzling to prepare the operands to MFMA as the transposed thread layout is what the output is produced after the first gemm.\r\nAdditionally, this PR also includes relaxation of blocksize >= 512 check just for attention because the performant tuning config for triton problems are just that and happens to work fine. In future we can investigate, whether that check is relevant now.","shortMessageHtmlLink":"[Attention] Use transposed/rotated attention kernel (ROCm#1354)"}},{"before":"1868313a8979d0871476954ce6ff1af81197048c","after":"5da3448e426eb9359bd009522166c76cc5e87df0","ref":"refs/heads/global-sync-2","pushedAt":"2024-01-02T15:23:34.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"sjw36","name":"SJW","path":"/sjw36","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/48454132?s=80&v=4"},"commit":{"message":"* cleanup","shortMessageHtmlLink":"* cleanup"}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"startCursor":"Y3Vyc29yOnYyOpK7MjAyNC0wOS0wM1QxODozODo0NC4wMDAwMDBazwAAAASr6aBp","endCursor":"Y3Vyc29yOnYyOpK7MjAyNC0wMS0wMlQxNToyMzozNC4wMDAwMDBazwAAAAPWGa0R"}},"title":"Activity · sjw36/rocMLIR"}