diff --git a/sycl/doc/extensions/deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc b/sycl/doc/extensions/deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc
deleted file mode 100644
index b632699a89a4c..0000000000000
--- a/sycl/doc/extensions/deprecated/sycl_ext_oneapi_matrix_no_use.asciidoc
+++ /dev/null
@@ -1,659 +0,0 @@
-# Matrix Programming Extension for DPC++: sycl_ext_oneapi_matrix
-:source-highlighter: coderay
-:coderay-linenums-mode: table
-:dpcpp: pass:[DPC++]
-
-// This section needs to be after the document title.
-:doctype: book
-:toc2:
-:toc: left
-:encoding: utf-8
-:lang: en
-
-:blank: pass:[ +]
-
-// Set the default source code type in this document to C++,
-// for syntax highlighting purposes.  This is needed because
-// docbook uses c++ and html5 uses cpp.
-:language: {basebackend@docbook:c++:cpp}
-
-
-== Notice
-
-Copyright (c) 2021-2021 Intel Corporation.  All rights reserved.
-
-NOTE: Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are
-trademarks of The Khronos Group Inc.  OpenCL(TM) is a trademark of Apple Inc.
-used by permission by Khronos.
-
-This extension is written against the SYCL 2020 revision 3 specification.  All
-references below to the "core SYCL specification" or to section numbers in the
-SYCL specification refer to that revision.
-
-
-**_NOTE:_** _This document describes the current design and API for the matrix
-extension to {dpcpp}. This is an initial experimental version to try out functionality
-and performance, and **future versions of this API may change in ways that are incompatible with this experimental version**. The current implementation provides support of the matrix interface on Intel(R) Advanced Matrix Extensions (AMX) and DPAS. We are going to work with the community on incrementally improving
-the API to bring them closer to standard C++ (aligned with the `std::mdspan` and `std::mdarray` proposals) and SYCL in the next several months._
-
-## Introduction
-This document presents an ongoing work towards defining a unified matrix interface. This interface is intended to unify different tensor hardware: Intel AMX in CPUs, Habana Gaudi and Goya tensor and gemm cores, Nvidia TPUs, IBM Power MMA. All these hardware provide low-level intrinsics or assembly to access and perform matrix operations. The goal is to provide a unified interface that is portable but also benefit from the maximum performance these different hardware can offer.
-
-## Feature test macro
-
-This extension provides a feature-test macro as described in the core SYCL
-specification section 6.3.3 "Feature test macros".  Therefore, an
-implementation supporting this extension must predefine the macro
-`SYCL_EXT_ONEAPI_MATRIX` to one of the values defined in the table below.
-Applications can test for the existence of this macro to determine if the
-implementation supports this feature, or applications can test the macro's
-value to determine which of the extension's APIs the implementation supports.
-
-[frame="none",options="header"]
-|======================
-|Value |Description
-|1     |Initial extension implementation on Intel AMX.  Base features are supported.
-|2     |Initial extension JIT implementation on Intel AMX and DPAS. load, store, mad, fill, piece-wise operations, and the query interface are supported 
-|======================
-
-## New `joint_matrix` class
-We introduce a new class called `joint_matrix`. The user needs to specify the type of the elements, shape, the memory layout, and the memory scope of the matrix. This results into the following description:
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-template <typename T, size_t Rows=sycl::dynamic_extent, size_t Cols=sycl::dynamic_extent, 
-          matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-struct joint_matrix {
-    joint_matrix(Group g) {}
-};
-}
-```
-
-
-#### Memory Scope
-In this experimental API version, we used the terminology of `joint_matrix` instead of plain `matrix` to emphasis that the matrix is shared among a group of work items and is not private to each work item. The memory scope is added as an additional template parameter and is also part of the constructor arguments.
-
-IMPORTANT: In the current implementation, only the subgroup scope is supported
-
-When the group is a `sycl::sub_group`, a matrix is declared as follows:
-
-```c++
-joint_matrix<int8_t, tM, tN> tA(sg); 
-```
-
-#### Shape
-The same class `joint_matrix` should handle both cases where sizes are constant (GPU case) and when sizes are variables (CPU case). Note that a Intel AMX 2d tile register permits sizes up to 1024 (16rowsx64cols) bytes. The ability to define only one interface for both makes it possible to give the user a way to make use of the flexibility introduced by the CPU but at the same time save resources on the GPU. We use `sycl::dynamic_extent`  to differentiate between static and dynamic sizes.
-
-IMPORTANT: In the current implementation, only the static extent is supported
-
-
-#### Layout
-Besides row major and column major layouts, `matrix_layout` is flexible enough to introduce customed layouts such as symmetric or tiled layouts.
-	
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-enum class matrix_layout {
-  row_major,
-  col_major,
-  packed_a,
-  packed_b
-};
-}
-```
-
-Intel AMX and DPAS hardware require B matrix to be in VNNI or 32 bits packed layout. If we multiply matrices A (M, K) and B (K, N) into a matrix C (M, N). The logical sizes are M, K, N. However, the packed shape for B tile uses the VNNI format, which is described below. The user must provide the information of packed_b layout to make the implementation allocate the right shape. The layout information for Intel AMX should be specified in user code as follows: 
-
-```c++
-joint_matrix<int8_t, K, N, packed_b> tB(sg);
-```   
-IMPORTANT: In the current implementation, only `packed_b` layout is necessary to specify on matrix B, the layout on other matrices is ignored.
-
-
-
-## Matrix Operations and their Execution Scope
-We define three new functions needed to perform the main and common operations on matrices namely, load, store, and the actual multiply and add operation. This set of functions can be easily extended if the tensor hardware implements new features.
-
-The base pointer determines the starting address of the matrix to be loaded/stored. `layout` determines whether the data are being read/written in a row (`row_major`), column major (`column_major`) fashion, or if the data has already been transformed into VNNI format (`packed_a`, `packed_b`). `stride` describes the number of elements between consecutive rows for row major and packed layout,  columns for column major layout. 
-
-Note that for getting maximum performance on Intel AMX and DPAS, prepacking data in the memory is necessary. If users did not specify the packed layouts (`packed_a` when matrix `C` is column major, `packed_b` when matrix `C` is row major), transforms done by the implementation will be slow due to extra scatter/gather operations. Hence, we expose these layouts `packed_a` and `packed_b` to the user to specify that A or B have already been VNNIed. The packed or VNNI layout is introduced in `VNNI layout` section below.
-	
-IMPORTANT: In the current implementation, the layout in the load of matrix B must be `packed_b`.  Therefore, both the template parameter for the declaration of the B matrix and the call to `joint_matrix_load` for the B matrix must specify the `packed_b` layout.  The layout in the load of matrices A and C must be `row_major`, and the layout in the store of matrix C must also be `row_major`.
-
-Since the matrix functions are group operations (as defined in Section 4.17.3 of the SYCL specification), the matrix API has to be accessed by all the work-items in the group in a convergent control flow. The `Group` template argument can be a work-group or a subgroup. These functions will be called once by each work item in the group.
-
-To be aligned with the SYCL 2020 group algorithms, an additional group argument is added to the matrix operations to designate that these functions are collective operations. The {dpcpp} syntax is the following: 
-
-IMPORTANT: In the current implementation, only the subgroup scope is supported.  
-
-#### Load 
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_layout Layout,
-          access::address_space Space,
-          access::decorated IsDecorated>
-  void joint_matrix_load(Group sg, joint_matrix<T, NumRows, NumCols, Layout, Group> &res,
-		    multi_ptr<T, Space, IsDecorated> src, size_t stride, matrix_layout MemLayout);
-}
-```
-This function loads data from memory to the 2d tiles/registers of Intel AMX/DPAS.
-
-
-#### Store 
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_layout L,
-          access::address_space Space,
-          access::decorated IsDecorated>
-  void joint_matrix_store(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &res,
-		     multi_ptr<T, Space, IsDecorated> src, size_t stride, matrix_layout memL);
-}
-```
-This function stores the data from the 2d tiles back to memory.
-
-#### Multiply and Add
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename Ta, typename Tb, typename Tc,
-          std::size_t M, std::size_t K, std::size_t N,
-	  matrix_layout La, matrix_layout Lb,
-          matrix_layout Lc>
-  joint_matrix<Group, Tc, M, N, Lc> joint_matrix_mad(Group sg, joint_matrix<Ta, M, K, La, Group> A,
-               joint_matrix<Tb, K, N, Lb, Group> B, joint_matrix<Tc, M, N, Lc, Group> C);
-}
-```
-The matrix multiply and add function performs the multiply operation on the matrices `A` and `B`, accumulate the result with `C` and return the result.
-
-
-#### Matrix Initialization: `joint_matrix_fill`
-The current interface presented above assumes that all the matrices are directly loaded from memory. This new function called `joint_matrix_fill`  makes it possible to multiply a matrix which is not directly loaded from memory but rather initialized directly in the register. On Intel AMX, if the initialization constant is zero, this would map to `_tile_zero` intrinsic: 
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-  template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_layout L, typename Tv>
-  void joint_matrix_fill(Group sg, joint_matrix<T, NumRows, NumCols, L, Group> &m, Tv v);
-}
-```
-IMPORTANT: In the current implementation, only the subgroup scope is supported.  
-
-#### Element Indexing and Piece-Wise Operations 
-##### Background
-Besides matrix multiply and add, this extension aims to make it possible to perform piece-wise operations on matrices in a SPMD manner. The mechanisms that are recommended to perform such piece-wise operations depend upon which of the following classes the operation falls into:
-
-Class 1- Element-wise operations where the same operation is performed on every element of the matrix, such that the operation can be performed without knowledge of the position of the element within the matrix. Activation functions or adding a constant value to every element of the matrix are two examples.
-
-Class 2- Piece-wise operations where the operation depends on the element index of the matrix or the operation takes multiple elements as operands (such as a sum of all elements in a row for example). Quantization that is needed for conversion between low precision types like `int8_t` and `fp32` uses piece-wise operations.
-
-// We explored multiple options to enable this feature in the matrix interface: 1) Allowing non-restrictive element indexing on the matrix elements would result into slow indexing on the GPU, 2) Operator overloading can represent only element-wise operations and not the operations on pieces (row, column, diagonal, etc) of the matrix. 3) Providing specific functions for these piece-wise operations can resolve some of the functions we know of today like the ones involved in quantization but it is not general to any problem that may occur in the future. 
-
-##### Explicit conversion with mapping from SIMD to SPMD
-The data elements in a joint_matrix are distributed or shared across the work-items in the Group in an implementation-defined way. There is no fixed allocation of matrix elements owned by a `joint_matrix` instance to the WIs comprising the group used to instantiate it. For instance, the matrix is a shared entity among the work items in the case of the AMX backend because the AMX tile that holds the matrix data is a 2d register that is shared among the work items. Therefore the partitioning among the WIs is implementation defined. However, it is necessary to allocate WIs to specific elements of the matrix. In order to be able to perform piece-wise operations in a general and efficient way, we provide a conversion function from the joint_matrix domain that is owned by a group of work items to the portion that is owned by each work item. This enables the WI to perform piece-wise operations on the matrix within the SYCL SPMD programming model. 
-
-We introduce a new function `get_wi_data` that provides a view of the portion of the matrix that is owned by the current WI. So modifying `wi_data` means also modifying the joint matrix corresponding elements. The indexing provided inside the `wi_data` class acesses only the portion of the current WI and returns  `wi_element`. This latter holds a reference to the original joint_matrix that `wi_data` was constructed from. Users can use the `=` operator to update the element of the `joint_matrix` represented by the `wi_element` after the element-wise operation.
-
-Using `get_wi_data`, it is not possible to know which portions of data are owned by each thread in the group as this is implementation defined and change from one backend to the other. For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping coordinates information must be known to reason about the matrix view and extract the relevant piece. But for element-wise operations where the same operation is performed on all the elements of the matrix, having all the WIs in the group apply the operation inside a loop iterating over the `length` of `wi_data` guarantees the whole matrix element-wise operation.   
-
-Therefore, this extension currently only supports class 1 of operations because the mapping between `get_wi_data` and `joint_matrix` elements is not required to be known for these operations. However, general piece-wise operations will be supported in the future as a new API will be provided to convey the mapping from `joint_matrix` domain to WI Domain (See Section "WI data to joint matrix mapping coordinates information for piece-wise operations for more information").
-
-Also, note that `get_wi_data` cannot return a fixed size array length because the length of the WI portion is a runtime variable for the following reasons:
-
-1- The main compilation mode of SYCL is JIT compilation and partitioning among WIs is implementation defined.
-
-2- SG size is not fixed (like in the CUDA backend where warp size is always 32).
-
-3- AMX has the flexibility of allowing variable sizes on the matrix (`dynamic_extent`).
-
-In the case of CUDA backend which is SYCL AOT compiled and SG size = 32 known and fixed, the additional marray capability will be provided.
-
-The code listing below shows a synopsis of these new APIs.
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-template <typename T, size_t NumRows, size_t NumCols,
-          matrix_layout Layout = matrix_layout::row_major,
-          typename Group = sycl::sub_group>
-struct joint_matrix {
-   wi_data<T, NumRows, NumCols, Layout, Group> get_wi_data();
-};
-template <typename T, size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
-class wi_data {
-  size_t length();
-  wi_element<T, NumRows, NumCols, Layout, Group> operator[](size_t i);
-};
-template <typename T, size_t NumRows, size_t NumCols,
-          matrix_layout Layout = matrix_layout::row_major,
-          typename Group = sycl::sub_group>
-class wi_element {
-  operator T();
-  wi_element &operator=(const T &rhs);
-…
-};
-}
-```
-
-In the following example `wi_data_c` is a reference to the WI owned portion of the joint matrix `matC`. As such `wi_data_c[i] OP rhs` updates the corresponding matrix element in the joint_matrix `matC`.
-Vectorization along the subgroup dimension will get enabled automatically to vectorize the contiguous portion of the matrix. 
-
-
-```c++
-auto wi_data_c = matC.get_wi_data();             
-for (int i = 0; i < wi_data_c.length(); i++)                
-        wi_data_c[i] *= alpha;    // Note that the indexing here "i" is in the vector owned by a WI, not in the matrix C        
-```
-
-IMPORTANT: In the current implementation, only the subgroup scope is supported.  
-
-IMPORTANT: The WI data to joint matrix mapping coordinates information is not implemented yet. 
-
-IMPORTANT: Since the current tensorcores implementation is AOT, it is possible to know how many elements are owned by each WI at compile time. In this case, `wi_data` can be of type `marray`. An additional interface will be provided for the tensorcores AOT backend. 
-
-
-## VNNI/Packed Layout
-Intel AMX and DPAS compute assumes register for B tile (src1) to be in VNNI format as they need 32bit of K-data in A and B to be contiguous in memory.
-The VNNI blocking factor is 2 in the case of 16-bit types, and it is 4 in the case of 8-bit types. While the current implementation assumes that the matrix has been already packed by the user for performance reasons, the layout information is needed to inform the implementation about this transform.  The following example illustrates how a matrix in `row_major` layout is transformed into the `packed_b` layout for a 16-bit type.
-
-#### Example 1: 16-bit elements
-      // Example of a 4 row x 4 column matrix using a 16-bit data element, in row-major layout.
-      // Element a1 is contiguous in memory with element b1, etc.
-      // ---------------------------------
-      // a1, b1, c1, d1
-      // a2, b2, c2, d2
-      // a3, b3, c3, d3
-      // a4, b4, c4, d4
-      // ---------------------------------
-      // The same matrix reformatted in packed_b layout. 
-      // Here, packing of 2 elements is needed to form 32 bits.
-      // Element a1 is contiguous in memory with element a2, etc.
-      // ---------------------------------
-      // a1, a2, b1, b2, c1, c2, d1, d2
-      // a3, a4, b3, b4, c3, c4, d3, d4
-
-#### Example 2: 8-bit elements
-
-      // Example of a 4 row x 4 column matrix using a 8-bit data element, in row-major layout.
-      // Element a1 is contiguous in memory with element b1, etc.
-      // ---------------------------------
-      // a1, b1, c1, d1
-      // a2, b2, c2, d2
-      // a3, b3, c3, d3
-      // a4, b4, c4, d4
-      // ---------------------------------
-      // The same matrix reformatted in packed_b layout.  
-      // Here, packing of 4 elements is needed to form 32 bits.
-      // Elements a1, a2, a3, a4 are contiguous in memory, etc.
-      // ---------------------------------
-      // a1, a2, a3, a4, b1, b2, b3, b4, c1, c2, c3, c4, d1, d2, d3, d4
-
-
-## Example using int8_t type
-```c++
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-queue q;
-range<2> G = {M/tM, N};
-range<2> L = {1, SG_SIZE};
-int8_t *memA = malloc_shared<int8_t>(M*K, q);
-int8_t *memB = malloc_shared<int8_t>(K*N, q);
-Int32_t *memC = malloc_shared<int32_t>(M*N, q);
-// Assuming memB has already been VNNIed
-q.parallel_for(nd_range<2>(G, L), [=](nd_item<2> item)                            
-  [[sycl::reqd_sub_group_size(SG_SIZE)]] {
-   const auto global_idx = item.get_global_id(0);
-   const auto global_idy = item.get_global_id(1);
-   const auto sg_startx = global_idx - item.get_local_id(0);
-   const auto sg_starty = global_idy - item.get_local_id(1);
-   sub_group sg = item.get_sub_group();
-   joint_matrix<int8_t, tM, tK> tA(sg);
-   // For B, since current implementation does not support non packed layout,
-   // users need to specify the packed_b layout
-   joint_matrix<int8_t, tK, tN, packed_b> tB(sg);
-   joint_matrix<int32_t, tM, tN> tC(sg);
-   joint_matrix_fill(sg, tC, 0);
-   for (int k = 0; k < K; k += tk) {
-     joint_matrix_load(sg, tA, memA + sg_startx * tM * K + k, K, matrix_layout::row_major);
-     joint_matrix_load(sg, tB, memB + k * N + sg_starty/SG_SIZE*tN*4, N*4, matrix_layout::packed_b); // VNNI
-     tC = joint_matrix_mad(sg, tA, tB, tC);
-   }
-   auto wi_data_c = matC.get_wi_data();             
-   for (int i = 0; i < wi_data_c.length(); i++)                
-     wi_data_c[i] *= alpha; // The indexing here "i" is in the vector owned by a WI, not in the matrix C
-   joint_matrix_store(sg, tC, memC + sg_startx * tM * N + sg_starty/SG_SIZE*tN, N, matrix_layout::row_major);
-}).wait();
-```
-
-== Query Interface
-Intel AMX, DPAS and Nvidia TPUs support different sizes and types. 
-The query interface is used to validate user code and inform them about supported types, sizes, scope, and layouts by the implementation.
-This also offers development and tuning productivity by both scientists and library developers. The query interface we are proposing here is a compile-time query, 
-so there will be no runtime errors.   
-The query interface proposed here consists of three functionalities:
-
-- Validation: at compile time, the validation functionality informs the user whether a specific combination is valid or not. This takes place when the user specifies all template parameters.
-
-- Default values: this provides a default shape if the user does not provide a specific combination. In this case, aliases to the `joint_matrix` type can be used, namely `joint_matrix_a/b/c` where no additional argument is needed. This form happens when the user specifies all template parameters except the sizes of the matrices (`tiles`) M, N, and K.
-
-- General query: the general query interface provides information  about sizes, types, static/dynamic, and scopes that are supported by a specific TPU implementation. This is needed to avoid padding by the user, for tuning, and efficient code generation if used by a library. The general query return an array of `combinations` of `combination` type. Each combination includes the sizes and the types for the matrices A, B, and C. Note that for each TPU, the query returns `max_msize, max_nsize, max_ksize` or `msize, nsize, ksize` exclusively depending whether the implementation supports a continuous or discrete number of sizes. For example, Intel AMX implementation supports a continuous number of sizes so the `max_*` variant is applied and only the maximum number is returned. DPAS implementation, on the other hand, supports a discrete list of numbers so the  `msize, nsize, ksize` variant is applied.  This form takes place when users only specify the TPU they are interested in using.
-
-The table below provides a description for each of the member variables and type aliases in `tpu_params` class and the forms in which  they are defined.
-
-[frame="none",options="header"]
-|======================
-| Member/type alias in `tpu_params` | Forms they are defined in |Description
-|`type_a`| validation, default values|type alias for the type of matrix A
-|`type_b`|  validation, default values|type alias for the type of matrix B
-|`type_c`|  validation, default values|type alias for the type of matrix C
-|`defaultM`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for M; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value M that the user provides if M is supported by the implementation
-|`defaultN`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for N; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value N that the user provides if N is supported by the implementation
-|`defaultK`|  validation, default values|when no sizes are provided by the user, indicates the suggested default size for K; usually this corresponds to the maximum size the implementation supports. In validation mode, where the user does provide sizes, this is the same value K that the user provides if K is supported by the implementation
-|`joint_matrix_a`|  validation, default values|type alias for `joint_matrix` for matrix A
-|`joint_matrix_b`| validation, default values| type alias for `joint_matrix` for matrix B
-|`joint_matrix_c`|  validation, default values| type alias for `joint_matrix` for matrix C
-|`dynamic_p`| validation, default values, general query| a boolean that indicates whether the implementation supports dynamic sizes (true) or not (false)
-|numtiles|  validation, default values, general query|indicates number of tiles in Intel AMX (does not apply to DPAS)
-|scope| validation, default values, general query| indicates the memory and execution scope supported by the TPU implementation
-|`combination` |  validation, default values, general query|composes the types and sizes of A, B, C matrices allowed in one combination
-|`max_msize`, `max_nsize`, `max_ksize`|  validation, default values, general query| if the TPU implementation supports a continuous number of element sizes, each of these members is non-zero, and the TPU implementation supports all element sizes from 1 up to (and including) that number. By contrast, if the TPU implementation supports a discrete number of element sizes, each of these members has the value zero
-|`msize`, `nsize`, `ksize`|  validation, default values, general query| if the TPU implementation supports a discrete number of element sizes, each of these members is non-zero, and the value tells one of the supported element sizes. By contrast, if the TPU supports a continuous number of element sizes, each of these members has the value zero
-|`atype`, `btype`, `ctype`| validation, default values, general query| indicates the types supported in the combination
-|`combinations`    | validation, default values, general query| tells the set of supported matrix sizes and types according to the template parameters that are provided. In the "general query" form, the user provides only the TPU type, so the combinations array contains all supported tile sizes and element types for that TPU. In the "default values" form, the user provides the TPU type and element types, so the combinations array contains only those supported matrix sizes and element types that match those element types on that TPU. In the "validation" form, the user provides the TPU type, element types, and element sizes so only this specific combination is returned in the combinations array. 
-|`num_combinations`|  validation, default values, general query|indicates number of combinations supported by the TPU implementation which corresponds to the size of the `combinations` array
-|======================
-
-
-
-
-
-
-```c++
-namespace sycl::ext::oneapi::experimental::matrix {
-
-
-template<tpu u, typename Ta=void, typename Tb=void, typename Tc=void, int M=0, int N=0, int K=0>
-struct tpu_params;
-
-// Validation form: Valid or not
-// Specialization when both types and sizes are given
-template <typename Ta, typename Tb, typename Tc, int M, int N, int K>
-struct tpu_params<
-    tpu::amx, Ta, Tb, Tc, M, N, K,
-    typename std::enable_if<(
-        !std::is_same_v<Ta, void> && !std::is_same_v<Tb, void> &&
-        !std::is_same_v<Tc, void> && M != 0 && N != 0 && K != 0)>::type> {
-  // Validate that parameters are supported
-  static_assert(
-      (M == 0 && N == 0 && K == 0) ||
-          (is_combination_valid_amx<Ta, Tb, Tc>(M, N, K)),
-      "Invalid parameters for Intel AMX, query valid types and maximum sizes "
-      "using: "
-      "tpu_params<tpu::amx> myparams; and then check out myparams.combinations array");
-
-
-  using type_a = Ta; // this type alias is not available in the current implementation 
-  using type_b = Tb; // this type alias is not available in the current implementation
-  using type_c = Tc; // this type alias is not available in the current implementation
-
-  // if combination is valid, construct the matrices
-
-  static constexpr std::size_t defaultM = (M != 0) ? M : 16;
-  static constexpr std::size_t defaultN = (N != 0) ? N : 16;
-  static constexpr std::size_t defaultK =
-      (K != 0) ? K : ((sizeof(Ta) == 1) ? 64 : 32);
-
-  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, Layout, Group>;
-  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, Layout, Group>;
-  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, Layout, Group>;
-
-  static constexpr bool dynamic_p = false; // should be true in future implementations
-                          // because Intel AMX hardware supports dynamic sizes
-  static constexpr uint32_t numtiles = 8;
-  static constexpr scope_t scope = scope_t::sub_group;
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type ctype;
-  };
-  // In this case, the combinations array contains only the combination that the user provided
-  static constexpr combination combinations[] = {
-      {16, 16, (sizeof(Ta) == 1) ? 64 : 32, M, N, K}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-// Default values form: Sizes-only query
-// Specialization for when only types are given, need to query only sizes
-template <typename Ta, typename Tb, typename Tc>
-struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
-                  typename std::enable_if<(!std::is_same_v<Ta, void> &&
-                                           !std::is_same_v<Tb, void> &&
-                                           !std::is_same_v<Tc, void>)>::type> {
-  static_assert((are_types_valid_amx<Ta, Tb, Tc>()),
-                "Invalid types for Intel AMX, supported types are int8_t, uint8_t, "
-                "and bf16 (Note that unsigned short should be used in the"
-                "DPC++ code to implement bf16) ");
-  
-  using type_a = Ta; // this type alias is not available in the current implementation 
-  using type_b = Tb; // this type alias is not available in the current implementation
-  using type_c = Tc; // this type alias is not available in the current implementation
- 
-  // construct the matrices using the default sizes
-  static constexpr std::size_t defaultM = 16;
-  static constexpr std::size_t defaultN = 16;
-  static constexpr std::size_t defaultK = ((sizeof(Ta) == 1) ? 64 : 32);
-
-  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-  using joint_matrix_a = joint_matrix<Ta, defaultM, defaultK, Layout, Group>;
-  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-  using joint_matrix_b = joint_matrix<Tb, defaultK, defaultN, Layout, Group>;
-  template <matrix_layout Layout = matrix_layout::row_major, typename Group = sub_group>
-  using joint_matrix_c = joint_matrix<Tc, defaultM, defaultN, Layout, Group>;
-
-  static constexpr bool dynamic_p = false; // should be true in future implementations because
-                          // Intel AMX hardware supports dynamic sizes
-  static constexpr uint32_t numtiles = 8;
-  static constexpr scope_t scope = scope_t::sub_group;
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type ctype;
-  };
-  // In this case, the combinations array contain only the combinations that correspond to the Ta, Tb, and Tc 
-  // types that the user provided
-  static constexpr combination combinations[] = {
-      {16, 16, (sizeof(Ta) == 1) ? 64 : 32}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-// General query form:
-// types are not given, no default sizes and no implicit matrix construction
-template <int M, int N, int K>
-struct tpu_params<tpu::amx, void, void, void, M, N, K> {
-  static constexpr bool dynamic_p = false; // should be true in future implementations because
-                          // Intel AMX hardware supports dynamic sizes
-  static constexpr uint32_t numtiles = 8;
-  static constexpr scope_t scope = scope_t::sub_group;
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type ctype;
-  };
-  
-  static constexpr combination combinations[] = {
-      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::sint8, matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::sint8, matrix_type::uint8, matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::sint8, matrix_type::sint32},
-      {16, 16, 64, 0, 0, 0, matrix_type::uint8, matrix_type::uint8, matrix_type::sint32},
-      {16, 16, 32, 0, 0,0, matrix_type::bf16, matrix_type::bf16, matrix_type::fp32}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-
-enum class tpu {
-  dpas,
-  amx
-};
-
-enum class matrix_type {
-  bf16,
-  fp16,
-  fp19,  // tfloat32
-  fp32,
-  fp64,
-  sint2,
-  sint4,
-  sint8,
-  sint16,
-  sint32, 
-  sint64,
-  uint2,
-  uint4,
-  uint8,
-  uint16,
-  uint32,
-  uint64
-};
-
-enum class scope_t {
-  sub_group,
-  work_group
-};
-}
-```
-
-
-=== Validation Example:
-```c++
-// User can provide sizes besides the types and tpu_params can assert if they are supported or not
-// in this case, an assertion will happens as 16 is not a supported size for M
-using myparams = tpu_params<tpu::dpas, int8_t, int8_t, int, 16, 8, 32>;  
-size_t NDRangeM = M / myparams::defaultM;  //Assertion would happen at this line
-size_t NDRangeN = N / myparams::defaultN;
-```
-
-=== Default Values Example:
-```c++
-using myparams = tpu_params_both<tpu::dpas, int8_t, int8_t, int>;  
-// use this to construct the ranges on the host side  
-size_t NDRangeM = M / myparams::defaultM;  
-size_t NDRangeN = N / myparams::defaultN;
-//if M,N,K do not multiply the default sizes, padding has to be done 
-// device code: the matrices are constructed using the default dimensions  
-myparams::joint_matrix_a sub_a(sg);  
-myparams::joint_matrix_b<matrix_layout::packed_b> sub_b(sg);  
-myparams::joint_matrix_c sub_c(sg);
-
-```
-
-=== General Query Example:
-```c++
-constexpr int M = 1500; // with msize = 8 and msize = 4, 
-          // M can be broken up to 125 sequence of 8-sized ops and remaining 500 using 125 sequence of 4-sized ops
-tpu_params<tpu::dpas> params;
-constexpr int msize = break_dimension(params, M);
-constexpr int msize_remainder = break_dimension_remainder(params, M);
-constexpr int nsize = params.combinations[0].nsize;
-constexpr int ksize = params.combinations[0].ksize;
-// device code:
-joint_matrix<int8_t, msize, ksize> sub_a(sg);
-joint_matrix<int8_t, ksize, nsize, matrix_layout::packed_b> sub_b(sg);
-joint_matrix<int, msize, nsize> sub_c(sg);
-//Remainder handling
-```
-
-//No don't need to provide more details in this section because the query interface can serve this. 
-
-//## Implementation Status
-
-//### oneAPI 2022.0 release
-//For oneAPI 2022.0 release, a JIT implementation has been made available on both Intel AMX and DPAS hardware of the specific features discussed above. In this case, there is no need to specify any architectural options to the command line. The static query interface can be used to guide the usage of this API. 
-// The DPAS and Intel AMX implementations support the logical capability support of the HW
-
-
-
-
-## Future-looking API
-
-### Memory scope
-The current experimental API uses `joint_` semantics to define the memory scope of the matrix. The long term solution is to use the proposed link:../supported/sycl_ext_oneapi_local_memory.asciidoc[`group_local_memory` extension] to allocate the matrix in local memory associated with a SYCL group as shown in the example below.
-
-
-```c++
-multi_ptr<matrix<T>, address_space::local_space> tA_ptr = group_local_memory<matrix<sub_group, int8_t, tM, tN>>(sg);
-```
-We did not utilize this extension for this matrix API version because sub-group local memory is not yet well defined in {dpcpp}. Moreover, the representation of this notion in LLVM IR and SPIR-V is not clear yet. 
-
-### WI data to joint matrix mapping coordinates information for piece-wise operations
-The indexing provided inside the `wi_data` class acesses only the portion of the current WI. It is not possible the location or coordinates of this portion in the original matrix.  This coordinates mapping  is implementation defined and change from one backend to the other.   For general piece-wise operations like sum of rows of a matrix, the WI data to joint matrix mapping coordinates information is needed to reason about the matrix view.
-With joint matrix, we want to write, as much as possible, one code to run on different backends. So if backend X states that a WI owns one exact row of the matrix for instance. Writing the following code will work only on that backend for that version of hardware. The hardware and implementations change, for instance, the same WI can own half of the row because SG size increased or hardware units increased. 
-
-```c++
-auto data = C.get_wi_data();
-for (int i = 0; i < length; ++i) {
-  sum_of_local_rows[row] += data[i];
-}
-```
-
-
-
-We want to keep backward compatibility in the joint matrix code when implementations or hardware change. To that end, instead of hard-code this mapping, we write  general backend and target-agnostic, especially in the JIT compilation mode of SYCL. This is possible by querying this mapping so code does not have to change from one version to the other.
-
-So for the mapping problem, since this mapping is implementation-defined, one of the proposals is to add runtime functions like:
-```c++
-auto data = C.get_wi_data();
-for (int i = 0; i < length; ++i) {
-  auto row, col = data[i].get_coord();
-  sum_of_local_rows[row] += data[i];
-}
-```
-
-
-## Open Questions
-- Besides row, col major and packed (VNNI) layout, what are the additional layouts that should absolutely be added?
-- Are there alternative names for the `packed_a` and `packed_b` layouts that would be clearer to distinguish between the VNNI Layout in matrix A and VNNI layout in matrix B of a matrix multiply and add operation on Intel AMX?
--- Yes, this will be addressed in the next revision where `use` argument will be introduced to distinguish between right (B) , left (A), and accumulator matrix. 
-- Ronan Keryell: "It would be interesting to investigate whether providing also member functions would simplify the API. Provide both so it is possible to use the best one for each use case, while waiting for https://en.wikipedia.org/wiki/Uniform_Function_Call_Syntax to land into C++?"
-
-- In the future looking APIs, `get_wi_data` (that is currently under design) returns an owned object. Should this return a view object to make sure the original matrix C is changed after its slices are modified.
-
-## TODO List
-- Add WI data to joint matrix mapping coordinates information for piece-wise operations. This will be added as part of the query or new methods to the 'get_wi_data' class. 
-- Add 'matrix_use' parameter to the matrix to distinguish between matrix A, B, and matrix accumulator. This is necessary for supporting VNNI and transpose transform 
-- Change the names default sizes in the query from defaultM, defaultN, defaultK to M,N,K
-- Change the type of `scope` in the query interface to be able to return more than one value. This will be useful in the event we support other scopes like workgroup besides subgroups
-- Add a more realistic and complete example that shows the value of the general query
-
-
-## Revision History
-
-[frame="none",options="header"]
-|======================
-|Rev |Date       |Author     |Changes
-|1   |2021-04-13 |Dounia Khaldi |Initial public working draft.
-|2   |2021-10-05 |Dounia Khaldi |JIT implementation on both Intel AMX and DPAS
-|3   |2022-05-16 |Dounia Khaldi |Add matrix fill and piece-wise operations support
-|======================
diff --git a/sycl/include/CL/__spirv/spirv_ops.hpp b/sycl/include/CL/__spirv/spirv_ops.hpp
index 5db5d09efd335..6e5b518a7b9aa 100644
--- a/sycl/include/CL/__spirv/spirv_ops.hpp
+++ b/sycl/include/CL/__spirv/spirv_ops.hpp
@@ -24,7 +24,6 @@
 
 #ifdef __SYCL_DEVICE_ONLY__
 
-#if (SYCL_EXT_ONEAPI_MATRIX_VERSION > 1)
 extern __DPCPP_SYCL_EXTERNAL float __spirv_RoundFToTF32INTEL(float a);
 template <typename T, typename Tp, std::size_t R, std::size_t C,
           __spv::MatrixUse U,
@@ -139,96 +138,6 @@ template <typename Ts, typename T, std::size_t R, std::size_t C,
 extern __DPCPP_SYCL_EXTERNAL __spv::__spirv_JointMatrixINTEL<T, R, C, L, S, U> *
 __spirv_VectorInsertDynamic(__spv::__spirv_JointMatrixINTEL<T, R, C, L, S, U> *,
                             Ts val, size_t i);
-#else
-template <typename T, typename Tp, std::size_t R, std::size_t C,
-          __spv::MatrixLayout L = __spv::MatrixLayout::RowMajor,
-          __spv::Scope::Flag S = __spv::Scope::Flag::Subgroup>
-extern __DPCPP_SYCL_EXTERNAL __spv::__spirv_JointMatrixINTEL<Tp, R, C, L, S> *
-__spirv_JointMatrixLoadINTEL(T *Ptr, std::size_t Stride,
-                             __spv::MatrixLayout Layout = L,
-                             __spv::Scope::Flag Sc = S, int MemOperand = 0);
-
-template <typename T, typename Tp, std::size_t R, std::size_t C,
-          __spv::MatrixLayout L = __spv::MatrixLayout::RowMajor,
-          __spv::Scope::Flag S = __spv::Scope::Flag::Subgroup>
-extern __DPCPP_SYCL_EXTERNAL void __spirv_JointMatrixStoreINTEL(
-    T *Ptr, __spv::__spirv_JointMatrixINTEL<Tp, R, C, L, S> *Object,
-    std::size_t Stride, __spv::MatrixLayout Layout = L,
-    __spv::Scope::Flag Sc = S, int MemOperand = 0);
-
-template <typename T1, typename T2, std::size_t M, std::size_t K, std::size_t N,
-          __spv::MatrixLayout LA = __spv::MatrixLayout::RowMajor,
-          __spv::MatrixLayout LB = __spv::MatrixLayout::RowMajor,
-          __spv::MatrixLayout LC = __spv::MatrixLayout::RowMajor,
-          __spv::Scope::Flag S = __spv::Scope::Flag::Subgroup>
-extern __DPCPP_SYCL_EXTERNAL __spv::__spirv_JointMatrixINTEL<T2, M, N, LC, S> *
-__spirv_JointMatrixMadINTEL(
-    __spv::__spirv_JointMatrixINTEL<T1, M, K, LA, S> *A,
-    __spv::__spirv_JointMatrixINTEL<T1, K, N, LB, S> *B,
-    __spv::__spirv_JointMatrixINTEL<T2, M, N, LC, S> *C,
-    __spv::Scope::Flag Sc = __spv::Scope::Flag::Subgroup);
-
-template <typename T1, typename T2, typename T3, std::size_t M, std::size_t K,
-          std::size_t N, __spv::MatrixLayout LA = __spv::MatrixLayout::RowMajor,
-          __spv::MatrixLayout LB = __spv::MatrixLayout::RowMajor,
-          __spv::MatrixLayout LC = __spv::MatrixLayout::RowMajor,
-          __spv::Scope::Flag S = __spv::Scope::Flag::Subgroup>
-extern __DPCPP_SYCL_EXTERNAL __spv::__spirv_JointMatrixINTEL<T3, M, N, LC, S> *
-__spirv_JointMatrixUUMadINTEL(
-    __spv::__spirv_JointMatrixINTEL<T1, M, K, LA, S> *A,
-    __spv::__spirv_JointMatrixINTEL<T2, K, N, LB, S> *B,
-    __spv::__spirv_JointMatrixINTEL<T3, M, N, LC, S> *C,
-    __spv::Scope::Flag Sc = __spv::Scope::Flag::Subgroup);
-
-template <typename T1, typename T2, typename T3, std::size_t M, std::size_t K,
-          std::size_t N, __spv::MatrixLayout LA = __spv::MatrixLayout::RowMajor,
-          __spv::MatrixLayout LB = __spv::MatrixLayout::RowMajor,
-          __spv::MatrixLayout LC = __spv::MatrixLayout::RowMajor,
-          __spv::Scope::Flag S = __spv::Scope::Flag::Subgroup>
-extern __DPCPP_SYCL_EXTERNAL __spv::__spirv_JointMatrixINTEL<T3, M, N, LC, S> *
-__spirv_JointMatrixUSMadINTEL(
-    __spv::__spirv_JointMatrixINTEL<T1, M, K, LA, S> *A,
-    __spv::__spirv_JointMatrixINTEL<T2, K, N, LB, S> *B,
-    __spv::__spirv_JointMatrixINTEL<T3, M, N, LC, S> *C,
-    __spv::Scope::Flag Sc = __spv::Scope::Flag::Subgroup);
-
-template <typename T1, typename T2, typename T3, std::size_t M, std::size_t K,
-          std::size_t N, __spv::MatrixLayout LA = __spv::MatrixLayout::RowMajor,
-          __spv::MatrixLayout LB = __spv::MatrixLayout::RowMajor,
-          __spv::MatrixLayout LC = __spv::MatrixLayout::RowMajor,
-          __spv::Scope::Flag S = __spv::Scope::Flag::Subgroup>
-extern __DPCPP_SYCL_EXTERNAL __spv::__spirv_JointMatrixINTEL<T3, M, N, LC, S> *
-__spirv_JointMatrixSUMadINTEL(
-    __spv::__spirv_JointMatrixINTEL<T1, M, K, LA, S> *A,
-    __spv::__spirv_JointMatrixINTEL<T2, K, N, LB, S> *B,
-    __spv::__spirv_JointMatrixINTEL<T3, M, N, LC, S> *C,
-    __spv::Scope::Flag Sc = __spv::Scope::Flag::Subgroup);
-
-template <typename T, std::size_t R, std::size_t C,
-          __spv::MatrixLayout L = __spv::MatrixLayout::RowMajor,
-          __spv::Scope::Flag S = __spv::Scope::Flag::Subgroup>
-extern __DPCPP_SYCL_EXTERNAL __spv::__spirv_JointMatrixINTEL<T, R, C, L, S> *
-__spirv_CompositeConstruct(const T v);
-
-template <typename T, std::size_t R, std::size_t C,
-          __spv::MatrixLayout L = __spv::MatrixLayout::RowMajor,
-          __spv::Scope::Flag S = __spv::Scope::Flag::Subgroup>
-extern __DPCPP_SYCL_EXTERNAL size_t __spirv_JointMatrixWorkItemLengthINTEL(
-    __spv::__spirv_JointMatrixINTEL<T, R, C, L, S> *);
-
-template <typename T, std::size_t R, std::size_t C,
-          __spv::MatrixLayout L = __spv::MatrixLayout::RowMajor,
-          __spv::Scope::Flag S = __spv::Scope::Flag::Subgroup>
-extern __DPCPP_SYCL_EXTERNAL T __spirv_VectorExtractDynamic(
-    __spv::__spirv_JointMatrixINTEL<T, R, C, L, S> *, size_t i);
-
-template <typename T, std::size_t R, std::size_t C,
-          __spv::MatrixLayout L = __spv::MatrixLayout::RowMajor,
-          __spv::Scope::Flag S = __spv::Scope::Flag::Subgroup>
-extern __DPCPP_SYCL_EXTERNAL __spv::__spirv_JointMatrixINTEL<T, R, C, L, S> *
-__spirv_VectorInsertDynamic(__spv::__spirv_JointMatrixINTEL<T, R, C, L, S> *,
-                            T val, size_t i);
-#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
 
 #ifndef __SPIRV_BUILTIN_DECLARATIONS__
 #error                                                                         \
@@ -1220,7 +1129,7 @@ extern __DPCPP_SYCL_EXTERNAL
     std::enable_if_t<std::is_integral_v<to> && std::is_unsigned_v<to>, to>
     __spirv_ConvertPtrToU(from val) noexcept;
 
-#else // if !__SYCL_DEVICE_ONLY__
+#else  // if !__SYCL_DEVICE_ONLY__
 
 template <typename dataT>
 __SYCL_CONVERGENT__ extern __ocl_event_t
diff --git a/sycl/include/CL/__spirv/spirv_types.hpp b/sycl/include/CL/__spirv/spirv_types.hpp
index 10880467b4563..6a2348c9b204e 100644
--- a/sycl/include/CL/__spirv/spirv_types.hpp
+++ b/sycl/include/CL/__spirv/spirv_types.hpp
@@ -8,7 +8,7 @@
 
 #pragma once
 
-#include <sycl/detail/defines.hpp> // for SYCL_EXT_ONEAPI_MATRIX_VERSION
+#include <sycl/detail/defines.hpp> // for __has_builtin
 #include <sycl/half_type.hpp>      // for half
 
 #include <cstddef> // for size_t
@@ -110,35 +110,19 @@ enum class GroupOperation : uint32_t {
   ClusteredReduce = 3,
 };
 
-#if (SYCL_EXT_ONEAPI_MATRIX_VERSION > 1)
 enum class MatrixLayout : uint32_t {
   RowMajor = 0,
   ColumnMajor = 1,
   Packed = 2,
   Dynamic = 3
 };
-#else
-enum class MatrixLayout : uint32_t {
-  RowMajor = 0,
-  ColumnMajor = 1,
-  PackedA = 2,
-  PackedB = 3,
-  Unused = 4
-};
-#endif
 
 enum class MatrixUse : uint32_t { MatrixA = 0, MatrixB = 1, Accumulator = 2 };
 
-#if (SYCL_EXT_ONEAPI_MATRIX_VERSION > 1)
 template <typename T, std::size_t R, std::size_t C, MatrixLayout L,
           Scope::Flag S = Scope::Flag::Subgroup,
           MatrixUse U = MatrixUse::MatrixA>
 struct __spirv_JointMatrixINTEL;
-#else
-template <typename T, std::size_t R, std::size_t C, MatrixLayout L,
-          Scope::Flag S = Scope::Flag::Subgroup>
-struct __spirv_JointMatrixINTEL;
-#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
 
 } // namespace __spv
 
@@ -176,8 +160,8 @@ template <int Bits> using ap_int = _BitInt(Bits);
 // SPIRV built-in functions.
 // Only in such cases the class is recognized as SPIRV type __ocl_event_t.
 #ifndef __SYCL_DEVICE_ONLY__
-typedef void* __ocl_event_t;
-typedef void* __ocl_sampler_t;
+typedef void *__ocl_event_t;
+typedef void *__ocl_sampler_t;
 // Adding only the datatypes that can be currently used in SYCL,
 // as per SYCL spec 1.2.1
 #define __SYCL_SPV_IMAGE_TYPE(NAME) typedef void *__ocl_##NAME##_t
diff --git a/sycl/include/sycl/detail/defines.hpp b/sycl/include/sycl/detail/defines.hpp
index de2de047528b1..a56fac997b30e 100644
--- a/sycl/include/sycl/detail/defines.hpp
+++ b/sycl/include/sycl/detail/defines.hpp
@@ -38,12 +38,3 @@
 #else
 #define __SYCL_TYPE(x)
 #endif
-
-// joint matrix should only be included by default for SPIR, NVPTX or HIP(GFX90A
-// only) backends
-#if defined __SPIR__ || defined __NVPTX__ || !defined __SYCL_DEVICE_ONLY__ ||  \
-    defined __gfx90a__
-#ifndef SYCL_EXT_ONEAPI_MATRIX_VERSION
-#define SYCL_EXT_ONEAPI_MATRIX_VERSION 4
-#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
-#endif // __SPIR__ || __NVPTX__ || !__SYCL_DEVICE_ONLY || __gfx90a__
diff --git a/sycl/include/sycl/ext/oneapi/matrix/matrix-jit.hpp b/sycl/include/sycl/ext/oneapi/matrix/matrix-jit.hpp
deleted file mode 100644
index e9f2ccadc0dba..0000000000000
--- a/sycl/include/sycl/ext/oneapi/matrix/matrix-jit.hpp
+++ /dev/null
@@ -1,647 +0,0 @@
-//==---------------- matrix-jit.hpp - SYCL matrix --------------*- C++ -*---==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-// ===--------------------------------------------------------------------=== //
-
-#pragma once
-
-#include "utils.hpp"
-#include <CL/__spirv/spirv_ops.hpp>
-#include <sycl/detail/defines_elementary.hpp>
-#include <sycl/ext/oneapi/bfloat16.hpp>
-#include <sycl/feature_test.hpp>
-
-namespace sycl {
-inline namespace _V1 {
-namespace ext::oneapi::experimental::matrix {
-
-enum class matrix_layout { row_major, col_major, packed_a, packed_b };
-
-template <matrix_layout Layout> struct spv_matrix_layout_traits {
-  static constexpr __spv::MatrixLayout value = __spv::MatrixLayout::RowMajor;
-};
-
-#define SPV_MATRIX_LAYOUT_TRAITS(LAYOUT, SPV_LAYOUT)                           \
-  template <> struct spv_matrix_layout_traits<LAYOUT> {                        \
-    static constexpr __spv::MatrixLayout value = SPV_LAYOUT;                   \
-  };
-
-SPV_MATRIX_LAYOUT_TRAITS(matrix_layout::row_major,
-                         __spv::MatrixLayout::RowMajor)
-SPV_MATRIX_LAYOUT_TRAITS(matrix_layout::col_major,
-                         __spv::MatrixLayout::ColumnMajor)
-SPV_MATRIX_LAYOUT_TRAITS(matrix_layout::packed_a, __spv::MatrixLayout::PackedA)
-SPV_MATRIX_LAYOUT_TRAITS(matrix_layout::packed_b, __spv::MatrixLayout::PackedB)
-
-template <typename G> struct spv_scope_traits {};
-template <> struct spv_scope_traits<sycl::sub_group> {
-  constexpr static auto value = __spv::Scope::Subgroup;
-};
-template <int D> struct spv_scope_traits<sycl::group<D>> {
-  constexpr static auto value = __spv::Scope::Workgroup;
-};
-
-template <typename T, size_t NumRows, size_t NumCols,
-          matrix_layout Layout = matrix_layout::row_major,
-          typename Group = sycl::sub_group>
-class wi_data;
-
-template <typename T, size_t NumRows, size_t NumCols,
-          matrix_layout Layout = matrix_layout::row_major,
-          typename Group = sycl::sub_group>
-struct joint_matrix {
-public:
-  __spv::__spirv_JointMatrixINTEL<
-      T, NumRows, NumCols, spv_matrix_layout_traits<Layout>::value,
-      spv_scope_traits<Group>::value> *spvm;
-  joint_matrix(Group sg) {
-#ifndef __SYCL_DEVICE_ONLY__
-    (void)sg;
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-  inline __SYCL_ALWAYS_INLINE wi_data<T, NumRows, NumCols, Layout, Group>
-  get_wi_data() {
-    return wi_data<T, NumRows, NumCols, Layout, Group>(*this);
-  }
-
-#ifdef __SYCL_DEVICE_ONLY__
-#if defined(__SPIR__)
-  // Generate a non-trivial assignment operator and copy c'tor that prevents
-  // memcpy from being generated.
-  // TODO: to remove, when either IGC can handle alloca JointMatrix or
-  // combination of InstCombine + SROA + mem2reg can remove it
-  joint_matrix(const joint_matrix &other) {
-    spvm = other.spvm;
-    return *this;
-  }
-
-  joint_matrix &operator=(const joint_matrix &rhs) {
-    spvm = rhs.spvm;
-    return *this;
-  }
-#endif // defined(__SPIR__)
-#endif
-};
-
-template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_layout Layout = matrix_layout::row_major,
-          access::address_space Space, access::decorated IsDecorated>
-inline __SYCL_ALWAYS_INLINE void joint_matrix_load(
-    Group sg, joint_matrix<T, NumRows, NumCols, Layout, Group> &res,
-    multi_ptr<T, Space, IsDecorated> src, size_t stride, matrix_layout MemL) {
-#ifdef __SYCL_DEVICE_ONLY__
-  static_assert(Space != access::address_space::private_space,
-                "Joint Matrix doesn't support load from private memory!");
-  using DecorT = typename sycl::detail::DecoratedType<T, Space>::type;
-  DecorT *Ptr = sycl::detail::getDecorated<DecorT>(src);
-  switch (MemL) {
-  default:
-    assert(false && "Invalid Memory Layout!");
-  case matrix_layout::row_major:
-    res.spvm =
-        __spirv_JointMatrixLoadINTEL<DecorT, T, NumRows, NumCols,
-                                     spv_matrix_layout_traits<Layout>::value>(
-            Ptr, stride, __spv::MatrixLayout::RowMajor,
-            spv_scope_traits<Group>::value);
-    break;
-  case matrix_layout::col_major:
-    res.spvm =
-        __spirv_JointMatrixLoadINTEL<DecorT, T, NumRows, NumCols,
-                                     spv_matrix_layout_traits<Layout>::value>(
-            Ptr, stride, __spv::MatrixLayout::ColumnMajor,
-            spv_scope_traits<Group>::value);
-    break;
-  case matrix_layout::packed_a:
-    res.spvm =
-        __spirv_JointMatrixLoadINTEL<DecorT, T, NumRows, NumCols,
-                                     spv_matrix_layout_traits<Layout>::value>(
-            Ptr, stride, __spv::MatrixLayout::PackedA,
-            spv_scope_traits<Group>::value);
-    break;
-  case matrix_layout::packed_b:
-    res.spvm =
-        __spirv_JointMatrixLoadINTEL<DecorT, T, NumRows, NumCols,
-                                     spv_matrix_layout_traits<Layout>::value>(
-            Ptr, stride, __spv::MatrixLayout::PackedB,
-            spv_scope_traits<Group>::value);
-    break;
-  }
-#else
-  (void)sg;
-  (void)res;
-  (void)src;
-  (void)stride;
-  (void)MemL;
-  throw runtime_error("joint matrix is not supported on host device.",
-                      PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-}
-
-template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_layout MatL = matrix_layout::row_major,
-          access::address_space Space, access::decorated IsDecorated>
-inline __SYCL_ALWAYS_INLINE void joint_matrix_store(
-    Group sg, joint_matrix<T, NumRows, NumCols, MatL, Group> &src,
-    multi_ptr<T, Space, IsDecorated> res, size_t stride, matrix_layout MemL) {
-#ifdef __SYCL_DEVICE_ONLY__
-  static_assert(Space != access::address_space::private_space,
-                "Joint Matrix doesn't support store to private memory!");
-  using DecorT = typename sycl::detail::DecoratedType<T, Space>::type;
-  DecorT *Ptr = sycl::detail::getDecorated<DecorT>(res);
-  switch (MemL) {
-  default:
-    assert(false && "Invalid Memory Layout!");
-  case matrix_layout::row_major:
-    __spirv_JointMatrixStoreINTEL<DecorT, T, NumRows, NumCols,
-                                  spv_matrix_layout_traits<MatL>::value>(
-        Ptr, src.spvm, stride, __spv::MatrixLayout::RowMajor,
-        spv_scope_traits<Group>::value);
-    break;
-  case matrix_layout::col_major:
-    __spirv_JointMatrixStoreINTEL<DecorT, T, NumRows, NumCols,
-                                  spv_matrix_layout_traits<MatL>::value>(
-        Ptr, src.spvm, stride, __spv::MatrixLayout::ColumnMajor,
-        spv_scope_traits<Group>::value);
-    break;
-  case matrix_layout::packed_a:
-    __spirv_JointMatrixStoreINTEL<DecorT, T, NumRows, NumCols,
-                                  spv_matrix_layout_traits<MatL>::value>(
-        Ptr, src.spvm, stride, __spv::MatrixLayout::PackedA,
-        spv_scope_traits<Group>::value);
-    break;
-  case matrix_layout::packed_b:
-    __spirv_JointMatrixStoreINTEL<DecorT, T, NumRows, NumCols,
-                                  spv_matrix_layout_traits<MatL>::value>(
-        Ptr, src.spvm, stride, __spv::MatrixLayout::PackedB,
-        spv_scope_traits<Group>::value);
-    break;
-  }
-#else
-  (void)sg;
-  (void)src;
-  (void)res;
-  (void)stride;
-  (void)MemL;
-  throw runtime_error("joint matrix is not supported on host device.",
-                      PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-}
-
-template <typename Group, typename T1, typename T2, typename T3, size_t M,
-          size_t K, size_t N, matrix_layout LayoutA, matrix_layout LayoutB,
-          matrix_layout LayoutC>
-inline __SYCL_ALWAYS_INLINE joint_matrix<T3, M, N, LayoutC, Group>
-joint_matrix_mad(Group sg, joint_matrix<T1, M, K, LayoutA, Group> &mA,
-                 joint_matrix<T2, K, N, LayoutB, Group> &mB,
-                 joint_matrix<T3, M, N, LayoutC, Group> &mC) {
-#ifdef __SYCL_DEVICE_ONLY__
-  joint_matrix<T3, M, N, LayoutC, Group> res(sg);
-  if constexpr (std::is_same<T1, uint16_t>::value &&
-                std::is_same<T2, uint16_t>::value &&
-                std::is_same<T3, float>::value)
-    res.spvm = __spirv_JointMatrixMadINTEL(mA.spvm, mB.spvm, mC.spvm);
-  else if constexpr (std::is_unsigned<T1>::value && std::is_unsigned<T2>::value)
-    res.spvm = __spirv_JointMatrixUUMadINTEL(mA.spvm, mB.spvm, mC.spvm);
-  else if constexpr (std::is_signed<T1>::value && std::is_unsigned<T2>::value)
-    res.spvm = __spirv_JointMatrixSUMadINTEL(mA.spvm, mB.spvm, mC.spvm);
-  else if constexpr (std::is_unsigned<T1>::value && std::is_signed<T2>::value)
-    res.spvm = __spirv_JointMatrixUSMadINTEL(mA.spvm, mB.spvm, mC.spvm);
-  else
-    res.spvm = __spirv_JointMatrixMadINTEL(mA.spvm, mB.spvm, mC.spvm);
-  return res;
-#else
-  (void)sg;
-  (void)mA;
-  (void)mB;
-  (void)mC;
-  throw runtime_error("joint matrix is not supported on host device.",
-                      PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-}
-
-template <typename Group, typename T, size_t NumRows, size_t NumCols,
-          matrix_layout Layout, typename T2>
-inline __SYCL_ALWAYS_INLINE void
-joint_matrix_fill(Group sg,
-                  joint_matrix<T, NumRows, NumCols, Layout, Group> &res,
-                  const T2 v) {
-  // We kept the unused "sg" in joint_matrix_fill to match the other DPC++
-  // functions
-  (void)sg;
-#ifdef __SYCL_DEVICE_ONLY__
-  res.spvm =
-      __spirv_CompositeConstruct<T, NumRows, NumCols,
-                                 spv_matrix_layout_traits<Layout>::value>(
-          static_cast<T>(v));
-
-#else
-  (void)res;
-  (void)v;
-#endif // __SYCL_DEVICE_ONLY__
-}
-
-template <typename T, size_t NumRows, size_t NumCols,
-          matrix_layout Layout = matrix_layout::row_major,
-          typename Group = sycl::sub_group>
-class wi_element {
-  joint_matrix<T, NumRows, NumCols, Layout, Group> &M;
-  std::size_t idx;
-
-public:
-  wi_element(joint_matrix<T, NumRows, NumCols, Layout, Group> &Mat,
-             std::size_t i)
-      : M(Mat), idx(i) {}
-  operator T() {
-#ifdef __SYCL_DEVICE_ONLY__
-    return __spirv_VectorExtractDynamic(M.spvm, idx);
-#else
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-  explicit operator bool() {
-#ifdef __SYCL_DEVICE_ONLY__
-    return __spirv_VectorExtractDynamic(M.spvm, idx) != static_cast<T>(0);
-#else
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-  template <typename T2> wi_element &operator=(const T2 &rhs) {
-#ifdef __SYCL_DEVICE_ONLY__
-    M.spvm = __spirv_VectorInsertDynamic(M.spvm, static_cast<T>(rhs), idx);
-    return *this;
-#else
-    (void)rhs;
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-  wi_element &
-  operator=(const wi_element<T, NumRows, NumCols, Layout, Group> &rhs) {
-#ifdef __SYCL_DEVICE_ONLY__
-    M.spvm = __spirv_VectorInsertDynamic(
-        M.spvm, __spirv_VectorExtractDynamic(rhs.M.spvm, rhs.idx), idx);
-    return *this;
-#else
-    (void)rhs;
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-#if __SYCL_DEVICE_ONLY__
-#define OP(op)                                                                 \
-  template <typename T2> wi_element &operator op##=(const T2 &rhs) {           \
-    M.spvm = __spirv_VectorInsertDynamic(                                      \
-        M.spvm,                                                                \
-        static_cast<T>(__spirv_VectorExtractDynamic(M.spvm, idx)               \
-                           op static_cast<T>(rhs)),                            \
-        idx);                                                                  \
-    return *this;                                                              \
-  }
-#else // __SYCL_DEVICE_ONLY__
-#define OP(op)                                                                 \
-  template <typename T2> wi_element &operator op##=(const T2 &rhs) {           \
-    (void)rhs;                                                                 \
-    throw runtime_error("joint matrix is not supported on host device.",       \
-                        PI_ERROR_INVALID_DEVICE);                              \
-  }
-#endif // __SYCL_DEVICE_ONLY__
-  OP(+)
-  OP(-)
-  OP(*)
-  OP(/)
-#undef OP
-};
-
-// Note that similarly to the other matrix functions, uint16_t is used here to
-// represent bf16 type. Since the AMX and DPAS implementations don't support
-// uint16_t, this interpretation is possible. This design choice was made before
-// the introduction of SYCL experimental bfloat16 type. Our plan is to move
-// towards using the SYCL bfloat16. But since it is still experimental, we will
-// probably keep both uint16 interpretation and SYCL bfloat16.
-template <size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
-class wi_element<uint16_t, NumRows, NumCols, Layout, Group> {
-  joint_matrix<uint16_t, NumRows, NumCols, Layout, Group> &M;
-  std::size_t idx;
-
-public:
-  wi_element(joint_matrix<uint16_t, NumRows, NumCols, Layout, Group> &Mat,
-             std::size_t i)
-      : M(Mat), idx(i) {}
-  operator uint16_t() {
-#ifdef __SYCL_DEVICE_ONLY__
-    return __spirv_VectorExtractDynamic(M.spvm, idx);
-#else
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-  explicit operator bool() {
-#ifdef __SYCL_DEVICE_ONLY__
-    return sycl::fabs(make_fp32(__spirv_VectorExtractDynamic(M.spvm, idx))) >=
-           std::numeric_limits<float>::epsilon();
-#else
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-  wi_element &operator=(const uint16_t &rhs) {
-#ifdef __SYCL_DEVICE_ONLY__
-    M.spvm = __spirv_VectorInsertDynamic(M.spvm, rhs, idx);
-    return *this;
-#else
-    (void)rhs;
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-  wi_element &
-  operator=(const wi_element<uint16_t, NumRows, NumCols, Layout, Group> &rhs) {
-#ifdef __SYCL_DEVICE_ONLY__
-    M.spvm = __spirv_VectorInsertDynamic(
-        M.spvm, __spirv_VectorExtractDynamic(rhs.M.spvm, rhs.idx), idx);
-    return *this;
-#else
-    (void)rhs;
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-  // We use here the following functions for conversion (bf16=>fp32 and
-  // fp32=>bf16). This is a workaround until we are able to use
-  // __spirv_ConvertFToBF16INTEL and __spirv_ConvertBF16ToFINTEL once these are
-  // supported in the CPU backend
-  static float make_fp32(uint16_t x) {
-    unsigned int y = x;
-    y = y << 16;
-    float *res = reinterpret_cast<float *>(&y);
-    return *res;
-  }
-
-  static uint16_t make_bf16(float x) {
-    int *res = reinterpret_cast<int *>(&x);
-    *res = *res >> 16;
-    return (uint16_t)*res;
-  }
-
-#if __SYCL_DEVICE_ONLY__
-#define OP(op)                                                                 \
-  wi_element &operator op##=(const uint16_t &rhs) {                            \
-    M.spvm = __spirv_VectorInsertDynamic(                                      \
-        M.spvm,                                                                \
-        make_bf16(make_fp32(__spirv_VectorExtractDynamic(M.spvm, idx)          \
-                                op make_fp32(rhs))),                           \
-        idx);                                                                  \
-    return *this;                                                              \
-  }
-#else // __SYCL_DEVICE_ONLY__
-#define OP(op)                                                                 \
-  wi_element &operator op##=(const uint16_t &rhs) {                            \
-    (void)rhs;                                                                 \
-    throw runtime_error("joint matrix is not supported on host device.",       \
-                        PI_ERROR_INVALID_DEVICE);                              \
-  }
-#endif // __SYCL_DEVICE_ONLY__
-  OP(+)
-  OP(-)
-  OP(*)
-  OP(/)
-#undef OP
-
-  template <typename T1, typename T2> struct Converter {
-    static T2 convert(const T1 &from) { return static_cast<T2>(from); }
-  };
-
-  template <typename T> struct Converter<T, uint16_t> {
-    static uint16_t convert(const T &from) { return make_bf16(from); }
-  };
-#if __SYCL_DEVICE_ONLY__
-#define OP(input_type, type, op)                                               \
-  friend type operator op(                                                     \
-      const wi_element<uint16_t, NumRows, NumCols, Layout, Group> &lhs,        \
-      const uint16_t &rhs) {                                                   \
-    return Converter<input_type, type>::convert(make_fp32(                     \
-        __spirv_VectorExtractDynamic(lhs.M.spvm, lhs.idx)) op make_fp32(rhs)); \
-  }                                                                            \
-  friend type operator op(                                                     \
-      const uint16_t &lhs,                                                     \
-      const wi_element<uint16_t, NumRows, NumCols, Layout, Group> &rhs) {      \
-    return Converter<input_type, type>::convert(make_fp32(                     \
-        __spirv_VectorExtractDynamic(rhs.M.spvm, rhs.idx)) op make_fp32(lhs)); \
-  }
-#else // __SYCL_DEVICE_ONLY__
-#define OP(input_type, type, op)                                               \
-  friend type operator op(                                                     \
-      const wi_element<uint16_t, NumRows, NumCols, Layout, Group> &lhs,        \
-      const uint16_t &rhs) {                                                   \
-    (void)lhs;                                                                 \
-    (void)rhs;                                                                 \
-    throw runtime_error("joint matrix is not supported on host device.",       \
-                        PI_ERROR_INVALID_DEVICE);                              \
-  }                                                                            \
-  friend type operator op(                                                     \
-      const uint16_t &lhs,                                                     \
-      const wi_element<uint16_t, NumRows, NumCols, Layout, Group> &rhs) {      \
-    (void)lhs;                                                                 \
-    (void)rhs;                                                                 \
-    throw runtime_error("joint matrix is not supported on host device.",       \
-                        PI_ERROR_INVALID_DEVICE);                              \
-  }
-#endif // __SYCL_DEVICE_ONLY__
-  OP(float, uint16_t, +)
-  OP(float, uint16_t, -)
-  OP(float, uint16_t, *)
-  OP(float, uint16_t, /)
-  OP(bool, bool, ==)
-  OP(bool, bool, !=)
-  OP(bool, bool, <)
-  OP(bool, bool, >)
-  OP(bool, bool, <=)
-  OP(bool, bool, >=)
-#undef OP
-};
-
-template <size_t NumRows, size_t NumCols, matrix_layout Layout, typename Group>
-class wi_element<sycl::ext::oneapi::bfloat16, NumRows, NumCols, Layout, Group> {
-  joint_matrix<sycl::ext::oneapi::bfloat16, NumRows, NumCols, Layout, Group> &M;
-  std::size_t idx;
-
-public:
-  wi_element(joint_matrix<sycl::ext::oneapi::bfloat16, NumRows, NumCols, Layout,
-                          Group> &Mat,
-             std::size_t i)
-      : M(Mat), idx(i) {}
-  operator sycl::ext::oneapi::bfloat16() {
-#ifdef __SYCL_DEVICE_ONLY__
-    return __spirv_VectorExtractDynamic(M.spvm, idx);
-#else
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-  explicit operator bool() {
-#ifdef __SYCL_DEVICE_ONLY__
-    return sycl::fabs(static_cast<float>(__spirv_VectorExtractDynamic(
-               M.spvm, idx))) >= std::numeric_limits<float>::epsilon();
-#else
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-  wi_element &operator=(const sycl::ext::oneapi::bfloat16 &rhs) {
-#ifdef __SYCL_DEVICE_ONLY__
-    M.spvm = __spirv_VectorInsertDynamic(M.spvm, rhs, idx);
-    return *this;
-#else
-    (void)rhs;
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-  wi_element &operator=(const wi_element<sycl::ext::oneapi::bfloat16, NumRows,
-                                         NumCols, Layout, Group> &rhs) {
-#ifdef __SYCL_DEVICE_ONLY__
-    M.spvm = __spirv_VectorInsertDynamic(
-        M.spvm, __spirv_VectorExtractDynamic(rhs.M.spvm, rhs.idx), idx);
-    return *this;
-#else
-    (void)rhs;
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-
-#if __SYCL_DEVICE_ONLY__
-#define OP(opassign, op)                                                       \
-  wi_element &operator opassign(const sycl::ext::oneapi::bfloat16 &rhs) {      \
-    M.spvm = __spirv_VectorInsertDynamic(                                      \
-        M.spvm, __spirv_VectorExtractDynamic(M.spvm, idx) op rhs, idx);        \
-    return *this;                                                              \
-  }
-#else // __SYCL_DEVICE_ONLY__
-#define OP(opassign, op)                                                       \
-  wi_element &operator opassign(const sycl::ext::oneapi::bfloat16 &rhs) {      \
-    (void)rhs;                                                                 \
-    throw runtime_error("joint matrix is not supported on host device.",       \
-                        PI_ERROR_INVALID_DEVICE);                              \
-  }
-#endif // __SYCL_DEVICE_ONLY__
-  OP(+=, +)
-  OP(-=, -)
-  OP(*=, *)
-  OP(/=, /)
-#undef OP
-
-#if __SYCL_DEVICE_ONLY__
-#define OP(type, op)                                                           \
-  friend type operator op(                                                     \
-      const wi_element<sycl::ext::oneapi::bfloat16, NumRows, NumCols, Layout,  \
-                       Group> &lhs,                                            \
-      const sycl::ext::oneapi::bfloat16 &rhs) {                                \
-    return __spirv_VectorExtractDynamic(lhs.M.spvm, lhs.idx) op rhs;           \
-  }                                                                            \
-  friend type operator op(                                                     \
-      const sycl::ext::oneapi::bfloat16 &lhs,                                  \
-      const wi_element<sycl::ext::oneapi::bfloat16, NumRows, NumCols, Layout,  \
-                       Group> &rhs) {                                          \
-    return __spirv_VectorExtractDynamic(rhs.M.spvm, rhs.idx) op lhs;           \
-  }
-  OP(sycl::ext::oneapi::bfloat16, +)
-  OP(sycl::ext::oneapi::bfloat16, -)
-  OP(sycl::ext::oneapi::bfloat16, *)
-  OP(sycl::ext::oneapi::bfloat16, /)
-#undef OP
-#define OP(type, op)                                                           \
-  friend type operator op(                                                     \
-      const wi_element<sycl::ext::oneapi::bfloat16, NumRows, NumCols, Layout,  \
-                       Group> &lhs,                                            \
-      const sycl::ext::oneapi::bfloat16 &rhs) {                                \
-    return type{static_cast<float>(__spirv_VectorExtractDynamic(               \
-        lhs.M.spvm, lhs.idx)) op static_cast<float>(rhs)};                     \
-  }                                                                            \
-  friend type operator op(                                                     \
-      const sycl::ext::oneapi::bfloat16 &lhs,                                  \
-      const wi_element<sycl::ext::oneapi::bfloat16, NumRows, NumCols, Layout,  \
-                       Group> &rhs) {                                          \
-    return type{static_cast<float>(__spirv_VectorExtractDynamic(               \
-        rhs.M.spvm, rhs.idx)) op static_cast<float>(lhs)};                     \
-  }
-  OP(bool, ==)
-  OP(bool, !=)
-  OP(bool, <)
-  OP(bool, >)
-  OP(bool, <=)
-  OP(bool, >=)
-#undef OP
-#else // __SYCL_DEVICE_ONLY__
-#define OP(type, op)                                                           \
-  friend type operator op(const wi_element<sycl::ext::oneapi::bfloat16,        \
-                                           NumRows, NumCols, Layout, Group> &, \
-                          const sycl::ext::oneapi::bfloat16 &) {               \
-    throw runtime_error("joint matrix is not supported on host device.",       \
-                        PI_ERROR_INVALID_DEVICE);                              \
-  }                                                                            \
-  friend type operator op(                                                     \
-      const sycl::ext::oneapi::bfloat16 &,                                     \
-      const wi_element<sycl::ext::oneapi::bfloat16, NumRows, NumCols, Layout,  \
-                       Group> &) {                                             \
-    throw runtime_error("joint matrix is not supported on host device.",       \
-                        PI_ERROR_INVALID_DEVICE);                              \
-  }
-  OP(sycl::ext::oneapi::bfloat16, +)
-  OP(sycl::ext::oneapi::bfloat16, -)
-  OP(sycl::ext::oneapi::bfloat16, *)
-  OP(sycl::ext::oneapi::bfloat16, /)
-  OP(bool, ==)
-  OP(bool, !=)
-  OP(bool, <)
-  OP(bool, >)
-  OP(bool, <=)
-  OP(bool, >=)
-#undef OP
-#endif // __SYCL_DEVICE_ONLY__
-};
-
-template <typename T, size_t NumRows, size_t NumCols, matrix_layout Layout,
-          typename Group>
-class wi_data {
-  joint_matrix<T, NumRows, NumCols, Layout, Group> &M;
-
-public:
-  wi_data(joint_matrix<T, NumRows, NumCols, Layout, Group> &Mat) : M(Mat) {}
-  size_t length() {
-#ifdef __SYCL_DEVICE_ONLY__
-    return __spirv_JointMatrixWorkItemLengthINTEL(M.spvm);
-#else
-    throw runtime_error("joint matrix is not supported on host device.",
-                        PI_ERROR_INVALID_DEVICE);
-#endif // __SYCL_DEVICE_ONLY__
-  }
-  wi_element<T, NumRows, NumCols, Layout, Group> operator[](size_t i) {
-    return wi_element<T, NumRows, NumCols, Layout, Group>(M, i);
-  }
-};
-
-#undef SPV_MATRIX_LAYOUT_TRAITS
-
-} // namespace ext::oneapi::experimental::matrix
-} // namespace _V1
-} // namespace sycl
diff --git a/sycl/include/sycl/ext/oneapi/matrix/matrix.hpp b/sycl/include/sycl/ext/oneapi/matrix/matrix.hpp
index 77037885fc28b..7a9050980d4e9 100644
--- a/sycl/include/sycl/ext/oneapi/matrix/matrix.hpp
+++ b/sycl/include/sycl/ext/oneapi/matrix/matrix.hpp
@@ -16,11 +16,5 @@
 
 #include <sycl/detail/defines.hpp>
 
-#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 1)
-#include <sycl/ext/oneapi/matrix/matrix-jit.hpp>
-#include <sycl/ext/oneapi/matrix/static-query.hpp>
-#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
-#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 4)
 #include <sycl/ext/oneapi/matrix/matrix-unified.hpp>
 #include <sycl/ext/oneapi/matrix/static-query-use.hpp>
-#endif // SYCL_EXT_ONEAPI_MATRIX_VERSION
diff --git a/sycl/include/sycl/ext/oneapi/matrix/static-query.hpp b/sycl/include/sycl/ext/oneapi/matrix/static-query.hpp
deleted file mode 100644
index 965ee55bb591a..0000000000000
--- a/sycl/include/sycl/ext/oneapi/matrix/static-query.hpp
+++ /dev/null
@@ -1,417 +0,0 @@
-//===-------------- static-query.hpp - SYCL matrix ------------*- C++ -*---===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-// ===--------------------------------------------------------------------=== //
-// This file implements the static query interface for the joint_matrix
-// experimental extension. Intel(R) Advanced Matrix Extensions (Intel(R) AMX),
-// DPAS and different other TPUs support different logical sizes and types. The
-// query interface is used to validate user code and inform them about supported
-// types, sizes, scope, and layouts by the current implementation. Note that
-// this query interface is a compile-time query, so there will be no runtime
-// errors. The query interface provides three functionalities: 1- At compile
-// time, inform the user whether a specific combination is valid or not. 2-
-// Construct the matrices using a default shape if user does not provide a
-// combination 3- General query interface for sizes, types, static/dynamic,
-// scope. This is needed to void padding by the user, for tuning, and efficient
-// code generation if used by a library.
-
-#pragma once
-
-namespace sycl {
-inline namespace _V1 {
-namespace ext {
-namespace oneapi {
-namespace experimental {
-namespace matrix {
-
-enum class tpu {
-  dpas,
-  amx,
-};
-enum class matrix_type {
-  bf8,
-  bf16,
-  fp16,
-  fp19, // tfloat32
-  fp32,
-  fp64,
-  sint2,
-  sint4,
-  sint8,
-  sint16,
-  sint32,
-  sint64,
-  uint2,
-  uint4,
-  uint8,
-  uint16,
-  uint32,
-  uint64
-};
-
-enum class scope_t { sub_group, work_group };
-
-template <tpu u, typename Ta = void, typename Tb = void, typename Tc = void,
-          int M = 0, int N = 0, int K = 0, typename Enabled = void>
-struct tpu_params;
-
-template <typename Ta, typename Tb, typename Tc>
-constexpr bool is_combination_valid_amx(int M, int N, int K) {
-  // is_same_v is a C++17 feature
-  if ((std::is_same_v<Ta, int8_t> && std::is_same_v<Tb, int8_t> &&
-       std::is_same_v<Tc, int> && M <= 16 && N <= 16 && K <= 64) ||
-      (std::is_same_v<Ta, uint8_t> && std::is_same_v<Tb, uint8_t> &&
-       std::is_same_v<Tc, int> && M <= 16 && N <= 16 && K <= 64) ||
-      (std::is_same_v<Ta, int8_t> && std::is_same_v<Tb, uint8_t> &&
-       std::is_same_v<Tc, int> && M <= 16 && N <= 16 && K <= 64) ||
-      (std::is_same_v<Ta, uint8_t> && std::is_same_v<Tb, int8_t> &&
-       std::is_same_v<Tc, int> && M <= 16 && N <= 16 && K <= 64) ||
-      // bf16
-      (std::is_same_v<Ta, unsigned short> &&
-       std::is_same_v<Tb, unsigned short> && std::is_same_v<Tc, float> &&
-       M <= 16 && N <= 16 && K <= 32))
-    return true;
-  else
-    return false;
-}
-
-template <typename Ta, typename Tb, typename Tc>
-constexpr bool are_types_valid_amx() {
-  if ((std::is_same_v<Ta, int8_t> && std::is_same_v<Tb, int8_t> &&
-       std::is_same_v<Tc, int>) ||
-      (std::is_same_v<Ta, uint8_t> && std::is_same_v<Tb, uint8_t> &&
-       std::is_same_v<Tc, int>) ||
-      (std::is_same_v<Ta, int8_t> && std::is_same_v<Tb, uint8_t> &&
-       std::is_same_v<Tc, int>) ||
-      (std::is_same_v<Ta, uint8_t> && std::is_same_v<Tb, int8_t> &&
-       std::is_same_v<Tc, int>) ||
-      (std::is_same_v<Ta, unsigned short> &&
-       std::is_same_v<Tb, unsigned short> && std::is_same_v<Tc, float>))
-    return true;
-  else
-    return false;
-}
-
-// General query:
-// types are not given, no default sizes and no implicit matrix construction
-template <int M, int N, int K>
-struct tpu_params<tpu::amx, void, void, void, M, N, K> {
-  static constexpr std::size_t defaultM = -1; // depends on the type
-  static constexpr std::size_t defaultN = -1;
-  static constexpr std::size_t defaultK = -1;
-
-  bool dynamic_p = false; // should be true in future implementations because
-                          // AMX hardware supports dynamic sizes
-  uint32_t numtiles = 8;
-  scope_t scope = scope_t::sub_group;
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type ctype;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-  };
-  using mt = matrix_type;
-  static constexpr combination combinations[] = {
-      {16, 16, 64, mt::sint8, mt::sint8, mt::sint32},
-      {16, 16, 64, mt::sint8, mt::uint8, mt::sint32},
-      {16, 16, 64, mt::uint8, mt::sint8, mt::sint32},
-      {16, 16, 64, mt::uint8, mt::uint8, mt::sint32},
-      {16, 16, 32, mt::bf16, mt::bf16, mt::fp32}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-// Sizes-only query
-// Specialization for when only types are given, need to query only sizes
-template <typename Ta, typename Tb, typename Tc>
-struct tpu_params<tpu::amx, Ta, Tb, Tc, 0, 0, 0,
-                  typename std::enable_if<(!std::is_same_v<Ta, void> &&
-                                           !std::is_same_v<Tb, void> &&
-                                           !std::is_same_v<Tc, void>)>::type> {
-  static_assert((are_types_valid_amx<Ta, Tb, Tc>()),
-                "Invalid types for AMX, supported types are int8_t, uint8_t, "
-                "and bf16 (Note that unsigned short should be used in the"
-                "DPC++ code to implement bf16) ");
-
-  // construct the matrices using the default sizes
-  static constexpr std::size_t defaultM = 16;
-  static constexpr std::size_t defaultN = 16;
-  static constexpr std::size_t defaultK = ((sizeof(Ta) == 1) ? 64 : 32);
-
-  template <typename Group>
-  using joint_matrix_a =
-      joint_matrix<Ta, defaultM, defaultK, matrix_layout::row_major, Group>;
-  template <typename Group>
-  using joint_matrix_b =
-      joint_matrix<Tb, defaultK, defaultN, matrix_layout::packed_b, Group>;
-  template <typename Group>
-  using joint_matrix_c =
-      joint_matrix<Tc, defaultM, defaultN, matrix_layout::row_major, Group>;
-
-  bool dynamic_p = false; // should be true in future implementations because
-                          // AMX hardware supports dynamic sizes
-  uint32_t numtiles = 8;
-  scope_t scope = scope_t::sub_group;
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type ctype;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-  };
-  static constexpr combination combinations[] = {
-      {16, 16, (sizeof(Ta) == 1) ? 64 : 32}};
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-// Valid or not:
-// Specialization when both types and sizes are given
-template <typename Ta, typename Tb, typename Tc, int M, int N, int K>
-struct tpu_params<
-    tpu::amx, Ta, Tb, Tc, M, N, K,
-    typename std::enable_if<(
-        !std::is_same_v<Ta, void> && !std::is_same_v<Tb, void> &&
-        !std::is_same_v<Tc, void> && M != 0 && N != 0 && K != 0)>::type> {
-  // Validate that parameters are supported
-  static_assert(
-      (M == 0 && N == 0 && K == 0) ||
-          (is_combination_valid_amx<Ta, Tb, Tc>(M, N, K)),
-      "Invalid parameters for AMX, query valid types and maximum sizes "
-      "using: tpu_params<tpu::amx> myparams; and then check out "
-      "myparams.combinations array");
-
-  // if combination is valid, construct the matrices
-
-  static constexpr std::size_t defaultM = (M != 0) ? M : 16;
-  static constexpr std::size_t defaultN = (N != 0) ? N : 16;
-  static constexpr std::size_t defaultK =
-      (K != 0) ? K : ((sizeof(Ta) == 1) ? 64 : 32);
-
-  template <typename Group>
-  using joint_matrix_a =
-      joint_matrix<Ta, defaultM, defaultK, matrix_layout::row_major, Group>;
-  template <typename Group>
-  using joint_matrix_b =
-      joint_matrix<Tb, defaultK, defaultN, matrix_layout::packed_b, Group>;
-  template <typename Group>
-  using joint_matrix_c =
-      joint_matrix<Tc, defaultM, defaultN, matrix_layout::row_major, Group>;
-
-  bool dynamic_p = false; // should be true in future implementations
-                          // because AMX hardware supports dynamic sizes
-  uint32_t numtiles = 8;
-  scope_t scope = scope_t::sub_group;
-};
-
-// DPAS case
-// The DPAS implementation supports the logical capability support of the HW
-// So in this case, M, N, K sizes returned by the query represent the logical
-// capabilities of the DPAS hardware.
-
-template <typename Ta, typename Tb, typename Tc>
-constexpr bool is_combination_valid_dpas(int M, int N, int K) {
-  if ((std::is_same_v<Ta, int8_t> && std::is_same_v<Tb, int8_t> &&
-       std::is_same_v<Tc, int> && (M == 1 || M == 2 || M == 4 || M == 8) &&
-       N == 8 && K == 32) ||
-      (std::is_same_v<Ta, int8_t> && std::is_same_v<Tb, uint8_t> &&
-       std::is_same_v<Tc, int> && (M == 1 || M == 2 || M == 4 || M == 8) &&
-       N == 8 && K == 32) ||
-      (std::is_same_v<Ta, uint8_t> && std::is_same_v<Tb, int8_t> &&
-       std::is_same_v<Tc, int> && (M == 1 || M == 2 || M == 4 || M == 8) &&
-       N == 8 && K == 32) ||
-      (std::is_same_v<Ta, uint8_t> && std::is_same_v<Tb, uint8_t> &&
-       std::is_same_v<Tc, int> && (M == 1 || M == 2 || M == 4 || M == 8) &&
-       N == 8 && K == 32) ||
-      (std::is_same_v<Ta, half> && std::is_same_v<Tb, half> &&
-       std::is_same_v<Tc, float> && (M == 1 || M == 2 || M == 4 || M == 8) &&
-       N == 8 && K == 16) ||
-      (std::is_same_v<Ta, unsigned short> &&
-       std::is_same_v<Tb, unsigned short> && std::is_same_v<Tc, float> &&
-       (M == 1 || M == 2 || M == 4 || M == 8) && N == 8 && K == 16))
-    return true;
-  else
-    return false;
-}
-
-template <typename Ta, typename Tb, typename Tc>
-constexpr bool are_types_valid_dpas() {
-  if ((std::is_same_v<Ta, int8_t> && std::is_same_v<Tb, int8_t> &&
-       std::is_same_v<Tc, int>) ||
-      (std::is_same_v<Ta, uint8_t> && std::is_same_v<Tb, int8_t> &&
-       std::is_same_v<Tc, int>) ||
-      (std::is_same_v<Ta, int8_t> && std::is_same_v<Tb, uint8_t> &&
-       std::is_same_v<Tc, int>) ||
-      (std::is_same_v<Ta, uint8_t> && std::is_same_v<Tb, uint8_t> &&
-       std::is_same_v<Tc, int>) ||
-      (std::is_same_v<Ta, half> && std::is_same_v<Tb, half> &&
-       std::is_same_v<Tc, float>) ||
-      (std::is_same_v<Ta, unsigned short> &&
-       std::is_same_v<Tb, unsigned short> && std::is_same_v<Tc, float>))
-    return true;
-  else
-    return false;
-}
-
-// General Query
-// specialization for when types are not given --> no default values
-template <int M, int N, int K>
-struct tpu_params<tpu::dpas, void, void, void, M, N, K> {
-  static constexpr std::size_t defaultM = -1; // depends on the type
-  static constexpr std::size_t defaultN = -1;
-  static constexpr std::size_t defaultK = -1;
-
-  bool dynamic_p = false; // no dynamic allocation on the GPU
-  uint32_t numtiles = -1; // does not apply for DPAS
-  scope_t scope = scope_t::sub_group;
-
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type ctype;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-  };
-  using mt = matrix_type;
-  static constexpr combination combinations[] = {
-      {0, 0, 0, mt::sint8, mt::sint8, mt::sint32, 1, 8, 32},
-      {0, 0, 0, mt::sint8, mt::sint8, mt::sint32, 2, 8, 32},
-      {0, 0, 0, mt::sint8, mt::sint8, mt::sint32, 4, 8, 32},
-      {0, 0, 0, mt::sint8, mt::sint8, mt::sint32, 8, 8, 32},
-      {0, 0, 0, mt::sint8, mt::uint8, mt::sint32, 1, 8, 32},
-      {0, 0, 0, mt::sint8, mt::uint8, mt::sint32, 2, 8, 32},
-      {0, 0, 0, mt::sint8, mt::uint8, mt::sint32, 4, 8, 32},
-      {0, 0, 0, mt::sint8, mt::uint8, mt::sint32, 8, 8, 32},
-      {0, 0, 0, mt::uint8, mt::sint8, mt::sint32, 1, 8, 32},
-      {0, 0, 0, mt::uint8, mt::sint8, mt::sint32, 2, 8, 32},
-      {0, 0, 0, mt::uint8, mt::sint8, mt::sint32, 4, 8, 32},
-      {0, 0, 0, mt::uint8, mt::sint8, mt::sint32, 8, 8, 32},
-      {0, 0, 0, mt::uint8, mt::uint8, mt::sint32, 1, 8, 32},
-      {0, 0, 0, mt::uint8, mt::uint8, mt::sint32, 2, 8, 32},
-      {0, 0, 0, mt::uint8, mt::uint8, mt::sint32, 4, 8, 32},
-      {0, 0, 0, mt::uint8, mt::uint8, mt::sint32, 8, 8, 32},
-      {0, 0, 0, mt::fp16, mt::fp16, mt::fp32, 1, 8, 16},
-      {0, 0, 0, mt::fp16, mt::fp16, mt::fp32, 2, 8, 16},
-      {0, 0, 0, mt::fp16, mt::fp16, mt::fp32, 4, 8, 16},
-      {0, 0, 0, mt::fp16, mt::fp16, mt::fp32, 8, 8, 16},
-      {0, 0, 0, mt::bf16, mt::bf16, mt::fp32, 1, 8, 16},
-      {0, 0, 0, mt::bf16, mt::bf16, mt::fp32, 2, 8, 16},
-      {0, 0, 0, mt::bf16, mt::bf16, mt::fp32, 4, 8, 16},
-      {0, 0, 0, mt::bf16, mt::bf16, mt::fp32, 8, 8, 16},
-  };
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-// Sizes-only query:
-// Specialization for when only types are given, need to query only sizes
-
-template <typename Ta, typename Tb, typename Tc>
-struct tpu_params<tpu::dpas, Ta, Tb, Tc, 0, 0, 0,
-                  typename std::enable_if<(!std::is_same_v<Ta, void> &&
-                                           !std::is_same_v<Tb, void> &&
-                                           !std::is_same_v<Tc, void>)>::type> {
-  static_assert((are_types_valid_dpas<Ta, Tb, Tc>()),
-                "Invalid types for DPAS, supported types are int8_t, uint8_t, "
-                "half, and bf16 (Note that unsigned short should be used in the"
-                "DPC++ code to implement bf16)");
-
-  // construct the matrices using the default sizes
-
-  static constexpr std::size_t defaultM = 8;
-  static constexpr std::size_t defaultN = 8;
-  static constexpr std::size_t defaultK = ((sizeof(Ta) == 1) ? 32 : 16);
-
-  template <typename Group>
-  using joint_matrix_a =
-      joint_matrix<Ta, defaultM, defaultK, matrix_layout::row_major, Group>;
-  template <typename Group>
-  using joint_matrix_b =
-      joint_matrix<Tb, defaultK, defaultN, matrix_layout::packed_b, Group>;
-  template <typename Group>
-  using joint_matrix_c =
-      joint_matrix<Tc, defaultM, defaultN, matrix_layout::row_major, Group>;
-
-  bool dynamic_p = false; // no dynamic allocation on the GPU
-  uint32_t numtiles = -1; // does not apply for DPAS
-  scope_t scope = scope_t::sub_group;
-  struct combination {
-    uint32_t max_msize;
-    uint32_t max_nsize;
-    uint32_t max_ksize;
-    matrix_type atype;
-    matrix_type btype;
-    matrix_type ctype;
-    uint32_t msize;
-    uint32_t nsize;
-    uint32_t ksize;
-  };
-  using mt = matrix_type;
-  static constexpr combination combinations[] = {
-      // The types used in the initialization below are fake and not used. In
-      // this case, users already chose the types, they are only looking for the
-      // sizes
-      {0, 0, 0, mt::bf8, mt::bf8, mt::bf8, 1, 8, (sizeof(Ta) == 1) ? 32 : 16},
-      {0, 0, 0, mt::bf8, mt::bf8, mt::bf8, 2, 8, (sizeof(Ta) == 1) ? 32 : 16},
-      {0, 0, 0, mt::bf8, mt::bf8, mt::bf8, 4, 8, (sizeof(Ta) == 1) ? 32 : 16},
-      {0, 0, 0, mt::bf8, mt::bf8, mt::bf8, 8, 8, (sizeof(Ta) == 1) ? 32 : 16},
-  };
-  static constexpr int num_combinations =
-      sizeof(combinations) / sizeof(combination);
-};
-
-// Valid or not:
-// Specialization when both types and sizes are given
-template <typename Ta, typename Tb, typename Tc, int M, int N, int K>
-struct tpu_params<
-    tpu::dpas, Ta, Tb, Tc, M, N, K,
-    typename std::enable_if<((!std::is_same_v<Ta, void> && M != 0))>::type> {
-  // Validate that parameters are supported
-  static_assert((M == 0 && N == 0 && K == 0) ||
-                    (is_combination_valid_dpas<Ta, Tb, Tc>(M, N, K)),
-                "Invalid parameters for DPAS, query valid combinations "
-                "using: tpu_params<tpu::dpas> myparams; and then check out "
-                "myparams.combinations array");
-
-  // if combination is valid, construct the matrices
-  static constexpr std::size_t defaultM = (M != 0) ? M : 8;
-  static constexpr std::size_t defaultN = (N != 0) ? N : 8;
-  static constexpr std::size_t defaultK =
-      (K != 0) ? K : ((sizeof(Ta) == 1) ? 32 : 16);
-
-  template <typename Group>
-  using joint_matrix_a =
-      joint_matrix<Ta, defaultM, defaultK, matrix_layout::row_major, Group>;
-  template <typename Group>
-  using joint_matrix_b =
-      joint_matrix<Tb, defaultK, defaultN, matrix_layout::packed_b, Group>;
-  template <typename Group>
-  using joint_matrix_c =
-      joint_matrix<Tc, defaultM, defaultN, matrix_layout::row_major, Group>;
-
-  bool dynamic_p = false; // no dynamic allocation on the GPU
-  uint32_t numtiles = -1; // does not apply for DPAS
-  scope_t scope = scope_t::sub_group;
-};
-} // namespace matrix
-} // namespace experimental
-} // namespace oneapi
-} // namespace ext
-} // namespace _V1
-} // namespace sycl
diff --git a/sycl/include/sycl/info/ext_oneapi_device_traits.def b/sycl/include/sycl/info/ext_oneapi_device_traits.def
index 6203904a20c9c..7a4668dbbdb6a 100644
--- a/sycl/include/sycl/info/ext_oneapi_device_traits.def
+++ b/sycl/include/sycl/info/ext_oneapi_device_traits.def
@@ -9,11 +9,10 @@ __SYCL_PARAM_TRAITS_TEMPLATE_SPEC(ext::oneapi::experimental,device, max_work_gro
 __SYCL_PARAM_TRAITS_SPEC(ext::oneapi::experimental, device, architecture,
                          ext::oneapi::experimental::architecture,
                          PI_EXT_ONEAPI_DEVICE_INFO_IP_VERSION)
-#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 4)
 __SYCL_PARAM_TRAITS_SPEC(ext::oneapi::experimental, device, matrix_combinations,
                          std::vector<ext::oneapi::experimental::matrix::combination>,
                          PI_EXT_ONEAPI_DEVICE_INFO_MATRIX_COMBINATIONS)
-#endif
+
 __SYCL_PARAM_TRAITS_SPEC(
     ext::oneapi::experimental, device, graph_support,
     ext::oneapi::experimental::graph_support_level,
diff --git a/sycl/include/sycl/info/info_desc.hpp b/sycl/include/sycl/info/info_desc.hpp
index 125efac92fb04..82cf18b5a30a6 100644
--- a/sycl/include/sycl/info/info_desc.hpp
+++ b/sycl/include/sycl/info/info_desc.hpp
@@ -17,9 +17,8 @@
 #include <sycl/aspects.hpp>
 #include <sycl/detail/type_traits.hpp>
 #include <sycl/ext/oneapi/experimental/device_architecture.hpp>
-#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 4)
 #include <sycl/ext/oneapi/matrix/query-types.hpp>
-#endif
+
 #include <sycl/range.hpp>
 
 namespace sycl {
diff --git a/sycl/source/detail/device_info.hpp b/sycl/source/detail/device_info.hpp
index a9a6d6b6730af..a7f9b2b8a29df 100644
--- a/sycl/source/detail/device_info.hpp
+++ b/sycl/source/detail/device_info.hpp
@@ -18,9 +18,7 @@
 #include <sycl/detail/pi.hpp>
 #include <sycl/device.hpp>
 #include <sycl/ext/oneapi/experimental/device_architecture.hpp>
-#if (SYCL_EXT_ONEAPI_MATRIX_VERSION == 4)
 #include <sycl/ext/oneapi/matrix/query-types.hpp>
-#endif
 #include <sycl/feature_test.hpp>
 #include <sycl/info/info_desc.hpp>
 #include <sycl/memory_enums.hpp>
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_all_ops_bf16.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_all_ops_bf16.cpp
deleted file mode 100644
index 7dd067c1d26e5..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_all_ops_bf16.cpp
+++ /dev/null
@@ -1,23 +0,0 @@
-//==----------- element_wise_all_ops_bf16.cpp  - DPC++ joint_matrix---------==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <random>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::intel;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../element_wise_all_ops_bf16_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_all_ops_half.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_all_ops_half.cpp
deleted file mode 100644
index 8e9a326d93372..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_all_ops_half.cpp
+++ /dev/null
@@ -1,25 +0,0 @@
-//==----------- element_wise_all_ops_half.cpp  - DPC++ joint_matrix---------==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: aspect-fp16
-// REQUIRES: matrix-xmx8
-// REQUIRES: matrix-fp16
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <random>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::intel;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../element_wise_all_ops_half_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_all_ops_int8.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_all_ops_int8.cpp
deleted file mode 100644
index 0e5b5ed814b50..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_all_ops_int8.cpp
+++ /dev/null
@@ -1,24 +0,0 @@
-//==----------- element_wise_all_ops_int8.cpp  - DPC++ joint_matrix---------==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-// REQUIRES: TEMPORARY_DISBLED
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <random>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::intel;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../element_wise_all_ops_int8_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_all_ops_int8_packed.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_all_ops_int8_packed.cpp
deleted file mode 100644
index a8ba3f6466de1..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_all_ops_int8_packed.cpp
+++ /dev/null
@@ -1,25 +0,0 @@
-//==------ element_wise_all_ops_int8_packed.cpp  - DPC++ joint_matrix-------==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-// XFAIL: gpu
-
-#include <iostream>
-#include <random>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::intel;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../element_wise_all_ops_int8_packed_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_irreg_sum_rows.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_irreg_sum_rows.cpp
deleted file mode 100644
index 3904514db79b2..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_irreg_sum_rows.cpp
+++ /dev/null
@@ -1,25 +0,0 @@
-//==-------- element_wise_irreg_sum_rows.cpp  - DPC++ joint_matrix----- ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-// this code calculates the sum of rows into a global array of number of rows
-// elements. First, partial reduction is computed inside each SG, then atomic
-// add is used to reduce between SG leaders
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../element_wise_irreg_sum_rows_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_ops.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_ops.cpp
deleted file mode 100644
index 80877d84c014e..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/element_wise_ops.cpp
+++ /dev/null
@@ -1,21 +0,0 @@
-//==----------- element_wise_ops.cpp  - DPC++ joint_matrix------------- ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../element_wise_ops_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_bf16.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_bf16.cpp
deleted file mode 100644
index c4c6cef499f3f..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_bf16.cpp
+++ /dev/null
@@ -1,20 +0,0 @@
-//==-------- joint_matrix_bf16.cpp  - DPC++ joint_matrix--------------- ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include "../../common.hpp"
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../joint_matrix_bf16_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_bfloat16.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_bfloat16.cpp
deleted file mode 100644
index 7533a8fa2c36a..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_bfloat16.cpp
+++ /dev/null
@@ -1,20 +0,0 @@
-//==-------- joint_matrix_bfloat16.cpp  - DPC++ joint_matrix----------- ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include "../../common.hpp"
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../joint_matrix_bfloat16_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_bfloat16_32x64.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_bfloat16_32x64.cpp
deleted file mode 100644
index 985593ace1211..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_bfloat16_32x64.cpp
+++ /dev/null
@@ -1,22 +0,0 @@
-//==----- joint_matrix_bfloat16_32x64.cpp  - DPC++ joint_matrix-------------==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-// XFAIL: *
-
-#include "../../common.hpp"
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../joint_matrix_bfloat16_32x64_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_half.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_half.cpp
deleted file mode 100644
index d2ab286ef4fb1..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_half.cpp
+++ /dev/null
@@ -1,23 +0,0 @@
-//==-------- joint_matrix_half.cpp  - DPC++ joint_matrix------------ ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: aspect-fp16
-// REQUIRES: matrix-xmx8
-// REQUIRES: matrix-fp16
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../joint_matrix_half_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_int8_vnni.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_int8_vnni.cpp
deleted file mode 100644
index 3da4487cbd5e8..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_int8_vnni.cpp
+++ /dev/null
@@ -1,23 +0,0 @@
-//==-------- joint_matrix_bf16_vnni.cpp  - DPC++ joint_matrix---------------==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-// XFAIL: *
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../joint_matrix_int8_vnni_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_ss_int8.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_ss_int8.cpp
deleted file mode 100644
index b2a35e09ffbb1..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_ss_int8.cpp
+++ /dev/null
@@ -1,21 +0,0 @@
-//==-------- joint_matrix_ss_int8.cpp  - DPC++ joint_matrix------------ ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../joint_matrix_ss_int8_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_su_int8.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_su_int8.cpp
deleted file mode 100644
index 2208a09aa4fb4..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_su_int8.cpp
+++ /dev/null
@@ -1,21 +0,0 @@
-//==-------- joint_matrix_su_int8.cpp  - DPC++ joint_matrix------------ ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../joint_matrix_su_int8_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_us_int8.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_us_int8.cpp
deleted file mode 100644
index b82a1d988c5d2..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_us_int8.cpp
+++ /dev/null
@@ -1,21 +0,0 @@
-//==-------- joint_matrix_us_int8.cpp  - DPC++ joint_matrix------------ ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../joint_matrix_us_int8_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_uu_int8.cpp b/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_uu_int8.cpp
deleted file mode 100644
index 0b5e8cf0316f3..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/XMX8/joint_matrix_uu_int8.cpp
+++ /dev/null
@@ -1,21 +0,0 @@
-//==-------- joint_matrix_uu_int8.cpp  - DPC++ joint_matrix------------ ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix-xmx8
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 8
-
-#include "../joint_matrix_uu_int8_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_bf16.cpp b/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_bf16.cpp
deleted file mode 100644
index 1afcbf421c37f..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_bf16.cpp
+++ /dev/null
@@ -1,23 +0,0 @@
-//==----------- element_wise_all_ops_bf16.cpp  - DPC++ joint_matrix---------==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <random>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::intel;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "element_wise_all_ops_bf16_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_bf16_impl.hpp b/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_bf16_impl.hpp
deleted file mode 100644
index b5de917e3af7e..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_bf16_impl.hpp
+++ /dev/null
@@ -1,257 +0,0 @@
-
-#define TM 8
-#define TN SG_SZ
-#define TK 16
-
-static float make_fp32(uint16_t x) {
-  unsigned int y = x;
-  y = y << 16;
-  float *res = reinterpret_cast<float *>(&y);
-  return *res;
-}
-
-static uint16_t make_bf16(float x) {
-  int *res = reinterpret_cast<int *>(&x);
-  *res = *res >> 16;
-  return (uint16_t)*res;
-}
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T, size_t M, size_t N>
-void assert_ops_ref(host_accessor<T, 2, access::mode::read> C,
-                    const float ref) {
-  for (size_t i = 0; i < M; i++)
-    for (size_t j = 0; j < N; j++) {
-      auto diff = make_fp32(C[i][j]) - ref;
-      assert(std::fabs(static_cast<float>(diff)) <
-             std::numeric_limits<float>::epsilon());
-    }
-}
-template <typename T, size_t M, size_t N>
-void matrix_verify_add(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const float ref) {
-  buffer<unsigned short, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class add_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, make_bf16(5.0));
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             wi_slice_a[i] = wi_slice_a[i] + make_bf16(2);
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_sub(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const float ref) {
-  buffer<unsigned short, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class sub_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, make_bf16(5.0));
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             wi_slice_a[i] = wi_slice_a[i] - make_bf16(2);
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_mul(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const float ref) {
-  buffer<unsigned short, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class mul_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, make_bf16(5.0));
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             wi_slice_a[i] = wi_slice_a[i] * make_bf16(3.0);
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_div(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const float ref) {
-  buffer<unsigned short, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class div_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, make_bf16(4.0));
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             wi_slice_a[i] = wi_slice_a[i] / make_bf16(2.0);
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_logic(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                         const float ref) {
-  buffer<unsigned short, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     cgh.parallel_for<class logic_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, make_bf16(5.0));
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             if (wi_slice_a[i]) {
-               if (wi_slice_a[i] > make_bf16(2.0) ||
-                   wi_slice_a[i] >= make_bf16(2.0) ||
-                   wi_slice_a[i] < make_bf16(2.0) ||
-                   wi_slice_a[i] <= make_bf16(2.0)) {
-                 T val = (wi_slice_a[i] != make_bf16(2.0)) ? wi_slice_a[i]
-                                                           : make_bf16(2.0);
-                 val = make_bf16(make_fp32(val) - static_cast<float>(1));
-                 val = make_bf16(make_fp32(val) + static_cast<float>(1));
-                 if (wi_slice_a[i] == make_bf16(2.0)) {
-                   val = make_bf16(make_fp32(val) - static_cast<float>(2));
-                   val = make_bf16(make_fp32(val) * static_cast<float>(3));
-                   val = make_bf16(make_fp32(val) / static_cast<float>(2));
-
-                 } else {
-                   val = make_bf16(make_fp32(val) + static_cast<float>(2));
-                 }
-                 wi_slice_a[i] = val;
-               }
-             }
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-unsigned short A[MATRIX_M][MATRIX_N];
-float D[MATRIX_M][MATRIX_N];
-
-void matrix_ops_ref(float *D, int M, int N) {
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      *(D + m * N + n) = 0;
-      *(D + m * N + n) *= 2;
-    }
-}
-
-int main() {
-
-  big_matrix<float, MATRIX_M, MATRIX_N> MD((float *)&D);
-  big_matrix<unsigned short, MATRIX_M, MATRIX_N> MA((unsigned short *)&A);
-
-  size_t NDRangeM = MATRIX_M / TM;
-  size_t NDRangeN = MATRIX_N / TN;
-  queue q;
-  nd_range<2> r({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ});
-
-  matrix_verify_add<unsigned short, MATRIX_M, MATRIX_N>(q, MA, r, 7.0);
-  matrix_verify_sub<unsigned short, MATRIX_M, MATRIX_N>(q, MA, r, 3.0);
-  matrix_verify_mul<unsigned short, MATRIX_M, MATRIX_N>(q, MA, r, 15.0);
-  matrix_verify_div<unsigned short, MATRIX_M, MATRIX_N>(q, MA, r, 2.0);
-  matrix_verify_logic<unsigned short, MATRIX_M, MATRIX_N>(q, MA, r, 7.0);
-
-  return 0;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_half.cpp b/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_half.cpp
deleted file mode 100644
index 8ac7e9048dfb6..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_half.cpp
+++ /dev/null
@@ -1,25 +0,0 @@
-//==----------- element_wise_all_ops_half.cpp  - DPC++ joint_matrix---------==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: aspect-fp16
-// REQUIRES: matrix
-// REQUIRES: matrix-fp16
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <random>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::intel;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "element_wise_all_ops_half_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_half_impl.hpp b/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_half_impl.hpp
deleted file mode 100644
index c174e63a96026..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_half_impl.hpp
+++ /dev/null
@@ -1,244 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 16
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-private:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T, size_t M, size_t N>
-void assert_ops_ref(host_accessor<T, 2, access::mode::read> C,
-                    const float ref) {
-  for (size_t i = 0; i < M; i++)
-    for (size_t j = 0; j < N; j++) {
-      auto diff = C[i][j] - ref;
-      assert(std::fabs(static_cast<float>(diff)) <
-             std::numeric_limits<float>::epsilon());
-    }
-}
-template <typename T, size_t M, size_t N>
-void matrix_verify_add(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const float ref) {
-  buffer<half, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class add_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, 5);
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             wi_slice_a[i] = wi_slice_a[i] + static_cast<half>(2);
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_sub(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const float ref) {
-  buffer<half, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class sub_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, 5);
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             wi_slice_a[i] = wi_slice_a[i] - static_cast<half>(2);
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_mul(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const float ref) {
-  buffer<half, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class mul_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, 5);
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             wi_slice_a[i] = wi_slice_a[i] * static_cast<half>(3.0);
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_div(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const float ref) {
-  buffer<half, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class div_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, 4);
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             wi_slice_a[i] = wi_slice_a[i] / static_cast<half>(2.0);
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_logic(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                         const float ref) {
-  buffer<half, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class logic_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, 5);
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             if (wi_slice_a[i]) {
-               if (wi_slice_a[i] > static_cast<half>(2.0) ||
-                   wi_slice_a[i] >= static_cast<half>(2.0) ||
-                   wi_slice_a[i] < static_cast<half>(2.0) ||
-                   wi_slice_a[i] <= static_cast<half>(2.0)) {
-                 T val = (wi_slice_a[i] != static_cast<half>(2.0))
-                             ? wi_slice_a[i]
-                             : static_cast<half>(2.0);
-                 val--;
-                 val++;
-                 if (wi_slice_a[i] == static_cast<half>(2.0)) {
-                   val -= 2;
-                   val *= 3;
-                   val /= 2;
-                 } else {
-                   val += 2;
-                 }
-                 wi_slice_a[i] = val;
-               }
-             }
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-half A[MATRIX_M][MATRIX_N];
-float D[MATRIX_M][MATRIX_N];
-
-void matrix_ops_ref(float *D, int M, int N) {
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      *(D + m * N + n) = 0;
-      *(D + m * N + n) *= 2;
-    }
-}
-
-int main() {
-
-  big_matrix<float, MATRIX_M, MATRIX_N> MD((float *)&D);
-  big_matrix<half, MATRIX_M, MATRIX_N> MA((half *)&A);
-
-  size_t NDRangeM = MATRIX_M / TM;
-  size_t NDRangeN = MATRIX_N / TN;
-  queue q;
-  nd_range<2> r({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ});
-
-  matrix_verify_add<half, MATRIX_M, MATRIX_N>(q, MA, r, 7.0);
-  matrix_verify_sub<half, MATRIX_M, MATRIX_N>(q, MA, r, 3.0);
-  matrix_verify_mul<half, MATRIX_M, MATRIX_N>(q, MA, r, 15.0);
-  matrix_verify_div<half, MATRIX_M, MATRIX_N>(q, MA, r, 2.0);
-  matrix_verify_logic<half, MATRIX_M, MATRIX_N>(q, MA, r, 7.0);
-
-  return 0;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_int8.cpp b/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_int8.cpp
deleted file mode 100644
index 0e790d07ecea8..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_int8.cpp
+++ /dev/null
@@ -1,23 +0,0 @@
-//==----------- element_wise_all_ops_int8.cpp  - DPC++ joint_matrix---------==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <random>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::intel;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "element_wise_all_ops_int8_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_int8_impl.hpp b/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_int8_impl.hpp
deleted file mode 100644
index 29ae891104c67..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_int8_impl.hpp
+++ /dev/null
@@ -1,231 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 32
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T, size_t M, size_t N>
-void assert_ops_ref(host_accessor<T, 2, access::mode::read> C, const int ref) {
-  for (size_t i = 0; i < M; i++)
-    for (size_t j = 0; j < N; j++) {
-      auto diff = C[i][j] - ref;
-      assert(std::fabs(static_cast<int>(diff)) <=
-             std::numeric_limits<int>::epsilon());
-    }
-}
-template <typename T, size_t M, size_t N>
-void matrix_verify_add(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const int ref) {
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class add_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, 5);
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             wi_slice_a[i] = wi_slice_a[i] + 2;
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_sub(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const int ref) {
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class sub_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, 5);
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             wi_slice_a[i] = wi_slice_a[i] - 2;
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_mul(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const int ref) {
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class mul_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, 5);
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             wi_slice_a[i] = wi_slice_a[i] * 3;
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_div(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const int ref) {
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class div_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, 4);
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             wi_slice_a[i] = wi_slice_a[i] / 2;
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_logic(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                         const int ref) {
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class logic_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TM, TK> sub_a(sg);
-
-           joint_matrix_fill(sg, sub_a, 5);
-
-           auto wi_slice_a = sub_a.get_wi_data();
-           for (int i = 0; i < wi_slice_a.length(); i++) {
-             if (wi_slice_a[i]) {
-               if (wi_slice_a[i] > 2 || wi_slice_a[i] >= 2 ||
-                   wi_slice_a[i] < 2 || wi_slice_a[i] <= 2) {
-                 T val = (wi_slice_a[i] != 2) ? wi_slice_a[i] : 2;
-                 val--;
-                 val++;
-                 if (wi_slice_a[i] == 2) {
-                   val -= 2;
-                   val *= 3;
-                   val /= 2;
-                 } else {
-                   val += 2;
-                 }
-                 wi_slice_a[i] = val;
-               }
-             }
-           }
-           joint_matrix_store(
-               sg, sub_a,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufA.get_host_access(read_only), ref);
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-int8_t A[MATRIX_M][MATRIX_N];
-int D[MATRIX_M][MATRIX_N];
-
-int main() {
-
-  big_matrix<int, MATRIX_M, MATRIX_N> MD((int *)&D);
-  big_matrix<int8_t, MATRIX_M, MATRIX_N> MA((int8_t *)&A);
-
-  size_t NDRangeM = MATRIX_M / TM;
-  size_t NDRangeN = MATRIX_N / TN;
-  queue q;
-  nd_range<2> r({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ});
-
-  matrix_verify_add<int8_t, MATRIX_M, MATRIX_N>(q, MA, r, 7);
-  matrix_verify_sub<int8_t, MATRIX_M, MATRIX_N>(q, MA, r, 3);
-  matrix_verify_mul<int8_t, MATRIX_M, MATRIX_N>(q, MA, r, 15);
-  matrix_verify_div<int8_t, MATRIX_M, MATRIX_N>(q, MA, r, 2);
-  matrix_verify_logic<int8_t, MATRIX_M, MATRIX_N>(q, MA, r, 7);
-
-  return 0;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_int8_packed.cpp b/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_int8_packed.cpp
deleted file mode 100644
index c74f266f45c63..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_int8_packed.cpp
+++ /dev/null
@@ -1,25 +0,0 @@
-//==------ element_wise_all_ops_int8_packed.cpp  - DPC++ joint_matrix-------==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-// XFAIL: *
-
-#include <iostream>
-#include <random>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::intel;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "element_wise_all_ops_int8_packed_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_int8_packed_impl.hpp b/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_int8_packed_impl.hpp
deleted file mode 100644
index 54213937e7d6a..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/element_wise_all_ops_int8_packed_impl.hpp
+++ /dev/null
@@ -1,231 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 32
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T, size_t M, size_t N>
-void assert_ops_ref(host_accessor<T, 2, access::mode::read> C, const int ref) {
-  for (size_t i = 0; i < M; i++)
-    for (size_t j = 0; j < N; j++) {
-      auto diff = C[i][j] - ref;
-      assert(std::fabs(static_cast<int>(diff)) <=
-             std::numeric_limits<int>::epsilon());
-    }
-}
-template <typename T, size_t M, size_t N>
-void matrix_verify_add(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const int ref) {
-  buffer<int8_t, 2> bufB(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class add_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<T, TK, TN, matrix_layout::packed_b> sub_b(sg);
-
-           joint_matrix_fill(sg, sub_b, 5);
-
-           auto wi_slice_b = sub_b.get_wi_data();
-           for (int i = 0; i < wi_slice_b.length(); i++) {
-             wi_slice_b[i] = wi_slice_b[i] + 2;
-           }
-           joint_matrix_store(
-               sg, sub_b,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N * 4 + sg_starty / SG_SZ * TN * 4,
-               N * 4, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufB.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_sub(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const int ref) {
-  buffer<int8_t, 2> bufB(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class sub_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-
-           joint_matrix_fill(sg, sub_b, 5);
-
-           auto wi_slice_b = sub_b.get_wi_data();
-           for (int i = 0; i < wi_slice_b.length(); i++) {
-             wi_slice_b[i] = wi_slice_b[i] - 2;
-           }
-           joint_matrix_store(
-               sg, sub_b,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N * 4 + sg_starty / SG_SZ * TN * 4,
-               N * 4, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufB.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_mul(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const int ref) {
-  buffer<int8_t, 2> bufB(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class mul_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-
-           joint_matrix_fill(sg, sub_b, 5);
-
-           auto wi_slice_b = sub_b.get_wi_data();
-           for (int i = 0; i < wi_slice_b.length(); i++) {
-             wi_slice_b[i] = wi_slice_b[i] * 3;
-           }
-           joint_matrix_store(
-               sg, sub_b,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N * 4 + sg_starty / SG_SZ * TN * 4,
-               N * 4, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufB.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_div(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                       const int ref) {
-  buffer<int8_t, 2> bufB(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class div_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-
-           joint_matrix_fill(sg, sub_b, 4);
-
-           auto wi_slice_b = sub_b.get_wi_data();
-           for (int i = 0; i < wi_slice_b.length(); i++) {
-             wi_slice_b[i] = wi_slice_b[i] / 2;
-           }
-           joint_matrix_store(
-               sg, sub_b,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N * 4 + sg_starty / SG_SZ * TN * 4,
-               N * 4, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufB.get_host_access(read_only), ref);
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_verify_logic(queue q, big_matrix<T, M, N> &A, nd_range<2> &r,
-                         const int ref) {
-  buffer<int8_t, 2> bufB(A.get_data(), range<2>(M, N));
-
-  q.submit([&](handler &cgh) {
-     auto accA = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class logic_matrix>(
-         r, [accA](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-
-           joint_matrix_fill(sg, sub_b, 5);
-
-           auto wi_slice_b = sub_b.get_wi_data();
-           for (int i = 0; i < wi_slice_b.length(); i++) {
-             if (wi_slice_b[i]) {
-               if (wi_slice_b[i] > 2 || wi_slice_b[i] >= 2 ||
-                   wi_slice_b[i] < 2 || wi_slice_b[i] <= 2) {
-                 T val = (wi_slice_b[i] != 2) ? wi_slice_b[i] : 2;
-                 val--;
-                 val++;
-                 if (wi_slice_b[i] == 2) {
-                   val -= 2;
-                   val *= 3;
-                   val /= 2;
-                 } else {
-                   val += 2;
-                 }
-                 wi_slice_b[i] = val;
-               }
-             }
-           }
-           joint_matrix_store(
-               sg, sub_b,
-               accA.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N * 4 + sg_starty / SG_SZ * TN * 4,
-               N * 4, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-  assert_ops_ref<T, M, N>(bufB.get_host_access(read_only), ref);
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-int8_t B[MATRIX_M][MATRIX_N];
-int D[MATRIX_M][MATRIX_N];
-
-int main() {
-
-  big_matrix<int, MATRIX_M, MATRIX_N> MD((int *)&D);
-  big_matrix<int8_t, MATRIX_M, MATRIX_N> MB((int8_t *)&B);
-
-  size_t NDRangeM = MATRIX_M / TM;
-  size_t NDRangeN = MATRIX_N / TN;
-  queue q;
-  nd_range<2> r({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ});
-
-  matrix_verify_add<int8_t, MATRIX_M, MATRIX_N>(q, MB, r, 7);
-  matrix_verify_sub<int8_t, MATRIX_M, MATRIX_N>(q, MB, r, 3);
-  matrix_verify_mul<int8_t, MATRIX_M, MATRIX_N>(q, MB, r, 15);
-  matrix_verify_div<int8_t, MATRIX_M, MATRIX_N>(q, MB, r, 2);
-  matrix_verify_logic<int8_t, MATRIX_M, MATRIX_N>(q, MB, r, 7);
-
-  return 0;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/element_wise_irreg_sum_rows.cpp b/sycl/test-e2e/Matrix/Legacy/element_wise_irreg_sum_rows.cpp
deleted file mode 100644
index 7382371b16a95..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/element_wise_irreg_sum_rows.cpp
+++ /dev/null
@@ -1,25 +0,0 @@
-//==-------- element_wise_irreg_sum_rows.cpp  - DPC++ joint_matrix----- ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-// this code calculates the sum of rows into a global array of number of rows
-// elements. First, partial reduction is computed inside each SG, then atomic
-// add is used to reduce between SG leaders
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "element_wise_irreg_sum_rows_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/element_wise_irreg_sum_rows_impl.hpp b/sycl/test-e2e/Matrix/Legacy/element_wise_irreg_sum_rows_impl.hpp
deleted file mode 100644
index 6a18fe3650f2c..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/element_wise_irreg_sum_rows_impl.hpp
+++ /dev/null
@@ -1,105 +0,0 @@
-#define TN SG_SZ
-#define TK 32
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T, size_t M, size_t N>
-void sum_rows_ref(host_accessor<T, 2, access::mode::read> B,
-                  host_accessor<int, 1, access::mode::read> sum_rows) {
-  int sum_rows_ref[M] = {0};
-  for (size_t i = 0; i < M; i++) {
-    for (size_t j = 0; j < N; j++) {
-      sum_rows_ref[i] += B[i][j];
-    }
-    auto diff = sum_rows[i] - sum_rows_ref[i];
-    assert(std::fabs(static_cast<int>(diff)) <=
-           std::numeric_limits<int>::epsilon());
-  }
-}
-
-template <typename T, size_t M, size_t N>
-void matrix_sum_rows(queue q, big_matrix<T, M, N> &B, nd_range<2> &r) {
-  buffer<int8_t, 2> bufB(B.get_data(), range<2>(M, N));
-  // size of vector is known because SG size of set by the user in this case
-  int sum_rows[M] = {0};
-  buffer<int> sum_rows_v(sum_rows, M); // there are total of tK/4 * 2, 16 rows
-  q.submit([&](handler &cgh) {
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     auto v = sum_rows_v.get_access<access::mode::atomic>(cgh);
-
-     cgh.parallel_for<class add_matrix>(
-         r, [=](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-
-           joint_matrix<T, TK, TN, matrix_layout::packed_b> sub_b(sg);
-
-           joint_matrix_load(
-               sg, sub_b,
-               accB.template get_multi_ptr<access::decorated::no>() +
-                   (global_idx * (TK / 4) * N) + sg_starty / SG_SZ * TN * 4,
-               N, matrix_layout::packed_b);
-           // calculate sum of rows in sum_rows_v[8], there are 8 rows in sub_b
-           // (tK/4)
-           int32_t sum_local_rows[M] = {0}; // 8 local rows, M total
-           // sub_b has 32x8 elements, 32 elements per WI, 4 per WI per row
-           auto data = sub_b.get_wi_data();
-
-           // each WI calculates local sum of rows
-           for (int row = 0; row < TK / 4; row++) { // there are 8 rows
-             for (int i = 0; i < data.length() / (TK / 4); i++) { // 4 per row
-               // i*SG_SIZE index is found based on the round robin
-               // distribution we are using in the implementation
-               sum_local_rows[row + global_idx * (TK / 4)] += data[i + row * 4];
-             }
-             sum_local_rows[row + global_idx * (TK / 4)] = reduce_over_group(
-                 sg, sum_local_rows[row + global_idx * (TK / 4)],
-                 sycl::plus<>());
-
-             // only Groups leader perform the global reduction
-             if (global_idy % SG_SZ == 0) {
-               atomic_fetch_add(v[row + global_idx * (TK / 4)],
-                                sum_local_rows[row + global_idx * (TK / 4)]);
-             }
-           }
-         }); // parallel for
-   }).wait();
-  sum_rows_ref<T, M, N>(bufB.get_host_access(read_only),
-                        sum_rows_v.get_host_access(read_only));
-}
-
-static constexpr size_t MATRIX_K = TK / 4 * 2;
-static constexpr size_t MATRIX_N = TN * 4 * 2;
-int8_t B[MATRIX_K][MATRIX_N];
-
-int main() {
-  big_matrix<int8_t, MATRIX_K, MATRIX_N> MB((int8_t *)&B);
-
-  size_t NDRangeK = MATRIX_K / (TK / 4);
-  size_t NDRangeN = (MATRIX_N / 4) / TN;
-  queue q;
-  nd_range<2> r({NDRangeK, NDRangeN * SG_SZ}, {1, 1 * SG_SZ});
-
-  for (int i = 0; i < MATRIX_K; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      B[i][j] = i;
-    }
-  }
-
-  matrix_sum_rows<int8_t, MATRIX_K, MATRIX_N>(q, MB, r);
-
-  return 0;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/element_wise_ops.cpp b/sycl/test-e2e/Matrix/Legacy/element_wise_ops.cpp
deleted file mode 100644
index c6f19e77f0097..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/element_wise_ops.cpp
+++ /dev/null
@@ -1,21 +0,0 @@
-//==----------- element_wise_ops.cpp  - DPC++ joint_matrix------------- ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "element_wise_ops_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/element_wise_ops_impl.hpp b/sycl/test-e2e/Matrix/Legacy/element_wise_ops_impl.hpp
deleted file mode 100644
index 8d15b78fd3198..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/element_wise_ops_impl.hpp
+++ /dev/null
@@ -1,159 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 32
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  // B => K/4 x N*4, A => M x K, C => M, N
-  // stride should be X's cols, e.g., B's stirde = N*4
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 4);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<int8_t, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<int32_t, 2> bufC(C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N,
-          K](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<int8_t, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<int32_t, TM, TN> sub_c(sg);
-
-           // AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
-           // strideX = X's cols, so strideC = N, strideA = K, strideB = N*4
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK / 4) * (N * 4) + sg_starty / SG_SZ * TN * 4,
-                 N * 4, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           auto wi_slice_c = sub_c.get_wi_data();
-           for (int i = 0; i < wi_slice_c.length(); i++) {
-             wi_slice_c[i] *= 2;
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-int8_t A[MATRIX_M][MATRIX_K];
-int8_t B[MATRIX_K / 4][MATRIX_N * 4];
-int32_t C[MATRIX_M][MATRIX_N];
-int32_t D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(int32_t *A_mem, int32_t *B_mem, int32_t *C_mem, int M,
-                         int N, int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        char *va = (char *)(A_mem + m * K + k);
-        char *vb = (char *)(B_mem + k * N + n);
-        int acc = *(C_mem + m * N + n);
-        for (int i = 0; i < 4; i++) {
-          acc += (va[i] * vb[i]);
-        }
-        *(C_mem + m * N + n) = acc;
-      }
-      *(C_mem + m * N + n) *= 2;
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = i + 2 * j;
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 4; i++) {
-    for (int j = 0; j < MATRIX_N * 4; j++) {
-      B[i][j] = i + j;
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1;
-      D[i][j] = 1;
-    }
-  }
-
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MC((int32_t *)&C);
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MD((int32_t *)&D);
-  big_matrix<int8_t, MATRIX_M, MATRIX_K> MA((int8_t *)&A);
-  big_matrix<int8_t, MATRIX_K / 4, MATRIX_N * 4> MB((int8_t *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 4);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/elemwise_irreg_size_ops_bf16.cpp b/sycl/test-e2e/Matrix/Legacy/elemwise_irreg_size_ops_bf16.cpp
deleted file mode 100644
index 0f57377c571ac..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/elemwise_irreg_size_ops_bf16.cpp
+++ /dev/null
@@ -1,196 +0,0 @@
-//==-------- elemwise_irreg_size_ops_bf16.cpp  - DPC++ joint_matrix---- ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// This test is for element wise operations when matrix size does not multiply
-// SG size. This corner case only applies to AMX. Also, it tests bf16 type.
-// only run this on AMX
-// REQUIRES: cpu
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-// 10x12 is not multiply the sg size, slicing implementation will have to insert
-// padding
-#define TM 10
-#define TN 12
-#define TK 16
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 2);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<unsigned short, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<unsigned short, 2> bufB(B.get_data(), range<2>(K / 2, N * 2));
-  buffer<float, 2> bufC((float *)C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N, K](nd_item<2> spmd_item)
-             [[intel::reqd_sub_group_size(SG_SZ)]]
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<unsigned short, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the packed_b layout.
-           // By default, the layout is row_major
-           joint_matrix<unsigned short, TK, TN, matrix_layout::packed_b> sub_b(
-               sg);
-           joint_matrix<float, TM, TN> sub_c(sg);
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K; k += TK) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k,
-                 K, matrix_layout::row_major);
-             // Assume we alreay in vnni format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k) * (N) + sg_starty / SG_SZ * TN * 2,
-                 N * 2, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           auto wi_slice_c = sub_c.get_wi_data();
-           for (int i = 0; i < wi_slice_c.length(); i++) {
-             wi_slice_c[i] += 5.0;
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-unsigned short A[MATRIX_M][MATRIX_K];
-unsigned short B[MATRIX_K / 2][MATRIX_N * 2];
-float C[MATRIX_M][MATRIX_N];
-float D[MATRIX_M][MATRIX_N];
-
-float make_fp32(short x) {
-  unsigned int y = x;
-  y = y << 16;
-  float *res = reinterpret_cast<float *>(&y);
-  return *res;
-}
-
-unsigned short make_bf16(float x) {
-  int *res = reinterpret_cast<int *>(&x);
-  *res = *res >> 16;
-  return (unsigned short)*res;
-}
-
-void matrix_multiply_ref(int *A_mem, int *B_mem, int *C_mem, int M, int N,
-                         int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        short *va = (short *)(A_mem + m * K + k);
-        short *vb = (short *)(B_mem + k * N + n);
-        float acc = *((float *)(C_mem + m * N + n));
-        // FIXME: Should we do reduce-add in another version?
-        for (int i = 0; i < 2; i++) {
-          acc += (make_fp32(va[i]) * make_fp32(vb[i]));
-        }
-        *((float *)(C_mem + m * N + n)) = acc;
-      }
-      *((float *)(C_mem + m * N + n)) += 5.0;
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = make_bf16(1.0f * (i + j));
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 2; i++) {
-    for (int j = 0; j < MATRIX_N * 2; j++) {
-      B[i][j] = make_bf16(2.0f * i + 3.0f * j);
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1.0;
-      D[i][j] = 1.0;
-    }
-  }
-
-  big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C);
-  big_matrix<float, MATRIX_M, MATRIX_N> MD((float *)&D);
-  big_matrix<unsigned short, MATRIX_M, MATRIX_K> MA((unsigned short *)&A);
-  big_matrix<unsigned short, MATRIX_K / 2, MATRIX_N * 2> MB(
-      (unsigned short *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 2);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bf16.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_bf16.cpp
deleted file mode 100644
index 6dd7921e51a64..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bf16.cpp
+++ /dev/null
@@ -1,20 +0,0 @@
-//==-------- joint_matrix_bf16.cpp  - DPC++ joint_matrix--------------- ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include "../common.hpp"
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "joint_matrix_bf16_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bf16_impl.hpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_bf16_impl.hpp
deleted file mode 100644
index 0f107a1a29515..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bf16_impl.hpp
+++ /dev/null
@@ -1,151 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 16
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 2);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<unsigned short, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<unsigned short, 2> bufB(B.get_data(), range<2>(K / 2, N * 2));
-  buffer<float, 2> bufC((float *)C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N, K](nd_item<2> spmd_item)
-             [[intel::reqd_sub_group_size(SG_SZ)]]
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<unsigned short, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the packed_b layout.
-           // By default, the layout is row_major
-           joint_matrix<unsigned short, TK, TN, matrix_layout::packed_b> sub_b(
-               sg);
-           joint_matrix<float, TM, TN> sub_c(sg);
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K; k += TK) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k,
-                 K, matrix_layout::row_major);
-             // Assume we alreay in vnni format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k) * (N) + sg_starty / SG_SZ * TN * 2,
-                 N * 2, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-unsigned short A[MATRIX_M][MATRIX_K];
-unsigned short B[MATRIX_K / 2][MATRIX_N * 2];
-float C[MATRIX_M][MATRIX_N];
-float D[MATRIX_M][MATRIX_N];
-
-float make_fp32(short x) {
-  unsigned int y = x;
-  y = y << 16;
-  float *res = reinterpret_cast<float *>(&y);
-  return *res;
-}
-
-unsigned short make_bf16(float x) {
-  int *res = reinterpret_cast<int *>(&x);
-  *res = *res >> 16;
-  return (unsigned short)*res;
-}
-
-void matrix_multiply_ref(int *A_mem, int *B_mem, int *C_mem, int M, int N,
-                         int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        short *va = (short *)(A_mem + m * K + k);
-        short *vb = (short *)(B_mem + k * N + n);
-        float acc = *((float *)(C_mem + m * N + n));
-        // FIXME: Should we do reduce-add in another version?
-        for (int i = 0; i < 2; i++) {
-          acc += (make_fp32(va[i]) * make_fp32(vb[i]));
-        }
-        *((float *)(C_mem + m * N + n)) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = make_bf16(1.0f * (i + j));
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 2; i++) {
-    for (int j = 0; j < MATRIX_N * 2; j++) {
-      B[i][j] = make_bf16(2.0f * i + 3.0f * j);
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1.0;
-      D[i][j] = 1.0;
-    }
-  }
-
-  big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C);
-  big_matrix<float, MATRIX_M, MATRIX_N> MD((float *)&D);
-  big_matrix<unsigned short, MATRIX_M, MATRIX_K> MA((unsigned short *)&A);
-  big_matrix<unsigned short, MATRIX_K / 2, MATRIX_N * 2> MB(
-      (unsigned short *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 2);
-
-  bool res = matrix_compare(MATRIX_M, MATRIX_N, (float *)C, (float *)D);
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16.cpp
deleted file mode 100644
index c46a1746ee3fc..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16.cpp
+++ /dev/null
@@ -1,20 +0,0 @@
-//==-------- joint_matrix_bfloat16.cpp  - DPC++ joint_matrix----------- ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include "../common.hpp"
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "joint_matrix_bfloat16_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_32x64.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_32x64.cpp
deleted file mode 100644
index fcfa621363290..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_32x64.cpp
+++ /dev/null
@@ -1,23 +0,0 @@
-//==----- joint_matrix_bfloat16_32x64.cpp  - DPC++ joint_matrix-------------==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-// XFAIL: *
-
-#include "../common.hpp"
-#include <iostream>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "joint_matrix_bfloat16_32x64_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_32x64_impl.hpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_32x64_impl.hpp
deleted file mode 100644
index 5199cf688cfbd..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_32x64_impl.hpp
+++ /dev/null
@@ -1,149 +0,0 @@
-#define TM 32
-#define TN 64
-#define TK 16
-
-template <typename T1, typename T2, size_t M, size_t N, size_t K>
-void matrix_multiply(big_matrix<T1, M, N> &C, big_matrix<T2, M, K> &A,
-                     big_matrix<T2, K / 2, N * 2> &B) {
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<bfloat16, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<bfloat16, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<float, 2> bufC((float *)C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [=](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]]
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<bfloat16, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<bfloat16, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<float, TM, TN> sub_c(sg);
-
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) { //
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK / 2) * (N * 2) + sg_starty / SG_SZ * TN * 2,
-                 N * 2, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-bfloat16 A[MATRIX_M][MATRIX_K];
-bfloat16 B[MATRIX_K / 2][MATRIX_N * 2];
-unsigned short Aref[MATRIX_M][MATRIX_K];
-unsigned short Bref[MATRIX_K / 2][MATRIX_N * 2];
-float C[MATRIX_M][MATRIX_N];
-float D[MATRIX_M][MATRIX_N];
-
-float make_fp32(short x) {
-  unsigned int y = x;
-  y = y << 16;
-  float *res = reinterpret_cast<float *>(&y);
-  return *res;
-}
-
-unsigned short make_bf16(float x) {
-  int *res = reinterpret_cast<int *>(&x);
-  *res = *res >> 16;
-  return (unsigned short)*res;
-}
-
-void matrix_multiply_ref(int *A_mem, int *B_mem, int *C_mem, int M, int N,
-                         int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        short *va = (short *)(A_mem + m * K + k);
-        short *vb = (short *)(B_mem + k * N + n);
-        float acc = *((float *)(C_mem + m * N + n));
-        // FIXME: Should we do reduce-add in another version?
-        for (int i = 0; i < 2; i++) {
-          acc += (make_fp32(va[i]) * make_fp32(vb[i]));
-        }
-        *((float *)(C_mem + m * N + n)) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      // bfloat16 is created using unsigned short since conversion from float to
-      // bfloat16 is not supported on the host side yet
-      A[i][j] = bfloat16(1.0f * (i + j));
-      Aref[i][j] = make_bf16(1.0f * (i + j));
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 2; i++) {
-    for (int j = 0; j < MATRIX_N * 2; j++) {
-      B[i][j] = bfloat16(2.0f * i + 3.0f * j);
-      Bref[i][j] = make_bf16(2.0f * i + 3.0f * j);
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1.0;
-      D[i][j] = 1.0;
-    }
-  }
-
-  big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C);
-  big_matrix<float, MATRIX_M, MATRIX_N> MD((float *)&D);
-  big_matrix<bfloat16, MATRIX_M, MATRIX_K> MA((bfloat16 *)&A);
-  big_matrix<bfloat16, MATRIX_K / 2, MATRIX_N * 2> MB((bfloat16 *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)Aref, (int32_t *)Bref, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 2);
-
-  bool res = matrix_compare(MATRIX_M, MATRIX_N, (float *)C, (float *)D);
-
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_colmajorA_colmajorB.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_colmajorA_colmajorB.cpp
deleted file mode 100644
index 898bb19f801da..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_colmajorA_colmajorB.cpp
+++ /dev/null
@@ -1,25 +0,0 @@
-//==-- joint_matrix_bfloat16_colmajorA_colmajorB.cpp  - DPC++ joint_matrix--==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-// This tests support of col major layout for matrix B which does transpose and
-// then VNNI transform. This is currently only available on AMX
-
-// XFAIL: gpu
-
-#include "../common.hpp"
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "joint_matrix_bfloat16_colmajorA_colmajorB_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_colmajorA_colmajorB_impl.hpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_colmajorA_colmajorB_impl.hpp
deleted file mode 100644
index 6ef7e56a3845b..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_colmajorA_colmajorB_impl.hpp
+++ /dev/null
@@ -1,110 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 16
-
-template <typename T1, typename T2, size_t M, size_t N, size_t K>
-void matrix_multiply(big_matrix<T1, M, N> &C, big_matrix<T2, M, K> &A,
-                     big_matrix<T2, K, N> &B) {
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<bfloat16, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<bfloat16, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<float, 2> bufC((float *)C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [=](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]]
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<bfloat16, TM, TK> sub_a(sg);
-           joint_matrix<bfloat16, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<float, TM, TN> sub_c(sg);
-
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) { //
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK) * M + sg_startx * TM,
-                 M, matrix_layout::col_major);
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (sg_starty / SG_SZ * TN) * K + k * TK,
-                 K, matrix_layout::col_major);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-bfloat16 A[MATRIX_K][MATRIX_M];
-bfloat16 B[MATRIX_N][MATRIX_K];
-float C[MATRIX_M][MATRIX_N];
-float D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(int M, int N, int K) {
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        D[m][n] += make_fp32(A[k][m]) * make_fp32(B[n][k]);
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_K; i++) {
-    for (int j = 0; j < MATRIX_M; j++) {
-      A[i][j] = bfloat16(1.0f * (i + j));
-    }
-  }
-  for (int i = 0; i < MATRIX_N; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      B[i][j] = bfloat16(2.0f * i + 3.0f * j);
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1.0;
-      D[i][j] = 1.0;
-    }
-  }
-
-  big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C);
-  big_matrix<float, MATRIX_M, MATRIX_N> MD((float *)&D);
-  big_matrix<bfloat16, MATRIX_M, MATRIX_K> MA((bfloat16 *)&A);
-  big_matrix<bfloat16, MATRIX_K, MATRIX_N> MB((bfloat16 *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref(MATRIX_M, MATRIX_N, MATRIX_K);
-
-  bool res = matrix_compare(MATRIX_M, MATRIX_N, (float *)C, (float *)D);
-  std::cout << (res ? "passed" : "failed") << std::endl;
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_impl.hpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_impl.hpp
deleted file mode 100644
index 60af725cff616..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_impl.hpp
+++ /dev/null
@@ -1,144 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 16
-
-template <typename T1, typename T2, size_t M, size_t N, size_t K>
-void matrix_multiply(big_matrix<T1, M, N> &C, big_matrix<T2, M, K> &A,
-                     big_matrix<T2, K / 2, N * 2> &B) {
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<bfloat16, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<bfloat16, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<float, 2> bufC((float *)C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [=](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]]
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<bfloat16, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<bfloat16, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<float, TM, TN> sub_c(sg);
-
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) { //
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK / 2) * (N * 2) + sg_starty / SG_SZ * TN * 2,
-                 N * 2, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-bfloat16 A[MATRIX_M][MATRIX_K];
-bfloat16 B[MATRIX_K / 2][MATRIX_N * 2];
-unsigned short Aref[MATRIX_M][MATRIX_K];
-unsigned short Bref[MATRIX_K / 2][MATRIX_N * 2];
-float C[MATRIX_M][MATRIX_N];
-float D[MATRIX_M][MATRIX_N];
-
-float make_fp32(short x) {
-  unsigned int y = x;
-  y = y << 16;
-  float *res = reinterpret_cast<float *>(&y);
-  return *res;
-}
-
-unsigned short make_bf16(float x) {
-  int *res = reinterpret_cast<int *>(&x);
-  *res = *res >> 16;
-  return (unsigned short)*res;
-}
-
-void matrix_multiply_ref(int *A_mem, int *B_mem, int *C_mem, int M, int N,
-                         int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        short *va = (short *)(A_mem + m * K + k);
-        short *vb = (short *)(B_mem + k * N + n);
-        float acc = *((float *)(C_mem + m * N + n));
-        // FIXME: Should we do reduce-add in another version?
-        for (int i = 0; i < 2; i++) {
-          acc += (make_fp32(va[i]) * make_fp32(vb[i]));
-        }
-        *((float *)(C_mem + m * N + n)) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      // bfloat16 is created using unsigned short since conversion from float to
-      // bfloat16 is not supported on the host side yet
-      A[i][j] = bfloat16(1.0f * (i + j));
-      Aref[i][j] = make_bf16(1.0f * (i + j));
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 2; i++) {
-    for (int j = 0; j < MATRIX_N * 2; j++) {
-      B[i][j] = bfloat16(2.0f * i + 3.0f * j);
-      Bref[i][j] = make_bf16(2.0f * i + 3.0f * j);
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1.0;
-      D[i][j] = 1.0;
-    }
-  }
-
-  big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C);
-  big_matrix<float, MATRIX_M, MATRIX_N> MD((float *)&D);
-  big_matrix<bfloat16, MATRIX_M, MATRIX_K> MA((bfloat16 *)&A);
-  big_matrix<bfloat16, MATRIX_K / 2, MATRIX_N * 2> MB((bfloat16 *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)Aref, (int32_t *)Bref, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 2);
-
-  bool res = matrix_compare(MATRIX_M, MATRIX_N, (float *)C, (float *)D);
-  std::cout << (res ? "passed" : "failed") << std::endl;
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_rowmajorA_rowmajorB.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_rowmajorA_rowmajorB.cpp
deleted file mode 100644
index bc3fa142bdccd..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_rowmajorA_rowmajorB.cpp
+++ /dev/null
@@ -1,23 +0,0 @@
-//==--joint_matrix_bfloat16_rowmajorA_rowmajorB.cpp  - DPC++ joint_matrix---==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-// This tests support of row major layout for matrix B which does automatic VNNI
-// transform. This is currently only available on AMX and XMX of PVC
-
-#include "../common.hpp"
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "joint_matrix_bfloat16_rowmajorA_rowmajorB_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_rowmajorA_rowmajorB_impl.hpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_rowmajorA_rowmajorB_impl.hpp
deleted file mode 100644
index 1f1b33d7561d0..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_bfloat16_rowmajorA_rowmajorB_impl.hpp
+++ /dev/null
@@ -1,110 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 16
-
-template <typename T1, typename T2, size_t M, size_t N, size_t K>
-void matrix_multiply(big_matrix<T1, M, N> &C, big_matrix<T2, M, K> &A,
-                     big_matrix<T2, K, N> &B) {
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<bfloat16, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<bfloat16, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<float, 2> bufC((float *)C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [=](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]]
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<bfloat16, TM, TK> sub_a(sg);
-           joint_matrix<bfloat16, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<float, TM, TN> sub_c(sg);
-
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK) * (N) + sg_starty / SG_SZ * TN,
-                 N, matrix_layout::row_major);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-bfloat16 A[MATRIX_M][MATRIX_K];
-bfloat16 B[MATRIX_K][MATRIX_N];
-float C[MATRIX_M][MATRIX_N];
-float D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(int M, int N, int K) {
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        D[m][n] += make_fp32(A[m][k]) * make_fp32(B[k][n]);
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = bfloat16(1.0f * (i + j));
-    }
-  }
-  for (int i = 0; i < MATRIX_K; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      B[i][j] = bfloat16(2.0f * i + 3.0f * j);
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1.0;
-      D[i][j] = 1.0;
-    }
-  }
-
-  big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C);
-  big_matrix<float, MATRIX_M, MATRIX_N> MD((float *)&D);
-  big_matrix<bfloat16, MATRIX_M, MATRIX_K> MA((bfloat16 *)&A);
-  big_matrix<bfloat16, MATRIX_K, MATRIX_N> MB((bfloat16 *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref(MATRIX_M, MATRIX_N, MATRIX_K);
-
-  bool res = matrix_compare(MATRIX_M, MATRIX_N, (float *)C, (float *)D);
-  std::cout << (res ? "passed" : "failed") << std::endl;
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_half.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_half.cpp
deleted file mode 100644
index 7062382643cca..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_half.cpp
+++ /dev/null
@@ -1,23 +0,0 @@
-//==-------- joint_matrix_half.cpp  - DPC++ joint_matrix------------ ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: aspect-fp16
-// REQUIRES: matrix
-// REQUIRES: matrix-fp16
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "joint_matrix_half_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_half_impl.hpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_half_impl.hpp
deleted file mode 100644
index c663cc282c758..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_half_impl.hpp
+++ /dev/null
@@ -1,151 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 16
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 2);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<half, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<half, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<float, 2> bufC(C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, SG_SZ}),
-         [accA, accB, accC, M, N,
-          K](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<half, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<half, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<float, TM, TN> sub_c(sg);
-
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK / 2) * (N * 2) + sg_starty / SG_SZ * TN * 2,
-                 N * 2, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-half A[MATRIX_M][MATRIX_K];
-half B[MATRIX_K / 2][MATRIX_N * 2];
-float C[MATRIX_M][MATRIX_N];
-float D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(float *A_mem, float *B_mem, float *C_mem, int M, int N,
-                         int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        half *va = (half *)(A_mem + m * K + k);
-        half *vb = (half *)(B_mem + k * N + n);
-        float acc = *(C_mem + m * N + n);
-        for (int i = 0; i < 2; i++) {
-          acc += ((float)va[i] * (float)vb[i]);
-        }
-        *((float *)(C_mem + m * N + n)) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = i + 2 * j;
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 2; i++) {
-    for (int j = 0; j < MATRIX_N * 2; j++) {
-      B[i][j] = i + j;
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1.0;
-      D[i][j] = 1.0;
-    }
-  }
-
-  big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C);
-  big_matrix<float, MATRIX_M, MATRIX_N> MD((float *)&D);
-  big_matrix<half, MATRIX_M, MATRIX_K> MA((half *)&A);
-  big_matrix<half, MATRIX_K / 2, MATRIX_N * 2> MB((half *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((float *)A, (float *)B, (float *)D, MATRIX_M, MATRIX_N,
-                      MATRIX_K / 2);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_int8_colmajorA_colmajorB.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_int8_colmajorA_colmajorB.cpp
deleted file mode 100644
index 619e9423d4671..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_int8_colmajorA_colmajorB.cpp
+++ /dev/null
@@ -1,26 +0,0 @@
-//==----- joint_matrix_int8_colmajorA_colmajorB.cpp  - DPC++ joint_matrix---==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-// This tests support of col major layout for matrix B which does transpose and
-// then VNNI transform. This is currently only available on AMX
-
-// XFAIL: gpu
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "joint_matrix_int8_colmajorA_colmajorB_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_int8_colmajorA_colmajorB_impl.hpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_int8_colmajorA_colmajorB_impl.hpp
deleted file mode 100644
index 21347c80c083b..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_int8_colmajorA_colmajorB_impl.hpp
+++ /dev/null
@@ -1,142 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 32
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  // B => K/4 x N*4, A => M x K, C => M, N
-  // stride should be X's cols, e.g., B's stirde = N*4
-  // assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 4);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<int8_t, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<int32_t, 2> bufC(C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N, K](nd_item<2> spmd_item)
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<int8_t, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<int32_t, TM, TN> sub_c(sg);
-
-           // AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
-           // strideX = X's cols, so strideC = N, strideA = K, strideB = N*4
-           joint_matrix_fill(sg, sub_c, 0);
-           for (int k = 0; k < K / TK; k += 1) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK) * M + sg_startx * TM,
-                 M, matrix_layout::col_major);
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (sg_starty / SG_SZ * TN) * K + k * TK,
-                 K, matrix_layout::col_major);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM;
-static constexpr size_t MATRIX_N = TN;
-static constexpr size_t MATRIX_K = TK;
-int8_t A[MATRIX_K][MATRIX_M];
-int8_t Aref[MATRIX_K][MATRIX_M];
-int8_t B[MATRIX_N][MATRIX_K];
-int8_t Bref[MATRIX_N][MATRIX_K];
-int32_t C[MATRIX_M][MATRIX_N];
-int32_t D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(int M, int N, int K) {
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        D[m][n] += Aref[k][m] * Bref[n][k];
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_K; i++) {
-    for (int j = 0; j < MATRIX_M; j++) {
-      A[i][j] = 2 * i + j;
-      Aref[i][j] = 2 * i + j;
-    }
-  }
-  for (int i = 0; i < MATRIX_N; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      B[i][j] = i + 2 * j;
-      Bref[i][j] = i + 2 * j;
-    }
-  }
-
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 0;
-      D[i][j] = 0;
-    }
-  }
-
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MC((int32_t *)&C);
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MD((int32_t *)&D);
-  big_matrix<int8_t, MATRIX_M, MATRIX_K> MA((int8_t *)&A);
-  big_matrix<int8_t, MATRIX_K, MATRIX_N> MB((int8_t *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref(MATRIX_M, MATRIX_N, MATRIX_K);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  std::cout << (res ? "passed" : "failed") << std::endl;
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_int8_vnni.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_int8_vnni.cpp
deleted file mode 100644
index a6d6250c0dff0..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_int8_vnni.cpp
+++ /dev/null
@@ -1,21 +0,0 @@
-//==-------- joint_matrix_bf16_vnni.cpp  - DPC++ joint_matrix---------------==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "joint_matrix_int8_vnni_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_int8_vnni_impl.hpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_int8_vnni_impl.hpp
deleted file mode 100644
index 1948071dbf405..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_int8_vnni_impl.hpp
+++ /dev/null
@@ -1,160 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 32
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<int8_t, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<int32_t, 2> bufC(C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N, K](nd_item<2> spmd_item)
-             [[intel::reqd_sub_group_size(SG_SZ)]] {
-               // The submatrix API has to be accessed by all the workitems in a
-               // subgroup these functions will be called once by the subgroup
-               // no code divergence between the workitems
-               const auto global_idx = spmd_item.get_global_id(0);
-               const auto global_idy = spmd_item.get_global_id(1);
-               const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-               const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-               sub_group sg = spmd_item.get_sub_group();
-               joint_matrix<int8_t, TM, TK> sub_a(sg);
-               joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-               joint_matrix<int32_t, TM, TN> sub_c(sg);
-
-               joint_matrix_fill(sg, sub_c, 0);
-               for (int k = 0; k < K / TK; k += 1) {
-                 joint_matrix_load(
-                     sg, sub_a,
-                     accA.template get_multi_ptr<access::decorated::no>() +
-                         (sg_startx * TM) * K + k * TK,
-                     K, matrix_layout::row_major);
-                 // VNNI transform is done automatically at this level
-                 joint_matrix_load(
-                     sg, sub_b,
-                     accB.template get_multi_ptr<access::decorated::no>() +
-                         (k * TK) * N + sg_starty / SG_SZ * TN,
-                     N, matrix_layout::row_major);
-                 sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-               }
-               joint_matrix_store(
-                   sg, sub_c,
-                   accC.template get_multi_ptr<access::decorated::no>() +
-                       (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-                   N, matrix_layout::row_major);
-             }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-int8_t A[MATRIX_M][MATRIX_K];
-int8_t B[MATRIX_K][MATRIX_N];
-int8_t Bvnni[MATRIX_K / 4][MATRIX_N * 4];
-int32_t C[MATRIX_M][MATRIX_N];
-int32_t D[MATRIX_M][MATRIX_N];
-
-void int8_row_vnni_reformat(int8_t *_in, int8_t *_out, int K, int N,
-                            int stride_in) {
-  // find the old index, new index, and copy element.
-  //(K, N) => (k/4, N*4)
-  // idx in 2d: (i,j)=>(i/4, j*4+i%4)
-  // linear idx:
-  for (int i = 0; i < K; ++i) {
-    for (int j = 0; j < N; ++j) {
-      size_t oldindex = i * stride_in + j;
-      size_t newindex = (i / 4) * N * 4 + j * 4 + i % 4;
-      _out[newindex] = _in[oldindex];
-    }
-  }
-}
-
-void matrix_multiply_ref(int32_t *A_mem, int32_t *B_mem, int32_t *C_mem, int M,
-                         int N, int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        char *va = (char *)(A_mem + m * K + k);
-        char *vb = (char *)(B_mem + k * N + n);
-        int acc = *(C_mem + m * N + n);
-        for (int i = 0; i < 4; i++) {
-          acc += (va[i] * vb[i]);
-        }
-        *(C_mem + m * N + n) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = i + j;
-    }
-  }
-  for (int i = 0; i < MATRIX_K; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      B[i][j] = i + j * 2;
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 0;
-      D[i][j] = 0;
-    }
-  }
-
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MC((int32_t *)&C);
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MD((int32_t *)&D);
-  big_matrix<int8_t, MATRIX_M, MATRIX_K> MA((int8_t *)&A);
-  big_matrix<int8_t, MATRIX_K / 4, MATRIX_N * 4> MB((int8_t *)&B);
-  matrix_multiply(MC, MA, MB);
-  int8_row_vnni_reformat((int8_t *)B, (int8_t *)Bvnni, MATRIX_K, MATRIX_N,
-                         MATRIX_N);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)Bvnni, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 4);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_query_default.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_query_default.cpp
deleted file mode 100644
index 8aaf737a274a8..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_query_default.cpp
+++ /dev/null
@@ -1,173 +0,0 @@
-//==-------- joint_matrix_query.cpp  - DPC++ joint_matrix------------ ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// Needs AMX.
-// REQUIRES: cpu
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 4);
-
-  using myparams2 = tpu_params<tpu::amx, int8_t, int8_t, int>;
-  constexpr int TM = myparams2::defaultM;
-  constexpr int TN = myparams2::defaultN;
-  constexpr int TK = myparams2::defaultK;
-
-  std::cout << "AMX query sizes are: M " << TM << " N " << TN << " K " << TK
-            << std::endl;
-
-  constexpr int SG_SZ = TN;
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<int8_t, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<int32_t, 2> bufC(C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N, K](nd_item<2> spmd_item)
-             [[intel::reqd_sub_group_size(SG_SZ)]]
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-
-           myparams2::joint_matrix_a<sub_group> sub_a(sg);
-           myparams2::joint_matrix_b<sub_group> sub_b(sg);
-           myparams2::joint_matrix_c<sub_group> sub_c(sg);
-
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK / 4) * (N * 4) + sg_starty / SG_SZ * TN * 4,
-                 N * 4, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = 128;
-static constexpr size_t MATRIX_N = 128;
-static constexpr size_t MATRIX_K = 128;
-int8_t A[MATRIX_M][MATRIX_K];
-int8_t B[MATRIX_K / 4][MATRIX_N * 4];
-int32_t C[MATRIX_M][MATRIX_N];
-int32_t D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(int32_t *A_mem, int32_t *B_mem, int32_t *C_mem, int M,
-                         int N, int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        char *va = (char *)(A_mem + m * K + k);
-        char *vb = (char *)(B_mem + k * N + n);
-        int acc = *(C_mem + m * N + n);
-        for (int i = 0; i < 4; i++) {
-          acc += (va[i] * vb[i]);
-        }
-        *(C_mem + m * N + n) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = i + 2 * j;
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 4; i++) {
-    for (int j = 0; j < MATRIX_N * 4; j++) {
-      B[i][j] = i + j;
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1;
-      D[i][j] = 1;
-    }
-  }
-
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MC((int32_t *)&C);
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MD((int32_t *)&D);
-  big_matrix<int8_t, MATRIX_M, MATRIX_K> MA((int8_t *)&A);
-  big_matrix<int8_t, MATRIX_K / 4, MATRIX_N * 4> MB((int8_t *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 4);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_ss_int8.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_ss_int8.cpp
deleted file mode 100644
index 6639f2a51478d..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_ss_int8.cpp
+++ /dev/null
@@ -1,21 +0,0 @@
-//==-------- joint_matrix_ss_int8.cpp  - DPC++ joint_matrix------------ ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "joint_matrix_ss_int8_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_ss_int8_impl.hpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_ss_int8_impl.hpp
deleted file mode 100644
index a2436bc56e792..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_ss_int8_impl.hpp
+++ /dev/null
@@ -1,144 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 32
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  // B => K/4 x N*4, A => M x K, C => M, N
-  // stride should be X's cols, e.g., B's stirde = N*4
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 4);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<int8_t, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<int32_t, 2> bufC(C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N,
-          K](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<int8_t, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<int32_t, TM, TN> sub_c(sg);
-
-           joint_matrix_fill(sg, sub_c, 0);
-           for (int k = 0; k < K / TK; k += 1) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK / 4) * (N * 4) + sg_starty / SG_SZ * TN * 4,
-                 N * 4, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-int8_t A[MATRIX_M][MATRIX_K];
-int8_t B[MATRIX_K / 4][MATRIX_N * 4];
-int32_t C[MATRIX_M][MATRIX_N];
-int32_t D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(int32_t *A_mem, int32_t *B_mem, int32_t *C_mem, int M,
-                         int N, int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        char *va = (char *)(A_mem + m * K + k);
-        char *vb = (char *)(B_mem + k * N + n);
-        int acc = *(C_mem + m * N + n);
-        for (int i = 0; i < 4; i++) {
-          acc += (va[i] * vb[i]);
-        }
-        *(C_mem + m * N + n) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = i + 2 * j;
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 4; i++) {
-    for (int j = 0; j < MATRIX_N * 4; j++) {
-      B[i][j] = i + j;
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 0;
-      D[i][j] = 0;
-    }
-  }
-
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MC((int32_t *)&C);
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MD((int32_t *)&D);
-  big_matrix<int8_t, MATRIX_M, MATRIX_K> MA((int8_t *)&A);
-  big_matrix<int8_t, MATRIX_K / 4, MATRIX_N * 4> MB((int8_t *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 4);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  std::cout << (res ? "passed" : "failed") << std::endl;
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_su_int8.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_su_int8.cpp
deleted file mode 100644
index 6cf9fd88c924c..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_su_int8.cpp
+++ /dev/null
@@ -1,21 +0,0 @@
-//==-------- joint_matrix_su_int8.cpp  - DPC++ joint_matrix------------ ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "joint_matrix_su_int8_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_su_int8_impl.hpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_su_int8_impl.hpp
deleted file mode 100644
index f0a9a7155fb0d..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_su_int8_impl.hpp
+++ /dev/null
@@ -1,154 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 32
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, typename T3, size_t NUM_ROWS_A,
-          size_t NUM_COLS_A, size_t NUM_ROWS_B, size_t NUM_COLS_B,
-          size_t NUM_ROWS_C, size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T3, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  // B => K/4 x N*4, A => M x K, C => M, N
-  // stride should be X's cols, e.g., B's stirde = N*4
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 4);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<uint8_t, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<int32_t, 2> bufC(C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N,
-          K](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<int8_t, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<uint8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<int32_t, TM, TN> sub_c(sg);
-
-           // AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
-           // strideX = X's cols, so strideC = N, strideA = K, strideB = N*4
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK / 4) * (N * 4) + sg_starty / SG_SZ * TN * 4,
-                 N * 4, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-int8_t A[MATRIX_M][MATRIX_K];
-uint8_t B[MATRIX_K / 4][MATRIX_N * 4];
-int32_t C[MATRIX_M][MATRIX_N];
-int32_t D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(int32_t *A_mem, int32_t *B_mem, int32_t *C_mem, int M,
-                         int N, int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        int8_t *va = (int8_t *)(A_mem + m * K + k);
-        uint8_t *vb = (uint8_t *)(B_mem + k * N + n);
-        int acc = *(C_mem + m * N + n);
-        for (int i = 0; i < 4; i++) {
-          acc += (static_cast<int>(va[i]) * static_cast<int>(vb[i]));
-        }
-        *(C_mem + m * N + n) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = i + 2 * j;
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 4; i++) {
-    for (int j = 0; j < MATRIX_N * 4; j++) {
-      B[i][j] = i + j;
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1;
-      D[i][j] = 1;
-    }
-  }
-
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MC((int32_t *)&C);
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MD((int32_t *)&D);
-  big_matrix<int8_t, MATRIX_M, MATRIX_K> MA((int8_t *)&A);
-  big_matrix<uint8_t, MATRIX_K / 4, MATRIX_N * 4> MB((uint8_t *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 4);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_us_int8.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_us_int8.cpp
deleted file mode 100644
index 290e1d9ae6471..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_us_int8.cpp
+++ /dev/null
@@ -1,21 +0,0 @@
-//==-------- joint_matrix_us_int8.cpp  - DPC++ joint_matrix------------ ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "joint_matrix_us_int8_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_us_int8_impl.hpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_us_int8_impl.hpp
deleted file mode 100644
index 68cf40bb481b9..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_us_int8_impl.hpp
+++ /dev/null
@@ -1,156 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 32
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, typename T3, size_t NUM_ROWS_A,
-          size_t NUM_COLS_A, size_t NUM_ROWS_B, size_t NUM_COLS_B,
-          size_t NUM_ROWS_C, size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T3, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  // B => K/4 x N*4, A => M x K, C => M, N
-  // stride should be X's cols, e.g., B's stirde = N*4
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 4);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<uint8_t, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<int8_t, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<int32_t, 2> bufC(C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N, K](nd_item<2> spmd_item)
-             [[intel::reqd_sub_group_size(SG_SZ)]]
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<uint8_t, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<int32_t, TM, TN> sub_c(sg);
-
-           // AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
-           // strideX = X's cols, so strideC = N, strideA = K, strideB = N*4
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK / 4) * (N * 4) + sg_starty / SG_SZ * TN * 4,
-                 N * 4, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-uint8_t A[MATRIX_M][MATRIX_K];
-int8_t B[MATRIX_K / 4][MATRIX_N * 4];
-int32_t C[MATRIX_M][MATRIX_N];
-int32_t D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(int32_t *A_mem, int32_t *B_mem, int32_t *C_mem, int M,
-                         int N, int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        uint8_t *va = (uint8_t *)(A_mem + m * K + k);
-        int8_t *vb = (int8_t *)(B_mem + k * N + n);
-        int acc = *(C_mem + m * N + n);
-        for (int i = 0; i < 4; i++) {
-          acc += (static_cast<int>(va[i]) * static_cast<int>(vb[i]));
-        }
-        *(C_mem + m * N + n) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = i + 2 * j;
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 4; i++) {
-    for (int j = 0; j < MATRIX_N * 4; j++) {
-      B[i][j] = i + j;
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1;
-      D[i][j] = 1;
-    }
-  }
-
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MC((int32_t *)&C);
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MD((int32_t *)&D);
-  big_matrix<uint8_t, MATRIX_M, MATRIX_K> MA((uint8_t *)&A);
-  big_matrix<int8_t, MATRIX_K / 4, MATRIX_N * 4> MB((int8_t *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 4);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_uu_int8.cpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_uu_int8.cpp
deleted file mode 100644
index b48131d627349..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_uu_int8.cpp
+++ /dev/null
@@ -1,21 +0,0 @@
-//==-------- joint_matrix_uu_int8.cpp  - DPC++ joint_matrix------------ ----==//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-// REQUIRES: matrix
-
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1
-// RUN: %{run} %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define SG_SZ 16
-
-#include "joint_matrix_uu_int8_impl.hpp"
diff --git a/sycl/test-e2e/Matrix/Legacy/joint_matrix_uu_int8_impl.hpp b/sycl/test-e2e/Matrix/Legacy/joint_matrix_uu_int8_impl.hpp
deleted file mode 100644
index 14190434dd2b1..0000000000000
--- a/sycl/test-e2e/Matrix/Legacy/joint_matrix_uu_int8_impl.hpp
+++ /dev/null
@@ -1,154 +0,0 @@
-#define TM 8
-#define TN SG_SZ
-#define TK 32
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  // B => K/4 x N*4, A => M x K, C => M, N
-  // stride should be X's cols, e.g., B's stirde = N*4
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 4);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<uint8_t, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<uint8_t, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<int32_t, 2> bufC(C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N,
-          K](nd_item<2> spmd_item) [[intel::reqd_sub_group_size(SG_SZ)]] {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<uint8_t, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<uint8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<int32_t, TM, TN> sub_c(sg);
-
-           // AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
-           // strideX = X's cols, so strideC = N, strideA = K, strideB = N*4
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK / 4) * (N * 4) + sg_starty / SG_SZ * TN * 4,
-                 N * 4, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-uint8_t A[MATRIX_M][MATRIX_K];
-uint8_t B[MATRIX_K / 4][MATRIX_N * 4];
-int32_t C[MATRIX_M][MATRIX_N];
-int32_t D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(int32_t *A_mem, int32_t *B_mem, int32_t *C_mem, int M,
-                         int N, int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        uint8_t *va = (uint8_t *)(A_mem + m * K + k);
-        uint8_t *vb = (uint8_t *)(B_mem + k * N + n);
-        int acc = *(C_mem + m * N + n);
-        for (int i = 0; i < 4; i++) {
-          acc += (static_cast<int>(va[i]) * static_cast<int>(vb[i]));
-        }
-        *(C_mem + m * N + n) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = i + 2 * j;
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 4; i++) {
-    for (int j = 0; j < MATRIX_N * 4; j++) {
-      B[i][j] = i + j;
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1;
-      D[i][j] = 1;
-    }
-  }
-
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MC((int32_t *)&C);
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MD((int32_t *)&D);
-  big_matrix<uint8_t, MATRIX_M, MATRIX_K> MA((uint8_t *)&A);
-  big_matrix<uint8_t, MATRIX_K / 4, MATRIX_N * 4> MB((uint8_t *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 4);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-
-  return !res;
-}
diff --git a/sycl/test-e2e/Matrix/SG32/element_wise_abc.cpp b/sycl/test-e2e/Matrix/SG32/element_wise_abc.cpp
index 6c781daba8890..bbaeec54ac149 100644
--- a/sycl/test-e2e/Matrix/SG32/element_wise_abc.cpp
+++ b/sycl/test-e2e/Matrix/SG32/element_wise_abc.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <cstddef>
diff --git a/sycl/test-e2e/Matrix/SG32/element_wise_all_ops.cpp b/sycl/test-e2e/Matrix/SG32/element_wise_all_ops.cpp
index 8992d16e56b8a..2d8bc00702a56 100644
--- a/sycl/test-e2e/Matrix/SG32/element_wise_all_ops.cpp
+++ b/sycl/test-e2e/Matrix/SG32/element_wise_all_ops.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_half.cpp b/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_half.cpp
index b7ced88bfa997..a739c9afff3cb 100644
--- a/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_half.cpp
+++ b/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_half.cpp
@@ -9,7 +9,7 @@
 // REQUIRES: matrix,gpu
 // REQUIRES: matrix-fp16
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_int8.cpp b/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_int8.cpp
index c3a60c174a8ca..44852c44f23d5 100644
--- a/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_int8.cpp
+++ b/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_int8_packed.cpp b/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_int8_packed.cpp
index bd6f4784c1928..c1347e290cfa0 100644
--- a/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_int8_packed.cpp
+++ b/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_int8_packed.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // This test stores the matrix B that is VNNIed (packed).
diff --git a/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_tf32.cpp b/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_tf32.cpp
index d7230889df8f2..86cc5fecd1d81 100644
--- a/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_tf32.cpp
+++ b/sycl/test-e2e/Matrix/SG32/element_wise_all_ops_tf32.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-tf32
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/SG32/element_wise_all_sizes.cpp b/sycl/test-e2e/Matrix/SG32/element_wise_all_sizes.cpp
index db9c109071f57..6f0bde4d2724d 100644
--- a/sycl/test-e2e/Matrix/SG32/element_wise_all_sizes.cpp
+++ b/sycl/test-e2e/Matrix/SG32/element_wise_all_sizes.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/SG32/element_wise_ops.cpp b/sycl/test-e2e/Matrix/SG32/element_wise_ops.cpp
index 2ec61867884c6..b02d3ffd990fe 100644
--- a/sycl/test-e2e/Matrix/SG32/element_wise_ops.cpp
+++ b/sycl/test-e2e/Matrix/SG32/element_wise_ops.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/SG32/get_coord_float_matC.cpp b/sycl/test-e2e/Matrix/SG32/get_coord_float_matC.cpp
index 8804a3287c8a9..3091987260e0e 100644
--- a/sycl/test-e2e/Matrix/SG32/get_coord_float_matC.cpp
+++ b/sycl/test-e2e/Matrix/SG32/get_coord_float_matC.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 // XFAIL: cpu
 
diff --git a/sycl/test-e2e/Matrix/SG32/get_coord_int8_matA.cpp b/sycl/test-e2e/Matrix/SG32/get_coord_int8_matA.cpp
index a2c1956443105..05642b66f46b3 100644
--- a/sycl/test-e2e/Matrix/SG32/get_coord_int8_matA.cpp
+++ b/sycl/test-e2e/Matrix/SG32/get_coord_int8_matA.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 // XFAIL: cpu
 
diff --git a/sycl/test-e2e/Matrix/SG32/get_coord_int8_matB.cpp b/sycl/test-e2e/Matrix/SG32/get_coord_int8_matB.cpp
index 0eb40cc44c32a..cfb92ba5fcbfe 100644
--- a/sycl/test-e2e/Matrix/SG32/get_coord_int8_matB.cpp
+++ b/sycl/test-e2e/Matrix/SG32/get_coord_int8_matB.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL: *
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_all_sizes.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_all_sizes.cpp
index a09cfce1d70c1..16e5f7904b400 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_all_sizes.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_all_sizes.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_apply_bf16.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_apply_bf16.cpp
index c87ebca5ee7ff..dad2d2d706f53 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_apply_bf16.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_apply_bf16.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache.cpp
index 443b6d859fa97..21e2881524d87 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache_init.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache_init.cpp
index 7fabb51cddd72..c01806b9e407b 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache_init.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache_init.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix, gpu
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -DINIT_LIST
+// RUN: %{build} -o %t.out -DINIT_LIST
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache_unroll.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache_unroll.cpp
index 24a8c44074a80..1411120c3fbfd 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache_unroll.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache_unroll.cpp
@@ -7,10 +7,10 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -mllvm -inline-threshold=5000 -o %t_gpu.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -DMANUAL_UNROLL
+// RUN: %{build} -mllvm -inline-threshold=5000 -o %t_gpu.out -DMANUAL_UNROLL
 // RUN: %if gpu %{ %{run} %t_gpu.out %}
 
-// RUN: %{build} -mllvm -inline-threshold=5000 -o %t_cpu.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -DMANUAL_UNROLL -DtM=16 -DtK=32 -DNCACHE1=32 -DKCACHE1=32
+// RUN: %{build} -mllvm -inline-threshold=5000 -o %t_cpu.out -DMANUAL_UNROLL -DtM=16 -DtK=32 -DNCACHE1=32 -DKCACHE1=32
 // RUN: %if cpu %{ %{run} %t_cpu.out %}
 
 // -mllvm -inline-threshold added as a workaround,
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache_unroll_init.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache_unroll_init.cpp
index 77b304e5dc7ce..9a8bd7e1734d8 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache_unroll_init.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_bf16_fill_k_cache_unroll_init.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix, gpu
 
-// RUN: %{build} -mllvm -inline-threshold=5000 -o %t_gpu.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -DINIT_LIST -DMANUAL_UNROLL
+// RUN: %{build} -mllvm -inline-threshold=5000 -o %t_gpu.out -DINIT_LIST -DMANUAL_UNROLL
 // RUN: %{run} %t_gpu.out
 
 // -mllvm -inline-threshold added as a workaround,
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16.cpp
index dc3bc665ab88a..6c28be044a357 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_16x16x16.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_16x16x16.cpp
index 759129e536fc3..bc04083af2385 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_16x16x16.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_16x16x16.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL: gpu
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_32x64x16.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_32x64x16.cpp
index af22031fd9f02..d39fb996a0f1f 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_32x64x16.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_32x64x16.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL: *
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_32x64x32.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_32x64x32.cpp
index fd3230a862571..70d1488a14f8c 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_32x64x32.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_32x64x32.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL: *
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_array.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_array.cpp
index 54694baa8871a..7dc4cb0dbb3cc 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_array.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_array.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_colmajorA_colmajorB.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_colmajorA_colmajorB.cpp
index f7a713a26d2ec..23680c63c93ac 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_colmajorA_colmajorB.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_colmajorA_colmajorB.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // This tests support of col major layout for matrix B which does transpose and
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_rowmajorA_rowmajorB.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_rowmajorA_rowmajorB.cpp
index 34699a67ce395..211aa67077db4 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_rowmajorA_rowmajorB.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_bfloat16_rowmajorA_rowmajorB.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // This tests support of row major layout for matrix B which does automatic VNNI
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_colA_rowB_colC.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_colA_rowB_colC.cpp
index 25ce247fc67b6..f9d96d5a0b4c7 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_colA_rowB_colC.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_colA_rowB_colC.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL:*
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_down_convert.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_down_convert.cpp
index 258ec0737a896..05b5548198dcc 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_down_convert.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_down_convert.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_half.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_half.cpp
index 9dfe60c013de1..1ef2313a1f38c 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_half.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_half.cpp
@@ -9,7 +9,7 @@
 // REQUIRES: matrix
 // REQUIRES: matrix-fp16
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_int8_colmajorA_colmajorB.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_int8_colmajorA_colmajorB.cpp
index 2b3ced9fc36ff..d5b66f7fd3cdc 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_int8_colmajorA_colmajorB.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_int8_colmajorA_colmajorB.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // This tests support of col major layout for matrix B which does transpose and
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_int8_vnni.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_int8_vnni.cpp
index bac14b7b87dcb..8a19969c4726b 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_int8_vnni.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_int8_vnni.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL: gpu
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_out_bounds.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_out_bounds.cpp
index 20291836b4110..e6d42a416398e 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_out_bounds.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_out_bounds.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL:*
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_ss_int8.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_ss_int8.cpp
index 56323e70272c9..27fb04882e9cd 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_ss_int8.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_ss_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_su_int8.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_su_int8.cpp
index b8b25542bd7e9..e66c75fc8412c 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_su_int8.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_su_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_tf32.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_tf32.cpp
index cae8af681ec4f..29e9a4f9d454b 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_tf32.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_tf32.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-tf32
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL:cpu
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_transposeC.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_transposeC.cpp
index af34b5a1e827c..9ec30386a806f 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_transposeC.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_transposeC.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL: gpu
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_unaligned_k.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_unaligned_k.cpp
index 40bceb606652d..b6e1005d610c6 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_unaligned_k.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_unaligned_k.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL:*
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_us_int8.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_us_int8.cpp
index 1779b5c204fc8..450427081c3ab 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_us_int8.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_us_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/SG32/joint_matrix_uu_int8.cpp b/sycl/test-e2e/Matrix/SG32/joint_matrix_uu_int8.cpp
index ce0b0e893e664..7d37a2aa4fe55 100644
--- a/sycl/test-e2e/Matrix/SG32/joint_matrix_uu_int8.cpp
+++ b/sycl/test-e2e/Matrix/SG32/joint_matrix_uu_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/XMX8/element_wise_abc.cpp b/sycl/test-e2e/Matrix/XMX8/element_wise_abc.cpp
index bc91d0aa46066..3fcb1682c1a15 100644
--- a/sycl/test-e2e/Matrix/XMX8/element_wise_abc.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/element_wise_abc.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <cstddef>
diff --git a/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops.cpp b/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops.cpp
index acf32bc94feee..f1f7bf84899a4 100644
--- a/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops.cpp
@@ -8,7 +8,7 @@
 // REQUIRES: matrix-xmx8
 // REQUIRES: TEMPORARY_DISBLED
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops_half.cpp b/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops_half.cpp
index 3c32e720fcce2..7d7152f53f8ba 100644
--- a/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops_half.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops_half.cpp
@@ -9,7 +9,7 @@
 // REQUIRES: matrix-xmx8
 // REQUIRES: matrix-fp16
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops_int8.cpp b/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops_int8.cpp
index e672562dcf69f..f03bbc2f325ff 100644
--- a/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops_int8.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops_int8.cpp
@@ -8,7 +8,7 @@
 // REQUIRES: matrix-xmx8
 // REQUIRES: TEMPORARY_DISBLED
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops_int8_packed.cpp b/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops_int8_packed.cpp
index 6bc655f9089fc..11428eda55ec9 100644
--- a/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops_int8_packed.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/element_wise_all_ops_int8_packed.cpp
@@ -8,7 +8,7 @@
 // REQUIRES: matrix-xmx8
 // REQUIRES: TEMPORARY_DISBLED
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // This test stores the matrix B that is VNNIed (packed).
diff --git a/sycl/test-e2e/Matrix/XMX8/element_wise_all_sizes.cpp b/sycl/test-e2e/Matrix/XMX8/element_wise_all_sizes.cpp
index 7b79d858bda2d..71a2bfb9580c9 100644
--- a/sycl/test-e2e/Matrix/XMX8/element_wise_all_sizes.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/element_wise_all_sizes.cpp
@@ -8,7 +8,7 @@
 // REQUIRES: matrix-xmx8
 // XFAIL: gpu-intel-dg2
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/XMX8/element_wise_ops.cpp b/sycl/test-e2e/Matrix/XMX8/element_wise_ops.cpp
index 87498caee3e5f..8fa0a2bf5094a 100644
--- a/sycl/test-e2e/Matrix/XMX8/element_wise_ops.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/element_wise_ops.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/XMX8/get_coord_float_matC.cpp b/sycl/test-e2e/Matrix/XMX8/get_coord_float_matC.cpp
index 3eabb2662270d..6a913ec88893e 100644
--- a/sycl/test-e2e/Matrix/XMX8/get_coord_float_matC.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/get_coord_float_matC.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 // XFAIL: cpu
 
diff --git a/sycl/test-e2e/Matrix/XMX8/get_coord_int8_matA.cpp b/sycl/test-e2e/Matrix/XMX8/get_coord_int8_matA.cpp
index f6c49cf0da022..a914eab57ba1b 100644
--- a/sycl/test-e2e/Matrix/XMX8/get_coord_int8_matA.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/get_coord_int8_matA.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 // XFAIL: cpu
 
diff --git a/sycl/test-e2e/Matrix/XMX8/get_coord_int8_matB.cpp b/sycl/test-e2e/Matrix/XMX8/get_coord_int8_matB.cpp
index b55e01525abca..a84580c3f846c 100644
--- a/sycl/test-e2e/Matrix/XMX8/get_coord_int8_matB.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/get_coord_int8_matB.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 // XFAIL: *
 
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_all_sizes.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_all_sizes.cpp
index 5095cad350bfd..be1ac0f24e88c 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_all_sizes.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_all_sizes.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_apply_bf16.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_apply_bf16.cpp
index f88b74af20df2..e7a510e41cbcf 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_apply_bf16.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_apply_bf16.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache.cpp
index 957c7c4475bf3..8afcb804929e5 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache_init.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache_init.cpp
index 23bc1c9c2a9fc..7dd0351356bcc 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache_init.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache_init.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -DINIT_LIST
+// RUN: %{build} -o %t.out -DINIT_LIST
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache_unroll.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache_unroll.cpp
index 0d99fa0f86e4b..0c1dbb733c8ab 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache_unroll.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache_unroll.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -mllvm -inline-threshold=2000 -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -DMANUAL_UNROLL
+// RUN: %{build} -mllvm -inline-threshold=2000 -o %t.out -DMANUAL_UNROLL
 // RUN: %{run} %t.out
 
 // -mllvm -inline-threshold=2000 added as a workaround,
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache_unroll_init.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache_unroll_init.cpp
index 2f32bbad898d4..d97399b296266 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache_unroll_init.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bf16_fill_k_cache_unroll_init.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -mllvm -inline-threshold=2000 -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -DINIT_LIST -DMANUAL_UNROLL
+// RUN: %{build} -mllvm -inline-threshold=2000 -o %t.out -DINIT_LIST -DMANUAL_UNROLL
 // RUN: %{run} %t.out
 
 // -mllvm -inline-threshold=2000 added as a workaround,
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bfloat16.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bfloat16.cpp
index 7f5c190079175..008db77761e3d 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bfloat16.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bfloat16.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bfloat16_32x64.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bfloat16_32x64.cpp
index 1873f503a22ba..b72e2ed83841c 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bfloat16_32x64.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bfloat16_32x64.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL: *
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bfloat16_array.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bfloat16_array.cpp
index 2feda785133b0..e6371806f3592 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_bfloat16_array.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_bfloat16_array.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_colA_rowB_colC.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_colA_rowB_colC.cpp
index 9e4bdc9472663..494a84c173edb 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_colA_rowB_colC.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_colA_rowB_colC.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL:gpu
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_half.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_half.cpp
index 91601dd608a6a..dbe060711b02a 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_half.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_half.cpp
@@ -9,7 +9,7 @@
 // REQUIRES: matrix-xmx8
 // REQUIRES: matrix-fp16
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_int8_vnni.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_int8_vnni.cpp
index 67abad7a91348..bcdd18571b4a9 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_int8_vnni.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_int8_vnni.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL: *
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_out_bounds.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_out_bounds.cpp
index c5335b81929ab..944cccd310d3e 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_out_bounds.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_out_bounds.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL:*
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_ss_int8.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_ss_int8.cpp
index ceb1b2e83d8be..4a3770be74f91 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_ss_int8.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_ss_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_su_int8.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_su_int8.cpp
index 71ba8145213fb..d5c7a74c20aff 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_su_int8.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_su_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_transposeC.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_transposeC.cpp
index b306bdc49ba0f..16487eeaa85f0 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_transposeC.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_transposeC.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL:gpu
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_unaligned_k.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_unaligned_k.cpp
index 96281e639a573..aa8e00c08b658 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_unaligned_k.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_unaligned_k.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL:*
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_us_int8.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_us_int8.cpp
index 4f9ece1b2f228..56feaaec924ad 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_us_int8.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_us_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/XMX8/joint_matrix_uu_int8.cpp b/sycl/test-e2e/Matrix/XMX8/joint_matrix_uu_int8.cpp
index 054dce1d3d752..a1643332e489f 100644
--- a/sycl/test-e2e/Matrix/XMX8/joint_matrix_uu_int8.cpp
+++ b/sycl/test-e2e/Matrix/XMX8/joint_matrix_uu_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-xmx8
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "../common.hpp"
diff --git a/sycl/test-e2e/Matrix/element_wise_abc.cpp b/sycl/test-e2e/Matrix/element_wise_abc.cpp
index 40f3a44cd7235..961e6ca319b9d 100644
--- a/sycl/test-e2e/Matrix/element_wise_abc.cpp
+++ b/sycl/test-e2e/Matrix/element_wise_abc.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <cstddef>
diff --git a/sycl/test-e2e/Matrix/element_wise_all_ops.cpp b/sycl/test-e2e/Matrix/element_wise_all_ops.cpp
index 0fd73525951b0..fd3648664a52c 100644
--- a/sycl/test-e2e/Matrix/element_wise_all_ops.cpp
+++ b/sycl/test-e2e/Matrix/element_wise_all_ops.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/element_wise_all_ops_half.cpp b/sycl/test-e2e/Matrix/element_wise_all_ops_half.cpp
index 1717f33d1a6b5..9b23ef9fdb339 100644
--- a/sycl/test-e2e/Matrix/element_wise_all_ops_half.cpp
+++ b/sycl/test-e2e/Matrix/element_wise_all_ops_half.cpp
@@ -9,7 +9,7 @@
 // REQUIRES: matrix
 // REQUIRES: matrix-fp16
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/element_wise_all_ops_int8.cpp b/sycl/test-e2e/Matrix/element_wise_all_ops_int8.cpp
index df3d107e647ea..862eadd49e446 100644
--- a/sycl/test-e2e/Matrix/element_wise_all_ops_int8.cpp
+++ b/sycl/test-e2e/Matrix/element_wise_all_ops_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/element_wise_all_ops_int8_packed.cpp b/sycl/test-e2e/Matrix/element_wise_all_ops_int8_packed.cpp
index fde412f3c8386..4355fae1ccd69 100644
--- a/sycl/test-e2e/Matrix/element_wise_all_ops_int8_packed.cpp
+++ b/sycl/test-e2e/Matrix/element_wise_all_ops_int8_packed.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // This test stores the matrix B that is VNNIed (packed).
diff --git a/sycl/test-e2e/Matrix/element_wise_all_ops_tf32.cpp b/sycl/test-e2e/Matrix/element_wise_all_ops_tf32.cpp
index 7ade05f42133f..64df1ed9c7f1a 100644
--- a/sycl/test-e2e/Matrix/element_wise_all_ops_tf32.cpp
+++ b/sycl/test-e2e/Matrix/element_wise_all_ops_tf32.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-tf32
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/element_wise_all_sizes.cpp b/sycl/test-e2e/Matrix/element_wise_all_sizes.cpp
index b6f2c7143ff34..6017284615d9a 100644
--- a/sycl/test-e2e/Matrix/element_wise_all_sizes.cpp
+++ b/sycl/test-e2e/Matrix/element_wise_all_sizes.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/element_wise_ops.cpp b/sycl/test-e2e/Matrix/element_wise_ops.cpp
index 539709551c765..a87ad3ab17999 100644
--- a/sycl/test-e2e/Matrix/element_wise_ops.cpp
+++ b/sycl/test-e2e/Matrix/element_wise_ops.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/elemwise_irreg_size_ops_bf16.cpp b/sycl/test-e2e/Matrix/elemwise_irreg_size_ops_bf16.cpp
index cc8722467d262..a2b8ef5aa8b57 100644
--- a/sycl/test-e2e/Matrix/elemwise_irreg_size_ops_bf16.cpp
+++ b/sycl/test-e2e/Matrix/elemwise_irreg_size_ops_bf16.cpp
@@ -11,7 +11,7 @@
 // REQUIRES: cpu
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/get_coord_float_matC.cpp b/sycl/test-e2e/Matrix/get_coord_float_matC.cpp
index d53326ef0b88d..015dda8e75475 100644
--- a/sycl/test-e2e/Matrix/get_coord_float_matC.cpp
+++ b/sycl/test-e2e/Matrix/get_coord_float_matC.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 // XFAIL: cpu
 
diff --git a/sycl/test-e2e/Matrix/get_coord_int8_matA.cpp b/sycl/test-e2e/Matrix/get_coord_int8_matA.cpp
index 8a5a4584be7fe..567d0831d3d19 100644
--- a/sycl/test-e2e/Matrix/get_coord_int8_matA.cpp
+++ b/sycl/test-e2e/Matrix/get_coord_int8_matA.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 // XFAIL: cpu
 
diff --git a/sycl/test-e2e/Matrix/get_coord_int8_matB.cpp b/sycl/test-e2e/Matrix/get_coord_int8_matB.cpp
index 5f0565dec1da1..798afde072dd3 100644
--- a/sycl/test-e2e/Matrix/get_coord_int8_matB.cpp
+++ b/sycl/test-e2e/Matrix/get_coord_int8_matB.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 // XFAIL: *
 
diff --git a/sycl/test-e2e/Matrix/joint_matrix_all_sizes.cpp b/sycl/test-e2e/Matrix/joint_matrix_all_sizes.cpp
index 4e5fd0f2c8824..408a6087206ea 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_all_sizes.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_all_sizes.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/joint_matrix_apply_bf16.cpp b/sycl/test-e2e/Matrix/joint_matrix_apply_bf16.cpp
index 74e412ae20ad4..1a4dcd086094f 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_apply_bf16.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_apply_bf16.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache.cpp b/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache.cpp
index 8e7b454df64fe..c26062c5a16bb 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache_init.cpp b/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache_init.cpp
index 67221e9c34147..7b86902953355 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache_init.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache_init.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix, gpu
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -DINIT_LIST
+// RUN: %{build} -o %t.out -DINIT_LIST
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache_unroll.cpp b/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache_unroll.cpp
index 64b7b8d3a122b..1896cea224dc8 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache_unroll.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache_unroll.cpp
@@ -7,10 +7,10 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -mllvm -inline-threshold=2000 -o %t_gpu.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -DMANUAL_UNROLL
+// RUN: %{build} -mllvm -inline-threshold=2000 -o %t_gpu.out -DMANUAL_UNROLL
 // RUN: %if gpu %{ %{run} %t_gpu.out %}
 
-// RUN: %{build} -mllvm -inline-threshold=2000 -o %t_cpu.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -DMANUAL_UNROLL -DtM=16 -DtK=32 -DNCACHE1=32 -DKCACHE1=32
+// RUN: %{build} -mllvm -inline-threshold=2000 -o %t_cpu.out -DMANUAL_UNROLL -DtM=16 -DtK=32 -DNCACHE1=32 -DKCACHE1=32
 // RUN: %if cpu %{ %{run} %t_cpu.out %}
 
 // -mllvm -inline-threshold=2000 added as a workaround,
diff --git a/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache_unroll_init.cpp b/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache_unroll_init.cpp
index b97c58dec63d0..271810bb70f8f 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache_unroll_init.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_bf16_fill_k_cache_unroll_init.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix, gpu
 
-// RUN: %{build} -mllvm -inline-threshold=2000 -o %t_gpu.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -DINIT_LIST -DMANUAL_UNROLL
+// RUN: %{build} -mllvm -inline-threshold=2000 -o %t_gpu.out -DINIT_LIST -DMANUAL_UNROLL
 // RUN: %{run} %t_gpu.out
 
 // -mllvm -inline-threshold=2000 added as a workaround,
diff --git a/sycl/test-e2e/Matrix/joint_matrix_bfloat16.cpp b/sycl/test-e2e/Matrix/joint_matrix_bfloat16.cpp
index ebd04bb5c466a..d1410ac68276e 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_bfloat16.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_bfloat16.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/joint_matrix_bfloat16_16x16x16.cpp b/sycl/test-e2e/Matrix/joint_matrix_bfloat16_16x16x16.cpp
index 55a3f50e44169..f367b9f725c72 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_bfloat16_16x16x16.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_bfloat16_16x16x16.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/joint_matrix_bfloat16_32x64x16.cpp b/sycl/test-e2e/Matrix/joint_matrix_bfloat16_32x64x16.cpp
index 733b0c0e8bf48..8f179eb95b10e 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_bfloat16_32x64x16.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_bfloat16_32x64x16.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL: cpu
diff --git a/sycl/test-e2e/Matrix/joint_matrix_bfloat16_32x64x32.cpp b/sycl/test-e2e/Matrix/joint_matrix_bfloat16_32x64x32.cpp
index c18a838210511..a7746ab9cca17 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_bfloat16_32x64x32.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_bfloat16_32x64x32.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL: *
diff --git a/sycl/test-e2e/Matrix/joint_matrix_bfloat16_array.cpp b/sycl/test-e2e/Matrix/joint_matrix_bfloat16_array.cpp
index 1825c650bb887..80e1f310ce440 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_bfloat16_array.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_bfloat16_array.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/joint_matrix_bfloat16_colmajorA_colmajorB.cpp b/sycl/test-e2e/Matrix/joint_matrix_bfloat16_colmajorA_colmajorB.cpp
index 6d770ffc02ff0..9cd31a8c5178e 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_bfloat16_colmajorA_colmajorB.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_bfloat16_colmajorA_colmajorB.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // This tests support of col major layout for matrix B which does transpose and
diff --git a/sycl/test-e2e/Matrix/joint_matrix_bfloat16_rowmajorA_rowmajorB.cpp b/sycl/test-e2e/Matrix/joint_matrix_bfloat16_rowmajorA_rowmajorB.cpp
index dff3aa504ed15..f118a4a981a45 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_bfloat16_rowmajorA_rowmajorB.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_bfloat16_rowmajorA_rowmajorB.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // This tests support of row major layout for matrix B which does automatic VNNI
diff --git a/sycl/test-e2e/Matrix/joint_matrix_colA_rowB_colC.cpp b/sycl/test-e2e/Matrix/joint_matrix_colA_rowB_colC.cpp
index 7a6474f744103..7d114175dff13 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_colA_rowB_colC.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_colA_rowB_colC.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL:*
diff --git a/sycl/test-e2e/Matrix/joint_matrix_down_convert.cpp b/sycl/test-e2e/Matrix/joint_matrix_down_convert.cpp
index d65161cbb6815..caea640677aa7 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_down_convert.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_down_convert.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/joint_matrix_half.cpp b/sycl/test-e2e/Matrix/joint_matrix_half.cpp
index 1752192e66957..ac09361a0799c 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_half.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_half.cpp
@@ -9,7 +9,7 @@
 // REQUIRES: matrix
 // REQUIRES: matrix-fp16
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/joint_matrix_int8_colmajorA_colmajorB.cpp b/sycl/test-e2e/Matrix/joint_matrix_int8_colmajorA_colmajorB.cpp
index 046f0fa03303e..33c00022a5a76 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_int8_colmajorA_colmajorB.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_int8_colmajorA_colmajorB.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // This tests support of col major layout for matrix B which does transpose and
diff --git a/sycl/test-e2e/Matrix/joint_matrix_int8_vnni.cpp b/sycl/test-e2e/Matrix/joint_matrix_int8_vnni.cpp
index 93e8e3564fe41..02813c6720deb 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_int8_vnni.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_int8_vnni.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/joint_matrix_out_bounds.cpp b/sycl/test-e2e/Matrix/joint_matrix_out_bounds.cpp
index 843e4b5f653bc..854d3ccc85dce 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_out_bounds.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_out_bounds.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL:*
diff --git a/sycl/test-e2e/Matrix/joint_matrix_query_default.cpp b/sycl/test-e2e/Matrix/joint_matrix_query_default.cpp
index 760b9961050b0..fec56b9d36c67 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_query_default.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_query_default.cpp
@@ -9,7 +9,7 @@
 // REQUIRES: cpu
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <iostream>
diff --git a/sycl/test-e2e/Matrix/joint_matrix_ss_int8.cpp b/sycl/test-e2e/Matrix/joint_matrix_ss_int8.cpp
index 993d0544c99b1..e487b8cdcb41d 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_ss_int8.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_ss_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/joint_matrix_su_int8.cpp b/sycl/test-e2e/Matrix/joint_matrix_su_int8.cpp
index 71d902b46cdf3..72910c4ed5446 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_su_int8.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_su_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/joint_matrix_tf32.cpp b/sycl/test-e2e/Matrix/joint_matrix_tf32.cpp
index 0c80e90c7dba4..1c7ee5d1d6656 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_tf32.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_tf32.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix-tf32
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL:cpu
diff --git a/sycl/test-e2e/Matrix/joint_matrix_transposeC.cpp b/sycl/test-e2e/Matrix/joint_matrix_transposeC.cpp
index 762f49c9639e4..a47d7987fc899 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_transposeC.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_transposeC.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL: gpu
diff --git a/sycl/test-e2e/Matrix/joint_matrix_unaligned_k.cpp b/sycl/test-e2e/Matrix/joint_matrix_unaligned_k.cpp
index 45f31103049cb..212ac34a3a640 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_unaligned_k.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_unaligned_k.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 // XFAIL:*
diff --git a/sycl/test-e2e/Matrix/joint_matrix_us_int8.cpp b/sycl/test-e2e/Matrix/joint_matrix_us_int8.cpp
index 3a1005c0af4cb..409b589904847 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_us_int8.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_us_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/joint_matrix_uu_int8.cpp b/sycl/test-e2e/Matrix/joint_matrix_uu_int8.cpp
index e65117b031240..59a47484a335c 100644
--- a/sycl/test-e2e/Matrix/joint_matrix_uu_int8.cpp
+++ b/sycl/test-e2e/Matrix/joint_matrix_uu_int8.cpp
@@ -7,7 +7,7 @@
 //===----------------------------------------------------------------------===//
 // REQUIRES: matrix
 
-// RUN: %{build} -o %t.out -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include "common.hpp"
diff --git a/sycl/test-e2e/Matrix/runtime_query_pvc.cpp b/sycl/test-e2e/Matrix/runtime_query_pvc.cpp
index 61ff2a19ed0b4..66a49b9f28aa7 100644
--- a/sycl/test-e2e/Matrix/runtime_query_pvc.cpp
+++ b/sycl/test-e2e/Matrix/runtime_query_pvc.cpp
@@ -1,5 +1,5 @@
 // REQUIRES: gpu-intel-pvc
-// RUN: %{build} -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -o %t.out
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <sycl/sycl.hpp>
diff --git a/sycl/test-e2e/Matrix/runtime_query_spr.cpp b/sycl/test-e2e/Matrix/runtime_query_spr.cpp
index 6806003b59ccc..fc0fd075b230c 100644
--- a/sycl/test-e2e/Matrix/runtime_query_spr.cpp
+++ b/sycl/test-e2e/Matrix/runtime_query_spr.cpp
@@ -1,5 +1,5 @@
 // REQUIRES: cpu
-// RUN: %{build} -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -o %t.out
+// RUN: %{build} -o %t.out
 // RUN: %{run} %t.out
 
 #include <sycl/sycl.hpp>
diff --git a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-bfloat16-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-bfloat16-test.cpp
index 309786a38003f..9f99cb6ea9457 100644
--- a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-bfloat16-test.cpp
+++ b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-bfloat16-test.cpp
@@ -1,6 +1,6 @@
 // REQUIRES: cuda
 
-// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
+// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
 
 #include <sycl/sycl.hpp>
 
diff --git a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-double-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-double-test.cpp
index 16603407d74b1..f4a79d2756937 100644
--- a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-double-test.cpp
+++ b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-double-test.cpp
@@ -1,6 +1,6 @@
 // REQUIRES: cuda
 
-// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
+// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
 
 #include <sycl/sycl.hpp>
 
diff --git a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-half-float-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-half-float-test.cpp
index 47ddc0fb42f48..cb5b3da54b794 100644
--- a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-half-float-test.cpp
+++ b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-half-float-test.cpp
@@ -1,6 +1,6 @@
 // REQUIRES: cuda
 
-// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_70 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
+// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_70 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
 
 #include <sycl/sycl.hpp>
 
diff --git a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-half-half-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-half-half-test.cpp
index 0468f592b6427..feea65a79848b 100644
--- a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-half-half-test.cpp
+++ b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-half-half-test.cpp
@@ -1,6 +1,6 @@
 // REQUIRES: cuda
 
-// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_70 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
+// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_70 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
 
 #include <sycl/sycl.hpp>
 
diff --git a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-int8-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-int8-test.cpp
index 858c8625cc6e9..492313dbaf71d 100644
--- a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-int8-test.cpp
+++ b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-int8-test.cpp
@@ -1,6 +1,6 @@
 // REQUIRES: cuda
 
-// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_72 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
+// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_72 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
 
 #include <sycl/sycl.hpp>
 
diff --git a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-tf32-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-tf32-test.cpp
index e2ae423f04c7d..e9200d930de46 100644
--- a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-tf32-test.cpp
+++ b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-tf32-test.cpp
@@ -1,6 +1,6 @@
 // REQUIRES: cuda
 
-// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
+// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_80 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
 
 // IMPORTANT: before updating sm version support beyond sm_90 read the following
 // NOTE!
diff --git a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-uint8-test.cpp b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-uint8-test.cpp
index c6a1bda15cdcb..67d0dd5ea4728 100644
--- a/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-uint8-test.cpp
+++ b/sycl/test/check_device_code/cuda/matrix/matrix-nvptx-uint8-test.cpp
@@ -1,6 +1,6 @@
 // REQUIRES: cuda
 
-// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_72 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
+// RUN: %clangxx -fsycl-device-only -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_72 -S -Xclang -emit-llvm %s -o -| FileCheck %s --check-prefixes=CHECK-OPAQUE
 
 #include <sycl/sycl.hpp>
 
diff --git a/sycl/test/check_device_code/matrix/matrix_load_store_as.cpp b/sycl/test/check_device_code/matrix/matrix_load_store_as.cpp
index 22c8203444ab4..36dbf89dc2661 100644
--- a/sycl/test/check_device_code/matrix/matrix_load_store_as.cpp
+++ b/sycl/test/check_device_code/matrix/matrix_load_store_as.cpp
@@ -4,7 +4,6 @@
 // CHECK-NOT: alloca target("spirv.JointMatrixINTEL"
 
 // check that correct address spaces are used to load from and store to
-#define SYCL_EXT_ONEAPI_MATRIX_VERSION 4
 #include <sycl/sycl.hpp>
 
 using namespace sycl;
diff --git a/sycl/test/check_device_code/matrix/matrix_load_store_as_legacy.cpp b/sycl/test/check_device_code/matrix/matrix_load_store_as_legacy.cpp
deleted file mode 100644
index bb18a21bc1002..0000000000000
--- a/sycl/test/check_device_code/matrix/matrix_load_store_as_legacy.cpp
+++ /dev/null
@@ -1,62 +0,0 @@
-// RUN: %clangxx -fsycl-device-only -S -emit-llvm -o - %s | FileCheck %s
-
-// Check that SROA and mem2reg won't leave alloca of matrix type in IR
-// CHECK-NOT: alloca target("spirv.JointMatrixINTEL"
-
-// check that correct address spaces are used to load from and store to
-#define SYCL_EXT_ONEAPI_MATRIX_VERSION 1
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-int main(void) {
-  queue q;
-  unsigned short *A = malloc_shared<unsigned short>(8 * 16, q);
-  unsigned short *B = malloc_shared<unsigned short>(16 * 16, q);
-  float *C = malloc_shared<float>(8 * 16, q);
-
-  auto pA = multi_ptr<unsigned short, access::address_space::global_space>(A);
-  auto pB = multi_ptr<unsigned short, access::address_space::global_space>(B);
-  auto pC = multi_ptr<float, access::address_space::global_space>(C);
-
-  q.submit([&](handler &h) {
-    local_accessor<unsigned short, 2> tileA{{8, 16}, h};
-
-    h.parallel_for(
-        nd_range<2>({1, 16}, {1, 16}),
-        [=](nd_item<2> it) [[intel::reqd_sub_group_size(16)]] {
-          sub_group sg = it.get_sub_group();
-
-          joint_matrix<unsigned short, 8, 16> tA(sg);
-          joint_matrix<unsigned short, 16, 16, matrix_layout::packed_b> tB(sg);
-          joint_matrix<float, 8, 16> tC(sg);
-
-          vec<unsigned short, 8> slmvec = sg.load<8>(pA);
-          sg.store<8>(
-              tileA.template get_multi_ptr<sycl::access::decorated::yes>(),
-              slmvec);
-          it.barrier(access::fence_space::local_space);
-
-          // A should load from local address space
-          // CHECK: %{{.*}} = tail call spir_func noundef target("spirv.JointMatrixINTEL", i16, 8, 16, 0, 3) @_Z[[#]]__spirv_JointMatrixLoadINTEL{{.*}}(ptr addrspace(3) noundef %{{.*}}, i64 noundef 16, i32 noundef 0, i32 noundef 3, i32 noundef 0) #{{.*}}
-          joint_matrix_load(
-              sg, tA,
-              tileA.template get_multi_ptr<sycl::access::decorated::yes>(), 16,
-              matrix_layout::row_major);
-          // B should load from global address space
-          // CHECK: %{{.*}} = tail call spir_func noundef target("spirv.JointMatrixINTEL", i16, 16, 16, 3, 3) @_Z[[#]]__spirv_JointMatrixLoadINTEL{{.*}}(ptr addrspace(1) noundef %{{.*}}, i64 noundef 32, i32 noundef [[#]], i32 noundef 3, i32 noundef 0) #{{.*}}
-          joint_matrix_load(sg, tB, pB, 32, matrix_layout::packed_b);
-          tC = joint_matrix_mad(sg, tA, tB, tC);
-          // C should store to global address space
-          // CHECK: tail call spir_func void @_Z[[#]]__spirv_JointMatrixStoreINTEL{{.*}}(ptr addrspace(1) noundef %{{.*}}, target("spirv.JointMatrixINTEL", float, 8, 16, 0, 3) noundef %{{.*}}, i64 noundef 16, i32 noundef 0, i32 noundef 3, i32 noundef 0) #{{.*}}
-          joint_matrix_store(sg, tC, pC, 16, matrix_layout::row_major);
-        });
-  });
-
-  free(A, q);
-  free(B, q);
-  free(C, q);
-
-  return 0;
-}
diff --git a/sycl/test/matrix/compile-query.cpp b/sycl/test/matrix/compile-query.cpp
index e110eef22a385..85dfefb6a2916 100644
--- a/sycl/test/matrix/compile-query.cpp
+++ b/sycl/test/matrix/compile-query.cpp
@@ -1,4 +1,4 @@
-// RUN: %clangxx -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -fsycl -o compile-query %s
+// RUN: %clangxx -fsycl -o compile-query %s
 #include <iostream>
 #include <sycl/sycl.hpp>
 
diff --git a/sycl/test/matrix/legacy/matrix-bf16-test-SG-16.cpp b/sycl/test/matrix/legacy/matrix-bf16-test-SG-16.cpp
deleted file mode 100644
index 391a9be2197c6..0000000000000
--- a/sycl/test/matrix/legacy/matrix-bf16-test-SG-16.cpp
+++ /dev/null
@@ -1,191 +0,0 @@
-// RUN: %clangxx -fsycl -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1 %s -o %t.out
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define TILE_SZ 16
-#define TM (TILE_SZ - 1)
-#define TN (TILE_SZ - 1)
-#define TK (2 * TILE_SZ - 2)
-
-#define SG_SZ 16
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  // B => K/4 x N*4, A => M x K, C => M, N
-  // stride should be X's cols, e.g., B's stirde = N*4
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 2);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<unsigned short, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<unsigned short, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<float, 2> bufC((float *)C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N, K](nd_item<2> spmd_item)
-             [[intel::reqd_sub_group_size(SG_SZ)]]
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<unsigned short, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<unsigned short, TK, TN, matrix_layout::packed_b> sub_b(
-               sg);
-           joint_matrix<float, TM, TN> sub_c(sg);
-
-           // AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
-           // strideX = X's cols, so strideC = N, strideA = K, strideB = N*4
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) { //
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<access::decorated::no>() +
-                     (k * TK / 2) * (N * 2) + sg_starty / SG_SZ * TN * 2,
-                 N * 2, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-unsigned short A[MATRIX_M][MATRIX_K];
-unsigned short B[MATRIX_K / 2][MATRIX_N * 2];
-float C[MATRIX_M][MATRIX_N];
-float D[MATRIX_M][MATRIX_N];
-
-float make_fp32(short x) {
-  unsigned int y = x;
-  y = y << 16;
-  float *res = reinterpret_cast<float *>(&y);
-  return *res;
-}
-
-unsigned short make_bf16(float x) {
-  int *res = reinterpret_cast<int *>(&x);
-  *res = *res >> 16;
-  return (unsigned short)*res;
-}
-
-void matrix_multiply_ref(int *A_mem, int *B_mem, int *C_mem, int M, int N,
-                         int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        short *va = (short *)(A_mem + m * K + k);
-        short *vb = (short *)(B_mem + k * N + n);
-        float acc = *((float *)(C_mem + m * N + n));
-        // FIXME: Should we do reduce-add in another version?
-        for (int i = 0; i < 2; i++) {
-          acc += (make_fp32(va[i]) * make_fp32(vb[i]));
-        }
-        *((float *)(C_mem + m * N + n)) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = make_bf16(1.0f * (i + j));
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 2; i++) {
-    for (int j = 0; j < MATRIX_N * 2; j++) {
-      B[i][j] = make_bf16(2.0f * i + 3.0f * j);
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1.0;
-      D[i][j] = 1.0;
-    }
-  }
-
-  big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C);
-  big_matrix<float, MATRIX_M, MATRIX_N> MD((float *)&D);
-  big_matrix<unsigned short, MATRIX_M, MATRIX_K> MA((unsigned short *)&A);
-  big_matrix<unsigned short, MATRIX_K / 2, MATRIX_N * 2> MB(
-      (unsigned short *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 2);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++)
-      std::cout << C[i][j] << ", ";
-    std::cout << "\n";
-  }
-  std::cout << std::endl;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++)
-      std::cout << D[i][j] << ", ";
-    std::cout << "\n";
-  }
-}
diff --git a/sycl/test/matrix/legacy/matrix-bf16-test.cpp b/sycl/test/matrix/legacy/matrix-bf16-test.cpp
deleted file mode 100644
index 6c6bfc1066f01..0000000000000
--- a/sycl/test/matrix/legacy/matrix-bf16-test.cpp
+++ /dev/null
@@ -1,190 +0,0 @@
-// RUN: %clangxx -fsycl -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1 %s -o %t.out
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define TILE_SZ 16
-#define TM (TILE_SZ - 1)
-#define TN (TILE_SZ - 1)
-#define TK (2 * TILE_SZ - 2)
-
-#define SG_SZ 16
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  // B => K/4 x N*4, A => M x K, C => M, N
-  // stride should be X's cols, e.g., B's stirde = N*4
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 2);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<unsigned short, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<unsigned short, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<float, 2> bufC((float *)C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N, K](nd_item<2> spmd_item)
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<unsigned short, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<unsigned short, TK, TN, matrix_layout::packed_b> sub_b(
-               sg);
-           joint_matrix<float, TM, TN> sub_c(sg);
-
-           // AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
-           // strideX = X's cols, so strideC = N, strideA = K, strideB = N*4
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<sycl::access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) { //
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<sycl::access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<sycl::access::decorated::no>() +
-                     (k * TK / 2) * (N * 2) + sg_starty / SG_SZ * TN * 2,
-                 N * 2, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<sycl::access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-unsigned short A[MATRIX_M][MATRIX_K];
-unsigned short B[MATRIX_K / 2][MATRIX_N * 2];
-float C[MATRIX_M][MATRIX_N];
-float D[MATRIX_M][MATRIX_N];
-
-float make_fp32(short x) {
-  unsigned int y = x;
-  y = y << 16;
-  float *res = reinterpret_cast<float *>(&y);
-  return *res;
-}
-
-unsigned short make_bf16(float x) {
-  int *res = reinterpret_cast<int *>(&x);
-  *res = *res >> 16;
-  return (unsigned short)*res;
-}
-
-void matrix_multiply_ref(int *A_mem, int *B_mem, int *C_mem, int M, int N,
-                         int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        short *va = (short *)(A_mem + m * K + k);
-        short *vb = (short *)(B_mem + k * N + n);
-        float acc = *((float *)(C_mem + m * N + n));
-        // FIXME: Should we do reduce-add in another version?
-        for (int i = 0; i < 2; i++) {
-          acc += (make_fp32(va[i]) * make_fp32(vb[i]));
-        }
-        *((float *)(C_mem + m * N + n)) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = make_bf16(1.0f * (i + j));
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 2; i++) {
-    for (int j = 0; j < MATRIX_N * 2; j++) {
-      B[i][j] = make_bf16(2.0f * i + 3.0f * j);
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1.0;
-      D[i][j] = 1.0;
-    }
-  }
-
-  big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C);
-  big_matrix<float, MATRIX_M, MATRIX_N> MD((float *)&D);
-  big_matrix<unsigned short, MATRIX_M, MATRIX_K> MA((unsigned short *)&A);
-  big_matrix<unsigned short, MATRIX_K / 2, MATRIX_N * 2> MB(
-      (unsigned short *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 2);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++)
-      std::cout << C[i][j] << ", ";
-    std::cout << "\n";
-  }
-  std::cout << std::endl;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++)
-      std::cout << D[i][j] << ", ";
-    std::cout << "\n";
-  }
-}
diff --git a/sycl/test/matrix/legacy/matrix-bfloat16-test.cpp b/sycl/test/matrix/legacy/matrix-bfloat16-test.cpp
deleted file mode 100644
index 022e69f9b75a2..0000000000000
--- a/sycl/test/matrix/legacy/matrix-bfloat16-test.cpp
+++ /dev/null
@@ -1,194 +0,0 @@
-// RUN: %clangxx -fsycl -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1 %s -o %t.out
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl::ext::oneapi::experimental::matrix;
-using bfloat16 = sycl::ext::oneapi::bfloat16;
-
-static constexpr auto TILE_SZ = 16;
-static constexpr auto TM = TILE_SZ - 1;
-static constexpr auto TN = TILE_SZ - 1;
-static constexpr auto TK = 2 * TILE_SZ - 2;
-
-static constexpr auto SG_SZ = 16;
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  // B => K/4 x N*4, A => M x K, C => M, N
-  // stride should be X's cols, e.g., B's stirde = N*4
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 2);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  sycl::buffer<bfloat16, 2> bufA(A.get_data(), sycl::range<2>(M, K));
-  sycl::buffer<bfloat16, 2> bufB(B.get_data(), sycl::range<2>(K, N));
-  sycl::buffer<float, 2> bufC((float *)C.get_data(), sycl::range<2>(M, N));
-
-  sycl::queue q;
-  q.submit([&](sycl::handler &cgh) {
-     auto accC = bufC.get_access<sycl::access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<sycl::access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<sycl::access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         sycl::nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N, K](sycl::nd_item<2> spmd_item)
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<bfloat16, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<bfloat16, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<float, TM, TN> sub_c(sg);
-
-           // AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
-           // strideX = X's cols, so strideC = N, strideA = K, strideB = N*4
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<sycl::access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) { //
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<sycl::access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<sycl::access::decorated::no>() +
-                     (k * TK / 2) * (N * 2) + sg_starty / SG_SZ * TN * 2,
-                 N * 2, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<sycl::access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-bfloat16 A[MATRIX_M][MATRIX_K];
-bfloat16 B[MATRIX_K / 2][MATRIX_N * 2];
-unsigned short Aref[MATRIX_M][MATRIX_K];
-unsigned short Bref[MATRIX_K / 2][MATRIX_N * 2];
-float C[MATRIX_M][MATRIX_N];
-float D[MATRIX_M][MATRIX_N];
-
-float make_fp32(short x) {
-  unsigned int y = x;
-  y = y << 16;
-  float *res = reinterpret_cast<float *>(&y);
-  return *res;
-}
-
-unsigned short make_bf16(float x) {
-  int *res = reinterpret_cast<int *>(&x);
-  *res = *res >> 16;
-  return (unsigned short)*res;
-}
-
-void matrix_multiply_ref(int *A_mem, int *B_mem, int *C_mem, int M, int N,
-                         int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        short *va = (short *)(A_mem + m * K + k);
-        short *vb = (short *)(B_mem + k * N + n);
-        float acc = *((float *)(C_mem + m * N + n));
-        // FIXME: Should we do reduce-add in another version?
-        for (int i = 0; i < 2; i++) {
-          acc += (make_fp32(va[i]) * make_fp32(vb[i]));
-        }
-        *((float *)(C_mem + m * N + n)) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      // Ee create bfloat16 from unsigned short since float-to-bfloat's
-      // conversion is not allowed.
-      A[i][j] = make_bf16(1.0f * (i + j));
-      Aref[i][j] = make_bf16(1.0f * (i + j));
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 2; i++) {
-    for (int j = 0; j < MATRIX_N * 2; j++) {
-      B[i][j] = make_bf16(2.0f * i + 3.0f * j);
-      Bref[i][j] = make_bf16(2.0f * i + 3.0f * j);
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1.0;
-      D[i][j] = 1.0;
-    }
-  }
-
-  big_matrix<float, MATRIX_M, MATRIX_N> MC((float *)&C);
-  big_matrix<float, MATRIX_M, MATRIX_N> MD((float *)&D);
-  big_matrix<bfloat16, MATRIX_M, MATRIX_K> MA((bfloat16 *)&A);
-  big_matrix<bfloat16, MATRIX_K / 2, MATRIX_N * 2> MB((bfloat16 *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)Aref, (int32_t *)Bref, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 2);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++)
-      std::cout << C[i][j] << ", ";
-    std::cout << "\n";
-  }
-  std::cout << std::endl;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++)
-      std::cout << D[i][j] << ", ";
-    std::cout << "\n";
-  }
-}
diff --git a/sycl/test/matrix/legacy/matrix-elemwise-ops.cpp b/sycl/test/matrix/legacy/matrix-elemwise-ops.cpp
deleted file mode 100644
index feddb05148c4e..0000000000000
--- a/sycl/test/matrix/legacy/matrix-elemwise-ops.cpp
+++ /dev/null
@@ -1,180 +0,0 @@
-// RUN: %clangxx -fsycl -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1 %s -o %t.out
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define TILE_SZ 16
-#define TM (TILE_SZ - 4)
-#define TN (TILE_SZ - 4)
-#define TK (4 * TILE_SZ - 16)
-
-#define SG_SZ 16
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  // B => K/4 x N*4, A => M x K, C => M, N
-  // stride should be X's cols, e.g., B's stirde = N*4
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 4);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<int8_t, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<int32_t, 2> bufC(C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N, K](nd_item<2> spmd_item)
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<int8_t, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<int32_t, TM, TN> sub_c(sg);
-
-           // AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
-           // strideX = X's cols, so strideC = N, strideA = K, strideB = N*4
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<sycl::access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<sycl::access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<sycl::access::decorated::no>() +
-                     (k * TK / 4) * (N * 4) + sg_starty / SG_SZ * TN * 4,
-                 N * 4, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           auto wi_data_c = sub_c.get_wi_data();
-           for (int i = 0; i < wi_data_c.length(); i++) {
-             wi_data_c[i] *= 2;
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<sycl::access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-int8_t A[MATRIX_M][MATRIX_K];
-int8_t B[MATRIX_K / 4][MATRIX_N * 4];
-int32_t C[MATRIX_M][MATRIX_N];
-int32_t D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(int32_t *A_mem, int32_t *B_mem, int32_t *C_mem, int M,
-                         int N, int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        char *va = (char *)(A_mem + m * K + k);
-        char *vb = (char *)(B_mem + k * N + n);
-        int acc = *(C_mem + m * N + n);
-        for (int i = 0; i < 4; i++) {
-          acc += (va[i] * vb[i]);
-        }
-        *(C_mem + m * N + n) = acc;
-      }
-      *(C_mem + m * N + n) *= 2;
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = i + 2 * j;
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 4; i++) {
-    for (int j = 0; j < MATRIX_N * 4; j++) {
-      B[i][j] = i + j;
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1;
-      D[i][j] = 1;
-    }
-  }
-
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MC((int32_t *)&C);
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MD((int32_t *)&D);
-  big_matrix<int8_t, MATRIX_M, MATRIX_K> MA((int8_t *)&A);
-  big_matrix<int8_t, MATRIX_K / 4, MATRIX_N * 4> MB((int8_t *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 4);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++)
-      std::cout << C[i][j] << ", ";
-    std::cout << "\n";
-  }
-  std::cout << std::endl;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++)
-      std::cout << D[i][j] << ", ";
-    std::cout << "\n";
-  }
-}
diff --git a/sycl/test/matrix/legacy/matrix-int8-test-SG-16.cpp b/sycl/test/matrix/legacy/matrix-int8-test-SG-16.cpp
deleted file mode 100644
index 335529ad3120a..0000000000000
--- a/sycl/test/matrix/legacy/matrix-int8-test-SG-16.cpp
+++ /dev/null
@@ -1,175 +0,0 @@
-// RUN: %clangxx -fsycl -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1 %s -o %t.out
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define TILE_SZ 16
-#define TM (TILE_SZ - 5)
-#define TN (TILE_SZ - 6)
-#define TK (4 * TILE_SZ - 8)
-
-#define SG_SZ 16
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  // B => K/4 x N*4, A => M x K, C => M, N
-  // stride should be X's cols, e.g., B's stirde = N*4
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 4);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<int8_t, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<int32_t, 2> bufC(C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N, K](nd_item<2> spmd_item)
-             [[intel::reqd_sub_group_size(SG_SZ)]]
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<int8_t, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<int32_t, TM, TN> sub_c(sg);
-
-           // AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
-           // strideX = X's cols, so strideC = N, strideA = K, strideB = N*4
-           joint_matrix_load(
-               sg, sub_c,
-               accC.template get_multi_ptr<sycl::access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-           for (int k = 0; k < K / TK; k += 1) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<sycl::access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<sycl::access::decorated::no>() +
-                     (k * TK / 4) * (N * 4) + sg_starty / SG_SZ * TN * 4,
-                 N * 4, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<sycl::access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-int8_t A[MATRIX_M][MATRIX_K];
-int8_t B[MATRIX_K / 4][MATRIX_N * 4];
-int32_t C[MATRIX_M][MATRIX_N];
-int32_t D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(int32_t *A_mem, int32_t *B_mem, int32_t *C_mem, int M,
-                         int N, int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        char *va = (char *)(A_mem + m * K + k);
-        char *vb = (char *)(B_mem + k * N + n);
-        int acc = *(C_mem + m * N + n);
-        for (int i = 0; i < 4; i++) {
-          acc += (va[i] * vb[i]);
-        }
-        *(C_mem + m * N + n) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = i + 2 * j;
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 4; i++) {
-    for (int j = 0; j < MATRIX_N * 4; j++) {
-      B[i][j] = i + j;
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 1;
-      D[i][j] = 1;
-    }
-  }
-
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MC((int32_t *)&C);
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MD((int32_t *)&D);
-  big_matrix<int8_t, MATRIX_M, MATRIX_K> MA((int8_t *)&A);
-  big_matrix<int8_t, MATRIX_K / 4, MATRIX_N * 4> MB((int8_t *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 4);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++)
-      std::cout << C[i][j] << ", ";
-    std::cout << "\n";
-  }
-  std::cout << std::endl;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++)
-      std::cout << D[i][j] << ", ";
-    std::cout << "\n";
-  }
-}
diff --git a/sycl/test/matrix/legacy/matrix-int8-test.cpp b/sycl/test/matrix/legacy/matrix-int8-test.cpp
deleted file mode 100644
index 77c57b4ef711e..0000000000000
--- a/sycl/test/matrix/legacy/matrix-int8-test.cpp
+++ /dev/null
@@ -1,175 +0,0 @@
-// RUN: %clangxx -fsycl -fsycl-device-only -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1 -S -emit-llvm -o - %s | FileCheck %s
-
-// CHECK-DAG: target("spirv.JointMatrixINTEL", i8, 12, 48, 0, 3)
-// CHECK-DAG: target("spirv.JointMatrixINTEL", i32, 12, 12, 0, 3)
-// CHECK-DAG: target("spirv.JointMatrixINTEL", i8, 48, 12, 3, 3)
-
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-#define TILE_SZ 16
-#define TM (TILE_SZ - 4)
-#define TN (TILE_SZ - 4)
-#define TK (4 * TILE_SZ - 16)
-
-#define SG_SZ 16
-
-template <typename T, size_t NUM_ROWS, size_t NUM_COLS> struct big_matrix {
-public:
-  T *mat;
-
-public:
-  T *get_data() { return mat; }
-  void set_data(T *data) { mat = data; }
-  big_matrix(T *data) : mat(data) {}
-};
-
-template <typename T1, typename T2, size_t NUM_ROWS_A, size_t NUM_COLS_A,
-          size_t NUM_ROWS_B, size_t NUM_COLS_B, size_t NUM_ROWS_C,
-          size_t NUM_COLS_C>
-void matrix_multiply(big_matrix<T1, NUM_ROWS_C, NUM_COLS_C> &C,
-                     big_matrix<T2, NUM_ROWS_A, NUM_COLS_A> &A,
-                     big_matrix<T2, NUM_ROWS_B, NUM_COLS_B> &B) {
-  size_t M = NUM_ROWS_C;
-  size_t N = NUM_COLS_C;
-  size_t K = NUM_COLS_A;
-  // B => K/4 x N*4, A => M x K, C => M, N
-  // stride should be X's cols, e.g., B's stirde = N*4
-  assert(NUM_ROWS_C == NUM_ROWS_A && NUM_COLS_A == NUM_ROWS_B * 4);
-  size_t NDRangeM = M / TM;
-  size_t NDRangeN = N / TN;
-  buffer<int8_t, 2> bufA(A.get_data(), range<2>(M, K));
-  buffer<int8_t, 2> bufB(B.get_data(), range<2>(K, N));
-  buffer<int32_t, 2> bufC(C.get_data(), range<2>(M, N));
-
-  queue q;
-  q.submit([&](handler &cgh) {
-     auto accC = bufC.get_access<access::mode::read_write>(cgh);
-     auto accA = bufA.get_access<access::mode::read_write>(cgh);
-     auto accB = bufB.get_access<access::mode::read_write>(cgh);
-
-     cgh.parallel_for<class imatrix>(
-         nd_range<2>({NDRangeM, NDRangeN * SG_SZ}, {1, 1 * SG_SZ}),
-         [accA, accB, accC, M, N, K](nd_item<2> spmd_item)
-
-         {
-           // The submatrix API has to be accessed by all the workitems in a
-           // subgroup these functions will be called once by the subgroup no
-           // code divergence between the workitems
-           const auto global_idx = spmd_item.get_global_id(0);
-           const auto global_idy = spmd_item.get_global_id(1);
-           const auto sg_startx = global_idx - spmd_item.get_local_id(0);
-           const auto sg_starty = global_idy - spmd_item.get_local_id(1);
-
-           sycl::sub_group sg = spmd_item.get_sub_group();
-           joint_matrix<int8_t, TM, TK> sub_a(sg);
-           // For B, since current implementation does not support non-packed
-           // layout, users need to specify the updated VNNI sizes along with
-           // the packed_b layout. By default, the layout is row_major and size
-           // is (TK, TN).
-           joint_matrix<int8_t, TK, TN, matrix_layout::packed_b> sub_b(sg);
-           joint_matrix<int32_t, TM, TN> sub_c(sg);
-
-           // AMX: 8 register tiles : 1k byte size, SMmaxxSKmax =16x64
-           // strideX = X's cols, so strideC = N, strideA = K, strideB = N*4
-           joint_matrix_fill(sg, sub_c, 0);
-           for (int k = 0; k < K / TK; k += 1) {
-             joint_matrix_load(
-                 sg, sub_a,
-                 accA.template get_multi_ptr<sycl::access::decorated::no>() +
-                     (sg_startx * TM) * K + k * TK,
-                 K, matrix_layout::row_major);
-             // Assuming B data is already in VNNI format.
-             joint_matrix_load(
-                 sg, sub_b,
-                 accB.template get_multi_ptr<sycl::access::decorated::no>() +
-                     (k * TK / 4) * (N * 4) + sg_starty / SG_SZ * TN * 4,
-                 N * 4, matrix_layout::packed_b);
-             sub_c = joint_matrix_mad(sg, sub_a, sub_b, sub_c);
-           }
-           joint_matrix_store(
-               sg, sub_c,
-               accC.template get_multi_ptr<sycl::access::decorated::no>() +
-                   (sg_startx * TM) * N + sg_starty / SG_SZ * TN,
-               N, matrix_layout::row_major);
-         }); // parallel for
-   }).wait();
-}
-
-static constexpr size_t MATRIX_M = TM * 2;
-static constexpr size_t MATRIX_N = TN * 2;
-static constexpr size_t MATRIX_K = TK * 2;
-int8_t A[MATRIX_M][MATRIX_K];
-int8_t B[MATRIX_K / 4][MATRIX_N * 4];
-int32_t C[MATRIX_M][MATRIX_N];
-int32_t D[MATRIX_M][MATRIX_N];
-
-void matrix_multiply_ref(int32_t *A_mem, int32_t *B_mem, int32_t *C_mem, int M,
-                         int N, int K) {
-  // tiling
-  for (int m = 0; m < M; m++)
-    for (int n = 0; n < N; n++) {
-      for (int k = 0; k < K; k++) {
-        char *va = (char *)(A_mem + m * K + k);
-        char *vb = (char *)(B_mem + k * N + n);
-        int acc = *(C_mem + m * N + n);
-        for (int i = 0; i < 4; i++) {
-          acc += (va[i] * vb[i]);
-        }
-        *(C_mem + m * N + n) = acc;
-      }
-    }
-}
-
-int main() {
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_K; j++) {
-      A[i][j] = i + 2 * j;
-    }
-  }
-  for (int i = 0; i < MATRIX_K / 4; i++) {
-    for (int j = 0; j < MATRIX_N * 4; j++) {
-      B[i][j] = i + j;
-    }
-  }
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      C[i][j] = 0;
-      D[i][j] = 0;
-    }
-  }
-
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MC((int32_t *)&C);
-  big_matrix<int32_t, MATRIX_M, MATRIX_N> MD((int32_t *)&D);
-  big_matrix<int8_t, MATRIX_M, MATRIX_K> MA((int8_t *)&A);
-  big_matrix<int8_t, MATRIX_K / 4, MATRIX_N * 4> MB((int8_t *)&B);
-  matrix_multiply(MC, MA, MB);
-  matrix_multiply_ref((int32_t *)A, (int32_t *)B, (int32_t *)D, MATRIX_M,
-                      MATRIX_N, MATRIX_K / 4);
-
-  bool res = true;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++) {
-      if (C[i][j] != D[i][j])
-        res = false;
-    }
-  }
-  if (res)
-    std::cout << "passed\n";
-  else
-    std::cout << "failed\n";
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++)
-      std::cout << C[i][j] << ", ";
-    std::cout << "\n";
-  }
-  std::cout << std::endl;
-  for (int i = 0; i < MATRIX_M; i++) {
-    for (int j = 0; j < MATRIX_N; j++)
-      std::cout << D[i][j] << ", ";
-    std::cout << "\n";
-  }
-}
diff --git a/sycl/test/matrix/matrix-bfloat16-test-coord-basicB.cpp b/sycl/test/matrix/matrix-bfloat16-test-coord-basicB.cpp
index ee6d37654184e..02cfbc0f8b904 100644
--- a/sycl/test/matrix/matrix-bfloat16-test-coord-basicB.cpp
+++ b/sycl/test/matrix/matrix-bfloat16-test-coord-basicB.cpp
@@ -1,4 +1,4 @@
-// RUN: %clangxx -fsycl -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 %s -o %t.out
+// RUN: %clangxx -fsycl -O2 %s -o %t.out
 
 // Kernel B sum by col
 #include <cmath>
diff --git a/sycl/test/matrix/matrix-bfloat16-test.cpp b/sycl/test/matrix/matrix-bfloat16-test.cpp
index 37dc5a1607631..da35ac4ef150c 100644
--- a/sycl/test/matrix/matrix-bfloat16-test.cpp
+++ b/sycl/test/matrix/matrix-bfloat16-test.cpp
@@ -1,4 +1,4 @@
-// RUN: %clangxx -fsycl -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 %s -o %t.out
+// RUN: %clangxx -fsycl -O2 %s -o %t.out
 #include <iostream>
 #include <sycl/sycl.hpp>
 
diff --git a/sycl/test/matrix/matrix-check-types-in-attributes.cpp b/sycl/test/matrix/matrix-check-types-in-attributes.cpp
index 9b31885f79304..f7a0223adb24d 100644
--- a/sycl/test/matrix/matrix-check-types-in-attributes.cpp
+++ b/sycl/test/matrix/matrix-check-types-in-attributes.cpp
@@ -1,4 +1,4 @@
-// RUN: %clangxx -fsycl -fsycl-device-only -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -O2 -S -emit-llvm -o - %s | FileCheck %s
+// RUN: %clangxx -fsycl -fsycl-device-only -O2 -S -emit-llvm -o - %s | FileCheck %s
 
 // This test checks the correctness of matrix types converted into strings
 
diff --git a/sycl/test/matrix/matrix-elemwise-ops.cpp b/sycl/test/matrix/matrix-elemwise-ops.cpp
index 9621f570cf461..fc75cab241287 100644
--- a/sycl/test/matrix/matrix-elemwise-ops.cpp
+++ b/sycl/test/matrix/matrix-elemwise-ops.cpp
@@ -1,4 +1,4 @@
-// RUN: %clangxx -fsycl -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -O2 %s -o %t.out
+// RUN: %clangxx -fsycl -O2 %s -o %t.out
 
 #include <iostream>
 #include <sycl/sycl.hpp>
diff --git a/sycl/test/matrix/matrix-int8-test.cpp b/sycl/test/matrix/matrix-int8-test.cpp
index b13cb23ae73b0..41a8f78303fd3 100644
--- a/sycl/test/matrix/matrix-int8-test.cpp
+++ b/sycl/test/matrix/matrix-int8-test.cpp
@@ -1,4 +1,4 @@
-// RUN: %clangxx -fsycl -fsycl-device-only -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 -O2 -S -emit-llvm -o - %s | FileCheck %s
+// RUN: %clangxx -fsycl -fsycl-device-only -O2 -S -emit-llvm -o - %s | FileCheck %s
 
 // CHECK-DAG: target("spirv.JointMatrixINTEL", i8, 12, 48, 0, 3, 0)
 // CHECK-DAG: target("spirv.JointMatrixINTEL", i32, 12, 12, 3, 3, 2)
diff --git a/sycl/test/matrix/matrix-tf32-test.cpp b/sycl/test/matrix/matrix-tf32-test.cpp
index 496af7dabd335..fb2e7ab66492d 100644
--- a/sycl/test/matrix/matrix-tf32-test.cpp
+++ b/sycl/test/matrix/matrix-tf32-test.cpp
@@ -1,4 +1,4 @@
-// RUN: %clangxx -fsycl -O2 -DSYCL_EXT_ONEAPI_MATRIX_VERSION=4 %s -o %t.out
+// RUN: %clangxx -fsycl -O2 %s -o %t.out
 
 #include <iostream>
 #include <sycl/sycl.hpp>
diff --git a/sycl/test/matrix/query.cpp b/sycl/test/matrix/query.cpp
deleted file mode 100644
index b94a481e01a7a..0000000000000
--- a/sycl/test/matrix/query.cpp
+++ /dev/null
@@ -1,143 +0,0 @@
-// RUN: %clangxx -fsycl -DSYCL_EXT_ONEAPI_MATRIX_VERSION=1 -o query %s
-#include <iostream>
-#include <sycl/sycl.hpp>
-
-using namespace sycl;
-using namespace sycl::ext::oneapi::experimental::matrix;
-
-void query_amx() {
-
-  // generates combination assert
-  // using myparams = tpu_params<tpu::amx, int, int, int, 2, 8, 32>;
-
-  // generates types assert
-  // using myparams2 = tpu_params<tpu::amx, int, int, int>;
-
-  // tells whether a combination is valid or not, if valid, those will be set as
-  // default
-  using myparams = tpu_params<tpu::amx, int8_t, int8_t, int, 2, 8, 32>;
-
-  size_t dmsize = myparams::defaultM;
-  size_t dnsize = myparams::defaultN;
-  size_t dksize = myparams::defaultK;
-  std::cout << "sizes of AMX tpu_params chosen by the user are: M " << dmsize
-            << " N " << dnsize << " K " << dksize << std::endl;
-
-  // Sizes-only query: types are given, generate default sizes
-  using myparams2 = tpu_params<tpu::amx, int8_t, int8_t, int>;
-  myparams2 p;
-  dmsize = myparams2::defaultM;
-  dnsize = myparams2::defaultN;
-  dksize = myparams2::defaultK;
-  std::cout << "default AMX sizes tpu_params  are: M " << dmsize << " N "
-            << dnsize << " K " << dksize << "\n AMX int8 num combinations is "
-            << p.num_combinations << std::endl;
-
-  // general query: types are not given
-  tpu_params<tpu::amx> myparams3;
-
-  std::cout << "AMX query num combinations: " << myparams3.num_combinations
-            << std::endl;
-
-  if (myparams3.combinations[0].msize != 0) // this is a max params hardware
-    return;
-  constexpr int msize = myparams3.combinations[0].max_msize;
-  constexpr int nsize = myparams3.combinations[0].max_nsize;
-  constexpr int ksize = myparams3.combinations[0].max_ksize;
-  std::cout << "AMX query sizes are: M " << msize << " N " << nsize << " K "
-            << ksize << std::endl;
-
-  size_t NDRangeM = 1024 / msize;
-  size_t NDRangeN = 1024 / nsize;
-  queue q;
-  q.submit([&](handler &cgh) {
-    cgh.parallel_for<class imatrix>(
-        nd_range<2>({NDRangeM, NDRangeN}, {1, 1}),
-        [msize, ksize, nsize](nd_item<2> spmd_item) {
-          sub_group sg = spmd_item.get_sub_group();
-          myparams2::joint_matrix_a<sub_group> sub_a1(sg);
-          myparams2::joint_matrix_b<sub_group> sub_b1(sg);
-          myparams2::joint_matrix_c<sub_group> sub_c1(sg);
-
-          joint_matrix<unsigned short, msize, ksize> sub_a(sg);
-          joint_matrix<unsigned short, ksize, nsize> sub_b(sg);
-          joint_matrix<float, msize, nsize> sub_c(sg);
-        });
-  });
-}
-
-void query_dpas() {
-
-  // generates combination assert
-  // using myparams = tpu_params<tpu::dpas, int, int, int, 2, 8, 32>;
-
-  // generate combination of type assert
-  // using myparams = tpu_params<tpu::dpas, int, int, int>;
-
-  // tells whether a combination is valid or not, if valid, those will be set as
-  // default
-  using myparams = tpu_params<tpu::dpas, int8_t, int8_t, int, 2, 8, 32>;
-
-  size_t dmsize = myparams::defaultM;
-  size_t dnsize = myparams::defaultN;
-  size_t dksize = myparams::defaultK;
-  std::cout << "sizes of DPAS tpu_params chosen by the user are: M " << dmsize
-            << " N " << dnsize << " K " << dksize << std::endl;
-
-  // sizes-only query: types are given, generate default sizes
-  using myparams2 = tpu_params<tpu::dpas, int8_t, int8_t, int>;
-  myparams2 p;
-  dmsize = myparams2::defaultM;
-  dnsize = myparams2::defaultN;
-  dksize = myparams2::defaultK;
-  std::cout << "Default DPAS sizes  are: M " << dmsize << " N " << dnsize
-            << " K " << dksize << "\n DPAS int8 num combinations is "
-            << p.num_combinations << std::endl;
-
-  dmsize = myparams2::combinations[0].msize;
-  dnsize = myparams2::combinations[0].nsize;
-  dksize = myparams2::combinations[0].ksize;
-  std::cout << "one of DPAS combination sizes  is: M " << dmsize << " N "
-            << dnsize << " K " << dksize << std::endl;
-
-  // general query: types are not given
-  tpu_params<tpu::dpas> myparams3;
-  std::cout << "DPAS query num combinations: " << myparams3.num_combinations
-            << std::endl;
-
-  if (myparams3.combinations[0].msize == 0) // this is not a max params hardware
-    return;
-  constexpr int msize = myparams3.combinations[0].msize;
-  constexpr int nsize = myparams3.combinations[0].nsize;
-  constexpr int ksize = myparams3.combinations[0].ksize;
-  std::cout << "DPAS query sizes are: M " << msize << " N " << nsize << " K "
-            << ksize << std::endl;
-  std::cout << "DPAS query max sizes are: M "
-            << myparams3.combinations[0].max_msize << " N "
-            << myparams3.combinations[0].max_nsize << " K "
-            << myparams3.combinations[0].max_ksize << std::endl;
-
-  size_t NDRangeM = 1024 / msize;
-  size_t NDRangeN = 1024 / nsize;
-  queue q;
-  q.submit([&](handler &cgh) {
-    cgh.parallel_for<class dmatrix>(
-        nd_range<2>({NDRangeM, NDRangeN}, {1, 1}),
-        [msize, ksize, nsize](nd_item<2> spmd_item) {
-          sub_group sg = spmd_item.get_sub_group();
-          myparams2::joint_matrix_a<sub_group> sub_a1(sg);
-          myparams2::joint_matrix_b<sub_group> sub_b1(sg);
-          myparams2::joint_matrix_c<sub_group> sub_c1(sg);
-
-          joint_matrix<unsigned short, msize, ksize> sub_a(sg);
-          joint_matrix<unsigned short, ksize, nsize> sub_b(sg);
-          joint_matrix<float, msize, nsize> sub_c(sg);
-        });
-  });
-}
-
-int main() {
-  query_amx();
-  query_dpas();
-  return 0;
-}