struct linalg::CoopMat<T, MemoryScope S, int M, int N, linalg.CoopMatMatrixUse R>¶
Conditionally conforms to: IArithmetic
Description¶
Represents a cooperative matrix for efficient warp/subgroup-level matrix operations on GPU hardware. CoopMat enables high-performance matrix multiply-accumulate operations by distributing matrix fragments across threads within a warp or subgroup. This type leverages specialized hardware instructions such as CUDA’s WMMA (Warp Matrix Multiply-Accumulate) or Vulkan cooperative matrix extensions.
Generic Parameters¶
T: ICoopElement¶
The element type of the matrix. Must be a built-in arithmetic type.
S : MemoryScope¶
The memory scope defining the cooperative group (e.g., device, workgroup, subgroup).
M : int¶
The number of rows in the matrix fragment.
N : int¶
The number of columns in the matrix fragment.
R : linalg.CoopMatMatrixUse¶
The matrix use specifier indicating whether this is a Matrix A, Matrix B, or accumulator matrix.
Methods¶
Conditional Conformances¶
Conformance to IArithmetic¶
linalg::CoopMat<T, MemoryScope S, int M, int N, linalg.CoopMatMatrixUse R> additionally conforms to IArithmetic when the following conditions are met:
Conformance to IArithmetic¶
linalg::CoopMat<T, MemoryScope S, int M, int N, linalg.CoopMatMatrixUse R> additionally conforms to IArithmetic when the following conditions are met:
Remarks¶
The dimensions M and N must match hardware-supported fragment shapes. For CUDA (mma.sync.m16n8k16), only the m16n16k16 shape is supported:
Shape m16n16k16: Matrix A (164294967235429496719116), Matrix B (164294967235429496719116), Accumulator (164294967235429496719116)
Matrix A dimensions are (m42949672354294967191k), Matrix B dimensions are (k42949672354294967191n), and Accumulator dimensions are (m42949672354294967191n). For CUDA m16n16k16:
Matrix A and B support: half, BFloat16, int8_t, uint8_t, FloatE4M3, FloatE5M2 (A and B must share the same element type)
Accumulator (Matrix C) supports an element type that matches the input family:
half inputs -> half or float accumulator/output
BFloat16 inputs -> float accumulator/output
int8_t / uint8_t -> int (s32) accumulator/output (with optional .satfinite)
FloatE4M3 / FloatE5M2 inputs -> half or float accumulator/output (SM 8.9+)
All matrices involved in a multiply-accumulate operation must use the same shape combination. The actual physical layout and distribution of elements across threads is hardware-specific.
When targeting Vulkan/SPIR-V, this type uses the SPV_KHR_cooperative_matrix extension (and optionally SPV_NV_cooperative_matrix2 for advanced features like transpose, reductions, and per-element operations). Valid shape combinations for Vulkan cooperative matrices (example device properties):
With float16 elements (A/B/C element types):
Shape m16n16k16: Matrix A (164294967235429496719116), Matrix B (164294967235429496719116), Accumulator (164294967235429496719116) - half/half/half
Shape m16n8k16: Matrix A (164294967235429496719116), Matrix B (16429496723542949671918), Accumulator (16429496723542949671918) - half/half/half
Shape m16n8k8: Matrix A (16429496723542949671918), Matrix B (8429496723542949671918), Accumulator (16429496723542949671918) - half/half/half
Shape m16n16k16: Matrix A (164294967235429496719116), Matrix B (164294967235429496719116), Accumulator (164294967235429496719116) - half/half/float
Shape m16n8k16: Matrix A (164294967235429496719116), Matrix B (16429496723542949671918), Accumulator (16429496723542949671918) - half/half/float
Shape m16n8k8: Matrix A (16429496723542949671918), Matrix B (8429496723542949671918), Accumulator (16429496723542949671918) - half/half/float
With 8-bit integer elements (A/B/C element types):
Shape m16n16k32: Matrix A (164294967235429496719132), Matrix B (324294967235429496719116), Accumulator (164294967235429496719116) - uint8/uint8/uint32
Shape m16n16k32: Matrix A (164294967235429496719132), Matrix B (324294967235429496719116), Accumulator (164294967235429496719116) - int8/int8/int32
Shape m16n8k32: Matrix A (164294967235429496719132), Matrix B (32429496723542949671918), Accumulator (16429496723542949671918) - uint8/uint8/uint32
Shape m16n8k32: Matrix A (164294967235429496719132), Matrix B (32429496723542949671918), Accumulator (16429496723542949671918) - int8/int8/int32
Note: Vulkan’s supported shapes are device-specific and can be queried at runtime using VkPhysicalDeviceCooperativeMatrixPropertiesKHR. The above list represents common configurations but may vary by GPU vendor and driver. The element distribution across threads in a subgroup may differ between CUDA and Vulkan implementations, so code using the subscript operator should only perform uniform operations for portability. If your code specifies a combination that is not supported by the device, the behavior is undefined.
Additionally, while only MemoryScope.Subgroup (warp-level cooperation) is supported on CUDA, MemoryScope.Workgroup can be used when targeting Vulkan, allowing cooperation among threads within the entire workgroup. Whenever Workgroup scope is supported, it is recommended to use it instead of Subgroup scope for simplicity and performance.
When using MemoryScope.Workgroup, Slang will emit SPIR-V code that uses the SPV_NV_cooperative_matrix2 extension. A workgroup-scope cooperative matrix can use larger matrix shapes that are multiples of 16/32 depending on the target device. Workgroup-scope cooperative matrices requires a specific workgroup size setting (specified via [numthreads]). Use the Vulkan API to query the supported combinations of element type, matrix shape and workgroup size settings.
Metal backend (simdgroup_matrix):
Only 8x8 matrix dimensions
Only half and float element types
Supported operations: fill, Load, Store, coopMatMulAdd
No per-element access (subscript, GetLength, getCount)
No scalar multiply, copyFrom, transpose, reduce, or MapElement