[CUDA] Implement gather_mm_rhs #2902

zcbenz · 2025-12-12T08:53:02Z

This PR implements gather_mm for the cases that can be transferred into a grouped GEMM.

The grouped GEMM code uses CUTLASS 2.x API which allows us to choose kernels that do not require large alignment. The performance is not good, running benchmarks/python/gather_mm_bench.py shows that it takes 7x time than equivalent matmul (for Metal kernel it takes 1.5x time).

There are a lot of things remaining to be done:

Implement the cases that lhs indices are passed.
Implement the cases when indices are broadcasted.
Implement the cases when indices are not sorted.
Tune the GEMM tile sizes.
Enable tensor core for sm80 and later.
Pad the group sizes so we can use much faster kernels.

But current work can serve as a baseline and a good foundation for progressive improvements.

awni · 2025-12-18T15:47:22Z

mlx/backend/cuda/matmul.cpp

+    array zero(0, a.dtype());
+    encoder.add_temporary(zero);
+    fill_gpu(zero, out, s);


Just a note, not necessary for this PR but we should probably do these with cudaMemsetAsync

It would need to go in the graph.. but I think that should be fairly straight-forward.

Cuda graph has a memset node we can use, that would be much better than running a kernel.

mlx/backend/cuda/matmul.cpp

angeloskath

Awesome start!

I left a comment regarding a small bug in the current code. Very weird that no tests caught that. I verified with the following small test

import mlx.core as mx
x = mx.random.normal((1024, 1, 1024))
w = mx.random.normal((16, 1024, 1024))
indices = mx.sort((mx.random.uniform(shape=(1024,)) * 16).astype(mx.int32))
y = mx.gather_mm(x, w.swapaxes(-1, -2), rhs_indices=indices, sorted_indices=True)
z = []
for i in range(1024):
    z.append(x[i] @ w[indices[i]].T)
z = mx.stack(z)
mx.eval(y, z)

angeloskath · 2025-12-23T00:58:37Z

mlx/backend/cuda/gemms/grouped_gemm_unaligned.cu

+      cutlass::ComplexTransform::kNone,
+      kAlignment,
+      T,
+      cutlass::layout::RowMajor,


This should depend on the passed in matrix. Basically when b_transposed in matmul.cpp is true this should be ColMajor.

Thanks for noticing this! I have fixed it and added tests.

angeloskath

Nice! Thanks.

zcbenz force-pushed the cuda-grouped-mm branch 2 times, most recently from 28e5ca7 to 3b2a857 Compare December 15, 2025 23:49

awni reviewed Dec 18, 2025

View reviewed changes

mlx/backend/cuda/matmul.cpp Outdated Show resolved Hide resolved

zcbenz force-pushed the cuda-grouped-mm branch from 3b2a857 to ce6070d Compare December 19, 2025 00:12

angeloskath reviewed Dec 23, 2025

View reviewed changes

zcbenz force-pushed the cuda-grouped-mm branch from ce6070d to a6e2da6 Compare December 23, 2025 04:33

[CUDA] Implement gather_mm_rhs

fb6740f

zcbenz force-pushed the cuda-grouped-mm branch from a6e2da6 to fb6740f Compare December 23, 2025 05:00

angeloskath approved these changes Dec 23, 2025

View reviewed changes

zcbenz merged commit 1d21d0e into ml-explore:main Dec 24, 2025
15 checks passed

zcbenz deleted the cuda-grouped-mm branch December 24, 2025 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDA] Implement gather_mm_rhs #2902

[CUDA] Implement gather_mm_rhs #2902

Uh oh!

zcbenz commented Dec 12, 2025

Uh oh!

awni Dec 18, 2025

Uh oh!

awni Dec 18, 2025

Uh oh!

zcbenz Dec 19, 2025

Uh oh!

Uh oh!

angeloskath left a comment

Uh oh!

angeloskath Dec 23, 2025

Uh oh!

zcbenz Dec 23, 2025

Uh oh!

angeloskath left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CUDA] Implement gather_mm_rhs #2902

[CUDA] Implement gather_mm_rhs #2902

Uh oh!

Conversation

zcbenz commented Dec 12, 2025

Uh oh!

awni Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

awni Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

zcbenz Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

angeloskath Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

zcbenz Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

angeloskath left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants