expand to arbitrary K size for MXFP4 by moizyousufi · Pull Request #10 · IST-DASLab/qutlass

moizyousufi · 2025-12-29T08:51:50Z

Description

The K size for MXFP4 was limited to 128. I noticed that natively we can expand to 256.

As for expanding >256, I have created a v2 API such that ≤256 defaults to legacy and >256 defaults to v2.

The v2 API

Implements CollectiveBuilder from CUTLASS 3.x/4.x, moving away from the hardcoded templates
- Legacy API needed a separate function for each K size, requiring hardcoded tile shapes per K. No longer a concern
- compile-time tiling!
Phases separately GEMM and quantization because CUTLASS 3.x/4.x doesn't support E2M1 epilogue fusion. This still works out because we leverage the CUTLASS GEMM and quantization can use native Blackwell PTX instructions
The fun stuff: K can be arbitrary size because now we have a dynamic workspace allocation with gemm_op.get_workspace_size. Now this will let QuTLASS scale up as high as you need.
- The nice thing about CUTLASS 3.x/4.x is that it has automatic K-tile looping, so basically the idea is that the legacy API required the WarpShape K dimension to match the actual K listed, but this v2 API fixes tile K dimension at 64 at compile-time, while letting the actual K be handled via iteration
- Now you might wonder why v2 isn't slower than legacy API despite the iteration:
  - StageCountAuto overlaps memory loads with computation
  - intermediate results are staying in registers (no GMEM roundtrip)
  - each 64-element K-tile still fully utilize tensor cores
  - smaller tiles -> less SMEM usage -> more concurrent CTAs
- I decided to make a v2 API instead of replacing the legacy API because the legacy API seems to use CUTLASS 2.x, which has the baked-in assumption that WarpShape.N == K so there isn't accumulation across K-tiles and the scale factors are computed once per warp
NOTE: Only B200 GPU has been tested for this. I cannot verify whether this will work on any other Blackwell GPU such as B300, RTX PRO 6000, or RTX 5090.

Other Notes

Also, I added a conda env for this so that there can be better/more consistent testing on this.

moizyousufi added 3 commits December 29, 2025 02:11

expand to arbitrary K size for MXFP4

194086c

Batched operations and skip rotation optimization

0e00411

fixed setup

8e024fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

expand to arbitrary K size for MXFP4#10

expand to arbitrary K size for MXFP4#10
moizyousufi wants to merge 3 commits intoIST-DASLab:mainfrom
moizyousufi:main

moizyousufi commented Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

moizyousufi commented Dec 29, 2025

Description

The v2 API

Other Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments