Fix LDS vectorization logic to allow for bank conflict resolution when kPack > kPerThread#2196
Open
umangyadav wants to merge 4 commits intodevelopfrom
Open
Fix LDS vectorization logic to allow for bank conflict resolution when kPack > kPerThread#2196umangyadav wants to merge 4 commits intodevelopfrom
umangyadav wants to merge 4 commits intodevelopfrom
Conversation
…efore it can't really be vectorized, in such cases we should rotate to avoid bank conflicts
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes the LDS vectorization logic for GEMM operations when kPack > kPerThread. The change ensures that bank conflict resolution is properly applied when vectorization is not actually possible due to non-contiguous memory writes.
- Adds a check to verify that
copyKPerThread >= kpackbefore allowing vectorization - Prevents false vectorization flags when consecutive threads cannot fill a complete
kPackvector - Improves LDS bank conflict rate from 0.49 to 0.07 in the example case
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
justinrosner
approved these changes
Jan 6, 2026
Contributor
|
please, can you provide a comparison of develop vs this branch? even if it is with |
dhernandez0
reviewed
Jan 7, 2026
| // bank conflicts resolution) | ||
| bool isPossibleToVectorizeD = (kpack < maxVlen && copyDPerThread > 1); | ||
| bool isPossibleToVectorizeD = | ||
| (kpack < maxVlen && copyDPerThread > 1) && (copyKPerThread >= kpack); |
Contributor
There was a problem hiding this comment.
can you add a comment to explain this?
Contributor
There was a problem hiding this comment.
nit: no need to have two parentheses.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Each thread is writing
dPerThread x kPerThreadorkPerThread x dPerThreadslice of threadTile.copyKPerThreaddoesn't necessarily have to be same askPack.LDS layout is
[kPackPerBlock, dPerblock, kPack].If
copyKPerThread < kPackthen consecutive threads will have to write enough values to fillkPack. in this caseDdimension is not being contiguously written and therefore can not be vectorized.Current logic prefers "vectorization" over bank conflict resolution. But in case of
copyKPerThread < kpackit is neither doing vectorization nor rotating to avoid bank conflicts.For example,
rocmlir-gen --arch gfx942 -t f16 -out_datatype f16 -g 1 -m 1500 -n 1152 -k 896 --perf_config="v4:96,256,8,96,32,32,4,1,2,0,0,4,1,1" --operation gemm --num_cu 80For this case, kPack = 4.
Each thread is loading
dxk = 3x2threadTile from global memory.Therefore in LDS, It appears like this following. This table shows only first 32 threads.
B0, B1..are bank mapping.Each consecutive 4 columns forms
kPack. This is only for representation purpose, inside LDS layout is actuallykPackPerBlock, mPerBlock, kPack].Notice that both thread1 and thread0 are writing to same
kPackvector.Existing logic will set vectorization flag but it can not happen in this case.
Data Table
Because each thread is writing 32 bits, compiler generates
ds_write2_b32. Thisds_writeinstruction happens in two phases.Phase 1: 0-31 threads
Phase 2 : 32-63 thread
Looking at bank mapping you can find that it has 8 way bank conflict : T0, T2, T4, T6, T8, T10, T12, T14 all are mapped to B0.
Measuring bank conflict with this rocprof-compute, this kernel shows LDS Bank conflicts of rate
0.49With this PR it drops to
0.07with this PR.This doesn't translate to better performance always though. I see compiler selecting single access
ds_write_b32instead ofds_write2_b32. But re-tuning selects some other perfConfig with better final performance, it goes from 88 TFLops to 97TFlops.