Skip to content

Potential Async Data Race in cp_async4_stream Usage - Request for Validation #42

@Zephyreeze

Description

@Zephyreeze

Issue Description

I've encountered an issue where values copied via cp_async4_stream are not visible in shared memory when accessed later in the kernel. Specifically:

In the original code sequence (L659-665 in marlin/marlin_cuda_kernel.cu):

// For per-column scales, we only fetch them here in the final step before write-out
if (group_blocks == -1 && last) {
  if (s_sh_wr_pred) // s_sh_wr_pred = threadIdx.x < s_sh_stride = 32 * thread_n_blocks / 8
    cp_async4_stream(&sh_s[s_sh_wr], &s[s_gl_rd]);
  cp_async_fence();
}
thread_block_reduce(); // Contains __syncthreads()

The shared memory values (sh_s) appear zeroed when accessed later, even though the source global memory (s) contains valid data.

However, when I modify the sequence to:

thread_block_reduce();  // Moved before async copy
if (group_blocks == -1 && last) {
  if (s_sh_wr_pred) // s_sh_wr_pred = threadIdx.x < s_sh_stride = 32 * thread_n_blocks / 8
    cp_async4_stream(&sh_s[s_sh_wr], &s[s_gl_rd]);
  cp_async_fence();
}

The values are correctly populated in shared memory. This suggests a potential data race or synchronization issue with the asynchronous copy.

Debug output before fix:

##### cp_async4_stream 
threadIdx.x=4 ## s_sh_wr=4, s_gl_rd=4 ##
sh_s[s_sh_wr].x=0, ...  // All zeros
s[s_gl_rd].x=15899, ... // We controlled that only one value in scales is non-zero

Questions for Maintainers

  1. Correctness Validation:

    • Is swapping the thread_block_reduce() and async operations safe?
    • Could this change break other execution paths or edge cases?
  2. Performance Impact:

    • How might this affect computation/communication overlap?
    • Does moving the synchronization earlier impact pipeline efficiency?

Environment

  • GPU: A100

This appears related to CUDA async operation scheduling where synchronization ops immediately following async launches might inhibit proper execution. Would appreciate your insight on the correct synchronization pattern for this case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions