[rocmlir-tuning-driver] Add rotating buffers and inline instruction cache flush#2188
[rocmlir-tuning-driver] Add rotating buffers and inline instruction cache flush#2188pabloantoniom wants to merge 7 commits intodevelopfrom
Conversation
…ly in non-benchmark mode (otherwise it kills the performance of small kernels!)
… not being invalidated with this
…ng a separate kernel
| static llvm::cl::opt<unsigned> numRotatingBuffers( | ||
| "num-rotating-buffers", | ||
| llvm::cl::desc("Number of rotating buffers to use for benchmarking"), | ||
| llvm::cl::value_desc("number of buffers"), llvm::cl::init(5)); |
There was a problem hiding this comment.
we could set the default to -1, and do this by default:
if bufferSize >= L2Size:
numRotatingBuffers = 2
else
numRotatingBuffers = 2*ceil(L2Size/bufferSize)
There was a problem hiding this comment.
where bufferSize is the total size of the tensors we are going to load from global memory.
|
|
||
| // Insert the inline assembly at the beginning of the entry block | ||
| builder.setInsertionPointToStart(entryBlock); | ||
|
|
There was a problem hiding this comment.
move foundAnyKernel = true to here? so we are sure it's a valid kernel
There was a problem hiding this comment.
in some cases, we might run out of gpu memory. Is there a way to restrict num_buffers by querying the gpu global memory size?
| } | ||
| } | ||
|
|
||
| // 3. Launch the actualkernels. |
There was a problem hiding this comment.
actualkernels -> actual kernels
There was a problem hiding this comment.
Pull request overview
This PR refactors the cache management strategy in the rocMLIR tuning driver to improve benchmarking accuracy and reduce overhead. It replaces explicit cache flush kernel launches with two optimizations: rotating buffers for L2 cache management and inline instruction cache flush assembly embedded directly in kernels.
Changes:
- Implements rotating buffer support to avoid L2 cache reuse between iterations (configurable via
--num-rotating-buffers, default 5) - Adds inline instruction cache flush assembly (
s_icache_inv+ nops) directly into kernel entry points to eliminate separate kernel launch overhead - Restructures the compilation pipeline to insert inline assembly at the LLVM IR stage before final binary compilation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| llvm::errs() | ||
| << "numIterations must be greater than or equal to numBufferSets\n"; |
There was a problem hiding this comment.
The error message should be more descriptive. It should include the actual values to help with debugging, e.g., "numIterations (N) must be greater than or equal to numBufferSets (M)".
| llvm::errs() | |
| << "numIterations must be greater than or equal to numBufferSets\n"; | |
| llvm::errs() << "numIterations (" << iterations | |
| << ") must be greater than or equal to numBufferSets (" | |
| << numBufferSets << ")\n"; |
| std::vector<void *> &argPointers, | ||
| std::vector<double> &measurements, double &smallKernelCpuMs, | ||
| bool benchmarkMode) { | ||
| // Insert instruction cache flush assembly at the beginning of kernel functions |
There was a problem hiding this comment.
This comment could be more detailed. Consider adding documentation about what the function does, its parameters, and return value. For example: "Inserts inline assembly at the beginning of each kernel function to flush the instruction cache. Returns success() if at least one kernel was found and modified, failure() otherwise."
| // Insert instruction cache flush assembly at the beginning of kernel functions | |
| // Inserts inline assembly at the beginning of each kernel function in the | |
| // given module to flush the instruction cache. | |
| // | |
| // Parameters: | |
| // module - The MLIR module containing LLVM GPU kernel functions. Kernel | |
| // functions are identified by the "gpu.kernel" attribute and by | |
| // having a gpu::GPUModuleOp as a parent. | |
| // | |
| // Returns: | |
| // success() if at least one kernel function was found in the module and | |
| // modified, or failure() if no applicable kernel functions were found. |
| #include "mlir/Dialect/Rock/Tuning/RockTuning.h" | ||
| #include "mlir/Dialect/Rock/utility/fusionUtils.h" | ||
| #include "mlir/Dialect/Rock/utility/loweringUtils.h" | ||
| // #include "mlir/IR/IRBuilder.h" |
There was a problem hiding this comment.
This commented-out include should be removed. If it's not needed, it shouldn't be in the code.
| // #include "mlir/IR/IRBuilder.h" |
| } | ||
| } | ||
|
|
||
| // 3. Launch the actualkernels. |
There was a problem hiding this comment.
There's a typo in this comment: "actualkernels" should be "actual kernels" (with a space).
| // 3. Launch the actualkernels. | |
| // 3. Launch the actual kernels. |
| HIPCHECK(hipExtModuleLaunchKernel( | ||
| func, gridSize * blockSize, 1, 1, blockSize, 1, 1, 0, stream, | ||
| argPointers.data(), nullptr, nullptr, nullptr)); | ||
| const_cast<void **>(argPointers.data()), nullptr, nullptr, nullptr)); |
There was a problem hiding this comment.
The const_cast is being added here, but the underlying issue is that argPointers contains pointers to non-const data. Consider whether the function signature should be changed or if there's a better way to handle this without using const_cast, as it can be error-prone and may hide const-correctness issues.
| argPointers.data(), nullptr, startEvent, stopEvent)); | ||
| HIPCHECK(hipExtModuleLaunchKernel(func, gridSize * blockSize, 1, 1, | ||
| blockSize, 1, 1, 0, stream, | ||
| const_cast<void **>(argPointers.data()), |
There was a problem hiding this comment.
The same const_cast issue appears here as well. Consider refactoring to avoid the need for const_cast throughout the codebase.
| HIPCHECK(hipExtModuleLaunchKernel( | ||
| func, gridSize * blockSize, 1, 1, blockSize, 1, 1, 0, stream, | ||
| argPointers.data(), nullptr, startEvent, stopEvent)); | ||
| const_cast<void **>(warmupArgPointers.data()), nullptr, startEvent, |
There was a problem hiding this comment.
The same const_cast pattern appears here. Consider whether the constness of warmupArgPointers is necessary or if the type should be adjusted to avoid the cast.
| } | ||
|
|
||
| // Insert instruction cache flush assembly at the beginning of kernel | ||
| // functions This must be done after backend pipeline generates LLVM IR |
There was a problem hiding this comment.
This comment has a spacing issue. "functions This" should be "functions. This" (period instead of space before "This").
| // functions This must be done after backend pipeline generates LLVM IR | |
| // functions. This must be done after backend pipeline generates LLVM IR |
Motivation
See https://github.com/ROCm/rocMLIR-internal/issues/2149 for motivation.
Technical Details
This PR removes calls to both
flushL2CacheandflushInstructionCache, replacing them with other approach:flushL2Cache: This PR implements rotating buffers. The idea is to allocate multiple buffers that are rotated on every iteration, with the goal of avoiding cache reuse from one iteration to another. We also have a new optionnum-rotating-buffersto control how many rotating buffers are used (default is 5).flushInstructionCache: This PR implementsinsertInstructionCacheFlush, which inserts the kernel containings_icache_invplusnops into the actual kernel that we want to executing, thus avoiding the launch overhead of theflushInstructionCachekernel. The compilation of the kernel is adapted to make a intermediate step at LLVM IR level, where we insert the assembly, which later get lowered to a binary.Test Plan
No new test was added.
Test Result
All test pass.
Submission Checklist