-
Notifications
You must be signed in to change notification settings - Fork 15.5k
Open
Description
https://godbolt.org/z/MPPnvT5h8
the simple code:
void swap_ptr_impl(int64_t* ptr, size_t len) {
for (size_t i = 0; i < len; i++) {
ptr[i] = std::byteswap(ptr[i]);
}
}
void swap_ptr2_impl(int64_t* ptr, size_t len) {
auto end = ptr + len;
for (; ptr < end; ptr++) {
*ptr = std::byteswap(*ptr);
}
}
void swap_span_impl(std::span<int64_t> sp) {
for (auto& x : sp) {
x = std::byteswap(x);
}
}
void swap_span_2(std::span<int64_t, 1024> sp) {
for (auto& x : sp) {
x = std::byteswap(x);
}
}swap_ptr_impl is 2x slower than other functions on i9-14900KF. 2.8x slower is seen on quickbench.
swap_span_2 (span length known) is also 2x slower.
Run on (32 X 3187 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 2048 KiB (x16)
L3 Unified 36864 KiB (x1)
------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
swap_ptr 400 ns 390 ns 1723077
swap_ptr2 184 ns 180 ns 4072727
swap_span 176 ns 165 ns 4072727
swap_span_2 403 ns 399 ns 1723077
with -fno-vectorize, the results are reasonable.
Run on (32 X 3187 MHz CPU s)
CPU Caches:
L1 Data 48 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 2048 KiB (x16)
L3 Unified 36864 KiB (x1)
------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
swap_ptr 181 ns 184 ns 4072727
swap_ptr2 181 ns 180 ns 3733333
swap_span 173 ns 172 ns 3733333
swap_span_2 175 ns 173 ns 4072727
so I assume that there is something wrong in the loop vectorizer. Verified since clang 17.