Skip to content

Test vectorization #61

@tkoskela

Description

@tkoskela

There is a 2.5x difference in performance of Particle State Update between Haswell and Skylake processors of the same clockspeed. One explanation could be the use of AVX512 vector instructions on Skylake. It would be interesting to show whether this is the case.

Single thread Haswell Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz:

 julia> tdac(TDAC.tdac_params(; nprt = 64, nobs = 64, enable_timers = true));
────────────────────────────────────────────────────────────────────────────────
                                         Time                   Allocations      
                                 ──────────────────────   ───────────────────────
        Tot / % measured:             77.2s / 100%            11.4GiB / 100%     
 Section                 ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────────────────
 Particle State Update       20    45.5s  59.0%   2.28s   3.00MiB  0.03%   154KiB
 Process Noise            1.28k    27.8s  36.0%  21.7ms   10.7GiB  93.7%  8.55MiB
 Initialization               1    1.47s  1.90%   1.47s    698MiB  5.98%   698MiB
 True State Update           20    931ms  1.20%  46.5ms   42.8KiB  0.00%  2.14KiB
 Resample                    20    774ms  1.00%  38.7ms   12.2KiB  0.00%     624B
 Particle Variance           20    343ms  0.44%  17.2ms   36.6MiB  0.31%  1.83MiB
 Particle Mean               20    181ms  0.23%  9.05ms     0.00B  0.00%    0.00B
 State Copy                  20    126ms  0.16%  6.32ms      640B  0.00%    32.0B
 Weights                     20   20.8ms  0.03%  1.04ms   2.53MiB  0.02%   130KiB
 Observations             1.30k   15.7ms  0.02%  12.1μs    280KiB  0.00%     221B
 Observation Noise        1.28k   2.50ms  0.00%  1.96μs   60.0KiB  0.00%    48.0B
 ────────────────────────────────────────────────────────────────────────────────

Single thread Skylake Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz

julia> tdac(TDAC.tdac_params(; nprt = 64, nobs = 64, enable_timers = true));
────────────────────────────────────────────────────────────────────────────────
                                         Time                   Allocations      
                                 ──────────────────────   ───────────────────────
        Tot / % measured:             46.2s / 100%            11.4GiB / 100%     
 Section                 ncalls     time   %tot     avg     alloc   %tot      avg
 ────────────────────────────────────────────────────────────────────────────────
 Process Noise            1.28k    25.1s  54.2%  19.6ms   10.7GiB  93.5%  8.55MiB
 Particle State Update       20    17.8s  38.5%   890ms   4.48MiB  0.04%   229KiB
 Initialization               1    2.13s  4.61%   2.13s    698MiB  5.97%   698MiB
 Resample                    20    382ms  0.83%  19.1ms   12.2KiB  0.00%     624B
 True State Update           20    300ms  0.65%  15.0ms   42.8KiB  0.00%  2.14KiB
 Particle Variance           20    208ms  0.45%  10.4ms   36.6MiB  0.31%  1.83MiB
 State Copy                  20    130ms  0.28%  6.48ms      640B  0.00%    32.0B
 Particle Mean               20   91.8ms  0.20%  4.59ms     0.00B  0.00%    0.00B
 Observations             1.30k   17.6ms  0.04%  13.5μs    280KiB  0.00%     221B
 Weights                     20   3.94ms  0.01%   197μs   2.53MiB  0.02%   130KiB
 Observation Noise        1.28k   2.73ms  0.01%  2.13μs   60.0KiB  0.00%    48.0B
 ────────────────────────────────────────────────────────────────────────────────

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions