[Type] Mat: better cache locality for operator*(Mat) #5921

fredroy · 2026-02-02T23:02:58Z

Changing accesses for better cache locality (suggested by AI)

TL;DR:
the Mat<3,3> version does not change because it has its own optimized specialized version
bigger the matrices, bigger the gain (Mat24x24, speedup of 400% in floats !)
macOS has a weird quirk for Mat6x6 on double, which is 50% slower ? 🤔 maybe due to a failed vectorization or somethin'

Timings:
Ubuntu 22.04, gcc12, lto, O3

before
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.53 us         1.53 us       457258
BM_Matrix_typemat_matmult<float, 3>/1024          3.07 us         3.07 us       227524
BM_Matrix_typemat_matmult<float, 3>/2048          6.16 us         6.16 us       112806
BM_Matrix_typemat_matmult<double, 3>/512          1.73 us         1.73 us       402135
BM_Matrix_typemat_matmult<double, 3>/1024         3.49 us         3.48 us       201140
BM_Matrix_typemat_matmult<double, 3>/2048         6.99 us         6.99 us        99944
BM_Matrix_typemat_matmult<float, 6>/512           23.8 us         23.8 us        29239
BM_Matrix_typemat_matmult<float, 6>/1024          47.7 us         47.7 us        14642
BM_Matrix_typemat_matmult<float, 6>/2048          95.8 us         95.8 us         7241
BM_Matrix_typemat_matmult<double, 6>/512          24.4 us         24.4 us        28460
BM_Matrix_typemat_matmult<double, 6>/1024         49.0 us         49.0 us        14222
BM_Matrix_typemat_matmult<double, 6>/2048         98.3 us         98.3 us         7058
BM_Matrix_typemat_matmult<float, 24>/512          2108 us         2108 us          331
BM_Matrix_typemat_matmult<float, 24>/1024         4234 us         4234 us          165
BM_Matrix_typemat_matmult<float, 24>/2048         8458 us         8457 us           80
BM_Matrix_typemat_matmult<double, 24>/512         1878 us         1878 us          372
BM_Matrix_typemat_matmult<double, 24>/1024        3773 us         3773 us          185
BM_Matrix_typemat_matmult<double, 24>/2048        7741 us         7741 us           89

after
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.54 us         1.54 us       453879
BM_Matrix_typemat_matmult<float, 3>/1024          3.09 us         3.09 us       226329
BM_Matrix_typemat_matmult<float, 3>/2048          6.17 us         6.16 us       113432
BM_Matrix_typemat_matmult<double, 3>/512          1.73 us         1.73 us       403088
BM_Matrix_typemat_matmult<double, 3>/1024         3.46 us         3.46 us       202741
BM_Matrix_typemat_matmult<double, 3>/2048         6.91 us         6.91 us       100423
BM_Matrix_typemat_matmult<float, 6>/512           22.4 us         22.4 us        31211
BM_Matrix_typemat_matmult<float, 6>/1024          44.4 us         44.4 us        15589
BM_Matrix_typemat_matmult<float, 6>/2048          89.2 us         89.2 us         7770
BM_Matrix_typemat_matmult<double, 6>/512          22.7 us         22.7 us        30714
BM_Matrix_typemat_matmult<double, 6>/1024         45.6 us         45.6 us        15286
BM_Matrix_typemat_matmult<double, 6>/2048         91.9 us         91.9 us         7593
BM_Matrix_typemat_matmult<float, 24>/512           522 us          522 us         1338
BM_Matrix_typemat_matmult<float, 24>/1024         1039 us         1039 us          672
BM_Matrix_typemat_matmult<float, 24>/2048         2090 us         2090 us          334
BM_Matrix_typemat_matmult<double, 24>/512          963 us          963 us          725
BM_Matrix_typemat_matmult<double, 24>/1024        1925 us         1925 us          362
BM_Matrix_typemat_matmult<double, 24>/2048        3929 us         3929 us          179

after (revised)
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.54 us         1.54 us       456346
BM_Matrix_typemat_matmult<float, 3>/1024          3.08 us         3.08 us       227839
BM_Matrix_typemat_matmult<float, 3>/2048          6.17 us         6.17 us       112654
BM_Matrix_typemat_matmult<double, 3>/512          1.73 us         1.73 us       399904
BM_Matrix_typemat_matmult<double, 3>/1024         3.46 us         3.46 us       201315
BM_Matrix_typemat_matmult<double, 3>/2048         6.92 us         6.92 us       100507
BM_Matrix_typemat_matmult<float, 6>/512           22.4 us         22.3 us        31397
BM_Matrix_typemat_matmult<float, 6>/1024          44.5 us         44.5 us        15630
BM_Matrix_typemat_matmult<float, 6>/2048          89.1 us         89.1 us         7768
BM_Matrix_typemat_matmult<double, 6>/512          22.8 us         22.8 us        30601
BM_Matrix_typemat_matmult<double, 6>/1024         45.7 us         45.7 us        15298
BM_Matrix_typemat_matmult<double, 6>/2048         91.9 us         91.8 us         7563
BM_Matrix_typemat_matmult<float, 24>/512           519 us          519 us         1356
BM_Matrix_typemat_matmult<float, 24>/1024         1045 us         1045 us          667
BM_Matrix_typemat_matmult<float, 24>/2048         2083 us         2082 us          334
BM_Matrix_typemat_matmult<double, 24>/512          959 us          958 us          723
BM_Matrix_typemat_matmult<double, 24>/1024        1938 us         1936 us          361
BM_Matrix_typemat_matmult<double, 24>/2048        3931 us         3927 us          178

Windows VS2026, release, lto

before
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           22.7 us         22.0 us        29867
BM_Matrix_typemat_matmult<float, 3>/1024          44.6 us         45.5 us        15448
BM_Matrix_typemat_matmult<float, 3>/2048          88.6 us         90.0 us         7467
BM_Matrix_typemat_matmult<double, 3>/512          18.0 us         18.0 us        37333
BM_Matrix_typemat_matmult<double, 3>/1024         36.2 us         36.8 us        18667
BM_Matrix_typemat_matmult<double, 3>/2048         72.3 us         71.5 us         8960
BM_Matrix_typemat_matmult<float, 6>/512            457 us          450 us         1493
BM_Matrix_typemat_matmult<float, 6>/1024           922 us          920 us          747
BM_Matrix_typemat_matmult<float, 6>/2048          1825 us         1843 us          407
BM_Matrix_typemat_matmult<double, 6>/512           415 us          414 us         1659
BM_Matrix_typemat_matmult<double, 6>/1024          822 us          816 us          747
BM_Matrix_typemat_matmult<double, 6>/2048         1664 us         1651 us          407
BM_Matrix_typemat_matmult<float, 24>/512          3469 us         3446 us          195
BM_Matrix_typemat_matmult<float, 24>/1024         7058 us         7115 us          112
BM_Matrix_typemat_matmult<float, 24>/2048        14486 us        14375 us           50
BM_Matrix_typemat_matmult<double, 24>/512         3543 us         3526 us          195
BM_Matrix_typemat_matmult<double, 24>/1024        7035 us         6836 us          112
BM_Matrix_typemat_matmult<double, 24>/2048       14557 us        14375 us           50

after
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           21.9 us         22.0 us        32000
BM_Matrix_typemat_matmult<float, 3>/1024          45.2 us         44.9 us        16000
BM_Matrix_typemat_matmult<float, 3>/2048          87.5 us         87.9 us         7467
BM_Matrix_typemat_matmult<double, 3>/512          18.1 us         18.0 us        37333
BM_Matrix_typemat_matmult<double, 3>/1024         36.9 us         36.9 us        19478
BM_Matrix_typemat_matmult<double, 3>/2048         72.7 us         71.5 us         8960
BM_Matrix_typemat_matmult<float, 6>/512            319 us          321 us         2240
BM_Matrix_typemat_matmult<float, 6>/1024           635 us          628 us         1120
BM_Matrix_typemat_matmult<float, 6>/2048          1303 us         1311 us          560
BM_Matrix_typemat_matmult<double, 6>/512           322 us          321 us         2240
BM_Matrix_typemat_matmult<double, 6>/1024          645 us          642 us         1120
BM_Matrix_typemat_matmult<double, 6>/2048         1286 us         1283 us          560
BM_Matrix_typemat_matmult<float, 24>/512          1715 us         1728 us          407
BM_Matrix_typemat_matmult<float, 24>/1024         3351 us         3294 us          204
BM_Matrix_typemat_matmult<float, 24>/2048         6725 us         6771 us           90
BM_Matrix_typemat_matmult<double, 24>/512         1766 us         1766 us          407
BM_Matrix_typemat_matmult<double, 24>/1024        3460 us         3446 us          195
BM_Matrix_typemat_matmult<double, 24>/2048        7244 us         7292 us           90

after (revised)
-------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations
-------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512          22.5 us         22.5 us        32000
BM_Matrix_typemat_matmult<float, 3>/1024         44.3 us         44.9 us        16000
BM_Matrix_typemat_matmult<float, 3>/2048         87.9 us         87.9 us         7467
BM_Matrix_typemat_matmult<double, 3>/512         18.3 us         18.4 us        37333
BM_Matrix_typemat_matmult<double, 3>/1024        36.3 us         36.0 us        18667
BM_Matrix_typemat_matmult<double, 3>/2048        73.1 us         73.9 us        11200
BM_Matrix_typemat_matmult<float, 6>/512           322 us          322 us         2133
BM_Matrix_typemat_matmult<float, 6>/1024          645 us          645 us          896
BM_Matrix_typemat_matmult<float, 6>/2048         1304 us         1311 us          560
BM_Matrix_typemat_matmult<double, 6>/512          306 us          300 us         2240
BM_Matrix_typemat_matmult<double, 6>/1024         620 us          628 us         1120
BM_Matrix_typemat_matmult<double, 6>/2048        1247 us         1228 us          560
BM_Matrix_typemat_matmult<float, 24>/512         1674 us         1689 us          407
BM_Matrix_typemat_matmult<float, 24>/1024        3341 us         3374 us          213
BM_Matrix_typemat_matmult<float, 24>/2048        6723 us         6771 us           90
BM_Matrix_typemat_matmult<double, 24>/512        1752 us         1766 us          407
BM_Matrix_typemat_matmult<double, 24>/1024       3557 us         3526 us          195
BM_Matrix_typemat_matmult<double, 24>/2048       7238 us         7254 us          112

macOS, xcode 26, lto

before
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.06 us         1.06 us       652973
BM_Matrix_typemat_matmult<float, 3>/1024          2.10 us         2.10 us       335371
BM_Matrix_typemat_matmult<float, 3>/2048          4.20 us         4.20 us       164335
BM_Matrix_typemat_matmult<double, 3>/512          1.14 us         1.14 us       615249
BM_Matrix_typemat_matmult<double, 3>/1024         2.30 us         2.29 us       312962
BM_Matrix_typemat_matmult<double, 3>/2048         4.54 us         4.54 us       151194
BM_Matrix_typemat_matmult<float, 6>/512           6.41 us         6.41 us       109319
BM_Matrix_typemat_matmult<float, 6>/1024          12.8 us         12.8 us        54908
BM_Matrix_typemat_matmult<float, 6>/2048          25.2 us         25.1 us        27832
BM_Matrix_typemat_matmult<double, 6>/512          11.4 us         11.4 us        60546
BM_Matrix_typemat_matmult<double, 6>/1024         22.6 us         22.6 us        30222
BM_Matrix_typemat_matmult<double, 6>/2048         44.5 us         44.5 us        15488
BM_Matrix_typemat_matmult<float, 24>/512           294 us          294 us         2388
BM_Matrix_typemat_matmult<float, 24>/1024          588 us          588 us         1185
BM_Matrix_typemat_matmult<float, 24>/2048         1177 us         1177 us          598
BM_Matrix_typemat_matmult<double, 24>/512          604 us          604 us         1167
BM_Matrix_typemat_matmult<double, 24>/1024        1201 us         1201 us          582
BM_Matrix_typemat_matmult<double, 24>/2048        2416 us         2416 us          291

after
--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512           1.06 us         1.06 us       657339
BM_Matrix_typemat_matmult<float, 3>/1024          2.14 us         2.14 us       332844
BM_Matrix_typemat_matmult<float, 3>/2048          4.27 us         4.27 us       164750
BM_Matrix_typemat_matmult<double, 3>/512          1.13 us         1.13 us       610176
BM_Matrix_typemat_matmult<double, 3>/1024         2.30 us         2.30 us       311717
BM_Matrix_typemat_matmult<double, 3>/2048         4.50 us         4.50 us       157442
BM_Matrix_typemat_matmult<float, 6>/512           5.94 us         5.94 us       119149
BM_Matrix_typemat_matmult<float, 6>/1024          11.7 us         11.7 us        58265
BM_Matrix_typemat_matmult<float, 6>/2048          23.6 us         23.6 us        29901
BM_Matrix_typemat_matmult<double, 6>/512          16.3 us         16.3 us        42924
BM_Matrix_typemat_matmult<double, 6>/1024         32.5 us         32.5 us        21619
BM_Matrix_typemat_matmult<double, 6>/2048         64.5 us         64.5 us        10772
BM_Matrix_typemat_matmult<float, 24>/512           215 us          215 us         3213
BM_Matrix_typemat_matmult<float, 24>/1024          433 us          433 us         1616
BM_Matrix_typemat_matmult<float, 24>/2048          865 us          865 us          808
BM_Matrix_typemat_matmult<double, 24>/512          400 us          400 us         1753
BM_Matrix_typemat_matmult<double, 24>/1024         799 us          799 us          871
BM_Matrix_typemat_matmult<double, 24>/2048        1596 us         1596 us          438

after (revised)
-------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations
-------------------------------------------------------------------------------------
BM_Matrix_typemat_matmult<float, 3>/512          1.04 us         1.04 us       676786
BM_Matrix_typemat_matmult<float, 3>/1024         2.08 us         2.08 us       334509
BM_Matrix_typemat_matmult<float, 3>/2048         4.19 us         4.19 us       166882
BM_Matrix_typemat_matmult<double, 3>/512         1.11 us         1.11 us       625134
BM_Matrix_typemat_matmult<double, 3>/1024        2.28 us         2.28 us       307440
BM_Matrix_typemat_matmult<double, 3>/2048        4.43 us         4.43 us       157761
BM_Matrix_typemat_matmult<float, 6>/512          5.81 us         5.81 us       119244
BM_Matrix_typemat_matmult<float, 6>/1024         11.6 us         11.6 us        60613
BM_Matrix_typemat_matmult<float, 6>/2048         23.1 us         23.1 us        30321
BM_Matrix_typemat_matmult<double, 6>/512         16.0 us         16.0 us        43933
BM_Matrix_typemat_matmult<double, 6>/1024        31.7 us         31.7 us        22104
BM_Matrix_typemat_matmult<double, 6>/2048        63.9 us         63.8 us        11088
BM_Matrix_typemat_matmult<float, 24>/512          215 us          215 us         3266
BM_Matrix_typemat_matmult<float, 24>/1024         431 us          431 us         1624
BM_Matrix_typemat_matmult<float, 24>/2048         863 us          863 us          809
BM_Matrix_typemat_matmult<double, 24>/512         400 us          400 us         1743
BM_Matrix_typemat_matmult<double, 24>/1024        800 us          799 us          843
BM_Matrix_typemat_matmult<double, 24>/2048       1594 us         1594 us          429

By submitting this pull request, I acknowledge that
I have read, understand, and agree SOFA Developer Certificate of Origin (DCO).

Reviewers will merge this pull-request only if

it builds with SUCCESS for all platforms on the CI.
it does not generate new warnings.
it does not generate new unit test failures.
it does not generate new scene test failures.
it does not break API compatibility.
it is more than 1 week old (or has fast-merge label).

alxbilger

You must initialize the result before calling the operator +=.

fredroy · 2026-02-04T03:14:43Z

You must initialize the result before calling the operator +=.

done , and re-did the benches (no change)

Sofa/framework/Type/src/sofa/type/Mat.h

fredroy added pr: enhancement About a possible enhancement pr: status to review To notify reviewers to review this pull-request labels Feb 2, 2026

alxbilger added the pr: ai-generated Label notifying the reviewers that part or all of the PR has been generated with the help of an AI label Feb 3, 2026

alxbilger requested changes Feb 3, 2026

View reviewed changes

fredroy added 2 commits February 4, 2026 10:11

rewrite for better cache accesses

fa9a7a6

zero-initialize the result matrix

eb57d55

fredroy force-pushed the optim_mat_operator_mult branch from 0ca315f to eb57d55 Compare February 4, 2026 01:11

fredroy requested a review from alxbilger February 4, 2026 02:24

bakpaul approved these changes Feb 4, 2026

View reviewed changes

Sofa/framework/Type/src/sofa/type/Mat.h Outdated Show resolved Hide resolved

really zero-init'ed

5672af4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Type] Mat: better cache locality for operator*(Mat) #5921

[Type] Mat: better cache locality for operator*(Mat) #5921

fredroy commented Feb 2, 2026 •

edited

Loading

Uh oh!

alxbilger left a comment

Uh oh!

fredroy commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Type] Mat: better cache locality for operator*(Mat) #5921

Are you sure you want to change the base?

[Type] Mat: better cache locality for operator*(Mat) #5921

Conversation

fredroy commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alxbilger left a comment

Choose a reason for hiding this comment

Uh oh!

fredroy commented Feb 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fredroy commented Feb 2, 2026 •

edited

Loading