Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ch4/shm: fix performance degradation on Sapphire Rapids with Intel Compiler #7150

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

yfguo
Copy link
Contributor

@yfguo yfguo commented Sep 24, 2024

Pull Request Description

@zhenggb72 reported performance degradation in inter-NUMA SHM communication when compare to v4.2.2. The issue was introduced in #7046. MPICH v4.2.2 was getting ~14us latency for 64KB message, but only getting ~23us latency after #7046. Setting MPIR_CVAR_CH4_SHM_POSIX_TOPO_ENABLE=true solves the problem.

The issue is cause by a change in memcpy operation. v4.2.2 uses non-temporal store for both intra-NUMA and inter-NUMA SHM communication. This was change to regular memcpy when topo-aware is disabled. The change in memcpy was because non-temporal store has higher latency in intra-NUMA communications in some architectures (see below result on Milan). Also, the non-temporal store has higher latency in inter-NUMA small message in other architectures (skylake, cascade, icelake).

After more comprehensive testing on broadwell, skylake, cascade, icelake, sapphire rapids, and milan, I think it is probably OK to make the topo-aware default to enabled, which would yield better performance for sapphire rapids and milan. Details numbers can be found in following comments.

This PR also address another source of performance degradation observed when building with Intel compiler. PR#7074 consolidated SSE2 and AVX related optimization options into MPL's configure because only MPL explicitly use them.
This change showed no performance degradation with GNU compiler. But, with Intel compilers, this does results in some performance degradation (see below). Therefore, we should add them back in the main configure. Currently, the main configure checks for availability of SSE2, AVX and AVX512F, and add them to CFLAGS. The MPL configure will further check for specific instructions that is used in MPL.

All raw numbers:
2024-shm_bench-arch_comparison.xlsx

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

Fix the performance degradation on Intel Sapphire Rapids after
introducing topo-aware SHM. This problem only happens when building
with Intel compiler. The problem was topo-aware default
to disabled. It uses regular memcpy for inter-NUMA message which
is different from v4.2.2 (uses non-temporal copy).

The reason this is disabled by default was due to using non-temporal
copy results in higher latency in small message. After more testing
with different CPUs (broadwell, skylake, cascade, icelake, milan),
It seems only skylake, cascade and icelake has this issue on small
message. It is probably OK to make topo-aware SHM default to enabled.
@yfguo
Copy link
Contributor Author

yfguo commented Sep 24, 2024

Inter-NUMA, Sunspot, icx

<style> </style>
  main stable422 this PR
1 1.52 1.46 1.46
2 1.52 1.46 1.46
4 1.52 1.46 1.46
8 1.52 1.46 1.45
16 1.52 1.46 1.46
32 1.52 1.46 1.46
64 1.69 1.66 1.68
128 1.71 1.67 1.75
256 1.74 1.7 1.79
512 1.77 1.84 1.92
1024 1.84 1.91 1.99
2048 2.2 2.02 2.14
4096 3.03 2.39 2.4
8192 4.42 3.13 2.9
16384 9.67 6.9 6.24
32768 13.99 9.49 8.29
65536 23.08 14.18 12.04
131072 39.61 23.52 19.64
262144 70.3 41.65 34.58
524288 128.28 85.07 64.73
1048576 247.39 213.87 170.08
2097152 524 438.87 294.68
4194304 1086.84 822.32 564.34

@yfguo
Copy link
Contributor Author

yfguo commented Sep 24, 2024

Intra-NUMA, Sunspot, icx

<style> </style>
  main stable422 this PR
1 0.84 0.76 0.79
2 0.84 0.76 0.79
4 0.84 0.76 0.79
8 0.84 0.76 0.79
16 0.84 0.76 0.79
32 0.84 0.76 0.8
64 0.9 0.84 0.87
128 0.92 0.85 0.88
256 0.95 0.88 0.91
512 0.99 1.04 0.95
1024 1.05 1.09 1.01
2048 1.27 1.26 1.22
4096 1.75 1.59 1.69
8192 2.52 2.24 2.56
16384 5.49 4.75 5.75
32768 7.37 6.92 7.19
65536 11.5 11.08 11.1
131072 19.48 18.92 18.3
262144 33.49 33.32 30.51
524288 65.09 58.02 65.58
1048576 135.79 117.81 139.51
2097152 261.86 258.09 216.93
4194304 539.92 482.54 477.29

@yfguo
Copy link
Contributor Author

yfguo commented Sep 24, 2024

AVX in MPICH configure vs AVX in MPL configure, intra-NUMA, Sunspot, icx

<style> </style>
  MPICH configure MPL configure
1 0.84 0.62
2 0.84 0.6
4 0.84 0.6
8 0.84 0.6
16 0.84 0.59
32 0.84 0.6
64 0.9 0.61
128 0.92 0.68
256 0.95 0.72
512 0.99 0.8
1024 1.05 0.94
2048 1.27 1.11
4096 1.75 1.51
8192 2.52 2.22
16384 5.49 5.2
32768 7.37 7.06
65536 11.5 11.08
131072 19.48 18.96
262144 33.49 32.38
524288 65.09 55.74
1048576 135.79 108.36
2097152 261.86 236.55
4194304 539.92 506.49
Inter-NUMA numbers have similar differences.

@yfguo yfguo assigned hzhou and unassigned hzhou Sep 24, 2024
@yfguo yfguo requested a review from hzhou September 24, 2024 21:42
Previous PR#7074 consolidated SSE2 and AVX related optimization
options into MPL's configure because only MPL explicitly use them.
This change showed no performance degradation with GNU compiler.
But, with Intel compilers, this does results in some performance
degradation. Therefore, we should add them back in the main
configure. Currently, the main configure checks for availability
of SSE2, AVX and AVX512F, and add them to CFLAGS. The MPL configure
will further check for specific instructions that is used in MPL.
@yfguo
Copy link
Contributor Author

yfguo commented Sep 24, 2024

test:mpich/ch4/ofi

@yfguo
Copy link
Contributor Author

yfguo commented Sep 24, 2024

Inter-NUMA, TOPO enabled vs disabled, Intel Compiler.
Note the higher latency for (< 4KB) in skylake-icelake for TOPO disabled.

<style> </style>
  broadwell skylake cascade icelake sapphire rapids
  topo enabled topo disabled topo enabled topo disabled topo enabled
1 0.97 1.02 1.22 1.03 1.16
2 0.92 0.97 1.21 1.02 1.16
4 0.89 0.93 1.22 1.02 1.15
8 0.87 0.91 1.21 1.01 1.15
16 0.86 0.89 1.21 1.02 1.14
32 0.85 0.88 1.23 1.02 1.17
64 0.92 0.97 1.31 1.13 1.24
128 1 1 1.51 1.25 1.42
256 1.03 1.05 1.47 1.31 1.29
512 1.14 1.1 1.75 1.34 1.59
1024 1.24 1.24 1.85 1.41 1.63
2048 1.4 1.64 2.03 1.76 1.75
4096 1.7 2.2 2.19 2.4 1.94
8192 2.43 3.4 2.85 3.8 2.54
16384 5.53 8.27 6.34 8.33 5.85
32768 8.36 12.47 8.38 11.82 7.85
65536 13 20.79 12.4 19.35 11.64
131072 23.88 38.23 20.75 33.09 19.61
262144 44.44 72.53 37.51 58.67 35.56
524288 81.54 139.62 69.87 110.64 66.8
1048576 160.64 273.81 136.96 221.46 131.96
2097152 320.4 542.7 256.93 452.78 247.88
4194304 623.81 1080.62 500.76 916.99 493.26

@yfguo yfguo mentioned this pull request Sep 26, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants