ch4/shm: fix performance degradation on Sapphire Rapids with Intel Compiler #7150

yfguo · 2024-09-24T21:34:09Z

Pull Request Description

@zhenggb72 reported performance degradation in inter-NUMA SHM communication when compare to v4.2.2. The issue was introduced in #7046. MPICH v4.2.2 was getting ~14us latency for 64KB message, but only getting ~23us latency after #7046. Setting MPIR_CVAR_CH4_SHM_POSIX_TOPO_ENABLE=true solves the problem.

The issue is cause by a change in memcpy operation. v4.2.2 uses non-temporal store for both intra-NUMA and inter-NUMA SHM communication. This was change to regular memcpy when topo-aware is disabled. The change in memcpy was because non-temporal store has higher latency in intra-NUMA communications in some architectures (see below result on Milan). Also, the non-temporal store has higher latency in inter-NUMA small message in other architectures (skylake, cascade, icelake).

After more comprehensive testing on broadwell, skylake, cascade, icelake, sapphire rapids, and milan, I think it is probably OK to make the topo-aware default to enabled, which would yield better performance for sapphire rapids and milan. Details numbers can be found in following comments.

This PR also address another source of performance degradation observed when building with Intel compiler. PR#7074 consolidated SSE2 and AVX related optimization options into MPL's configure because only MPL explicitly use them.
This change showed no performance degradation with GNU compiler. But, with Intel compilers, this does results in some performance degradation (see below). Therefore, we should add them back in the main configure. Currently, the main configure checks for availability of SSE2, AVX and AVX512F, and add them to CFLAGS. The MPL configure will further check for specific instructions that is used in MPL.

All raw numbers:
2024-shm_bench-arch_comparison.xlsx

Author Checklist

Provide Description
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits Follow Good Practice
Commits are self-contained and do not do two things at once.
Commit message is of the form: module: short description
Commit message explains what's in the commit.
Passes All Tests
Whitespace checker. Warnings test. Additional tests via comments.
Contribution Agreement
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.

Fix the performance degradation on Intel Sapphire Rapids after introducing topo-aware SHM. This problem only happens when building with Intel compiler. The problem was topo-aware default to disabled. It uses regular memcpy for inter-NUMA message which is different from v4.2.2 (uses non-temporal copy). The reason this is disabled by default was due to using non-temporal copy results in higher latency in small message. After more testing with different CPUs (broadwell, skylake, cascade, icelake, milan), It seems only skylake, cascade and icelake has this issue on small message. It is probably OK to make topo-aware SHM default to enabled.

yfguo · 2024-09-24T21:36:21Z

Inter-NUMA, Sunspot, icx

	main	stable422	this PR
1	1.52	1.46	1.46
2	1.52	1.46	1.46
4	1.52	1.46	1.46
8	1.52	1.46	1.45
16	1.52	1.46	1.46
32	1.52	1.46	1.46
64	1.69	1.66	1.68
128	1.71	1.67	1.75
256	1.74	1.7	1.79
512	1.77	1.84	1.92
1024	1.84	1.91	1.99
2048	2.2	2.02	2.14
4096	3.03	2.39	2.4
8192	4.42	3.13	2.9
16384	9.67	6.9	6.24
32768	13.99	9.49	8.29
65536	23.08	14.18	12.04
131072	39.61	23.52	19.64
262144	70.3	41.65	34.58
524288	128.28	85.07	64.73
1048576	247.39	213.87	170.08
2097152	524	438.87	294.68
4194304	1086.84	822.32	564.34

yfguo · 2024-09-24T21:37:16Z

Intra-NUMA, Sunspot, icx

	main	stable422	this PR
1	0.84	0.76	0.79
2	0.84	0.76	0.79
4	0.84	0.76	0.79
8	0.84	0.76	0.79
16	0.84	0.76	0.79
32	0.84	0.76	0.8
64	0.9	0.84	0.87
128	0.92	0.85	0.88
256	0.95	0.88	0.91
512	0.99	1.04	0.95
1024	1.05	1.09	1.01
2048	1.27	1.26	1.22
4096	1.75	1.59	1.69
8192	2.52	2.24	2.56
16384	5.49	4.75	5.75
32768	7.37	6.92	7.19
65536	11.5	11.08	11.1
131072	19.48	18.92	18.3
262144	33.49	33.32	30.51
524288	65.09	58.02	65.58
1048576	135.79	117.81	139.51
2097152	261.86	258.09	216.93
4194304	539.92	482.54	477.29

yfguo · 2024-09-24T21:38:53Z

AVX in MPICH configure vs AVX in MPL configure, intra-NUMA, Sunspot, icx

	MPICH configure	MPL configure
1	0.84	0.62
2	0.84	0.6
4	0.84	0.6
8	0.84	0.6
16	0.84	0.59
32	0.84	0.6
64	0.9	0.61
128	0.92	0.68
256	0.95	0.72
512	0.99	0.8
1024	1.05	0.94
2048	1.27	1.11
4096	1.75	1.51
8192	2.52	2.22
16384	5.49	5.2
32768	7.37	7.06
65536	11.5	11.08
131072	19.48	18.96
262144	33.49	32.38
524288	65.09	55.74
1048576	135.79	108.36
2097152	261.86	236.55
4194304	539.92	506.49

Inter-NUMA numbers have similar differences.

Previous PR#7074 consolidated SSE2 and AVX related optimization options into MPL's configure because only MPL explicitly use them. This change showed no performance degradation with GNU compiler. But, with Intel compilers, this does results in some performance degradation. Therefore, we should add them back in the main configure. Currently, the main configure checks for availability of SSE2, AVX and AVX512F, and add them to CFLAGS. The MPL configure will further check for specific instructions that is used in MPL.

yfguo · 2024-09-24T21:46:20Z

test:mpich/ch4/ofi

yfguo · 2024-09-24T21:54:28Z

Inter-NUMA, TOPO enabled vs disabled, Intel Compiler.
Note the higher latency for (< 4KB) in skylake-icelake for TOPO disabled.

	broadwell	skylake	cascade	icelake	sapphire rapids
	topo enabled	topo disabled	topo enabled	topo disabled	topo enabled
1	0.97	1.02	1.22	1.03	1.16
2	0.92	0.97	1.21	1.02	1.16
4	0.89	0.93	1.22	1.02	1.15
8	0.87	0.91	1.21	1.01	1.15
16	0.86	0.89	1.21	1.02	1.14
32	0.85	0.88	1.23	1.02	1.17
64	0.92	0.97	1.31	1.13	1.24
128	1	1	1.51	1.25	1.42
256	1.03	1.05	1.47	1.31	1.29
512	1.14	1.1	1.75	1.34	1.59
1024	1.24	1.24	1.85	1.41	1.63
2048	1.4	1.64	2.03	1.76	1.75
4096	1.7	2.2	2.19	2.4	1.94
8192	2.43	3.4	2.85	3.8	2.54
16384	5.53	8.27	6.34	8.33	5.85
32768	8.36	12.47	8.38	11.82	7.85
65536	13	20.79	12.4	19.35	11.64
131072	23.88	38.23	20.75	33.09	19.61
262144	44.44	72.53	37.51	58.67	35.56
524288	81.54	139.62	69.87	110.64	66.8
1048576	160.64	273.81	136.96	221.46	131.96
2097152	320.4	542.7	256.93	452.78	247.88
4194304	623.81	1080.62	500.76	916.99	493.26

yfguo assigned hzhou and unassigned hzhou Sep 24, 2024

yfguo requested a review from hzhou September 24, 2024 21:42

yfguo force-pushed the fix-shm-perf branch from 5eb6704 to e0710c7 Compare September 24, 2024 21:45

yfguo mentioned this pull request Sep 26, 2024

mpl: pass MPL's CFLAGS back to MPICH #7152

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ch4/shm: fix performance degradation on Sapphire Rapids with Intel Compiler #7150

ch4/shm: fix performance degradation on Sapphire Rapids with Intel Compiler #7150

yfguo commented Sep 24, 2024 •

edited

Loading

yfguo commented Sep 24, 2024 •

edited

Loading

yfguo commented Sep 24, 2024 •

edited

Loading

yfguo commented Sep 24, 2024 •

edited

Loading

yfguo commented Sep 24, 2024

yfguo commented Sep 24, 2024

ch4/shm: fix performance degradation on Sapphire Rapids with Intel Compiler #7150

Are you sure you want to change the base?

ch4/shm: fix performance degradation on Sapphire Rapids with Intel Compiler #7150

Conversation

yfguo commented Sep 24, 2024 • edited Loading

Pull Request Description

Author Checklist

yfguo commented Sep 24, 2024 • edited Loading

yfguo commented Sep 24, 2024 • edited Loading

yfguo commented Sep 24, 2024 • edited Loading

yfguo commented Sep 24, 2024

yfguo commented Sep 24, 2024

yfguo commented Sep 24, 2024 •

edited

Loading

yfguo commented Sep 24, 2024 •

edited

Loading

yfguo commented Sep 24, 2024 •

edited

Loading

yfguo commented Sep 24, 2024 •

edited

Loading