OWRS Performance Enhancement #3

minxinhao · 2024-05-17T12:35:30Z

I ran smart's code on my testbed, which brought a huge performance boost. My testbed uses a connectx-6 NIC and two Intel(R) Xeon(R) Gold 5218 CPUs.
When I wanted to replicate the performance gains of owrs, I wrote my own test code, which just posted depth wrs and poll all. But I couldn't get the same performance gain with my code.
Here are the throughputs at 8byte using smart and my test code.

I turned off all optimization options for smart except thread_aware_alloc. And made my test code as close as possible to the qp optimization and owrs optimization that smart uses. But no matter what, I can't get similar performance improvement above 24 threads and above 8 depth. Can you give me some idea about the source of the performance improvement in smart. Here is the smart_config I am using.
Also, my testing found that qp's allocation optimization on the doorbell register is not applied above 12 (which is the actual driver limit) (I turned off preload), but the smart code still gets a higher performance boost with more shared_uuar than 12. This is something I can't understand either.

{
"infiniband": {
"name": "",
"port": 1,
"gid_idx": 1
},

"qp_param": {
"max_cqe_size": 256,
"max_wqe_size": 256,
"max_sge_size": 1,
"max_inline_data": 64
},

"max_nodes": 128,
"initiator_cache_size": 4096,

"use_thread_aware_alloc": true,
"thread_aware_alloc": {
"total_uuar": 100,
"shared_uuar": 96,
"shared_cq": true
},

"use_work_req_throt": false,
"work_req_throt": {
"initial_credit": 4,
"max_credit": 12,
"credit_step": 2,
"execution_epochs": 60,
"sample_cycles": 19200000,
"inf_credit_weight": 1.05,
"auto_tuning": false
},

"use_conflict_avoidance": false,
"use_speculative_lookup": false,

"experimental": {
"qp_sharing": false
}
}

alogfans · 2024-06-12T02:03:47Z

Does you enable Hyper-Thread? Our tests assume that #thread count <= #cores count
You can refer to https://github.com/madsys-dev/smart/blob/master/smart/initiator.cpp to find how our implemented. You can simply omit branchs that enable TaskPool (i.e., TaskPool::IsEnabled() == false)
For the second issue, SMART modifies rdma_core (see patch/ directory) so that the number of shared_uuars can exceed 12 (i.e., exactly the the input value MLX5_NUM_SHARED_UUARS). The application uses the modified libibverbs by specifying the LD_PRELOAD environment variable. You can try to use it in your program.

minxinhao changed the title ~~OWRS性能提升~~ OWRS Performance Enhancement May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OWRS Performance Enhancement #3

OWRS Performance Enhancement #3

minxinhao commented May 17, 2024 •

edited

Loading

alogfans commented Jun 12, 2024

OWRS Performance Enhancement #3

OWRS Performance Enhancement #3

Comments

minxinhao commented May 17, 2024 • edited Loading

alogfans commented Jun 12, 2024

minxinhao commented May 17, 2024 •

edited

Loading