Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OWRS Performance Enhancement #3

Open
minxinhao opened this issue May 17, 2024 · 1 comment
Open

OWRS Performance Enhancement #3

minxinhao opened this issue May 17, 2024 · 1 comment

Comments

@minxinhao
Copy link

minxinhao commented May 17, 2024

I ran smart's code on my testbed, which brought a huge performance boost. My testbed uses a connectx-6 NIC and two Intel(R) Xeon(R) Gold 5218 CPUs.
When I wanted to replicate the performance gains of owrs, I wrote my own test code, which just posted depth wrs and poll all. But I couldn't get the same performance gain with my code.
Here are the throughputs at 8byte using smart and my test code.
smart
post_and_poll

I turned off all optimization options for smart except thread_aware_alloc. And made my test code as close as possible to the qp optimization and owrs optimization that smart uses. But no matter what, I can't get similar performance improvement above 24 threads and above 8 depth. Can you give me some idea about the source of the performance improvement in smart. Here is the smart_config I am using.
Also, my testing found that qp's allocation optimization on the doorbell register is not applied above 12 (which is the actual driver limit) (I turned off preload), but the smart code still gets a higher performance boost with more shared_uuar than 12. This is something I can't understand either.

{
"infiniband": {
"name": "",
"port": 1,
"gid_idx": 1
},

"qp_param": {
"max_cqe_size": 256,
"max_wqe_size": 256,
"max_sge_size": 1,
"max_inline_data": 64
},

"max_nodes": 128,
"initiator_cache_size": 4096,

"use_thread_aware_alloc": true,
"thread_aware_alloc": {
"total_uuar": 100,
"shared_uuar": 96,
"shared_cq": true
},

"use_work_req_throt": false,
"work_req_throt": {
"initial_credit": 4,
"max_credit": 12,
"credit_step": 2,
"execution_epochs": 60,
"sample_cycles": 19200000,
"inf_credit_weight": 1.05,
"auto_tuning": false
},

"use_conflict_avoidance": false,
"use_speculative_lookup": false,

"experimental": {
"qp_sharing": false
}
}

@minxinhao minxinhao changed the title OWRS性能提升 OWRS Performance Enhancement May 17, 2024
@alogfans
Copy link
Collaborator

  • Does you enable Hyper-Thread? Our tests assume that #thread count <= #cores count
  • You can refer to https://github.com/madsys-dev/smart/blob/master/smart/initiator.cpp to find how our implemented. You can simply omit branchs that enable TaskPool (i.e., TaskPool::IsEnabled() == false)
  • For the second issue, SMART modifies rdma_core (see patch/ directory) so that the number of shared_uuars can exceed 12 (i.e., exactly the the input value MLX5_NUM_SHARED_UUARS). The application uses the modified libibverbs by specifying the LD_PRELOAD environment variable. You can try to use it in your program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants