-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Memcached Benchmark
The following describe the details of the Memcached benchmark making it reproducible. Let us know if you find anything is missing.
Raw data:
CPU | Seastar Memcached with DPDK | Stock Memcached (multi process) | Stock memcached (multi threaded) |
---|---|---|---|
2 | 553,175 | 350,844 | 321,287 |
4 | 1,021,918 | 615,270 | 573,149 |
6 | 1,703,790 | 857,428 | 709,502 |
8 | 2,149,162 | 1,102,417 | 741,356 |
10 | 2,629,885 | 1,335,069 | 608,014 |
12 | 2,870,919 | 1,528,598 | 608,968 |
14 | 3,217,044 | 1,726,642 | 440,658 |
16 | 3,460,167 | 1,887,060 | 603,479 |
18 | 4,049,397 | 2,167,573 | 902,192 |
20 | 4,426,457 | 2,281,064 | 1,128,469 |
As you can see, SeaStar's Memcache server is 4X faster than the stock threaded memcache. The later suffers from various locking issues, especially the mutex_trylock busy wait look. In order to squeeze more performance out of stock memcache we executed it as multiple single processes that share nothing. It's not a fair comparison since this way memory isn't shared and it puts some responsibility and complexity on the client. Even with this approach SeaStar outperforms stock memcache by 2X.
It worth to note that SeaStar was designed for much more complex scenarios than memcache and should excel even more when high level of parallelism is needed.
The stats were retrieved using graphite and the internal collectd client when run with 4 cores. The top right graph shows the packet coalescing rate - as the load increases (the top left graph shows the idle time shrinks to zero), each packet processing round handles 30 packets.
The bottom right graph shows #tasks executed per core. The number is 1,250,000/sec. Remember it's the amount of SeaStar tasks, not memcache. The bottom left graph shows the number of network packets each core handles (in this setup, between 200k/s-250k/s).
Let's observe the difference between the various cpu hog functions using perf top. First let's see SeaStar's perf data. The kernel code is completely out of the picture. The most cpu intensive function is that hash function, not surprisingly. Afterwards memory deletion and allocation are next.
Percent | Binary | Function |
---|---|---|
8.54% | memcached | boost::intrusive::hashtable_impl<boost::intrusive::mhtraits<memcache::item, |
4.67% | memcached | memory::cpu_pages::free |
3.74% | memcached | deleter::~deleter |
2.84% | memcached | promise<>::promise |
2.72% | memcached | rte_pktmbuf_alloc |
2.51% | memcached | ixgbe_xmit_pkts |
2.50% | memcached | memory::small_pool::allocate |
2.49% | libc-2.20.so | |
2.02% | memcached | dpdk::dpdk_qp::send |
1.95% | memcached | net::interface::dispatch_packet |
1.94% | memcached | promise<>::~promise |
1.82% | memcached | memcache_ascii_parser::parse |
1.74% | memcached | _ZNO6futureII11foreign_ptrIN5boost13intrusive_ptrIN8memcache4itemILb0EEEEEEEE6rescueIZN17smp_message_queue15async_work_itemIZN11distributedINS3_5cache |
1.71% | memcached | net::ipv4::get_packet |
1.33% | libc-2.20.so | |
1.33% | memcached | net::packet::packet |
1.24% | memcached | scattered_message::append_static<unsigned |
1.22% | memcached | memcache::ascii_protocol::handle |
1.22% | memcached | net::tcpnet::ipv4_traits::tcb::output_one |
1.21% | memcached | future<temporary_buffer |
1.14% | memcached | net::packet::impl::allocate_if_needed |
1.10% | memcached | std::_Hashtable<net::l4connidnet::ipv4_traits, |
1.09% | memcached | net::packet::share |
1.08% | memcached | smp_message_queue::process_queue<2ul, |
1.02% | memcached | memory::cpu_pages::allocate_small |
1.01% | memcached | net::packet::impl::allocate |
0.93% | memcached | memory::cpu_pages::translate |
0.91% | memcached | memcache::intrusive_ptr_release |
0.91% | memcached | dpdk::dpdk_qp::tx_buf_factory::get |
0.88% | memcached | _ZZN3net3tcpINS_11ipv4_traitsEEC4ERNS_7ipv4_l4ILNS_15ip_protocol_numE6EEEENUlvE_clEv |
0.82% | memcached | memory::allocate |
0.81% | memcached | memcache_ascii_parser::parse(char*, |
0.76% | memcached | promise<>::set_value |
0.74% | memcached | net::native_connected_socket_impl<net::tcpnet::ipv4_traits |
0.74% | memcached | net::ipv4::send(net::ipv4_address, |
0.70% | memcached | smp_message_queue::async_work_item<std::enable_if<is_future<future<foreign_ptr<boost::intrusive_ptr<memcache::item |
0.64% | memcached | net::native_connected_socket_impl<net::tcpnet::ipv4_traits |
0.61% | memcached | reactor::run_tasks |
0.61% | memcached | memory::free |
0.60% | memcached | _ZNSt17_Function_handlerIFNSt12experimental8optionalIN3net6packetEEEvEZNS2_9interfaceC4ESt10shared_ptrINS2_6deviceEEEUlvE0_E9_M_invokeERKSt9_Any_data |
0.60% | memcached | future_state<>::operator= |
0.55% | memcached | dpdk::dpdk_qp::tx_buf::reset_zc |
0.55% | memcached | net::tcpnet::ipv4_traits::tcb::can_send |
0.53% | memcached | lw_shared_ptr<memcache::tcp_server::connection>::~lw_shared_ptr |
0.53% | memcached | _ZN6futureII11foreign_ptrIN5boost13intrusive_ptrIN8memcache4itemILb0EEEEEEEE4thenIZNS3_14ascii_protocolILb0EE10handle_getILb0EEES_IIEER13output_stream |
0.52% | memcached | net::tcpnet::ipv4_traits::received |
0.52% | memcached | net::packet::allocate_headroom |
0.50% | memcached | do_until_continued<memcache::tcp_server::start()::{lambda()#1}::operator()() |
0.49% | memcached | _ZZN6futureIIEE4thenIZN3net28native_connected_socket_implINS2_3tcpINS2_11ipv4_traitsEEEE23native_data_source_impl3getEvEUlvE0_EENSt9result_ofIFT_vEE4t |
0.49% | memcached | foreign_ptr<boost::intrusive_ptr<memcache::item |
0.49% | memcached | net::tcpnet::ipv4_traits::tcb::send |
Next is the perf top output of SeaStar's posix. Now the kernel is all over the place. The hash function only shows up on the 11th place!
Percent | Binary | Function |
---|---|---|
3.81% | [kernel] | ipt_do_table |
3.22% | [kernel] | copy_user_enhanced_fast_string |
2.74% | [kernel] | tcp_sendmsg |
2.34% | [kernel] | sock_poll |
2.25% | [kernel] | __nf_conntrack_find_get |
2.17% | [kernel] | _raw_spin_lock |
2.09% | [kernel] | __fget |
1.77% | [kernel] | _raw_spin_lock_irqsave |
1.67% | [kernel] | __skb_clone |
1.65% | memcached | memory::cpu_pages::free |
1.61% | memcached | boost::intrusive::hashtable_impl<boost::intrusive::mhtraits<memcache::item |
1.36% | [kernel] | reschedule_interrupt |
1.35% | [kernel] | nf_iterate |
1.32% | [kernel] | sock_has_perm |
1.18% | [kernel] | __ip_local_out |
1.17% | [kernel] | tcp_poll |
1.15% | memcached | memory::small_pool::allocate |
1.08% | [kernel] | tcp_packet |
1.06% | [kernel] | nf_conntrack_in |
1.04% | [kernel] | skb_entail |
1.04% | memcached | reactor::run_tasks |
1.03% | [kernel] | avc_has_perm |
1.03% | memcached | promise<temporary_buffer |
1.02% | [kernel] | sys_epoll_ctl |
0.97% | [kernel] | tcp_recvmsg |
0.91% | [kernel] | tcp_transmit_skb |
0.89% | libc-2.20.so | |
0.87% | [kernel] | __alloc_skb |
0.86% | memcached | memcache_ascii_parser::parse |
0.72% | memcached | promise<>::promise |
0.72% | [kernel] | _raw_spin_lock_bh |
0.70% | [kernel] | system_call_after_swapgs |
0.61% | [kernel] | tcp_rearm_rto |
0.60% | memcached | memory::cpu_pages::allocate_small |
0.57% | memcached | deleter::~deleter |
0.56% | [kernel] | __local_bh_enable_ip |
0.54% | [kernel] | selinux_file_permission |
0.54% | memcached | scattered_message::append_static<unsigned |
0.54% | [kernel] | tcp_v4_rcv |
0.51% | [kernel] | ip_queue_xmit |
0.51% | [kernel] | tcp_write_xmit |
0.50% | [kernel] | selinux_socket_sock_rcv_skb |
0.49% | memcached | reactor_backend_epoll::get_epoll_future |
0.49% | memcached | net::packet::packet |
0.48% | [kernel] | sock_def_readable |
0.46% | [kernel] | ip_finish_output |
0.46% | memcached | do_until_continued<memcache::tcp_server::start()::{lambda()#1}::operator()() |
0.45% | [kernel] | tcp_wfree |
0.44% | [kernel] | system_call |
- Server 1: Memcache server
- Server 2: Memcache Client - memaslap
- Memcached version 1.4.17
- One, single threaded, Memcached process per CPU
- Fetch dpdk from upstream (support for i40e is not sufficient in 1.8.0)
- update config/common_linuxapp
- update CONFIG_RTE_MBUF_REFCNT to 'n'
- update CONFIG_RTE_MAX_MEMSEG=4096
- follow instructions from Seastar readme on DPDK installation for 1.8.0
- hugepages define 2048,2048 pages
- compile seastar
- sudo build/release/apps/memcached/memcached --network-stack native --dpdk-pmd --dhcp 0 --host-ipv4-addr $seastar_ip --netmask-ipv4-addr 255.255.255.0 --collectd 0 --smp $cpu
- memaslap from libmemcached-1.0.18
- Disable irqbalance
- Fix the irq smp_affinity of the 40Gb card to invoke each interrupt on a single cpu
for $cpu < 6
for ((i = 0; i < 12; ++i)); do taskset -c $i memaslap -s $seastar_ip:11211 -t 60s -T 1 -c 60 -X 64 & done
for $cpu >= 6
for ((i = 0; i < 52; ++i)); do taskset -c $i memaslap -s $seastar_ip:11211 -t 60s -T 1 -c 60 -X 64 & done
- verify there are no misses in each test - restart memcached for each test
Same as HTTPD Test