-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[performance] Performance degradation over vitastor 32k #22
Comments
I don't think so... There is a special meaning for 128k - it's the block (object) size. I.e. writes smaller than 128k are first written to the journal (so they result in additional WA) and writes starting with 128k are written using redirect-write. |
I have 3 physical machines, each with 1 NVME, and 3 machines form an etcd cluster. OSD parameter
fio parameter
|
ceph
vitastor
|
First of all I think you should remove --flusher_count option, in new versions it doesn't require manual tuning, because it's auto-tuned between 1 and max_flusher_count, and --flusher_count actually sets max_flusher_count |
Thank you, this result is actually the result of not adding the |
The following are the test results of T1Q1
In the README, your latency can reach 0.14ms, how can I achieve your effect?
Is 0.32ms-0.14ms related to RDMA? |
Best latency is achieved when you disable powersaving, cpupower idle-set -D 0 && cpupower frequency-set --governor performance. You can also pin each vitastor osd to 1 core and disable powersaving only for these cores If it doesn't help then your network may be slow Regarding writes, I still think your performance is too low. What if you run 1 fio job with rw=write bs=4M iodepth=4? |
0.14 was achieved without rdma |
After the CPU performance is turned on, the effect is not obvious.
Your CPU is
|
Yes, 2 OSDs per NVMe should help bandwidth with large block sizes. Vitastor OSD is single-threaded, so you're probably just hitting the single thread b/w limit due to memory copying. RDMA would also help because it eliminates 1 extra copy of 2. DPDK could also help but I haven't found usable TCP (or something else...) implementations for it so it's not possible to implement it yet. |
You can also check your network with sockperf. |
In fact, the IOPS increased by 3 times when the CPU performance mode is turned on compared to when the performance mode is not turned on. I have also turned it on before, and the above two results were performed with the CPU performance mode turned on. What should I do about interrupts & NUMA & enable/disable irqbalance? sockperf result
|
You have 65 us one-way latency, so your roundtrip time is 130 us. Theoretical performance of vitastor for replicated writes is 2 RTT + 1 Disk write = 260 us+disk. Disk is I think around 10-40 us. So around 270-300 us in total = 3300-3700 iops T1Q1... That's similar to what you get. :-) Regarding interrupts etc, here is a good article https://blog.cloudflare.com/how-to-achieve-low-latency/ You can also try to disable offloads and interrupt coalescing (in case they're enabled), like |
Great, I use tuned and installed irqbalance to reduce the network latency between vitastor hosts to 22 usec.
Adjust the network card interruption |
Latency from inside VM doesn't matter because vitastor client connects to OSDs from the host |
Thanks, I get it. Then I am going to split a SSD into 2 OSDs to see if the latency will be reduced. |
Q1 latency won't be reduced with split, but throughput with large iodepth should improve, yeah :-) |
When
|
3300 -> 3600 iops isn't a good reduce). Something in the network is probably still slow... |
you can check out what osds say in their logs about average latency |
so in your case 'write_stable op latency' is around 60us (disk), and 'write_stable subop latency' is 160-200us, so rtt is still 100-140us. why is it so slow?) what network do you use? |
My configuration is
Very strange, why my network latency is so high. I don't have the latency data of the network card and switch, and I am not sure how much the system is wasting. Does OSD ping log mean that there is no I/O?
|
Idle pings may be slower than 'real' even when CPU powersave is turned off. I get 60 us idle ping time between 2 osds on localhost even with powersave disabled, but it drops to ~35 us when testing T1Q1 randread. But in your case 120 us seems similar to the "normal" ping, i.e. similar to just the network RTT in your case... how high is your ping latency when testing T1Q1 randread? |
When T1Q1 randread 4K, fio runs for 5 minutes, starting at 21:45:04. I have uploaded log to: |
When randwrite 64k, it appeared in the osd log.
|
|
Yes it seems ping time is still about 90-100us even when OSDs are active... so it's still the same, network problem should be found and fixed to improve latency :-) I also tested vitastor with a single NVMe and yes, it seems it requires 2-3 OSDs per NVMe to fully utilize linear write speed, just because of memory copying. |
I think about how to reduce network latency, but maybe my hardware has reached its limit. |
But at the same time you say sockperf shows ~22us? Can you recheck sockperf between OSD nodes and between OSD and client? |
I have 3 servers in a vitastor cluster, and a virtual machine runs on one of them. The following are the results of sockperf running between physical nodes:
The following is the result of sockperf to the physical node inside the virtual machine node:
|
The performance of vitastor in T1Q1 is higher than ceph, but the performance of T16Q4 is not as high as ceph. Why is this? |
No idea. In my tests it's faster in all modes. T16Q4? How do you test ceph in T16Q4? fio -ioengine=rbd -numjobs=16 -rbdname=...? I'm adking because this case is particularly slow in ceph i.e -numjobs > 1 into the same RBD image leads to ~10x performance drop in ceph |
T1Q1 and T16Q4 are all tested in a virtual machine with 16 cores and 32GB memory. The virtual machine uses libaio to read and write virtual disks (vdb...). |
Ok, so how much do you get, from both vitastor and ceph? |
ceph
vitastor
|
I think that's due to rather high block size.
|
Single-threaded TCP caps around 1200 MB/s +- (depending on exact CPU model), so it's a noticeable limit. And 90% of the CPU load is just memory copying. I'm even thinking about integrating VPP support to overcome it :-)) it may be interesting even if it just removes the need to copy memory, even without improving latency |
That's very interesting, this is the performance result of the test. The same hardware, the virtual machine is 16 core 32GB, the virtual disk size is 100GB, the virtual machine system is ubuntu18.04, fio T16Q4. vitastor
ceph
|
It seems I found a way to easily improve client throughput, it's io_uring async execution via worker threads. It makes latency slightly worse, but only slightly, i.e. it only adds 5-10us (a context switch). I'll do some more tests and then I think I'll add options to enable it :-) |
It's great, I'm dealing with some other things recently and I believe I will be back soon. |
I finally found an idiotic problem which was more than halving the send speed. :-) |
Great, cool! This is already great! I will find time to test it. |
Hi @vitalif ,
I tested the performance of vitastor without RDMA.
Below 32k (4k, 8k, 16k performance is higher than ceph, latency is lower than ceph)
Above 32k, including 32k (32k 64k 128k 256k 512k 1m performance is lower than ceph, latency is higher than ceph)
The above rand and sequence have been tested, and the results are the same.
Does it have a special meaning for 32k? Can I adjust some parameters to improve the performance above 32k?
The text was updated successfully, but these errors were encountered: