-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TPC-C benchmark CPU and memory optimizations to lower hardware requirements and better logic for initial data upload #145
Comments
Hi, About the memory optimisation, I noticed that instead of storing the latency records you are using the buckets approach where each bucket size is optimally selected. but to find 99th or 95th percentile latency, as the bucket size increase there is a significant jump in the difference between the current bucket and the next bucket (for example, here the difference between the last two buckets is 10000 - 6000 = 4000. now let's say my 99th percentile latency is 6001 so by this approach it is determined to be 10000 and for the next transaction let's say my latency is 9999 and the 99th percentile latency is still determined to be 10000. due to this huge gap between two consecutive buckets, how do you suppose we notice such a big latency jump accurately? |
Hi Vasanth, I agree with you, that bucket granularity could be better. TPC-C standard specifies some latency constraints for 90%, in particular 5 seconds for New-Order, Payment, Order-status, Delivery and 20 seconds for Stock-Level (5.2.5.3). With buckets you loose some precision, when calculate percentiles. Though, if there is a 5 seconds border, you can precisely calculate percentile for the 5 seconds value and be 100% sure, that the constraint met. Usually, the problem is with New-Order and Payment transaction latantcies: if they're OK, the rest are OK because of benchmarking design. That is why I've added 5-6, but not 7-9. And there is a 10 just for "outlier"-like values (like infinity). But again, it would be nice to add more buckets (and it's free). |
@eivanov89 Thank you for submitting the issue. The virtual thread enhancement is something we would like to do at our end to reduce the client requirement. We are working on verifying the impact of changes at our end. Regarding the second issue of LatencyRecord - New-Order latencies that we track are in the range of 50-150 ms (based on the configuration). A 10-20% of increase in latency is treated as regression and hence we have to be precise and can't rely on histogram. A better way might be to do sampling i.e. instead of collecting all latency records, collect say 1% or 5% of the records. This way the memory requirements will go down. However, the documentation is an older documentation. We can now run 2500 warehouse on a single client with 16Gb of memory. Hence the calculation is irrelevant. At our end, to check the memory requirements, we commented the execute phase of tpcc and generated dummy objects of LatencyRecord. This is what we found out: 1000 Warehouse run for 30 minutes generate about 1 million objects of LatencyRecord and for 1 million objects the memory usage is around ~140MB to 170MB. |
@hbhanawat Hi Hemant, thank you for your reply.
Could you please elaborate little more on this? You could use more granular buckets, couldn't you? Also in my opinion you can use a switch to change modes: either collect all results / buckets / sampling.
Sounds great! Then probably I will be able to give it a try. |
Hi,
According to your documentation, "for 10k warehouses, you would need ten clients of type c5.4xlarge to drive the benchmark. For multiple clients, you need to perform three steps". In total, it is 160 CPU cores and 320 GiB RAM, i.e.
At YDB we followed your path and also had forked and adapted TPC-C from the Benchbase. Thus, we finished with very similar high harware requirements for TPC-C clients. Fortunately, we found very simple yet effective optimizations (and because you have same codebase, you can easily employ them too). In this post we discuss our TPC-C implementations and later describe some pitfalls, which again can be easily fixed. Here is a sum-up of changes:
After these changes, you will have the following requirements:
Now, to run 10K warehouses you need 10 cores and ~60 GiB of RAM. It gives 2 c5.4xlarge instead of 10 (a significant change in cost) or even 1 memory optimized instance.
Another issue that affects the price of measurements is loading time. You specify ~5.5 hours for 10K warehouses and 30 cluster nodes of type c5d.4xlarge. If you have the original code there, then probably you use YSQL to upload the data. If you try YCQL instead, you can probably cut the time in half. Initially, we needed 2.7 hours to load 15K warehouses, but we were able to change TPC-C code to do it in 1.6 hours. We simply use bulk upserts, which are blind writes instead of inserts, which were by default.
We're very interested to try YugabyteDB with high number of warehouses (e.g. 40K and 80K), these optimisations will help a lot to cut the spendings.
The text was updated successfully, but these errors were encountered: