You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm opening this issue to trigger the discussion on the best methods to analyze, interpret, and compare the performance results obtained by the methods decided in #6 .
Right now, buildfarm_perf_test does a very slim and probably inadequate processing of the results, basically showing mean and standard deviation without even looking at the probability distributions, which can lead to wrong assessments in terms of for instance regression detection. Since Apex.AI performance_test already produces an "means" entry every second, I thought about benefiting from the central limit theorem and perform Gaussian statistics over those distributions. Mind that having normal distributions of the measurements will ease the comparisons, since that enables statistics such as student T-test, which can asses the significance in the difference between two different experiments. However, I encounter some problems with this approach:
For low publication rates we will need a vast amount of entries to tend to normal distributions.
Some measurements seem to be not random variables. I played with latency measurements in different publication modes. For synchronous publications I got fairly normal distributions of means when publishing for 10 min at 1000 Hz. However, asynchronous publications showed ternary distributions with three very clear events, each of them with a different probability.
The previous made me think that we have to develop a system that can decide on which statistic test to run between experiments that gives the most relevant information. Such a system could then be used for detecting regressions in CI builds, and also as a way to present to end user performance results which interpretation can be used to draw fair and relevant conclusions about the performance of the stack when using different configurations.
Furthermore, the ROS 2 middlewares allow for a great number of configurations, a lot of them having an impact on the performance of the stack. I think it'd be very important to define testing profiles and have results for each of them, so that end users can select the profile from which they will benefit the most.
I would also be very helpful for users and developers to set performance requirements on the different profiles. I my opinion, we are sometimes considering the latency difference to the micro-second, but I really don't think any robotic system minds of such small difference, specially cause the control system would never run at such high frequencies. I think from the users' perspective is not a question of who gives the very best performance, but rather who can meet my requirements. This approach would push development to meet all the requirements in every direction, improving the overall ROS 2 experience.
The text was updated successfully, but these errors were encountered:
Related to that, we measured latencies with ros2 foxy varying different parameters (payload, number of nodes, dds middleware, frequency...) and submitted a paper. You can find the preprint at arxiv: https://arxiv.org/pdf/2101.02074.pdf
Thanks, @EduPonz for your comments. I will continue the discussion related to real-time statistics in this thread. I will list different options I'm aware of:
I'm opening this issue to trigger the discussion on the best methods to analyze, interpret, and compare the performance results obtained by the methods decided in #6 .
Right now,
buildfarm_perf_test
does a very slim and probably inadequate processing of the results, basically showing mean and standard deviation without even looking at the probability distributions, which can lead to wrong assessments in terms of for instance regression detection. Since Apex.AIperformance_test
already produces an "means" entry every second, I thought about benefiting from the central limit theorem and perform Gaussian statistics over those distributions. Mind that having normal distributions of the measurements will ease the comparisons, since that enables statistics such as student T-test, which can asses the significance in the difference between two different experiments. However, I encounter some problems with this approach:The previous made me think that we have to develop a system that can decide on which statistic test to run between experiments that gives the most relevant information. Such a system could then be used for detecting regressions in CI builds, and also as a way to present to end user performance results which interpretation can be used to draw fair and relevant conclusions about the performance of the stack when using different configurations.
Furthermore, the ROS 2 middlewares allow for a great number of configurations, a lot of them having an impact on the performance of the stack. I think it'd be very important to define testing profiles and have results for each of them, so that end users can select the profile from which they will benefit the most.
I would also be very helpful for users and developers to set performance requirements on the different profiles. I my opinion, we are sometimes considering the latency difference to the micro-second, but I really don't think any robotic system minds of such small difference, specially cause the control system would never run at such high frequencies. I think from the users' perspective is not a question of who gives the very best performance, but rather who can meet my requirements. This approach would push development to meet all the requirements in every direction, improving the overall ROS 2 experience.
The text was updated successfully, but these errors were encountered: