-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-14164 test: Have fault injection test slow down on loaded systems. #13645
Conversation
Bug-tracker data: |
Have the fault injection test play nicely with the CI system by decreasing the level of concurrency when the system is loaded. Required-githooks: true Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
58ab946
to
b45be33
Compare
Required-githooks: true Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Build #1 of this PR didn't run the FI test, build #2 did, the load average on the system was over 300 at points so the build backed right back as far as it would, therefore the build took 5 hours compared to the normal 2. For build #3 of this PR I pushed in the morning UK time and the system was otherwise idle so the backoff never happened and the test completed in 1h40m because it was the only job running. I've set the backoff to trigger at a load average of 100, once this feature is in the majority of builds then they should all back off equally so I don't expect this job to ever take five hours in practice although we can expect it to go up to some degree - but this is precisely the times where we're seeing occasional failure at the minute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a high level, I question whether this is the right approach. And with the VM instability lately, is it just an issue with VMs being oversubscribed?
So do I, I've asked for changes to Jenkins configuration but it hasn't happened although I also think this approach will allow more throughput on both loaded and unloaded systems. The code already has logic in there to limit the maximum concurrency to a % of available cores but doesn't take into consideration other activities on the same node so this change isn't that much of a stretch.
It's the docker hosts, not VMs. The build slots have been reduced from 35 to 30 or thereabouts whereas it should be closer to five, using non-pipeline Jenkins you could request multiple slots per job so we could have configured the FI job to require six slots and limit it that way but Jenkinsfile does not support such an option. The other approach would be to have different docker builders for builds vs fault injection testing but that's doubeing that part of the CI infrastructure and this is something that's only caused problems on around a six month cadence (the ticket is from August, the recent failures are from a slew of concurrent jobs after master re-opened last week). |
Have the fault injection test play nicely with the CI system by
decreasing the level of concurrency when the system is loaded.
Signed-off-by: Ashley Pittman ashley.m.pittman@intel.com