Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14164 test: Have fault injection test slow down on loaded systems. #13645

Merged
merged 2 commits into from
Jan 29, 2024

Conversation

ashleypittman
Copy link
Contributor

@ashleypittman ashleypittman commented Jan 22, 2024

Have the fault injection test play nicely with the CI system by
decreasing the level of concurrency when the system is loaded.

Signed-off-by: Ashley Pittman ashley.m.pittman@intel.com

Copy link

Bug-tracker data:
Ticket title is 'Fault injection testing using NLT - DER_NOMEM(-1009): Out of memory'
Status is 'Reopened'
Labels: 'NLT-testing,ci_impact,triaged'
https://daosio.atlassian.net/browse/DAOS-14164

Have the fault injection test play nicely with the CI system by
decreasing the level of concurrency when the system is loaded.

Required-githooks: true

Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Required-githooks: true
Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
@ashleypittman ashleypittman marked this pull request as ready for review January 23, 2024 10:12
@ashleypittman
Copy link
Contributor Author

Build #1 of this PR didn't run the FI test, build #2 did, the load average on the system was over 300 at points so the build backed right back as far as it would, therefore the build took 5 hours compared to the normal 2.

For build #3 of this PR I pushed in the morning UK time and the system was otherwise idle so the backoff never happened and the test completed in 1h40m because it was the only job running.

I've set the backoff to trigger at a load average of 100, once this feature is in the majority of builds then they should all back off equally so I don't expect this job to ever take five hours in practice although we can expect it to go up to some degree - but this is precisely the times where we're seeing occasional failure at the minute.

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a high level, I question whether this is the right approach. And with the VM instability lately, is it just an issue with VMs being oversubscribed?

@ashleypittman
Copy link
Contributor Author

From a high level, I question whether this is the right approach.

So do I, I've asked for changes to Jenkins configuration but it hasn't happened although I also think this approach will allow more throughput on both loaded and unloaded systems. The code already has logic in there to limit the maximum concurrency to a % of available cores but doesn't take into consideration other activities on the same node so this change isn't that much of a stretch.

And with the VM instability lately, is it just an issue with VMs being oversubscribed?

It's the docker hosts, not VMs. The build slots have been reduced from 35 to 30 or thereabouts whereas it should be closer to five, using non-pipeline Jenkins you could request multiple slots per job so we could have configured the FI job to require six slots and limit it that way but Jenkinsfile does not support such an option.

The other approach would be to have different docker builders for builds vs fault injection testing but that's doubeing that part of the CI infrastructure and this is something that's only caused problems on around a six month cadence (the ticket is from August, the recent failures are from a slew of concurrent jobs after master re-opened last week).

@ashleypittman ashleypittman requested a review from a team January 26, 2024 17:14
@ashleypittman ashleypittman merged commit d2a0047 into master Jan 29, 2024
48 checks passed
@ashleypittman ashleypittman deleted the amd/nlt-fi-backoff branch January 29, 2024 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants