DAOS-14164 test: Have fault injection test slow down on loaded systems. #13645

ashleypittman · 2024-01-22T20:50:02Z

Have the fault injection test play nicely with the CI system by
decreasing the level of concurrency when the system is loaded.

Signed-off-by: Ashley Pittman ashley.m.pittman@intel.com

github-actions · 2024-01-22T20:56:10Z

Bug-tracker data:
Ticket title is 'Fault injection testing using NLT - DER_NOMEM(-1009): Out of memory'
Status is 'Reopened'
Labels: 'NLT-testing,ci_impact,triaged'
https://daosio.atlassian.net/browse/DAOS-14164

Have the fault injection test play nicely with the CI system by decreasing the level of concurrency when the system is loaded. Required-githooks: true Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

Required-githooks: true Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

ashleypittman · 2024-01-23T11:32:48Z

Build #1 of this PR didn't run the FI test, build #2 did, the load average on the system was over 300 at points so the build backed right back as far as it would, therefore the build took 5 hours compared to the normal 2.

For build #3 of this PR I pushed in the morning UK time and the system was otherwise idle so the backoff never happened and the test completed in 1h40m because it was the only job running.

I've set the backoff to trigger at a load average of 100, once this feature is in the majority of builds then they should all back off equally so I don't expect this job to ever take five hours in practice although we can expect it to go up to some degree - but this is precisely the times where we're seeing occasional failure at the minute.

daltonbohning

From a high level, I question whether this is the right approach. And with the VM instability lately, is it just an issue with VMs being oversubscribed?

ashleypittman · 2024-01-26T10:30:26Z

From a high level, I question whether this is the right approach.

So do I, I've asked for changes to Jenkins configuration but it hasn't happened although I also think this approach will allow more throughput on both loaded and unloaded systems. The code already has logic in there to limit the maximum concurrency to a % of available cores but doesn't take into consideration other activities on the same node so this change isn't that much of a stretch.

And with the VM instability lately, is it just an issue with VMs being oversubscribed?

It's the docker hosts, not VMs. The build slots have been reduced from 35 to 30 or thereabouts whereas it should be closer to five, using non-pipeline Jenkins you could request multiple slots per job so we could have configured the FI job to require six slots and limit it that way but Jenkinsfile does not support such an option.

The other approach would be to have different docker builders for builds vs fault injection testing but that's doubeing that part of the CI infrastructure and this is something that's only caused problems on around a six month cadence (the ticket is from August, the recent failures are from a slew of concurrent jobs after master re-opened last week).

DAOS-14164 test: Have fault injection test slow down on loaded systems.

b45be33

Have the fault injection test play nicely with the CI system by decreasing the level of concurrency when the system is loaded. Required-githooks: true Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

ashleypittman force-pushed the amd/nlt-fi-backoff branch from 58ab946 to b45be33 Compare January 22, 2024 21:25

Back off on the logic a bit.

cd381a6

Required-githooks: true Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>

ashleypittman marked this pull request as ready for review January 23, 2024 10:12

ashleypittman requested review from phender, jolivier23, daltonbohning and brianjmurrell January 23, 2024 11:32

jolivier23 approved these changes Jan 23, 2024

View reviewed changes

ashleypittman requested a review from sylviachanoiyee January 25, 2024 15:35

daltonbohning reviewed Jan 25, 2024

View reviewed changes

daltonbohning approved these changes Jan 26, 2024

View reviewed changes

ashleypittman requested a review from a team January 26, 2024 17:14

ashleypittman merged commit d2a0047 into master Jan 29, 2024
48 checks passed

ashleypittman deleted the amd/nlt-fi-backoff branch January 29, 2024 09:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-14164 test: Have fault injection test slow down on loaded systems. #13645

DAOS-14164 test: Have fault injection test slow down on loaded systems. #13645

ashleypittman commented Jan 22, 2024 •

edited

Loading

github-actions bot commented Jan 22, 2024

ashleypittman commented Jan 23, 2024

daltonbohning left a comment

ashleypittman commented Jan 26, 2024

DAOS-14164 test: Have fault injection test slow down on loaded systems. #13645

DAOS-14164 test: Have fault injection test slow down on loaded systems. #13645

Conversation

ashleypittman commented Jan 22, 2024 • edited Loading

github-actions bot commented Jan 22, 2024

ashleypittman commented Jan 23, 2024

daltonbohning left a comment

Choose a reason for hiding this comment

ashleypittman commented Jan 26, 2024

ashleypittman commented Jan 22, 2024 •

edited

Loading