-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-13672 control: Bump system_ram_reserved to reduce OOM occurrences #12430
Merged
Merged
Changes from 2 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
4279202
DAOS-13672 control: Bump system_ram_reserved to reduce OOM occurrences
tanabarr a837b3f
increase system_ram_reserved Value for some tests experiencing insuff…
tanabarr 5ef4961
increase telemetry test timeout to resolve CI failures
tanabarr 5046edb
correct test timeout change required to resolve CI failures
tanabarr 3f3692d
increase system_ram_reserved default to 16 as per code review comment
tanabarr a8a280c
Merge remote-tracking branch 'origin/master' into tanabarr/control-sy…
tanabarr a25515c
attempt to resolve snapshot aggregation test failure
tanabarr File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please explain why there are instances where in the yaml file you set this to 6 and in others 16? (none of this is mentioned in the ticket description that says bump from 6->8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instances where the yaml file value have been reduced to 6 were as a result of the bump in the default value causing a memory check failure. This is likely to be in situations where the environment in which the CI test-in-question is memory constrained and so reserving 8gib rather than 6gib doesn't allow a minimum (4gib) RAM-disk to be allocated to the engine after taking into account other memory reservations in the calculation.
Some other tests in CI that are intermittently experiencing memory check failures (e.g. https://daosio.atlassian.net/browse/DAOS-13918) due to RAM usage spikes have been "fixed" in this PR by increasing the
system_ram_reserved
for hardware based tests to 16gib to give a larger buffer which should reduce if not remove the chance of intermittent failures related to this check.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment makes me wonder if we have learned that the default should actually be 16 rather than 8. If we're seeing intermittent failures on hardware tests with the default of 8, then it seems likely to me that the default of 8 is probably going to cause failures out in the field.
We knew from the start of this effort that it was going to involve some experimentation to find defaults that strike the right balance between safety and max utilization of the hardware. As such, I think the testing really should use whatever values we're expecting most customers to use. If those values result in lots of intermittent failures in CI, then that is a canary for what to expect in the field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've discussed this a lot and your advice was that increasing from 6 to a higher value was undesirable (previously I had suggested a larger value than 8) because we may be wasting/under-utilising resources. That said do you want me to change it to 16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, what I said was that I didn't think that we should immediately jump to a large value just to ensure that the tests passed, because we could be leaving RAM on the table without knowing for sure that we absolutely need the extra safety margin. Perhaps I didn't make the point clearly enough: This is exploratory work. I don't think that there is a "right" value that is going to be perfect for all situations. The goal is to find a default that is good enough in most situations. Based on what you're saying here, it seems like 8 is not good enough to avoid frequent intermittent failures in CI, and therefore it is probably unsuitable to use as the default for production configurations as well.
My recommendation here is to test what is shipped, so if we need 16 to reliably pass CI tests, then that's what we should ship. Going forward, if we find that there are still intermittent failures due to OOM, then I think some investigation should be done to understand why those happen instead of reflexively increasing the reserved memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will increase the default value to 16.