Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-13672 control: Bump system_ram_reserved to reduce OOM occurrences #12430

Merged
merged 7 commits into from
Aug 1, 2023

Conversation

tanabarr
Copy link
Contributor

@tanabarr tanabarr commented Jun 16, 2023

Attempt to reduce the chance of OOM killer terminating an engine
process when maximum pool space is allocated by slightly increasing
the system_ram_reserved default value from 6->16gib.

Some test yaml system_ram_reserved values have been reduced to 6 to
prevent the increase in the default from causing an available memory
check failure on engine start-up in memory constrained (VM)
environments. Increasing the default value should also resolve
DAOS-13918 by providing a larger memory buffer to reduce the chance
of intermittent failures related to this check.

Required-githooks: true

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@tanabarr tanabarr requested a review from a team as a code owner June 16, 2023 16:02
@tanabarr tanabarr requested review from mjmac and removed request for a team June 16, 2023 16:02
@tanabarr tanabarr self-assigned this Jun 16, 2023
@tanabarr tanabarr added the control-plane work on the management infrastructure of the DAOS Control Plane label Jun 16, 2023
@tanabarr tanabarr requested review from NiuYawei and a team June 16, 2023 16:03
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@github-actions
Copy link

Bug-tracker data:
Ticket title is 'Tighten validation rules for tmpfs memory usage'
Status is 'In Progress'
Labels: 'md_on_ssd2'
https://daosio.atlassian.net/browse/DAOS-13672

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-12430/1/testReport/(root)/

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

NiuYawei
NiuYawei previously approved these changes Jun 19, 2023
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12430/2/execution/node/1095/log

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@daosbuild1
Copy link
Collaborator

Test stage Unit Test on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-12430/3/testReport/

daltonbohning
daltonbohning previously approved these changes Jun 20, 2023
@tanabarr tanabarr requested a review from NiuYawei June 20, 2023 21:02
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-12430/4/testReport/

NiuYawei
NiuYawei previously approved these changes Jun 21, 2023
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@tanabarr tanabarr dismissed stale reviews from daltonbohning and NiuYawei via 6e16c1c June 21, 2023 17:11
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@tanabarr
Copy link
Contributor Author

tanabarr commented Jun 21, 2023

merged 6e16c1c into this PR so number of targets are taken into account when calculating engine memory reservation during scm auto sizing

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.4 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12430/7/execution/node/353/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12430/7/execution/node/360/log

Copy link
Contributor

@kjacque kjacque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current changes look OK to me. I'll hold off on approval until it runs with the correct pragma.

Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
…srsvd-bump

Allow-unstable: true
Test-tag: pr daily_regression
Test-nvme: auto_md_on_ssd

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
@tanabarr tanabarr removed the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jul 26, 2023
Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

knard-intel
knard-intel previously approved these changes Jul 27, 2023
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12430/19/execution/node/1315/log

@tanabarr
Copy link
Contributor Author

https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-12430/19/tests/

Shows the following identified failures:
2 * PoolCreateTests - DAOS-13967,DAOS-13916
DmgSystemReformatTest - DAOS-13613

And 2 new failures:
DmgPoolQueryTest - not specific to this PR, seen on https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/md-on-ssd-testing/63/tests/
SnapshotAggregation - DmgJsonCommandFailure('pool create failed: DER_NOSPACE(-1007): No space on storage target',) looks specific to this PR and possibly to do with the system_ram_reserved default increase to 16gib

Skip-func-test-vm: true
Test-tag: test_snapshot_aggregation
Test-nvme: auto_md_on_ssd

Required-githooks: true

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-12430/20/execution/node/328/log

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. No errors found by checkpatch.

@tanabarr
Copy link
Contributor Author

reducing system_ram_reserved to try to make test_snapshot_aggregation pass

@tanabarr tanabarr added the forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. label Jul 28, 2023
@tanabarr
Copy link
Contributor Author

snapshot aggregation test now passing as per https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-12430/21/testReport/FTEST_container/SnapshotAggregation/

combining runs 19 & 21 there are only expected failures

identified failures:
2 * PoolCreateTests - DAOS-13967,DAOS-13916
DmgSystemReformatTest - DAOS-13613
DmgPoolQueryTest - DAOS-14072 (PR just created, seen on https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/md-on-ssd-testing/63/tests/)

@phender can you take a look to see if we are at the stage where this could be force landed? obviously pending review approvals

@mchaarawi mchaarawi removed their request for review July 28, 2023 17:26
@tanabarr tanabarr requested a review from a team July 28, 2023 17:30
@phender phender merged commit f9f935e into master Aug 1, 2023
32 checks passed
@phender phender deleted the tanabarr/control-sysrsvd-bump branch August 1, 2023 01:42
@ashleypittman
Copy link
Contributor

ashleypittman commented Aug 1, 2023

The ARM CI testing on master failed for this PR, please fix insufficient ram to meet minimum requirements (mem stats: total 16 GiB (16739012608) - (hugepages 0 B + sys rsvd 16 GiB + (engine rsvd 1.0 GiB * nr engines 1). 4 tgts-per-engine))

https://github.com/daos-stack/daos/actions/runs/5721566985/job/15503353047

tanabarr added a commit that referenced this pull request Aug 1, 2023
…#12597)

During the evaluation of an optimum RAM-disk size, update per-engine
memory reservation per-engine calculation to take into account the
number of targets. Reserve 128mib of RAM per target.

control: Bump system_ram_reserved to reduce OOM occurrences (#12430)

Attempt to reduce the chance of OOM killer terminating an engine
process when maximum pool space is allocated by slightly increasing
the system_ram_reserved default value from 6->16gib.

Some test yaml system_ram_reserved values have been reduced to 6 to
prevent the increase in the default from causing an available memory
check failure on engine start-up in memory constrained (VM)
environments. Increasing the default value should also resolve
DAOS-13918 by providing a larger memory buffer to reduce the chance
of intermittent failures related to this check.

test: Reduce system_ram_reserved for GHA ARM build (#12746)

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
sylviachanoiyee pushed a commit that referenced this pull request Aug 2, 2023
…#12597) (#12755)

During the evaluation of an optimum RAM-disk size, update per-engine
memory reservation per-engine calculation to take into account the
number of targets. Reserve 128mib of RAM per target.

control: Bump system_ram_reserved to reduce OOM occurrences (#12430)

Attempt to reduce the chance of OOM killer terminating an engine
process when maximum pool space is allocated by slightly increasing
the system_ram_reserved default value from 6->16gib.

Some test yaml system_ram_reserved values have been reduced to 6 to
prevent the increase in the default from causing an available memory
check failure on engine start-up in memory constrained (VM)
environments. Increasing the default value should also resolve
DAOS-13918 by providing a larger memory buffer to reduce the chance
of intermittent failures related to this check.

test: Reduce system_ram_reserved for GHA ARM build (#12746)

Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
control-plane work on the management infrastructure of the DAOS Control Plane forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.
Development

Successfully merging this pull request may close these issues.

10 participants