Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-13672 control: Bump system_ram_reserved to reduce OOM occurrences #12430

Merged
merged 7 commits into from
Aug 1, 2023
8 changes: 4 additions & 4 deletions src/control/cmd/dmg/auto_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -141,17 +141,17 @@ func TestAuto_confGen(t *testing.T) {
Message: control.MockServerScanResp(t, "withSpaceUsage"),
}
storRespHighMem := control.MockServerScanResp(t, "withSpaceUsage")
// Total mem to meet requirements 34GiB hugeMem, 2GiB per engine rsvd, 6GiB sys rsvd,
// Total mem to meet requirements 34GiB hugeMem, 2GiB per engine rsvd, 8GiB sys rsvd,
// 5GiB per engine for tmpfs.
storRespHighMem.MemInfo.MemTotalKb = (humanize.GiByte * (34 + 4 + 6 + 10)) / humanize.KiByte
storRespHighMem.MemInfo.MemTotalKb = (humanize.GiByte * (34 + 4 + 8 + 10)) / humanize.KiByte
mockRamdiskSize := 5
storHostRespHighMem := &control.HostResponse{
Addr: "host1",
Message: storRespHighMem,
}
e0 := control.MockEngineCfg(0, 2, 4, 6, 8).WithHelperStreamCount(4)
e1 := control.MockEngineCfg(1, 1, 3, 5, 7).WithHelperStreamCount(4)
exmplEngineCfgs := []*engine.Config{e0, e1}
mockRamdiskSize := 5 // RoundDownGiB(16*0.75/2)
metadataMountPath := "/mnt/daos_md"
controlMetadata := storage.ControlMetadata{
Path: metadataMountPath,
Expand Down Expand Up @@ -406,7 +406,7 @@ disable_vfio: false
disable_vmd: false
enable_hotplug: false
nr_hugepages: 6144
system_ram_reserved: 6
system_ram_reserved: 8
disable_hugepages: false
control_log_mask: INFO
control_log_file: /tmp/daos_server.log
Expand Down
4 changes: 2 additions & 2 deletions src/control/lib/control/auto_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1576,7 +1576,7 @@ func TestControl_AutoConfig_genConfig(t *testing.T) {
memTotal: (54 * humanize.GiByte) / humanize.KiByte,
expCfg: MockServerCfg(exmplEngineCfg0.Fabric.Provider,
[]*engine.Config{
MockEngineCfgTmpfs(0, 5, /* tmpfs size in gib */
MockEngineCfgTmpfs(0, 4, /* tmpfs size in gib */
mockBdevTier(0, 0).WithBdevDeviceRoles(4),
mockBdevTier(0, 1, 2).WithBdevDeviceRoles(3)).
WithHelperStreamCount(0).
Expand All @@ -1585,7 +1585,7 @@ func TestControl_AutoConfig_genConfig(t *testing.T) {
filepath.Join(controlMetadata.EngineDirectory(0),
storage.BdevOutConfName),
),
MockEngineCfgTmpfs(1, 5, /* tmpfs size in gib */
MockEngineCfgTmpfs(1, 4, /* tmpfs size in gib */
mockBdevTier(1, 3).WithBdevDeviceRoles(4),
mockBdevTier(1, 4, 5).WithBdevDeviceRoles(3)).
WithHelperStreamCount(0).
Expand Down
2 changes: 1 addition & 1 deletion src/control/server/storage/scm.go
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ const (

// Memory reservation constant defaults to be used when calculating RAM-disk size for DAOS I/O engine.
const (
DefaultSysMemRsvd = humanize.GiByte * 6 // per-system
DefaultSysMemRsvd = humanize.GiByte * 8 // per-system
DefaultTgtMemRsvd = humanize.MiByte * 128 // per-engine-target
DefaultEngineMemRsvd = humanize.GiByte * 1 // per-engine
)
Expand Down
4 changes: 2 additions & 2 deletions src/control/server/storage/scm_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,15 +52,15 @@ func Test_CalcRamdiskSize(t *testing.T) {
memSys: DefaultSysMemRsvd,
tgtCount: 16,
engCount: 2,
expSize: humanize.GiByte * 10, // (60 - (30+6+4)) / 2
expSize: humanize.GiByte * 9, // (60 - (30+8+4)) / 2
},
"default values; low nr targets": {
memTotal: humanize.GiByte * 60,
memHuge: humanize.GiByte * 30,
memSys: DefaultSysMemRsvd,
tgtCount: 1,
engCount: 2,
expSize: humanize.GiByte * 11, // (60 - (30+6+2)) / 2
expSize: humanize.GiByte * 10, // (60 - (30+8+2)) / 2
},
"custom values; low sys reservation": {
memTotal: humanize.GiByte * 60,
Expand Down
1 change: 1 addition & 0 deletions src/tests/ftest/control/config_generate_run.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ hosts:
test_servers: 1
timeout: 250
server_config:
system_ram_reserved: 16
engines_per_host: 1
engines:
0:
Expand Down
1 change: 1 addition & 0 deletions src/tests/ftest/control/dmg_network_scan.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ server_config:
port: 10001
control_log_mask: TRACE
engines_per_host: 1
system_ram_reserved: 16
engines:
0:
storage:
Expand Down
1 change: 1 addition & 0 deletions src/tests/ftest/control/dmg_server_set_logmasks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ timeout: 120
server_config:
name: daos_server
engines_per_host: 1
system_ram_reserved: 6
engines:
0:
storage:
Expand Down
1 change: 1 addition & 0 deletions src/tests/ftest/pool/create_all_vm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ test_two_pools:
server_config:
name: daos_server
engines_per_host: 1
system_ram_reserved: 6
engines:
0:
targets: 5
Expand Down
1 change: 1 addition & 0 deletions src/tests/ftest/security/cont_overwrite_acl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ timeout: 120
server_config:
name: daos_server
engines_per_host: 1
system_ram_reserved: 6
engines:
0:
targets: 4
Expand Down
1 change: 1 addition & 0 deletions src/tests/ftest/security/cont_update_acl.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ timeout: 120
server_config:
name: daos_server
engines_per_host: 1
system_ram_reserved: 6
engines:
0:
targets: 4
Expand Down
1 change: 1 addition & 0 deletions src/tests/ftest/server/daos_server_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ timeout: 130
server_config:
name: daos_server
engines_per_host: 1
system_ram_reserved: 16
Copy link
Contributor

@mchaarawi mchaarawi Jul 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please explain why there are instances where in the yaml file you set this to 6 and in others 16? (none of this is mentioned in the ticket description that says bump from 6->8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instances where the yaml file value have been reduced to 6 were as a result of the bump in the default value causing a memory check failure. This is likely to be in situations where the environment in which the CI test-in-question is memory constrained and so reserving 8gib rather than 6gib doesn't allow a minimum (4gib) RAM-disk to be allocated to the engine after taking into account other memory reservations in the calculation.

Some other tests in CI that are intermittently experiencing memory check failures (e.g. https://daosio.atlassian.net/browse/DAOS-13918) due to RAM usage spikes have been "fixed" in this PR by increasing the system_ram_reserved for hardware based tests to 16gib to give a larger buffer which should reduce if not remove the chance of intermittent failures related to this check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

description updated

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment makes me wonder if we have learned that the default should actually be 16 rather than 8. If we're seeing intermittent failures on hardware tests with the default of 8, then it seems likely to me that the default of 8 is probably going to cause failures out in the field.

We knew from the start of this effort that it was going to involve some experimentation to find defaults that strike the right balance between safety and max utilization of the hardware. As such, I think the testing really should use whatever values we're expecting most customers to use. If those values result in lots of intermittent failures in CI, then that is a canary for what to expect in the field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've discussed this a lot and your advice was that increasing from 6 to a higher value was undesirable (previously I had suggested a larger value than 8) because we may be wasting/under-utilising resources. That said do you want me to change it to 16?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, what I said was that I didn't think that we should immediately jump to a large value just to ensure that the tests passed, because we could be leaving RAM on the table without knowing for sure that we absolutely need the extra safety margin. Perhaps I didn't make the point clearly enough: This is exploratory work. I don't think that there is a "right" value that is going to be perfect for all situations. The goal is to find a default that is good enough in most situations. Based on what you're saying here, it seems like 8 is not good enough to avoid frequent intermittent failures in CI, and therefore it is probably unsuitable to use as the default for production configurations as well.

My recommendation here is to test what is shipped, so if we need 16 to reliably pass CI tests, then that's what we should ship. Going forward, if we find that there are still intermittent failures due to OOM, then I think some investigation should be done to understand why those happen instead of reflexively increasing the reserved memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will increase the default value to 16.

engines:
0:
storage:
Expand Down
5 changes: 3 additions & 2 deletions utils/config/daos_server.yml
Original file line number Diff line number Diff line change
Expand Up @@ -235,9 +235,10 @@
## of RAM resulting in MemAvailable value being too low to support the calculated RAM-disk size
## increasing the value will reduce the calculate size. Alternatively in situations where total
## RAM is low, reducing the value may prevent problems where RAM-disk size calculated is below the
## minimum of 4gib.
## minimum of 4gib. Increasing the value may help avoid the potential of OOM killer terminating
## engine processes but could also result in stopping DAOS from using available memory resources.
#
## default: 6
## default: 8
#system_ram_reserved: 5
#
#
Expand Down