Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-14408 common: enable NDCTL for DCPM #14371

Open
wants to merge 69 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
ee6d615
DAOS-14408 common: ensure NDCTL not used for storage class `ram`
grom72 Aug 8, 2024
f74f025
Ensure ABT_THREAD_STACKSIZE>=18432 for dcpm storage class
grom72 Aug 27, 2024
ceccaca
Merge remote-tracking branch 'origin/master' into grom72/ndctl-valida…
grom72 Aug 28, 2024
12bed56
Test with actual PMDK pkg.
grom72 Aug 28, 2024
44c0346
Repeate DaosBuild test on Medium HW
grom72 Aug 29, 2024
9598f9a
tgt_vos_create_one() ULT requires bigger stack size in some configura…
grom72 Aug 29, 2024
fc2c933
Final verification on master
grom72 Aug 29, 2024
1f3f723
ABT_THREAD_STACKSIZE is automatically set during engin startup
grom72 Aug 30, 2024
a9a4a5c
Remove obsolete test config
grom72 Aug 30, 2024
8f6e286
Revert "Remove obsolete test config"
grom72 Aug 30, 2024
884f96d
Automatically set the PMEMOBJ_CONF env var
grom72 Aug 30, 2024
5ec658a
Exclude PMEMOBJ_CONF env var as it is automatically set
grom72 Aug 30, 2024
2ca8e35
Fix tests
grom72 Aug 30, 2024
4528923
Fix tests
grom72 Aug 30, 2024
648d374
Fix more tests
grom72 Aug 30, 2024
0976c04
Restore NLT original configuration.
grom72 Aug 30, 2024
dfb9bf0
Restore NLT original configuration.
grom72 Aug 30, 2024
c1dba55
Full final validation
grom72 Aug 30, 2024
d6fd878
Validation with legacy PMDK pkg (no NDCTL enabled)
grom72 Aug 31, 2024
a9a12d5
Unit tests for PMEMOBJ environment variables configuration validation
grom72 Sep 2, 2024
118ad26
Fix unit tests
grom72 Sep 2, 2024
a33e5c0
Fix unit tests (2)
grom72 Sep 2, 2024
f8c108b
Fix unit tests (3)
grom72 Sep 2, 2024
bcdced7
Final unit tests tuning (and final validation)
grom72 Sep 2, 2024
193f4e6
Final validation (2nd)
grom72 Sep 3, 2024
560d9de
Changelog update
grom72 Sep 3, 2024
3212708
Fix changelog
grom72 Sep 3, 2024
f2a865e
Final validation (3rd)
grom72 Sep 3, 2024
5d4aaf2
Code update based on reviewers feedback
grom72 Sep 3, 2024
9815445
Fix source code format
grom72 Sep 3, 2024
8bd6325
Update code style and extend tests scenarious
grom72 Sep 4, 2024
cc7000b
Remove underscores from go variables names
grom72 Sep 4, 2024
e80ae21
Allign ULT default stack size with Linux page size
grom72 Sep 4, 2024
1ace438
Force build
grom72 Sep 5, 2024
05efd1e
Fix documentation 18KiB -> 20KiB
grom72 Sep 5, 2024
6e07425
Upgrade PMDK to version 2.1.0 to enable NDCTL for engines with DCPM
grom72 Sep 5, 2024
dbb626b
Fix: add dependencies required for PMDK w/ NDCTL
grom72 Sep 5, 2024
6dd7d34
Fix: add dependencies required for PMDK w/ NDCTL (2nd)
grom72 Sep 5, 2024
16c0519
Revert "Fix: add dependencies required for PMDK w/ NDCTL (2nd)"
grom72 Sep 5, 2024
ec653de
Revert "Fix: add dependencies required for PMDK w/ NDCTL"
grom72 Sep 5, 2024
b0842ed
Revert "Upgrade PMDK to version 2.1.0 to enable NDCTL for engines wit…
grom72 Sep 5, 2024
cf26dd1
Upgrade PMDK to version 2.1.0 to enable NDCTL for engines
grom72 Sep 5, 2024
8297d37
Fix typo
grom72 Sep 5, 2024
da701b4
Fix typo in pkg name
grom72 Sep 6, 2024
23f52e6
Fix GHA ARM build
grom72 Sep 6, 2024
83123e2
Force landing builds workflow when build.config is modified
grom72 Sep 6, 2024
88d5087
Revert "Fix typo in pkg name"
grom72 Sep 6, 2024
913b030
Reapply "Fix typo in pkg name"
grom72 Sep 6, 2024
ac3193a
Fix PMDK patch
grom72 Sep 6, 2024
4c5ed9d
Merge remote-tracking branch 'origin/master' into grom72/ndctl-valida…
grom72 Sep 6, 2024
eea85fb
Merge remote-tracking branch 'origin/master' into grom72/ndctl-valida…
grom72 Sep 9, 2024
ee99bf5
Changelog update
grom72 Sep 9, 2024
bd9100f
Adjust PMDK env var before engine run.
grom72 Sep 9, 2024
48f6285
UpdatePMDKEnvars() is moved to processConfig()
grom72 Sep 11, 2024
4f90e7a
Restore license date
grom72 Sep 11, 2024
f4c187d
Minor fixes to improve code clarity
grom72 Sep 11, 2024
8f8d846
Fix dfuse/build_test.py
grom72 Sep 12, 2024
c9f0c11
Fix dfuse/build_test.py (2nd)
grom72 Sep 12, 2024
e701bb8
Fix dfuse/build_test.py test (3rd)
grom72 Sep 13, 2024
e52972b
Fix: a simpler implementation of the unit tests setter.
grom72 Sep 13, 2024
e433d4c
Force full validation with the legacy PMDK
grom72 Sep 15, 2024
9175174
RPMs validation with PMDK 2.1.0 w NDCTL enabled
grom72 Sep 17, 2024
82e9e92
RPMs validation with PMDK 2.1.0 w NDCTL enabled (2nd)
grom72 Sep 24, 2024
bc08b29
RPMs validation with PMDK 2.1.0 w NDCTL enabled (2nd)
grom72 Sep 24, 2024
0565522
RPMs validation with PMDK 2.1.0 w NDCTL enabled (3nd)
grom72 Sep 25, 2024
c6f4853
Add libndctl-devel for leap test environment
grom72 Sep 25, 2024
817c616
Add libndctl-devel for to Ubuntu pkg spec
grom72 Sep 25, 2024
e444b1f
Final validation.
grom72 Sep 26, 2024
8d7da45
Final validation (no NDCTL).
grom72 Sep 26, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions src/control/cmd/dmg/auto_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -566,6 +566,8 @@ engines:
provider: ofi+verbs
fabric_iface: ib0
fabric_iface_port: 31416
env_vars:
- ABT_THREAD_STACKSIZE=18432
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should not be including this in the output of config gen considering we handle it automatically on reading back the config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We compare raw expOut not config structure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have - ABT_THREAD_STACKSIZE=18432 explicitly listed rather than making some tricks with reverting changes that are common for all configurations.

=== RUN   TestAuto_ConfigWrite
    auto_test.go:662: unexpected output config (-want, +got):
          (
          	"""
          	... // 23 identical lines
          	  fabric_iface: ib0
          	  fabric_iface_port: 31416
        + 	  env_vars:
        + 	  - ABT_THREAD_STACKSIZE=18432
          	  pinned_numa_node: 0
          	- targets: 6
          	... // 13 identical lines
          	  fabric_iface: ib1
          	  fabric_iface_port: 32416
        + 	  env_vars:
        + 	  - ABT_THREAD_STACKSIZE=18432
          	  pinned_numa_node: 1
          	disable_vfio: false
          	... // 16 identical lines
          	"""
          )
        
--- FAIL: TestAuto_ConfigWrite (0.01s)

pinned_numa_node: 0
- targets: 6
nr_xs_helpers: 0
Expand All @@ -583,6 +585,8 @@ engines:
provider: ofi+verbs
fabric_iface: ib1
fabric_iface_port: 32416
env_vars:
- ABT_THREAD_STACKSIZE=18432
pinned_numa_node: 1
disable_vfio: false
disable_vmd: false
Expand Down
8 changes: 7 additions & 1 deletion src/control/server/config/server_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -129,8 +129,14 @@ func mockConfigFromFile(t *testing.T, path string) (*Server, error) {
t.Helper()
c := DefaultServer()
c.Path = path
err := c.Load()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding env specific code here is a bad smell, we are doing something wrong if we have to do that. Maybe we can make it work nicer by creating config parameters for these variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal logic specific to PMDK w/ NDCTL. We do not need any parameters, as everything can be derived from the existing configuration. It is a permanent change for PMDK started from version 2.1.0-2 (daos-stack/pmdk#38). We do not want to bother the end-user with something that can be done automatically.

Copy link
Contributor Author

@grom72 grom72 Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is fully transparent from an end-user perspective. No changes in configuration and no changes in behavior except additional NDCTL operations in pmemobj_created()/_opened()/_close(). The only thing that we have to adjust are tests as they did not assume such an operation during engine startup like automatic env variables adjustment.
Originally, we started with C code modification, moving to a solution by adjusting the default configuration and ending up with the concept of automatic variables update as the algorithm is always the same:
DCPM - bigger default thread stack size (ABT_THREAD_STACKSIZE)
RAM - skip ndctl - (PMEMOBJ_CONF=sds.at_create=0)
The rest is some sort of validation to avoid configuration that causes server crashes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, whilst the above highlighted change is a workaround that looks out of scope in the test code, it is only test code so I think it's okay.

if err == nil {
for i := 0; i < len(c.Engines); i++ {
c.Engines[i] = c.Engines[i].WithSDSForRam().WithStackSizeForDCPM()
}
}

return c, c.Load()
return c, err
}

func TestServerConfig_MarshalUnmarshal(t *testing.T) {
Expand Down
54 changes: 53 additions & 1 deletion src/control/server/engine/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ package engine
import (
"fmt"
"os"
"strconv"
"strings"

"github.com/pkg/errors"
Expand All @@ -28,6 +29,8 @@ const (
envLogMasks = "D_LOG_MASK"
envLogDbgStreams = "DD_MASK"
envLogSubsystems = "DD_SUBSYS"

MIN_ABT_THREAD_STACKSIZE_FOR_DCPM = 18432
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go variable names, even Constants should be of the form minAbtThreadStackSizeDcpm or minABTThreadStackSizeDCPM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

)

// FabricConfig encapsulates networking fabric configuration.
Expand Down Expand Up @@ -343,6 +346,41 @@ func (c *Config) Validate() error {
return errors.Wrap(err, "validate engine log subsystems")
}

// Ensure proper environment variables for PMDK w/ NDCTL enabled.
pmemobj_conf_str, pmemobj_conf_err := c.GetEnvVar("PMEMOBJ_CONF")
if c.Storage.Tiers[0].Class == storage.ClassDcpm {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to have new unit test cases for this new block of logic. I see a few cases here:

  • env var not set with SCM (OK, env vars list updated)
  • env var not set without SCM (OK, no change)
  • env var contains non-integer junk (error)
  • stack size too small with SCM (error)
  • smaller stack size without SCM (OK)

It may be easiest to test by breaking the if-block out into a helper function and writing tests for that, rather than updating an existing unit test to check whether the Config was updated. Either way would be acceptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far all existing tests have been updated.

// Ensure 18KiB ABT stack size for an engine with DCPM storage class.
stacksize_str, err := c.GetEnvVar("ABT_THREAD_STACKSIZE")
if err != nil {
fmt.Printf("env_var ABT_THREAD_STACKSIZE set to %d for `dcpm` storage class\n",
MIN_ABT_THREAD_STACKSIZE_FOR_DCPM)
c.EnvVars = append(c.EnvVars, fmt.Sprintf("ABT_THREAD_STACKSIZE=%d",
MIN_ABT_THREAD_STACKSIZE_FOR_DCPM))
} else {
stacksize_value, err := strconv.Atoi(stacksize_str)
if err != nil || stacksize_value < MIN_ABT_THREAD_STACKSIZE_FOR_DCPM {
return errors.New(fmt.Sprintf("env_var ABT_THREAD_STACKSIZE should be >= %d for `dcpm` storage class",
MIN_ABT_THREAD_STACKSIZE_FOR_DCPM))
}
}
// Ensure default handling of shutdown state (SDS) for DCPM storage class.
if pmemobj_conf_err != nil && strings.Contains(pmemobj_conf_str, "sds.at_create") {
return errors.New("env_var PMEMOBJ_CONF should NOT be set to sds.at_create=? for non-`dcpm` storage class")
}
} else {
grom72 marked this conversation as resolved.
Show resolved Hide resolved
// Disable shutdown state (SDS) (part of RAS) for RAM-based simulated SCM.
// RAM doesn't support this feature and trying to use
// it will fail the create/open operations.
if pmemobj_conf_err != nil {
fmt.Printf("env_var PMEMOBJ_CONF set to sds.at_create=0 for non-`dcpm` storage class\n")
c.EnvVars = append(c.EnvVars, "PMEMOBJ_CONF=sds.at_create=0")
} else if strings.Contains(pmemobj_conf_str, "sds.at_create") {
if strings.Contains(pmemobj_conf_str, "sds.at_create=1") {
return errors.New("env_var PMEMOBJ_CONF should be set to sds.at_create=0 for non-`dcpm` storage class")
}
}
}

return nil
}

Expand Down Expand Up @@ -480,7 +518,7 @@ func (c *Config) WithSystemName(name string) *Config {
func (c *Config) WithStorage(cfgs ...*storage.TierConfig) *Config {
c.Storage.Tiers = storage.TierConfigs{}
c.AppendStorage(cfgs...)
return c
return c.WithStackSizeForDCPM().WithSDSForRam()
}

// AppendStorage appends the given storage tier configurations to
Expand Down Expand Up @@ -690,3 +728,17 @@ func (c *Config) WithStorageIndex(i uint32) *Config {
c.Storage.EngineIdx = uint(i)
return c
}

func (c *Config) WithStackSizeForDCPM() *Config {
if len(c.Storage.Tiers) > 0 && c.Storage.Tiers[0].Class == storage.ClassDcpm {
c.EnvVars = append(c.EnvVars, fmt.Sprintf("ABT_THREAD_STACKSIZE=%d", MIN_ABT_THREAD_STACKSIZE_FOR_DCPM))
}
return c
}

func (c *Config) WithSDSForRam() *Config {
if len(c.Storage.Tiers) > 0 && c.Storage.Tiers[0].Class != storage.ClassDcpm {
c.EnvVars = append(c.EnvVars, "PMEMOBJ_CONF=sds.at_create=0")
}
return c
}
3 changes: 3 additions & 0 deletions src/control/server/engine/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,8 @@ func TestConfig_Constructed(t *testing.T) {
t.Fatal(err)
}

fromDisk = fromDisk.WithSDSForRam().WithStackSizeForDCPM()

if diff := cmp.Diff(fromDisk, constructed, defConfigCmpOpts...); diff != "" {
t.Fatalf("(-want, +got):\n%s", diff)
}
Expand Down Expand Up @@ -730,6 +732,7 @@ func TestConfig_ToCmdVals(t *testing.T) {
"D_LOG_MASK=" + logMask,
"CRT_TIMEOUT=" + strconv.FormatUint(uint64(crtTimeout), 10),
"FI_OFI_RXM_USE_SRX=0",
"PMEMOBJ_CONF=sds.at_create=0",
}

gotArgs, err := cfg.CmdLineArgs()
Expand Down
1 change: 1 addition & 0 deletions src/control/server/engine/exec_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,7 @@ func TestRunnerNormalExit(t *testing.T) {
"CRT_TIMEOUT=30",
"D_INTERFACE=qib0",
"D_LOG_MASK=DEBUG,MGMT=DEBUG,RPC=ERR,MEM=ERR",
"PMEMOBJ_CONF=sds.at_create=0",
allowedUserEnv + "=" + allowedUserVal,
}
sort.Strings(env)
Expand Down
2 changes: 1 addition & 1 deletion src/mgmt/srv_target.c
Original file line number Diff line number Diff line change
Expand Up @@ -1180,7 +1180,7 @@ ds_mgmt_hdlr_tgt_create(crt_rpc_t *tc_req)
/* A zero size accommodates the existing file */
vpa.vpa_scm_size = 0;
vpa.vpa_nvme_size = tc_in->tc_nvme_size / dss_tgt_nr;
rc = dss_thread_collective(tgt_vos_create_one, &vpa, 0);
rc = dss_thread_collective(tgt_vos_create_one, &vpa, DSS_ULT_DEEP_STACK);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should probably be reviewed by someone familiar with the repercussions on the engine side. Maybe @NiuYawei ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks fine. It just uses deeper stack by default for creating the vos target.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we've already used deep stack size for regular pool open/create ULTs in other places, this is a missed case.

if (rc) {
D_ERROR(DF_UUID": thread collective tgt_vos_create_one failed, "DF_RC"\n",
DP_UUID(tc_in->tc_pool_uuid), DP_RC(rc));
Expand Down
Loading