Merge upstream/release/2.6 into upstream/google/2.6 #15415

juszhan1 · 2024-10-28T20:51:22Z

DAOS-16556 client: call fstat() before mmap() to update file status in kernel (DAOS-16556 client: call fstat() before mmap() to update file status in kernel #15304)
DAOS-16446 test: HDF5-VOL test - Set object class and container prope… (DAOS-16446 test: HDF5-VOL test - Set object class and container prope… #15004) (DAOS-16446 test: HDF5-VOL test - Set object class and container prope… #15098)
DAOS-16673 common: ignore Hadoop 3.4.0 related CVE (DAOS-16673 common: ignore Hadoop 3.4.0 related CVE #15320)
DAOS-14408 common: ensure NDCTL not used for storage class ram (DAOS-14408 common: ensure NDCTL not used for storage class ram #15203)
DAOS-16653 pool: Batch crt events (DAOS-16653 pool: Batch crt events #15230) (DAOS-16653 pool: Batch crt events (#15230) #15302)
DAOS-16720 cq: pin isort to v1.1.0 (DAOS-16720 cq: pin isort to v1.1.0 #15338) (DAOS-16720 cq: pin isort to v1.1.0 (#15338) #15339)
DAOS-15852 test: more timing samples for co_op_dup_timing() (DAOS-15852 test: more timing samples for co_op_dup_timing() #14497) (DAOS-15852 test: more timing samples for co_op_dup_timing() (#14497) #15324)
DAOS-16572 rebuild: properly assign global_dtx_resync_version in IV - b26 (DAOS-16572 rebuild: properly assign global_dtx_resync_version in IV - b26 #15186)
DAOS-16716 ci: Set reference build for PRs (DAOS-16716 ci: Set reference build for PRs #15337)
DAOS-16329 chk: maintenance mode after checking pool with dryrun - b26 (DAOS-16329 chk: maintenance mode after checking pool with dryrun - b26 #14985)
DAOS-16265 test: Fix erasurecode/rebuild_fio.py out of space (DAOS-16265 test: Fix erasurecode/rebuild_fio.py out of space #15020) (DAOS-16265 test: Fix erasurecode/rebuild_fio.py out of space (#15020) #15340)
DAOS-16693 telemetry: Avoid race between init/read (DAOS-16693 telemetry: Avoid race between init/read #15306) (DAOS-16693 telemetry: Avoid race between init/read (#15306) #15322)
DAOS-16696 cart: Fix rc in error path (DAOS-16696 cart: Fix rc in error path #15313) (DAOS-16696 cart: Fix rc in error path (#15313) #15357)
DAOS-16574 vos: shrink DTX table blob size - b26 (DAOS-16574 vos: shrink DTX table blob size #15220) (DAOS-16574 vos: shrink DTX table blob size - b26 (#15220) #15221)
DAOS-16653 doc: Fix CRT_EVENT_DELAY description (DAOS-16653 docs: Fix CRT_EVENT_DELAY description #15351) (DAOS-16653 doc: Fix CRT_EVENT_DELAY description (#15351) #15371)
DAOS-16650 control: dmg system exclude, update group version (DAOS-16650 control: dmg system exclude, update group version #15288) (DAOS-16650 control: dmg system exclude, update group version (#15288) #15349)
DAOS-16488 chk: take sd_lock before accessing VOS sys_db - b26 (DAOS-16488 chk: take sd_lock before accessing VOS sys_db - b26 #15269)
DAOS-16469 dtx: optimize DTX CoS cache - b26 (DAOS-16469 dtx: optimize DTX CoS cache - b26 #15085)
DAOS-14262 cart: add ability to select traffic class for SWIM context (DAOS-14262 cart: add ability to select traffic class for SWIM context #14893) (DAOS-14262 cart: add ability to select traffic class for SWIM context (#14893) #14917)
DAOS-16469 container: Lower log level for cont_aggregate_interval (DAOS-16469 container: Lower log level for cont_aggregate_interval - b26 #15283)
DAOS-16716 ci: Set reference build for PRs (DAOS-16716 ci: Set reference build for PRs #15379)
DAOS-15914: crt_reply_send_input_free() (DAOS-15914: crt_reply_send_input_free() #14817)
DAOS-16721 object: fix coll RPC for obj with sparse layout - b26 (DAOS-16721 object: fix coll RPC for obj with sparse layout - b26 #15376)
DAOS-16687 control: Handle missing PCIe caps in storage query usage (DAOS-16687 control: Handle missing PCIe caps in storage query usage #15296) (DAOS-16687 control: Handle missing PCIe caps in storage query usage (… #15392)
DAOS-16722 client: to intercept PMPI_Init() in libpil4dfs (DAOS-16722 client: to intercept PMPI_Init() in libpil4dfs #15387)
DAOS-15943 test: Remove server logging from pre-teardown (DAOS-15943 test: Remove server logging from pre-teardown #15282) (DAOS-15943 test: Remove server logging from pre-teardown (#15282) #15386)
Revert "DAOS-14262 cart: add ability to select traffic class for SWIM context (DAOS-14262 cart: add ability to select traffic class for SWIM context #14893) (DAOS-14262 cart: add ability to select traffic class for SWIM context (#14893) #14917)"

…n kernel (#15304) Signed-off-by: Lei Huang <lei.huang@intel.com>

#15004) (#15098) In HDF5, DFS, MPIIO, or POSIX, object class and container properties are defined during the container create. If it’s DFS, object class is also set to the IOR parameter. However, in HDF5-VOL, object class and container properties are defined with the following environment variables of mpirun. HDF5_DAOS_OBJ_CLASS (Object class) HDF5_DAOS_FILE_PROP (Container properties) The infrastructure to set these variables are already there in run_ior_with_pool(). In file_count_test_base.py, pass in the env vars to run_ior_with_pool(env=env) as a dictionary. Object class is the oclass variable. Container properties can be obtained from self.container.properties.value. This fix is discussed in PR #14964. Signed-off-by: Makito Kano <makito.kano@intel.com>

Hadoope 3.4.0 has resolved a few CVE issues but introduces new + enable Trivy scans on release branch + enable on demand scan and scan on final PR merge. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

) * DAOS-14408 common: enable NDCTL for DCPM This PR prepares DAOS to be used with NDCTL enabled in PMDK, which means: - NDCTL must not be used when non-DCPM (simulate PMem) - `storage class: "ram"` is used: `PMEMOBJ_CONF=sds.at_create=0` env variable disables NDCTL features in the PMDK This change affects all tests run on simulated PMem (e.g. inside VMs). Some DOAS utility applications may also require `PMEMOBJ_CONF=sds.at_create=0` to be set. - The default ULT stack size must be at least 20KiB to avoid stack overuse by PMDK with NDCTL enabled and be aligned with Linux page size. `ABT_THREAD_STACKSIZE=20480` env variable is used to increase the default ULT stack size. This env variable is set by control/server module just before engine is started. Much bigger stack is used for pmempool open/create-related tasks e.g. `tgt_vos_create_one` to avoid stack overusage. This modification shall not affect md-on-ssd mode as long as `storage class: "ram"` is used for the first tier in the `storage` configuration. This change does not require any configuration changes to existing systems. The new PMDK package with NDCTL enabled (daos-stack/pmdk#38) will land as soon as this PR is merged. Signed-off-by: Jan Michalski <jan.michalski@intel.com>

* DAOS-16653 pool: Batch crt events When multiple engines become unavailable around the same time, if a pool cannot tolerate the unavailability of those engines, it is sometimes desired that the pool would not exclude any of the engines. Hence, this patch introduces a CaRT event delay, tunable via the server-side environment variable, CRT_EVENT_DELAY, so that the events signaling the unavailability of those engines will be handled in hopefully one batch, giving pool_svc_update_map_internal a chance to reject the pool map update based on the RF check. When the RF check rejects a pool map change, we should revisit the corresponding events later, rather than simply throwing them away. This patch improves this case by returning the events back to the event queue, and pause the queue handling until next new event or pool map update. - Introduce event sets: pool_svc_event_set. Now the event queue can be simplified to just one event set. - Add the ability to pause and resume the event handling: pse_paused. - Track the time when the latest event was queued: pse_time. Signed-off-by: Li Wei <wei.g.li@intel.com>

Pin isort to v1.1.0 to avoid suprprise changes and because v1.1.1 is not working for us. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

…15324) Rarely, this test will produce timings that exceed the failure threshold. Local and PR/CI experiments have shown that increasing the test's NUM_OPS to more than 200 iterations greatly reduces or may eliminate such intermittent timing failures, by "spreading out" the magnitude of the time spent in the 3 main loops of the test (0% loops perform fault injections, 33%, and 50%). Signed-off-by: Kenneth Cain <kenneth.c.cain@intel.com>

… b26 (#15186) In rebuild_iv_ent_refresh() for refreshing DTX resync version, needs to assign rt_global_dtx_resync_version firstly before wakeup related rebuild_scan_leader. Signed-off-by: Fan Yong <fan.yong@intel.com>

Release branch PRs should use the release branch build instead of master branch build for NLT reference Signed-off-by: Jeff Olivier <jeffolivier@google.com>

#14985) Sometimes, after system shutdown unexpectedly, the users may expect to check their critical data under some kind of maintenance mode. Under such mode, no user data can be modified or moved or aggregated. That will guarantee no further potential (DAOS logic caused) damage can happen during the check. For such purpose, we will enhance current DAOS CR logic with --dryrun option to allow the pool (after check) to be opened as immutable with disabling some mechanism that may potentially cause data modification or movement (such as rebuild or aggregation). Under such mode, if client wants to connect to the pool, the read-only option must be specified. Similarly for opening container in such pool. Signed-off-by: Fan Yong <fan.yong@intel.com>

…#15340) Prevent accumulating large server log files caused by temporarily enabling the DEBUG log mask while creating or destroying pools. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

In rare cases, a reader may attempt to access a telemetry node after it has been added to the tree, but before it has been fully initialized. Use an atomic to prevent reads before the initialization has completed. Unlucky readers will get a -DER_AGAIN instead of crashing. Signed-off-by: Michael MacDonald <mjmac@google.com>

- Fix rc in error path during ivo_on_update failure Required-githooks: true Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>

Use 4KB blob for committed DTX table and 16KB for active DTX table. It is more efficient for lower allocator and reduce the possibility of space allocation failure when space pressure. Simplify vos_dtx_commit logic and code cleanup. Signed-off-by: Fan Yong <fan.yong@intel.com>

Fix the description of the CRT_EVENT_DELAY environment variable in docs/admin/env_variables.md. Signed-off-by: Li Wei <wei.g.li@intel.com>

…#15349) With this change, when a daos administrator runs dmg system exclude for a given set of engines, the system map version / cart primary group version will be updated. In turn, daos_engines will more immediately detect the "loss" of the administratively excluded engines, update pool maps and perform rebuild. This change supports a use case of a proactive exclusion of ranks that are expected to be impacted by planned maintenance that would cut off connectivity to certain engines. Signed-off-by: Kenneth Cain <kenneth.c.cain@intel.com>

The VOS sys_db may have multuiple users, such as SMD and CHK. It is caller's duty to take lock against the VOS sys_db before accessing it to handle concurrent operations from multiple XS. Signed-off-by: Fan Yong <fan.yong@intel.com>

If there are a lot of committable DTX entries in DTX CoS cache, then it may be inefficient to locate the DTX entry in CoS cache with given oid + dkey_hash, that may happen under the case of that DTX batched commit is blocked (such as because of network trouble) as to trigger DTX refresh (for DTX cleanup) on other related engines. If that happened, it will increase the system load on such engine and slow down DTX commit further more. The patch reduces unnecessary search operation inside CoS cache. Other changes: 1. Metrics (io/dtx/async_cmt_lat/tgt_id) for DTX asynchronously commit latency (with unit ms). 2. Fix a bug in sched_ult2xs() with multiple numa sockets for DSS_XS_OFFLOAD case. 3. Delay commit (or abort) collective DTX on the leader target to handle resent race. 4. Avoid blocking dtx_req_wait() if chore failed to send out some DTX RPC. 5. Some cleanup for error handling. Signed-off-by: Fan Yong <fan.yong@intel.com>

…#14893) (#14917) Add SWIM_TRAFFIC_CLASS env var (default is unspec) Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>

…5283) To reduce the side-effect caused by frequent log with -DER_INPROGRESS. Signed-off-by: Fan Yong <fan.yong@intel.com>

Release branch PRs should use the release branch build instead of master branch build for Fault Injection reference Signed-off-by: Jeff Olivier <jeffolivier@google.com>

- New crt_reply_send_input_free() API added which releases input buffer right after HG_Respond() instead of waiting until the handle is destroyed. - srv_obj.c calls changed to use new crt_reply_send_input_free() - I/O context takes refcount on RPC - only release input buffer for target update Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com> Signed-off-by: Liang Zhen <liang.zhen@intel.com> Co-authored-by: Liang Zhen <liang.zhen@intel.com>

) The old implementation did not correctly calculate some collective object RPC size, and may cause trouble when need bulk data transfer for large collective object RPC. It also potentially affects how to dispatch collective RPCs from leader to other engines. The patch also addes more sanity check for coll-punch RPC to detect potential DRAM corruption. Signed-off-by: Fan Yong <fan.yong@intel.com>

…15296) (#15392) Missing PCIe capabilities when querying a NVMe SSD's configuration space is unusual but should be handled gracefully by the control-plane and shouldn't cause a failure to return usage statistics when calling dmg storage query usage. Update so that pciutils lib is only called when attempting to display health stats via dmg and not when fetching usage info. Improve clarity of workflow to ease maintenance and add test coverage for updates. Enable continued functionality when NVMe device doesn't return any extended capabilities in PCIe configuration space data by adding sentinel error to library for such a case. Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>

Intercept PMPI_Init() to avoid calling daos_init() if MPI_Init() is intercepted by other library (like darshan and mpip). Signed-off-by: Lei Huang <lei.huang@intel.com>

…5386) Signed-off-by: Maureen Jean <maureen.jean@intel.com>

…/2.6 Required-githooks: true Change-Id: Idb4059d027c43b9d449d9d020e92618fc174c5b2

… context (#14893) (#14917)" This reverts commit 2b5620b.

daosbuild1 · 2024-10-28T20:54:36Z

Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15415/1/execution/node/365/log

daosbuild1 · 2024-10-28T20:54:59Z

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15415/1/execution/node/387/log

daosbuild1 · 2024-10-28T20:55:46Z

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15415/1/execution/node/362/log

daosbuild1 · 2024-10-28T20:55:55Z

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15415/1/execution/node/323/log

daosbuild1 · 2024-10-28T20:56:59Z

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15415/1/execution/node/273/log

daosbuild1 · 2024-10-28T21:00:29Z

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15415/1/execution/node/326/log

github-actions · 2024-10-28T21:01:16Z

Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/Merge

wiliamhuang and others added 28 commits October 15, 2024 10:18

DAOS-16556 client: call fstat() before mmap() to update file status i…

76cfb41

…n kernel (#15304) Signed-off-by: Lei Huang <lei.huang@intel.com>

DAOS-16673 common: ignore Hadoop 3.4.0 related CVE (#15320)

6e16c8e

Hadoope 3.4.0 has resolved a few CVE issues but introduces new + enable Trivy scans on release branch + enable on demand scan and scan on final PR merge. Signed-off-by: Tomasz Gromadzki <tomasz.gromadzki@intel.com>

DAOS-16720 cq: pin isort to v1.1.0 (#15338) (#15339)

e0f5883

Pin isort to v1.1.0 to avoid suprprise changes and because v1.1.1 is not working for us. Signed-off-by: Dalton Bohning <dalton.bohning@intel.com>

DAOS-16716 ci: Set reference build for PRs (#15337)

81e57d0

Release branch PRs should use the release branch build instead of master branch build for NLT reference Signed-off-by: Jeff Olivier <jeffolivier@google.com>

DAOS-16265 test: Fix erasurecode/rebuild_fio.py out of space (#15020) (…

b913d3e

…#15340) Prevent accumulating large server log files caused by temporarily enabling the DEBUG log mask while creating or destroying pools. Signed-off-by: Phil Henderson <phillip.henderson@intel.com>

DAOS-16696 cart: Fix rc in error path (#15313) (#15357)

42a0d35

- Fix rc in error path during ivo_on_update failure Required-githooks: true Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>

DAOS-16653 doc: Fix CRT_EVENT_DELAY description (#15351) (#15371)

dcf8419

Fix the description of the CRT_EVENT_DELAY environment variable in docs/admin/env_variables.md. Signed-off-by: Li Wei <wei.g.li@intel.com>

DAOS-14262 cart: add ability to select traffic class for SWIM context (…

2b5620b

…#14893) (#14917) Add SWIM_TRAFFIC_CLASS env var (default is unspec) Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>

DAOS-16469 container: Lower log level for cont_aggregate_interval (#1…

70b12e3

…5283) To reduce the side-effect caused by frequent log with -DER_INPROGRESS. Signed-off-by: Fan Yong <fan.yong@intel.com>

DAOS-16716 ci: Set reference build for PRs (#15379)

2a1892f

Release branch PRs should use the release branch build instead of master branch build for Fault Injection reference Signed-off-by: Jeff Olivier <jeffolivier@google.com>

DAOS-16722 client: to intercept PMPI_Init() in libpil4dfs (#15387)

eb95b55

Intercept PMPI_Init() to avoid calling daos_init() if MPI_Init() is intercepted by other library (like darshan and mpip). Signed-off-by: Lei Huang <lei.huang@intel.com>

DAOS-15943 test: Remove server logging from pre-teardown (#15282) (#1…

bde13c3

…5386) Signed-off-by: Maureen Jean <maureen.jean@intel.com>

Merge remote-tracking branch 'origin/release/2.6' into juszhan/google…

326327e

…/2.6 Required-githooks: true Change-Id: Idb4059d027c43b9d449d9d020e92618fc174c5b2

Revert "DAOS-14262 cart: add ability to select traffic class for SWIM…

7941cb7

… context (#14893) (#14917)" This reverts commit 2b5620b.

juszhan1 closed this Nov 4, 2024

juszhan1 deleted the juszhan/google/2.6 branch November 4, 2024 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge upstream/release/2.6 into upstream/google/2.6 #15415

Merge upstream/release/2.6 into upstream/google/2.6 #15415

juszhan1 commented Oct 28, 2024

daosbuild1 commented Oct 28, 2024

daosbuild1 commented Oct 28, 2024

daosbuild1 commented Oct 28, 2024

daosbuild1 commented Oct 28, 2024

daosbuild1 commented Oct 28, 2024

daosbuild1 commented Oct 28, 2024

github-actions bot commented Oct 28, 2024

Merge upstream/release/2.6 into upstream/google/2.6 #15415

Merge upstream/release/2.6 into upstream/google/2.6 #15415

Conversation

juszhan1 commented Oct 28, 2024

daosbuild1 commented Oct 28, 2024

daosbuild1 commented Oct 28, 2024

daosbuild1 commented Oct 28, 2024

daosbuild1 commented Oct 28, 2024

daosbuild1 commented Oct 28, 2024

daosbuild1 commented Oct 28, 2024

github-actions bot commented Oct 28, 2024