diff --git a/debian/changelog b/debian/changelog index da7efc0367b..73f8bead281 100644 --- a/debian/changelog +++ b/debian/changelog @@ -1,8 +1,20 @@ -daos (2.3.108-2) unstable; urgency=medium +daos (2.3.108-4) unstable; urgency=medium [Michael MacDonald] * Add golang-go as a tests dependency for dfuse/daos_build.py - -- Michael MacDonald Thu, 29 Jun 2023 10:10:00 -0400 + -- Michael MacDonald Mon, 17 Jul 2023 10:10:00 -0400 + +daos (2.3.108-3) unstable; urgency=medium + [ Wang Shilong ] + * Remove lmdb-devel for MD on SSD + + -- Wang Shilong Thu, 13 Jul 2023 22:44:00 +0800 + +daos (2.3.108-2) unstable; urgency=medium + [ Li Wei ] + * Update raft to 0.10.1-1408.g9524cdb + + -- Li Wei Wed, 28 Jun 2023 10:38:00 +0900 daos (2.3.108-1) unstable; urgency=medium [ Jeff Olivier ] diff --git a/debian/control b/debian/control index 84796139ba2..7e213538e8e 100644 --- a/debian/control +++ b/debian/control @@ -29,10 +29,9 @@ Build-Depends: debhelper (>= 10), libboost-dev, libspdk-dev (>= 22.01.2), libipmctl-dev, - libraft-dev (= 0.9.1-1401.gc18bcb8), + libraft-dev (= 0.10.1-1408.g9524cdb), python3-tabulate, liblz4-dev, - liblmdb-dev, libcapstone-dev Standards-Version: 4.1.2 Homepage: https://docs.daos.io/ diff --git a/docs/admin/env_variables.md b/docs/admin/env_variables.md index daac2c8e6a6..48ee91ba1b3 100644 --- a/docs/admin/env_variables.md +++ b/docs/admin/env_variables.md @@ -35,6 +35,8 @@ Environment variables in this section only apply to the server side. |----------------------|-----------| |RDB\_ELECTION\_TIMEOUT|Raft election timeout used by RDBs in milliseconds. INTEGER. Default to 7000 ms.| |RDB\_REQUEST\_TIMEOUT |Raft request timeout used by RDBs in milliseconds. INTEGER. Default to 3000 ms.| +|RDB_LEASE_MAINTENANCE_GRACE|Raft grace period of leadership lease maintenance used by RDBs in milliseconds. INTEGER. Default to 7000 ms. If a Raft leader is unable to maintain leadership leases from a majority for more than RDB_ELECTION_TIMEOUT + RDB_LEASE_MAINTENANCE_GRACE, it steps down voluntarily.| +|RDB_USE_LEASES|Whether RDBs shall use Raft leadership leases, instead of RPCs, to verify leadership. BOOL. Default to true. Rafts track leadership leases regardless; this environment variable essentially controls whether RDBs use Raft leadership leases to improve RDB TX performance.| |RDB\_COMPACT\_THRESHOLD|Raft log compaction threshold in applied entries. INTEGER. Default to 256 entries.| |RDB\_AE\_MAX\_ENTRIES |Maximum number of entries in a Raft AppendEntries request. INTEGER. Default to 32.| |RDB\_AE\_MAX\_SIZE |Maximum total size in bytes of all entries in a Raft AppendEntries request. INTEGER. Default to 1 MB.| diff --git a/docs/admin/hardware.md b/docs/admin/hardware.md index a97d25dbf5a..4fa3ac3c63a 100644 --- a/docs/admin/hardware.md +++ b/docs/admin/hardware.md @@ -35,9 +35,9 @@ validated on a regular basis. An RDMA-capable fabric is preferred for best performance. The DAOS data plane relies on [OFI libfabric](https://ofiwg.github.io/libfabric/) and supports OFI providers for Ethernet/tcp and InfiniBand/verbs. -Starting with a Technology Preview in DAOS 2.2, [UCX](https://www.openucx.org/) +[UCX](https://www.openucx.org/) is also supported as an alternative network stack for DAOS. -Refer to [UCX Fabric Support (DAOS 2.2 Technology Preview)](./ucx.md) +Refer to [UCX Fabric Support](./ucx.md) for details on setting up DAOS with UCX support. DAOS supports multiple network interfaces on the servers diff --git a/docs/admin/md-on-ssd.md b/docs/admin/md-on-ssd.md new file mode 100644 index 00000000000..6007c1abb78 --- /dev/null +++ b/docs/admin/md-on-ssd.md @@ -0,0 +1,18 @@ +# Metadata on SSD Phase1 (Technology Preview) + +DAOS Version 2.4 includes a Technology Preview of the +_Metadata-on-SSD (Phase1)_ +code path to support DAOS servers without Intel Optane Persistent Memory. + +Please refer to the DAOS Wiki articles on the +[Metadata-on-SSD Design](https://daosio.atlassian.net/wiki/spaces/DC/pages/11196923911/Metadata+on+SSDs) +and the +[WAL Detailed Design](https://daosio.atlassian.net/wiki/spaces/DC/pages/11215339529/WAL+Detailed+Design) +for more information. + +A presentation on this new code path, +including initial performance comparisons of DAOS Servers with and without PMem, +can be found in the presentation +[DAOS Beyond PMem](https://www.ixpug.org/images/docs/ISC23/DAOS_mhennecke.pptx) +from the +[ISC 2023 IXPUG Workshop](https://www.ixpug.org/events/isc23-ixpug-workshop). diff --git a/docs/admin/pool_operations.md b/docs/admin/pool_operations.md index 024b2a97fa0..8a69210bd72 100644 --- a/docs/admin/pool_operations.md +++ b/docs/admin/pool_operations.md @@ -486,7 +486,7 @@ To create a pool with a custom ACL: $ dmg pool create --size --acl-file ``` -The ACL file format is detailed in [here](https://docs.daos.io/v2.4/overview/security/#acl-file). +The ACL file format is detailed [here](https://docs.daos.io/v2.4/overview/security/#acl-file). ### Displaying ACL @@ -629,7 +629,7 @@ operation is ongoing. Drain additionally enables non-replicated data to be rebuilt onto another target whereas in a conventional failure scenario non-replicated data would not be integrated into a rebuild and would be lost. Drain operation is not allowed if there are other ongoing rebuild operations, otherwise -it will return -DER_BUSY. +it will return -DER\_BUSY. To drain a target from a pool: @@ -650,7 +650,7 @@ original state. The operator can either reintegrate specific targets for an engine rank by supplying a target idx list, or reintegrate an entire engine rank by omitting the list. Reintegrate operation is not allowed if there are other ongoing rebuild operations, -otherwise it will return -DER_BUSY. +otherwise it will return -DER\_BUSY. ``` $ dmg pool reintegrate $DAOS_POOL --rank=${rank} --target-idx=${idx1},${idx2},${idx3} @@ -702,7 +702,7 @@ pool. This will automatically trigger a server rebalance operation where objects within the extended pool will be rebalanced across the new storage. Extend operation is not allowed if there are other ongoing rebuild operations, -otherwise it will return -DER_BUSY. +otherwise it will return -DER\_BUSY. ``` $ dmg pool extend $DAOS_POOL --ranks=${rank1},${rank2}... @@ -717,14 +717,14 @@ small extensions. ### Resize -Support for quiescent pool resize (changing capacity used on each storage node -without adding new ones) is currently not supported and is under consideration. +Support for quiescent pool resize (changing capacity used on each storage engine +without adding new engines) is currently not supported and is under consideration. ## Pool Catastrophic Recovery A DAOS pool is instantiated on each target by a set of pmemobj files managed by PMDK and SPDK blobs on SSDs. Tools to verify and repair this -persistent data is scheduled for DAOS v2.4 and will be documented here +persistent data are scheduled for DAOS Version 2.6 and will be documented here once available. Meanwhile, PMDK provides a recovery tool (i.e., pmempool check) to verify diff --git a/docs/admin/troubleshooting.md b/docs/admin/troubleshooting.md index abdf2f9b62b..16a178c37b2 100644 --- a/docs/admin/troubleshooting.md +++ b/docs/admin/troubleshooting.md @@ -463,6 +463,15 @@ fabric_iface_port: 31316 # engine 1 fabric_iface_port: 31416 ``` +### daos_agent cache of engine URIs is stale + +The `daos_agent` cache may become invalid if `daos_engine` processes restart with different +configurations or IP addresses, or if the DAOS system is reformatted. +If this happens, the `daos` tool (as well as other I/O or `libdaos` operations) may return +`-DER_BAD_TARGET` (-1035) errors. + +To resolve the issue, a privileged user may send a `SIGUSR2` signal to the `daos_agent` process to +force an immediate cache refresh. ## Diagnostic and Recovery Tools diff --git a/docs/admin/ucx.md b/docs/admin/ucx.md index d03d45f5a6a..1f6c539e89c 100644 --- a/docs/admin/ucx.md +++ b/docs/admin/ucx.md @@ -1,19 +1,10 @@ -# UCX Fabric Support (DAOS 2.2 Technology Preview) +# UCX Fabric Support -DAOS 2.2 includes a technology preview of +DAOS 2.4 includes [UCX](https://www.openucx.org/) support for clusters using InfiniBand, as an alternative to the default [libfabric](https://ofiwg.github.io/libfabric/) network stack. -!!! note UCX support has been enabled for the DAOS builds on - EL8 and Leap15 only. It is not supported on CentOS7. - -The goal of this technology preview is to allow early -evaluation and testing. DAOS over UCX has not been fully -validated yet, and it is not recommended to use it in a -production environment with DAOS 2.2. -It is a roadmap item to fully support UCX in DAOS 2.4. - !!! note The network provider is an immutable property of a DAOS system. Changing the network provider to UCX requires that the DAOS storage is reformatted. @@ -77,7 +68,8 @@ the following steps are needed: zypper install mercury-ucx ``` -* To **update** from DAOS 2.0 (with libfabric) to DAOS 2.2 with +* To **update** from an earlier DAOS version (with libfabric) + to DAOS 2.4 with UCX, the recommended path is to first perform a standard DAOS RPM update (which will update the default `mercury` package). After the update, the `mercury` RPM package can be replaced by @@ -89,4 +81,3 @@ configuration file (`/etc/daos/daos_server.yml`). A sample YML file is available on [github](https://github.com/daos-stack/daos/blob/release/2.4/utils/config/examples/daos_server_ucx.yml). The recommended setting for UCX is `provider: ucx+dc_x`. - diff --git a/docs/overview/terminology.md b/docs/overview/terminology.md index 1273225cf0a..bb9da4c30cd 100644 --- a/docs/overview/terminology.md +++ b/docs/overview/terminology.md @@ -43,6 +43,7 @@ |[SPDK](https://spdk.io/)|Storage Performance Development Kit| |SSD|Solid State Drive| |[SWIM](https://doi.org/10.1109/DSN.2002.1028914)|Scalable Weakly-consistent Infection-style process group Membership Protocol| +|[UCF](https://ucfconsortium.org/)|Unified Communication Framework (UCF Consortium)| |[UCP](https://www.openucx.org/)|Unified Communication Protocols (high-level API of UCX)| |[UCS](https://www.openucx.org/)|Unified Communication Transports (low-level API of UCX)| |[UCT](https://www.openucx.org/)|Unified Communication Services (common utilities of UCX)| diff --git a/docs/release/release_notes.md b/docs/release/release_notes.md index c3fb0a15e36..2eabd5c8c38 100644 --- a/docs/release/release_notes.md +++ b/docs/release/release_notes.md @@ -1,7 +1,232 @@ -# DAOS Version 2.4 Release Notes +# DAOS Version 2.4 Release Notes (DRAFT) -DAOS 2.4 is under active development and has not been released yet. -The release is planned for 2Q2023. -In the meantime, please refer to the support document for the -[latest](https://docs.daos.io/latest/release/release_notes/) -stable DAOS release. +!!! note + This document is a DRAFT of the DAOS Version 2.4 Release Notes. + Information in this document may change without notice before the + release of DAOS Version 2.4. + +We are pleased to announce the release of DAOS version 2.4. + + +## DAOS Version 2.4.0 (2023-xx-xx) + +### General Support + +DAOS Version 2.4.0 supports the following environments: + +Architecture Support: + +* DAOS 2.4.0 supports the x86\_64 architecture. + +Operating System Support: + +* SLES 15.4 and Leap 15.4 + +* EL8 (RHEL, Rocky Linux, Alma Linux): + + - EL8.6 (EUS) + - Validation of EL8.8 is in progress + +Fabric and Network Provider Support: + +* libfabric support for the following fabrics and providers: + + - `ofi+tcp` on all fabrics (without RXM) + - `ofi+tcp;ofi_rxm` on all fabrics (with RXM) + - `ofi+verbs` on InfiniBand fabrics and RoCE (with RXM) + - `ofi+cxi` on Slingshot fabrics (with HPE-provided libfabric) + - `ofi+opx` on Omni-Path fabrics (Technology Preview) + +* [UCX](https://docs.daos.io/v2.4/admin/ucx/) support on InfiniBand fabrics: + + - `ucx+dc_x` on InfiniBand fabrics + +Storage Class Memory Support: + +* DAOS Servers with 2nd gen Intel Xeon Scalable processors and + Intel Optane Persistent Memory 100 Series. + +* DAOS Servers with 3rd gen Intel Xeon Scalable processors and + Intel Optane Persistent Memory 200 Series. + +* DAOS Servers without Intel Optane Persistent Memory, + using the _Metadata-on-SSD_ (Phase1) code path (Technology Preview) + +For a complete list of supported hardware and software, refer to the +[Support Matrix](https://docs.daos.io/v2.4/release/support_matrix/). + + +### Key features and improvements + +#### Software Version Currency + +* See [above](#General-Support) for supported operating system levels. + +* Libfabric and MLNX\_OFED (including UCX) have been refreshed. + Refer to the + [Support Matrix](https://docs.daos.io/v2.4/release/support_matrix/) + for details. + +* The `ipmctl` tool to manage Intel Optane Persistent Memory + has been updated to Version 3 (provided by the OS distributions). + +* The following prerequisite software packages that are included + in the DAOS RPM builds have been updated: + + - Argobots has been updated to 1.1-2 + - DPDK has been updated to 21.11.2-1 + - Libfabric has been updated to 1.18.0-2 (TB8), going to 1.18.1rc1 + - Mercury has been updated to 2.3.0-1 + - Raft has been updated to 0.9.2-1.403 + - SPDK has been update to 22.01.2-3 + +#### New Network Providers + +* UCX support on InfiniBand fabrics is now generally available + (it was a Technology Preview in DAOS 2.2). + Refer to [UCX](https://docs.daos.io/v2.4/admin/ucx/) for details. + +* Slingshot fabrics are now supported with the `ofa+cxi`provider. + +* Omni-Path Express is supported as a Technology Preview, + using the `ofi+opx` provider. + For production usage on Omni-Path fabrics, + please continue to use the `ofi+tcp`provider. + +#### New Features and Usability Improvements + +* The `daos_server scm prepare` command now supports the creation of + multiple SCM namespaces per CPU socket, + using the `--scm-ns-per-socket` option. + On DAOS servers with Intel Optane Persistent Memory modules, + this can be used to configure multiple DAOS engines per CPU socket + (to support multiple HPC fabric links per CPU socket). + +* DAOS Version 2.4 includes a Technology Preview of the + [Metadata-on-SSD (Phase1)](https://docs.daos.io/v2.4/admin/md-on-ssd/) + code path to support DAOS servers without Intel Optane Persistent Memory. + +* DAOS Version 2.4 includes initial support for excluding, + draining, and reintegrating DAOS engines to/from a pool, + using the `dmg pool {exclude|drain|reintegrate}` commands. + Expanding a pool by adding additional DAOS engines to the pool is + also supported, using the `dmg pool extend` command. + Refer to + [Pool Modifications](https://docs.daos.io/v2.4/admin/pool_operations/#pool-modifications) + in the Administration Guide for more information. + +* The default container redundancy level + has been changed from _engine_ to _server_ + (the `rf_lvl` container property now has a value of `node (2)`). + For DAOS systems with multiple engines per server, this will reduce + the number of available fault domains. + So it may be possible that wide erasure codes no longer work. + For testing purposes, it is possible to change the redundancy level + back to _engine_. + For production usage, the new default is highly recommended + as it more appropriately reflects the actual fault domains. + +* The Erasure Coding implementation now uses _EC parity rotation_. + This significantly improves EC performance, + in particular for parallel I/O into a single shared file. + +* In addition to the `libioil.so` interception library (which can + be used to intercept POSIX data I/O calls but not metadata operations), + DAOS Version 2.4 includes a Technology Preview of a new interception + library `libpil4dfs.so` which can also intercept POSIX metadata calls. + Refer to + [this section](https://docs.daos.io/staging/v2.4/user/filesystem/#interception-library-libpil4dfs) + in the User Guide for more information on `libpil4dfs.so`, + including the current limitations of this Technology Preview. + +* On DAOS servers with + [VMD](https://docs.daos.io/v2.4/admin/vmd/) enabled, + the `dmg storage led identify` command can now be used + to visually identify one or more NVMe SSD(s). + +* DAOS Version 2.4 supports + [Multi-user dfuse](https://docs.daos.io/v2.4/user/multi-user-dfuse/). + This feature is particularly useful on shared nodes like login nodes: + A single instance of the `dfuse` process can be run (as root, + or under a non-root service userid), and all users can access + DAOS POSIX containers through that single `dfuse`instance + instead of starting multiple per-user `dfuse` instances. + +* Several dfuse enhancements have been implemented, including + readdir caching, interception support for streaming I/O calls, + and the ability to fine-fune the dfuse caching behavior + through container properties and dfuse command parameters. + +#### Other notable changes + +When deleting a pool that still has containers configured in it, +the `dmg pool destroy` command now needs the `--recursive` option. + +In `dmg pool create` the `-p $POOL_LABEL` option is now obsolete. +Use `$POOL_LABEL` as a positional argument (without the `-p`). + +The `daos container create` command no longer supports the +`-l $CONT_LABEL` option. Use the container label as a +positional argument instead (without `-l`). + + +### Known Issues and limitations + +Known issues from DAOS 2.2, need to be validated before DAOS 2.4 GA: + +- [DAOS-11685](https://daosio.atlassian.net/browse/DAOS-11685): + Under certain workloads with `rf=2`, a server may crash. + There is not workaround; a fix is targeted for daos-2.2.1. + +- [DAOS-11317](https://daosio.atlassian.net/browse/DAOS-11317): + Running the Mellanox-provided `mlnxofedinstall` script to install a new version of MLNX\_OFED, + while the `mercury-ucx` RPM is already installed, will un-install `mercury-ucx` + (as well as mercury-ucx-debuginfo if the debuginfo RPMs are installed). + This leaves DAOS non-functional after the MOFED update. + Workaround: Run `{yum|dnf|zypper} install mercury-ucx [mercury-ucx-debuginfo]` + after the MLNX\_OFED update and before starting DAOS again. + +- [DAOS-8848](https://daosio.atlassian.net/browse/DAOS-8848) and + [SPDK-2587](https://github.com/spdk/spdk/issues/2587): + Binding and unbinding NVMe SSDs between the kernel and SPDK (using the + `daos_server storage prepare -n [--reset]` command) can sporadically cause + the NVMe SSDs to become inaccessible. + Workaround: This situation can be corrected by + running `rmmod vfio_pci; modprobe vfio_pci` and `rmmod nvme; modprobe nvme`. + +- [DAOS-10215](https://daosio.atlassian.net/browse/DAOS-10215): + For Replication and Erasure Coding (EC), in DAOS 2.2 the redundancy level (`rf_lvl`) + is set to `1 (rank=engine)`. On servers with more than one engine per server, + setting the redundancy level to `2 (server)` would be more appropriate, + but the `daos cont create` command currently does not support this. + No workaround is available at this point. + +- No OPA/PSM2 support. + Please refer to the "Fabric Support" section of the + [Support Matrix](https://docs.daos.io/v2.4/release/support_matrix/) for details. + No workaround is available at this point. + +- [DAOS-8943](https://daosio.atlassian.net/browse/DAOS-8943): + Premature ENOSPC error / Reclaiming free NVMe space under heavy I/O load can cause early + out-of-space errors to be reported to applications. + No workaround is available at this point. + +### Bug fixes + +The DAOS 2.4 release includes fixes for numerous defects. +For details, please refer to the Github +[release/2.4 commit history](https://github.com/daos-stack/daos/commits/release/2.4) +and the associated [Jira tickets](https://jira.daos.io/) as stated in the commit messages. + + +## Additional resources + +Visit the [online documentation](https://docs.daos.io/v2.4/) for more +information. All DAOS project source code is maintained in the +[https://github.com/daos-stack/daos](https://github.com/daos-stack/daos) repository. +Please visit this [link](https://github.com/daos-stack/daos/blob/release/2.4/LICENSE) +for more information on the licenses. + +Refer to the [System Deployment](https://docs.daos.io/v2.4/admin/deployment/) +section of the [DAOS Administration Guide](https://docs.daos.io/v2.4/admin/hardware/) +for installation details. diff --git a/docs/release/support_matrix.md b/docs/release/support_matrix.md index 34585207cc3..a0c9c7a283f 100644 --- a/docs/release/support_matrix.md +++ b/docs/release/support_matrix.md @@ -1,7 +1,437 @@ -# DAOS Version 2.4 Support +# DAOS Version 2.4 Support (DRAFT) -DAOS 2.4 is under active development and has not been released yet. -The release is planned for 2Q2023. -In the meantime, please refer to the support document for the -[latest](https://docs.daos.io/latest/release/support_matrix/) -stable DAOS release. +!!! note + This document is a DRAFT of the DAOS Version 2.4 Support document. + Information in this document may change without notice before the + release of DAOS Version 2.4. + + +## Community Support and Commercial Support + +Community support for DAOS is available through the +[DAOS mailing list](https://daos.groups.io/) and the +[DAOS Slack channel](https://daos-stack.slack.com/). +The [DAOS community JIRA tickets](https://daosio.atlassian.net/jira) +can be searched for known issues and possible solutions. +Community support is provided on a best effort basis +without any guaranteed SLAs. + +The Intel DAOS engineering team +can also be contracted to provide Commercial Level-3 Support for DAOS. +Under such a support agreement, Intel partners that offer DAOS +Commercial Support to their end customers will provide the DAOS +Level-1 and Level-2 support. They can then escalate Level-2 support +tickets to the Intel Level-3 support team +through a dedicated JIRA path with well-defined SLAs. +Please refer to the +[intel.com landing page for DAOS](https://www.intel.com/content/www/us/en/high-performance-computing/daos.html) +for information on the DAOS partner ecosystem. + +This document describes the supported environments for Intel Level-3 support +at the DAOS Version 2.4 level. +Information for future releases is indicative only and may change. +Partner support offerings may impose further constraints, for example if they +include DAOS support as part of a more general cluster support offering +with its own release cycle. + +Some members of the DAOS community have reported successful compilation +and basic testing of DAOS in other environments (for example on ARM64 +platforms, or on other Linux distributions). Those activities are highly +appreciated community contributions. However such environments are +not currently supported by Intel in a production environment. + + +## Hardware platforms supported for DAOS Servers + +DAOS Version 2.4 supports the x86\_64 architecture. + +DAOS servers require byte-addressable Storage Class Memory (SCM) +for the DAOS metadata, and there are two different ways to +implement SCM in a DAOS server: Using Persistent Memory, +or using DRAM combined with logging to NVMe SSDs. + + +### DAOS Servers with Persistent Memory + +All DAOS versions support Intel Optane Persistent Memory (PMem) +as its SCM layer. DAOS Version 2.4 has been validated with +[Intel Optane Persistent Memory 100 Series](https://ark.intel.com/content/www/us/en/ark/products/series/190349/intel-optane-persistent-memory-100-series.html) +on 2nd gen Intel Xeon Scalable processors, and with +[Intel Optane Persistent Memory 200 Series](https://ark.intel.com/content/www/us/en/ark/products/series/203877/intel-optane-persistent-memory-200-series.html) +on 3rd gen Intel Xeon Scalable processors. + +For maximum performance, it is strongly recommended that all memory channels +of a DAOS server are populated with one DRAM module and one Optane PMem module. +All Optane PMem modules in a DAOS server must have the same capacity. + +!!! note + Note that the Intel Optane Persistent Memory 300 Series + for 4th gen Intel Xeon Scalable processors has been cancelled, + and is not supported by DAOS. + +[PMDK](https://github.com/pmem/pmdk) is used as the programming interface +when using Optane Persistent Memory. + + +### DAOS Servers without Persistent Memory + +To support DAOS servers without Optane Persistent Memory, +DAOS Version 2.4 includes a Technology Preview of the +_Metadata-on-SSD_ feature. This code path uses DRAM memory to hold the +DAOS metadata, and persists the DAOS metadata on NVMe SSDs through +a write-ahead log (WAL) and asynchronous metadata checkpointing. + +More details on the Metadata-on-SSD functionality can be found in the +article [DAOS beyond Persistent Memory]() +in the _ISC High Performance 2023 International Workshops proceedings_ +and in the DAOS Administration Guide. + +For maximum performance, it is strongly recommended that all memory channels +of a DAOS server are populated. + + +### NVMe Storage in DAOS Servers + +While not strictly required, DAOS servers typically include NVMe disks +for bulk storage, which must be supported by [SPDK](https://spdk.io/). +(NVMe storage can be emulated by files on non-NVMe storage for development +and testing purposes, but this is not supported in a production environment.) +All NVMe disks managed by a single DAOS engine must have identical capacity, +and it is strongly recommended to use identical drive models. +It is also strongly recommended that all DAOS engines in a DAOS system +have identical NVMe storage configurations. +The number of targets per DAOS engine must be identical for all DAOS engines. + +DAOS Version 2.4 supports Intel Volume Management Devices (VMD) to manage the +NVMe disks on the DAOS servers. Enabling VMD is platform-dependent; +details are provided in the Administration Guide. + +Each DAOS engine needs one high-speed network port for communication in the +DAOS data plane. DAOS Version 2.4 does not support more than one +high-speed network port per DAOS engine. +(It is possible that two DAOS engines on a 2-socket server share a +single high-speed network port for development and testing purposes, +but this is not supported in a production environment.) +It is strongly recommended that all DAOS engines in a DAOS system use the same +model of high-speed fabric adapter. +Heterogeneous adapter population across DAOS engines has **not** been tested, +and running with such configurations may cause unexpected behavior. +Please refer to "Fabric Support" below for more details. + + +## Hardware platforms supported for DAOS Clients + +DAOS Version 2.4 supports the x86\_64 architecture. + +DAOS clients have no specific hardware dependencies. + +Each DAOS client needs a network port on the same high-speed interconnect +that the DAOS servers are connected to. +Multiple high-speed network ports per DAOS client are supported. +Note that a single task on a DAOS client will always use a single network port, +but when multiple tasks per client node are used then the DAOS agent will +distribute the load by allocating different network ports to different tasks. + + +## Operating Systems supported for DAOS Servers + +The DAOS software stack is built and supported on +Linux for the x86\_64 architecture. + +DAOS Version 2.4 has been primarily validated +on [Rocky Linux 8.6](https://docs.rockylinux.org/release_notes/8.6/) +and [openSUSE Leap 15.4](https://en.opensuse.org/openSUSE:Roadmap). +The following subsections provide details on the Linux distributions +which DAOS Version 2.4 supports on DAOS servers. + +Note that all DAOS servers in a DAOS server cluster (also called _DAOS system_) +must run the same Linux distribution. DAOS clients that access a DAOS server +cluster can run the same or different Linux distributions. + + +### SUSE Linux Enterprise Server 15 and openSUSE Leap 15 + +DAOS Version 2.4 is supported on SLES 15 SP4 and openSUSE Leap 15.4. + +General support for SLES 15 SP3 has ended on 31-Dec-2022. +DAOS nodes running SLES 15 SP3 or openSUSE 15.3 +have to be updated to 15.4 before updating DAOS to version 2.4. + +Links to SLES 15 Release Notes: + +* [SLES 15 SP3](https://www.suse.com/releasenotes/x86_64/SUSE-SLES/15-SP3/) +* [SLES 15 SP4](https://www.suse.com/releasenotes/x86_64/SUSE-SLES/15-SP4/) + +Links to openSUSE Leap 15 Release Notes: + +* [openSUSE Leap 15.3](https://doc.opensuse.org/release-notes/x86_64/openSUSE/Leap/15.3/) +* [openSUSE Leap 15.4](https://doc.opensuse.org/release-notes/x86_64/openSUSE/Leap/15.4/) + +Refer to the [SLES Life Cycle](https://www.suse.com/lifecycle/) +description on the SUSE support website for information on SLES support phases. + + +### Enterprise Linux 8 (EL8): RHEL 8, Rocky Linux 8, AlmaLinux 8 + +DAOS Version 2.4.0 is supported on EL 8.6 with Extended Update Support (EUS). +Support for the EL 8.7 release has ended, and DAOS Version 2.4 is not supported on EL 8.7. +Validation of DAOS Version 2.4 on EL 8.8 is in progress. + +!!! note + Most validation of DAOS Version 2.4 has been done on the Rocky Linux 8.6 release. + +!!! note + CentOS Linux 8 is not supported by DAOS Version 2.4. + Please install a supported EL8 operating system before deploying DAOS Version 2.4. + +Links to RHEL 8 Release Notes: + +* [RHEL 8.6](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/8.6_release_notes/index) +* [RHEL 8.7](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/8.7_release_notes/index) +* [RHEL 8.8](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/8.8_release_notes/index) + +Links to Rocky Linux 8 Release Notes: + +* [Rocky Linux 8.6](https://docs.rockylinux.org/release_notes/8_6/) +* [Rocky Linux 8.7](https://docs.rockylinux.org/release_notes/8_7/) +* [Rocky Linux 8.8](https://docs.rockylinux.org/release_notes/8_8/) + +Links to AlmaLinux 8 Release Notes: + +* [AlmaLinux 8.6](https://wiki.almalinux.org/release-notes/8.6.html) +* [AlmaLinux 8.7](https://wiki.almalinux.org/release-notes/8.7.html) +* [AlmaLinux 8.8](https://wiki.almalinux.org/release-notes/8.8.html) + +Refer to the [RHEL Life Cycle](https://access.redhat.com/support/policy/updates/errata/) +description on the Red Hat support website for information on RHEL support phases. + + +### Enterprise Linux 9 (EL9): RHEL 9, Rocky Linux 9, AlmaLinux 9 + +DAOS Version 2.4.0 has not been validated and is not supported on EL9. +Support for EL 9.2 (or later) will be added in DAOS Version 2.6. + +Links to RHEL 9 Release Notes: + +* [RHEL 9.0](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/9.0_release_notes/index) +* [RHEL 9.1](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/9.1_release_notes/index) +* [RHEL 9.2](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/9.2_release_notes/index) + +Links to Rocky Linux Release Notes: + +* [Rocky Linux 9.0](https://docs.rockylinux.org/release_notes/9_0/) +* [Rocky Linux 9.1](https://docs.rockylinux.org/release_notes/9_1/) +* [Rocky Linux 9.2](https://docs.rockylinux.org/release_notes/9_2/) + +Links to AlmaLinux Release Notes: + +* [AlmaLinux 9.0](https://wiki.almalinux.org/release-notes/9.0.html) +* [AlmaLinux 9.1](https://wiki.almalinux.org/release-notes/9.1.html) +* [AlmaLinux 9.2](https://wiki.almalinux.org/release-notes/9.2.html) + + +### Unsupported Linux Distributions + +With DAOS Version 2.4, CentOS 7 and RHEL 7 are no longer supported. +Please update your DAOS servers to a supported EL8 level before updating to DAOS 2.4. + +DAOS also does not support +openSUSE Tumbleweed, +Fedora, +CentOS Linux, +CentOS Stream, +Ubuntu, or +Oracle Linux. + + +## Operating Systems supported for DAOS Clients + +The DAOS software stack is built and supported on +Linux for the x86\_64 architecture. + +In DAOS Version 2.4, the supported Linux distributions and versions for DAOS clients +are identical to those for DAOS servers. Please refer to the +[previous section](#Operating-Systems-supported-for-DAOS-Servers) for details. + +In future DAOS releases, DAOS client support may be added for additional +Linux distributions and/or versions. + + +## Fabric Support + +DAOS Version 2.4 supports both OFI [libfabric](https://ofiwg.github.io/libfabric/). +and UCF [UCX](https://openucx.org/) for communication in the DAOS data plane. +This section describes the supported network providers and contains references +to vendor-specific information for the supported networking hardware. + + +### OFI libfabric + +With the exception of UCX for InfiniBand networks, OFI libfabric is the recommended +networking stack for DAOS. DAOS Version 2.4 ships with version 1.18.1rc1 of +[libfabric](https://ofiwg.github.io/libfabric/) +(but see below for DAOS on HPE Slingshot). +It is strongly recommended to use exactly the provided libfabric version +on all DAOS servers and all DAOS clients. + +Links to libfabric releases on github +(the RPM distribution of DAOS includes libfabric RPM packages with the correct version): + +* [libfabric 1.18.1rc1](https://github.com/ofiwg/libfabric/releases/tag/v1.18.1rc1) (release candidate) + +Not all libfabric core providers listed in +[fi\_provider(7)](https://ofiwg.github.io/libfabric/main/man/fi_provider.7.html) +are supported by DAOS. The following providers are supported: + +* The `ofi+tcp` provider is supported on all networking hardware. + It does not use RDMA, so on an RDMA-capable network this provider typically + does not achieve the maximum performance of the fabric. +* The `ofi+verbs` provider is supported for RDMA communication over InfiniBand + fabrics. Note that as an alternative to libfabric, the UCX networking stack + can be used on InfiniBand fabrics as described in the next subsection. +* The `ofi+cxi` provider is supported for RDMA communication over Slingshot. +* The `ofi+opx` (Omni-Path Express) provider is enabled as a _Technology Preview_ + for RDMA transport over Omni-Path fabrics, for testing and evaluation purposes. + In production environments using Omni-Path networking, please continue to use + the `ofi+tcp` provider until the `ofi+opx` provider is fully supported. + +!!! note + Starting with libfabric 1.18.0, libfabric has support for TCP without `rxm`. + DAOS [PR12436](https://github.com/daos-stack/daos/pull/12436) + will remove the automatic addition of `rxm` to the `ofi+tcp` provider string; + to get `rxm` it then has to be explicitly added as `ofi+tcp;ofi_rxm`. + +!!! note + The `ofi+psm2` provider for Omni-Path fabrics has known issues + when used with DAOS, and it has been removed from DAOS Version 2.4. + +!!! note + The `ofi+psm3` provider for Ethernet fabrics has not been validated with + and is not supported by DAOS Version 2.4. + + +### UCF Unified Communication X (UCX) + +For InfiniBand fabrics, DAOS 2.4 also supports [UCX](https://openucx.org/), +which is maintained by the Unified Communication Framework (UCF) consortium. + +DAOS Version 2.4 has been validated primarily with UCX Version 1.14.0-1, +which is included in the MLNX\_OFED 5.8 levels listed in the next section. +UCX Version 1.15.0-1 (included in MLNX\_OFED 5.9 and 23.04) +has not been validated with DAOS 2.4.0. + +* The `ucx+dc_x` provider has been validated and is supported with DAOS Version 2.4. + It is the recommended fabric provider on InfiniBand fabrics. + +* The `ucx+tcp` provider can be used for evaluation and testing purposes, + but it has not been fully validated with DAOS Version 2.4 and is not supported + for use in production environments. + + +### NVIDIA/Mellanox OFED (MLNX\_OFED) + +On [NVIDIA/Mellanox InfiniBand](https://www.nvidia.com/en-us/networking/products/infiniband/) +fabrics, DAOS requires that the +[Mellanox OFED (MLNX\_OFED)](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed) +software stack is installed on the DAOS servers and DAOS clients. + +DAOS Version 2.4 has been validated with MLNX\_OFED Version 5.8-1, +and both 5.8-1.0.1.1 and 5.8-1.1.2.1 are supported. +Versions older than 5.8-1 are not supported by DAOS 2.4. +MLNX\_OFED 5.8-2, 5.9 and 23.04 have not been validated with DAOS 2.4.0. + +Links to MLNX\_OFED Release Notes: + +* [MLNX\_OFED 5.8-1.0.1.1](https://docs.nvidia.com/networking/display/MLNXOFEDv581011/Release+Notes) (October 31, 2022) +* [MLNX\_OFED 5.8-1.1.2.1](https://docs.nvidia.com/networking/display/MLNXOFEDv581121LTS/Release+Notes) (December 1, 2022) +* [MLNX\_OFED 5.8-2.0.3.0](https://docs.nvidia.com/networking/display/MLNXOFEDv582030LTS) (February 28, 2023) +* [MLNX\_OFED 5.9-0.5.6.0](https://docs.nvidia.com/networking/display/MLNXOFEDv590560/Release+Notes) (February 2, 2023) +* [MLNX\_OFED 23.04-0.5.3.3](https://docs.nvidia.com/networking/display/MLNXOFEDv23040533/Release+Notes) (May 8, 2023) +* [MLNX\_OFED 23.04-1.1.3.0](https://docs.nvidia.com/networking/display/MLNXOFEDv23041130/Release+Notes) (June 1, 2023) + +It is strongly recommended that all DAOS servers and all DAOS clients +run the same version of MLNX\_OFED, and that the InfiniBand adapters are +updated to the firmware levels that are included in that MLNX\_OFED +distribution. +It is also strongly recommended that the same model of +InfiniBand fabric adapter is used in all DAOS servers. +DAOS Version 2.4 has **not** been tested with heterogeneous InfiniBand +adapter configurations. +The only exception to this recommendation is the mix of single-port +and dual-port adapters of the same generation, where only one of the ports +of the dual-port adapter(s) is used by DAOS. + + +### HPE Slingshot + +Customers using an [HPE Slingshot](https://www.hpe.com/us/en/compute/hpc/slingshot-interconnect.html) +fabric should contact their HPE representatives for information on the recommended HPE software stack +to use with DAOS Version 2.4 and the libfabric CXI provider. + + +### Cornelis Omni-Path Express (OPX) + +DAOS Version 2.4 includes a Technology Preview of the libfabric Omni-Path Express +(OPX) provider that supports [Omni-Path](https://www.cornelisnetworks.com/products/) +networking from Cornelis Networks. +See [fi\_opx(7)](https://ofiwg.github.io/libfabric/main/man/fi_opx.7.html) for details. +This OPX Technology Preview can be used for evaluation and testing purposes, +but it is not yet supported for production environments. +In production environments using Omni-Path networking, please continue to use the +libfabric TCP provider until the libfabric OXP provider is fully supported. + +The DAOS Version 2.4 RPM builds include the required version of libfabric +to enable the OPX Technology Preview. + +Please refer to the +[Cornelis Networks presentation](https://daosio.atlassian.net/wiki/download/attachments/11015454821/12_Update_on_Omni-Path_Support_for_DAOS_DUG21_19Nov2021.pdf) +at [DUG21](https://daosio.atlassian.net/wiki/spaces/DC/pages/11015454821/DUG21) +for information on Omni-Path Express support for DAOS. + + +## DAOS Scaling + +DAOS is a scale-out storage solution that is designed for extreme scale. +This section summarizes the DAOS scaling targets, some DAOS architectural limits, +and the current testing limits of DAOS Version 2.4. + +Note: Scaling characteristics depend on the properties of the high-performance +interconnect, and the libfaric provider that is used. The DAOS scaling targets +below assume a non-blocking, RDMA-capable fabric. Most scaling tests so far +have been performed on InfiniBand fabrics with the libfabric `verbs` provider. + +DAOS scaling targets +(these are order of magnitude figures that indicate what the DAOS architecture +should support - see below for the scales at which DAOS 2.4 has been validated): + +* DAOS client nodes in a DAOS system: 105 (hundreds of thousands) +* DAOS servers in a DAOS system: 103 (thousands) +* DAOS engines per DAOS server: 100 (less than ten) +* DAOS engines per CPU socket: 100 (1, 2 or 4) +* DAOS targets per DAOS engine: 101 (tens) +* SCM storage devices per DAOS engine: 101 (tens) +* NVMe storage devices per DAOS engine: 101 (tens) +* DAOS pools in a DAOS system: 102 (hundreds) +* DAOS containers in a DAOS pool: 102 (hundreds) +* DAOS objects in a DAOS container: 1010 (tens of billions) +* Application tasks accessing a DAOS container: 106 (millions) + +Note that DAOS has an architectural limit of 216=65536 storage targets +in a DAOS system, because the number of storage targets is encoded in +16 of the 32 "DAOS internal bits" within the 128-bit DAOS Object ID. + +DAOS Version 2.4 has been validated at the following scales: + +* DAOS client nodes in a DAOS system: 256 +* DAOS servers in a DAOS system: 256 +* DAOS engines per DAOS server: 1, 2 and 4 +* DAOS engines per CPU socket: 1 and 2 +* DAOS targets per DAOS engine: 4-32 +* SCM storage devices per DAOS engine: 6 (Optane PMem 100), 8 (Optane PMem 200) +* NVMe storage devices per DAOS engine: 0 (PMem-only pools), 4-12 +* DAOS pools in a DAOS system: 100 +* DAOS containers in a DAOS pool: 100 +* DAOS objects in a DAOS container: 6 billion (in mdtest benchmarks) +* Application tasks accessing a DAOS container: 3072 (using verbs) + +This test coverage will be expanded in subsequent DAOS releases. diff --git a/docs/release/upgrading.md b/docs/release/upgrading.md index 24f34be7479..48338666502 100644 --- a/docs/release/upgrading.md +++ b/docs/release/upgrading.md @@ -1,6 +1,6 @@ # Upgrading to DAOS Version 2.4 DAOS 2.4 is under active development and has not been released yet. -The release is planned for 2Q2023. +The release is planned for 3Q2023. In the meantime, please refer to the upgrading information for the [latest](https://docs.daos.io/latest/release/upgrading/) DAOS release. diff --git a/site_scons/components/__init__.py b/site_scons/components/__init__.py index e4412bb5228..205cc9c3563 100644 --- a/site_scons/components/__init__.py +++ b/site_scons/components/__init__.py @@ -218,8 +218,6 @@ def define_common(reqs): reqs.define('yaml', headers=['yaml.h'], package='libyaml-devel') - reqs.define('lmdb', headers=['lmdb.h'], libs=['lmdb'], package='lmdb-devel') - reqs.define('event', libs=['event'], package='libevent-devel') reqs.define('crypto', libs=['crypto'], headers=['openssl/md5.h'], package='openssl-devel') diff --git a/site_scons/prereq_tools/base.py b/site_scons/prereq_tools/base.py index 3259fe1c1d1..491e132deea 100644 --- a/site_scons/prereq_tools/base.py +++ b/site_scons/prereq_tools/base.py @@ -528,7 +528,7 @@ def run_build(self, opts): common_reqs = ['argobots', 'ucx', 'ofi', 'hwloc', 'mercury', 'boost', 'uuid', 'crypto', 'protobufc', 'lz4', 'isal', 'isal_crypto'] client_reqs = ['fuse', 'json-c', 'capstone'] - server_reqs = ['pmdk', 'spdk', 'lmdb'] + server_reqs = ['pmdk', 'spdk'] test_reqs = ['cmocka'] reqs = [] diff --git a/src/bio/SConscript b/src/bio/SConscript index 6a7eacd4d93..88211d8eaf1 100644 --- a/src/bio/SConscript +++ b/src/bio/SConscript @@ -38,9 +38,6 @@ def scons(): bio = denv.d_library("bio", tgts, install_off="../..", LIBS=libs) denv.Install('$PREFIX/lib64/daos_srv', bio) - if prereqs.test_requested(): - SConscript('tests/SConscript', exports='denv') - if __name__ == "SCons.Script": scons() diff --git a/src/bio/bio_xstream.c b/src/bio/bio_xstream.c index 0c9db8d31a4..fa99bd18a1d 100644 --- a/src/bio/bio_xstream.c +++ b/src/bio/bio_xstream.c @@ -317,6 +317,13 @@ bio_nvme_init(const char *nvme_conf, int numa_node, unsigned int mem_size, goto free_cond; } + /* + * Let's keep using large cluster size(1GB) for pmem mode, the SPDK blobstore + * loading time is unexpected long for smaller cluster size(32MB), see DAOS-13694. + */ + if (!bio_nvme_configured(SMD_DEV_TYPE_META)) + nvme_glb.bd_bs_opts.cluster_sz = (1UL << 30); /* 1GB */ + D_INFO("MD on SSD is %s\n", bio_nvme_configured(SMD_DEV_TYPE_META) ? "enabled" : "disabled"); diff --git a/src/bio/smd/tests/smd_ut.c b/src/bio/smd/tests/smd_ut.c index c8259c248e9..d877eefed62 100644 --- a/src/bio/smd/tests/smd_ut.c +++ b/src/bio/smd/tests/smd_ut.c @@ -199,7 +199,6 @@ db_fini(void) uuid_t dev_id1; uuid_t dev_id2; -static bool is_lmdb; static int smd_ut_setup(void **state) { @@ -210,13 +209,8 @@ smd_ut_setup(void **state) print_error("Error initializing the debug instance\n"); return rc; } - if (is_lmdb) { - lmm_db_init_ex(SMD_STORAGE_PATH, "sys_db", true, false); - rc = smd_init(lmm_db_get()); - } else { - db_init(); - rc = smd_init(&ut_db.ud_db); - } + db_init(); + rc = smd_init(&ut_db.ud_db); if (rc) { print_error("Error initializing SMD store: %d\n", rc); @@ -230,13 +224,8 @@ smd_ut_setup(void **state) static int smd_ut_teardown(void **state) { - if (is_lmdb) { - lmm_db_fini(); - smd_fini(); - } else { - smd_fini(); - db_fini(); - } + smd_fini(); + db_fini(); daos_debug_fini(); return 0; } @@ -552,14 +541,12 @@ print_usage(char *name) print_message( "\n\nCOMMON TESTS\n==========================\n"); print_message("%s -h|--help\n", name); - print_message("%s -l|--lmdb\n", name); } const char *s_opts = "hl"; static int idx; static struct option l_opts[] = { {"help", no_argument, NULL, 'h'}, - {"lmdb", no_argument, NULL, 'l'}, }; int main(int argc, char **argv) @@ -579,9 +566,6 @@ int main(int argc, char **argv) print_usage(argv[0]); rc = 0; goto out; - case 'l': - is_lmdb = true; - break; default: rc = 1; goto out; diff --git a/src/bio/tests/SConscript b/src/bio/tests/SConscript deleted file mode 100644 index 80d7dd7c931..00000000000 --- a/src/bio/tests/SConscript +++ /dev/null @@ -1,20 +0,0 @@ -"""Build Blob I/O tests""" - - -def scons(): - """Execute build""" - Import('denv') - - libraries = ['uuid', 'abt', 'bio', 'gurt', 'cmocka', 'daos_common_pmem', 'daos_tests'] - - env = denv.Clone() - - env.AppendUnique(LIBPATH=[Dir('..')]) - env.AppendUnique(RPATH_FULL=['$PREFIX/lib64/daos_srv']) - bio_ut_src = ['bio_ut.c', 'wal_ut.c'] - bio_ut = env.d_test_program('bio_ut', bio_ut_src, LIBS=libraries) - env.Install('$PREFIX/bin/', bio_ut) - - -if __name__ == "SCons.Script": - scons() diff --git a/src/cart/crt_corpc.c b/src/cart/crt_corpc.c index 9442c0f3766..6232af5741f 100644 --- a/src/cart/crt_corpc.c +++ b/src/cart/crt_corpc.c @@ -94,6 +94,7 @@ crt_corpc_initiate(struct crt_rpc_priv *rpc_priv) struct crt_grp_gdata *grp_gdata; struct crt_grp_priv *grp_priv; struct crt_corpc_hdr *co_hdr; + int src_timeout; bool grp_ref_taken = false; int rc = 0; @@ -121,6 +122,10 @@ crt_corpc_initiate(struct crt_rpc_priv *rpc_priv) } } + /* Inherit a timeout from a source */ + src_timeout = rpc_priv->crp_req_hdr.cch_src_timeout; + rpc_priv->crp_timeout_sec = src_timeout; + rc = crt_corpc_info_init(rpc_priv, grp_priv, grp_ref_taken, co_hdr->coh_filter_ranks, co_hdr->coh_grp_ver /* grp_ver */, @@ -675,7 +680,8 @@ crt_corpc_reply_hdlr(const struct crt_cb_info *cb_info) D_ERROR("co_ops->co_aggregate(opc: %#x) " "failed: "DF_RC"\n", child_req->cr_opc, DP_RC(rc)); - rc = 0; + if (co_info->co_rc == 0) + co_info->co_rc = rc; } co_info->co_child_ack_num++; D_DEBUG(DB_NET, "parent rpc %p, child rpc %p, " @@ -713,7 +719,8 @@ crt_corpc_reply_hdlr(const struct crt_cb_info *cb_info) D_ERROR("co_ops->co_aggregate(opc: %#x)" " failed: "DF_RC"\n", child_req->cr_opc, DP_RC(rc)); - rc = 0; + if (co_info->co_rc == 0) + co_info->co_rc = rc; } } } @@ -872,6 +879,7 @@ crt_corpc_req_hdlr(struct crt_rpc_priv *rpc_priv) child_rpc_priv = container_of(child_rpc, struct crt_rpc_priv, crp_pub); + child_rpc_priv->crp_timeout_sec = rpc_priv->crp_timeout_sec; corpc_add_child_rpc(rpc_priv, child_rpc_priv); child_rpc_priv->crp_grp_priv = co_info->co_grp_priv; diff --git a/src/cart/crt_ctl.c b/src/cart/crt_ctl.c index 3500bec4a91..3ab6f1c0df4 100644 --- a/src/cart/crt_ctl.c +++ b/src/cart/crt_ctl.c @@ -117,11 +117,11 @@ crt_hdlr_ctl_get_uri_cache(crt_rpc_t *rpc_req) out_args->cguc_grp_cache.ca_count = uri_cache.idx; /* actual count */ rc = 0; out: + D_RWLOCK_UNLOCK(&grp_priv->gp_rwlock); out_args->cguc_rc = rc; rc = crt_reply_send(rpc_req); D_ASSERTF(rc == 0, "crt_reply_send() failed. rc: %d\n", rc); D_DEBUG(DB_TRACE, "sent reply to get uri cache request\n"); - D_RWLOCK_UNLOCK(&grp_priv->gp_rwlock); D_FREE(uri_cache.grp_cache); } diff --git a/src/cart/crt_hg_proc.c b/src/cart/crt_hg_proc.c index 5055f011c9f..e317f1ecbfb 100644 --- a/src/cart/crt_hg_proc.c +++ b/src/cart/crt_hg_proc.c @@ -542,6 +542,7 @@ crt_proc_in_common(crt_proc_t proc, crt_rpc_input_t *data) ); hdr->cch_dst_tag = rpc_priv->crp_pub.cr_ep.ep_tag; + hdr->cch_src_timeout = rpc_priv->crp_timeout_sec; if (crt_is_service()) { hdr->cch_src_rank = crt_grp_priv_get_primary_rank( diff --git a/src/cart/crt_rpc.c b/src/cart/crt_rpc.c index 857613c4d08..e834064250b 100644 --- a/src/cart/crt_rpc.c +++ b/src/cart/crt_rpc.c @@ -1897,6 +1897,28 @@ crt_req_dst_tag_get(crt_rpc_t *rpc, uint32_t *tag) return rc; } +int +crt_req_src_timeout_get(crt_rpc_t *rpc, uint16_t *timeout) +{ + struct crt_rpc_priv *rpc_priv = NULL; + int rc = 0; + + if (rpc == NULL) { + D_ERROR("NULL rpc passed\n"); + D_GOTO(out, rc = -DER_INVAL); + } + + if (timeout == NULL) { + D_ERROR("NULL timeout passed\n"); + D_GOTO(out, rc = -DER_INVAL); + } + + rpc_priv = container_of(rpc, struct crt_rpc_priv, crp_pub); + *timeout = rpc_priv->crp_req_hdr.cch_src_timeout; +out: + return rc; +} + int crt_register_hlc_error_cb(crt_hlc_error_cb event_handler, void *arg) { diff --git a/src/cart/crt_rpc.h b/src/cart/crt_rpc.h index 608b5d37f04..1b22358fb71 100644 --- a/src/cart/crt_rpc.h +++ b/src/cart/crt_rpc.h @@ -65,12 +65,15 @@ struct crt_common_hdr { d_rank_t cch_dst_rank; /* originator rank in default primary group */ d_rank_t cch_src_rank; - /* tag to which rpc request was sent to */ - uint32_t cch_dst_tag; + /* destination tag */ + uint16_t cch_dst_tag; + /* source timeout, to be replaced by deadline eventually */ + uint16_t cch_src_timeout; /* used in crp_reply_hdr to propagate rpc failure back to sender */ uint32_t cch_rc; }; + typedef enum { RPC_STATE_INITED = 0x36, RPC_STATE_QUEUED, /* queued for flow controlling */ diff --git a/src/client/api/event.c b/src/client/api/event.c index ba12a05a996..42382f95df5 100644 --- a/src/client/api/event.c +++ b/src/client/api/event.c @@ -619,8 +619,12 @@ daos_eq_create(daos_handle_t *eqh) int rc = 0; /** not thread-safe, but best effort */ - if (eq_ref == 0) + D_MUTEX_LOCK(&daos_eq_lock); + if (eq_ref == 0) { + D_MUTEX_UNLOCK(&daos_eq_lock); return -DER_UNINIT; + } + D_MUTEX_UNLOCK(&daos_eq_lock); eq = daos_eq_alloc(); if (eq == NULL) diff --git a/src/client/dfs/dfs.c b/src/client/dfs/dfs.c index 280ec52318d..0184287abc8 100644 --- a/src/client/dfs/dfs.c +++ b/src/client/dfs/dfs.c @@ -4659,7 +4659,8 @@ dfs_read_int(dfs_t *dfs, dfs_obj_t *obj, daos_off_t off, dfs_iod_t *iod, if (rc) D_GOTO(err_params, rc = daos_der2errno(rc)); - return dc_task_schedule(task, true); + rc = dc_task_schedule(task, true); + return daos_der2errno(rc); err_params: D_FREE(params); diff --git a/src/client/dfs/dfs_sys.c b/src/client/dfs/dfs_sys.c index 39acae2fd4a..08a07b47b68 100644 --- a/src/client/dfs/dfs_sys.c +++ b/src/client/dfs/dfs_sys.c @@ -472,7 +472,7 @@ fini_sys(dfs_sys_t *dfs_sys, bool disconnect) rc = d_hash_table_destroy(dfs_sys->hash, false); if (rc) { D_DEBUG(DB_TRACE, "failed to destroy hash table: "DF_RC"\n", DP_RC(rc)); - return rc; + return daos_der2errno(rc); } dfs_sys->hash = NULL; } diff --git a/src/client/dfuse/dfuse_core.c b/src/client/dfuse/dfuse_core.c index bb14e10bbd0..9b39bcf1152 100644 --- a/src/client/dfuse/dfuse_core.c +++ b/src/client/dfuse/dfuse_core.c @@ -727,7 +727,6 @@ dfuse_cont_open_by_label(struct dfuse_projection_info *fs_handle, struct dfuse_p DFUSE_TRA_INFO(dfc, "Using default caching values"); dfuse_set_default_cont_cache_values(dfc); - rc = 0; } else if (rc != 0) { D_GOTO(err_close, rc); } diff --git a/src/client/dfuse/dfuse_main.c b/src/client/dfuse/dfuse_main.c index 1c3e1aeef8a..50e25720c2b 100644 --- a/src/client/dfuse/dfuse_main.c +++ b/src/client/dfuse/dfuse_main.c @@ -251,10 +251,11 @@ static void show_help(char *name) { printf( - "usage: %s [pool] [container]\n" + "usage: %s [OPTIONS] [mountpoint [pool container]]\n" + "\n" "Options:\n" "\n" - " -m --mountpoint= Mount point to use\n" + " -m --mountpoint= Mount point to use (deprecated, use positional argument)\n" "\n" " --pool=name pool UUID/label\n" " --container=name container UUID/label\n" @@ -262,8 +263,8 @@ show_help(char *name) " --sys-name=STR DAOS system name context for servers\n" "\n" " -S --singlethread Single threaded\n" - " -t --thread-count=count Number of threads to use\n" - " -e --eq-count=count Number of event queues to use\n" + " -t --thread-count=count Total number of threads to use\n" + " -e --eq-count=count Number of event queues to use\n" " -f --foreground Run in foreground\n" " --enable-caching Enable all caching (default)\n" " --enable-wb-cache Use write-back cache rather than write-through (default)\n" @@ -271,50 +272,83 @@ show_help(char *name) " --disable-wb-cache Use write-through rather than write-back cache\n" " -o options mount style options string\n" "\n" - " --multi-user Run dfuse in multi user mode\n" + " --multi-user Run dfuse in multi user mode\n" "\n" " -h --help Show this help\n" " -v --version Show version\n" "\n" - "Specifying pool and container are optional. If not set then dfuse can connect to\n" - "many using the uuids as leading components of the path.\n" - "Pools and containers can be specified using either uuids or labels.\n" + "dfuse performs a user space mount of a DAOS POSIX container at the mountpoint\n" + "directory that is specified as the first positional argument. This directory\n" + "has to exist and has to be accessible to the user, or the mount will fail.\n" + "Alternatively, the mountpoint directory can also be specified with the -m or\n" + "--mountpoint= option but this usage is deprecated.\n" "\n" - "The path option can be use to set a filesystem path from which Namespace attributes\n" - "will be loaded, or if path is not set then the mount directory will also be\n" - "checked. Only one way of setting pool and container data should be used.\n" + "The DAOS pool and container can be specified in several different ways" + "(only one way of specifying the pool and container should be used):\n" + "* The DAOS pool and container can be explicitly specified on the command line\n" + " as positional arguments, using either UUIDs or labels. This is the most\n" + " common way to use dfuse to mount a POSIX container.\n" + "* The DAOS pool and container can be explicitly specified on the command line\n" + " using the --pool and --container options, with either UUIDs or labels.\n" + " This usage is deprecated in favor of using positional arguments.\n" + "* When the --path option is used, DAOS namespace attributes are loaded from\n" + " that filesystem path, including the DAOS pool and container information.\n" + "* When the --path option is not used, then the mountpoint directory will also\n" + " be checked and DAOS namespace attributes will be loaded from there if present.\n" + "* When using the -o mount option string, pool= and container= keys in the mount\n" + " option string identify the DAOS pool and container.\n" + "* When the pool and container are not specified through any of these methods,\n" + " dfuse will construct filesystem pathnames under the mountpoint by using the\n" + " pool and container UUIDs (not labels) of *all* pools and POSIX containers to\n" + " which the user running dfuse has access as pathname components.\n" + " - A path to a POSIX container that is mounted this way can be traversed to\n" + " access the root of that container, for example by changing directory to\n" + " /mountpoint/pool_uuid/cont_uuid/.\n" + " - However, listing the /mountpoint/ directory is not supported and will not\n" + " show the pool UUIDs that are mounted there.\n" + " - Similarly, while the user can change directory into a /mountpoint/pool_uuid/\n" + " directory, listing that directory is not supported and will not show the\n" + " container UUIDs that are mounted there.\n" + " - Running 'fusermount3 -u /mountpoint' will unmount *all* POSIX containers that\n" + " have been mounted this way, as well as the /mountpoint/pool_uuid/ directories.\n" "\n" - "The default thread count is one per available core to allow maximum throughput,\n" - "this can be modified by running dfuse in a cpuset via numactl or similar tools or\n" - "by using the --thread-count option.\n" - "dfuse has two types of threads: fuse threads which accept requests from the kernel\n" - "and process them, and progress threads which complete asynchronous read/write\n" - "operations. Each asynchronous thread will have one daos event queue so consume\n" - "additional network resources. The --thread-count option will control the total\n" - "number of threads, increasing the --eq-count option will reduce the number of\n" - "fuse threads accordingly. The default value for eq-count is 1.\n" - "As all metadata operations are blocking the level of concurrency is limited by the\n" - "number of fuse threads." - "Singlethreaded mode will use one thread for handling fuse requests and a second\n" - "thread for a single event queue for a total of two threads\n" + "Threading and resource usage:\n" + "dfuse has two types of threads: fuse threads which accept and process requests from\n" + "the kernel, and progress threads which complete asynchronous read/write operations.\n" + "Each asynchronous progress thread uses one DAOS event queue to consume additional\n" + "network resources. As all metadata operations are blocking, the level of concurrency\n" + "in dfuse is limited by the number of fuse threads.\n" + "By default, the total thread count is one per available core to allow maximum\n" + "throughput. If hyperthreading is enabled, then one thread per hyperthread core\n" + "is used. This can be modified in two ways: Reducing the number of available\n" + "cores by running dfuse in a cpuset via numactl or similar tools,\n" + "or by using the --thread-count, --eq-count or --singlethread options:\n" + "* The --thread-count option controls the total number of threads.\n" + "* Increasing the --eq-count option at a fixed --thread-count will reduce the number\n" + " of fuse threads accordingly. The default value for --eq-count is 1.\n" + "* The --singlethread mode will use one thread for handling fuse requests and a\n" + " second thread for a single event queue, for a total of two threads.\n" "\n" "If dfuse is running in background mode (the default unless launched via mpirun)\n" "then it will stay in the foreground until the mount is registered with the\n" "kernel to allow appropriate error reporting.\n" "\n" - "The -o is to allow use of dfuse via fstab or similar and accepts standard mount\n" - "options. This will be treated as a comma separated list of key=value pairs and\n" - "dfuse will use pool= and container= keys from this string.\n" + "The -o option can be used to run dfuse via fstab or similar and accepts standard\n" + "mount options. This will be treated as a comma separated list of key=value pairs,\n" + "and dfuse will use pool= and container= keys from this string.\n" "\n" - "Caching is on by default with short metadata timeouts and write-back data cache,\n" - "this can be disabled entirely for the mount by the use of command line options.\n" - "Further settings can be set on a per-container basis via the use of container\n" - "attributes. If the --disable-caching option is given then no caching will be\n" - "performed and the container attributes are not used, if --disable-wb-cache is\n" - "given the data caching for the whole mount is performed in write-back mode and\n" - "the container attributes are still used\n" + "Caching is on by default. The caching behavior for a dfuse mount can be controlled\n" + "by command line options. Further caching controls can be set on a per-container\n" + "basis through container attributes.\n" + "* If the --disable-caching option is used then no caching will be performed, and the\n" + " container attributes are not used. The default is --enable-caching.\n" + "* If --disable-wb-cache is used then the write operations for the whole mount are\n" + " performed in write-through mode, and the container attributes are still used.\n" + " The default is --enable-wb-cache.\n" + "* If --disable-caching and --enable-wb-cache are both specified,\n" + " the --enable-wb-cache option is ignored and no caching is performed.\n" "\n" - "version: %s\n", + "Version: %s\n", name, DAOS_VERSION); } diff --git a/src/client/dfuse/il/int_posix.c b/src/client/dfuse/il/int_posix.c index 43baa8fde95..a7c3161cb49 100644 --- a/src/client/dfuse/il/int_posix.c +++ b/src/client/dfuse/il/int_posix.c @@ -675,6 +675,65 @@ ioil_open_cont_handles(int fd, struct dfuse_il_reply *il_reply, struct ioil_cont return true; } +/* Wrapper function for daos_init() + * Within ioil there are some use-cases where the caller opens files in sequence and expects back + * specific file descriptors, specifically some configure scripts which hard-code fd numbers. To + * avoid problems here then if the fd being intercepted is low then pre-open a number of fds before + * calling daos_init() and close them afterwards so that daos itself does not use and of the low + * number file descriptors. + * The DAOS logging uses fnctl calls to force it's FDs to higher numbers to avoid the same problems. + * See DAOS-13381 for more details. Returns true on success + */ + +#define IOIL_MIN_FD 10 + +static bool +call_daos_init(int fd) +{ + int fds[IOIL_MIN_FD] = {}; + int i = 0; + int rc; + bool rcb = false; + + if (fd < IOIL_MIN_FD) { + fds[0] = __real_open("/", O_RDONLY); + + while (fds[i] < IOIL_MIN_FD) { + fds[i + 1] = __real_dup(fds[i]); + if (fds[i + 1] == -1) { + DFUSE_LOG_DEBUG("Pre-opening files failed: %d (%s)", errno, + strerror(errno)); + goto out; + } + i++; + D_ASSERT(i < IOIL_MIN_FD); + } + } + + rc = daos_init(); + if (rc) { + DFUSE_LOG_DEBUG("daos_init() failed, " DF_RC, DP_RC(rc)); + goto out; + } + rcb = true; + +out: + i = 0; + while (fds[i] > 0) { + __real_close(fds[i]); + i++; + D_ASSERT(i < IOIL_MIN_FD); + } + + if (rcb) + ioil_iog.iog_daos_init = true; + else + ioil_iog.iog_no_daos = true; + + return rcb; +} + +/* Returns true on success */ static bool check_ioctl_on_open(int fd, struct fd_entry *entry, int flags) { @@ -709,15 +768,9 @@ check_ioctl_on_open(int fd, struct fd_entry *entry, int flags) rc = pthread_mutex_lock(&ioil_iog.iog_lock); D_ASSERT(rc == 0); - if (!ioil_iog.iog_daos_init) { - rc = daos_init(); - if (rc) { - DFUSE_LOG_DEBUG("daos_init() failed, " DF_RC, DP_RC(rc)); - ioil_iog.iog_no_daos = true; - D_GOTO(err, 0); - } - ioil_iog.iog_daos_init = true; - } + if (!ioil_iog.iog_daos_init) + if (!call_daos_init(fd)) + goto err; d_list_for_each_entry(pool, &ioil_iog.iog_pools_head, iop_pools) { if (uuid_compare(pool->iop_uuid, il_reply.fir_pool) != 0) diff --git a/src/client/dfuse/ops/readdir.c b/src/client/dfuse/ops/readdir.c index 5a56062d528..f49113bbc09 100644 --- a/src/client/dfuse/ops/readdir.c +++ b/src/client/dfuse/ops/readdir.c @@ -539,9 +539,9 @@ dfuse_do_readdir(struct dfuse_projection_info *fs_handle, fuse_req_t req, struct DFUSE_TRA_DEBUG(oh, "Switching to private handle"); dfuse_dre_drop(fs_handle, oh); oh->doh_rd = _handle_init(oh->doh_ie->ie_dfs); + hdl = oh->doh_rd; if (oh->doh_rd == NULL) D_GOTO(out_reset, rc = ENOMEM); - hdl = oh->doh_rd; DFUSE_TRA_UP(oh->doh_rd, oh, "readdir"); } else { dfuse_readdir_reset(hdl); @@ -647,9 +647,11 @@ dfuse_do_readdir(struct dfuse_projection_info *fs_handle, fuse_req_t req, struct NULL); if (rc == ENOENT) { DFUSE_TRA_DEBUG(oh, "File does not exist"); + D_FREE(drc); continue; } else if (rc != 0) { DFUSE_TRA_DEBUG(oh, "Problem finding file %d", rc); + D_FREE(drc); D_GOTO(reply, rc); } @@ -665,6 +667,8 @@ dfuse_do_readdir(struct dfuse_projection_info *fs_handle, fuse_req_t req, struct rc = create_entry(fs_handle, oh->doh_ie, &stbuf, obj, dre->dre_name, out, attr_len, &rlink); if (rc != 0) { + dfs_release(obj); + D_FREE(drc); D_GOTO(reply, rc); } @@ -769,7 +773,8 @@ dfuse_do_readdir(struct dfuse_projection_info *fs_handle, fuse_req_t req, struct return 0; out_reset: - dfuse_readdir_reset(hdl); + if (hdl) + dfuse_readdir_reset(hdl); D_ASSERT(rc != 0); return rc; } diff --git a/src/client/kv/dc_kv.c b/src/client/kv/dc_kv.c index 3924c48c04b..ff3d7a7ac2f 100644 --- a/src/client/kv/dc_kv.c +++ b/src/client/kv/dc_kv.c @@ -358,9 +358,10 @@ dc_kv_put(tse_task_t *task) daos_obj_update_t *update_args; tse_task_t *update_task; struct io_params *params = NULL; + bool free_params = true; int rc; - if (args->key == NULL || args->buf_size == 0 || args->buf == NULL) + if (args->key == NULL) D_GOTO(err_task, rc = -DER_INVAL); kv = kv_hdl2ptr(args->oh); @@ -403,6 +404,7 @@ dc_kv_put(tse_task_t *task) rc = tse_task_register_comp_cb(task, free_io_params_cb, ¶ms, sizeof(params)); if (rc != 0) D_GOTO(err_utask, rc); + free_params = false; rc = tse_task_register_deps(task, 1, &update_task); if (rc != 0) @@ -418,7 +420,8 @@ dc_kv_put(tse_task_t *task) tse_task_complete(update_task, rc); err_task: tse_task_complete(task, rc); - D_FREE(params); + if (free_params) + D_FREE(params); if (kv) kv_decref(kv); return rc; @@ -434,6 +437,7 @@ dc_kv_get(tse_task_t *task) struct io_params *params = NULL; void *buf; daos_size_t *buf_size; + bool free_params = true; int rc; if (args->key == NULL) @@ -493,6 +497,7 @@ dc_kv_get(tse_task_t *task) rc = tse_task_register_comp_cb(task, free_io_params_cb, ¶ms, sizeof(params)); if (rc != 0) D_GOTO(err_ftask, rc); + free_params = false; rc = tse_task_register_deps(task, 1, &fetch_task); if (rc != 0) @@ -508,7 +513,8 @@ dc_kv_get(tse_task_t *task) tse_task_complete(fetch_task, rc); err_task: tse_task_complete(task, rc); - D_FREE(params); + if (free_params) + D_FREE(params); if (kv) kv_decref(kv); return rc; @@ -522,6 +528,7 @@ dc_kv_remove(tse_task_t *task) daos_obj_punch_t *punch_args; tse_task_t *punch_task; struct io_params *params = NULL; + bool free_params = true; int rc; if (args->key == NULL) @@ -553,6 +560,7 @@ dc_kv_remove(tse_task_t *task) rc = tse_task_register_comp_cb(task, free_io_params_cb, ¶ms, sizeof(params)); if (rc != 0) D_GOTO(err_ptask, rc); + free_params = false; rc = tse_task_register_deps(task, 1, &punch_task); if (rc != 0) @@ -568,7 +576,8 @@ dc_kv_remove(tse_task_t *task) tse_task_complete(punch_task, rc); err_task: tse_task_complete(task, rc); - D_FREE(params); + if (free_params) + D_FREE(params); if (kv) kv_decref(kv); return rc; diff --git a/src/common/SConscript b/src/common/SConscript index 10ad0ab4197..c61ecdeebe3 100644 --- a/src/common/SConscript +++ b/src/common/SConscript @@ -18,7 +18,6 @@ def build_daos_common(denv, client): stack_mmap_files = [] ad_mem_files = [] dav_src = [] - sys_lmdb_files = [] common_libs = ['isal', 'isal_crypto', 'cart', 'gurt', 'lz4', 'protobuf-c', 'uuid', 'pthread'] if client: @@ -31,8 +30,7 @@ def build_daos_common(denv, client): 'dav/ravl_interval.c', 'dav/recycler.c', 'dav/stats.c', 'dav/tx.c', 'dav/ulog.c', 'dav/util.c', 'dav/wal_tx.c'] ad_mem_files = ['ad_mem.c', 'ad_tx.c'] - sys_lmdb_files = ['sys_lmdb.c'] - common_libs.extend(['pmemobj', 'lmdb', 'abt']) + common_libs.extend(['pmemobj', 'abt']) benv.AppendUnique(RPATH_FULL=['$PREFIX/lib64/daos_srv']) benv.Append(CPPDEFINES=['-DDAOS_PMEM_BUILD']) benv.Append(OBJPREFIX="v_") @@ -45,8 +43,8 @@ def build_daos_common(denv, client): benv.require('argobots') benv.Append(CCFLAGS=['-DULT_MMAP_STACK']) - common = benv.d_library(libname, COMMON_FILES + dav_src + ad_mem_files + stack_mmap_files - + sys_lmdb_files, LIBS=common_libs) + common = benv.d_library(libname, COMMON_FILES + dav_src + ad_mem_files + stack_mmap_files, + LIBS=common_libs) benv.Install('$PREFIX/lib64/', common) return common diff --git a/src/common/btree.c b/src/common/btree.c index 98ccd56fe92..41d13c87a73 100644 --- a/src/common/btree.c +++ b/src/common/btree.c @@ -1059,7 +1059,7 @@ btr_check_availability(struct btr_context *tcx, struct btr_check_alb *alb) } } -static void +static int btr_node_insert_rec_only(struct btr_context *tcx, struct btr_trace *trace, struct btr_record *rec) { @@ -1068,6 +1068,7 @@ btr_node_insert_rec_only(struct btr_context *tcx, struct btr_trace *trace, bool leaf; bool reuse = false; char sbuf[BTR_PRINT_BUF]; + int rc; /* NB: assume trace->tr_node has been added to TX */ D_ASSERT(!btr_node_is_full(tcx, trace->tr_node)); @@ -1080,7 +1081,6 @@ btr_node_insert_rec_only(struct btr_context *tcx, struct btr_trace *trace, nd = btr_off2ptr(tcx, trace->tr_node); if (nd->tn_keyn > 0) { struct btr_check_alb alb; - int rc; if (trace->tr_at != nd->tn_keyn) alb.at = trace->tr_at; @@ -1102,7 +1102,9 @@ btr_node_insert_rec_only(struct btr_context *tcx, struct btr_trace *trace, rec_a = btr_node_rec_at(tcx, trace->tr_node, trace->tr_at); if (reuse) { - btr_rec_free(tcx, rec_a, NULL); + rc = btr_rec_free(tcx, rec_a, NULL); + if (rc) + return rc; } else { if (trace->tr_at != nd->tn_keyn) { struct btr_record *rec_b; @@ -1116,6 +1118,7 @@ btr_node_insert_rec_only(struct btr_context *tcx, struct btr_trace *trace, } btr_rec_copy(tcx, rec_a, rec, 1); + return 0; } /** @@ -1194,7 +1197,9 @@ btr_node_split_and_insert(struct btr_context *tcx, struct btr_trace *trace, D_DEBUG(DB_TRACE, "Splitting leaf node\n"); btr_rec_copy(tcx, rec_dst, rec_src, nd_right->tn_keyn); - btr_node_insert_rec_only(tcx, trace, rec); + rc = btr_node_insert_rec_only(tcx, trace, rec); + if (rc) + return rc; /* insert the right node and the first key of the right * node to its parent @@ -1241,7 +1246,9 @@ btr_node_split_and_insert(struct btr_context *tcx, struct btr_trace *trace, */ btr_hkey_copy(tcx, &hkey_buf[0], &rec_src->rec_hkey[0]); - btr_node_insert_rec_only(tcx, trace, rec); + rc = btr_node_insert_rec_only(tcx, trace, rec); + if (rc) + return rc; btr_hkey_copy(tcx, &rec->rec_hkey[0], &hkey_buf[0]); @@ -1347,7 +1354,7 @@ btr_node_insert_rec(struct btr_context *tcx, struct btr_trace *trace, if (btr_node_is_full(tcx, trace->tr_node)) rc = btr_node_split_and_insert(tcx, trace, rec); else - btr_node_insert_rec_only(tcx, trace, rec); + rc = btr_node_insert_rec_only(tcx, trace, rec); done: return rc; } @@ -2011,7 +2018,9 @@ btr_update(struct btr_context *tcx, d_iov_t *key, d_iov_t *val, d_iov_t *val_out } D_DEBUG(DB_TRACE, "Replace the original record\n"); - btr_rec_free(tcx, rec, NULL); + rc = btr_rec_free(tcx, rec, NULL); + if (rc) + goto out; rc = btr_rec_alloc(tcx, key, val, rec, val_out); } out: diff --git a/src/common/mem.c b/src/common/mem.c index f9b5a267c8e..bfe64515ebf 100644 --- a/src/common/mem.c +++ b/src/common/mem.c @@ -34,14 +34,7 @@ struct umem_tx_stage_item { #ifdef DAOS_PMEM_BUILD -enum { - DAOS_MD_PMEM = 0, - DAOS_MD_BMEM = 1, - DAOS_MD_ADMEM = 2, -}; - static int daos_md_backend = DAOS_MD_PMEM; - #define UMM_SLABS_CNT 16 /** Initializes global settings for the pmem objects. @@ -87,6 +80,27 @@ umempobj_settings_init(bool md_on_ssd) return 0; } +int umempobj_get_backend_type(void) +{ + return daos_md_backend; +} + +int umempobj_backend_type2class_id(int backend) +{ + switch (backend) { + case DAOS_MD_PMEM: + return UMEM_CLASS_PMEM; + case DAOS_MD_BMEM: + return UMEM_CLASS_BMEM; + case DAOS_MD_ADMEM: + return UMEM_CLASS_ADMEM; + default: + D_ASSERTF(0, + "bad daos_md_backend %d\n", backend); + return -DER_INVAL; + } +} + /** Define common slabs. We can refine this for 2.4 pools but that is for next patch */ static const int slab_map[] = { 0, /* 32 bytes */ @@ -122,7 +136,7 @@ set_slab_desc(struct umem_pool *ph_p, struct umem_slab_desc *slab) struct dav_alloc_class_desc davslab; int rc = 0; - switch (daos_md_backend) { + switch (ph_p->up_store.store_type) { static unsigned class_id = 10; case DAOS_MD_PMEM: @@ -152,7 +166,7 @@ set_slab_desc(struct umem_pool *ph_p, struct umem_slab_desc *slab) slab->class_id = class_id++; break; default: - D_ASSERTF(0, "bad daos_md_backend %d\n", daos_md_backend); + D_ASSERTF(0, "bad daos_md_backend %d\n", ph_p->up_store.store_type); break; } return rc; @@ -176,7 +190,7 @@ static inline uint64_t slab_flags(struct umem_pool *pool, unsigned int slab_id) { D_ASSERT(slab_id < UMM_SLABS_CNT); - return (daos_md_backend == DAOS_MD_PMEM) ? + return (pool->up_store.store_type == DAOS_MD_PMEM) ? POBJ_CLASS_ID(pool->up_slabs[slab_id].class_id) : DAV_CLASS_ID(pool->up_slabs[slab_id].class_id); } @@ -276,10 +290,12 @@ umempobj_create(const char *path, const char *layout_name, int flags, if (store != NULL) umm_pool->up_store = *store; + else + umm_pool->up_store.store_type = DAOS_MD_PMEM; /* default */ D_DEBUG(DB_TRACE, "creating path %s, poolsize %zu, store_size %zu ...\n", path, poolsize, store != NULL ? store->stor_size : 0); - switch (daos_md_backend) { + switch (umm_pool->up_store.store_type) { case DAOS_MD_PMEM: pop = pmemobj_create(path, layout_name, poolsize, mode); if (!pop) { @@ -319,7 +335,7 @@ umempobj_create(const char *path, const char *layout_name, int flags, umm_pool->up_priv = bh.bh_blob; break; default: - D_ASSERTF(0, "bad daos_md_backend %d\n", daos_md_backend); + D_ASSERTF(0, "bad daos_md_backend %d\n", store->store_type); break; }; @@ -355,11 +371,15 @@ umempobj_open(const char *path, const char *layout_name, int flags, struct umem_ if (umm_pool == NULL) return NULL; - if (store != NULL) + if (store != NULL) { umm_pool->up_store = *store; + } else { + umm_pool->up_store.store_type = DAOS_MD_PMEM; /* default */ + umm_pool->up_store.store_standalone = true; + } D_DEBUG(DB_TRACE, "opening %s\n", path); - switch (daos_md_backend) { + switch (umm_pool->up_store.store_type) { case DAOS_MD_PMEM: pop = pmemobj_open(path, layout_name); if (!pop) { @@ -400,7 +420,7 @@ umempobj_open(const char *path, const char *layout_name, int flags, struct umem_ umm_pool->up_priv = bh.bh_blob; break; default: - D_ASSERTF(0, "bad daos_md_backend %d\n", daos_md_backend); + D_ASSERTF(0, "bad daos_md_backend %d\n", umm_pool->up_store.store_type); break; } @@ -423,7 +443,7 @@ umempobj_close(struct umem_pool *ph_p) PMEMobjpool *pop; struct ad_blob_handle bh; - switch (daos_md_backend) { + switch (ph_p->up_store.store_type) { case DAOS_MD_PMEM: pop = (PMEMobjpool *)ph_p->up_priv; @@ -437,7 +457,7 @@ umempobj_close(struct umem_pool *ph_p) ad_blob_close(bh); break; default: - D_ASSERTF(0, "bad daos_md_backend %d\n", daos_md_backend); + D_ASSERTF(0, "bad daos_md_backend %d\n", ph_p->up_store.store_type); break; } @@ -461,7 +481,7 @@ umempobj_get_rootptr(struct umem_pool *ph_p, size_t size) struct ad_blob_handle bh; uint64_t off; - switch (daos_md_backend) { + switch (ph_p->up_store.store_type) { case DAOS_MD_PMEM: pop = (PMEMobjpool *)ph_p->up_priv; @@ -475,7 +495,7 @@ umempobj_get_rootptr(struct umem_pool *ph_p, size_t size) bh.bh_blob = (struct ad_blob *)ph_p->up_priv; return ad_root(bh, size); default: - D_ASSERTF(0, "bad daos_md_backend %d\n", daos_md_backend); + D_ASSERTF(0, "bad daos_md_backend %d\n", ph_p->up_store.store_type); break; } @@ -496,7 +516,7 @@ umempobj_get_heapusage(struct umem_pool *ph_p, daos_size_t *curr_allocated) struct dav_heap_stats st; int rc = 0; - switch (daos_md_backend) { + switch (ph_p->up_store.store_type) { case DAOS_MD_PMEM: pop = (PMEMobjpool *)ph_p->up_priv; @@ -512,7 +532,7 @@ umempobj_get_heapusage(struct umem_pool *ph_p, daos_size_t *curr_allocated) *curr_allocated = 40960; /* TODO */ break; default: - D_ASSERTF(0, "bad daos_md_backend %d\n", daos_md_backend); + D_ASSERTF(0, "bad daos_md_backend %d\n", ph_p->up_store.store_type); break; } @@ -531,7 +551,7 @@ umempobj_log_fraginfo(struct umem_pool *ph_p) daos_size_t scm_used, scm_active; struct dav_heap_stats st; - switch (daos_md_backend) { + switch (ph_p->up_store.store_type) { case DAOS_MD_PMEM: pop = (PMEMobjpool *)ph_p->up_priv; @@ -552,7 +572,7 @@ umempobj_log_fraginfo(struct umem_pool *ph_p) D_ERROR("Fragmentation info, not implemented in ADMEM yet.\n"); break; default: - D_ASSERTF(0, "bad daos_md_backend %d\n", daos_md_backend); + D_ASSERTF(0, "bad daos_md_backend %d\n", ph_p->up_store.store_type); break; } } @@ -1398,17 +1418,6 @@ umem_class_init(struct umem_attr *uma, struct umem_instance *umm) bool found; found = false; -#ifdef DAOS_PMEM_BUILD - if (uma->uma_id == UMEM_CLASS_PMEM) { - if (daos_md_backend == DAOS_MD_BMEM) - uma->uma_id = UMEM_CLASS_BMEM; - else if (daos_md_backend == DAOS_MD_ADMEM) - uma->uma_id = UMEM_CLASS_ADMEM; - else - D_ASSERTF(daos_md_backend == DAOS_MD_PMEM, - "bad daos_md_backend %d\n", daos_md_backend); - } -#endif for (umc = &umem_class_defined[0]; umc->umc_id != UMEM_CLASS_UNKNOWN; umc++) { if (umc->umc_id == uma->uma_id) { diff --git a/src/common/sys_lmdb.c b/src/common/sys_lmdb.c deleted file mode 100644 index a1020c903f6..00000000000 --- a/src/common/sys_lmdb.c +++ /dev/null @@ -1,543 +0,0 @@ -/** - * (C) Copyright 2022-2023 Intel Corporation. - * - * SPDX-License-Identifier: BSD-2-Clause-Patent - */ -#define D_LOGFAC DD_FAC(common) - -#include -#include -#include -#include -#include -#include -#include - -#define SYS_DB_NAME "sys_db" - -#define SYS_DB_MD "metadata" -#define SYS_DB_MD_VER "version" - -#define SYS_DB_VERSION_1 1 -#define SYS_DB_VERSION SYS_DB_VERSION_1 -#define SYS_DB_MAX_MAP_SIZE (1024 * 1024 *32) - -/** private information of LMDB based system DB */ -struct lmm_sys_db { - /** exported part of VOS system DB */ - struct sys_db db_pub; - /* LMDB environment handle */ - MDB_env *db_env; - MDB_txn *db_txn; - /* Address where the new MDB_dbi handle will be stored */ - MDB_dbi db_dbi; - /* If MDB_dbi handle is valid or not */ - bool db_dbi_valid; - char *db_file; - char *db_path; - /* DB should be destroyed on exit */ - bool db_destroy_db; - ABT_mutex db_lock; -}; - -static struct lmm_sys_db lmm_db; - -static int -lmm_db_upsert(struct sys_db *db, char *table, d_iov_t *key, d_iov_t *val); - -static int -lmm_db_fetch(struct sys_db *db, char *table, d_iov_t *key, d_iov_t *val); - -static int -lmm_db_tx_begin(struct sys_db *db); - -static int -lmm_db_tx_end(struct sys_db *db, int rc); - -static struct lmm_sys_db * -db2lmm(struct sys_db *db) -{ - return container_of(db, struct lmm_sys_db, db_pub); -} - -static void -lmm_db_unlink(struct sys_db *db) -{ - struct lmm_sys_db *ldb = db2lmm(db); - - unlink(ldb->db_file); /* ignore error code */ -} - -static int -mdb_error2daos_error(int rc) -{ - if (rc > 0) - rc = -rc; - - switch (rc) { - case 0: - return 0; - case MDB_VERSION_MISMATCH: - return -DER_MISMATCH; - case MDB_INVALID: - return -DER_INVAL; - case MDB_PANIC: - case MDB_MAP_RESIZED: - return -DER_SHUTDOWN; - case MDB_READERS_FULL: - return -DER_AGAIN; - case MDB_NOTFOUND: - return -DER_NONEXIST; - case MDB_KEYEXIST: - return -DER_EXIST; - default: - return daos_errno2der(-rc); - } - -} - -/* open or create system DB stored in external storage */ -static int -lmm_db_open_create(struct sys_db *db, bool try_create) -{ - struct lmm_sys_db *ldb = db2lmm(db); - d_iov_t key; - d_iov_t val; - uint32_t ver; - int rc = 0; - - if (try_create) { - rc = mkdir(ldb->db_path, 0777); - if (rc < 0 && errno != EEXIST) { - rc = daos_errno2der(errno); - return rc; - } - } else if (access(ldb->db_file, R_OK | W_OK) != 0) { - rc = -DER_NO_PERM; - D_CRIT("No access to existing db file %s\n", ldb->db_file); - return rc; - } - - D_DEBUG(DB_IO, "Opening %s, try_create=%d\n", ldb->db_file, try_create); - rc = mdb_env_create(&ldb->db_env); - if (rc) { - rc = mdb_error2daos_error(rc); - D_CRIT("Failed to create env handle for sysdb: "DF_RC"\n", DP_RC(rc)); - goto out; - } - - rc = mdb_env_set_mapsize(ldb->db_env, SYS_DB_MAX_MAP_SIZE); - if (rc) { - rc = mdb_error2daos_error(rc); - D_CRIT("Failed to set env map size: "DF_RC"\n", DP_RC(rc)); - goto out; - } - - rc = mdb_env_open(ldb->db_env, ldb->db_file, MDB_NOSUBDIR, 0664); - if (rc) { - rc = mdb_error2daos_error(rc); - D_CRIT("Failed to open env handle for sysdb: "DF_RC"\n", DP_RC(rc)); - goto out; - } - - rc = mdb_txn_begin(ldb->db_env, NULL, 0, &ldb->db_txn); - if (rc) { - rc = mdb_error2daos_error(rc); - D_CRIT("Failed to begin tx for sysdb: "DF_RC"\n", DP_RC(rc)); - goto out; - } - - rc = mdb_dbi_open(ldb->db_txn, NULL, 0, &ldb->db_dbi); - if (rc) { - rc = mdb_error2daos_error(rc); - D_CRIT("Failed to open sysdb: "DF_RC"\n", DP_RC(rc)); - goto txn_abort; - } - ldb->db_dbi_valid = true; - - d_iov_set(&key, SYS_DB_MD_VER, strlen(SYS_DB_MD_VER)); - d_iov_set(&val, &ver, sizeof(ver)); - if (try_create) { - ver = SYS_DB_VERSION; - rc = lmm_db_upsert(db, SYS_DB_MD, &key, &val); - if (rc) { - D_CRIT("Failed to set version for sysdb: "DF_RC"\n", - DP_RC(rc)); - goto txn_abort; - } - rc = mdb_txn_commit(ldb->db_txn); - if (rc) - D_CRIT("Failed to commit version for sysdb: "DF_RC"\n", - DP_RC(rc)); - goto out; - } else { - /* make lock assertion happen */ - ABT_mutex_lock(db2lmm(db)->db_lock); - rc = lmm_db_fetch(db, SYS_DB_MD, &key, &val); - ABT_mutex_unlock(db2lmm(db)->db_lock); - if (rc) { - D_CRIT("Failed to read sysdb version: "DF_RC"\n", - DP_RC(rc)); - rc = -DER_INVAL; - } - - if (ver < SYS_DB_VERSION_1 || ver > SYS_DB_VERSION) - rc = -DER_DF_INCOMPT; - } -txn_abort: - mdb_txn_abort(ldb->db_txn); -out: - ldb->db_txn = NULL; - return rc; -} - -#define MAX_SMD_TABLE_LEN 32 -static int -lmm_db_generate_key(char *table, d_iov_t *key, MDB_val *db_key) -{ - char *new_key; - int table_len; - - table_len = strnlen(table, MAX_SMD_TABLE_LEN + 1); - if (table_len > MAX_SMD_TABLE_LEN) - return -DER_INVAL; - - db_key->mv_size = key->iov_len + table_len; - D_ALLOC(new_key, db_key->mv_size); - if (new_key == NULL) - return -DER_NOMEM; - - memcpy(new_key, table, table_len); - memcpy(new_key + table_len, (char *)key->iov_buf, key->iov_len); - db_key->mv_data = new_key; - - return 0; -} - -static int -lmm_db_unpack_key(char *table, d_iov_t *key, MDB_val *db_key) -{ - int table_len = strnlen(table, MAX_SMD_TABLE_LEN + 1); - char *buf; - int len; - - if (table_len > MAX_SMD_TABLE_LEN) - return -DER_INVAL; - - if (db_key->mv_size < table_len) - return -DER_INVAL; - - len = db_key->mv_size - table_len; - D_ALLOC(buf, len); - if (buf == NULL) - return -DER_NOMEM; - - memcpy(buf, db_key->mv_data + table_len, len); - d_iov_set(key, buf, len); - - return 0; -} - -static int -lmm_db_fetch(struct sys_db *db, char *table, d_iov_t *key, d_iov_t *val) -{ - MDB_val db_key, db_data; - struct lmm_sys_db *ldb = db2lmm(db); - int rc; - bool end_tx = false; - - D_ASSERT(ABT_mutex_trylock(db2lmm(db)->db_lock) == ABT_ERR_MUTEX_LOCKED); - if (ldb->db_txn == NULL) { - rc = mdb_txn_begin(ldb->db_env, NULL, MDB_RDONLY, &ldb->db_txn); - if (rc) - return mdb_error2daos_error(rc); - end_tx = true; - } - - rc = lmm_db_generate_key(table, key, &db_key); - if (rc) - goto out; - - rc = mdb_get(ldb->db_txn, ldb->db_dbi, &db_key, &db_data); - D_FREE(db_key.mv_data); - if (rc) { - rc = mdb_error2daos_error(rc); - goto out; - } - - if (db_data.mv_size != val->iov_len) { - D_ERROR("mismatch value for table: %s, expected: %lu, got: %lu\n", - table, val->iov_len, db_data.mv_size); - rc = -DER_MISMATCH; - goto out; - } - memcpy(val->iov_buf, db_data.mv_data, db_data.mv_size); - -out: - if (end_tx) { - mdb_txn_abort(ldb->db_txn); - ldb->db_txn = NULL; - } - return rc; -} - -static int -lmm_db_upsert(struct sys_db *db, char *table, d_iov_t *key, d_iov_t *val) -{ - MDB_val db_key, db_data; - struct lmm_sys_db *ldb = db2lmm(db); - int rc; - bool end_tx = false; - - if (ldb->db_txn == NULL) { - rc = lmm_db_tx_begin(db); - if (rc) - return rc; - end_tx = true; - } - - rc = lmm_db_generate_key(table, key, &db_key); - if (rc) - goto out; - - db_data.mv_size = val->iov_len; - db_data.mv_data = val->iov_buf; - - rc = mdb_put(ldb->db_txn, ldb->db_dbi, &db_key, &db_data, 0); - if (rc) - D_ERROR("Failed to put in mdb: %d\n", rc); - D_FREE(db_key.mv_data); - -out: - rc = mdb_error2daos_error(rc); - if (end_tx) - rc = lmm_db_tx_end(db, rc); - return rc; -} - -static int -lmm_db_delete(struct sys_db *db, char *table, d_iov_t *key) -{ - MDB_val db_key; - struct lmm_sys_db *ldb = db2lmm(db); - int rc; - bool end_tx = false; - - if (ldb->db_txn == NULL) { - rc = lmm_db_tx_begin(db); - if (rc) - return rc; - end_tx = true; - } - - rc = lmm_db_generate_key(table, key, &db_key); - if (rc) - goto out; - - rc = mdb_del(ldb->db_txn, ldb->db_dbi, &db_key, NULL); - if (rc) - D_ERROR("Failed to delete in mdb: %d\n", rc); - D_FREE(db_key.mv_data); - -out: - rc = mdb_error2daos_error(rc); - if (end_tx) - rc = lmm_db_tx_end(db, rc); - return rc; -} - -static int -lmm_db_traverse(struct sys_db *db, char *table, sys_db_trav_cb_t cb, void *args) -{ - struct lmm_sys_db *ldb = db2lmm(db); - MDB_cursor *cursor; - MDB_val db_key, db_data; - int rc; - d_iov_t key; - int table_len = strnlen(table, MAX_SMD_TABLE_LEN); - - D_ASSERT(ldb->db_txn == NULL); - rc = mdb_txn_begin(ldb->db_env, NULL, MDB_RDONLY, &ldb->db_txn); - if (rc) - return mdb_error2daos_error(rc); - - rc = mdb_cursor_open(ldb->db_txn, ldb->db_dbi, &cursor); - if (rc) { - rc = mdb_error2daos_error(rc); - goto tx_end; - } - - while ((rc = mdb_cursor_get(cursor, &db_key, &db_data, MDB_NEXT)) == 0) { - if (strncmp(db_key.mv_data, table, table_len) != 0) - continue; - - rc = lmm_db_unpack_key(table, &key, &db_key); - if (rc) - goto close; - - rc = cb(db, table, &key, args); - D_FREE(key.iov_buf); - if (rc) - goto close; - } - /* reach end */ - if (rc == MDB_NOTFOUND) - rc = 0; - rc = mdb_error2daos_error(rc); -close: - mdb_cursor_close(cursor); -tx_end: - mdb_txn_abort(ldb->db_txn); - ldb->db_txn = NULL; - - return rc; -} - -static int -lmm_db_tx_begin(struct sys_db *db) -{ - struct lmm_sys_db *ldb = db2lmm(db); - int rc; - - D_ASSERT(ldb->db_txn == NULL); - rc = mdb_txn_begin(ldb->db_env, NULL, 0, &ldb->db_txn); - - return mdb_error2daos_error(rc); -} - -static int -lmm_db_tx_end(struct sys_db *db, int rc) -{ - struct lmm_sys_db *ldb = db2lmm(db); - MDB_txn *txn = ldb->db_txn; - - D_ASSERT(txn != NULL); - ldb->db_txn = NULL; - - if (rc) { - mdb_txn_abort(txn); - return rc; - } - - rc = mdb_txn_commit(txn); - if (rc) - D_ERROR("Failed to commit txn in mdb: %d\n", rc); - - return mdb_error2daos_error(rc); -} - -static void -lmm_db_lock(struct sys_db *db) -{ - ABT_mutex_lock(db2lmm(db)->db_lock); -} - -static void -lmm_db_unlock(struct sys_db *db) -{ - ABT_mutex_unlock(db2lmm(db)->db_lock); -} - -/** Finalize system DB of VOS */ -void -lmm_db_fini(void) -{ - if (lmm_db.db_lock) - ABT_mutex_free(&lmm_db.db_lock); - if (lmm_db.db_file) { - if (lmm_db.db_destroy_db) - lmm_db_unlink(&lmm_db.db_pub); - if (lmm_db.db_env) { - if (lmm_db.db_dbi_valid) - mdb_dbi_close(lmm_db.db_env, lmm_db.db_dbi); - mdb_env_close(lmm_db.db_env); - } - D_FREE(lmm_db.db_file); - } - - D_FREE(lmm_db.db_path); - memset(&lmm_db, 0, sizeof(lmm_db)); -} - -int -lmm_db_init_ex(const char *db_path, const char *db_name, bool force_create, bool destroy_db_on_fini) -{ - int rc; - - D_ASSERT(db_path != NULL); - - memset(&lmm_db, 0, sizeof(lmm_db)); - lmm_db.db_destroy_db = destroy_db_on_fini; - - rc = ABT_mutex_create(&lmm_db.db_lock); - if (rc != ABT_SUCCESS) - return -DER_NOMEM; - - D_ASPRINTF(lmm_db.db_path, "%s", db_path); - if (lmm_db.db_path == NULL) { - D_ERROR("Generate sysdb path failed. %d\n", rc); - rc = -DER_NOMEM; - goto failed; - } - - if (!db_name) - db_name = SYS_DB_NAME; - - D_ASPRINTF(lmm_db.db_file, "%s/%s", lmm_db.db_path, db_name); - if (lmm_db.db_file == NULL) { - D_ERROR("Generate sysdb filename failed. %d\n", rc); - rc = -DER_NOMEM; - goto failed; - } - - strncpy(lmm_db.db_pub.sd_name, db_name, SYS_DB_NAME_SZ - 1); - lmm_db.db_pub.sd_fetch = lmm_db_fetch; - lmm_db.db_pub.sd_upsert = lmm_db_upsert; - lmm_db.db_pub.sd_delete = lmm_db_delete; - lmm_db.db_pub.sd_traverse = lmm_db_traverse; - lmm_db.db_pub.sd_tx_begin = lmm_db_tx_begin; - lmm_db.db_pub.sd_tx_end = lmm_db_tx_end; - lmm_db.db_pub.sd_lock = lmm_db_lock; - lmm_db.db_pub.sd_unlock = lmm_db_unlock; - - if (force_create) - lmm_db_unlink(&lmm_db.db_pub); - - rc = access(lmm_db.db_file, F_OK); - if (rc == 0) { - rc = lmm_db_open_create(&lmm_db.db_pub, false); - if (rc) { - D_ERROR("Failed to open sys DB: "DF_RC"\n", DP_RC(rc)); - goto failed; - } - D_DEBUG(DB_IO, "successfully open system DB\n"); - } else { - rc = lmm_db_open_create(&lmm_db.db_pub, true); - if (rc) { - D_ERROR("Failed to create sys DB: "DF_RC"\n", DP_RC(rc)); - goto failed; - } - D_DEBUG(DB_IO, "successfully create system DB\n"); - } - - return 0; - -failed: - lmm_db_fini(); - return rc; -} - -/** Initialize system DB of VOS */ -int -lmm_db_init(const char *db_path) -{ - return lmm_db_init_ex(db_path, NULL, false, false); -} - - -/** Export system DB of VOS */ -struct sys_db * -lmm_db_get(void) -{ - return &lmm_db.db_pub; -} diff --git a/src/common/tests/umem_test.c b/src/common/tests/umem_test.c index 147787bdb8b..8c192f7e892 100644 --- a/src/common/tests/umem_test.c +++ b/src/common/tests/umem_test.c @@ -1,5 +1,5 @@ /** - * (C) Copyright 2019-2021 Intel Corporation. + * (C) Copyright 2019-2023 Intel Corporation. * * SPDX-License-Identifier: BSD-2-Clause-Patent */ @@ -417,6 +417,7 @@ test_page_cache(void **state) arg->ta_store.stor_size = 46 * 1024 * 1024; arg->ta_store.stor_ops = &stor_ops; + arg->ta_store.store_type = DAOS_MD_BMEM; rc = umem_cache_alloc(&arg->ta_store, 0); assert_rc_equal(rc, 0); diff --git a/src/common/tests/umem_test_bmem.c b/src/common/tests/umem_test_bmem.c index b0076087fbd..b8b78a025aa 100644 --- a/src/common/tests/umem_test_bmem.c +++ b/src/common/tests/umem_test_bmem.c @@ -73,7 +73,8 @@ struct umem_store_ops _store_ops = { .so_wal_submit = _persist_submit, }; -struct umem_store ustore = { .stor_size = POOL_SIZE, .stor_ops = &_store_ops }; +struct umem_store ustore = { .stor_size = POOL_SIZE, .stor_ops = &_store_ops, + .store_type = DAOS_MD_BMEM }; int teardown_pmem(void **state) diff --git a/src/common/tests/utest_common.c b/src/common/tests/utest_common.c index dda9bbb8a7c..472039c74a8 100644 --- a/src/common/tests/utest_common.c +++ b/src/common/tests/utest_common.c @@ -1,5 +1,5 @@ /** - * (C) Copyright 2019-2022 Intel Corporation. + * (C) Copyright 2019-2023 Intel Corporation. * * SPDX-License-Identifier: BSD-2-Clause-Patent */ @@ -69,7 +69,11 @@ utest_pmem_create(const char *name, size_t pool_size, size_t root_size, return -DER_NOMEM; strcpy(ctx->uc_pool_name, name); - ctx->uc_uma.uma_id = UMEM_CLASS_PMEM; + if (store) + ctx->uc_uma.uma_id = umempobj_backend_type2class_id(store->store_type); + else + ctx->uc_uma.uma_id = UMEM_CLASS_PMEM; + ctx->uc_uma.uma_pool = umempobj_create(name, "utest_pool", UMEMPOBJ_ENABLE_STATS, pool_size, 0666, store); diff --git a/src/container/srv_target.c b/src/container/srv_target.c index c21f9885416..b4004822d56 100644 --- a/src/container/srv_target.c +++ b/src/container/srv_target.c @@ -1464,10 +1464,8 @@ ds_cont_local_open(uuid_t pool_uuid, uuid_t cont_hdl_uuid, uuid_t cont_uuid, D_GOTO(err_hdl, rc); hdl->sch_cont = cont; - if (rc == 1) { + if (rc == 1) poh = hdl->sch_cont->sc_pool->spc_hdl; - rc = 0; - } } uuid_copy(hdl->sch_uuid, cont_hdl_uuid); diff --git a/src/control/cmd/daos/acl.go b/src/control/cmd/daos/acl.go index 9085d01e091..d2533f44b47 100644 --- a/src/control/cmd/daos/acl.go +++ b/src/control/cmd/daos/acl.go @@ -384,10 +384,16 @@ func (cmd *containerSetOwnerCmd) Execute(args []string) error { var user *C.char var group *C.char if cmd.User != "" { + if !strings.ContainsRune(cmd.User, '@') { + cmd.User += "@" + } user = C.CString(cmd.User) defer C.free(unsafe.Pointer(user)) } if cmd.Group != "" { + if !strings.ContainsRune(cmd.Group, '@') { + cmd.Group += "@" + } group = C.CString(cmd.Group) defer C.free(unsafe.Pointer(group)) } diff --git a/src/control/cmd/daos_agent/infocache.go b/src/control/cmd/daos_agent/infocache.go index 0eec423bd35..0dbdf4fc645 100644 --- a/src/control/cmd/daos_agent/infocache.go +++ b/src/control/cmd/daos_agent/infocache.go @@ -139,9 +139,6 @@ func (ci *cachedAttachInfo) Refresh(ctx context.Context) error { return errors.New("cachedAttachInfo is nil") } - ci.Lock() - defer ci.Unlock() - req := &control.GetAttachInfoReq{System: ci.system, AllRanks: true} resp, err := ci.fetch(ctx, ci.rpcClient, req) if err != nil { @@ -185,9 +182,6 @@ func (cfi *cachedFabricInfo) Refresh(ctx context.Context) error { return errors.New("cachedFabricInfo is nil") } - cfi.Lock() - defer cfi.Unlock() - results, err := cfi.fetch(ctx) if err != nil { return errors.Wrap(err, "refreshing cached fabric info") @@ -328,7 +322,28 @@ func (c *InfoCache) GetAttachInfo(ctx context.Context, sys string) (*control.Get return nil, errors.Errorf("unexpected attach info data type %T", item) } - return cai.lastResponse, nil + return copyGetAttachInfoResp(cai.lastResponse), nil +} + +func copyGetAttachInfoResp(orig *control.GetAttachInfoResp) *control.GetAttachInfoResp { + if orig == nil { + return nil + } + + cp := new(control.GetAttachInfoResp) + *cp = *orig + + // Copy slices instead of using original pointers + cp.MSRanks = make([]uint32, len(orig.MSRanks)) + _ = copy(cp.MSRanks, orig.MSRanks) + cp.ServiceRanks = make([]*control.PrimaryServiceRank, len(orig.ServiceRanks)) + _ = copy(cp.ServiceRanks, orig.ServiceRanks) + + if orig.ClientNetHint.EnvVars != nil { + cp.ClientNetHint.EnvVars = make([]string, len(orig.ClientNetHint.EnvVars)) + _ = copy(cp.ClientNetHint.EnvVars, orig.ClientNetHint.EnvVars) + } + return cp } func (c *InfoCache) getAttachInfoRemote(ctx context.Context, sys string) (*control.GetAttachInfoResp, error) { diff --git a/src/control/lib/cache/cache.go b/src/control/lib/cache/cache.go index 4ffd98e3fb7..c31d82b0a61 100644 --- a/src/control/lib/cache/cache.go +++ b/src/control/lib/cache/cache.go @@ -94,6 +94,13 @@ func (ic *ItemCache) Keys() []string { return nil } + ic.mutex.RLock() + defer ic.mutex.RUnlock() + + return ic.keys() +} + +func (ic *ItemCache) keys() []string { keys := []string{} for k := range ic.items { keys = append(keys, k) @@ -144,13 +151,14 @@ func (ic *ItemCache) GetOrCreate(ctx context.Context, key string, missFn ItemCre ic.set(item) } + item.Lock() if item.NeedsRefresh() { if err := item.Refresh(ctx); err != nil { + item.Unlock() return nil, noopRelease, errors.Wrapf(err, "fetch data for %q", key) } ic.log.Debugf("refreshed item %q", key) } - item.Lock() return item, item.Unlock, nil } @@ -174,13 +182,14 @@ func (ic *ItemCache) Get(ctx context.Context, key string) (Item, func(), error) return nil, noopRelease, err } + item.Lock() if item.NeedsRefresh() { if err := item.Refresh(ctx); err != nil { + item.Unlock() return nil, noopRelease, errors.Wrapf(err, "fetch data for %q", key) } ic.log.Debugf("refreshed item %q", key) } - item.Lock() return item, item.Unlock, nil } @@ -203,18 +212,28 @@ func (ic *ItemCache) Refresh(ctx context.Context, keys ...string) error { defer ic.mutex.Unlock() if len(keys) == 0 { - keys = ic.Keys() + keys = ic.keys() } for _, key := range keys { - item, err := ic.get(key) - if err != nil { + if err := ic.refreshItem(ctx, key); err != nil { return err } + } + return nil +} - if err := item.Refresh(ctx); err != nil { - return errors.Wrapf(err, "failed to refresh cached item %q", item.Key()) - } +func (ic *ItemCache) refreshItem(ctx context.Context, key string) error { + item, err := ic.get(key) + if err != nil { + return err } + + item.Lock() + defer item.Unlock() + if err := item.Refresh(ctx); err != nil { + return errors.Wrapf(err, "failed to refresh cached item %q", item.Key()) + } + return nil } diff --git a/src/control/lib/control/mocks.go b/src/control/lib/control/mocks.go index 1ec98b972bc..71d06131dfe 100644 --- a/src/control/lib/control/mocks.go +++ b/src/control/lib/control/mocks.go @@ -125,10 +125,12 @@ func (mi *MockInvoker) InvokeUnaryRPCAsync(ctx context.Context, uReq UnaryReques ur := mi.cfg.UnaryResponse mi.invokeCountMutex.RLock() if len(mi.cfg.UnaryResponseSet) > mi.invokeCount { + mi.log.Debugf("using configured UnaryResponseSet[%d]", mi.invokeCount) ur = mi.cfg.UnaryResponseSet[mi.invokeCount] } mi.invokeCountMutex.RUnlock() if ur == nil { + mi.log.Debugf("using dummy UnaryResponse") // If the config didn't define a response, just dummy one up for // tests that don't care. ur = &UnaryResponse{ @@ -140,6 +142,8 @@ func (mi *MockInvoker) InvokeUnaryRPCAsync(ctx context.Context, uReq UnaryReques }, }, } + } else { + mi.log.Debugf("using configured UnaryResponse") } var invokeCount int @@ -148,6 +152,7 @@ func (mi *MockInvoker) InvokeUnaryRPCAsync(ctx context.Context, uReq UnaryReques invokeCount = mi.invokeCount mi.invokeCountMutex.Unlock() go func(invokeCount int) { + mi.log.Debugf("returning mock responses, invokeCount=%d", invokeCount) delayIdx := invokeCount - 1 for idx, hr := range ur.Responses { var delay time.Duration @@ -156,14 +161,17 @@ func (mi *MockInvoker) InvokeUnaryRPCAsync(ctx context.Context, uReq UnaryReques delay = mi.cfg.UnaryResponseDelays[delayIdx][idx] } if delay > 0 { - time.Sleep(delay) + mi.log.Debugf("delaying mock response for %s", delay) + select { + case <-time.After(delay): + case <-ctx.Done(): + mi.log.Debugf("context canceled on iteration %d (error=%s)", idx, ctx.Err().Error()) + return + } } - select { - case <-ctx.Done(): - return - case responses <- hr: - } + mi.log.Debug("sending mock response") + responses <- hr } close(responses) }(invokeCount) diff --git a/src/control/lib/control/system_test.go b/src/control/lib/control/system_test.go index 4597439489a..67a6ddd3bea 100644 --- a/src/control/lib/control/system_test.go +++ b/src/control/lib/control/system_test.go @@ -1455,7 +1455,7 @@ func TestControl_SystemJoin_Timeouts(t *testing.T) { }, "inner context is canceled; request is retried": { mic: &MockInvokerConfig{ - ReqTimeout: 100 * time.Millisecond, // outer timeout + ReqTimeout: 500 * time.Millisecond, // outer timeout RetryTimeout: 10 * time.Millisecond, // inner timeout UnaryResponseSet: []*UnaryResponse{ { diff --git a/src/control/server/server.go b/src/control/server/server.go index 356c6311bd3..8e5d921bb57 100644 --- a/src/control/server/server.go +++ b/src/control/server/server.go @@ -85,12 +85,7 @@ func processFabricProvider(cfg *config.Server) { } func shouldAppendRXM(provider string) bool { - for _, rxmProv := range []string{"ofi+verbs", "ofi+tcp"} { - if rxmProv == provider { - return true - } - } - return false + return provider == "ofi+verbs" } // server struct contains state and components of DAOS Server. diff --git a/src/control/server/server_utils_test.go b/src/control/server/server_utils_test.go index 80a89cbd0c8..069fac5e028 100644 --- a/src/control/server/server_utils_test.go +++ b/src/control/server/server_utils_test.go @@ -1247,3 +1247,43 @@ func TestServerUtils_getControlAddr(t *testing.T) { }) } } + +func TestServer_processFabricProvider(t *testing.T) { + for name, tc := range map[string]struct { + cfgFabric string + expFabric string + }{ + "ofi+verbs": { + cfgFabric: "ofi+verbs", + expFabric: "ofi+verbs;ofi_rxm", + }, + "ofi+verbs;ofi_rxm": { + cfgFabric: "ofi+verbs;ofi_rxm", + expFabric: "ofi+verbs;ofi_rxm", + }, + "ofi+tcp": { + cfgFabric: "ofi+tcp", + expFabric: "ofi+tcp", + }, + "ofi+tcp;ofi_rxm": { + cfgFabric: "ofi+tcp;ofi_rxm", + expFabric: "ofi+tcp;ofi_rxm", + }, + "ucx": { + cfgFabric: "ucx+ud", + expFabric: "ucx+ud", + }, + } { + t.Run(name, func(t *testing.T) { + cfg := &config.Server{ + Fabric: engine.FabricConfig{ + Provider: tc.cfgFabric, + }, + } + + processFabricProvider(cfg) + + test.AssertEqual(t, tc.expFabric, cfg.Fabric.Provider, "") + }) + } +} diff --git a/src/engine/srv.c b/src/engine/srv.c index bbf3734af63..affd7d7af15 100644 --- a/src/engine/srv.c +++ b/src/engine/srv.c @@ -120,7 +120,6 @@ struct dss_xstream_data { /** barrier for all ULTs to enter handling loop */ ABT_cond xd_ult_barrier; ABT_mutex xd_mutex; - struct dss_thread_local_storage *xd_dtc; }; static struct dss_xstream_data xstream_data; @@ -1196,13 +1195,7 @@ enum { static void dss_sys_db_fini(void) { - - if (!bio_nvme_configured(SMD_DEV_TYPE_META)) { - vos_db_fini(); - return; - } - - lmm_db_fini(); + vos_db_fini(); } /** @@ -1228,7 +1221,7 @@ dss_srv_fini(bool force) dss_sys_db_fini(); /* fall through */ case XD_INIT_TLS_INIT: - dss_tls_fini(xstream_data.xd_dtc); + vos_standalone_tls_fini(); /* fall through */ case XD_INIT_TLS_REG: pthread_key_delete(dss_tls_key); @@ -1254,18 +1247,11 @@ static int dss_sys_db_init() { int rc; - char *lmm_db_path = NULL; + char *sys_db_path = NULL; char *nvme_conf_path = NULL; - if (!bio_nvme_configured(SMD_DEV_TYPE_META)) { - rc = vos_db_init(dss_storage_path); - if (rc) - return rc; - rc = smd_init(vos_db_get()); - if (rc) - vos_db_fini(); - return rc; - } + if (!bio_nvme_configured(SMD_DEV_TYPE_META)) + goto db_init; if (dss_nvme_conf == NULL) { D_ERROR("nvme conf path not set\n"); @@ -1275,21 +1261,21 @@ dss_sys_db_init() D_STRNDUP(nvme_conf_path, dss_nvme_conf, PATH_MAX); if (nvme_conf_path == NULL) return -DER_NOMEM; - D_STRNDUP(lmm_db_path, dirname(nvme_conf_path), PATH_MAX); + D_STRNDUP(sys_db_path, dirname(nvme_conf_path), PATH_MAX); D_FREE(nvme_conf_path); - if (lmm_db_path == NULL) { + if (sys_db_path == NULL) return -DER_NOMEM; - } - rc = lmm_db_init(lmm_db_path); +db_init: + rc = vos_db_init(bio_nvme_configured(SMD_DEV_TYPE_META) ? sys_db_path : dss_storage_path); if (rc) goto out; - rc = smd_init(lmm_db_get()); + rc = smd_init(vos_db_get()); if (rc) - lmm_db_fini(); + vos_db_fini(); out: - D_FREE(lmm_db_path); + D_FREE(sys_db_path); return rc; } @@ -1344,8 +1330,8 @@ dss_srv_init(void) xstream_data.xd_init_step = XD_INIT_TLS_REG; /* initialize xstream-local storage */ - xstream_data.xd_dtc = dss_tls_init(DAOS_SERVER_TAG - DAOS_TGT_TAG, 0, -1); - if (!xstream_data.xd_dtc) { + rc = vos_standalone_tls_init(DAOS_SERVER_TAG - DAOS_TGT_TAG); + if (rc) { D_ERROR("Not enough DRAM to initialize XS local storage.\n"); D_GOTO(failed, rc = -DER_NOMEM); } @@ -1387,10 +1373,16 @@ dss_srv_init(void) return rc; } +bool +dss_srv_shutting_down(void) +{ + return dss_get_module_info()->dmi_srv_shutting_down; +} + static void set_draining(void *arg) { - dss_get_module_info()->dmi_srv_shutting_down = 1; + dss_get_module_info()->dmi_srv_shutting_down = true; } /* @@ -1416,8 +1408,6 @@ dss_srv_set_shutting_down(void) rc = ABT_task_free(&task); D_ASSERTF(rc == ABT_SUCCESS, "join task: %d\n", rc); } - - dss_get_module_info()->dmi_srv_shutting_down = 1; } void diff --git a/src/engine/srv_internal.h b/src/engine/srv_internal.h index 319290b74b8..9a4b3941319 100644 --- a/src/engine/srv_internal.h +++ b/src/engine/srv_internal.h @@ -212,14 +212,15 @@ void sched_stop(struct dss_xstream *dx); static inline bool sched_xstream_stopping(void) { - struct dss_xstream *dx = dss_current_xstream(); + struct dss_xstream *dx; ABT_bool state; int rc; /* ULT creation from main thread which doesn't have dss_xstream */ - if (dx == NULL) + if (dss_tls_get() == NULL) return false; + dx = dss_current_xstream(); rc = ABT_future_test(dx->dx_stopping, &state); D_ASSERTF(rc == ABT_SUCCESS, "%d\n", rc); return state == ABT_TRUE; @@ -251,7 +252,8 @@ static inline void dss_free_stack_cb(void *arg) { mmap_stack_desc_t *desc = (mmap_stack_desc_t *)arg; - struct dss_xstream *dx = dss_current_xstream(); + /* main thread doesn't have TLS and XS */ + struct dss_xstream *dx = dss_tls_get() ? dss_current_xstream() : NULL; /* ensure pool where to free stack is from current-XStream/ULT-exiting */ if (dx != NULL) @@ -271,7 +273,11 @@ sched_create_thread(struct dss_xstream *dx, void (*func)(void *), void *arg, struct sched_info *info = &dx->dx_sched_info; int rc; #ifdef ULT_MMAP_STACK - struct dss_xstream *cur_dx = dss_current_xstream(); + bool tls_set = dss_tls_get() ? true : false; + struct dss_xstream *cur_dx = NULL; + + if (tls_set) + cur_dx = dss_current_xstream(); /* if possible,stack should be allocated from launching XStream pool */ if (cur_dx == NULL) diff --git a/src/include/cart/api.h b/src/include/cart/api.h index b7ee1f82064..fc6d34c41fb 100644 --- a/src/include/cart/api.h +++ b/src/include/cart/api.h @@ -429,6 +429,18 @@ crt_req_dst_rank_get(crt_rpc_t *req, d_rank_t *rank); int crt_req_dst_tag_get(crt_rpc_t *req, uint32_t *tag); +/** + * Return source timeout in seconds + * + * \param[in] req Pointer to RPC request + * \param[out] timeout Returned timeout + * + * \return DER_SUCCESS on success or error + * on failure + */ +int +crt_req_src_timeout_get(crt_rpc_t *rpc, uint16_t *timeout); + /** * Return reply buffer * diff --git a/src/include/daos/mem.h b/src/include/daos/mem.h index 3513b4557e2..ebf3be12cdd 100644 --- a/src/include/daos/mem.h +++ b/src/include/daos/mem.h @@ -27,9 +27,24 @@ int umempobj_settings_init(bool md_on_ssd); +/* convert backend type to umem class id */ +int umempobj_backend_type2class_id(int backend); + /* umem persistent object property flags */ #define UMEMPOBJ_ENABLE_STATS 0x1 +#ifdef DAOS_PMEM_BUILD +enum { + DAOS_MD_PMEM = 0, + DAOS_MD_BMEM = 1, + DAOS_MD_ADMEM = 2, +}; + +/* return umem backend type */ +int umempobj_get_backend_type(void); + +#endif + struct umem_wal_tx; struct umem_wal_tx_ops { @@ -134,6 +149,10 @@ struct umem_store { * the storage device. */ struct umem_store_ops *stor_ops; + /* backend type */ + int store_type; + /* standalone store */ + bool store_standalone; }; struct umem_slab_desc { diff --git a/src/include/daos/sys_db.h b/src/include/daos/sys_db.h index acc3fe6f578..850e9c4203a 100644 --- a/src/include/daos/sys_db.h +++ b/src/include/daos/sys_db.h @@ -43,11 +43,4 @@ struct sys_db { void (*sd_unlock)(struct sys_db *db); }; -/* for lmdb backend apis */ -int lmm_db_init(const char *db_path); -int lmm_db_init_ex(const char *db_path, const char *db_name, - bool force_create, bool destroy_db_on_fini); -void lmm_db_fini(void); -struct sys_db *lmm_db_get(void); - #endif /* __SYS_DB_H__ */ diff --git a/src/include/daos_srv/daos_engine.h b/src/include/daos_srv/daos_engine.h index 6704f7e5c77..90d35300b40 100644 --- a/src/include/daos_srv/daos_engine.h +++ b/src/include/daos_srv/daos_engine.h @@ -192,11 +192,8 @@ dss_current_xstream(void) * finish entering shutdown mode (i.e., any dss_srv_set_shutting_down call * won't return). */ -static inline bool -dss_srv_shutting_down(void) -{ - return dss_get_module_info()->dmi_srv_shutting_down; -} +bool +dss_srv_shutting_down(void); /** * Module facility feature bits diff --git a/src/include/daos_srv/pool.h b/src/include/daos_srv/pool.h index 25068b08033..6f9d2ef781e 100644 --- a/src/include/daos_srv/pool.h +++ b/src/include/daos_srv/pool.h @@ -78,6 +78,8 @@ struct ds_pool { uint32_t sp_rebuild_gen; int sp_reintegrating; + + int sp_discard_status; /** path to ephemeral metrics */ char sp_path[D_TM_MAX_NAME_LEN]; diff --git a/src/include/daos_srv/vos.h b/src/include/daos_srv/vos.h index ad5e635a7bc..3b6078b21d6 100644 --- a/src/include/daos_srv/vos.h +++ b/src/include/daos_srv/vos.h @@ -1253,14 +1253,6 @@ enum vos_cont_opc { int vos_cont_ctl(daos_handle_t coh, enum vos_cont_opc opc); -/** - * Profile the VOS operation in standalone vos mode. - **/ -int -vos_profile_start(char *path, int avg); -void -vos_profile_stop(void); - uint64_t vos_get_io_size(daos_handle_t ioh); @@ -1274,6 +1266,11 @@ int vos_dedup_verify(daos_handle_t ioh); struct sys_db *vos_db_get(void); + +/* return sysdb pool uuid */ +uuid_t * +vos_db_pool_uuid(void); + /** * Create the system DB in VOS * System DB is KV store that can support insert/delete/traverse @@ -1488,4 +1485,17 @@ vos_obj_key2anchor(daos_handle_t coh, daos_unit_oid_t oid, daos_key_t *dkey, dao int vos_obj_layout_upgrade(daos_handle_t hdl, daos_unit_oid_t oid, uint32_t layout_ver); +/** + * Init standalone VOS TLS. + * \param[in] tags + */ +int +vos_standalone_tls_init(int tags); + +/** + * Finish standalone VOS TLS. + */ +void +vos_standalone_tls_fini(void); + #endif /* __VOS_API_H */ diff --git a/src/include/daos_srv/vos_types.h b/src/include/daos_srv/vos_types.h index 5eca027ad06..5bccb2d2ef2 100644 --- a/src/include/daos_srv/vos_types.h +++ b/src/include/daos_srv/vos_types.h @@ -94,6 +94,8 @@ enum vos_pool_open_flags { VOS_POF_EXTERNAL_FLUSH = (1 << 3), /** RDB pool */ VOS_POF_RDB = (1 << 4), + /** SYS DB pool */ + VOS_POF_SYSDB = (1 << 5), }; enum vos_oi_attr { @@ -395,9 +397,7 @@ typedef int (*vos_iter_filter_cb_t)(daos_handle_t ih, vos_iter_desc_t *desc, * Parameters for initializing VOS iterator */ typedef struct { - /** standalone prepare: pool connection handle or container open handle - * nested prepare: DAOS_HDL_INVAL - */ + /** pool connection handle or container open handle */ daos_handle_t ip_hdl; /** standalone prepare: DAOS_HDL_INVAL * nested prepare: parent iterator handle diff --git a/src/include/daos_task.h b/src/include/daos_task.h index e832c9082a1..e6e00af1f7a 100644 --- a/src/include/daos_task.h +++ b/src/include/daos_task.h @@ -843,19 +843,15 @@ typedef daos_obj_list_t daos_obj_list_recx_t; */ typedef daos_obj_list_t daos_obj_list_obj_t; -/** daos_obj_key2anchor args */ -typedef struct { - /** Object open handle */ - daos_handle_t oh; - /** Transaction open handle. */ - daos_handle_t th; - /** Distribution key. */ - daos_key_t *dkey; - /** Attribute key. */ - daos_key_t *akey; - /** Anchor to set */ - daos_anchor_t *anchor; -} daos_obj_key2anchor_t; +/** + * parameter subset for list_obj - + * daos_handle_t oh; + * daos_handle_t th; + * daos_key_t *dkey; + * daos_key_t *akey; + * daos_anchor_t *anchor; + */ +typedef daos_obj_list_t daos_obj_key2anchor_t; /** Array create args */ typedef struct { diff --git a/src/mgmt/srv_pool.c b/src/mgmt/srv_pool.c index 7e6b0782b73..851cedfcfad 100644 --- a/src/mgmt/srv_pool.c +++ b/src/mgmt/srv_pool.c @@ -1,5 +1,5 @@ /* - * (C) Copyright 2016-2022 Intel Corporation. + * (C) Copyright 2016-2023 Intel Corporation. * * SPDX-License-Identifier: BSD-2-Clause-Patent */ @@ -58,6 +58,29 @@ ds_mgmt_tgt_pool_destroy_ranks(uuid_t pool_uuid, d_rank_list_t *filter_ranks) return rc; } +static uint32_t +pool_create_rpc_timeout(crt_rpc_t *tc_req, size_t scm_size) +{ + uint32_t timeout; + uint32_t default_timeout; + size_t gib; + int rc = crt_req_get_timeout(tc_req, &default_timeout); + + D_ASSERTF(rc == 0, "crt_req_get_timeout: "DF_RC"\n", DP_RC(rc)); + + gib = scm_size / ((size_t)1024 * 1024 * 1024); + if (gib < 32) + timeout = 15; + else if (gib < 64) + timeout = 30; + else if (gib < 128) + timeout = 60; + else + timeout = 90; + + return max(timeout, default_timeout); +} + static int ds_mgmt_tgt_pool_create_ranks(uuid_t pool_uuid, char *tgt_dev, d_rank_list_t *rank_list, size_t scm_size, size_t nvme_size) @@ -69,6 +92,7 @@ ds_mgmt_tgt_pool_create_ranks(uuid_t pool_uuid, char *tgt_dev, d_rank_list_t *ra int topo; int rc; int rc_cleanup; + uint32_t timeout; /* Collective RPC to all of targets of the pool */ topo = crt_tree_topo(CRT_TREE_KNOMIAL, 4); @@ -83,6 +107,10 @@ ds_mgmt_tgt_pool_create_ranks(uuid_t pool_uuid, char *tgt_dev, d_rank_list_t *ra return rc; } + timeout = pool_create_rpc_timeout(tc_req, scm_size); + crt_req_set_timeout(tc_req, timeout); + D_DEBUG(DB_MGMT, DF_UUID": pool create RPC timeout: %u\n", + DP_UUID(pool_uuid), timeout); tc_in = crt_req_get(tc_req); D_ASSERT(tc_in != NULL); uuid_copy(tc_in->tc_pool_uuid, pool_uuid); diff --git a/src/object/cli_obj.c b/src/object/cli_obj.c index c36306ba8c8..3bec18327a4 100644 --- a/src/object/cli_obj.c +++ b/src/object/cli_obj.c @@ -3782,6 +3782,7 @@ obj_get_sub_anchors(daos_obj_list_t *obj_args, int opc) case DAOS_OBJ_AKEY_RPC_ENUMERATE: return (struct shard_anchors *)obj_args->akey_anchor->da_sub_anchors; case DAOS_OBJ_RECX_RPC_ENUMERATE: + case DAOS_OBJ_RPC_KEY2ANCHOR: return (struct shard_anchors *)obj_args->anchor->da_sub_anchors; } return NULL; @@ -3799,6 +3800,7 @@ obj_set_sub_anchors(daos_obj_list_t *obj_args, int opc, struct shard_anchors *an obj_args->akey_anchor->da_sub_anchors = (uint64_t)anchors; break; case DAOS_OBJ_RECX_RPC_ENUMERATE: + case DAOS_OBJ_RPC_KEY2ANCHOR: obj_args->anchor->da_sub_anchors = (uint64_t)anchors; break; } @@ -3933,10 +3935,8 @@ sub_anchors_is_eof(struct shard_anchors *sub_anchors) daos_anchor_t *sub_anchor; sub_anchor = &sub_anchors->sa_anchors[i].ssa_anchor; - if (!daos_anchor_is_eof(sub_anchor)) { - D_DEBUG(DB_TRACE, "sub anchor %d not eof\n", i); + if (!daos_anchor_is_eof(sub_anchor)) break; - } } return i == sub_anchors->sa_anchors_nr; @@ -4170,28 +4170,6 @@ obj_list_comp(struct obj_auxi_args *obj_auxi, return 0; } -static int -k2a_update_sub_anchor_cb(tse_task_t *shard_task, struct shard_auxi_args *shard_auxi, - struct obj_auxi_args *obj_auxi, void *cb_arg) -{ - tse_task_t *task = obj_auxi->obj_task; - daos_obj_key2anchor_t *obj_arg = dc_task_get_args(task); - struct shard_k2a_args *shard_arg; - int shard; - - shard_arg = container_of(shard_auxi, struct shard_k2a_args, ka_auxi); - shard = shard_auxi->shard % obj_get_grp_size(obj_auxi->obj); - shard = obj_ec_shard_off(obj_auxi->obj, obj_auxi->dkey_hash, shard); - if (obj_arg->anchor && obj_arg->anchor->da_sub_anchors) { - struct shard_anchors *sub_anchors; - - sub_anchors = (struct shard_anchors *)obj_arg->anchor->da_sub_anchors; - memcpy(&sub_anchors->sa_anchors[shard].ssa_anchor, - shard_arg->ka_anchor, sizeof(daos_anchor_t)); - } - return 0; -} - static int obj_comp_cb_internal(struct obj_auxi_args *obj_auxi) { @@ -4224,7 +4202,6 @@ obj_comp_cb_internal(struct obj_auxi_args *obj_auxi) DP_OID(obj_auxi->obj->cob_md.omd_id), rc); obj_auxi->io_retry = 1; } - D_DEBUG(DB_TRACE, "exit %d\n", rc); D_GOTO(out, rc); } @@ -4232,14 +4209,22 @@ obj_comp_cb_internal(struct obj_auxi_args *obj_auxi) rc = obj_list_comp(obj_auxi, &iter_arg); else if (obj_auxi->opc == DAOS_OBJ_RPC_KEY2ANCHOR) { daos_obj_key2anchor_t *obj_arg = dc_task_get_args(obj_auxi->obj_task); + int grp_idx; - if (!obj_arg->anchor->da_sub_anchors || !obj_is_ec(obj_auxi->obj)) - goto out; - rc = obj_auxi_shards_iterate(obj_auxi, k2a_update_sub_anchor_cb, NULL); - if (rc) - D_GOTO(out, rc); - } + grp_idx = obj_dkey2grpidx(obj_auxi->obj, obj_auxi->dkey_hash, + obj_auxi->map_ver_req); + D_ASSERTF(grp_idx >= 0, "grp_idx %d obj_auxi->map_ver_req %u", + grp_idx, obj_auxi->map_ver_req); + obj_arg->anchor->da_shard = grp_idx * obj_get_grp_size(obj_auxi->obj); + sub_anchors = (struct shard_anchors *)obj_arg->anchor->da_sub_anchors; + if (sub_anchors) { + if (sub_anchors_is_eof(sub_anchors)) + daos_anchor_set_eof(obj_arg->anchor); + else + daos_anchor_set_zero(obj_arg->anchor); + } + } out: if (sub_anchors == NULL && obj_is_enum_opc(obj_auxi->opc)) merged_list_free(&merged_list, obj_auxi->opc); @@ -5791,103 +5776,158 @@ dc_obj_update_task(tse_task_t *task) } static int -shard_anchors_check_alloc_bufs(struct obj_auxi_args *obj_auxi, - struct shard_anchors *sub_anchors, - int shards_nr, int nr, daos_size_t buf_size) +daos_shard_tgt_lookup(struct daos_shard_tgt *tgts, int tgt_nr, uint32_t shard) { - struct obj_req_tgts *req_tgts = &obj_auxi->req_tgts; - struct shard_sub_anchor *sub_anchor; - daos_obj_list_t *obj_args; - int rc = 0; + int i; + + for (i = 0; i < tgt_nr; i++) { + if (tgts[i].st_shard == shard) + return i; + } + + return -1; +} + +/* + * Check if any sub anchor enumeration reach EOF, then set them to IGNORE_RANK, so + * to avoid send more RPC. + */ +static int +shard_anchors_eof_check(struct obj_auxi_args *obj_auxi, struct shard_anchors *sub_anchors) +{ + struct daos_shard_tgt *shard_tgts = obj_auxi->req_tgts.ort_shard_tgts; + uint32_t tgt_nr = obj_auxi->req_tgts.ort_grp_nr * + obj_auxi->req_tgts.ort_grp_size; + int shards_nr = sub_anchors->sa_anchors_nr; int i; - D_ASSERT(nr > 0); - obj_args = dc_task_get_args(obj_auxi->obj_task); - for (i = 0; i < shards_nr && obj_args->sgl != NULL; i++) { - d_sg_list_t *sgl; + /* + * To avoid complexity of post sgl merge(see obj_shard_list_obj_cb()) and following + * rebuild process, let's skip shard eof check for object enumeration, i.e. always + * enumerate even for eof shard. + */ + if (obj_auxi->opc == DAOS_OBJ_RPC_ENUMERATE) { + if (tgt_nr != shards_nr) { + D_ERROR(DF_OID" shards_nr %u tgt_nr %u: "DF_RC"\n", + DP_OID(obj_auxi->obj->cob_md.omd_id), shards_nr, tgt_nr, + DP_RC(-DER_IO)); + return -DER_IO; + } + return 0; + } - sub_anchor = &sub_anchors->sa_anchors[i]; + /* Check if any shards reach their EOF */ + D_ASSERT(sub_anchors != NULL); + for (i = 0; i < shards_nr; i++) { + struct shard_sub_anchor *sub_anchor; + sub_anchor = &sub_anchors->sa_anchors[i]; /* - * check if sg_iovs needs to be re-allocated, since it may - * reallocate sgl with REC2BIG. + * If the shard from sub_anchors does not exist in forward tgts(obj_auxi->req_tgts) + * anymore, then it means the shard become invalid, i.e. we do not need enumerate + * from this shard anymore, so set it to eof. */ - sub_anchor->ssa_shard = req_tgts->ort_shard_tgts[i].st_shard; - if (obj_auxi->opc != DAOS_OBJ_RPC_ENUMERATE && - daos_anchor_is_eof(&sub_anchor->ssa_anchor)) { - if (sub_anchor->ssa_sgl.sg_iovs) - d_sgl_fini(&sub_anchor->ssa_sgl, true); - req_tgts->ort_shard_tgts[i].st_rank = DAOS_TGT_IGNORE; + if (daos_shard_tgt_lookup(shard_tgts, tgt_nr, sub_anchor->ssa_shard) == -1) { + D_DEBUG(DB_IO, DF_OID" set anchor eof %d/%d/%u\n", + DP_OID(obj_auxi->obj->cob_md.omd_id), i, shards_nr, + sub_anchor->ssa_shard); + daos_anchor_set_eof(&sub_anchor->ssa_anchor); continue; } - if (obj_shard_is_invalid(obj_auxi->obj, sub_anchor->ssa_shard, - DAOS_OBJ_RPC_ENUMERATE)) { - daos_anchor_set_eof(&sub_anchor->ssa_anchor); + if (daos_anchor_is_eof(&sub_anchor->ssa_anchor)) { if (sub_anchor->ssa_sgl.sg_iovs) d_sgl_fini(&sub_anchor->ssa_sgl, true); + if (sub_anchor->ssa_recxs != NULL) + D_FREE(sub_anchor->ssa_recxs); + if (sub_anchor->ssa_kds) + D_FREE(sub_anchor->ssa_kds); + D_DEBUG(DB_IO, DF_OID" anchor eof %d/%d/%u\n", + DP_OID(obj_auxi->obj->cob_md.omd_id), i, shards_nr, + sub_anchor->ssa_shard); + shard_tgts[i].st_rank = DAOS_TGT_IGNORE; continue; } + } - if (sub_anchor->ssa_sgl.sg_iovs) { - if (sub_anchor->ssa_sgl.sg_iovs->iov_buf_len == buf_size) - continue; - d_sgl_fini(&sub_anchor->ssa_sgl, true); - } + if (tgt_nr <= shards_nr) + return 0; - rc = d_sgl_init(&sub_anchor->ssa_sgl, 1); - if (rc) - D_GOTO(out, rc); + /* More shards are added during enumeration, though to keep the anchor, let's + * ignore those new added shards */ + D_DEBUG(DB_IO, DF_OID" shards %u tgt_nr %u ignore tgts not in sub_anchors\n", + DP_OID(obj_auxi->obj->cob_md.omd_id), shards_nr, tgt_nr); - sgl = &sub_anchor->ssa_sgl; - rc = daos_iov_alloc(&sgl->sg_iovs[0], buf_size, false); - if (rc) - D_GOTO(out, rc); + for (i = 0; i < tgt_nr; i++) { + struct daos_shard_tgt *tgt = &shard_tgts[i]; + + if (shard_anchor_lookup(sub_anchors, tgt->st_shard) == -1) + shard_tgts[i].st_rank = DAOS_TGT_IGNORE; } - for (i = 0; i < shards_nr && obj_args->kds != NULL; i++) { + return 0; +} + +static int +shard_anchors_check_alloc_bufs(struct obj_auxi_args *obj_auxi, struct shard_anchors *sub_anchors, + int nr, daos_size_t buf_size) +{ + struct obj_req_tgts *req_tgts = &obj_auxi->req_tgts; + int shards_nr = sub_anchors->sa_anchors_nr; + struct shard_sub_anchor *sub_anchor; + daos_obj_list_t *obj_args; + int rc = 0; + int i; + + obj_args = dc_task_get_args(obj_auxi->obj_task); + for (i = 0; i < shards_nr; i++) { sub_anchor = &sub_anchors->sa_anchors[i]; + if (sub_anchor->ssa_shard == (uint32_t)(-1)) + sub_anchor->ssa_shard = req_tgts->ort_shard_tgts[i].st_shard; - sub_anchor->ssa_shard = req_tgts->ort_shard_tgts[i].st_shard; - if (obj_auxi->opc != DAOS_OBJ_RPC_ENUMERATE && - daos_anchor_is_eof(&sub_anchor->ssa_anchor)) { - if (sub_anchor->ssa_kds) - D_FREE(sub_anchor->ssa_kds); - req_tgts->ort_shard_tgts[i].st_rank = DAOS_TGT_IGNORE; + if (daos_anchor_is_eof(&sub_anchor->ssa_anchor)) continue; - } - if (sub_anchor->ssa_kds != NULL) { - if (sub_anchors->sa_nr == nr) - continue; - D_FREE(sub_anchor->ssa_kds); - } + if (obj_args->sgl != NULL) { + if (sub_anchor->ssa_sgl.sg_iovs && + sub_anchor->ssa_sgl.sg_iovs->iov_buf_len != buf_size) + d_sgl_fini(&sub_anchor->ssa_sgl, true); - D_ALLOC_ARRAY(sub_anchor->ssa_kds, nr); - if (sub_anchor->ssa_kds == NULL) - D_GOTO(out, rc = -DER_NOMEM); - } + if (sub_anchor->ssa_sgl.sg_iovs == NULL) { + d_sg_list_t *sgl; - for (i = 0; i < shards_nr && obj_args->recxs != NULL; i++) { - sub_anchor = &sub_anchors->sa_anchors[i]; + rc = d_sgl_init(&sub_anchor->ssa_sgl, 1); + if (rc) + D_GOTO(out, rc); - sub_anchor->ssa_shard = req_tgts->ort_shard_tgts[i].st_shard; - if (obj_auxi->opc != DAOS_OBJ_RPC_ENUMERATE && - daos_anchor_is_eof(&sub_anchor->ssa_anchor)) { - if (sub_anchor->ssa_recxs != NULL) - D_FREE(sub_anchor->ssa_recxs); - req_tgts->ort_shard_tgts[i].st_rank = DAOS_TGT_IGNORE; + sgl = &sub_anchor->ssa_sgl; + rc = daos_iov_alloc(&sgl->sg_iovs[0], buf_size, false); + if (rc) + D_GOTO(out, rc); + } } - if (sub_anchor->ssa_recxs != NULL) { - if (sub_anchors->sa_nr == nr) - continue; - D_FREE(sub_anchor->ssa_recxs); + if (obj_args->kds != NULL) { + if (sub_anchor->ssa_kds != NULL && sub_anchors->sa_nr != nr) + D_FREE(sub_anchor->ssa_kds); + + if (sub_anchor->ssa_kds == NULL) { + D_ALLOC_ARRAY(sub_anchor->ssa_kds, nr); + if (sub_anchor->ssa_kds == NULL) + D_GOTO(out, rc = -DER_NOMEM); + } } - D_ALLOC_ARRAY(sub_anchor->ssa_recxs, nr); - if (sub_anchor->ssa_recxs == NULL) - D_GOTO(out, rc = -DER_NOMEM); + if (obj_args->recxs != NULL) { + if (sub_anchor->ssa_recxs != NULL && sub_anchors->sa_nr == nr) + D_FREE(sub_anchor->ssa_recxs); + + if (sub_anchor->ssa_recxs == NULL) { + D_ALLOC_ARRAY(sub_anchor->ssa_recxs, nr); + if (sub_anchor->ssa_recxs == NULL) + D_GOTO(out, rc = -DER_NOMEM); + } + } } sub_anchors->sa_nr = nr; @@ -5908,27 +5948,12 @@ shard_anchors_alloc(struct obj_auxi_args *obj_auxi, int shards_nr, int nr, if (sub_anchors == NULL) return NULL; + for (i = 0; i < shards_nr; i++) + sub_anchors->sa_anchors[i].ssa_shard = -1; + D_INIT_LIST_HEAD(&sub_anchors->sa_merged_list); sub_anchors->sa_anchors_nr = shards_nr; - if (obj_auxi->opc == DAOS_OBJ_RPC_KEY2ANCHOR) { - for (i = 0; i < shards_nr; i++) { - d_sg_list_t *sgl; - struct shard_sub_anchor *sub_anchor; - - sub_anchor = &sub_anchors->sa_anchors[i]; - rc = d_sgl_init(&sub_anchor->ssa_sgl, 1); - if (rc) - D_GOTO(out, rc); - - sgl = &sub_anchor->ssa_sgl; - rc = daos_iov_alloc(&sgl->sg_iovs[0], buf_size, false); - if (rc) - D_GOTO(out, rc); - } - return sub_anchors; - } - - rc = shard_anchors_check_alloc_bufs(obj_auxi, sub_anchors, shards_nr, nr, buf_size); + rc = shard_anchors_check_alloc_bufs(obj_auxi, sub_anchors, nr, buf_size); if (rc) D_GOTO(out, rc); @@ -5955,7 +5980,7 @@ shard_anchors_alloc(struct obj_auxi_args *obj_auxi, int shards_nr, int nr, * For migrate enumeration(OBJ_RPC_ENUMERATE), all 3 sub anchors(ssa_anchors, ssa_recx_anchors, * ssa_akey_anchors) will be attached to obj_args->dkey_anchors, i.e. anchors and akey_anchors * are "useless" here. - * Though for normal migrate enumeration (no sub anchors), anchors/dkey_anchors/akey_anchors + * Though for normal enumeration (no sub anchors), anchors/dkey_anchors/akey_anchors * will all be used. */ static int @@ -5963,11 +5988,12 @@ sub_anchors_prep(struct obj_auxi_args *obj_auxi, int shards_nr) { daos_obj_list_t *obj_args; struct shard_anchors *sub_anchors; - int nr; + int nr = 0; daos_size_t buf_size; obj_args = dc_task_get_args(obj_auxi->obj_task); - nr = *obj_args->nr; + if (obj_args->nr != NULL) + nr = *obj_args->nr; buf_size = daos_sgl_buf_size(obj_args->sgl); if (obj_auxi->opc == DAOS_OBJ_RPC_ENUMERATE) { D_ASSERTF(nr >= shards_nr, "nr %d shards_nr %d\n", nr, shards_nr); @@ -5977,10 +6003,16 @@ sub_anchors_prep(struct obj_auxi_args *obj_auxi, int shards_nr) obj_auxi->sub_anchors = 1; sub_anchors = obj_get_sub_anchors(obj_args, obj_auxi->opc); - if (sub_anchors != NULL) - return shard_anchors_check_alloc_bufs(obj_auxi, sub_anchors, - sub_anchors->sa_anchors_nr, nr, - buf_size); + if (sub_anchors != NULL) { + int rc; + + rc = shard_anchors_eof_check(obj_auxi, sub_anchors); + if (rc) + return rc; + + return shard_anchors_check_alloc_bufs(obj_auxi, sub_anchors, nr, buf_size); + } + sub_anchors = shard_anchors_alloc(obj_auxi, shards_nr, nr, buf_size); if (sub_anchors == NULL) return -DER_NOMEM; @@ -6452,8 +6484,17 @@ shard_k2a_prep(struct shard_auxi_args *shard_auxi, struct dc_object *obj, obj_args = dc_task_get_args(obj_auxi->obj_task); shard_arg = container_of(shard_auxi, struct shard_k2a_args, ka_auxi); - shard_arg->ka_anchor = obj_args->anchor; + if (obj_args->anchor->da_sub_anchors) { + struct shard_anchors *sub_anchors; + int shard; + sub_anchors = (struct shard_anchors *)obj_args->anchor->da_sub_anchors; + shard = shard_anchor_lookup(sub_anchors, shard_auxi->shard); + D_ASSERT(shard != -1); + shard_arg->ka_anchor = &sub_anchors->sa_anchors[shard].ssa_anchor; + } else { + shard_arg->ka_anchor = obj_args->anchor; + } return 0; } @@ -6494,16 +6535,10 @@ dc_obj_key2anchor(tse_task_t *task) } if (obj_auxi->is_ec_obj) { - struct shard_anchors *sub_anchors = NULL; - - rc = obj_ec_get_parity_or_alldata_shard(obj_auxi, map_ver, grp_idx, args->dkey, - &shard_cnt, NULL); + rc = obj_ec_get_parity_or_alldata_shard(obj_auxi, map_ver, grp_idx, + args->dkey, &shard_cnt, NULL); if (obj_ec_parity_rotate_enabled(obj)) shard_cnt = obj_get_grp_size(obj); - sub_anchors = shard_anchors_alloc(obj_auxi, shard_cnt, 1, args->dkey->iov_buf_len); - if (sub_anchors == NULL) - D_GOTO(err_obj, rc = -DER_NOMEM); - args->anchor->da_sub_anchors = (uint64_t)sub_anchors; } else { shard_cnt = 1; if (obj_auxi->to_leader) { @@ -6521,20 +6556,22 @@ dc_obj_key2anchor(tse_task_t *task) if (rc < 0) { D_ERROR(DF_OID" Can not find shard grp %d: "DF_RC"\n", DP_OID(obj->cob_md.omd_id), grp_idx, DP_RC(rc)); - if (args->anchor->da_sub_anchors) - shard_anchors_free((struct shard_anchors *)args->anchor->da_sub_anchors, - DAOS_OBJ_RPC_KEY2ANCHOR); D_GOTO(err_obj, rc); } shard = rc; rc = obj_shards_2_fwtgts(obj, map_ver, NIL_BITMAP, shard, shard_cnt, 1, OBJ_TGT_FLAG_CLI_DISPATCH, obj_auxi); - if (rc != 0) { - if (args->anchor->da_sub_anchors) - shard_anchors_free((struct shard_anchors *)args->anchor->da_sub_anchors, - DAOS_OBJ_RPC_KEY2ANCHOR); + if (rc != 0) D_GOTO(err_obj, rc); + + if (shard_cnt > 1) { + rc = sub_anchors_prep(obj_auxi, shard_cnt); + if (rc) { + D_ERROR(DF_OID" prepare %d anchor fail: %d\n", + DP_OID(obj->cob_md.omd_id), shard_cnt, rc); + D_GOTO(err_obj, rc); + } } if (daos_handle_is_valid(args->th)) { diff --git a/src/object/cli_shard.c b/src/object/cli_shard.c index 9040559db7b..2c1cc9f1979 100644 --- a/src/object/cli_shard.c +++ b/src/object/cli_shard.c @@ -2458,7 +2458,6 @@ dc_k2a_cb(tse_task_t *task, void *arg) struct obj_k2a_args *k2a_args = (struct obj_k2a_args *)arg; struct obj_key2anchor_in *oki; struct obj_key2anchor_out *oko; - uint64_t save_sub_anchor; int ret = task->dt_result; int rc = 0; @@ -2502,9 +2501,7 @@ dc_k2a_cb(tse_task_t *task, void *arg) } *k2a_args->eaa_map_ver = obj_reply_map_version_get(k2a_args->rpc); - save_sub_anchor = k2a_args->anchor->da_sub_anchors; enum_anchor_copy(k2a_args->anchor, &oko->oko_anchor); - k2a_args->anchor->da_sub_anchors = save_sub_anchor; dc_obj_shard2anchor(k2a_args->anchor, k2a_args->shard); out: if (k2a_args->eaa_obj != NULL) @@ -2578,7 +2575,7 @@ dc_obj_shard_key2anchor(struct dc_obj_shard *obj_shard, enum obj_rpc_opc opc, cb_args.eaa_map_ver = &args->ka_auxi.map_ver; cb_args.epoch = &args->ka_auxi.epoch; cb_args.th = &obj_args->th; - cb_args.anchor = obj_args->anchor; + cb_args.anchor = args->ka_anchor; cb_args.shard = obj_shard->do_shard_idx; rc = tse_task_register_comp_cb(task, dc_k2a_cb, &cb_args, sizeof(cb_args)); if (rc != 0) diff --git a/src/object/obj_task.c b/src/object/obj_task.c index b8d3d473b89..7e307c2ff65 100644 --- a/src/object/obj_task.c +++ b/src/object/obj_task.c @@ -400,6 +400,7 @@ dc_obj_key2anchor_task_create(daos_handle_t oh, daos_handle_t th, daos_key_t *dk args->dkey = dkey; args->akey = akey; args->anchor = anchor; + args->nr = NULL; return 0; } diff --git a/src/object/srv_ec_aggregate.c b/src/object/srv_ec_aggregate.c index 5034a65fc51..3edb9606fa5 100644 --- a/src/object/srv_ec_aggregate.c +++ b/src/object/srv_ec_aggregate.c @@ -1797,7 +1797,7 @@ agg_process_stripe(struct ec_agg_param *agg_param, struct ec_agg_entry *entry) /* Query the parity, entry->ae_par_extent.ape_epoch will be set to * parity ext epoch if exist. */ - iter_param.ip_hdl = DAOS_HDL_INVAL; + iter_param.ip_hdl = agg_param->ap_cont_handle; /* set epr_lo as zero to pass-through possibly existed snapshot * between agg_param->ap_epr.epr_lo and .epr_hi. */ diff --git a/src/object/srv_obj_migrate.c b/src/object/srv_obj_migrate.c index 4b3c6879eff..dfada37cca8 100644 --- a/src/object/srv_obj_migrate.c +++ b/src/object/srv_obj_migrate.c @@ -2884,13 +2884,21 @@ migrate_obj_ult(void *data) * discard, or discard has been done. spc_discard_done means * discarding has been done in the current VOS target. */ - while (tls->mpt_pool->spc_pool->sp_need_discard && - !tls->mpt_pool->spc_discard_done) { - D_DEBUG(DB_REBUILD, DF_UUID" wait for discard to finish.\n", - DP_UUID(arg->pool_uuid)); - dss_sleep(2 * 1000); - if (tls->mpt_fini) + if (tls->mpt_pool->spc_pool->sp_need_discard) { + while(!tls->mpt_pool->spc_discard_done) { + D_DEBUG(DB_REBUILD, DF_UUID" wait for discard to finish.\n", + DP_UUID(arg->pool_uuid)); + dss_sleep(2 * 1000); + if (tls->mpt_fini) + D_GOTO(free_notls, rc); + } + D_ASSERT(tls->mpt_pool->spc_pool->sp_need_discard == 0); + if (tls->mpt_pool->spc_pool->sp_discard_status) { + rc = tls->mpt_pool->spc_pool->sp_discard_status; + D_DEBUG(DB_REBUILD, DF_UUID" discard failure"DF_RC".\n", + DP_UUID(arg->pool_uuid), DP_RC(rc)); D_GOTO(free_notls, rc); + } } for (i = 0; i < arg->snap_cnt; i++) { diff --git a/src/pool/srv_target.c b/src/pool/srv_target.c index 75a32ba066d..9e1b5b81b10 100644 --- a/src/pool/srv_target.c +++ b/src/pool/srv_target.c @@ -1958,6 +1958,8 @@ ds_pool_tgt_discard_ult(void *data) DP_UUID(arg->pool_uuid), DP_RC(rc)); put: pool->sp_need_discard = 0; + pool->sp_discard_status = rc; + ds_pool_put(pool); free: tgt_discard_arg_free(arg); @@ -1992,6 +1994,7 @@ ds_pool_tgt_discard_handler(crt_rpc_t *rpc) } pool->sp_need_discard = 1; + pool->sp_discard_status = 0; rc = dss_ult_create(ds_pool_tgt_discard_ult, arg, DSS_XS_SYS, 0, 0, NULL); ds_pool_put(pool); diff --git a/src/rdb/raft b/src/rdb/raft index 3d20556a08f..9524cdb7161 160000 --- a/src/rdb/raft +++ b/src/rdb/raft @@ -1 +1 @@ -Subproject commit 3d20556a08fc21deb6899104ad817d0a6e8e7af4 +Subproject commit 9524cdb716151f1830071d66b61191444bde74f7 diff --git a/src/rdb/rdb.c b/src/rdb/rdb.c index 219e337a7a5..cdfed7dd618 100644 --- a/src/rdb/rdb.c +++ b/src/rdb/rdb.c @@ -97,6 +97,8 @@ rdb_create(const char *path, const uuid_t uuid, uint64_t caller_term, size_t siz if (rc != 0) goto out_mc_hdl; + db->d_new = true; + *storagep = rdb_to_storage(db); out_mc_hdl: if (rc != 0) @@ -485,6 +487,16 @@ rdb_close(struct rdb_storage *storage) D_FREE(db); } +static bool +rdb_get_use_leases(void) +{ + char *name = "RDB_USE_LEASES"; + bool value = true; + + d_getenv_bool(name, &value); + return value; +} + /** * Start \a storage, converting \a storage into \a dbp. If this is successful, * the caller must stop using \a storage; otherwise, the caller remains @@ -514,7 +526,9 @@ rdb_start(struct rdb_storage *storage, struct rdb **dbp) return rc; } - D_DEBUG(DB_MD, DF_DB": started db %p\n", DP_DB(db), db); + db->d_use_leases = rdb_get_use_leases(); + + D_DEBUG(DB_MD, DF_DB": started db %p: use_leases=%d\n", DP_DB(db), db, db->d_use_leases); *dbp = db; return 0; } diff --git a/src/rdb/rdb_internal.h b/src/rdb/rdb_internal.h index 7bd361893c8..8a8b9bdc341 100644 --- a/src/rdb/rdb_internal.h +++ b/src/rdb/rdb_internal.h @@ -91,6 +91,8 @@ struct rdb { ABT_cond d_commit_cv; /* for waking active pool checkpoint */ daos_handle_t d_mc; /* metadata container */ uint64_t d_nospc_ts; /* last time commit observed low/no space (usec) */ + bool d_new; /* for skipping lease recovery */ + bool d_use_leases; /* when verifying leadership */ /* rdb_raft fields */ raft_server_t *d_raft; @@ -206,7 +208,7 @@ int rdb_raft_trigger_compaction(struct rdb *db, bool compact_all, uint64_t *idx) * These are for daos_rpc::dr_opc and DAOS_RPC_OPCODE(opc, ...) rather than * crt_req_create(..., opc, ...). See src/include/daos/rpc.h. */ -#define DAOS_RDB_VERSION 3 +#define DAOS_RDB_VERSION 4 /* LIST of internal RPCS in form of: * OPCODE, flags, FMT, handler, corpc_hdlr, */ diff --git a/src/rdb/rdb_raft.c b/src/rdb/rdb_raft.c index 0a9075518cc..cfc292c0c6a 100644 --- a/src/rdb/rdb_raft.c +++ b/src/rdb/rdb_raft.c @@ -1402,6 +1402,17 @@ rdb_raft_cb_debug(raft_server_t *raft, raft_node_t *node, void *arg, } } +static raft_time_t +rdb_raft_cb_get_time(raft_server_t *raft, void *user_data) +{ + struct timespec now; + int rc; + + rc = clock_gettime(CLOCK_REALTIME, &now); + D_ASSERTF(rc == 0, "clock_gettime: %d\n", errno); + return now.tv_sec * 1000 + now.tv_nsec / (1000 * 1000); +} + /* * rdb's raft callback implementations * @@ -1423,7 +1434,8 @@ static raft_cbs_t rdb_raft_cbs = { .log_pop = rdb_raft_cb_log_pop, .log_get_node_id = rdb_raft_cb_log_get_node_id, .notify_membership_event = rdb_raft_cb_notify_membership_event, - .log = rdb_raft_cb_debug + .log = rdb_raft_cb_debug, + .get_time = rdb_raft_cb_get_time }; static int @@ -2024,16 +2036,18 @@ rdb_raft_append_apply(struct rdb *db, void *entry, size_t size, void *result) return rdb_raft_append_apply_internal(db, &mentry, result); } -/* Verify the leadership with a quorum. */ +/* Verify the leadership with a majority. */ int rdb_raft_verify_leadership(struct rdb *db) { + if (db->d_use_leases && raft_has_majority_leases(db->d_raft)) + return 0; + /* - * raft does not provide this functionality yet; append an empty entry - * as a (slower) workaround. + * Since raft does not provide a function for verifying leadership via + * RPCs yet, append an empty entry as a (slower) workaround. */ - return rdb_raft_append_apply(db, NULL /* entry */, 0 /* size */, - NULL /* result */); + return rdb_raft_append_apply(db, NULL /* entry */, 0 /* size */, NULL /* result */); } /* Generate a random double in [0.0, 1.0]. */ @@ -2080,7 +2094,7 @@ rdb_timerd(void *arg) ABT_mutex_lock(db->d_raft_mutex); rdb_raft_save_state(db, &state); - rc = raft_periodic(db->d_raft, d_prev * 1000 /* ms */); + rc = raft_periodic(db->d_raft); rc = rdb_raft_check_state(db, &state, rc); ABT_mutex_unlock(db->d_raft_mutex); if (rc != 0) @@ -2430,6 +2444,21 @@ rdb_raft_get_request_timeout(void) return value; } +static int +rdb_raft_get_lease_maintenance_grace(void) +{ + char *name = "RDB_LEASE_MAINTENANCE_GRACE"; + unsigned int default_value = 7000; + unsigned int value = default_value; + + d_getenv_int(name, &value); + if (value == 0 || value > INT_MAX) { + D_WARN("%s not in (0, %d] (defaulting to %u)\n", name, INT_MAX, default_value); + value = default_value; + } + return value; +} + static uint64_t rdb_raft_get_compact_thres(void) { @@ -2654,6 +2683,7 @@ rdb_raft_start(struct rdb *db) { int election_timeout; int request_timeout; + int lease_maintenance_grace; int rc; D_ASSERT(db->d_raft == NULL); @@ -2667,6 +2697,8 @@ rdb_raft_start(struct rdb *db) } raft_set_nodeid(db->d_raft, dss_self_rank()); + if (db->d_new) + raft_set_first_start(db->d_raft); raft_set_callbacks(db->d_raft, &rdb_raft_cbs, db); rc = rdb_raft_load(db); @@ -2677,8 +2709,10 @@ rdb_raft_start(struct rdb *db) election_timeout = rdb_raft_get_election_timeout(); request_timeout = rdb_raft_get_request_timeout(); + lease_maintenance_grace = rdb_raft_get_lease_maintenance_grace(); raft_set_election_timeout(db->d_raft, election_timeout); raft_set_request_timeout(db->d_raft, request_timeout); + raft_set_lease_maintenance_grace(db->d_raft, lease_maintenance_grace); rc = dss_ult_create(rdb_recvd, db, DSS_XS_SELF, 0, 0, &db->d_recvd); if (rc != 0) @@ -2697,8 +2731,9 @@ rdb_raft_start(struct rdb *db) D_DEBUG(DB_MD, DF_DB": raft started: election_timeout=%dms request_timeout=%dms " - "compact_thres="DF_U64" ae_max_entries=%u ae_max_size="DF_U64"\n", DP_DB(db), - election_timeout, request_timeout, db->d_compact_thres, db->d_ae_max_entries, + "lease_maintenance_grace=%dms compact_thres="DF_U64" ae_max_entries=%u " + "ae_max_size="DF_U64"\n", DP_DB(db), election_timeout, request_timeout, + lease_maintenance_grace, db->d_compact_thres, db->d_ae_max_entries, db->d_ae_max_size); return 0; @@ -3092,6 +3127,7 @@ rdb_raft_process_reply(struct rdb *db, crt_rpc_t *rpc) struct rdb_installsnapshot_out *out_is; d_rank_t rank; raft_node_t *node; + raft_time_t *lease = NULL; int rc; /* Get the destination of the request - that is the source @@ -3107,6 +3143,31 @@ rdb_raft_process_reply(struct rdb *db, crt_rpc_t *rpc) return; } + /* + * If this is an AE or IS response, adjust the lease expiration time + * for clock offsets among replicas. + */ + switch (opc) { + case RDB_APPENDENTRIES: + out_ae = out; + lease = &out_ae->aeo_msg.lease; + break; + case RDB_INSTALLSNAPSHOT: + out_is = out; + lease = &out_is->iso_msg.lease; + break; + } + if (lease != NULL) { + int adjustment = d_hlc2msec(d_hlc_epsilon_get()) + 1 /* ms margin */; + + if (*lease < adjustment) { + D_ERROR(DF_DB": dropping %s response from rank %u: invalid lease: %ld\n", + DP_DB(db), opc == RDB_APPENDENTRIES ? "AE" : "IS", rank, *lease); + return; + } + *lease -= adjustment; + } + ABT_mutex_lock(db->d_raft_mutex); node = raft_get_node(db->d_raft, rank); @@ -3119,18 +3180,15 @@ rdb_raft_process_reply(struct rdb *db, crt_rpc_t *rpc) switch (opc) { case RDB_REQUESTVOTE: out_rv = out; - rc = raft_recv_requestvote_response(db->d_raft, node, - &out_rv->rvo_msg); + rc = raft_recv_requestvote_response(db->d_raft, node, &out_rv->rvo_msg); break; case RDB_APPENDENTRIES: out_ae = out; - rc = raft_recv_appendentries_response(db->d_raft, node, - &out_ae->aeo_msg); + rc = raft_recv_appendentries_response(db->d_raft, node, &out_ae->aeo_msg); break; case RDB_INSTALLSNAPSHOT: out_is = out; - rc = raft_recv_installsnapshot_response(db->d_raft, node, - &out_is->iso_msg); + rc = raft_recv_installsnapshot_response(db->d_raft, node, &out_is->iso_msg); break; default: D_ASSERTF(0, DF_DB": unexpected opc: %u\n", DP_DB(db), opc); diff --git a/src/rdb/rdb_rpc.c b/src/rdb/rdb_rpc.c index 05933ec5f88..7de71cff2d0 100644 --- a/src/rdb/rdb_rpc.c +++ b/src/rdb/rdb_rpc.c @@ -1,5 +1,5 @@ /* - * (C) Copyright 2017-2022 Intel Corporation. + * (C) Copyright 2017-2023 Intel Corporation. * * SPDX-License-Identifier: BSD-2-Clause-Patent */ @@ -154,6 +154,9 @@ crt_proc_msg_installsnapshot_response_t(crt_proc_t proc, crt_proc_op_t proc_op, if (unlikely(rc)) return rc; rc = crt_proc_int32_t(proc, proc_op, &p->complete); + if (unlikely(rc)) + return rc; + rc = crt_proc_int64_t(proc, proc_op, &p->lease); if (unlikely(rc)) return rc; diff --git a/src/tests/ftest/avocado_tests.py b/src/tests/ftest/avocado_tests.py old mode 100755 new mode 100644 index 75fe7ded542..0a2c12a8859 --- a/src/tests/ftest/avocado_tests.py +++ b/src/tests/ftest/avocado_tests.py @@ -1,4 +1,3 @@ -#!/usr/bin/python3 """ (C) Copyright 2020-2023 Intel Corporation. @@ -41,8 +40,10 @@ def tearDown(self): def test_junit_stdio(self): """Test full Stdout in Jenkins JUnit display + :avocado: tags=manual + :avocado: tags=vm :avocado: tags=avocado_tests,avocado_junit_stdout - :avocado: tags=test_junit_stdio + :avocado: tags=ApricotTests,test_junit_stdio """ with open('large_stdout.txt', 'r') as inp: print(inp.read()) @@ -51,9 +52,10 @@ def test_junit_stdio(self): def test_teardown_timeout_timed_out(self): """Test the PoC tearDown() timeout patch + :avocado: tags=manual + :avocado: tags=vm :avocado: tags=avocado_tests,avocado_test_teardown_timeout - :avocado: tags=avocado_test_teardown_timeout_timed_out - :avocado: tags=test_teardown_timeout_timed_out + :avocado: tags=ApricotTests,test_teardown_timeout_timed_out """ self.log.debug("Sleeping for 10 seconds") time.sleep(10) @@ -61,8 +63,10 @@ def test_teardown_timeout_timed_out(self): def test_teardown_timeout(self): """Test the PoC tearDown() timeout patch + :avocado: tags=manual + :avocado: tags=vm :avocado: tags=avocado_tests,avocado_test_teardown_timeout - :avocado: tags=test_teardown_timeout + :avocado: tags=ApricotTests,test_teardown_timeout """ self.log.debug("Sleeping for 1 second") time.sleep(1) diff --git a/src/tests/ftest/cart/corpc/corpc_five_node.py b/src/tests/ftest/cart/corpc/corpc_five_node.py index 261df5b050e..fbd262ccdff 100644 --- a/src/tests/ftest/cart/corpc/corpc_five_node.py +++ b/src/tests/ftest/cart/corpc/corpc_five_node.py @@ -13,12 +13,13 @@ class CartCoRpcFiveNodeTest(CartTest): :avocado: recursive """ - def test_cart_corpc(self): + def test_cart_corpc_five_node(self): """Test CaRT CoRPC. :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,corpc,five_node,memcheck - :avocado: tags=test_cart_corpc + :avocado: tags=CartCoRpcFiveNodeTest,test_cart_corpc_five_node """ cmd = self.build_cmd(self.env, "test_servers") self.launch_test(cmd) diff --git a/src/tests/ftest/cart/corpc/corpc_one_node.py b/src/tests/ftest/cart/corpc/corpc_one_node.py index 40ff1c06406..3b339de4201 100644 --- a/src/tests/ftest/cart/corpc/corpc_one_node.py +++ b/src/tests/ftest/cart/corpc/corpc_one_node.py @@ -13,12 +13,13 @@ class CartCoRpcOneNodeTest(CartTest): :avocado: recursive """ - def test_cart_corpc(self): + def test_cart_corpc_one_node(self): """Test CaRT CoRPC. :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,corpc,one_node,memcheck - :avocado: tags=test_cart_corpc + :avocado: tags=CartCoRpcOneNodeTest,test_cart_corpc_one_node """ cmd = self.build_cmd(self.env, "test_servers") self.launch_test(cmd) diff --git a/src/tests/ftest/cart/corpc/corpc_two_node.py b/src/tests/ftest/cart/corpc/corpc_two_node.py index 3d4d5a71a37..4f0c4dc674b 100644 --- a/src/tests/ftest/cart/corpc/corpc_two_node.py +++ b/src/tests/ftest/cart/corpc/corpc_two_node.py @@ -13,12 +13,13 @@ class CartCoRpcTwoNodeTest(CartTest): :avocado: recursive """ - def test_cart_corpc(self): + def test_cart_corpc_two_node(self): """Test CaRT CoRPC. :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,corpc,two_node,memcheck - :avocado: tags=test_cart_corpc + :avocado: tags=CartCoRpcTwoNodeTest,test_cart_corpc_two_node """ cmd = self.build_cmd(self.env, "test_servers") self.launch_test(cmd) diff --git a/src/tests/ftest/cart/ctl/ctl_five_node.py b/src/tests/ftest/cart/ctl/ctl_five_node.py index 9cd23801148..d8cf310c694 100644 --- a/src/tests/ftest/cart/ctl/ctl_five_node.py +++ b/src/tests/ftest/cart/ctl/ctl_five_node.py @@ -13,12 +13,13 @@ class CartCtlFiveNodeTest(CartTest): :avocado: recursive """ - def test_cart_ctl(self): + def test_cart_ctl_five_node(self): """Test CaRT ctl. :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,ctl,five_node,memcheck - :avocado: tags=test_cart_ctl + :avocado: tags=CartCtlFiveNodeTest,test_cart_ctl_five_node """ srvcmd = self.build_cmd(self.env, "test_servers") diff --git a/src/tests/ftest/cart/ctl/ctl_one_node.py b/src/tests/ftest/cart/ctl/ctl_one_node.py index 545f4b58511..bf260960592 100644 --- a/src/tests/ftest/cart/ctl/ctl_one_node.py +++ b/src/tests/ftest/cart/ctl/ctl_one_node.py @@ -13,12 +13,13 @@ class CartCtlOneNodeTest(CartTest): :avocado: recursive """ - def test_cart_ctl(self): + def test_cart_ctl_one_node(self): """Test CaRT ctl. :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,ctl,one_node,memcheck - :avocado: tags=test_cart_ctl + :avocado: tags=CartCtlOneNodeTest,test_cart_ctl_one_node """ srvcmd = self.build_cmd(self.env, "test_servers") diff --git a/src/tests/ftest/cart/ghost_rank_rpc/ghost_rank_rpc_one_node.py b/src/tests/ftest/cart/ghost_rank_rpc/ghost_rank_rpc_one_node.py index 2c7684dac17..2e92676b9f2 100644 --- a/src/tests/ftest/cart/ghost_rank_rpc/ghost_rank_rpc_one_node.py +++ b/src/tests/ftest/cart/ghost_rank_rpc/ghost_rank_rpc_one_node.py @@ -17,8 +17,9 @@ def test_cart_ghost_rank_rpc(self): """Test ghost rank RPC. :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,ghost_rank_rpc,one_node,memcheck - :avocado: tags=test_cart_ghost_rank_rpc + :avocado: tags=CartGhostRankRpcOneNodeTest,test_cart_ghost_rank_rpc """ cmd = self.build_cmd(self.env, "test_servers") self.launch_test(cmd) diff --git a/src/tests/ftest/cart/group_test/group_test.py b/src/tests/ftest/cart/group_test/group_test.py index 85ca17d78ac..bfa7b596d23 100644 --- a/src/tests/ftest/cart/group_test/group_test.py +++ b/src/tests/ftest/cart/group_test/group_test.py @@ -6,19 +6,20 @@ from cart_utils import CartTest -class GroupTest(CartTest): +class CartGroupTest(CartTest): # pylint: disable=too-few-public-methods """Run GroupTests for primary and secondary groups. :avocado: recursive """ - def test_group(self): + def test_cart_group(self): """Test CaRT NoPmix Launcher. :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,group_test,one_node,memcheck - :avocado: tags=test_group + :avocado: tags=CartGroupTest,test_cart_group """ cmd = self.build_cmd(self.env, "test_servers") self.launch_test(cmd) diff --git a/src/tests/ftest/cart/iv/iv_one_node.py b/src/tests/ftest/cart/iv/iv_one_node.py index 4c49e46530b..e566d5d36ff 100644 --- a/src/tests/ftest/cart/iv/iv_one_node.py +++ b/src/tests/ftest/cart/iv/iv_one_node.py @@ -254,12 +254,13 @@ def _iv_test_actions(self, cmd, actions): 'Error code {!s} running command "{!s}"'.format( cli_rtn, command)) - def test_cart_iv(self): + def test_cart_iv_one_node(self): """Test CaRT IV. :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,iv,one_node,memcheck - :avocado: tags=test_cart_iv + :avocado: tags=CartIvOneNodeTest,test_cart_iv_one_node """ srvcmd = self.build_cmd(self.env, "test_servers") diff --git a/src/tests/ftest/cart/iv/iv_two_node.py b/src/tests/ftest/cart/iv/iv_two_node.py index 5d5b13bf28d..4c49e9305a9 100644 --- a/src/tests/ftest/cart/iv/iv_two_node.py +++ b/src/tests/ftest/cart/iv/iv_two_node.py @@ -182,13 +182,14 @@ def _iv_test_actions(self, cmd, actions): 'Error code {!s} running command "{!s}"'.format( cli_rtn, command)) - def test_cart_iv(self): + def test_cart_iv_two_node(self): """ Test CaRT IV :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,iv,two_node,memcheck - :avocado: tags=test_cart_iv + :avocado: tags=CartIvTwoNodeTest,test_cart_iv_two_node """ srvcmd = self.build_cmd(self.env, "test_servers") diff --git a/src/tests/ftest/cart/no_pmix/multictx_one_node.py b/src/tests/ftest/cart/no_pmix/multictx_one_node.py index 574dff1bea6..14161f6123f 100644 --- a/src/tests/ftest/cart/no_pmix/multictx_one_node.py +++ b/src/tests/ftest/cart/no_pmix/multictx_one_node.py @@ -19,8 +19,9 @@ def test_cart_no_pmix(self): """Test CaRT NoPmix. :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,no_pmix,one_node,memcheck - :avocado: tags=test_cart_no_pmix + :avocado: tags=CartNoPmixOneNodeTest,test_cart_no_pmix """ cmd = self.params.get("tst_bin", '/run/tests/*/') diff --git a/src/tests/ftest/cart/nopmix_launcher/launcher_one_node.py b/src/tests/ftest/cart/nopmix_launcher/launcher_one_node.py index 61bd689bfe4..401cefc49fe 100644 --- a/src/tests/ftest/cart/nopmix_launcher/launcher_one_node.py +++ b/src/tests/ftest/cart/nopmix_launcher/launcher_one_node.py @@ -17,8 +17,9 @@ def test_cart_no_pmix_launcher(self): """Test CaRT NoPmix Launcher. :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,no_pmix_launcher,one_node,memcheck - :avocado: tags=test_cart_no_pmix_launcher + :avocado: tags=CartNoPmixLauncherOneNodeTest,test_cart_no_pmix_launcher """ cli_bin = self.params.get("test_clients_bin", '/run/tests/*/') cli_arg = self.params.get("test_clients_arg", '/run/tests/*/') diff --git a/src/tests/ftest/cart/rpc/multisend_one_node.py b/src/tests/ftest/cart/rpc/multisend_one_node.py index 4b56b12e988..b9d97390902 100644 --- a/src/tests/ftest/cart/rpc/multisend_one_node.py +++ b/src/tests/ftest/cart/rpc/multisend_one_node.py @@ -18,8 +18,9 @@ def test_cart_multisend(self): """Test multi-send :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,rpc,one_node,memcheck,multisend,bulk - :avocado: tags=test_cart_multisend + :avocado: tags=CartMultisendOneNodeTest,test_cart_multisend """ srvcmd = self.build_cmd(self.env, "test_servers") diff --git a/src/tests/ftest/cart/rpc/rpc_one_node.py b/src/tests/ftest/cart/rpc/rpc_one_node.py index 81e71b41e4d..b8ca50bf1a8 100644 --- a/src/tests/ftest/cart/rpc/rpc_one_node.py +++ b/src/tests/ftest/cart/rpc/rpc_one_node.py @@ -13,12 +13,13 @@ class CartRpcOneNodeTest(CartTest): :avocado: recursive """ - def test_cart_rpc(self): + def test_cart_rpc_one_node(self): """Test CaRT RPC. :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,rpc,one_node,memcheck - :avocado: tags=test_cart_rpc + :avocado: tags=CartRpcOneNodeTest,test_cart_rpc_one_node """ srvcmd = self.build_cmd(self.env, "test_servers") clicmd = self.build_cmd(self.env, "test_clients") diff --git a/src/tests/ftest/cart/rpc/rpc_two_node.py b/src/tests/ftest/cart/rpc/rpc_two_node.py index 5b93ebc323f..df6afc58ecb 100644 --- a/src/tests/ftest/cart/rpc/rpc_two_node.py +++ b/src/tests/ftest/cart/rpc/rpc_two_node.py @@ -13,12 +13,13 @@ class CartRpcTwoNodeTest(CartTest): :avocado: recursive """ - def test_cart_rpc(self): + def test_cart_rpc_two_node(self): """Test CaRT RPC. :avocado: tags=all,pr,daily_regression + :avocado: tags=vm :avocado: tags=cart,rpc,two_node,memcheck - :avocado: tags=test_cart_rpc + :avocado: tags=CartRpcTwoNodeTest,test_cart_rpc_two_node """ srvcmd = self.build_cmd(self.env, "test_servers") clicmd = self.build_cmd(self.env, "test_clients") diff --git a/src/tests/ftest/cart/rpc/swim_notification.py b/src/tests/ftest/cart/rpc/swim_notification.py index 72488106357..970631b74ce 100644 --- a/src/tests/ftest/cart/rpc/swim_notification.py +++ b/src/tests/ftest/cart/rpc/swim_notification.py @@ -19,6 +19,7 @@ def test_cart_rpc(self): """Test CaRT RPC. :avocado: tags=all,pr + :avocado: tags=vm :avocado: tags=cart,rpc,one_node,swim_rank_eviction,memcheck :avocado: tags=CartRpcOneNodeSwimNotificationOnRankEvictionTest,test_cart_rpc """ diff --git a/src/tests/ftest/cart/selftest/selftest_three_node.py b/src/tests/ftest/cart/selftest/selftest_three_node.py index c36acb2e6ae..a9f54bee628 100644 --- a/src/tests/ftest/cart/selftest/selftest_three_node.py +++ b/src/tests/ftest/cart/selftest/selftest_three_node.py @@ -13,12 +13,13 @@ class CartSelfThreeNodeTest(CartTest): :avocado: recursive """ - def test_cart_selftest(self): + def test_cart_selftest_three_node(self): """Test CaRT Self Test. :avocado: tags=all,pr,daily_regression - :avocado: tags=cart,selftest,three_node,memcheck - :avocado: tags=test_cart_selftest + :avocado: tags=vm + :avocado: tags=cart,selftest,memcheck + :avocado: tags=CartSelfThreeNodeTest,test_cart_selftest_three_node """ srvcmd = self.build_cmd(self.env, "test_servers") diff --git a/src/tests/ftest/checksum/csum_basic.py b/src/tests/ftest/checksum/csum_basic.py index 566f6cc0bd9..bbb396b9f3b 100644 --- a/src/tests/ftest/checksum/csum_basic.py +++ b/src/tests/ftest/checksum/csum_basic.py @@ -62,9 +62,9 @@ def test_single_object_with_checksum(self): Test Description: Write Avocado Test to verify single data after pool/container disconnect/reconnect. :avocado: tags=all,daily_regression - :avocado: vm + :avocado: tags=vm :avocado: tags=checksum - :avocado: tags=basic_checksum_object,test_single_object_with_checksum + :avocado: tags=CsumContainerValidation,test_single_object_with_checksum """ self.d_log.info("Writing the Single Dataset") record_index = 0 diff --git a/src/tests/ftest/container/global_handle.py b/src/tests/ftest/container/global_handle.py deleted file mode 100644 index 00df09cadf5..00000000000 --- a/src/tests/ftest/container/global_handle.py +++ /dev/null @@ -1,119 +0,0 @@ -''' - (C) Copyright 2018-2023 Intel Corporation. - - SPDX-License-Identifier: BSD-2-Clause-Patent -''' -import ctypes -import traceback -from multiprocessing import sharedctypes - -from pydaos.raw import DaosPool, DaosContainer, DaosApiError, IOV -from avocado import fail_on - -from apricot import TestWithServers - - -class GlobalHandle(TestWithServers): - """Test the ability to share container handles among processes. - - :avocado: recursive - """ - - @fail_on(DaosApiError) - def check_handle(self, pool_glob_handle, uuidstr, cont_glob_handle, rank): - """Verify that the global handles can be turned into local handles. - - This gets run in a child process and verifies the global handles can be - turned into local handles in another process. - - Args: - pool_glob_handle (sharedctypes.RawValue): pool handle - uuidstr (sharedctypes.RawArray): pool uuid - cont_glob_handle (sharedctypes.RawValue): container handle - rank (int): pool svc rank - - Raises: - DaosApiError: if there was an error converting the pool handle or - using the local pool handle to create a container. - - """ - # setup the pool and connect using global handle - pool = DaosPool(self.context) - pool.uuid = uuidstr - pool.set_svc(rank) - buf = ctypes.cast( - pool_glob_handle.iov_buf, - ctypes.POINTER(ctypes.c_byte * pool_glob_handle.iov_buf_len)) - buf2 = bytearray() - buf2.extend(buf.contents) - pool_handle = pool.global2local( - self.context, pool_glob_handle.iov_len, - pool_glob_handle.iov_buf_len, buf2) - - # perform an operation that will use the new handle, if it - # doesn't throw an exception, then all is well. - pool.pool_query() - - # setup the container and then connect using the global handle - container = DaosContainer(self.context) - container.poh = pool_handle - buf = ctypes.cast( - cont_glob_handle.iov_buf, - ctypes.POINTER(ctypes.c_byte * cont_glob_handle.iov_buf_len)) - buf2 = bytearray() - buf2.extend(buf.contents) - _ = container.global2local( - self.context, cont_glob_handle.iov_len, - cont_glob_handle.iov_buf_len, buf2) - # just try one thing to make sure handle is good - container.query() - - def test_global_handle(self): - """Test Description: Use a pool handle in another process. - - :avocado: tags=all,daily_regression - :avocado: tags=vm - :avocado: tags=container - :avocado: tags=global_handle,container_global_handle,test_global_handle - """ - # initialize a python pool object then create the underlying - # daos storage and connect to it - self.add_pool(create=True, connect=True) - - # create a pool global handle - iov_len, buf_len, buf = self.pool.pool.local2global() - buftype = ctypes.c_byte * buf_len - c_buf = buftype.from_buffer(buf) - sct_pool_handle = ( - sharedctypes.RawValue( - IOV, ctypes.cast(c_buf, ctypes.c_void_p), buf_len, iov_len)) - - # create a container - self.add_container(self.pool) - self.container.open() - - try: - # create a container global handle - iov_len, buf_len, buf = self.container.container.local2global() - buftype = ctypes.c_byte * buf_len - c_buf = buftype.from_buffer(buf) - sct_cont_handle = ( - sharedctypes.RawValue( - IOV, ctypes.cast(c_buf, ctypes.c_void_p), buf_len, iov_len)) - - sct_pool_uuid = sharedctypes.RawArray( - ctypes.c_byte, self.pool.pool.uuid) - # this should work in the future but need on-line server addition - # arg_list = ( - # p = Process(target=check_handle, args=arg_list) - # p.start() - # p.join() - # for now verifying global handle in the same process which is not - # the intended use case - self.check_handle( - sct_pool_handle, sct_pool_uuid, sct_cont_handle, 0) - - except DaosApiError as error: - self.log.error(error) - self.log.error(traceback.format_exc()) - self.fail("Expecting to pass but test has failed.\n") diff --git a/src/tests/ftest/container/global_handle.yaml b/src/tests/ftest/container/global_handle.yaml deleted file mode 100644 index f38a0d2c013..00000000000 --- a/src/tests/ftest/container/global_handle.yaml +++ /dev/null @@ -1,20 +0,0 @@ -# change host names to your reserved nodes, the -# required quantity is indicated by the placeholders -hosts: - test_servers: 1 -timeout: 60 -server_config: - name: daos_server - engines_per_host: 1 - engines: - 0: - targets: 4 - nr_xs_helpers: 0 - storage: - 0: - class: ram - scm_mount: /mnt/daos - system_ram_reserved: 1 -pool: - control_method: dmg - scm_size: 1073741824 diff --git a/src/tests/ftest/harness/basic.py b/src/tests/ftest/harness/basic.py index 18cec9167a9..e3997f1054a 100644 --- a/src/tests/ftest/harness/basic.py +++ b/src/tests/ftest/harness/basic.py @@ -22,8 +22,9 @@ def test_always_fails(self): """Simple test of apricot test code. :avocado: tags=all + :avocado: tags=vm :avocado: tags=harness,harness_basic_test - :avocado: tags=always_fails,test_always_fails + :avocado: tags=HarnessBasicTest,always_fails,test_always_fails """ self.fail("NOOP test to do nothing but fail") @@ -33,7 +34,7 @@ def test_always_fails_hw(self): :avocado: tags=all :avocado: tags=hw,large,medium,small :avocado: tags=harness,harness_basic_test - :avocado: tags=always_fails,test_always_fails_hw + :avocado: tags=HarnessBasicTest,always_fails,test_always_fails_hw """ self.test_always_fails() diff --git a/src/tests/ftest/harness/unit.py b/src/tests/ftest/harness/unit.py index 174b46e2760..318a29406ed 100644 --- a/src/tests/ftest/harness/unit.py +++ b/src/tests/ftest/harness/unit.py @@ -3,9 +3,12 @@ SPDX-License-Identifier: BSD-2-Clause-Patent """ +from ClusterShell.NodeSet import NodeSet + from apricot import TestWithoutServers from data_utils import list_unique, list_flatten, list_stats, \ dict_extract_values, dict_subtract +from run_utils import run_remote, ResultData class HarnessUnitTest(TestWithoutServers): @@ -14,13 +17,50 @@ class HarnessUnitTest(TestWithoutServers): :avocado: recursive """ + def _verify_remote_command_result(self, result, passed, expected, timeout, homogeneous, + passed_hosts, failed_hosts, all_stdout, all_stderr): + """Verify a RemoteCommandResult object. + + Args: + result (RemoteCommandResult): object to verify + passed (bool): expected passed command state + expected (list): expected list of ResultData objects + timeout (bool): expected command timeout state + homogeneous (bool): expected homogeneous command output state + passed_hosts (NodeSet): expected set of hosts on which the command passed + failed_hosts (NodeSet): expected set of hosts on which the command failed + all_stdout (dict): expected stdout str per host key + all_stderr (dict): expected stderr str per host key + """ + self.assertEqual(passed, result.passed, 'Incorrect RemoteCommandResult.passed') + self.assertEqual( + len(expected), len(result.output), 'Incorrect RemoteCommandResult.output count') + sorted_output = sorted(result.output) + for index, expect in enumerate(sorted(expected)): + actual = sorted_output[index] + for key in ('command', 'returncode', 'hosts', 'stdout', 'stderr', 'timeout'): + self.assertEqual( + getattr(expect, key), getattr(actual, key), + 'Incorrect ResultData.{}'.format(key)) + self.assertEqual(timeout, result.timeout, 'Incorrect RemoteCommandResult.timeout') + self.assertEqual( + homogeneous, result.homogeneous, 'Incorrect RemoteCommandResult.homogeneous') + self.assertEqual( + passed_hosts, result.passed_hosts, 'Incorrect RemoteCommandResult.passed_hosts') + self.assertEqual( + failed_hosts, result.failed_hosts, 'Incorrect RemoteCommandResult.failed_hosts') + self.assertEqual(all_stdout, result.all_stdout, 'Incorrect RemoteCommandResult.all_stdout') + self.assertEqual(all_stderr, result.all_stderr, 'Incorrect RemoteCommandResult.all_stderr') + def test_harness_unit_list_unique(self): """Verify list_unique(). :avocado: tags=all + :avocado: tags=vm :avocado: tags=harness,dict_utils :avocado: tags=HarnessUnitTest,test_harness_unit_list_unique """ + self.log_step('Verify list_unique()') self.assertEqual( list_unique([1, 2, 3]), [1, 2, 3]) @@ -36,14 +76,17 @@ def test_harness_unit_list_unique(self): self.assertEqual( list_unique([{0: 1}, {2: 3}, {2: 3}]), [{0: 1}, {2: 3}]) + self.log_step('Unit Test Passed') def test_harness_unit_list_flatten(self): """Verify list_flatten(). :avocado: tags=all + :avocado: tags=vm :avocado: tags=harness,dict_utils :avocado: tags=HarnessUnitTest,test_harness_unit_list_flatten """ + self.log_step('Verify list_flatten()') self.assertEqual( list_flatten([1, 2, 3]), [1, 2, 3]) @@ -65,14 +108,17 @@ def test_harness_unit_list_flatten(self): self.assertEqual( list_flatten([1, 2, 3, {'foo': 'bar'}]), [1, 2, 3, {'foo': 'bar'}]) + self.log_step('Unit Test Passed') def test_harness_unit_list_stats(self): """Verify list_stats(). :avocado: tags=all + :avocado: tags=vm :avocado: tags=harness,dict_utils :avocado: tags=HarnessUnitTest,test_harness_unit_list_stats """ + self.log_step('Verify list_stats()') self.assertEqual( list_stats([100, 200]), { @@ -87,14 +133,17 @@ def test_harness_unit_list_stats(self): 'min': -100, 'max': 200 }) + self.log_step('Unit Test Passed') def test_harness_unit_dict_extract_values(self): """Verify dict_extract_values(). :avocado: tags=all + :avocado: tags=vm :avocado: tags=harness,dict_utils :avocado: tags=HarnessUnitTest,test_harness_unit_dict_extract_values """ + self.log_step('Verify dict_extract_values()') dict1 = { 'key1': { 'key1.1': { @@ -147,14 +196,17 @@ def test_harness_unit_dict_extract_values(self): self.assertEqual( dict_extract_values(dict2, ['a']), [{'b': {'a': 0}}, 0]) + self.log_step('Unit Test Passed') def test_harness_unit_dict_subtract(self): """Verify dict_subtract(). :avocado: tags=all + :avocado: tags=vm :avocado: tags=harness,dict_utils :avocado: tags=HarnessUnitTest,test_harness_unit_dict_subtract """ + self.log_step('Verify dict_subtract()') dict1 = { 'key1': { 'key2': { @@ -181,3 +233,220 @@ def test_harness_unit_dict_subtract(self): } } }) + self.log_step('Unit Test Passed') + + def test_harness_unit_run_remote_single(self): + """Verify run_remote() with a single host. + + :avocado: tags=all + :avocado: tags=vm + :avocado: tags=harness,run_utils + :avocado: tags=HarnessUnitTest,test_harness_unit_run_remote_single + """ + hosts = self.get_hosts_from_yaml('test_clients', 'partition', 'reservation', '/run/hosts/*') + command = 'uname -o' + self.log_step('Verify run_remote() w/ single host') + self._verify_remote_command_result( + result=run_remote(self.log, NodeSet(hosts[0]), command), + passed=True, + expected=[ResultData(command, 0, NodeSet(hosts[0]), ['GNU/Linux'], [], False)], + timeout=False, + homogeneous=True, + passed_hosts=NodeSet(hosts[0]), + failed_hosts=NodeSet(), + all_stdout={hosts[0]: 'GNU/Linux'}, + all_stderr={hosts[0]: ''} + ) + self.log_step('Unit Test Passed') + + def test_harness_unit_run_remote_homogeneous(self): + """Verify run_remote() with homogeneous output. + + :avocado: tags=all + :avocado: tags=vm + :avocado: tags=harness,run_utils + :avocado: tags=HarnessUnitTest,test_harness_unit_run_remote_homogeneous + """ + hosts = self.get_hosts_from_yaml('test_clients', 'partition', 'reservation', '/run/hosts/*') + command = 'uname -o' + self.log_step('Verify run_remote() w/ homogeneous output') + self._verify_remote_command_result( + result=run_remote(self.log, hosts, command), + passed=True, + expected=[ResultData(command, 0, hosts, ['GNU/Linux'], [], False)], + timeout=False, + homogeneous=True, + passed_hosts=hosts, + failed_hosts=NodeSet(), + all_stdout={str(hosts): 'GNU/Linux'}, + all_stderr={str(hosts): ''} + ) + self.log_step('Unit Test Passed') + + def test_harness_unit_run_remote_heterogeneous(self): + """Verify run_remote() with heterogeneous output. + + :avocado: tags=all + :avocado: tags=vm + :avocado: tags=harness,run_utils + :avocado: tags=HarnessUnitTest,test_harness_unit_run_remote_heterogeneous + """ + hosts = self.get_hosts_from_yaml('test_clients', 'partition', 'reservation', '/run/hosts/*') + command = 'hostname -s' + self.log_step('Verify run_remote() w/ heterogeneous output') + self._verify_remote_command_result( + result=run_remote(self.log, hosts, command), + passed=True, + expected=[ + ResultData(command, 0, NodeSet(hosts[0]), [hosts[0]], [], False), + ResultData(command, 0, NodeSet(hosts[1]), [hosts[1]], [], False), + ], + timeout=False, + homogeneous=False, + passed_hosts=hosts, + failed_hosts=NodeSet(), + all_stdout={ + hosts[0]: hosts[0], + hosts[1]: hosts[1] + }, + all_stderr={ + hosts[0]: '', + hosts[1]: '' + }, + ) + self.log_step('Unit Test Passed') + + def test_harness_unit_run_remote_combined(self): + """Verify run_remote() with combined stdout and stderr. + + :avocado: tags=all + :avocado: tags=vm + :avocado: tags=harness,run_utils + :avocado: tags=HarnessUnitTest,test_harness_unit_run_remote_combined + """ + hosts = self.get_hosts_from_yaml('test_clients', 'partition', 'reservation', '/run/hosts/*') + command = 'echo stdout; if [ $(hostname -s) == \'{}\' ]; then echo stderr 1>&2; fi'.format( + hosts[1]) + self.log_step('Verify run_remote() w/ separated stdout and stderr') + self._verify_remote_command_result( + result=run_remote(self.log, hosts, command, stderr=False), + passed=True, + expected=[ + ResultData(command, 0, NodeSet(hosts[0]), ['stdout'], [], False), + ResultData(command, 0, NodeSet(hosts[1]), ['stdout', 'stderr'], [], False), + ], + timeout=False, + homogeneous=False, + passed_hosts=hosts, + failed_hosts=NodeSet(), + all_stdout={ + hosts[0]: 'stdout', + hosts[1]: 'stdout\nstderr' + }, + all_stderr={ + hosts[0]: '', + hosts[1]: '' + } + ) + self.log_step('Unit Test Passed') + + def test_harness_unit_run_remote_separated(self): + """Verify run_remote() with separated stdout and stderr. + + :avocado: tags=all + :avocado: tags=vm + :avocado: tags=harness,run_utils + :avocado: tags=HarnessUnitTest,test_harness_unit_run_remote_separated + """ + hosts = self.get_hosts_from_yaml('test_clients', 'partition', 'reservation', '/run/hosts/*') + command = 'echo stdout; if [ $(hostname -s) == \'{}\' ]; then echo stderr 1>&2; fi'.format( + hosts[1]) + self.log_step('Verify run_remote() w/ separated stdout and stderr') + self._verify_remote_command_result( + result=run_remote(self.log, hosts, command, stderr=True), + passed=True, + expected=[ + ResultData(command, 0, NodeSet(hosts[0]), ['stdout'], [], False), + ResultData(command, 0, NodeSet(hosts[1]), ['stdout'], ['stderr'], False), + ], + timeout=False, + homogeneous=False, + passed_hosts=hosts, + failed_hosts=NodeSet(), + all_stdout={ + hosts[0]: 'stdout', + hosts[1]: 'stdout' + }, + all_stderr={ + hosts[0]: '', + hosts[1]: 'stderr' + } + ) + self.log_step('Unit Test Passed') + + def test_harness_unit_run_remote_no_stdout(self): + """Verify run_remote() with separated stdout and stderr. + + :avocado: tags=all + :avocado: tags=vm + :avocado: tags=harness,run_utils + :avocado: tags=HarnessUnitTest,test_harness_unit_run_remote_separated + """ + hosts = self.get_hosts_from_yaml('test_clients', 'partition', 'reservation', '/run/hosts/*') + command = 'if [ $(hostname -s) == \'{}\' ]; then echo stderr 1>&2; fi'.format(hosts[1]) + self.log_step('Verify run_remote() w/ no stdout') + self._verify_remote_command_result( + result=run_remote(self.log, hosts, command, stderr=True), + passed=True, + expected=[ + ResultData(command, 0, NodeSet(hosts[0]), [], [], False), + ResultData(command, 0, NodeSet(hosts[1]), [], ['stderr'], False), + ], + timeout=False, + homogeneous=False, + passed_hosts=hosts, + failed_hosts=NodeSet(), + all_stdout={ + hosts[0]: '', + hosts[1]: '' + }, + all_stderr={ + hosts[0]: '', + hosts[1]: 'stderr' + } + ) + self.log_step('Unit Test Passed') + + def test_harness_unit_run_remote_failure(self): + """Verify run_remote() with separated stdout and stderr. + + :avocado: tags=all + :avocado: tags=vm + :avocado: tags=harness,run_utils + :avocado: tags=HarnessUnitTest,test_harness_unit_run_remote_separated + """ + hosts = self.get_hosts_from_yaml('test_clients', 'partition', 'reservation', '/run/hosts/*') + command = 'if [ $(hostname -s) == \'{}\' ]; then echo fail; exit 1; fi; echo pass'.format( + hosts[1]) + self.log_step('Verify run_remote() w/ a failure') + self._verify_remote_command_result( + result=run_remote(self.log, hosts, command, stderr=True), + passed=False, + expected=[ + ResultData(command, 0, NodeSet(hosts[0]), ['pass'], [], False), + ResultData(command, 1, NodeSet(hosts[1]), ['fail'], [], False), + ], + timeout=False, + homogeneous=False, + passed_hosts=NodeSet(hosts[0]), + failed_hosts=NodeSet(hosts[1]), + all_stdout={ + hosts[0]: 'pass', + hosts[1]: 'fail' + }, + all_stderr={ + hosts[0]: '', + hosts[1]: '' + } + ) + self.log_step('Unit Test Passed') diff --git a/src/tests/ftest/harness/unit.yaml b/src/tests/ftest/harness/unit.yaml index 9ca52a511a8..03a2c3a16e1 100644 --- a/src/tests/ftest/harness/unit.yaml +++ b/src/tests/ftest/harness/unit.yaml @@ -1,2 +1,3 @@ timeout: 10 -test_clients: 1 +hosts: + test_clients: 2 diff --git a/src/tests/ftest/ior/intercept_basic.py b/src/tests/ftest/ior/intercept_basic.py deleted file mode 100644 index 3bc38ea7aec..00000000000 --- a/src/tests/ftest/ior/intercept_basic.py +++ /dev/null @@ -1,51 +0,0 @@ -""" - (C) Copyright 2019-2023 Intel Corporation. - - SPDX-License-Identifier: BSD-2-Clause-Patent -""" - -from ior_intercept_test_base import IorInterceptTestBase - - -class IorInterceptBasic(IorInterceptTestBase): - """Test class Description: Verify IOR performance with DFUSE + IL is similar to DFS - for a single server and single client node. - - :avocado: recursive - """ - - def test_ior_intercept(self): - """Jira ID: DAOS-3498. - - Test Description: - Verify IOR performance with DFUSE + IL is similar to DFS. - - Use case: - Run IOR write + read with DFS. - Run IOR write + read with DFUSE + IL. - Verify performance with DFUSE + IL is similar to DFS. - - :avocado: tags=all,daily_regression - :avocado: tags=hw,medium - :avocado: tags=daosio,dfuse,il,ior,ior_intercept - :avocado: tags=IorInterceptBasic,test_ior_intercept - """ - self.run_il_perf_check('libioil.so') - - def test_ior_intercept_pil4dfs(self): - """Jira ID: DAOS-12142. - - Test Description: - Verify IOR performance with DFUSE + libpil4dfs is similar to DFS. - - Use case: - Run IOR write + read with DFS. - Run IOR write + read with DFUSE + libpil4dfs. - Verify performance with DFUSE + libpil4dfs is similar to DFS. - - :avocado: tags=all,daily_regression - :avocado: tags=hw,medium - :avocado: tags=daosio,dfuse,il,ior,ior_intercept,pil4dfs - :avocado: tags=IorInterceptBasic,test_ior_intercept_pil4dfs - """ - self.run_il_perf_check('libpil4dfs.so') diff --git a/src/tests/ftest/ior/intercept_basic.yaml b/src/tests/ftest/ior/intercept_basic.yaml deleted file mode 100644 index 01bf441601f..00000000000 --- a/src/tests/ftest/ior/intercept_basic.yaml +++ /dev/null @@ -1,33 +0,0 @@ -hosts: - test_servers: 1 - test_clients: 1 -timeout: 1000 -server_config: - name: daos_server - engines_per_host: 1 - engines: - 0: - log_mask: INFO - storage: auto -pool: - size: 90% - svcn: 1 -container: - type: POSIX - control_method: daos -ior: - env_vars: - - D_LOG_MASK=INFO - client_processes: - np: 32 - test_file: testFile - repetitions: 3 - sw_deadline: 60 - flags: "-v -w -r -R" - dfs_oclass: 'SX' - transfer_size: '1M' - block_size: '100G' - write_x: 0.08 # Max 8% performance difference. - read_x: 0.08 # Loosely derived from 3% stddev + 5% actual deviation. -dfuse: - disable_caching: true diff --git a/src/tests/ftest/ior/intercept_multi_client.py b/src/tests/ftest/ior/intercept_multi_client.py index 72e94ae4dd7..3a8254d5bc1 100644 --- a/src/tests/ftest/ior/intercept_multi_client.py +++ b/src/tests/ftest/ior/intercept_multi_client.py @@ -14,7 +14,7 @@ class IorInterceptMultiClient(IorInterceptTestBase): :avocado: recursive """ - def test_ior_intercept_multi_client(self): + def test_ior_intercept_libioil(self): """Jira ID: DAOS-3499. Test Description: @@ -28,6 +28,24 @@ def test_ior_intercept_multi_client(self): :avocado: tags=all,full_regression :avocado: tags=hw,large :avocado: tags=daosio,dfuse,il,ior,ior_intercept - :avocado: tags=IorInterceptMultiClient,test_ior_intercept_multi_client + :avocado: tags=IorInterceptMultiClient,test_ior_intercept_libioil """ self.run_il_perf_check('libioil.so') + + def test_ior_intercept_libpil4dfs(self): + """Jira ID: DAOS-12142. + + Test Description: + Verify IOR performance with DFUSE + libpil4dfs is similar to DFS. + + Use case: + Run IOR write + read with DFS. + Run IOR write + read with DFUSE + libpil4dfs. + Verify performance with DFUSE + libpil4dfs is similar to DFS. + + :avocado: tags=all,full_regression + :avocado: tags=hw,large + :avocado: tags=daosio,dfuse,il,ior,ior_intercept,pil4dfs + :avocado: tags=IorInterceptMultiClient,test_ior_intercept_libpil4dfs + """ + self.run_il_perf_check('libpil4dfs.so') diff --git a/src/tests/ftest/ior/intercept_multi_client.yaml b/src/tests/ftest/ior/intercept_multi_client.yaml index 11d6d4d059e..94a0508fbbb 100644 --- a/src/tests/ftest/ior/intercept_multi_client.yaml +++ b/src/tests/ftest/ior/intercept_multi_client.yaml @@ -45,8 +45,6 @@ ior: transfersize: !mux 512B: transfer_size: '512B' - 1K: - transfer_size: '1K' 4K: transfer_size: '4K' 1M: diff --git a/src/tests/ftest/launch.py b/src/tests/ftest/launch.py index 719295155db..8e5234699e8 100755 --- a/src/tests/ftest/launch.py +++ b/src/tests/ftest/launch.py @@ -10,6 +10,7 @@ from collections import OrderedDict, defaultdict from tempfile import TemporaryDirectory import errno +import getpass import json import logging import os @@ -28,6 +29,7 @@ # from util.distro_utils import detect # pylint: disable=import-error,no-name-in-module from process_core_files import CoreFileProcessing, CoreFileException +from slurm_setup import SlurmSetup, SlurmSetupException # Update the path to support utils files that import other utils files sys.path.append(os.path.join(os.path.dirname(os.path.abspath(__file__)), "util")) @@ -47,7 +49,6 @@ BULLSEYE_SRC = os.path.join(os.path.dirname(os.path.abspath(__file__)), "test.cov") BULLSEYE_FILE = os.path.join(os.sep, "tmp", "test.cov") -DEFAULT_DAOS_APP_DIR = os.path.join(os.sep, "scratch") DEFAULT_DAOS_TEST_LOG_DIR = os.path.join(os.sep, "var", "tmp", "daos_testing") DEFAULT_DAOS_TEST_USER_DIR = os.path.join(os.sep, "var", "tmp", "daos_testing", "user") DEFAULT_DAOS_TEST_SHARED_DIR = os.path.expanduser(os.path.join("~", "daos_test")) @@ -61,10 +62,14 @@ ("cxi", "ofi+cxi"), ("verbs", "ofi+verbs"), ("ucx", "ucx+dc_x"), - ("tcp", "ofi+tcp"), + ("tcp", "ofi+tcp;ofi_rxm"), ("opx", "ofi+opx"), ] ) +# Temporary pipeline-lib workaround until DAOS-13934 is implemented +PROVIDER_ALIAS = { + "ofi+tcp": "ofi+tcp;ofi_rxm" +} PROCS_TO_CLEANUP = [ "daos_server", "daos_engine", "daos_agent", "cart_ctl", "orterun", "mpirun", "dfuse"] TYPES_TO_UNMOUNT = ["fuse.daos"] @@ -586,15 +591,19 @@ class Launch(): RESULTS_DIRS = ( "daos_configs", "daos_logs", "cart_logs", "daos_dumps", "valgrind_logs", "stacktraces") - def __init__(self, name, mode): + def __init__(self, name, mode, slurm_install, slurm_setup): """Initialize a Launch object. Args: name (str): launch job name mode (str): execution mode, e.g. "normal", "manual", or "ci" + slurm_install (bool): whether or not to install slurm RPMs if needed + slurm_setup (bool): whether or not to enable configuring slurm if needed """ self.name = name self.mode = mode + self.slurm_install = slurm_install + self.slurm_setup = slurm_setup self.avocado = AvocadoInfo() self.class_name = f"FTEST_launch.launch-{self.name.lower().replace('.', '-')}" @@ -604,6 +613,7 @@ def __init__(self, name, mode): self.tag_filters = [] self.repeat = 1 self.local_host = get_local_host() + self.user = getpass.getuser() # Results tracking settings self.job_results_dir = None @@ -619,7 +629,6 @@ def __init__(self, name, mode): # Options for creating slurm partitions self.slurm_control_node = NodeSet() self.slurm_partition_hosts = NodeSet() - self.slurm_add_partition = False def _start_test(self, class_name, test_name, log_file): """Start a new test result. @@ -945,7 +954,6 @@ def run(self, args): message = f"Invalid '--slurm_control_node={args.slurm_control_node}' argument" return self.get_exit_status(1, message, "Setup", sys.exc_info()) self.slurm_partition_hosts.add(args.test_clients or args.test_servers) - self.slurm_add_partition = args.slurm_setup # Execute the tests status = self.run_tests( @@ -1062,8 +1070,6 @@ def _set_test_environment(self, servers, clients, list_tests, provider, insecure # Set the default location for daos log files written during testing # if not already defined. - if "DAOS_APP_DIR" not in os.environ: - os.environ["DAOS_APP_DIR"] = DEFAULT_DAOS_APP_DIR if "DAOS_TEST_LOG_DIR" not in os.environ: os.environ["DAOS_TEST_LOG_DIR"] = DEFAULT_DAOS_TEST_LOG_DIR if "DAOS_TEST_USER_DIR" not in os.environ: @@ -1073,6 +1079,9 @@ def _set_test_environment(self, servers, clients, list_tests, provider, insecure os.environ["DAOS_TEST_SHARED_DIR"] = os.path.join(base_dir, "tmp") else: os.environ["DAOS_TEST_SHARED_DIR"] = DEFAULT_DAOS_TEST_SHARED_DIR + if "DAOS_TEST_APP_DIR" not in os.environ: + os.environ["DAOS_TEST_APP_DIR"] = os.path.join( + os.environ["DAOS_TEST_SHARED_DIR"], "daos_test", "apps") os.environ["D_LOG_FILE"] = os.path.join(os.environ["DAOS_TEST_LOG_DIR"], "daos.log") os.environ["D_LOG_FILE_APPEND_PID"] = "1" @@ -1277,6 +1286,7 @@ def _set_provider_environment(self, servers, interface, provider): """ logger.debug("-" * 80) # Use the detected provider if one is not set + provider = PROVIDER_ALIAS.get(provider, provider) if not provider: provider = os.environ.get("CRT_PHY_ADDR_STR") if provider is None: @@ -1808,8 +1818,11 @@ def run_tests(self, sparse, fail_fast, stop_daos, archive, rename, jenkinslog, c # Display the location of the avocado logs logger.info("Avocado job results directory: %s", self.job_results_dir) + # Configure slurm if any tests use partitions + return_code |= self.setup_slurm() + # Configure hosts to collect code coverage - self.setup_bullseye() + return_code |= self.setup_bullseye() # Run each test for as many repetitions as requested for repeat in range(1, self.repeat + 1): @@ -1861,7 +1874,7 @@ def run_tests(self, sparse, fail_fast, stop_daos, archive, rename, jenkinslog, c logger.removeHandler(test_file_handler) # Collect code coverage files after all test have completed - self.finalize_bullseye() + return_code |= self.finalize_bullseye() # Summarize the run return self._summarize_run(return_code) @@ -1933,6 +1946,80 @@ def finalize_bullseye(self): os.rename(old_file, new_file) return status + def setup_slurm(self): + """Set up slurm on the hosts if any tests are using partitions. + + Returns: + int: status code: 0 = success, 128 = failure + """ + status = 0 + logger.info("Setting up slurm partitions if required by tests") + if not any(test.yaml_info["client_partition"] for test in self.tests): + logger.debug(" No tests using client partitions detected - skipping slurm setup") + return status + + if not self.slurm_setup: + logger.debug(" The 'slurm_setup' argument is not set - skipping slurm setup") + return status + + status |= self.setup_application_directory() + + slurm_setup = SlurmSetup(logger, self.slurm_partition_hosts, self.slurm_control_node, True) + try: + if self.slurm_install: + slurm_setup.install() + slurm_setup.update_config(self.user, 'daos_client') + slurm_setup.start_munge(self.user) + slurm_setup.start_slurm(self.user, True) + except SlurmSetupException: + message = "Error setting up slurm" + self._fail_test(self.result.tests[-1], "Run", message, sys.exc_info()) + status |= 128 + except Exception: # pylint: disable=broad-except + message = "Unknown error setting up slurm" + self._fail_test(self.result.tests[-1], "Run", message, sys.exc_info()) + status |= 128 + + return status + + def setup_application_directory(self): + """Set up the application directory. + + Returns: + int: status code: 0 = success, 128 = failure + """ + app_dir = os.environ.get('DAOS_TEST_APP_DIR') + app_src = os.environ.get('DAOS_TEST_APP_SRC') + + logger.debug("Setting up the '%s' application directory", app_dir) + if not os.path.exists(app_dir): + # Create the apps directory if it does not already exist + try: + logger.debug(' Creating the application directory') + os.makedirs(app_dir) + except OSError: + message = 'Error creating the application directory' + self._fail_test(self.result.tests[-1], 'Run', message, sys.exc_info()) + return 128 + else: + logger.debug(' Using the existing application directory') + + if app_src and os.path.exists(app_src): + logger.debug(" Copying applications from the '%s' directory", app_src) + run_local(logger, f"ls -al '{app_src}'") + for app in os.listdir(app_src): + try: + run_local( + logger, f"cp -r '{os.path.join(app_src, app)}' '{app_dir}'", check=True) + except RunException: + message = 'Error copying files to the application directory' + self._fail_test(self.result.tests[-1], 'Run', message, sys.exc_info()) + return 128 + + logger.debug(" Applications in '%s':", app_dir) + run_local(logger, f"ls -al '{app_dir}'") + return 0 + @staticmethod def display_disk_space(path): """Display disk space of provided path destination. @@ -1999,18 +2086,18 @@ def _setup_host_information(self, test): partition = test.yaml_info["client_partition"] logger.debug("Determining if the %s client partition exists", partition) exists = show_partition(logger, self.slurm_control_node, partition).passed - if not exists and not self.slurm_add_partition: + if not exists and not self.slurm_setup: message = f"Error missing {partition} partition" self._fail_test(self.result.tests[-1], "Prepare", message, None) return 128 - if self.slurm_add_partition and exists: + if self.slurm_setup and exists: logger.info( "Removing existing %s partition to ensure correct configuration", partition) if not delete_partition(logger, self.slurm_control_node, partition).passed: message = f"Error removing existing {partition} partition" self._fail_test(self.result.tests[-1], "Prepare", message, None) return 128 - if self.slurm_add_partition: + if self.slurm_setup: hosts = self.slurm_partition_hosts.difference(test.yaml_info["test_servers"]) logger.debug( "Partition hosts from '%s', excluding test servers '%s': %s", @@ -3130,6 +3217,10 @@ def main(): type=str, help="slurm control node where scontrol commands will be issued to check for the existence " "of any slurm partitions required by the tests") + parser.add_argument( + "-si", "--slurm_install", + action="store_true", + help="enable installing slurm RPMs if required by the tests") parser.add_argument( "--scm_mount", action="store", @@ -3141,7 +3232,7 @@ def main(): parser.add_argument( "-ss", "--slurm_setup", action="store_true", - help="setup any slurm partitions required by the tests") + help="enable setting up slurm partitions if required by the tests") parser.add_argument( "--scm_size", action="store", @@ -3216,11 +3307,12 @@ def main(): args.sparse = True if not args.logs_threshold: args.logs_threshold = DEFAULT_LOGS_THRESHOLD + args.slurm_install = True args.slurm_setup = True args.user_create = True # Setup the Launch object - launch = Launch(args.name, args.mode) + launch = Launch(args.name, args.mode, args.slurm_install, args.slurm_setup) # Perform the steps defined by the arguments specified try: diff --git a/src/tests/ftest/pool/global_handle.py b/src/tests/ftest/pool/global_handle.py deleted file mode 100644 index 306412bdc3f..00000000000 --- a/src/tests/ftest/pool/global_handle.py +++ /dev/null @@ -1,85 +0,0 @@ -''' - (C) Copyright 2018-2023 Intel Corporation. - - SPDX-License-Identifier: BSD-2-Clause-Patent -''' -import traceback - -from pydaos.raw import DaosPool, DaosContainer, DaosApiError - -from apricot import TestWithServers - - -class GlobalHandle(TestWithServers): - """Test the ability to share pool handles among processes. - - :avocado: recursive - """ - - def check_handle(self, buf_len, iov_len, buf, uuidstr, rank): - """Verify that the global handle can be turned into a local handle. - - This gets run in a child process and verifies the global handle can be - turned into a local handle in another process. - - Args: - buf_len (object): buffer length; 1st return value from - DaosPool.local2global() - iov_len (object): iov length; 2nd return value from - DaosPool.local2global() - buf (object): buffer; 3rd return value from DaosPool.local2global() - uuidstr (str): pool UUID - rank (int): pool svc rank - - Raises: - DaosApiError: if there was an error converting the pool handle or - using the local pool handle to create a container. - - """ - pool = DaosPool(self.context) - pool.set_uuid_str(uuidstr) - pool.set_svc(rank) - - # note that the handle is stored inside the pool as well - pool.global2local(self.context, iov_len, buf_len, buf) - - # perform some operations that will use the new handle - pool.pool_query() - container = DaosContainer(self.context) - container.create(pool.handle) - - def test_global_handle(self): - """Test ID: Jira-XXXX. - - Test Description: Use a pool handle in another process. - - :avocado: tags=all,daily_regression - :avocado: tags=vm - :avocado: tags=pool,global_handle - :avocado: tags=GlobalHandle,test_global_handle - """ - # initialize a python pool object then create the underlying - # daos storage - self.add_pool() - - # create a container just to make sure handle is good - self.add_container(self.pool) - - try: - # create a global handle - iov_len, buf_len, buf = self.pool.pool.local2global() - - # this should work in the future but need on-line server addition - # arg_list = (buf_len, iov_len, buf, pool.get_uuid_str(), 0) - # p = Process(target=check_handle, args=arg_list) - # p.start() - # p.join() - # for now verifying global handle in the same process which is not - # the intended use case - self.check_handle( - buf_len, iov_len, buf, self.pool.pool.get_uuid_str(), 0) - - except DaosApiError as error: - self.log.error(error) - self.log.error(traceback.format_exc()) - self.fail("Expecting to pass but test has failed.\n") diff --git a/src/tests/ftest/pool/global_handle.yaml b/src/tests/ftest/pool/global_handle.yaml deleted file mode 100644 index f38a0d2c013..00000000000 --- a/src/tests/ftest/pool/global_handle.yaml +++ /dev/null @@ -1,20 +0,0 @@ -# change host names to your reserved nodes, the -# required quantity is indicated by the placeholders -hosts: - test_servers: 1 -timeout: 60 -server_config: - name: daos_server - engines_per_host: 1 - engines: - 0: - targets: 4 - nr_xs_helpers: 0 - storage: - 0: - class: ram - scm_mount: /mnt/daos - system_ram_reserved: 1 -pool: - control_method: dmg - scm_size: 1073741824 diff --git a/src/tests/ftest/scripts/main.sh b/src/tests/ftest/scripts/main.sh index 46fdae198d4..3bfb69309ca 100644 --- a/src/tests/ftest/scripts/main.sh +++ b/src/tests/ftest/scripts/main.sh @@ -157,23 +157,6 @@ if ${SETUP_ONLY:-false}; then exit 0 fi -export DAOS_APP_DIR=${DAOS_APP_DIR:-$DAOS_TEST_SHARED_DIR} - -# check if slurm needs to be configured for soak -if [[ "${TEST_TAG_ARG}" =~ soak && "${STAGE_NAME}" =~ Hardware ]]; then - if ! ./slurm_setup.py -d -c "$FIRST_NODE" -n "${TEST_NODES}" -s -i; then - exit "${PIPESTATUS[0]}" - fi - - if ! mkdir -p "${DAOS_APP_DIR}/soak/apps"; then - exit "${PIPESTATUS[0]}" - fi - - if ! cp -r /scratch/soak/apps/* "${DAOS_APP_DIR}/soak/apps/"; then - exit "${PIPESTATUS[0]}" - fi -fi - # need to increase the number of oopen files (on EL8 at least) ulimit -n 4096 @@ -188,6 +171,8 @@ export WITH_VALGRIND export STAGE_NAME export TEST_RPMS export DAOS_BASE +export DAOS_TEST_APP_SRC=${DAOS_TEST_APP_SRC:-"/scratch/daos_test/apps"} +export DAOS_TEST_APP_DIR=${DAOS_TEST_APP_DIR:-"${DAOS_TEST_SHARED_DIR}/daos_test/apps"} launch_node_args="-ts ${TEST_NODES}" if [ "${STAGE_NAME}" == "Functional Hardware 24" ]; then @@ -199,6 +184,7 @@ if [ "${STAGE_NAME}" == "Functional Hardware 24" ]; then client_nodes=$(IFS=','; echo "${test_node_list[*]:8}") launch_node_args="-ts ${server_nodes} -tc ${client_nodes}" fi + # shellcheck disable=SC2086,SC2090 if ! ./launch.py --mode ci ${launch_node_args} ${LAUNCH_OPT_ARGS} ${TEST_TAG_ARR[*]}; then rc=${PIPESTATUS[0]} diff --git a/src/tests/ftest/slurm_setup.py b/src/tests/ftest/slurm_setup.py index 2b0266604b7..8cfa0869cbd 100755 --- a/src/tests/ftest/slurm_setup.py +++ b/src/tests/ftest/slurm_setup.py @@ -10,279 +10,487 @@ import argparse import getpass import logging +import os import re import socket import sys from ClusterShell.NodeSet import NodeSet -from util.logger_utils import get_console_handler -from util.run_utils import get_clush_command, run_remote +# Update the path to support utils files that import other utils files +sys.path.append(os.path.join(os.path.dirname(os.path.abspath(__file__)), "util")) +# pylint: disable=import-outside-toplevel +from logger_utils import get_console_handler # noqa: E402 +from package_utils import install_packages, remove_packages # noqa: E402 +from run_utils import get_clush_command, run_remote, command_as_user # noqa: E402 # Set up a logger for the console messages logger = logging.getLogger(__name__) logger.setLevel(logging.DEBUG) logger.addHandler(get_console_handler("%(message)s", logging.DEBUG)) -SLURM_CONF = "/etc/slurm/slurm.conf" -EPILOG_FILE = "/var/tmp/epilog_soak.sh" - -PACKAGE_LIST = ["slurm", "slurm-example-configs", - "slurm-slurmctld", "slurm-slurmd"] - -COPY_LIST = ["cp /etc/slurm/slurm.conf.example /etc/slurm/slurm.conf", - "cp /etc/slurm/cgroup.conf.example /etc/slurm/cgroup.conf", - "cp /etc/slurm/slurmdbd.conf.example /etc/slurm/slurmdbd.conf"] - -MUNGE_STARTUP = [ - "chown munge. {0}".format("/etc/munge/munge.key"), - "systemctl restart munge", - "systemctl enable munge"] - -SLURMCTLD_STARTUP = [ - "systemctl restart slurmctld", - "systemctl enable slurmctld"] - -SLURMCTLD_STARTUP_DEBUG = [ - "cat /var/log/slurmctld.log", - "grep -v \"^#\\w\" /etc/slurm/slurm.conf"] - -SLURMD_STARTUP = [ - "systemctl restart slurmd", - "systemctl enable slurmd"] - -SLURMD_STARTUP_DEBUG = [ - "cat /var/log/slurmd.log", - "grep -v \"^#\\w\" /etc/slurm/slurm.conf"] - - -def create_epilog_script(args): - """Create epilog script to run after each job. - - Args: - args (Namespace): command line arguments - - Returns: - int: 0 if command passes; 1 otherwise - - """ - sudo = "sudo" if args.sudo else "" - with open(EPILOG_FILE, 'w') as script_file: - script_file.write("#!/bin/bash\n#\n") - script_file.write("/usr/bin/bash -c 'pkill --signal 9 dfuse'\n") - script_file.write("/usr/bin/bash -c 'for dir in $(find /tmp/daos_dfuse);" - "do fusermount3 -uz $dir;rm -rf $dir; done'\n") - script_file.write("exit 0\n") - command = f"{sudo} chmod 755 {EPILOG_FILE}" - return execute_cluster_cmds(args.control, [command]) - - -def update_config_cmdlist(args): - """Create the command lines to update slurmd.conf file. - - Args: - args (Namespace): command line arguments - - Returns: - cmd_list: list of cmdlines to update config file - - """ - all_nodes = NodeSet("{},{}".format(str(args.control), str(args.nodes))) - if create_epilog_script(args) > 1: - logger.error("%s could not be updated. Check if file exists", EPILOG_FILE) - sys.exit(1) - cmd_list = [f"sed -i -e 's/ClusterName=cluster/ClusterName=ci_cluster/g' {SLURM_CONF}", - f"sed -i -e 's/SlurmUser=slurm/SlurmUser={args.user}/g' {SLURM_CONF}", - f"sed -i -e 's/NodeName/#NodeName/g' {SLURM_CONF}", - f"sed -i -e 's#EpilogSlurmctld=#EpilogSlurmctld={EPILOG_FILE}#g' {SLURM_CONF}"] - sudo = "sudo" if args.sudo else "" - # Copy the slurm*example.conf files to /etc/slurm/ - if execute_cluster_cmds(all_nodes, COPY_LIST, args.sudo) > 0: - sys.exit(1) - match = False - # grep SLURM_CONF to determine format of the the file - for ctl_host in ["SlurmctldHost", "ControlMachine"]: - command = r"grep {} {}".format(ctl_host, SLURM_CONF) - if run_remote(logger, all_nodes, command).passed: - ctl_str = "sed -i -e 's/{0}=linux0/{0}={1}/g' {2}".format( - ctl_host, args.control, SLURM_CONF) - cmd_list.insert(0, ctl_str) - match = True - break - if not match: - logger.error("% could not be updated. Check conf file format", SLURM_CONF) - sys.exit(1) - - # This info needs to be gathered from every node that can run a slurm job - command = r"lscpu | grep -E '(Socket|Core|Thread)\(s\)'" - result = run_remote(logger, all_nodes, command) - for data in result.output: - info = { - match[0]: match[1] - for match in re.findall(r"(Socket|Core|Thread).*:\s+(\d+)", "\n".join(data.stdout)) - if len(match) > 1} - - if "Socket" not in info or "Core" not in info or "Thread" not in info: - # Did not find value for socket|core|thread so do not - # include in config file - pass - cmd_list.append("echo \"NodeName={0} Sockets={1} CoresPerSocket={2} " - "ThreadsPerCore={3}\" |{4} tee -a {5}".format( - data.hosts, info["Socket"], info["Core"], info["Thread"], sudo, - SLURM_CONF)) - - # - cmd_list.append("echo \"PartitionName={} Nodes={} Default=YES " - "MaxTime=INFINITE State=UP\" |{} tee -a {}".format( - args.partition, args.nodes, sudo, SLURM_CONF)) - - return execute_cluster_cmds(all_nodes, cmd_list, args.sudo) - - -def execute_cluster_cmds(nodes, cmdlist, sudo=False): - """Execute the list of cmds on hostlist nodes. - - Args: - nodes (NodeSet): nodes on which to execute the commands - cmdlist ([type]): list of cmdlines to execute - sudo (str, optional): Execute cmd with sudo privileges. Defaults to false. - - Returns: - ret_code: returns 0 if all commands passed on all hosts; 1 otherwise - - """ - for cmd in cmdlist: - if sudo: - cmd = "sudo {}".format(cmd) - if not run_remote(logger, nodes, cmd, timeout=600).passed: - # Do not bother executing any remaining commands if this one failed - return 1 - return 0 - - -def configuring_packages(args, action): - """Install required slurm and munge packages. - - Args: - args (Namespace): command line arguments - action (str): 'install' or 'remove' - - """ - # Install packages on control and compute nodes - all_nodes = NodeSet("{},{}".format(str(args.control), str(args.nodes))) - logger.info("%s slurm packages on %s: %s", action, all_nodes, ", ".join(PACKAGE_LIST)) - command = ["dnf", action, "-y"] + PACKAGE_LIST - return execute_cluster_cmds(all_nodes, [" ".join(command)], args.sudo) - - -def start_munge(args): - """Start munge service on all nodes. - - Args: - args (Namespace): command line arguments - - """ - sudo = "sudo" if args.sudo else "" - all_nodes = NodeSet("{},{}".format(str(args.control), str(args.nodes))) - # exclude the control node - nodes = NodeSet(str(args.nodes)) - nodes.difference_update(str(args.control)) - - # copy key to all nodes FROM slurmctl node; - # change the protections/ownership on the munge dir on all nodes - cmd_list = [ - "{0} chmod -R 777 /etc/munge; {0} chown {1}. /etc/munge".format( - sudo, args.user)] - if execute_cluster_cmds(all_nodes, cmd_list) > 0: - return 1 - - # Check if file exists on slurm control node - # change the protections/ownership on the munge key before copying - cmd_list = ["set -Eeu", - "rc=0", - "if [ ! -f /etc/munge/munge.key ]", - "then {} create-munge-key".format(sudo), - "fi", - "{} chmod 777 /etc/munge/munge.key".format(sudo), - "{} chown {}. /etc/munge/munge.key".format(sudo, args.user)] - - if execute_cluster_cmds(args.control, ["; ".join(cmd_list)]) > 0: - return 1 - # remove any existing key from other nodes - cmd_list = ["{} rm -f /etc/munge/munge.key".format(sudo)] - if execute_cluster_cmds(nodes, ["; ".join(cmd_list)]) > 0: - return 1 - - # copy munge.key to all hosts - command = get_clush_command( - nodes, args="--copy /etc/munge/munge.key --dest /etc/munge/munge.key") - if execute_cluster_cmds(args.control, [command]) > 0: - return 1 - - # set the protection back to defaults - cmd_list = [ - "{} chmod 400 /etc/munge/munge.key".format(sudo), - "{} chown munge. /etc/munge/munge.key".format(sudo), - "{} chmod 700 /etc/munge".format(sudo), - "{} chown munge. /etc/munge".format(sudo)] - if execute_cluster_cmds(all_nodes, ["; ".join(cmd_list)]) > 0: - return 1 - - # Start Munge service on all nodes - all_nodes = NodeSet("{},{}".format(str(args.control), str(args.nodes))) - return execute_cluster_cmds(all_nodes, MUNGE_STARTUP, args.sudo) - - -def start_slurm(args): - """Start the slurm services on all nodes. - - Args: - args (Namespace): command line arguments - - """ - # Setting up slurm on all nodes - all_nodes = NodeSet("{},{}".format(str(args.control), str(args.nodes))) - cmd_list = [ - "mkdir -p /var/log/slurm", - "chown {}. {}".format(args.user, "/var/log/slurm"), - "mkdir -p /var/spool/slurmd", - "mkdir -p /var/spool/slurmctld", - "mkdir -p /var/spool/slurm/d", - "mkdir -p /var/spool/slurm/ctld", - "chown {}. {}/ctld".format(args.user, "/var/spool/slurm"), - "chown {}. {}".format(args.user, "/var/spool/slurmctld"), - "chmod 775 {}".format("/var/spool/slurmctld"), - "rm -f /var/spool/slurmctld/clustername"] - - if execute_cluster_cmds(all_nodes, cmd_list, args.sudo) > 0: - return 1 - - # Startup the slurm control service - status = execute_cluster_cmds(args.control, SLURMCTLD_STARTUP, args.sudo) - if status > 0 or args.debug: - execute_cluster_cmds(args.control, SLURMCTLD_STARTUP_DEBUG, args.sudo) - if status > 0: - return 1 - - # Startup the slurm service - status = execute_cluster_cmds(all_nodes, SLURMD_STARTUP, args.sudo) - if status > 0 or args.debug: - execute_cluster_cmds(all_nodes, SLURMD_STARTUP_DEBUG, args.sudo) - if status > 0: - return 1 - - # ensure that the nodes are in the idle state - cmd_list = ["scontrol update nodename={} state=idle".format(args.nodes)] - status = execute_cluster_cmds(args.nodes, cmd_list, args.sudo) - if status > 0 or args.debug: - cmd_list = SLURMCTLD_STARTUP_DEBUG - execute_cluster_cmds(args.control, cmd_list, args.sudo) - cmd_list = SLURMD_STARTUP_DEBUG - execute_cluster_cmds(all_nodes, cmd_list, args.sudo) - if status > 0: - return 1 - return 0 +class SlurmSetupException(Exception): + """Exception for SlurmSetup class.""" + + +class SlurmSetup(): + """Slurm setup class.""" + + EPILOG_FILE = '/var/tmp/epilog_soak.sh' + EXAMPLE_FILES = [ + '/etc/slurm/slurm.conf.example', + '/etc/slurm/cgroup.conf.example', + '/etc/slurm/slurmdbd.conf.example'] + MUNGE_DIR = '/etc/munge' + MUNGE_KEY = '/etc/munge/munge.key' + PACKAGE_LIST = ['slurm', 'slurm-example-configs', 'slurm-slurmctld', 'slurm-slurmd'] + SLURM_CONF = '/etc/slurm/slurm.conf' + SLURM_LOG_DIR = '/var/log/slurm' + + def __init__(self, log, nodes, control_node, sudo=False): + """Initialize a SlurmSetup object. + + Args: + log (logger): object configured to log messages + nodes (NodeSet): slurm nodes + control_node (NodeSet): slurm control node + sudo (bool, optional): whether or not to use sudo with commands. Defaults to False. + """ + self.log = log + self.nodes = NodeSet(nodes) + self.control = NodeSet(control_node) + self.root = 'root' if sudo else None + + @property + def all_nodes(self): + """Get all the nodes specified in this class. + + Returns: + NodeSet: all the nodes specified in this class + """ + return self.nodes.union(self.control) + + def remove(self): + """Remove slurm packages from the nodes. + + Raises: + SlurmSetupException: if there is a problem removing the packages + """ + self.log.info("Removing slurm packages") + result = remove_packages(self.log, self.all_nodes, self.PACKAGE_LIST, self.root) + if not result.passed: + raise SlurmSetupException(f"Error removing slurm packages on {result.failed_hosts}") + + def install(self): + """Install slurm packages on the nodes. + + Raises: + SlurmSetupException: if there is a problem installing the packages + """ + self.log.info("Installing slurm packages") + result = install_packages(self.log, self.all_nodes, self.PACKAGE_LIST, self.root) + if not result.passed: + raise SlurmSetupException(f"Error installing slurm packages on {result.failed_hosts}") + + def update_config(self, slurm_user, partition): + """Update the slurm config. + + Args: + slurm_user (str): user to define in the slurm config file + partition (str): name of the slurm partition to include in the configuration + + Raises: + SlurmSetupException: if there is a problem + """ + self.log.info("Updating slurm config files") + + # Create the slurm epilog script on the control node + self._create_epilog_script(self.EPILOG_FILE) + + # Copy the slurm example.conf files to all nodes + for source in self.EXAMPLE_FILES: + self._copy_file(self.all_nodes, source, os.path.splitext(source)[0]) + + # Update the config file on all hosts + self._update_slurm_config(slurm_user, partition) + + def start_munge(self, user): + """Start munge. + + Args: + user (str): user account to use with munge + + Raises: + SlurmSetupException: if there is a problem starting munge + """ + self.log.info("Starting munge") + + # Create munge key only if it does not exist. + result = run_remote( + self.log, self.control, command_as_user(f'test -f {self.MUNGE_KEY}', self.root)) + if not result.passed: + # Create a munge key on the control host + self.log.debug('Creating a new munge key on %s', self.control) + result = run_remote( + self.log, self.control, command_as_user('create-munge-key', self.root)) + if not result.passed: + # Try the other possible munge key creation command: + result = run_remote( + self.log, self.control, command_as_user('mungekey -c', self.root)) + if not result.passed: + raise SlurmSetupException(f'Error creating munge key on {result.failed_hosts}') + + # Setup the munge dir file permissions on all hosts + self._update_file(self.all_nodes, self.MUNGE_DIR, '777', user) + + # Setup the munge key file permissions on the control host + self._update_file(self.control, self.MUNGE_KEY, '777', user) + + # Copy the munge key from the control node to the non-control nodes + non_control = self.nodes.difference(self.control) + self.log.debug('Copying the munge key to %s', non_control) + command = get_clush_command( + non_control, args=f"-B -S -v --copy {self.MUNGE_KEY} --dest {self.MUNGE_KEY}") + result = run_remote(self.log, self.control, command) + if not result.passed: + raise SlurmSetupException(f'Error creating munge key on {result.failed_hosts}') + + # Resetting munge dir and key permissions + self._update_file(self.all_nodes, self.MUNGE_KEY, '400', 'munge') + self._update_file(self.all_nodes, self.MUNGE_DIR, '700', 'munge') + + # Restart munge on all nodes + self._restart_systemctl(self.all_nodes, 'munge') + + def start_slurm(self, user, debug): + """Start slurm. + + Args: + user (str): user account to use with slurm + debug (bool): whether or not to display slurm debug + + Raises: + SlurmSetupException: if there is a problem starting slurm + """ + self.log.info("Starting slurm") + + self._mkdir(self.all_nodes, self.SLURM_LOG_DIR) + self._update_file_ownership(self.all_nodes, self.SLURM_LOG_DIR, user) + self._mkdir(self.all_nodes, '/var/spool/slurmd') + self._mkdir(self.all_nodes, '/var/spool/slurmctld') + self._mkdir(self.all_nodes, '/var/spool/slurm/d') + self._mkdir(self.all_nodes, '/var/spool/slurm/ctld') + self._update_file_ownership(self.all_nodes, '/var/spool/slurm/ctld', user) + self._update_file(self.all_nodes, '/var/spool/slurmctld', '775', user) + self._remove_file(self.all_nodes, '/var/spool/slurmctld/clustername') + + # Restart slurmctld on the control node + self._restart_systemctl( + self.control, 'slurmctld', '/var/log/slurmctld.log', self.SLURM_CONF) + + # Restart slurmd on all nodes + self._restart_systemctl(self.all_nodes, 'slurmd', '/var/log/slurmd.log', self.SLURM_CONF) + + # Update nodes to the idle state + command = command_as_user( + f'scontrol update nodename={str(self.nodes)} state=idle', self.root) + result = run_remote(self.log, self.nodes, command) + if not result.passed or debug: + self._display_debug(self.control, '/var/log/slurmctld.log', self.SLURM_CONF) + self._display_debug(self.all_nodes, '/var/log/slurmd.log', self.SLURM_CONF) + if not result.passed: + raise SlurmSetupException(f'Error setting nodes to idle on {self.nodes}') + + def _create_epilog_script(self, script): + """Create epilog script to run after each job. + + Args: + script (str): epilog script name. + + Raises: + SlurmSetupException: if there is a problem creating the epilog script + """ + self.log.debug('Creating the slurm epilog script to run after each job.') + try: + with open(script, 'w') as script_file: + script_file.write('#!/bin/bash\n#\n') + script_file.write('/usr/bin/bash -c \'pkill --signal 9 dfuse\'\n') + script_file.write( + '/usr/bin/bash -c \'for dir in $(find /tmp/daos_dfuse);' + 'do fusermount3 -uz $dir;rm -rf $dir; done\'\n') + script_file.write('exit 0\n') + except IOError as error: + self.log.debug('Error writing %s - verifying file existence:', script) + run_remote(self.log, self.control, f'ls -al {script}') + raise SlurmSetupException(f'Error writing slurm epilog script {script}') from error + + command = command_as_user(f'chmod 755 {script}', self.root) + if not run_remote(self.log, self.control, command).passed: + raise SlurmSetupException(f'Error setting slurm epilog script {script} permissions') + + def _copy_file(self, nodes, source, destination): + """Copy the source file to the destination on all the nodes. + + Args: + nodes (NodeSet): nodes on which to copy the files + source (str): file to copy + destination (str): where to copy the file + + Raises: + SlurmSetupException: if there is an error copying the file on any host + """ + self.log.debug(f'Copying the {source} file to {destination} on {str(nodes)}') + command = command_as_user(f'cp {source} {destination}', self.root) + result = run_remote(self.log, nodes, command) + if not result.passed: + raise SlurmSetupException( + f'Error copying {source} to {destination} on {str(result.failed_hosts)}') + + def _update_slurm_config(self, slurm_user, partition): + """Update the slurm config file. + + Args: + slurm_user (str): user to define in the slurm config file + partition (str): name of the slurm partition to include in the configuration + + Raises: + SlurmSetupException: if there is a problem modifying slurm config file + """ + # Update the config file with the slurm cluster name + self._modify_slurm_config_file( + 'slurm cluster name', self.all_nodes, 's/ClusterName=cluster/ClusterName=ci_cluster/g', + self.root) + + # Update the config file with the slurm user + self._modify_slurm_config_file( + 'slurm user', self.all_nodes, f's/SlurmUser=slurm/SlurmUser={slurm_user}/g', + self.root) + + # Update the config file with the removal of the NodeName entry + self._modify_slurm_config_file( + 'node name', self.all_nodes, 's/NodeName/#NodeName/g', self.root) + + # Update the config file with the slurm epilog file + self._modify_slurm_config_file( + 'epilog file', self.all_nodes, 's#EpilogSlurmctld=#EpilogSlurmctld={EPILOG_FILE}#g', + self.root) + + # Update the config file with the slurm control node + not_updated = self.all_nodes.copy() + for control_keyword in ['SlurmctldHost', 'ControlMachine']: + command = f'grep {control_keyword} {self.SLURM_CONF}' + results = run_remote(self.log, self.all_nodes, command) + if results.passed_hosts: + not_updated.remove( + self._modify_slurm_config_file( + 'slurm control node', results.passed_hosts, + f's/{control_keyword}=linux0/{control_keyword}={str(self.control)}/g', + self.root)) + if not_updated: + raise SlurmSetupException(f'Slurm control node not updated on {not_updated}') + + # Update the config file with each node's socket/core/thread information + self._update_slurm_config_sys_info() + + # Update the config file with the partition information + self._update_slurm_config_partitions(partition) + + def _modify_slurm_config_file(self, description, hosts, replacement, user=None): + """Replace text in the slurm configuration file. + + Args: + description (str): what is being modified in the slurm config file + hosts (NodeSet): hosts on which to modify the slurm config file + replacement (str): what text to replace + user (str, optional): user to use when running the sed command. Defaults to None. + + Raises: + SlurmSetupException: if there is a problem modifying slurm config file + + Returns: + NodeSet: hosts on which the command succeeded + """ + self.log.debug( + 'Updating the %s in the %s config file on %s', description, self.SLURM_CONF, hosts) + command = command_as_user(f'sed -i -e \'{replacement}\' {self.SLURM_CONF}', user) + result = run_remote(self.log, hosts, command) + if result.failed_hosts: + raise SlurmSetupException( + f'Error updating {description} in the {self.SLURM_CONF} config ' + f'file on {result.failed_hosts}') + return result.passed_hosts + + def _update_slurm_config_sys_info(self): + """Update the slurm config files with hosts socket/core/thread information. + + Raises: + SlurmSetupException: if there is a problem updating the slurm config file + """ + self.log.debug('Updating slurm config socket/core/thread information on %s', self.all_nodes) + command = r"lscpu | grep -E '(Socket|Core|Thread)\(s\)'" + result = run_remote(self.log, self.all_nodes, command) + for data in result.output: + info = { + match[0]: match[1] + for match in re.findall(r"(Socket|Core|Thread).*:\s+(\d+)", "\n".join(data.stdout)) + if len(match) > 1} + + if "Socket" in info and "Core" in info and "Thread" in info: + echo_command = (f'echo \"Nodename={data.hosts} Sockets={info["Socket"]} ' + f'CoresPerSocket={info["Core"]} ThreadsPerCore={info["Thread"]}\"') + mod_result = self._append_config_file(echo_command) + if mod_result.failed_hosts: + raise SlurmSetupException( + 'Error updating socket/core/thread information on ' + f'{mod_result.failed_hosts}') + + def _update_slurm_config_partitions(self, partition): + """Update the slurm config files with hosts partition information. + + Args: + partition (str): name of the slurm partition to include in the configuration + + Raises: + SlurmSetupException: if there is a problem updating the slurm config file + """ + self.log.debug('Updating slurm config partition information on %s', self.all_nodes) + echo_command = ( + f'echo \"PartitionName={partition} Nodes={self.nodes} Default=YES MaxTime=INFINITE ' + 'State=UP\"') + mod_result = self._append_config_file(echo_command) + if mod_result.failed_hosts: + raise SlurmSetupException( + f'Error updating partition information on {mod_result.failed_hosts}') + + def _append_config_file(self, echo_command): + """Append data to the config file. + + Args: + echo_command (str): command adding contents to the config file + + Returns: + RemoteCommandResult: the result from the echo | tee command + """ + tee_command = command_as_user(f'tee -a {self.SLURM_CONF}', self.root) + return run_remote(self.log, self.all_nodes, f'{echo_command} | {tee_command}') + + def _update_file(self, nodes, file, permission, user): + """Update file permissions and ownership. + + Args: + nodes (NodeSet): nodes on which to update the file permissions/ownership + file (str): file whose permissions/ownership will be updated + permission (str): file permission to set + user (str): user to have ownership of the file + + Raises: + SlurmSetupException: if there was an error updating the file permissions/ownership + """ + self._update_file_permissions(nodes, file, permission) + self._update_file_ownership(nodes, file, user) + + def _update_file_permissions(self, nodes, file, permission): + """Update the file permissions. + + Args: + nodes (NodeSet): nodes on which to update the file permissions + file (str): file whose permissions will be updated + permission (str): file permission to set + user (str): user to use with chown command + + Raises: + SlurmSetupException: if there was an error updating the file permissions + """ + self.log.debug('Updating file permissions for %s on %s', self.MUNGE_DIR, nodes) + result = run_remote( + self.log, nodes, command_as_user(f'chmod -R {permission} {file}', self.root)) + if not result.passed: + raise SlurmSetupException( + f'Error updating permissions to {permission} for {file} on {result.failed_hosts}') + + def _update_file_ownership(self, nodes, file, user): + """Update the file ownership. + + Args: + nodes (NodeSet): nodes on which to update the file ownership + file (str): file whose ownership will be updated + user (str): user to have ownership of the file + + Raises: + SlurmSetupException: if there was an error updating the file ownership + """ + result = run_remote(self.log, nodes, command_as_user(f'chown {user}. {file}', self.root)) + if not result.passed: + raise SlurmSetupException( + f'Error updating ownership to {user} for {file} on {result.failed_hosts}') + + def _remove_file(self, nodes, file): + """Remove a file. + + Args: + nodes (NodeSet): nodes on which to remove the file + file (str): file to remove + + Raises: + SlurmSetupException: if there was an error removing the file + """ + self.log.debug('Removing %s on %s', file, nodes) + result = run_remote(self.log, nodes, command_as_user(f'rm -fr {file}', self.root)) + if not result.passed: + raise SlurmSetupException(f'Error removing {file} on {result.failed_hosts}') + + def _restart_systemctl(self, nodes, service, debug_log=None, debug_config=None): + """Restart the systemctl service. + + Args: + nodes (NodeSet): nodes on which to restart the systemctl service + service (str): systemctl service to restart/enable + debug_log (str, optional): log file to display if there is a problem restarting + debug_config (str, optional): config file to display if there is a problem restarting + + Raises: + SlurmSetupException: if there is a problem restarting the systemctl service + """ + self.log.debug('Restarting %s on %s', service, nodes) + for action in ('restart', 'enable'): + command = command_as_user(f'systemctl {action} {service}', self.root) + result = run_remote(self.log, self.all_nodes, command) + if not result.passed: + self._display_debug(result.failed_hosts, debug_log, debug_config) + raise SlurmSetupException(f'Error restarting {service} on {result.failed_hosts}') + + def _display_debug(self, nodes, debug_log=None, debug_config=None): + """Display debug information. + + Args: + nodes (NodeSet): nodes on which to display the debug information + debug_log (str, optional): log file to display. Defaults to None. + debug_config (str, optional): config file to display. Defaults to None. + """ + if debug_log: + self.log.debug('DEBUG: %s contents:', debug_log) + command = command_as_user(f'cat {debug_log}', self.root) + run_remote(self.log, nodes, command) + if debug_config: + self.log.debug('DEBUG: %s contents:', debug_config) + command = command_as_user(f'grep -v \"^#\\w\" {debug_config}', self.root) + run_remote(self.log, nodes, command) + + def _mkdir(self, nodes, directory): + """Create a directory. + + Args: + nodes (NodeSet): nodes on which to create the directory + directory (str): directory to create + + Raises: + SlurmSetupException: if there was an error creating the directory + """ + self.log.debug('Creating %s on %s', directory, nodes) + result = run_remote(self.log, nodes, command_as_user(f'mkdir -p {directory}', self.root)) + if not result.passed: + raise SlurmSetupException(f'Error creating {directory} on {result.failed_hosts}') def main(): @@ -330,36 +538,44 @@ def main(): logger.error("slurm_nodes: Specify at least one slurm node") sys.exit(1) - # Convert control node and slurm node list into NodeSets - args.control = NodeSet(args.control) - args.nodes = NodeSet(args.nodes) + slurm_setup = SlurmSetup(logger, args.nodes, args.control, args.sudo) # Remove packages if specified with --remove and then exit if args.remove: - ret_code = configuring_packages(args, "remove") - if ret_code > 0: + try: + slurm_setup.remove() + sys.exit(0) + except SlurmSetupException as error: + logger.error(str(error)) sys.exit(1) - sys.exit(0) # Install packages if specified with --install and continue with setup if args.install: - ret_code = configuring_packages(args, "install") - if ret_code > 0: + try: + slurm_setup.install() + except SlurmSetupException as error: + logger.error(str(error)) sys.exit(1) # Edit the slurm conf files - ret_code = update_config_cmdlist(args) - if ret_code > 0: + try: + slurm_setup.update_config(args.user, args.partition) + except SlurmSetupException as error: + logger.error(str(error)) sys.exit(1) # Munge Setup - ret_code = start_munge(args) - if ret_code > 0: + try: + slurm_setup.start_munge(args.user) + except SlurmSetupException as error: + logger.error(str(error)) sys.exit(1) # Slurm Startup - ret_code = start_slurm(args) - if ret_code > 0: + try: + slurm_setup.start_slurm(args.user, args.debug) + except SlurmSetupException as error: + logger.error(str(error)) sys.exit(1) sys.exit(0) diff --git a/src/tests/ftest/soak/smoke.yaml b/src/tests/ftest/soak/smoke.yaml index 0cee0ef4ccd..85e73ab55ff 100644 --- a/src/tests/ftest/soak/smoke.yaml +++ b/src/tests/ftest/soak/smoke.yaml @@ -164,7 +164,7 @@ vpic_smoke: - 1 taskspernode: - 1 - cmdline: "${DAOS_APP_DIR}/soak/apps/vpic-install/bin/harris.Linux" + cmdline: "${DAOS_TEST_APP_DIR}/vpic-install/bin/harris.Linux" api: - POSIX - POSIX-LIBIOIL @@ -183,7 +183,7 @@ lammps_smoke: - 1 taskspernode: - 1 - cmdline: "${DAOS_APP_DIR}/soak/apps/lammps/src/lmp_mpi -i ${DAOS_APP_DIR}/soak/apps/lammps/bench/in.lj.smoke" + cmdline: "${DAOS_TEST_APP_DIR}/lammps/src/lmp_mpi -i ${DAOS_TEST_APP_DIR}/lammps/bench/in.lj.smoke" api: - POSIX - POSIX-LIBIOIL diff --git a/src/tests/ftest/soak/stress.yaml b/src/tests/ftest/soak/stress.yaml index 736bfe46936..ea86426f6b0 100644 --- a/src/tests/ftest/soak/stress.yaml +++ b/src/tests/ftest/soak/stress.yaml @@ -185,7 +185,7 @@ vpic_stress: - 1 taskspernode: - 32 - cmdline: "${DAOS_APP_DIR}/soak/apps/vpic-install/bin/harris.Linux" + cmdline: "${DAOS_TEST_APP_DIR}/vpic-install/bin/harris.Linux" api: - POSIX - POSIX-LIBIOIL @@ -202,7 +202,7 @@ lammps_stress: - 8 taskspernode: - 32 - cmdline: "${DAOS_APP_DIR}/soak/apps/lammps/src/lmp_mpi -i ${DAOS_APP_DIR}/soak/apps/lammps/bench/in.lj" + cmdline: "${DAOS_TEST_APP_DIR}/lammps/src/lmp_mpi -i ${DAOS_TEST_APP_DIR}/lammps/bench/in.lj" api: - POSIX - POSIX-LIBIOIL diff --git a/src/tests/ftest/util/ior_intercept_test_base.py b/src/tests/ftest/util/ior_intercept_test_base.py index 69993b7aae0..7cec0fd32db 100644 --- a/src/tests/ftest/util/ior_intercept_test_base.py +++ b/src/tests/ftest/util/ior_intercept_test_base.py @@ -57,6 +57,7 @@ def run_il_perf_check(self, libname): # Log some params for debugging. server_provider = self.server_managers[0].get_config_value("provider") self.log.info("Provider: %s", server_provider) + self.log.info("Library: %s", libname) self.log.info("Servers: %s", self.hostlist_servers) self.log.info("Clients: %s", self.hostlist_clients) self.log.info("PPN: %s", self.ppn) diff --git a/src/tests/ftest/util/package_utils.py b/src/tests/ftest/util/package_utils.py new file mode 100644 index 00000000000..bbc5f549ecb --- /dev/null +++ b/src/tests/ftest/util/package_utils.py @@ -0,0 +1,64 @@ +""" +(C) Copyright 2023 Intel Corporation. + +SPDX-License-Identifier: BSD-2-Clause-Patent +""" + +from run_utils import run_remote, command_as_user + + +def find_packages(log, hosts, pattern, user=None): + """Get the installed packages on each specified host. + + Args: + log (logger): logger for the messages produced by this method + hosts (NodeSet): hosts on which to search for installed packages + pattern (str): grep pattern to use to search for installed packages + user (str, optional): user account to use to run the search command. Defaults to None. + + Returns: + dict: a dictionary of host keys with a list of installed RPM values + """ + installed = {} + command = command_as_user(f"rpm -qa | grep -E {pattern} | sort -n", user) + result = run_remote(log, hosts, command) + for data in result.output: + if data.passed: + installed[str(data.hosts)] = data.stdout or [] + return installed + + +def install_packages(log, hosts, packages, user=None, timeout=600): + """Install the packages on the hosts. + + Args: + log (logger): logger for the messages produced by this method + hosts (NodeSet): hosts on which to install the packages + packages (list): a list of packages to install + user (str, optional): user to use when installing the packages. Defaults to None. + timeout (int, optional): timeout for the dnf install command. Defaults to 600. + + Returns: + RemoteCommandResult: the 'dnf install' command results + """ + log.info('Installing packages on %s: %s', hosts, ', '.join(packages)) + command = command_as_user(' '.join(['dnf', 'install', '-y'] + packages), user) + return run_remote(log, hosts, command, timeout=timeout) + + +def remove_packages(log, hosts, packages, user=None, timeout=600): + """Remove the packages on the hosts. + + Args: + log (logger): logger for the messages produced by this method + hosts (NodeSet): hosts on which to remove the packages + packages (list): a list of packages to remove + user (str, optional): user to use when removing the packages. Defaults to None. + timeout (int, optional): timeout for the dnf remove command. Defaults to 600. + + Returns: + RemoteCommandResult: the 'dnf remove' command results + """ + log.info('Removing packages on %s: %s', hosts, ', '.join(packages)) + command = command_as_user(' '.join(['dnf', 'remove', '-y'] + packages), user) + return run_remote(log, hosts, command, timeout=timeout) diff --git a/src/tests/ftest/util/run_utils.py b/src/tests/ftest/util/run_utils.py index b1cd471756c..730dbb3bbd1 100644 --- a/src/tests/ftest/util/run_utils.py +++ b/src/tests/ftest/util/run_utils.py @@ -15,38 +15,64 @@ class RunException(Exception): """Base exception for this module.""" -class RemoteCommandResult(): - """Stores the command result from a Task object.""" +class ResultData(): + # pylint: disable=too-few-public-methods + """Command result data for the set of hosts.""" + + def __init__(self, command, returncode, hosts, stdout, stderr, timeout): + """Initialize a ResultData object. + + Args: + command (str): the executed command + returncode (int): the return code of the executed command + hosts (NodeSet): the host(s) on which the executed command yielded this result + stdout (list): the result of the executed command split by newlines + timeout (bool): indicator for a command timeout + """ + self.command = command + self.returncode = returncode + self.hosts = hosts + self.stdout = stdout + self.stderr = stderr + self.timeout = timeout - class ResultData(): - # pylint: disable=too-few-public-methods - """Command result data for the set of hosts.""" + def __lt__(self, other): + """Determine if another ResultData object is less than this one. - def __init__(self, command, returncode, hosts, stdout, timeout): - """Initialize a ResultData object. + Args: + other (NodeSet): the other NodSet to compare - Args: - command (str): the executed command - returncode (int): the return code of the executed command - hosts (NodeSet): the host(s) on which the executed command yielded this result - stdout (list): the result of the executed command split by newlines - timeout (bool): indicator for a command timeout - """ - self.command = command - self.returncode = returncode - self.hosts = hosts - self.stdout = stdout - self.timeout = timeout + Returns: + bool: True if this object is less than the other ResultData object; False otherwise + """ + if not isinstance(other, ResultData): + raise NotImplementedError + return str(self.hosts) < str(other.hosts) - @property - def passed(self): - """Did the command pass. + def __gt__(self, other): + """Determine if another ResultData object is greater than this one. - Returns: - bool: if the command was successful + Args: + other (NodeSet): the other NodSet to compare - """ - return self.returncode == 0 + Returns: + bool: True if this object is greater than the other ResultData object; False otherwise + """ + return not self.__lt__(other) + + @property + def passed(self): + """Did the command pass. + + Returns: + bool: if the command was successful + + """ + return self.returncode == 0 + + +class RemoteCommandResult(): + """Stores the command result from a Task object.""" def __init__(self, command, task): """Create a RemoteCommandResult object. @@ -122,6 +148,19 @@ def all_stdout(self): stdout[str(data.hosts)] = '\n'.join(data.stdout) return stdout + @property + def all_stderr(self): + """Get all of the stderr from the issued command from each host. + + Returns: + dict: the stderr (the values) from each set of hosts (the keys, as a str of the NodeSet) + + """ + stderr = {} + for data in self.output: + stderr[str(data.hosts)] = '\n'.join(data.stderr) + return stderr + def _process_task(self, task, command): """Populate the output list and determine the passed result for the specified task. @@ -137,23 +176,67 @@ def _process_task(self, task, command): # Populate the a list of unique output for each NodeSet for code in sorted(results): - output_data = list(task.iter_buffers(results[code])) - if not output_data: - output_data = [["", results[code]]] - for output, output_hosts in output_data: + stdout_data = self._sanitize_iter_data( + results[code], list(task.iter_buffers(results[code])), '') + + for stdout_raw, stdout_hosts in stdout_data: # In run_remote(), task.run() is executed with the stderr=False default. # As a result task.iter_buffers() will return combined stdout and stderr. - stdout = [] - for line in output.splitlines(): - if isinstance(line, bytes): - stdout.append(line.decode("utf-8")) - else: - stdout.append(line) - self.output.append( - self.ResultData(command, code, NodeSet.fromlist(output_hosts), stdout, False)) + stdout = self._msg_tree_elem_to_list(stdout_raw) + stderr_data = self._sanitize_iter_data( + stdout_hosts, list(task.iter_errors(stdout_hosts)), '') + for stderr_raw, stderr_hosts in stderr_data: + stderr = self._msg_tree_elem_to_list(stderr_raw) + self.output.append( + ResultData( + command, code, NodeSet.fromlist(stderr_hosts), stdout, stderr, False)) if timed_out: self.output.append( - self.ResultData(command, 124, NodeSet.fromlist(timed_out), None, True)) + ResultData(command, 124, NodeSet.fromlist(timed_out), None, None, True)) + + @staticmethod + def _sanitize_iter_data(hosts, data, default_entry): + """Ensure the data generated from an iter function has entries for each host. + + Args: + hosts (list): lists of host which generated data + data (list): data from an iter function as a list + default_entry (object): entry to add to data for missing hosts in data + + Returns: + list: a list of tuples of entries and list of hosts + """ + if not data: + return [(default_entry, hosts)] + + source_keys = NodeSet.fromlist(hosts) + data_keys = NodeSet() + for _, keys in data: + data_keys.add(NodeSet.fromlist(keys)) + + sanitized_data = data.copy() + missing_keys = source_keys - data_keys + if missing_keys: + sanitized_data.append((default_entry, list(missing_keys))) + return sanitized_data + + @staticmethod + def _msg_tree_elem_to_list(msg_tree_elem): + """Convert a ClusterShell.MsgTree.MsgTreeElem to a list of strings. + + Args: + msg_tree_elem (MsgTreeElem): output from Task.iter_* method. + + Returns: + list: list of strings + """ + msg_tree_elem_list = [] + for line in msg_tree_elem.splitlines(): + if isinstance(line, bytes): + msg_tree_elem_list.append(line.decode("utf-8")) + else: + msg_tree_elem_list.append(line) + return msg_tree_elem_list def log_output(self, log): """Log the command result. @@ -174,14 +257,21 @@ def log_result_data(log, data): data (ResultData): command result common to a set of hosts """ info = " timed out" if data.timeout else "" - if not data.stdout: + if not data.stdout and not data.stderr: log.debug(" %s (rc=%s)%s: ", str(data.hosts), data.returncode, info) - elif len(data.stdout) == 1: + elif data.stdout and len(data.stdout) == 1 and not data.stderr: log.debug(" %s (rc=%s)%s: %s", str(data.hosts), data.returncode, info, data.stdout[0]) else: log.debug(" %s (rc=%s)%s:", str(data.hosts), data.returncode, info) + indent = 6 if data.stderr else 4 + if data.stdout and data.stderr: + log.debug(" :") for line in data.stdout: - log.debug(" %s", line) + log.debug("%s%s", " " * indent, line) + if data.stderr: + log.debug(" :") + for line in data.stderr: + log.debug("%s%s", " " * indent, line) def get_clush_command(hosts, args=None, command="", command_env=None, command_sudo=False): @@ -289,7 +379,7 @@ def run_local(log, command, capture_output=True, timeout=None, check=False, verb return result -def run_remote(log, hosts, command, verbose=True, timeout=120, task_debug=False): +def run_remote(log, hosts, command, verbose=True, timeout=120, task_debug=False, stderr=False): """Run the command on the remote hosts. Args: @@ -300,6 +390,7 @@ def run_remote(log, hosts, command, verbose=True, timeout=120, task_debug=False) timeout (int, optional): number of seconds to wait for the command to complete. Defaults to 120 seconds. task_debug (bool, optional): whether to enable debug for the task object. Defaults to False. + stderr (bool, optional): whether to enable stdout/stderr separation. Defaults to False. Returns: RemoteCommandResult: a grouping of the command results from the same hosts with the same @@ -307,8 +398,8 @@ def run_remote(log, hosts, command, verbose=True, timeout=120, task_debug=False) """ task = task_self() - if task_debug: - task.set_info('debug', True) + task.set_info('debug', task_debug) + task.set_default("stderr", stderr) # Enable forwarding of the ssh authentication agent connection task.set_info("ssh_options", "-oForwardAgent=yes") if verbose: diff --git a/src/tests/ftest/util/server_utils_params.py b/src/tests/ftest/util/server_utils_params.py index 443a0be0d5c..8f456706c04 100644 --- a/src/tests/ftest/util/server_utils_params.py +++ b/src/tests/ftest/util/server_utils_params.py @@ -437,7 +437,7 @@ class EngineYamlParameters(YamlParameters): "common": [ "D_LOG_FILE_APPEND_PID=1", "COVFILE=/tmp/test.cov"], - "ofi+tcp": [], + "ofi+tcp;ofi_rxm": [], "ofi+verbs": [ "FI_OFI_RXM_USE_SRX=1"], "ofi+cxi": [ @@ -459,7 +459,7 @@ def __init__(self, base_namespace, index, provider=None, max_storage_tiers=MAX_S namespace = [os.sep] + base_namespace.split(os.sep)[1:-1] + ["engines", str(index), "*"] self._base_namespace = base_namespace self._index = index - self._provider = provider or os.environ.get("CRT_PHY_ADDR_STR", "ofi+tcp") + self._provider = provider or os.environ.get("CRT_PHY_ADDR_STR", "ofi+tcp;ofi_rxm") self._max_storage_tiers = max_storage_tiers super().__init__(os.path.join(*namespace)) diff --git a/src/tests/suite/daos_capa.c b/src/tests/suite/daos_capa.c index 481017dcf0a..c0b650d272c 100644 --- a/src/tests/suite/daos_capa.c +++ b/src/tests/suite/daos_capa.c @@ -501,6 +501,8 @@ update_ro(void **state) rc = daos_obj_close(oh, NULL); assert_rc_equal(rc, 0); + par_barrier(PAR_COMM_WORLD); + /** close container handle */ rc = daos_cont_close(coh, NULL); assert_rc_equal(rc, 0); diff --git a/src/tests/suite/daos_kv.c b/src/tests/suite/daos_kv.c index 701c5cc2ede..788caf1e73c 100644 --- a/src/tests/suite/daos_kv.c +++ b/src/tests/suite/daos_kv.c @@ -287,43 +287,59 @@ kv_cond_ops(void **state) val_out = 5; size = sizeof(int); print_message("Conditional FETCH of non existent Key(should fail)\n"); - rc = daos_kv_get(oh, DAOS_TX_NONE, DAOS_COND_KEY_GET, "Key2", - &size, &val_out, NULL); + rc = daos_kv_get(oh, DAOS_TX_NONE, DAOS_COND_KEY_GET, "Key2", &size, &val_out, NULL); assert_rc_equal(rc, -DER_NONEXIST); assert_int_equal(val_out, 5); val = 1; print_message("Conditional UPDATE of non existent Key(should fail)\n"); - rc = daos_kv_put(oh, DAOS_TX_NONE, DAOS_COND_KEY_UPDATE, "Key1", - sizeof(int), &val, NULL); + rc = daos_kv_put(oh, DAOS_TX_NONE, DAOS_COND_KEY_UPDATE, "Key1", sizeof(int), &val, NULL); assert_rc_equal(rc, -DER_NONEXIST); print_message("Conditional INSERT of non existent Key\n"); - rc = daos_kv_put(oh, DAOS_TX_NONE, DAOS_COND_KEY_INSERT, "Key1", - sizeof(int), &val, NULL); + rc = daos_kv_put(oh, DAOS_TX_NONE, DAOS_COND_KEY_INSERT, "Key1", sizeof(int), &val, NULL); assert_rc_equal(rc, 0); val = 2; print_message("Conditional INSERT of existing Key (Should fail)\n"); - rc = daos_kv_put(oh, DAOS_TX_NONE, DAOS_COND_KEY_INSERT, "Key1", - sizeof(int), &val, NULL); + rc = daos_kv_put(oh, DAOS_TX_NONE, DAOS_COND_KEY_INSERT, "Key1", sizeof(int), &val, NULL); assert_rc_equal(rc, -DER_EXIST); size = sizeof(int); print_message("Conditional FETCH of existing Key\n"); - rc = daos_kv_get(oh, DAOS_TX_NONE, DAOS_COND_KEY_GET, "Key1", - &size, &val_out, NULL); + rc = daos_kv_get(oh, DAOS_TX_NONE, DAOS_COND_KEY_GET, "Key1", &size, &val_out, NULL); assert_rc_equal(rc, 0); assert_int_equal(val_out, 1); print_message("Conditional Remove non existing Key (should fail)\n"); - rc = daos_kv_remove(oh, DAOS_TX_NONE, DAOS_COND_KEY_REMOVE, "Key2", - NULL); + rc = daos_kv_remove(oh, DAOS_TX_NONE, DAOS_COND_KEY_REMOVE, "Key2", NULL); assert_rc_equal(rc, -DER_NONEXIST); print_message("Conditional Remove existing Key\n"); - rc = daos_kv_remove(oh, DAOS_TX_NONE, DAOS_COND_KEY_REMOVE, "Key1", - NULL); + rc = daos_kv_remove(oh, DAOS_TX_NONE, DAOS_COND_KEY_REMOVE, "Key1", NULL); + assert_rc_equal(rc, 0); + + print_message("Conditional INSERT of Key with no value\n"); + rc = daos_kv_put(oh, DAOS_TX_NONE, DAOS_COND_KEY_INSERT, "Empty_Key", 0, NULL, NULL); + assert_rc_equal(rc, 0); + + print_message("Conditional INSERT of existing (but empty) Key (should fail)\n"); + rc = daos_kv_put(oh, DAOS_TX_NONE, DAOS_COND_KEY_INSERT, "Empty_Key", sizeof(int), &val, + NULL); + assert_rc_equal(rc, -DER_EXIST); + + size = sizeof(int); + print_message("Conditional FETCH of existing but empty Key\n"); + rc = daos_kv_get(oh, DAOS_TX_NONE, DAOS_COND_KEY_GET, "Empty_Key", &size, &val_out, NULL); + assert_rc_equal(rc, 0); + assert_int_equal(size, 0); + + print_message("Update the empty Key with a no value update\n"); + rc = daos_kv_put(oh, DAOS_TX_NONE, 0, "Empty_Key", 0, NULL, NULL); + assert_rc_equal(rc, 0); + + print_message("Conditional Remove existing but empty Key\n"); + rc = daos_kv_remove(oh, DAOS_TX_NONE, DAOS_COND_KEY_REMOVE, "Empty_Key", NULL); assert_rc_equal(rc, 0); print_message("Destroying KV\n"); diff --git a/src/tests/suite/daos_obj_ec.c b/src/tests/suite/daos_obj_ec.c index 84d5b4b94e2..c23e66d7b3a 100644 --- a/src/tests/suite/daos_obj_ec.c +++ b/src/tests/suite/daos_obj_ec.c @@ -2292,6 +2292,62 @@ ec_update_2akeys(void **state) } } +static void +ec_dkey_enum_fail(void **state) +{ + test_arg_t *arg = *state; + struct ioreq req; + daos_obj_id_t oid; + int num_dkey = 1000; + daos_anchor_t anchor = { 0 }; + int total = 0; + char buf[512]; + daos_size_t buf_len = 512; + int i; + int rc; + + if (!test_runable(arg, 3)) + return; + + oid = daos_test_oid_gen(arg->coh, OC_EC_2P1G1, 0, 0, arg->myrank); + ioreq_init(&req, arg->coh, oid, DAOS_IOD_ARRAY, arg); + for (i = 0; i < num_dkey; i++) { + char dkey[32]; + char data[5]; + daos_recx_t recx; + + /* Make dkey on different shards */ + req.iod_type = DAOS_IOD_ARRAY; + sprintf(dkey, "dkey_%d", i); + recx.rx_nr = 5; + recx.rx_idx = 0; + memset(data, 'a', 5); + insert_recxs(dkey, "a_key", 1, DAOS_TX_NONE, &recx, 1, data, 16, &req); + } + + print_message("iterate dkey...\n"); + while (!daos_anchor_is_eof(&anchor)) { + daos_key_desc_t kds[10]; + uint32_t number = 10; + + memset(buf, 0, buf_len); + memset(kds, 0, sizeof(*kds) * number); + rc = enumerate_dkey(DAOS_TX_NONE, &number, kds, &anchor, buf, buf_len, &req); + assert_rc_equal(rc, 0); + if (total == 0) { + daos_fail_loc_set(DAOS_FAIL_SHARD_OPEN | DAOS_FAIL_ALWAYS); + daos_fail_value_set(2); + } + total += number; + } + daos_fail_loc_set(0); + daos_fail_value_set(0); + + assert_rc_equal(total, 1000); + + ioreq_fini(&req); +} + /** create a new pool/container for each test */ static const struct CMUnitTest ec_tests[] = { {"EC0: ec dkey list and punch test", @@ -2342,6 +2398,8 @@ static const struct CMUnitTest ec_tests[] = { test_case_teardown}, {"EC24: ec multi-array update", ec_multi_array, async_disable, test_case_teardown}, + {"EC25: ec dkey enumerate with failure shard", ec_dkey_enum_fail, async_disable, + test_case_teardown}, }; int diff --git a/src/vos/sys_db.c b/src/vos/sys_db.c index 05ff9d14508..575e605d80f 100644 --- a/src/vos/sys_db.c +++ b/src/vos/sys_db.c @@ -1,5 +1,5 @@ /** - * (C) Copyright 2020-2022 Intel Corporation. + * (C) Copyright 2020-2023 Intel Corporation. * * SPDX-License-Identifier: BSD-2-Clause-Patent */ @@ -71,6 +71,12 @@ db2vos(struct sys_db *db) return container_of(db, struct vos_sys_db, db_pub); } +uuid_t * +vos_db_pool_uuid() +{ + return &vos_db.db_pool; +} + static void db_close(struct sys_db *db) { @@ -124,13 +130,13 @@ db_open_create(struct sys_db *db, bool try_create) D_DEBUG(DB_IO, "Opening %s, try_create=%d\n", vdb->db_file, try_create); if (try_create) { rc = vos_pool_create(vdb->db_file, vdb->db_pool, SYS_DB_SIZE, 0, - 0, &vdb->db_poh); + VOS_POF_SYSDB, &vdb->db_poh); if (rc) { D_CRIT("sys pool create error: "DF_RC"\n", DP_RC(rc)); goto failed; } } else { - rc = vos_pool_open(vdb->db_file, vdb->db_pool, 0, &vdb->db_poh); + rc = vos_pool_open(vdb->db_file, vdb->db_pool, VOS_POF_SYSDB, &vdb->db_poh); if (rc) { /** * The access checks above should ensure the file diff --git a/src/vos/tests/SConscript b/src/vos/tests/SConscript index 9439312de39..0c4aafc2222 100644 --- a/src/vos/tests/SConscript +++ b/src/vos/tests/SConscript @@ -40,6 +40,17 @@ def scons(): LIBS=libraries) unit_env.Install('$PREFIX/bin/', test) + tenv = denv.Clone() + tenv.AppendUnique(RPATH_FULL=['$PREFIX/lib64/daos_srv']) + tenv.Append(CPPDEFINES={'VOS_STANDALONE': '1'}) + + libraries = ['uuid', 'bio', 'gurt', 'cmocka', 'daos_common_pmem', 'daos_tests', 'vos', 'abt'] + + tenv.require('spdk') + bio_ut_src = ['bio_ut.c', 'wal_ut.c'] + bio_ut = tenv.d_test_program('bio_ut', bio_ut_src, LIBS=libraries) + tenv.Install('$PREFIX/bin/', bio_ut) + if __name__ == "SCons.Script": scons() diff --git a/src/bio/tests/bio_ut.c b/src/vos/tests/bio_ut.c similarity index 55% rename from src/bio/tests/bio_ut.c rename to src/vos/tests/bio_ut.c index f1818adb200..b8e557aca4e 100644 --- a/src/bio/tests/bio_ut.c +++ b/src/vos/tests/bio_ut.c @@ -11,6 +11,8 @@ #include #include #include "bio_ut.h" +#include "../vos_tls.h" +#include static char db_path[100]; struct bio_ut_args ut_args; @@ -18,11 +20,7 @@ struct bio_ut_args ut_args; void ut_fini(struct bio_ut_args *args) { - bio_xsctxt_free(args->bua_xs_ctxt); - smd_fini(); - lmm_db_fini(); - bio_nvme_fini(); - ABT_finalize(); + vos_self_fini(); daos_debug_fini(); } @@ -34,61 +32,14 @@ ut_fini(struct bio_ut_args *args) int ut_init(struct bio_ut_args *args) { - struct sys_db *db; - char nvme_conf[200] = { 0 }; - int fd, rc; - - snprintf(nvme_conf, sizeof(nvme_conf), "%s/daos_nvme.conf", db_path); - - rc = daos_debug_init(DAOS_LOG_DEFAULT); - if (rc != 0) - return rc; - - rc = ABT_init(0, NULL); - if (rc != 0) - goto out_debug; - - fd = open(nvme_conf, O_RDONLY, 0600); - if (fd < 0) { - D_ERROR("Failed to open %s. %s\n", nvme_conf, strerror(errno)); - rc = daos_errno2der(errno); - goto out_abt; - } - close(fd); - - rc = bio_nvme_init(nvme_conf, BIO_UT_NUMA_NODE, BIO_UT_MEM_SIZE, BIO_UT_HUGEPAGE_SZ, - BIO_UT_TARGET_NR, true); - if (rc) { - D_ERROR("NVMe init failed. "DF_RC"\n", DP_RC(rc)); - goto out_abt; - } - - rc = lmm_db_init_ex(db_path, "self_db", true, true); - if (rc) { - D_ERROR("lmm DB init failed. "DF_RC"\n", DP_RC(rc)); - goto out_nvme; - } - db = lmm_db_get(); - - rc = smd_init(db); - D_ASSERT(rc == 0); + int rc; - rc = bio_xsctxt_alloc(&args->bua_xs_ctxt, BIO_STANDALONE_TGT_ID, true); - if (rc) { - D_ERROR("Allocate Per-xstream NVMe context failed. "DF_RC"\n", DP_RC(rc)); - goto out_smd; - } + rc = vos_self_init(db_path, false, BIO_STANDALONE_TGT_ID); + if (rc) + daos_debug_fini(); + else + args->bua_xs_ctxt = vos_xsctxt_get(); - return 0; -out_smd: - smd_fini(); - lmm_db_fini(); -out_nvme: - bio_nvme_fini(); -out_abt: - ABT_finalize(); -out_debug: - daos_debug_fini(); return rc; } diff --git a/src/bio/tests/bio_ut.h b/src/vos/tests/bio_ut.h similarity index 100% rename from src/bio/tests/bio_ut.h rename to src/vos/tests/bio_ut.h diff --git a/src/vos/tests/vts_aggregate.c b/src/vos/tests/vts_aggregate.c index e54ee6a1c93..2b2b92082af 100644 --- a/src/vos/tests/vts_aggregate.c +++ b/src/vos/tests/vts_aggregate.c @@ -202,11 +202,11 @@ lookup_object(struct io_test_args *arg, daos_unit_oid_t oid) * tree. If this returns 0, we need to release the object though * this is only presently used to check existence */ - rc = vos_obj_hold(vos_obj_cache_current(), + rc = vos_obj_hold(vos_obj_cache_current(true), vos_hdl2cont(arg->ctx.tc_co_hdl), oid, &epr, 0, VOS_OBJ_VISIBLE, DAOS_INTENT_DEFAULT, &obj, 0); if (rc == 0) - vos_obj_release(vos_obj_cache_current(), obj, false); + vos_obj_release(vos_obj_cache_current(true), obj, false); return rc; } diff --git a/src/vos/tests/vts_io.c b/src/vos/tests/vts_io.c index a52b4589788..c8dac0c6e58 100644 --- a/src/vos/tests/vts_io.c +++ b/src/vos/tests/vts_io.c @@ -224,7 +224,7 @@ setup_io(void **state) srand(time(NULL)); test_args_init(&test_args, VPOOL_SIZE); - table = vos_ts_table_get(); + table = vos_ts_table_get(true); if (table == NULL) return -1; @@ -236,7 +236,7 @@ int teardown_io(void **state) { struct io_test_args *arg = *state; - struct vos_ts_table *table = vos_ts_table_get(); + struct vos_ts_table *table = vos_ts_table_get(true); int rc; if (table) { @@ -966,7 +966,7 @@ io_obj_cache_test(void **state) rc = vos_obj_cache_create(10, &occ); assert_rc_equal(rc, 0); - tls = vos_tls_get(); + tls = vos_tls_get(true); old_cache = tls->vtl_ocache; tls->vtl_ocache = occ; diff --git a/src/vos/tests/vts_ts.c b/src/vos/tests/vts_ts.c index bd7cd4c7904..60302ffe262 100644 --- a/src/vos/tests/vts_ts.c +++ b/src/vos/tests/vts_ts.c @@ -1,5 +1,5 @@ /** - * (C) Copyright 2020-2022 Intel Corporation. + * (C) Copyright 2020-2023 Intel Corporation. * * SPDX-License-Identifier: BSD-2-Clause-Patent */ @@ -145,7 +145,7 @@ run_positive_entry_test(struct ts_test_arg *ts_arg, uint32_t type) /** Now evict the extra records to reset the array for child tests */ for (idx = 0; idx < NUM_EXTRA; idx++) - vos_ts_evict(&ts_arg->ta_extra_records[idx], type); + vos_ts_evict(&ts_arg->ta_extra_records[idx], type, true); /** evicting an entry should move it to lru */ vos_ts_set_reset(ts_arg->ta_ts_set, type, 0); @@ -153,7 +153,7 @@ run_positive_entry_test(struct ts_test_arg *ts_arg, uint32_t type) false, &same); assert_true(found); assert_int_equal(same->te_info->ti_type, type); - vos_ts_evict(&ts_arg->ta_records[type][20], type); + vos_ts_evict(&ts_arg->ta_records[type][20], type, true); found = vos_ts_lookup(ts_arg->ta_ts_set, &ts_arg->ta_records[type][20], true, &entry); assert_false(found); @@ -195,7 +195,7 @@ ilog_test_ts_get(void **state) for (type = VOS_TS_TYPE_AKEY;; type--) { for (idx = 0; idx < ts_arg->ta_counts[type]; idx++) { - vos_ts_evict(&ts_arg->ta_records[type][idx], type); + vos_ts_evict(&ts_arg->ta_records[type][idx], type, true); found = vos_ts_lookup(ts_arg->ta_ts_set, &ts_arg->ta_records[type][idx], true, &entry); @@ -231,7 +231,7 @@ alloc_ts_cache(void **state) int rc; /** Free already allocated table */ - ts_table = vos_ts_table_get(); + ts_table = vos_ts_table_get(true); if (ts_table != NULL) ts_arg->old_table = ts_table; @@ -734,13 +734,13 @@ ts_test_init(void **state) alloc_ts_cache(state); - ts_table = vos_ts_table_get(); + ts_table = vos_ts_table_get(true); for (i = 0; i < VOS_TS_TYPE_COUNT; i++) ts_arg->ta_counts[i] = ts_table->tt_type_info[i].ti_count; daos_dti_gen_unique(&dth.dth_xid); - rc = vos_ts_set_allocate(&ts_arg->ta_ts_set, 0, 0, 1, &dth); + rc = vos_ts_set_allocate(&ts_arg->ta_ts_set, 0, 0, 1, &dth, true); if (rc != 0) { D_FREE(ts_arg); return rc; @@ -756,7 +756,7 @@ ts_test_fini(void **state) struct vos_ts_table *ts_table; vos_ts_set_free(ts_arg->ta_ts_set); - ts_table = vos_ts_table_get(); + ts_table = vos_ts_table_get(true); vos_ts_table_free(&ts_table); vos_ts_table_set(ts_arg->old_table); D_FREE(ts_arg); diff --git a/src/vos/tests/vts_wal.c b/src/vos/tests/vts_wal.c index f9a30415c56..6d46f74dbb4 100644 --- a/src/vos/tests/vts_wal.c +++ b/src/vos/tests/vts_wal.c @@ -480,7 +480,7 @@ wal_kv_large(void **state) /* Update small EV/SV, large EV/SV (located on data blob) */ umm = &vos_hdl2cont(tcx->tc_co_hdl)->vc_pool->vp_umm; - rc = umem_tx_begin(umm, vos_txd_get()); + rc = umem_tx_begin(umm, vos_txd_get(true)); assert_rc_equal(rc, 0); epoch = epc_lo; @@ -500,7 +500,7 @@ wal_kv_large(void **state) /* Verify all values */ umm = &vos_hdl2cont(tcx->tc_co_hdl)->vc_pool->vp_umm; - rc = umem_tx_begin(umm, vos_txd_get()); + rc = umem_tx_begin(umm, vos_txd_get(true)); assert_rc_equal(rc, 0); epoch = epc_lo; diff --git a/src/bio/tests/wal_ut.c b/src/vos/tests/wal_ut.c similarity index 99% rename from src/bio/tests/wal_ut.c rename to src/vos/tests/wal_ut.c index 57709bdfcde..8273b387595 100644 --- a/src/bio/tests/wal_ut.c +++ b/src/vos/tests/wal_ut.c @@ -7,7 +7,7 @@ #define D_LOGFAC DD_FAC(tests) #include "bio_ut.h" -#include "../bio_wal.h" +#include "../../bio/bio_wal.h" static void ut_mc_fini(struct bio_ut_args *args) diff --git a/src/vos/vos_aggregate.c b/src/vos/vos_aggregate.c index a6108bc8d47..4f753f45fcb 100644 --- a/src/vos/vos_aggregate.c +++ b/src/vos/vos_aggregate.c @@ -162,10 +162,8 @@ struct vos_agg_param { /* Boundary for aggregatable write filter */ daos_epoch_t ap_filter_epoch; uint32_t ap_flags; - unsigned int ap_discard:1, - ap_csum_err:1, - ap_nospc_err:1, - ap_discard_obj:1; + unsigned int ap_discard : 1, ap_csum_err : 1, ap_nospc_err : 1, ap_in_progress : 1, + ap_discard_obj : 1; struct umem_instance *ap_umm; int (*ap_yield_func)(void *arg); void *ap_yield_arg; @@ -317,10 +315,11 @@ need_aggregate(daos_handle_t ih, struct vos_agg_param *agg_param, vos_iter_desc_ static inline bool vos_aggregate_yield(struct vos_agg_param *agg_param) { - int rc; + int rc; + struct vos_container *cont = vos_hdl2cont(agg_param->ap_coh); /* Current DTX handle must be NULL, since aggregation runs under non-DTX mode. */ - D_ASSERT(vos_dth_get() == NULL); + D_ASSERT(vos_dth_get(cont->vc_pool->vp_sysdb) == NULL); if (agg_param->ap_yield_func == NULL) { bio_yield(agg_param->ap_umm); @@ -2209,6 +2208,7 @@ vos_agg_ev(daos_handle_t ih, vos_iter_entry_t *entry, struct evt_extent phy_ext, lgc_ext; int rc = 0; int next_idx; + struct vos_container *cont = vos_hdl2cont(agg_param->ap_coh); D_ASSERT(agg_param != NULL); D_ASSERT(acts != NULL); @@ -2249,7 +2249,7 @@ vos_agg_ev(daos_handle_t ih, vos_iter_entry_t *entry, } /* Current DTX handle must be NULL, since aggregation runs under non-DTX mode. */ - D_ASSERT(vos_dth_get() == NULL); + D_ASSERT(vos_dth_get(cont->vc_pool->vp_sysdb) == NULL); /* Aggregation Yield for testing purpose */ while (DAOS_FAIL_CHECK(DAOS_VOS_AGG_BLOCKED)) @@ -2324,7 +2324,7 @@ vos_aggregate_pre_cb(daos_handle_t ih, vos_iter_entry_t *entry, *acts |= VOS_ITER_CB_ABORT; if (rc == -DER_CSUM) { - agg_param->ap_csum_err = true; + agg_param->ap_csum_err = 1; if (vam && vam->vam_csum_errs) d_tm_inc_counter(vam->vam_csum_errs, 1); } else if (rc == -DER_NOSPACE) { @@ -2334,6 +2334,7 @@ vos_aggregate_pre_cb(daos_handle_t ih, vos_iter_entry_t *entry, * this entry to avoid orphaned tree * assertion */ + agg_param->ap_in_progress = 1; agg_param->ap_skip_akey = true; agg_param->ap_skip_dkey = true; agg_param->ap_skip_obj = true; @@ -2437,6 +2438,7 @@ vos_aggregate_post_cb(daos_handle_t ih, vos_iter_entry_t *entry, if (rc == -DER_TX_BUSY) { struct vos_agg_metrics *vam = agg_cont2metrics(cont); + agg_param->ap_in_progress = 1; rc = 0; switch (type) { default: @@ -2689,6 +2691,15 @@ vos_aggregate(daos_handle_t coh, daos_epoch_range_t *epr, rc = -DER_CSUM; /* Inform caller the csum error */ close_merge_window(&ad->ad_agg_param.ap_window, rc); /* HAE needs be updated for csum error case */ + } else if (ad->ad_agg_param.ap_in_progress) { + /* Don't update HAE when there were in-progress entries. Otherwise, + * we will never aggregate anything in those subtrees until there is + * a new write. + * + * NB: We may be able to improve this by tracking the lowest epoch + * of such entries and updating the HAE to that value - 1. + */ + goto exit; } update_hae: @@ -2731,7 +2742,8 @@ vos_discard(daos_handle_t coh, daos_unit_oid_t *oidp, daos_epoch_range_t *epr, return -DER_NOMEM; if (oidp != NULL) { - rc = vos_obj_discard_hold(vos_obj_cache_current(), cont, *oidp, &obj); + rc = vos_obj_discard_hold(vos_obj_cache_current(cont->vc_pool->vp_sysdb), + cont, *oidp, &obj); if (rc != 0) { if (rc == -DER_NONEXIST) rc = 0; @@ -2787,7 +2799,7 @@ vos_discard(daos_handle_t coh, daos_unit_oid_t *oidp, daos_epoch_range_t *epr, release_obj: if (oidp != NULL) - vos_obj_discard_release(vos_obj_cache_current(), obj); + vos_obj_discard_release(vos_obj_cache_current(cont->vc_pool->vp_sysdb), obj); free_agg_data: D_FREE(ad); diff --git a/src/vos/vos_common.c b/src/vos/vos_common.c index d863d15aa7c..0b605819c75 100644 --- a/src/vos/vos_common.c +++ b/src/vos/vos_common.c @@ -58,11 +58,14 @@ vos_report_layout_incompat(const char *type, int version, int min_version, } struct vos_tls * -vos_tls_get(void) +vos_tls_get(bool standalone) { #ifdef VOS_STANDALONE return self_mode.self_tls; #else + if (standalone) + return self_mode.self_tls; + return dss_module_key_get(dss_tls_get(), &vos_module_key); #endif } @@ -103,46 +106,16 @@ vos_ts_add_missing(struct vos_ts_set *ts_set, daos_key_t *dkey, int akey_nr, } } -#ifdef VOS_STANDALONE -int -vos_profile_start(char *path, int avg) -{ - struct vos_tls *tls = vos_tls_get(); - struct daos_profile *dp; - int rc; - - if (tls == NULL) - return 0; - - rc = daos_profile_init(&dp, path, avg, 0, 0); - if (rc) - return rc; - - tls->vtl_dp = dp; - return 0; -} - -void -vos_profile_stop() -{ - struct vos_tls *tls = vos_tls_get(); - - if (tls == NULL || tls->vtl_dp == NULL) - return; - - daos_profile_dump(tls->vtl_dp); - daos_profile_destroy(tls->vtl_dp); - tls->vtl_dp = NULL; -} - -#endif - struct bio_xs_context * vos_xsctxt_get(void) { #ifdef VOS_STANDALONE return self_mode.self_xs_ctxt; #else + /* main thread doesn't have TLS and XS context*/ + if (dss_tls_get() == NULL) + return NULL; + return dss_get_module_info()->dmi_nvme_ctxt; #endif } @@ -227,27 +200,28 @@ vos_tx_publish(struct dtx_handle *dth, bool publish) } int -vos_tx_begin(struct dtx_handle *dth, struct umem_instance *umm) +vos_tx_begin(struct dtx_handle *dth, struct umem_instance *umm, bool is_sysdb) { int rc; if (dth == NULL) - return umem_tx_begin(umm, vos_txd_get()); + return umem_tx_begin(umm, vos_txd_get(is_sysdb)); + D_ASSERT(!is_sysdb); /** Note: On successful return, dth tls gets set and will be cleared by the corresponding * call to vos_tx_end. This is to avoid ever keeping that set after a call to * umem_tx_end, which may yield for bio operations. */ if (dth->dth_local_tx_started) { - vos_dth_set(dth); + vos_dth_set(dth, false); return 0; } - rc = umem_tx_begin(umm, vos_txd_get()); + rc = umem_tx_begin(umm, vos_txd_get(is_sysdb)); if (rc == 0) { dth->dth_local_tx_started = 1; - vos_dth_set(dth); + vos_dth_set(dth, false); } return rc; @@ -290,7 +264,7 @@ vos_tx_end(struct vos_container *cont, struct dtx_handle *dth_in, /* Not the last modification. */ if (err == 0 && dth->dth_modification_cnt > dth->dth_op_seq) { - vos_dth_set(NULL); + vos_dth_set(NULL, cont->vc_pool->vp_sysdb); return 0; } @@ -302,7 +276,7 @@ vos_tx_end(struct vos_container *cont, struct dtx_handle *dth_in, if (err == 0) err = vos_tx_publish(dth, true); - vos_dth_set(NULL); + vos_dth_set(NULL, cont->vc_pool->vp_sysdb); if (bio_nvme_configured(SMD_DEV_TYPE_META) && biod != NULL) err = umem_tx_end_ex(vos_cont2umm(cont), err, biod); @@ -427,6 +401,16 @@ vos_tls_fini(int tags, void *data) D_FREE(tls); } +void +vos_standalone_tls_fini(void) +{ + if (self_mode.self_tls) { + vos_tls_fini(DAOS_TGT_TAG, self_mode.self_tls); + self_mode.self_tls = NULL; + } + +} + static void * vos_tls_init(int tags, int xs_id, int tgt_id) { @@ -494,6 +478,17 @@ vos_tls_init(int tags, int xs_id, int tgt_id) return NULL; } +int +vos_standalone_tls_init(int tags) +{ + self_mode.self_tls = vos_tls_init(tags, 0, -1); + if (!self_mode.self_tls) + return -DER_NOMEM; + + return 0; +} + + struct dss_module_key vos_module_key = { .dmk_tags = DAOS_RDB_TAG | DAOS_TGT_TAG, .dmk_index = -1, @@ -839,16 +834,10 @@ vos_self_fini_locked(void) self_mode.self_xs_ctxt = NULL; } - if (!bio_nvme_configured(SMD_DEV_TYPE_META)) - vos_db_fini(); - else - lmm_db_fini(); + vos_db_fini(); vos_self_nvme_fini(); - if (self_mode.self_tls) { - vos_tls_fini(DAOS_TGT_TAG, self_mode.self_tls); - self_mode.self_tls = NULL; - } + vos_standalone_tls_fini(); ABT_finalize(); } @@ -892,8 +881,8 @@ vos_self_init(const char *db_path, bool use_sys_db, int tgt_id) vos_start_epoch = 0; #if VOS_STANDALONE - self_mode.self_tls = vos_tls_init(DAOS_TGT_TAG, 0, -1); - if (!self_mode.self_tls) { + rc = vos_standalone_tls_init(DAOS_TGT_TAG); + if (rc) { ABT_finalize(); goto out; } @@ -906,23 +895,14 @@ vos_self_init(const char *db_path, bool use_sys_db, int tgt_id) if (rc) goto failed; - if (bio_nvme_configured(SMD_DEV_TYPE_META)) { - /* LMM DB path same as VOS DB path argument in self init case */ - if (use_sys_db) - rc = lmm_db_init(db_path); - else - rc = lmm_db_init_ex(db_path, "self_db", true, true); - db = lmm_db_get(); - } else { - if (use_sys_db) - rc = vos_db_init(db_path); - else - rc = vos_db_init_ex(db_path, "self_db", true, true); - db = vos_db_get(); - } + if (use_sys_db) + rc = vos_db_init(db_path); + else + rc = vos_db_init_ex(db_path, "self_db", true, true); if (rc) goto failed; + db = vos_db_get(); rc = smd_init(db); if (rc) goto failed; diff --git a/src/vos/vos_container.c b/src/vos/vos_container.c index c6ca3089db9..19a10d6acac 100644 --- a/src/vos/vos_container.c +++ b/src/vos/vos_container.c @@ -55,15 +55,15 @@ static int cont_df_rec_free(struct btr_instance *tins, struct btr_record *rec, void *args) { struct vos_cont_df *cont_df; + struct vos_pool *vos_pool = (struct vos_pool *)tins->ti_priv; if (UMOFF_IS_NULL(rec->rec_off)) return -DER_NONEXIST; cont_df = umem_off2ptr(&tins->ti_umm, rec->rec_off); - vos_ts_evict(&cont_df->cd_ts_idx, VOS_TS_TYPE_CONT); + vos_ts_evict(&cont_df->cd_ts_idx, VOS_TS_TYPE_CONT, vos_pool->vp_sysdb); - return gc_add_item(tins->ti_priv, DAOS_HDL_INVAL, GC_CONT, rec->rec_off, - 0); + return gc_add_item(vos_pool, DAOS_HDL_INVAL, GC_CONT, rec->rec_off, 0); } static int @@ -198,7 +198,8 @@ cont_free_internal(struct vos_container *cont) } cont->vc_pool->vp_dtx_committed_count -= cont->vc_dtx_committed_count; - d_tm_dec_gauge(vos_tls_get()->vtl_committed, cont->vc_dtx_committed_count); + d_tm_dec_gauge(vos_tls_get(cont->vc_pool->vp_sysdb)->vtl_committed, + cont->vc_dtx_committed_count); D_FREE(cont); } @@ -229,7 +230,7 @@ cont_insert(struct vos_container *cont, struct d_uuid *key, struct d_uuid *pkey, D_ASSERT(cont != NULL && coh != NULL); d_uhash_ulink_init(&cont->vc_uhlink, &co_hdl_uh_ops); - rc = d_uhash_link_insert(vos_cont_hhash_get(), key, + rc = d_uhash_link_insert(vos_cont_hhash_get(cont->vc_pool->vp_sysdb), key, pkey, &cont->vc_uhlink); if (rc) { D_ERROR("UHASH table container handle insert failed\n"); @@ -245,11 +246,11 @@ cont_insert(struct vos_container *cont, struct d_uuid *key, struct d_uuid *pkey, static int cont_lookup(struct d_uuid *key, struct d_uuid *pkey, - struct vos_container **cont) { + struct vos_container **cont, bool is_sysdb) { struct d_ulink *ulink; - ulink = d_uhash_link_lookup(vos_cont_hhash_get(), key, pkey); + ulink = d_uhash_link_lookup(vos_cont_hhash_get(is_sysdb), key, pkey); if (ulink == NULL) return -DER_NONEXIST; @@ -260,13 +261,13 @@ cont_lookup(struct d_uuid *key, struct d_uuid *pkey, static void cont_decref(struct vos_container *cont) { - d_uhash_link_putref(vos_cont_hhash_get(), &cont->vc_uhlink); + d_uhash_link_putref(vos_cont_hhash_get(cont->vc_pool->vp_sysdb), &cont->vc_uhlink); } static void cont_addref(struct vos_container *cont) { - d_uhash_link_addref(vos_cont_hhash_get(), &cont->vc_uhlink); + d_uhash_link_addref(vos_cont_hhash_get(cont->vc_pool->vp_sysdb), &cont->vc_uhlink); } /** @@ -342,7 +343,7 @@ vos_cont_open(daos_handle_t poh, uuid_t co_uuid, daos_handle_t *coh) * Check if handle exists * then return the handle immediately */ - rc = cont_lookup(&ukey, &pkey, &cont); + rc = cont_lookup(&ukey, &pkey, &cont, pool->vp_sysdb); if (rc == 0) { cont->vc_open_count++; D_DEBUG(DB_TRACE, "Found handle for cont "DF_UUID @@ -482,7 +483,7 @@ vos_cont_close(daos_handle_t coh) cont->vc_open_count--; if (cont->vc_open_count == 0) - vos_obj_cache_evict(vos_obj_cache_current(), cont); + vos_obj_cache_evict(vos_obj_cache_current(cont->vc_pool->vp_sysdb), cont); D_DEBUG(DB_TRACE, "Close cont "DF_UUID", open count: %d\n", DP_UUID(cont->vc_id), cont->vc_open_count); @@ -563,12 +564,12 @@ vos_cont_destroy(daos_handle_t poh, uuid_t co_uuid) vos_dedup_invalidate(pool); - rc = cont_lookup(&key, &pkey, &cont); + rc = cont_lookup(&key, &pkey, &cont, pool->vp_sysdb); if (rc != -DER_NONEXIST) { D_ASSERT(rc == 0); if (cont->vc_open_count == 0) { - d_uhash_link_delete(vos_cont_hhash_get(), + d_uhash_link_delete(vos_cont_hhash_get(pool->vp_sysdb), &cont->vc_uhlink); cont_decref(cont); } else { diff --git a/src/vos/vos_dtx.c b/src/vos/vos_dtx.c index 8e2e69bc011..747e117723b 100644 --- a/src/vos/vos_dtx.c +++ b/src/vos/vos_dtx.c @@ -199,8 +199,8 @@ dtx_act_ent_cleanup(struct vos_container *cont, struct vos_dtx_act_ent *dae, } for (i = 0; i < count; i++) - vos_obj_evict_by_oid(vos_obj_cache_current(), cont, - oids[i]); + vos_obj_evict_by_oid(vos_obj_cache_current(cont->vc_pool->vp_sysdb), + cont, oids[i]); } if (dae->dae_oids != NULL && dae->dae_oids != &dae->dae_oid_inline && @@ -825,8 +825,9 @@ vos_dtx_commit_one(struct vos_container *cont, struct dtx_id *dti, daos_epoch_t DCE_XID(dce) = DAE_XID(dae); DCE_EPOCH(dce) = DAE_EPOCH(dae); } else { - struct dtx_handle *dth = vos_dth_get(); + struct dtx_handle *dth = vos_dth_get(false); + D_ASSERT(!cont->vc_pool->vp_sysdb); D_ASSERT(dtx_is_valid_handle(dth)); D_ASSERT(dth->dth_solo); @@ -1106,7 +1107,7 @@ int vos_dtx_check_availability(daos_handle_t coh, uint32_t entry, daos_epoch_t epoch, uint32_t intent, uint32_t type, bool retry) { - struct dtx_handle *dth = vos_dth_get(); + struct dtx_handle *dth; struct vos_container *cont; struct vos_dtx_act_ent *dae = NULL; bool found; @@ -1114,6 +1115,7 @@ vos_dtx_check_availability(daos_handle_t coh, uint32_t entry, cont = vos_hdl2cont(coh); D_ASSERT(cont != NULL); + dth = vos_dth_get(cont->vc_pool->vp_sysdb); if (dth != NULL && dth->dth_for_migration) intent = DAOS_INTENT_MIGRATION; @@ -1347,9 +1349,9 @@ vos_dtx_check_availability(daos_handle_t coh, uint32_t entry, } uint32_t -vos_dtx_get(void) +vos_dtx_get(bool standalone) { - struct dtx_handle *dth = vos_dth_get(); + struct dtx_handle *dth = vos_dth_get(standalone); if (!dtx_is_valid_handle(dth)) return DTX_LID_COMMITTED; @@ -1423,7 +1425,7 @@ int vos_dtx_register_record(struct umem_instance *umm, umem_off_t record, uint32_t type, uint32_t *tx_id) { - struct dtx_handle *dth = vos_dth_get(); + struct dtx_handle *dth = vos_dth_get(umm->umm_pool->up_store.store_standalone); struct vos_dtx_act_ent *dae; int rc = 0; @@ -2105,8 +2107,9 @@ vos_dtx_post_handle(struct vos_container *cont, } if (!abort && dces != NULL) { - struct vos_tls *tls = vos_tls_get(); + struct vos_tls *tls = vos_tls_get(false); + D_ASSERT(cont->vc_pool->vp_sysdb == false); for (i = 0; i < count; i++) { if (dces[i] != NULL) { cont->vc_dtx_committed_count++; @@ -2397,7 +2400,7 @@ vos_dtx_set_flags(daos_handle_t coh, struct dtx_id dtis[], int count, uint32_t f int vos_dtx_aggregate(daos_handle_t coh) { - struct vos_tls *tls = vos_tls_get(); + struct vos_tls *tls = vos_tls_get(false); struct vos_container *cont; struct vos_cont_df *cont_df; struct umem_instance *umm; @@ -2421,6 +2424,7 @@ vos_dtx_aggregate(daos_handle_t coh) if (dbd == NULL || dbd->dbd_count == 0) return 0; + D_ASSERT(cont->vc_pool->vp_sysdb == false); /* Take the opportunity to free some memory if we can */ lrua_array_aggregate(cont->vc_dtx_array); @@ -2615,7 +2619,7 @@ vos_dtx_mark_sync(daos_handle_t coh, daos_unit_oid_t oid, daos_epoch_t epoch) int rc; cont = vos_hdl2cont(coh); - occ = vos_obj_cache_current(); + occ = vos_obj_cache_current(cont->vc_pool->vp_sysdb); rc = vos_obj_hold(occ, cont, oid, &epr, 0, VOS_OBJ_VISIBLE, DAOS_INTENT_DEFAULT, &obj, 0); if (rc != 0) { @@ -3173,7 +3177,8 @@ vos_dtx_cache_reset(daos_handle_t coh, bool force) } cont->vc_pool->vp_dtx_committed_count -= cont->vc_dtx_committed_count; - d_tm_dec_gauge(vos_tls_get()->vtl_committed, cont->vc_dtx_committed_count); + D_ASSERT(cont->vc_pool->vp_sysdb == false); + d_tm_dec_gauge(vos_tls_get(false)->vtl_committed, cont->vc_dtx_committed_count); cont->vc_dtx_committed_hdl = DAOS_HDL_INVAL; cont->vc_dtx_committed_count = 0; diff --git a/src/vos/vos_gc.c b/src/vos/vos_gc.c index 0b6d24d0834..6c6c9e44d12 100644 --- a/src/vos/vos_gc.c +++ b/src/vos/vos_gc.c @@ -1,5 +1,5 @@ /** - * (C) Copyright 2019-2022 Intel Corporation. + * (C) Copyright 2019-2023 Intel Corporation. * * SPDX-License-Identifier: BSD-2-Clause-Patent */ @@ -875,7 +875,7 @@ gc_check_cont(struct vos_container *cont) int gc_add_pool(struct vos_pool *pool) { - struct vos_tls *tls = vos_tls_get(); + struct vos_tls *tls = vos_tls_get(pool->vp_sysdb); D_DEBUG(DB_TRACE, "Register pool="DF_UUID" for GC\n", DP_UUID(pool->vp_id)); @@ -941,7 +941,7 @@ gc_log_pool(struct vos_pool *pool) static int vos_gc_run(int *credits) { - struct vos_tls *tls = vos_tls_get(); + struct vos_tls *tls = vos_tls_get(true); d_list_t *pools = &tls->vtl_gc_pools; int rc = 0; int checked = 0; @@ -1083,7 +1083,7 @@ vos_gc_yield(void *arg) int rc; /* Current DTX handle must be NULL, since GC runs under non-DTX mode. */ - D_ASSERT(vos_dth_get() == NULL); + D_ASSERT(vos_dth_get(false) == NULL); if (param->vgc_yield_func == NULL) { param->vgc_credits = GC_CREDS_TIGHT; @@ -1107,12 +1107,13 @@ vos_gc_pool(daos_handle_t poh, int credits, int (*yield_func)(void *arg), void *yield_arg) { struct vos_pool *pool = vos_hdl2pool(poh); - struct vos_tls *tls = vos_tls_get(); + struct vos_tls *tls = vos_tls_get(pool->vp_sysdb); struct vos_gc_param param; uint32_t nr_flushed = 0; int rc = 0, total = 0; D_ASSERT(daos_handle_is_valid(poh)); + D_ASSERT(pool->vp_sysdb == false); vos_space_update_metrics(pool); diff --git a/src/vos/vos_ilog.c b/src/vos/vos_ilog.c index bac68e598b7..3a0e08f5b89 100644 --- a/src/vos/vos_ilog.c +++ b/src/vos/vos_ilog.c @@ -44,8 +44,9 @@ static int vos_ilog_is_same_tx(struct umem_instance *umm, uint32_t tx_id, daos_epoch_t epoch, bool *same, void *args) { - struct dtx_handle *dth = vos_dth_get(); - uint32_t dtx = vos_dtx_get(); + bool standalone = umm->umm_pool->up_store.store_standalone; + struct dtx_handle *dth = vos_dth_get(standalone); + uint32_t dtx = vos_dtx_get(standalone); daos_handle_t coh; coh.cookie = (unsigned long)args; @@ -372,7 +373,7 @@ int vos_ilog_update_(struct vos_container *cont, struct ilog_df *ilog, struct vos_ilog_info *parent, struct vos_ilog_info *info, uint32_t cond, struct vos_ts_set *ts_set) { - struct dtx_handle *dth = vos_dth_get(); + struct dtx_handle *dth = vos_dth_get(cont->vc_pool->vp_sysdb); daos_epoch_range_t max_epr = *epr; struct ilog_desc_cbs cbs; daos_handle_t loh; @@ -459,7 +460,7 @@ vos_ilog_punch_(struct vos_container *cont, struct ilog_df *ilog, struct vos_ilog_info *parent, struct vos_ilog_info *info, struct vos_ts_set *ts_set, bool leaf, bool replay) { - struct dtx_handle *dth = vos_dth_get(); + struct dtx_handle *dth = vos_dth_get(cont->vc_pool->vp_sysdb); daos_epoch_range_t max_epr = *epr; struct ilog_desc_cbs cbs; daos_handle_t loh; @@ -658,17 +659,17 @@ vos_ilog_ts_mark(struct vos_ts_set *ts_set, struct ilog_df *ilog) } void -vos_ilog_ts_evict(struct ilog_df *ilog, uint32_t type) +vos_ilog_ts_evict(struct ilog_df *ilog, uint32_t type, bool standalone) { uint32_t *idx; idx = ilog_ts_idx_get(ilog); - return vos_ts_evict(idx, type); + return vos_ts_evict(idx, type, standalone); } void -vos_ilog_last_update(struct ilog_df *ilog, uint32_t type, daos_epoch_t *epc) +vos_ilog_last_update(struct ilog_df *ilog, uint32_t type, daos_epoch_t *epc, bool standalone) { struct vos_ts_entry *se_entry = NULL; struct vos_wts_cache *wcache; @@ -679,7 +680,7 @@ vos_ilog_last_update(struct ilog_df *ilog, uint32_t type, daos_epoch_t *epc) D_ASSERT(epc != NULL); idx = ilog_ts_idx_get(ilog); - found = vos_ts_peek_entry(idx, type, &se_entry); + found = vos_ts_peek_entry(idx, type, &se_entry, standalone); if (found) { D_ASSERT(se_entry != NULL); wcache = &se_entry->te_w_cache; diff --git a/src/vos/vos_ilog.h b/src/vos/vos_ilog.h index 5bae3e6aae0..b256889e2ed 100644 --- a/src/vos/vos_ilog.h +++ b/src/vos/vos_ilog.h @@ -347,11 +347,13 @@ vos_ilog_ts_mark(struct vos_ts_set *ts_set, struct ilog_df *ilog); * * \param ilog[in] The incarnation log * \param type[in] The timestamp type + * \param standalone[in] standloane TLS or not */ void -vos_ilog_ts_evict(struct ilog_df *ilog, uint32_t type); +vos_ilog_ts_evict(struct ilog_df *ilog, uint32_t type, bool standalone); void -vos_ilog_last_update(struct ilog_df *ilog, uint32_t type, daos_epoch_t *epc); +vos_ilog_last_update(struct ilog_df *ilog, uint32_t type, daos_epoch_t *epc, + bool standalone); #endif /* __VOS_ILOG_H__ */ diff --git a/src/vos/vos_internal.h b/src/vos/vos_internal.h index 96059dfc57b..dc452c9269a 100644 --- a/src/vos/vos_internal.h +++ b/src/vos/vos_internal.h @@ -236,6 +236,8 @@ struct vos_pool { uint32_t vp_dying : 1; /** exclusive handle (see VOS_POF_EXCL) */ int vp_excl:1; + /* this pool is for sysdb */ + bool vp_sysdb; /** this pool is for rdb */ bool vp_rdb; /** caller specifies pool is small (for sys space reservation) */ @@ -569,19 +571,19 @@ extern struct vos_iter_ops vos_dtx_iter_ops; static inline void vos_pool_addref(struct vos_pool *pool) { - d_uhash_link_addref(vos_pool_hhash_get(), &pool->vp_hlink); + d_uhash_link_addref(vos_pool_hhash_get(pool->vp_sysdb), &pool->vp_hlink); } static inline void vos_pool_decref(struct vos_pool *pool) { - d_uhash_link_putref(vos_pool_hhash_get(), &pool->vp_hlink); + d_uhash_link_putref(vos_pool_hhash_get(pool->vp_sysdb), &pool->vp_hlink); } static inline void vos_pool_hash_del(struct vos_pool *pool) { - d_uhash_link_delete(vos_pool_hhash_get(), &pool->vp_hlink); + d_uhash_link_delete(vos_pool_hhash_get(pool->vp_sysdb), &pool->vp_hlink); } /** @@ -591,7 +593,7 @@ vos_pool_hash_del(struct vos_pool *pool) static inline struct daos_lru_cache * vos_get_obj_cache(void) { - return vos_tls_get()->vtl_ocache; + return vos_tls_get(false)->vtl_ocache; } /** @@ -701,7 +703,7 @@ vos_dtx_register_record(struct umem_instance *umm, umem_off_t record, /** Return the already active dtx id, if any */ uint32_t -vos_dtx_get(void); +vos_dtx_get(bool standalone); /** * Deregister the record from the DTX entry. @@ -1036,7 +1038,8 @@ struct vos_iterator { it_for_discard:1, it_for_migration:1, it_show_uncommitted:1, - it_ignore_uncommitted:1; + it_ignore_uncommitted:1, + it_for_sysdb:1; }; /* Auxiliary structure for passing information between parent and nested @@ -1202,7 +1205,7 @@ vos_evt_desc_cbs_init(struct evt_desc_cbs *cbs, struct vos_pool *pool, daos_handle_t coh); int -vos_tx_begin(struct dtx_handle *dth, struct umem_instance *umm); +vos_tx_begin(struct dtx_handle *dth, struct umem_instance *umm, bool is_sysdb); /** Finish the transaction and publish or cancel the reservations or * return if err == 0 and it's a multi-modification transaction that @@ -1431,17 +1434,27 @@ vos_epc_punched(daos_epoch_t epc, uint16_t minor_epc, } static inline bool -vos_dtx_hit_inprogress(void) +vos_dtx_hit_inprogress(bool standalone) { - struct dtx_handle *dth = vos_dth_get(); + struct dtx_handle *dth; + + if (standalone) + return false; + + dth = vos_dth_get(false); return dth != NULL && dth->dth_share_tbd_count > 0; } static inline bool -vos_dtx_continue_detect(int rc) +vos_dtx_continue_detect(int rc, bool standalone) { - struct dtx_handle *dth = vos_dth_get(); + struct dtx_handle *dth; + + if (standalone) + return false; + + dth = vos_dth_get(false); /* Continue to detect other potential in-prepared DTX. */ return rc == -DER_INPROGRESS && dth != NULL && diff --git a/src/vos/vos_io.c b/src/vos/vos_io.c index 1d7b4482e78..644ffa771dc 100644 --- a/src/vos/vos_io.c +++ b/src/vos/vos_io.c @@ -577,7 +577,8 @@ vos_ioc_destroy(struct vos_io_context *ioc, bool evict) dcs_csum_info_list_fini(&ioc->ic_csum_list); if (ioc->ic_obj) - vos_obj_release(vos_obj_cache_current(), ioc->ic_obj, evict); + vos_obj_release(vos_obj_cache_current(ioc->ic_cont->vc_pool->vp_sysdb), + ioc->ic_obj, evict); vos_ioc_reserve_fini(ioc); vos_ilog_fetch_finish(&ioc->ic_dkey_info); @@ -670,8 +671,9 @@ vos_ioc_create(daos_handle_t coh, daos_unit_oid_t oid, bool read_only, } } + cont = vos_hdl2cont(coh); rc = vos_ts_set_allocate(&ioc->ic_ts_set, vos_flags, cflags, iod_nr, - dth); + dth, cont->vc_pool->vp_sysdb); if (rc != 0) goto error; @@ -680,7 +682,6 @@ vos_ioc_create(daos_handle_t coh, daos_unit_oid_t oid, bool read_only, return 0; } - cont = vos_hdl2cont(coh); bioc = vos_data_ioctxt(cont->vc_pool); ioc->ic_biod = bio_iod_alloc(bioc, vos_ioc2umm(ioc), iod_nr, read_only ? BIO_IOD_TYPE_FETCH : BIO_IOD_TYPE_UPDATE); @@ -786,6 +787,7 @@ akey_fetch_single(daos_handle_t toh, const daos_epoch_range_t *epr, struct bio_iov biov; /* iov to return data buffer */ int rc; struct dcs_csum_info csum_info = {0}; + bool standalone = ioc->ic_cont->vc_pool->vp_sysdb; d_iov_set(&kiov, &key, sizeof(key)); key.sk_epoch = ioc->ic_bound; @@ -798,7 +800,7 @@ akey_fetch_single(daos_handle_t toh, const daos_epoch_range_t *epr, rc = dbtree_fetch(toh, BTR_PROBE_LE, DAOS_INTENT_DEFAULT, &kiov, &kiov, &riov); - if (vos_dtx_hit_inprogress()) + if (vos_dtx_hit_inprogress(standalone)) D_GOTO(out, rc = (rc == 0 ? -DER_INPROGRESS : rc)); if (rc == -DER_NONEXIST) { @@ -911,6 +913,7 @@ akey_fetch_recx(daos_handle_t toh, const daos_epoch_range_t *epr, bool with_shadow = (shadow_ep != DAOS_EPOCH_MAX); uint32_t inob; int rc; + bool standalone = ioc->ic_cont->vc_pool->vp_sysdb; index = recx->rx_idx; end = recx->rx_idx + recx->rx_nr; @@ -926,7 +929,7 @@ akey_fetch_recx(daos_handle_t toh, const daos_epoch_range_t *epr, evt_ent_array_init(ioc->ic_ent_array, 0); rc = evt_find(toh, &filter, ioc->ic_ent_array); - if (rc != 0 || vos_dtx_hit_inprogress()) + if (rc != 0 || vos_dtx_hit_inprogress(standalone)) D_GOTO(failed, rc = (rc == 0 ? -DER_INPROGRESS : rc)); holes = 0; @@ -1128,6 +1131,7 @@ stop_check(struct vos_io_context *ioc, uint64_t cond, daos_iod_t *iod, int *rc, bool check_uncertainty) { uint64_t flags; + bool standalone = ioc->ic_cont->vc_pool->vp_sysdb; if (*rc == 0) return false; @@ -1135,7 +1139,7 @@ stop_check(struct vos_io_context *ioc, uint64_t cond, daos_iod_t *iod, int *rc, if (*rc != -DER_NONEXIST) return true; - if (vos_dtx_hit_inprogress()) { + if (vos_dtx_hit_inprogress(standalone)) { *rc = -DER_INPROGRESS; return true; } @@ -1194,6 +1198,7 @@ akey_fetch(struct vos_io_context *ioc, daos_handle_t ak_toh) bool is_array = (iod->iod_type == DAOS_IOD_ARRAY); bool has_cond = false; struct daos_recx_ep_list *shadow; + bool standalone = ioc->ic_cont->vc_pool->vp_sysdb; D_DEBUG(DB_IO, "akey "DF_KEY" fetch %s epr "DF_X64"-"DF_X64"\n", DP_KEY(&iod->iod_name), @@ -1281,7 +1286,7 @@ akey_fetch(struct vos_io_context *ioc, daos_handle_t ak_toh) rc = akey_fetch_recx(toh, &val_epr, &fetch_recx, shadow_ep, &rsize, ioc); - if (vos_dtx_continue_detect(rc)) + if (vos_dtx_continue_detect(rc, standalone)) continue; if (rc != 0) { @@ -1291,7 +1296,7 @@ akey_fetch(struct vos_io_context *ioc, daos_handle_t ak_toh) } } - if (vos_dtx_hit_inprogress()) { + if (vos_dtx_hit_inprogress(standalone)) { D_DEBUG(DB_IO, "inprogress %d: idx %lu, nr %lu rsize " DF_U64"\n", i, (unsigned long)iod->iod_recxs[i].rx_idx, @@ -1321,7 +1326,7 @@ akey_fetch(struct vos_io_context *ioc, daos_handle_t ak_toh) } } - if (vos_dtx_hit_inprogress()) + if (vos_dtx_hit_inprogress(standalone)) goto out; ioc_trim_tail_holes(ioc); @@ -1329,7 +1334,7 @@ akey_fetch(struct vos_io_context *ioc, daos_handle_t ak_toh) if (daos_handle_is_valid(toh)) key_tree_release(toh, is_array); - return vos_dtx_hit_inprogress() ? -DER_INPROGRESS : rc; + return vos_dtx_hit_inprogress(standalone) ? -DER_INPROGRESS : rc; } static void @@ -1350,6 +1355,7 @@ dkey_fetch(struct vos_io_context *ioc, daos_key_t *dkey) daos_handle_t toh = DAOS_HDL_INVAL; int i, rc; bool has_cond; + bool standalone = ioc->ic_cont->vc_pool->vp_sysdb; rc = obj_tree_init(obj); if (rc != 0) @@ -1404,7 +1410,7 @@ dkey_fetch(struct vos_io_context *ioc, daos_key_t *dkey) for (i = 0; i < ioc->ic_iod_nr; i++) { iod_set_cursor(ioc, i); rc = akey_fetch(ioc, toh); - if (vos_dtx_continue_detect(rc)) + if (vos_dtx_continue_detect(rc, standalone)) continue; if (rc != 0) @@ -1412,14 +1418,14 @@ dkey_fetch(struct vos_io_context *ioc, daos_key_t *dkey) } /* Add this check to prevent some new added logic after above for(). */ - if (vos_dtx_hit_inprogress()) + if (vos_dtx_hit_inprogress(standalone)) goto out; out: if (daos_handle_is_valid(toh)) key_tree_release(toh, false); - return vos_dtx_hit_inprogress() ? -DER_INPROGRESS : rc; + return vos_dtx_hit_inprogress(standalone) ? -DER_INPROGRESS : rc; } uint64_t @@ -1474,13 +1480,13 @@ vos_fetch_begin(daos_handle_t coh, daos_unit_oid_t oid, daos_epoch_t epoch, if (rc != 0) return rc; - vos_dth_set(dth); + vos_dth_set(dth, ioc->ic_cont->vc_pool->vp_sysdb); rc = vos_ts_set_add(ioc->ic_ts_set, ioc->ic_cont->vc_ts_idx, NULL, 0); D_ASSERT(rc == 0); - rc = vos_obj_hold(vos_obj_cache_current(), ioc->ic_cont, oid, - &ioc->ic_epr, ioc->ic_bound, VOS_OBJ_VISIBLE, + rc = vos_obj_hold(vos_obj_cache_current(ioc->ic_cont->vc_pool->vp_sysdb), + ioc->ic_cont, oid, &ioc->ic_epr, ioc->ic_bound, VOS_OBJ_VISIBLE, DAOS_INTENT_DEFAULT, &ioc->ic_obj, ioc->ic_ts_set); if (stop_check(ioc, VOS_COND_FETCH_MASK | VOS_OF_COND_PER_AKEY, NULL, &rc, false)) { @@ -1509,7 +1515,7 @@ vos_fetch_begin(daos_handle_t coh, daos_unit_oid_t oid, daos_epoch_t epoch, set_ioc: *ioh = vos_ioc2ioh(ioc); out: - vos_dth_set(NULL); + vos_dth_set(NULL, ioc->ic_cont->vc_pool->vp_sysdb); if (rc == -DER_NONEXIST || rc == -DER_INPROGRESS || (rc == 0 && ioc->ic_read_ts_only)) { @@ -2302,7 +2308,7 @@ vos_update_end(daos_handle_t ioh, uint32_t pm_ver, daos_key_t *dkey, int err, err = vos_ts_set_add(ioc->ic_ts_set, ioc->ic_cont->vc_ts_idx, NULL, 0); D_ASSERT(err == 0); - err = vos_tx_begin(dth, umem); + err = vos_tx_begin(dth, umem, ioc->ic_cont->vc_pool->vp_sysdb); if (err != 0) goto abort; @@ -2325,8 +2331,8 @@ vos_update_end(daos_handle_t ioh, uint32_t pm_ver, daos_key_t *dkey, int err, D_FREE(daes); } - err = vos_obj_hold(vos_obj_cache_current(), ioc->ic_cont, ioc->ic_oid, - &ioc->ic_epr, ioc->ic_bound, + err = vos_obj_hold(vos_obj_cache_current(ioc->ic_cont->vc_pool->vp_sysdb), + ioc->ic_cont, ioc->ic_oid, &ioc->ic_epr, ioc->ic_bound, VOS_OBJ_CREATE | VOS_OBJ_VISIBLE, DAOS_INTENT_UPDATE, &ioc->ic_obj, ioc->ic_ts_set); if (err != 0) diff --git a/src/vos/vos_iterator.c b/src/vos/vos_iterator.c index 50f90d60707..817ea2e56c3 100644 --- a/src/vos/vos_iterator.c +++ b/src/vos/vos_iterator.c @@ -1,5 +1,5 @@ /** - * (C) Copyright 2016-2022 Intel Corporation. + * (C) Copyright 2016-2023 Intel Corporation. * * SPDX-License-Identifier: BSD-2-Clause-Patent */ @@ -106,8 +106,8 @@ nested_prepare(vos_iter_type_t type, struct vos_iter_dict *dict, return -DER_NONEXIST; } - old = vos_dth_get(); - vos_dth_set(iter->it_dth); + old = vos_dth_get(!!iter->it_for_sysdb); + vos_dth_set(iter->it_dth, !!iter->it_for_sysdb); rc = iter->it_ops->iop_nested_tree_fetch(iter, type, &info); if (rc != 0) { VOS_TX_TRACE_FAIL(rc, "Problem fetching nested tree (%s) from " @@ -141,10 +141,29 @@ nested_prepare(vos_iter_type_t type, struct vos_iter_dict *dict, *cih = vos_iter2hdl(citer); out: - vos_dth_set(old); + vos_dth_set(old, !!iter->it_for_sysdb); return rc; } +static bool +is_sysdb_pool(vos_iter_type_t type, vos_iter_param_t *param) +{ + struct vos_pool *vos_pool; + struct vos_container *vos_cont; + + if (type == VOS_ITER_COUUID) { + vos_pool = vos_hdl2pool(param->ip_hdl); + D_ASSERT(vos_pool != NULL); + + return vos_pool->vp_sysdb; + } + + vos_cont = vos_hdl2cont(param->ip_hdl); + D_ASSERT(vos_cont != NULL); + + return vos_cont->vc_pool->vp_sysdb; +} + int vos_iter_prepare(vos_iter_type_t type, vos_iter_param_t *param, daos_handle_t *ih, struct dtx_handle *dth) @@ -155,6 +174,7 @@ vos_iter_prepare(vos_iter_type_t type, vos_iter_param_t *param, struct vos_ts_set *ts_set = NULL; int rc; int rlevel; + bool is_sysdb; if (ih == NULL) { D_ERROR("Argument 'ih' is invalid to vos_iter_param\n"); @@ -163,11 +183,11 @@ vos_iter_prepare(vos_iter_type_t type, vos_iter_param_t *param, *ih = DAOS_HDL_INVAL; - if (daos_handle_is_inval(param->ip_hdl) && - daos_handle_is_inval(param->ip_ih)) { - D_ERROR("No valid handle specified in vos_iter_param\n"); + if (daos_handle_is_inval(param->ip_hdl)) { + D_ERROR("No valid pool or cont handle specified in vos_iter_param\n"); return -DER_INVAL; } + is_sysdb = is_sysdb_pool(type, param); for (dict = &vos_iterators[0]; dict->id_ops != NULL; dict++) { if (dict->id_type == type) @@ -213,17 +233,17 @@ vos_iter_prepare(vos_iter_type_t type, vos_iter_param_t *param, D_ASSERT(!dtx_is_valid_handle(dth)); break; } - rc = vos_ts_set_allocate(&ts_set, 0, rlevel, 1 /* max akeys */, dth); + rc = vos_ts_set_allocate(&ts_set, 0, rlevel, 1 /* max akeys */, dth, is_sysdb); if (rc != 0) goto out; D_DEBUG(DB_TRACE, "Preparing standalone iterator of type %s\n", dict->id_name); - old = vos_dth_get(); - vos_dth_set(dth); + old = vos_dth_get(is_sysdb); + vos_dth_set(dth, is_sysdb); rc = dict->id_ops->iop_prepare(type, param, &iter, ts_set); - vos_dth_set(old); + vos_dth_set(old, is_sysdb); if (rc != 0) { VOS_TX_LOG_FAIL(rc, "Could not prepare iterator for %s: "DF_RC "\n", dict->id_name, DP_RC(rc)); @@ -317,13 +337,14 @@ vos_iter_probe_ex(daos_handle_t ih, daos_anchor_t *anchor, uint32_t flags) struct vos_iterator *iter = vos_hdl2iter(ih); struct dtx_handle *old; int rc; + bool is_sysdb = !!iter->it_for_sysdb; D_ASSERT(iter->it_ops != NULL); - old = vos_dth_get(); - vos_dth_set(iter->it_dth); + old = vos_dth_get(is_sysdb); + vos_dth_set(iter->it_dth, is_sysdb); rc = iter->it_ops->iop_probe(iter, anchor, flags); - vos_dth_set(old); + vos_dth_set(old, is_sysdb); if (rc == 0) iter->it_state = VOS_ITS_OK; else if (rc == -DER_NONEXIST) @@ -360,6 +381,7 @@ vos_iter_next(daos_handle_t ih, daos_anchor_t *anchor) struct vos_iterator *iter = vos_hdl2iter(ih); struct dtx_handle *old; int rc; + bool is_sysdb = !!iter->it_for_sysdb; rc = iter_verify_state(iter); if (rc) @@ -367,10 +389,10 @@ vos_iter_next(daos_handle_t ih, daos_anchor_t *anchor) D_ASSERT(iter->it_ops != NULL); - old = vos_dth_get(); - vos_dth_set(iter->it_dth); + old = vos_dth_get(is_sysdb); + vos_dth_set(iter->it_dth, is_sysdb); rc = iter->it_ops->iop_next(iter, anchor); - vos_dth_set(old); + vos_dth_set(old, is_sysdb); if (rc == 0) iter->it_state = VOS_ITS_OK; else if (rc == -DER_NONEXIST) @@ -387,7 +409,8 @@ vos_iter_fetch(daos_handle_t ih, vos_iter_entry_t *it_entry, { struct vos_iterator *iter = vos_hdl2iter(ih); struct dtx_handle *old; - int rc; + bool is_sysdb = !!iter->it_for_sysdb; + int rc; rc = iter_verify_state(iter); if (rc) @@ -395,10 +418,10 @@ vos_iter_fetch(daos_handle_t ih, vos_iter_entry_t *it_entry, D_ASSERT(iter->it_ops != NULL); - old = vos_dth_get(); - vos_dth_set(iter->it_dth); + old = vos_dth_get(is_sysdb); + vos_dth_set(iter->it_dth, is_sysdb); rc = iter->it_ops->iop_fetch(iter, it_entry, anchor); - vos_dth_set(old); + vos_dth_set(old, is_sysdb); return rc; } @@ -656,13 +679,13 @@ advance_stage(vos_iter_type_t type, unsigned int acts, vos_iter_param_t *param, static inline void vos_iter_sched_sync(struct vos_iterator *iter) { - iter->it_seq = vos_sched_seq(); + iter->it_seq = vos_sched_seq(!!iter->it_for_sysdb); } static inline bool vos_iter_sched_check(struct vos_iterator *iter) { - uint64_t seq = vos_sched_seq(); + uint64_t seq = vos_sched_seq(!!iter->it_for_sysdb); bool ret = iter->it_seq != seq; iter->it_seq = seq; @@ -970,6 +993,7 @@ vos_iter_validate_internal(struct vos_iterator *iter) daos_anchor_t *anchor; int rc; struct dtx_handle *old; + bool is_sysdb = !!iter->it_for_sysdb; D_ASSERT(iter->it_anchors != NULL); @@ -1004,10 +1028,10 @@ vos_iter_validate_internal(struct vos_iterator *iter) D_ASSERTF(0, "Unexpected iterator type %d\n", iter->it_type); } - old = vos_dth_get(); - vos_dth_set(iter->it_dth); + old = vos_dth_get(is_sysdb); + vos_dth_set(iter->it_dth, is_sysdb); rc = iter->it_ops->iop_probe(iter, anchor, VOS_ITER_PROBE_AGAIN); - vos_dth_set(old); + vos_dth_set(old, is_sysdb); if (rc == 0) return 0; diff --git a/src/vos/vos_obj.c b/src/vos/vos_obj.c index f7459a4b530..6a9e5474355 100644 --- a/src/vos/vos_obj.c +++ b/src/vos/vos_obj.c @@ -108,7 +108,7 @@ tree_is_empty(struct vos_object *obj, umem_off_t *known_key, daos_handle_t toh, const daos_epoch_range_t *epr, vos_iter_type_t type) { daos_anchor_t anchor = {0}; - struct dtx_handle *dth = vos_dth_get(); + struct dtx_handle *dth = vos_dth_get(obj->obj_cont->vc_pool->vp_sysdb); struct umem_instance *umm; d_iov_t key; struct vos_key_info kinfo = {0}; @@ -354,16 +354,17 @@ static int obj_punch(daos_handle_t coh, struct vos_object *obj, daos_epoch_t epoch, daos_epoch_t bound, uint64_t flags, struct vos_ts_set *ts_set) { - struct daos_lru_cache *occ = vos_obj_cache_current(); + struct daos_lru_cache *occ; struct vos_container *cont; struct vos_ilog_info *info; int rc; + cont = vos_hdl2cont(coh); + occ = vos_obj_cache_current(cont->vc_pool->vp_sysdb); D_ALLOC_PTR(info); if (info == NULL) return -DER_NOMEM; vos_ilog_fetch_init(info); - cont = vos_hdl2cont(coh); rc = vos_oi_punch(cont, obj->obj_id, epoch, bound, flags, obj->obj_df, info, ts_set); if (rc) @@ -447,7 +448,8 @@ vos_obj_punch(daos_handle_t coh, daos_unit_oid_t oid, daos_epoch_t epoch, } - rc = vos_ts_set_allocate(&ts_set, flags, cflags, akey_nr, dth); + rc = vos_ts_set_allocate(&ts_set, flags, cflags, akey_nr, + dth, cont->vc_pool->vp_sysdb); if (rc != 0) goto reset; @@ -455,7 +457,7 @@ vos_obj_punch(daos_handle_t coh, daos_unit_oid_t oid, daos_epoch_t epoch, if (rc != 0) goto reset; - rc = vos_tx_begin(dth, vos_cont2umm(cont)); + rc = vos_tx_begin(dth, vos_cont2umm(cont), cont->vc_pool->vp_sysdb); if (rc != 0) goto reset; @@ -479,8 +481,8 @@ vos_obj_punch(daos_handle_t coh, daos_unit_oid_t oid, daos_epoch_t epoch, hold_flags = (flags & VOS_OF_COND_PUNCH) ? 0 : VOS_OBJ_CREATE; hold_flags |= VOS_OBJ_VISIBLE; /* NB: punch always generate a new incarnation of the object */ - rc = vos_obj_hold(vos_obj_cache_current(), vos_hdl2cont(coh), oid, &epr, - bound, hold_flags, DAOS_INTENT_PUNCH, &obj, ts_set); + rc = vos_obj_hold(vos_obj_cache_current(cont->vc_pool->vp_sysdb), vos_hdl2cont(coh), + oid, &epr, bound, hold_flags, DAOS_INTENT_PUNCH, &obj, ts_set); if (rc == 0) { if (dkey) { /* key punch */ rc = key_punch(obj, epr.epr_hi, bound, pm_ver, dkey, @@ -507,7 +509,8 @@ vos_obj_punch(daos_handle_t coh, daos_unit_oid_t oid, daos_epoch_t epoch, rc = vos_mark_agg(cont, &obj->obj_df->vo_tree, &cont->vc_cont_df->cd_obj_root, epoch); - vos_obj_release(vos_obj_cache_current(), obj, rc != 0); + vos_obj_release(vos_obj_cache_current(cont->vc_pool->vp_sysdb), + obj, rc != 0); } } @@ -559,7 +562,7 @@ vos_obj_key2anchor(daos_handle_t coh, daos_unit_oid_t oid, daos_key_t *dkey, dao daos_anchor_t *anchor) { struct vos_container *cont; - struct daos_lru_cache *occ = vos_obj_cache_current(); + struct daos_lru_cache *occ; int rc; struct vos_object *obj; daos_epoch_range_t epr = {0, DAOS_EPOCH_MAX}; @@ -570,6 +573,7 @@ vos_obj_key2anchor(daos_handle_t coh, daos_unit_oid_t oid, daos_key_t *dkey, dao D_ERROR("Container is not open"); return -DER_INVAL; } + occ = vos_obj_cache_current(cont->vc_pool->vp_sysdb); rc = vos_obj_hold(occ, cont, oid, &epr, DAOS_EPOCH_MAX, 0, DAOS_INTENT_DEFAULT, &obj, NULL); if (rc != 0) { @@ -622,8 +626,8 @@ vos_obj_key2anchor(daos_handle_t coh, daos_unit_oid_t oid, daos_key_t *dkey, dao static int vos_obj_delete_internal(daos_handle_t coh, daos_unit_oid_t oid, bool only_delete_entry) { - struct daos_lru_cache *occ = vos_obj_cache_current(); struct vos_container *cont = vos_hdl2cont(coh); + struct daos_lru_cache *occ = vos_obj_cache_current(cont->vc_pool->vp_sysdb); struct umem_instance *umm = vos_cont2umm(cont); struct vos_object *obj; daos_epoch_range_t epr = {0, DAOS_EPOCH_MAX}; @@ -674,8 +678,8 @@ int vos_obj_del_key(daos_handle_t coh, daos_unit_oid_t oid, daos_key_t *dkey, daos_key_t *akey) { - struct daos_lru_cache *occ = vos_obj_cache_current(); struct vos_container *cont = vos_hdl2cont(coh); + struct daos_lru_cache *occ = vos_obj_cache_current(cont->vc_pool->vp_sysdb); struct umem_instance *umm = vos_cont2umm(cont); struct vos_object *obj; daos_key_t *key; @@ -853,7 +857,8 @@ key_iter_fill(struct vos_krec_df *krec, struct vos_obj_iter *oiter, bool check_e ent->ie_epoch = epr.epr_hi; ent->ie_punch = oiter->it_ilog_info.ii_next_punch; ent->ie_obj_punch = oiter->it_obj->obj_ilog_info.ii_next_punch; - vos_ilog_last_update(&krec->kr_ilog, ts_type, &ent->ie_last_update); + vos_ilog_last_update(&krec->kr_ilog, ts_type, &ent->ie_last_update, + !!oiter->it_iter.it_for_sysdb); return 0; } @@ -872,6 +877,8 @@ key_iter_fetch(struct vos_obj_iter *oiter, vos_iter_entry_t *ent, uint32_t ts_type; unsigned int acts; int rc; + struct vos_object *obj = oiter->it_obj; + bool is_sysdb = obj->obj_cont->vc_pool->vp_sysdb; rc = key_iter_fetch_helper(oiter, &rbund, &ent->ie_key, anchor); D_ASSERTF(check_existence || rc != -DER_NONEXIST, @@ -896,22 +903,21 @@ key_iter_fetch(struct vos_obj_iter *oiter, vos_iter_entry_t *ent, ts_type = VOS_TS_TYPE_DKEY; else ts_type = VOS_TS_TYPE_AKEY; - vos_ilog_last_update(&krec->kr_ilog, ts_type, &desc.id_agg_write); + vos_ilog_last_update(&krec->kr_ilog, ts_type, &desc.id_agg_write, + !!oiter->it_iter.it_for_sysdb); } acts = 0; - start_seq = vos_sched_seq(); - dth = vos_dth_get(); - if (dth != NULL) - vos_dth_set(NULL); + start_seq = vos_sched_seq(is_sysdb); + dth = vos_dth_get(is_sysdb); + vos_dth_set(NULL, is_sysdb); rc = oiter->it_iter.it_filter_cb(vos_iter2hdl(&oiter->it_iter), &desc, oiter->it_iter.it_filter_arg, &acts); - if (dth != NULL) - vos_dth_set(dth); + vos_dth_set(dth, is_sysdb); if (rc != 0) return rc; - if (start_seq != vos_sched_seq()) + if (start_seq != vos_sched_seq(is_sysdb)) acts |= VOS_ITER_CB_YIELD; if (acts & (VOS_ITER_CB_EXIT | VOS_ITER_CB_ABORT | VOS_ITER_CB_RESTART | VOS_ITER_CB_DELETE | VOS_ITER_CB_YIELD)) @@ -1615,8 +1621,9 @@ vos_obj_iter_prep(vos_iter_type_t type, vos_iter_param_t *param, struct vos_ts_set *ts_set) { struct vos_obj_iter *oiter; - struct vos_container *cont; - struct dtx_handle *dth = vos_dth_get(); + struct vos_container *cont = vos_hdl2cont(param->ip_hdl); + bool is_sysdb = cont->vc_pool->vp_sysdb; + struct dtx_handle *dth = vos_dth_get(is_sysdb); daos_epoch_t bound; int rc; @@ -1642,6 +1649,8 @@ vos_obj_iter_prep(vos_iter_type_t type, vos_iter_param_t *param, oiter->it_iter.it_for_discard = 1; if (param->ip_flags & VOS_IT_FOR_MIGRATION) oiter->it_iter.it_for_migration = 1; + if (is_sysdb) + oiter->it_iter.it_for_sysdb = 1; if (param->ip_flags == VOS_IT_KEY_TREE) { /** Prepare the iterator from an already open tree handle. See * vos_iterate_key @@ -1652,7 +1661,6 @@ vos_obj_iter_prep(vos_iter_type_t type, vos_iter_param_t *param, goto done; } - cont = vos_hdl2cont(param->ip_hdl); rc = vos_ts_set_add(ts_set, cont->vc_ts_idx, NULL, 0); D_ASSERT(rc == 0); @@ -1660,7 +1668,7 @@ vos_obj_iter_prep(vos_iter_type_t type, vos_iter_param_t *param, * the object/key if it's punched more than once. However, rebuild * system should guarantee this will never happen. */ - rc = vos_obj_hold(vos_obj_cache_current(), cont, + rc = vos_obj_hold(vos_obj_cache_current(is_sysdb), cont, param->ip_oid, &oiter->it_epr, oiter->it_iter.it_bound, (oiter->it_flags & VOS_IT_PUNCHED) ? 0 : @@ -1760,13 +1768,14 @@ vos_obj_iter_nested_tree_fetch(struct vos_iterator *iter, vos_iter_type_t type, static int nested_dkey_iter_init(struct vos_obj_iter *oiter, struct vos_iter_info *info) { - int rc; + int rc; + struct vos_container *cont = vos_hdl2cont(info->ii_hdl); /* XXX the condition epoch ranges could cover multiple versions of * the object/key if it's punched more than once. However, rebuild * system should guarantee this will never happen. */ - rc = vos_obj_hold(vos_obj_cache_current(), vos_hdl2cont(info->ii_hdl), + rc = vos_obj_hold(vos_obj_cache_current(cont->vc_pool->vp_sysdb), cont, info->ii_oid, &info->ii_epr, oiter->it_iter.it_bound, (oiter->it_flags & VOS_IT_PUNCHED) ? 0 : VOS_OBJ_VISIBLE, vos_iter_intent(&oiter->it_iter), @@ -1797,7 +1806,7 @@ nested_dkey_iter_init(struct vos_obj_iter *oiter, struct vos_iter_info *info) return 0; failed: - vos_obj_release(vos_obj_cache_current(), oiter->it_obj, false); + vos_obj_release(vos_obj_cache_current(cont->vc_pool->vp_sysdb), oiter->it_obj, false); return rc; } @@ -1808,7 +1817,8 @@ vos_obj_iter_nested_prep(vos_iter_type_t type, struct vos_iter_info *info, { struct vos_object *obj = info->ii_obj; struct vos_obj_iter *oiter; - struct dtx_handle *dth = vos_dth_get(); + struct vos_container *vos_cont; + struct dtx_handle *dth; daos_epoch_t bound; struct evt_desc_cbs cbs; struct evt_filter filter = {0}; @@ -1816,6 +1826,11 @@ vos_obj_iter_nested_prep(vos_iter_type_t type, struct vos_iter_info *info, int rc = 0; uint32_t options; + if (type != VOS_ITER_DKEY) + vos_cont = obj->obj_cont; + else + vos_cont = vos_hdl2cont(info->ii_hdl); + dth = vos_dth_get(vos_cont->vc_pool->vp_sysdb); D_ALLOC_PTR(oiter); if (oiter == NULL) return -DER_NOMEM; @@ -1838,6 +1853,8 @@ vos_obj_iter_nested_prep(vos_iter_type_t type, struct vos_iter_info *info, oiter->it_iter.it_for_discard = 1; if (info->ii_flags & VOS_IT_FOR_MIGRATION) oiter->it_iter.it_for_migration = 1; + if (vos_cont->vc_pool->vp_sysdb) + oiter->it_iter.it_for_sysdb = 1; switch (type) { default: @@ -1906,6 +1923,7 @@ vos_obj_iter_fini(struct vos_iterator *iter) { struct vos_obj_iter *oiter = vos_iter2oiter(iter); int rc; + struct vos_object *object; if (daos_handle_is_inval(oiter->it_hdl)) D_GOTO(out, rc = -DER_NO_HDL); @@ -1930,9 +1948,11 @@ vos_obj_iter_fini(struct vos_iterator *iter) * to ensure that a parent never gets removed before all nested * iterators are finalized */ - if (oiter->it_flags != VOS_IT_KEY_TREE && oiter->it_obj != NULL && + object = oiter->it_obj; + if (oiter->it_flags != VOS_IT_KEY_TREE && object != NULL && (iter->it_type == VOS_ITER_DKEY || !iter->it_from_parent)) - vos_obj_release(vos_obj_cache_current(), oiter->it_obj, false); + vos_obj_release(vos_obj_cache_current(object->obj_cont->vc_pool->vp_sysdb), + object, false); vos_ilog_fetch_finish(&oiter->it_ilog_info); D_FREE(oiter); diff --git a/src/vos/vos_obj.h b/src/vos/vos_obj.h index 48da4ed820f..20160a97abf 100644 --- a/src/vos/vos_obj.h +++ b/src/vos/vos_obj.h @@ -137,9 +137,9 @@ void vos_obj_cache_evict(struct daos_lru_cache *occ, struct vos_container *cont); /** - * Return object cache for the current thread. + * Return object cache for the current IO. */ -struct daos_lru_cache *vos_obj_cache_current(void); +struct daos_lru_cache *vos_obj_cache_current(bool standalone); /** * Object Index API and handles diff --git a/src/vos/vos_obj_cache.c b/src/vos/vos_obj_cache.c index a51a0c32714..286191cd66b 100644 --- a/src/vos/vos_obj_cache.c +++ b/src/vos/vos_obj_cache.c @@ -190,12 +190,12 @@ vos_obj_cache_evict(struct daos_lru_cache *cache, struct vos_container *cont) } /** - * Return object cache for the current thread. + * Return object cache for the current IO. */ struct daos_lru_cache * -vos_obj_cache_current(void) +vos_obj_cache_current(bool standalone) { - return vos_obj_cache_get(); + return vos_obj_cache_get(standalone); } static __thread struct vos_object obj_local = {0}; @@ -467,7 +467,7 @@ vos_obj_hold(struct daos_lru_cache *occ, struct vos_container *cont, obj->obj_sync_epoch = obj->obj_df->vo_sync; if (obj->obj_df != NULL && epr->epr_hi <= obj->obj_sync_epoch && - vos_dth_get() != NULL && + vos_dth_get(obj->obj_cont->vc_pool->vp_sysdb) != NULL && (intent == DAOS_INTENT_PUNCH || intent == DAOS_INTENT_UPDATE)) { /* If someone has synced the object against the * obj->obj_sync_epoch, then we do not allow to modify the diff --git a/src/vos/vos_obj_index.c b/src/vos/vos_obj_index.c index 7d68ae83b52..58d1dba1021 100644 --- a/src/vos/vos_obj_index.c +++ b/src/vos/vos_obj_index.c @@ -71,7 +71,8 @@ static int oi_rec_alloc(struct btr_instance *tins, d_iov_t *key_iov, d_iov_t *val_iov, struct btr_record *rec, d_iov_t *val_out) { - struct dtx_handle *dth = vos_dth_get(); + struct vos_container *cont = vos_hdl2cont(tins->ti_coh); + struct dtx_handle *dth = vos_dth_get(cont->vc_pool->vp_sysdb); struct vos_obj_df *obj; daos_unit_oid_t *key; umem_off_t obj_off; @@ -132,9 +133,12 @@ oi_rec_free(struct btr_instance *tins, struct btr_record *rec, void *args) struct oi_delete_arg *del_arg = args; daos_handle_t coh = { 0 }; int rc; + struct vos_pool *pool; obj = umem_off2ptr(umm, rec->rec_off); + D_ASSERT(tins->ti_priv); + pool = (struct vos_pool *)tins->ti_priv; /* Normally it should delete both ilog and vo_tree, but during upgrade * the new OID (with new layout version) will share the same ilog and * vos_tree with the old OID (with old layout version), so it will only @@ -152,10 +156,9 @@ oi_rec_free(struct btr_instance *tins, struct btr_record *rec, void *args) return rc; } - vos_ilog_ts_evict(&obj->vo_ilog, VOS_TS_TYPE_OBJ); + vos_ilog_ts_evict(&obj->vo_ilog, VOS_TS_TYPE_OBJ, pool->vp_sysdb); } - D_ASSERT(tins->ti_priv); if (del_arg != NULL) coh = vos_cont2hdl((struct vos_container *)del_arg->cont); @@ -245,7 +248,7 @@ vos_oi_find_alloc(struct vos_container *cont, daos_unit_oid_t oid, daos_epoch_t epoch, bool log, struct vos_obj_df **obj_p, struct vos_ts_set *ts_set) { - struct dtx_handle *dth = vos_dth_get(); + struct dtx_handle *dth = vos_dth_get(cont->vc_pool->vp_sysdb); struct vos_obj_df *obj = NULL; d_iov_t key_iov; d_iov_t val_iov; @@ -512,7 +515,7 @@ oi_iter_prep(vos_iter_type_t type, vos_iter_param_t *param, { struct vos_oi_iter *oiter = NULL; struct vos_container *cont = NULL; - struct dtx_handle *dth = vos_dth_get(); + struct dtx_handle *dth; int rc = 0; if (type != VOS_ITER_OBJ) { @@ -525,6 +528,8 @@ oi_iter_prep(vos_iter_type_t type, vos_iter_param_t *param, if (cont == NULL) return -DER_INVAL; + dth = vos_dth_get(cont->vc_pool->vp_sysdb); + D_ALLOC_PTR(oiter); if (oiter == NULL) return -DER_NOMEM; @@ -552,6 +557,8 @@ oi_iter_prep(vos_iter_type_t type, vos_iter_param_t *param, oiter->oit_iter.it_for_discard = 1; if (param->ip_flags & VOS_IT_FOR_MIGRATION) oiter->oit_iter.it_for_migration = 1; + if (cont->vc_pool->vp_sysdb) + oiter->oit_iter.it_for_sysdb = 1; rc = dbtree_iter_prepare(cont->vc_btr_hdl, 0, &oiter->oit_hdl); if (rc) @@ -580,6 +587,7 @@ oi_iter_match_probe(struct vos_iterator *iter, daos_anchor_t *anchor, uint32_t f uint64_t feats; unsigned int acts; int rc; + bool is_sysdb = !!iter->it_for_sysdb; while (1) { struct vos_obj_df *obj; @@ -605,22 +613,21 @@ oi_iter_match_probe(struct vos_iterator *iter, daos_anchor_t *anchor, uint32_t f /* Upgrading case, set it to latest known epoch */ if (obj->vo_max_write == 0) vos_ilog_last_update(&obj->vo_ilog, VOS_TS_TYPE_OBJ, - &desc.id_agg_write); + &desc.id_agg_write, + !!iter->it_for_sysdb); else desc.id_agg_write = obj->vo_max_write; } acts = 0; - start_seq = vos_sched_seq(); - dth = vos_dth_get(); - if (dth != NULL) - vos_dth_set(NULL); + start_seq = vos_sched_seq(is_sysdb); + dth = vos_dth_get(is_sysdb); + vos_dth_set(NULL, is_sysdb); rc = iter->it_filter_cb(vos_iter2hdl(iter), &desc, iter->it_filter_arg, &acts); - if (dth != NULL) - vos_dth_set(dth); + vos_dth_set(dth, is_sysdb); if (rc != 0) goto failed; - if (start_seq != vos_sched_seq()) + if (start_seq != vos_sched_seq(is_sysdb)) acts |= VOS_ITER_CB_YIELD; if (acts & (VOS_ITER_CB_EXIT | VOS_ITER_CB_ABORT | VOS_ITER_CB_RESTART | VOS_ITER_CB_DELETE | VOS_ITER_CB_YIELD)) @@ -720,7 +727,7 @@ oi_iter_fill(struct vos_obj_df *obj, struct vos_oi_iter *oiter, bool check_exist /* Upgrading case, set it to latest known epoch */ if (obj->vo_max_write == 0) vos_ilog_last_update(&obj->vo_ilog, VOS_TS_TYPE_OBJ, - &ent->ie_last_update); + &ent->ie_last_update, oiter->oit_iter.it_for_sysdb); else ent->ie_last_update = obj->vo_max_write; @@ -817,7 +824,7 @@ oi_iter_check_punch(daos_handle_t ih) D_DEBUG(DB_IO, "Moving object "DF_UOID" to gc heap\n", DP_UOID(oid)); /* Evict the object from cache */ - rc = vos_obj_evict_by_oid(vos_obj_cache_current(), + rc = vos_obj_evict_by_oid(vos_obj_cache_current(oiter->oit_cont->vc_pool->vp_sysdb), oiter->oit_cont, oid); if (rc != 0) D_ERROR("Could not evict object "DF_UOID" "DF_RC"\n", @@ -876,7 +883,7 @@ oi_iter_aggregate(daos_handle_t ih, bool range_discard) */ /* Evict the object from cache */ - rc = vos_obj_evict_by_oid(vos_obj_cache_current(), + rc = vos_obj_evict_by_oid(vos_obj_cache_current(oiter->oit_cont->vc_pool->vp_sysdb), oiter->oit_cont, oid); if (rc != 0) D_ERROR("Could not evict object "DF_UOID" "DF_RC"\n", diff --git a/src/vos/vos_pool.c b/src/vos/vos_pool.c index 94801919313..0b1dda5157c 100644 --- a/src/vos/vos_pool.c +++ b/src/vos/vos_pool.c @@ -573,6 +573,13 @@ vos_pmemobj_create(const char *path, uuid_t pool_id, const char *layout, int rc, ret; *ph = NULL; + /* always use PMEM mode for SMD */ + store.store_type = umempobj_get_backend_type(); + if (flags & VOS_POF_SYSDB) { + store.store_type = DAOS_MD_PMEM; + store.store_standalone = true; + } + /* No NVMe is configured or current xstream doesn't have NVMe context */ if (!bio_nvme_configured(SMD_DEV_TYPE_MAX) || xs_ctxt == NULL) goto umem_create; @@ -648,6 +655,13 @@ vos_pmemobj_open(const char *path, uuid_t pool_id, const char *layout, unsigned int rc, ret; *ph = NULL; + /* always use PMEM mode for SMD */ + store.store_type = umempobj_get_backend_type(); + if (flags & VOS_POF_SYSDB) { + store.store_type = DAOS_MD_PMEM; + store.store_standalone = true; + } + /* No NVMe is configured or current xstream doesn't have NVMe context */ if (!bio_nvme_configured(SMD_DEV_TYPE_MAX) || xs_ctxt == NULL) goto umem_open; @@ -812,7 +826,7 @@ pool_link(struct vos_pool *pool, struct d_uuid *ukey, daos_handle_t *poh) { int rc; - rc = d_uhash_link_insert(vos_pool_hhash_get(), ukey, NULL, + rc = d_uhash_link_insert(vos_pool_hhash_get(pool->vp_sysdb), ukey, NULL, &pool->vp_hlink); if (rc) { D_ERROR("uuid hash table insert failed: "DF_RC"\n", DP_RC(rc)); @@ -828,8 +842,10 @@ static int pool_lookup(struct d_uuid *ukey, struct vos_pool **pool) { struct d_ulink *hlink; + bool is_sysdb = uuid_compare(ukey->uuid, *vos_db_pool_uuid()) == 0 ? + true : false; - hlink = d_uhash_link_lookup(vos_pool_hhash_get(), ukey, NULL); + hlink = d_uhash_link_lookup(vos_pool_hhash_get(is_sysdb), ukey, NULL); if (hlink == NULL) { D_DEBUG(DB_MGMT, "can't find "DF_UUID"\n", DP_UUID(ukey->uuid)); return -DER_NONEXIST; @@ -949,7 +965,7 @@ vos_pool_create_ex(const char *path, uuid_t uuid, daos_size_t scm_sz, scm_sz = lstat.st_size; } - uma.uma_id = UMEM_CLASS_PMEM; + uma.uma_id = umempobj_backend_type2class_id(ph->up_store.store_type); uma.uma_pool = ph; rc = umem_class_init(&uma, &umem); @@ -1008,8 +1024,8 @@ vos_pool_create_ex(const char *path, uuid_t uuid, daos_size_t scm_sz, uuid_copy(blob_hdr.bbh_pool, uuid); /* Format SPDK blob*/ - rc = vea_format(&umem, vos_txd_get(), &pool_df->pd_vea_df, VOS_BLK_SZ, - VOS_BLOB_HDR_BLKS, nvme_sz, vos_blob_format_cb, + rc = vea_format(&umem, vos_txd_get(flags & VOS_POF_SYSDB), &pool_df->pd_vea_df, + VOS_BLK_SZ, VOS_BLOB_HDR_BLKS, nvme_sz, vos_blob_format_cb, &blob_hdr, false); if (rc) { D_ERROR("Format blob error for pool:"DF_UUID". "DF_RC"\n", @@ -1063,6 +1079,7 @@ vos_pool_kill(uuid_t uuid, unsigned int flags) rc = 0; break; } + D_ASSERT(pool->vp_sysdb == false); D_ASSERT(pool != NULL); if (gc_have_pool(pool)) { @@ -1207,8 +1224,8 @@ pool_open(void *ph, struct vos_pool_df *pool_df, unsigned int flags, void *metri } uma = &pool->vp_uma; - uma->uma_id = UMEM_CLASS_PMEM; uma->uma_pool = ph; + uma->uma_id = umempobj_backend_type2class_id(uma->uma_pool->up_store.store_type); /* Initialize dummy data I/O context */ rc = bio_ioctxt_open(&pool->vp_dummy_ioctxt, vos_xsctxt_get(), pool->vp_id, true); @@ -1244,8 +1261,8 @@ pool_open(void *ph, struct vos_pool_df *pool_df, unsigned int flags, void *metri unmap_ctxt.vnc_unmap = vos_blob_unmap_cb; unmap_ctxt.vnc_data = vos_data_ioctxt(pool); unmap_ctxt.vnc_ext_flush = flags & VOS_POF_EXTERNAL_FLUSH; - rc = vea_load(&pool->vp_umm, vos_txd_get(), &pool_df->pd_vea_df, - &unmap_ctxt, vea_metrics, &pool->vp_vea_info); + rc = vea_load(&pool->vp_umm, vos_txd_get(flags & VOS_POF_SYSDB), + &pool_df->pd_vea_df, &unmap_ctxt, vea_metrics, &pool->vp_vea_info); if (rc) { D_ERROR("Failed to load block space info: "DF_RC"\n", DP_RC(rc)); @@ -1259,6 +1276,7 @@ pool_open(void *ph, struct vos_pool_df *pool_df, unsigned int flags, void *metri /* Insert the opened pool to the uuid hash table */ uuid_copy(ukey.uuid, pool_df->pd_id); + pool->vp_sysdb = !!(flags & VOS_POF_SYSDB); rc = pool_link(pool, &ukey, poh); if (rc) { D_ERROR("Error inserting into vos DRAM hash\n"); @@ -1496,6 +1514,7 @@ vos_pool_query_space(uuid_t pool_id, struct vos_pool_space *vps) } D_ASSERT(pool != NULL); + D_ASSERT(pool->vp_sysdb == false); rc = vos_space_query(pool, vps, false); vos_pool_decref(pool); return rc; diff --git a/src/vos/vos_query.c b/src/vos/vos_query.c index 6e6c6f4356f..b7007ec6b87 100644 --- a/src/vos/vos_query.c +++ b/src/vos/vos_query.c @@ -113,7 +113,7 @@ find_key(struct open_query *query, daos_handle_t toh, daos_key_t *key, ci_set_null(rbund.rb_csum); rc = dbtree_iter_fetch(ih, &kiov, &riov, anchor); - if (vos_dtx_continue_detect(rc)) + if (vos_dtx_continue_detect(rc, query->qt_pool->vp_sysdb)) goto next; if (rc != 0) @@ -123,7 +123,7 @@ find_key(struct open_query *query, daos_handle_t toh, daos_key_t *key, if (rc == 0) break; - if (vos_dtx_continue_detect(rc)) + if (vos_dtx_continue_detect(rc, query->qt_pool->vp_sysdb)) continue; if (rc != -DER_NONEXIST) @@ -145,7 +145,7 @@ find_key(struct open_query *query, daos_handle_t toh, daos_key_t *key, if (rc == 0) rc = fini_rc; - return vos_dtx_hit_inprogress() ? -DER_INPROGRESS : rc; + return vos_dtx_hit_inprogress(query->qt_pool->vp_sysdb) ? -DER_INPROGRESS : rc; } static int @@ -583,6 +583,7 @@ vos_obj_query_key(daos_handle_t coh, daos_unit_oid_t oid, uint32_t flags, uint32_t cflags = 0; int rc = 0; int nr_akeys = 0; + bool is_sysdb = false; obj_epr.epr_hi = dtx_is_valid_handle(dth) ? dth->dth_epoch : epoch; bound = dtx_is_valid_handle(dth) ? dth->dth_epoch_bound : epoch; @@ -654,21 +655,21 @@ vos_obj_query_key(daos_handle_t coh, daos_unit_oid_t oid, uint32_t flags, } } - vos_dth_set(dth); - rc = vos_ts_set_allocate(&query->qt_ts_set, 0, cflags, nr_akeys, dth); + cont = vos_hdl2cont(coh); + is_sysdb = cont->vc_pool->vp_sysdb; + vos_dth_set(dth, is_sysdb); + rc = vos_ts_set_allocate(&query->qt_ts_set, 0, cflags, nr_akeys, dth, is_sysdb); if (rc != 0) { D_ERROR("Failed to allocate timestamp set: "DF_RC"\n", DP_RC(rc)); goto free_query; } - cont = vos_hdl2cont(coh); - rc = vos_ts_set_add(query->qt_ts_set, cont->vc_ts_idx, NULL, 0); D_ASSERT(rc == 0); query->qt_bound = MAX(obj_epr.epr_hi, bound); - rc = vos_obj_hold(vos_obj_cache_current(), vos_hdl2cont(coh), oid, + rc = vos_obj_hold(vos_obj_cache_current(is_sysdb), vos_hdl2cont(coh), oid, &obj_epr, query->qt_bound, VOS_OBJ_VISIBLE, DAOS_INTENT_DEFAULT, &obj, query->qt_ts_set); if (rc != 0) { @@ -786,7 +787,7 @@ vos_obj_query_key(daos_handle_t coh, daos_unit_oid_t oid, uint32_t flags, *max_write = obj->obj_df->vo_max_write; if (obj != NULL) - vos_obj_release(vos_obj_cache_current(), obj, false); + vos_obj_release(vos_obj_cache_current(is_sysdb), obj, false); if (rc == 0 || rc == -DER_NONEXIST) { if (vos_ts_wcheck(query->qt_ts_set, obj_epr.epr_hi, @@ -799,7 +800,7 @@ vos_obj_query_key(daos_handle_t coh, daos_unit_oid_t oid, uint32_t flags, vos_ts_set_free(query->qt_ts_set); free_query: - vos_dth_set(NULL); + vos_dth_set(NULL, is_sysdb); D_FREE(query); return rc; diff --git a/src/vos/vos_tls.h b/src/vos/vos_tls.h index c6b03597201..96c9a3e0c6d 100644 --- a/src/vos/vos_tls.h +++ b/src/vos/vos_tls.h @@ -1,5 +1,5 @@ /** - * (C) Copyright 2016-2022 Intel Corporation. + * (C) Copyright 2016-2023 Intel Corporation. * * SPDX-License-Identifier: BSD-2-Clause-Patent */ @@ -66,48 +66,48 @@ struct vos_tls { }; struct bio_xs_context *vos_xsctxt_get(void); -struct vos_tls *vos_tls_get(); +struct vos_tls *vos_tls_get(bool standalone); static inline struct d_hash_table * -vos_pool_hhash_get(void) +vos_pool_hhash_get(bool is_sysdb) { - return vos_tls_get()->vtl_pool_hhash; + return vos_tls_get(is_sysdb)->vtl_pool_hhash; } static inline struct d_hash_table * -vos_cont_hhash_get(void) +vos_cont_hhash_get(bool is_sysdb) { - return vos_tls_get()->vtl_cont_hhash; + return vos_tls_get(is_sysdb)->vtl_cont_hhash; } static inline struct daos_lru_cache * -vos_obj_cache_get(void) +vos_obj_cache_get(bool standalone) { - return vos_tls_get()->vtl_ocache; + return vos_tls_get(standalone)->vtl_ocache; } static inline struct umem_tx_stage_data * -vos_txd_get(void) +vos_txd_get(bool standalone) { - return &vos_tls_get()->vtl_txd; + return &vos_tls_get(standalone)->vtl_txd; } static inline struct vos_ts_table * -vos_ts_table_get(void) +vos_ts_table_get(bool standalone) { - return vos_tls_get()->vtl_ts_table; + return vos_tls_get(standalone)->vtl_ts_table; } static inline void vos_ts_table_set(struct vos_ts_table *ts_table) { - vos_tls_get()->vtl_ts_table = ts_table; + vos_tls_get(false)->vtl_ts_table = ts_table; } static inline void -vos_dth_set(struct dtx_handle *dth) +vos_dth_set(struct dtx_handle *dth, bool standalone) { - struct vos_tls *tls = vos_tls_get(); + struct vos_tls *tls = vos_tls_get(standalone); struct dtx_share_peer *dsp; if (dth != NULL && dth != tls->vtl_dth && @@ -123,26 +123,23 @@ vos_dth_set(struct dtx_handle *dth) } static inline struct dtx_handle * -vos_dth_get(void) +vos_dth_get(bool standalone) { - struct vos_tls *tls = vos_tls_get(); + struct vos_tls *tls = vos_tls_get(standalone); - if (tls != NULL) - return vos_tls_get()->vtl_dth; - - return NULL; + return tls ? tls->vtl_dth : NULL; } static inline void -vos_kh_clear(void) +vos_kh_clear(bool standalone) { - vos_tls_get()->vtl_hash_set = false; + vos_tls_get(standalone)->vtl_hash_set = false; } static inline void -vos_kh_set(uint64_t hash) +vos_kh_set(uint64_t hash, bool standalone) { - struct vos_tls *tls = vos_tls_get(); + struct vos_tls *tls = vos_tls_get(standalone); tls->vtl_hash = hash; tls->vtl_hash_set = true; @@ -150,9 +147,9 @@ vos_kh_set(uint64_t hash) } static inline bool -vos_kh_get(uint64_t *hash) +vos_kh_get(uint64_t *hash, bool standalone) { - struct vos_tls *tls = vos_tls_get(); + struct vos_tls *tls = vos_tls_get(standalone); *hash = tls->vtl_hash; @@ -160,12 +157,12 @@ vos_kh_get(uint64_t *hash) } static inline uint64_t -vos_hash_get(const void *buf, uint64_t len) +vos_hash_get(const void *buf, uint64_t len, bool standalone) { - uint64_t hash; + uint64_t hash; - if (vos_kh_get(&hash)) { - vos_kh_clear(); + if (vos_kh_get(&hash, standalone)) { + vos_kh_clear(standalone); return hash; } @@ -174,14 +171,17 @@ vos_hash_get(const void *buf, uint64_t len) #ifdef VOS_STANDALONE static inline uint64_t -vos_sched_seq(void) +vos_sched_seq(bool standalone) { return 0; } #else static inline uint64_t -vos_sched_seq(void) +vos_sched_seq(bool standalone) { + if (standalone) + return 0; + return sched_cur_seq(); } #endif diff --git a/src/vos/vos_tree.c b/src/vos/vos_tree.c index fef05721a92..bd2d1c45555 100644 --- a/src/vos/vos_tree.c +++ b/src/vos/vos_tree.c @@ -148,12 +148,13 @@ ktr_rec_msize(int alloc_overhead) static void ktr_hkey_gen(struct btr_instance *tins, d_iov_t *key_iov, void *hkey) { - struct ktr_hkey *kkey = (struct ktr_hkey *)hkey; + struct ktr_hkey *kkey = (struct ktr_hkey *)hkey; + struct umem_pool *umm_pool = tins->ti_umm.umm_pool; hkey_common_gen(key_iov, hkey); if (key_iov->iov_len > KH_INLINE_MAX) - vos_kh_set(kkey->kh_murmur64); + vos_kh_set(kkey->kh_murmur64, umm_pool->up_store.store_standalone); } /** compare the hashed key */ @@ -278,6 +279,7 @@ ktr_rec_free(struct btr_instance *tins, struct btr_record *rec, void *args) daos_handle_t coh; int gc; int rc; + struct vos_pool *pool; if (UMOFF_IS_NULL(rec->rec_off)) return 0; @@ -290,14 +292,14 @@ ktr_rec_free(struct btr_instance *tins, struct btr_record *rec, void *args) if (rc != 0) return rc; + pool = (struct vos_pool *)tins->ti_priv; vos_ilog_ts_evict(&krec->kr_ilog, (krec->kr_bmap & KREC_BF_DKEY) ? - VOS_TS_TYPE_DKEY : VOS_TS_TYPE_AKEY); + VOS_TS_TYPE_DKEY : VOS_TS_TYPE_AKEY, pool->vp_sysdb); D_ASSERT(tins->ti_priv); gc = (krec->kr_bmap & KREC_BF_DKEY) ? GC_DKEY : GC_AKEY; coh = vos_cont2hdl(args); - return gc_add_item((struct vos_pool *)tins->ti_priv, coh, gc, - rec->rec_off, 0); + return gc_add_item(pool, coh, gc, rec->rec_off, 0); } static int @@ -363,7 +365,8 @@ static int svt_rec_store(struct btr_instance *tins, struct btr_record *rec, struct vos_svt_key *skey, struct vos_rec_bundle *rbund) { - struct dtx_handle *dth = vos_dth_get(); + struct vos_container *cont = vos_hdl2cont(tins->ti_coh); + struct dtx_handle *dth = vos_dth_get(cont->vc_pool->vp_sysdb); struct vos_irec_df *irec = vos_rec2irec(tins, rec); struct dcs_csum_info *csum = rbund->rb_csum; struct bio_iov *biov = rbund->rb_biov; @@ -580,13 +583,14 @@ svt_rec_free_internal(struct btr_instance *tins, struct btr_record *rec, bio_addr_t *addr = &irec->ir_ex_addr; struct dtx_handle *dth = NULL; struct umem_rsrvd_act *rsrvd_scm; + struct vos_container *cont = vos_hdl2cont(tins->ti_coh); int i; if (UMOFF_IS_NULL(rec->rec_off)) return 0; if (overwrite) { - dth = vos_dth_get(); + dth = vos_dth_get(cont->vc_pool->vp_sysdb); if (dth == NULL) return -DER_NO_PERM; /* Not allowed */ } @@ -924,7 +928,7 @@ key_tree_prepare(struct vos_object *obj, daos_handle_t toh, int tmprc; /** reset the saved hash */ - vos_kh_clear(); + vos_kh_clear(obj->obj_cont->vc_pool->vp_sysdb); if (krecp != NULL) *krecp = NULL; diff --git a/src/vos/vos_ts.c b/src/vos/vos_ts.c index f69149e4739..9e47d100097 100644 --- a/src/vos/vos_ts.c +++ b/src/vos/vos_ts.c @@ -1,5 +1,5 @@ /** - * (C) Copyright 2020-2022 Intel Corporation. + * (C) Copyright 2020-2023 Intel Corporation. * * SPDX-License-Identifier: BSD-2-Clause-Patent */ @@ -262,7 +262,7 @@ vos_ts_evict_lru(struct vos_ts_table *ts_table, struct vos_ts_entry **entryp, int vos_ts_set_allocate(struct vos_ts_set **ts_set, uint64_t flags, uint16_t cflags, uint32_t akey_nr, - const struct dtx_handle *dth) + const struct dtx_handle *dth, bool standalone) { const struct dtx_id *tx_id = NULL; uint32_t size; @@ -271,7 +271,7 @@ vos_ts_set_allocate(struct vos_ts_set **ts_set, uint64_t flags, VOS_COND_UPDATE_MASK | VOS_OF_COND_PER_AKEY; - vos_kh_clear(); + vos_kh_clear(standalone); *ts_set = NULL; if (!dtx_is_valid_handle(dth)) { @@ -313,7 +313,7 @@ vos_ts_set_upgrade(struct vos_ts_set *ts_set) if (!vos_ts_in_tx(ts_set)) return; - ts_table = vos_ts_table_get(); + ts_table = vos_ts_table_get(false); for (i = 0; i < ts_set->ts_init_count; i++) { set_entry = &ts_set->ts_entries[i]; diff --git a/src/vos/vos_ts.h b/src/vos/vos_ts.h index 823b51ab136..2772fab2ce2 100644 --- a/src/vos/vos_ts.h +++ b/src/vos/vos_ts.h @@ -1,5 +1,5 @@ /** - * (C) Copyright 2020-2022 Intel Corporation. + * (C) Copyright 2020-2023 Intel Corporation. * * SPDX-License-Identifier: BSD-2-Clause-Patent */ @@ -234,7 +234,7 @@ static inline bool vos_ts_lookup_internal(struct vos_ts_set *ts_set, uint32_t type, uint32_t *idx, struct vos_ts_entry **entryp) { - struct vos_ts_table *ts_table = vos_ts_table_get(); + struct vos_ts_table *ts_table = vos_ts_table_get(false); struct vos_ts_info *info = &ts_table->tt_type_info[type]; void *entry; struct vos_ts_set_entry set_entry = {0}; @@ -317,7 +317,7 @@ vos_ts_alloc(struct vos_ts_set *ts_set, uint32_t *idx, uint64_t hash) if (!vos_ts_in_tx(ts_set)) return NULL; - ts_table = vos_ts_table_get(); + ts_table = vos_ts_table_get(false); vos_ts_set_get_info(ts_table, ts_set, &info, &hash_offset); @@ -378,7 +378,7 @@ vos_ts_get_negative(struct vos_ts_set *ts_set, uint64_t hash, bool reset) if (reset) ts_set->ts_init_count--; - ts_table = vos_ts_table_get(); + ts_table = vos_ts_table_get(false); vos_ts_set_get_info(ts_table, ts_set, &info, &hash_offset); @@ -490,7 +490,7 @@ vos_ts_set_add(struct vos_ts_set *ts_set, uint32_t *idx, const void *rec, return -DER_BUSY; /** No more room in the set */ if (vos_ts_lookup(ts_set, idx, false, &entry)) { - vos_kh_clear(); + vos_kh_clear(false); expected_type = entry->te_info->ti_type; D_ASSERT(expected_type == ts_set->ts_etype); goto set_params; @@ -498,8 +498,9 @@ vos_ts_set_add(struct vos_ts_set *ts_set, uint32_t *idx, const void *rec, calc_hash: if (ts_set->ts_etype > VOS_TS_TYPE_CONT) { + /* sysdb pool should not come here */ if (ts_set->ts_etype != VOS_TS_TYPE_OBJ) { - hash = vos_hash_get(rec, rec_size); + hash = vos_hash_get(rec, rec_size, false); } else { daos_unit_oid_t *oid = (daos_unit_oid_t *)rec; @@ -591,9 +592,9 @@ vos_ts_set_mark_entry(struct vos_ts_set *ts_set, uint32_t *idx) * \param[in] type Type of the object */ static inline void -vos_ts_evict(uint32_t *idx, uint32_t type) +vos_ts_evict(uint32_t *idx, uint32_t type, bool standalone) { - struct vos_ts_table *ts_table = vos_ts_table_get(); + struct vos_ts_table *ts_table = vos_ts_table_get(standalone); if (ts_table == NULL) return; @@ -602,9 +603,10 @@ vos_ts_evict(uint32_t *idx, uint32_t type) } static inline bool -vos_ts_peek_entry(uint32_t *idx, uint32_t type, struct vos_ts_entry **entryp) +vos_ts_peek_entry(uint32_t *idx, uint32_t type, struct vos_ts_entry **entryp, + bool standalone) { - struct vos_ts_table *ts_table = vos_ts_table_get(); + struct vos_ts_table *ts_table = vos_ts_table_get(standalone); struct vos_ts_info *info; if (ts_table == NULL) @@ -640,13 +642,14 @@ vos_ts_table_free(struct vos_ts_table **ts_table); * \param[in] cflags Check/update flags * \param[in] akey_nr Number of akeys in operation * \param[in] dth Optional transaction handle + * \param[in] standalone use standalone tls * * \return 0 on success, error otherwise. */ int vos_ts_set_allocate(struct vos_ts_set **ts_set, uint64_t flags, uint16_t cflags, uint32_t akey_nr, - const struct dtx_handle *dth); + const struct dtx_handle *dth, bool standalone); /** Upgrade any negative entries in the set now that the associated * update/punch has committed diff --git a/utils/cq/words.dict b/utils/cq/words.dict index 07b9e37cef1..f5dba89a4ee 100644 --- a/utils/cq/words.dict +++ b/utils/cq/words.dict @@ -388,7 +388,9 @@ shlex simul sinfo slurm +slurmd slurmctl +slurmctld spdk squeue src diff --git a/utils/rpms/daos.spec b/utils/rpms/daos.spec index 7f2738c4d04..574dc82747f 100644 --- a/utils/rpms/daos.spec +++ b/utils/rpms/daos.spec @@ -15,7 +15,7 @@ Name: daos Version: 2.3.108 -Release: 2%{?relval}%{?dist} +Release: 4%{?relval}%{?dist} Summary: DAOS Storage Engine License: BSD-2-Clause-Patent @@ -72,11 +72,10 @@ BuildRequires: libisa-l_crypto-devel BuildRequires: libisal-devel BuildRequires: libisal_crypto-devel %endif -BuildRequires: daos-raft-devel = 0.9.2-1.403.g3d20556%{?dist} +BuildRequires: daos-raft-devel = 0.10.1-1.408.g9524cdb%{?dist} BuildRequires: openssl-devel BuildRequires: libevent-devel BuildRequires: libyaml-devel -BuildRequires: lmdb-devel BuildRequires: libcmocka-devel BuildRequires: valgrind-devel BuildRequires: systemd @@ -217,7 +216,7 @@ Requires: git Requires: dbench Requires: lbzip2 Requires: attr -Requires: golang >= 1.18 +Requires: go >= 1.18 %if (0%{?suse_version} >= 1315) Requires: lua-lmod Requires: libcapstone-devel @@ -558,8 +557,14 @@ getent passwd daos_agent >/dev/null || useradd -s /sbin/nologin -r -g daos_agent # No files in a shim package %changelog -* Thu Jun 29 2023 Michael MacDonald 2.3.108-2 -- Install golang >= 1.18 as a daos-client-tests dependency +* Mon Jul 17 2023 Michael MacDonald 2.3.108-4 +- Install go >= 1.18 as a daos-client-tests dependency + +* Thu Jul 13 2023 Wang Shilong 2.3.108-3 +- Remove lmdb-devel for MD on SSD + +* Wed Jun 28 2023 Li Wei 2.3.108-2 +- Update raft to 0.10.1-1.408.g9524cdb * Tue Jun 06 2023 Jeff Olivier 2.3.108-1 - Switch version to 2.3.108 diff --git a/utils/scripts/install-el8.sh b/utils/scripts/install-el8.sh index e809c7efb61..807c3bab733 100755 --- a/utils/scripts/install-el8.sh +++ b/utils/scripts/install-el8.sh @@ -45,7 +45,6 @@ dnf --nodocs install \ libunwind-devel \ libuuid-devel \ libyaml-devel \ - lmdb-devel \ Lmod \ lz4-devel \ make \ diff --git a/utils/scripts/install-el9.sh b/utils/scripts/install-el9.sh index acc264068d1..6e9e83ed8e4 100755 --- a/utils/scripts/install-el9.sh +++ b/utils/scripts/install-el9.sh @@ -44,7 +44,6 @@ dnf --nodocs install \ libtool-ltdl-devel \ libunwind-devel \ libuuid-devel \ - lmdb-devel \ libyaml-devel \ lz4-devel \ make \ diff --git a/utils/scripts/install-leap15.sh b/utils/scripts/install-leap15.sh index 6938139da85..f4c60e8c5e7 100755 --- a/utils/scripts/install-leap15.sh +++ b/utils/scripts/install-leap15.sh @@ -45,7 +45,6 @@ dnf --nodocs install \ libunwind-devel \ libuuid-devel \ libyaml-devel \ - lmdb-devel \ lua-lmod \ make \ maven \ diff --git a/utils/scripts/install-ubuntu.sh b/utils/scripts/install-ubuntu.sh index 5b5c2694f0d..45f4bee522b 100755 --- a/utils/scripts/install-ubuntu.sh +++ b/utils/scripts/install-ubuntu.sh @@ -43,7 +43,6 @@ apt-get install \ libtool-bin \ libunwind-dev \ libyaml-dev \ - liblmdb-dev \ locales \ maven \ numactl \ diff --git a/utils/utest.yaml b/utils/utest.yaml index fa41401c539..8566c28126f 100644 --- a/utils/utest.yaml +++ b/utils/utest.yaml @@ -112,7 +112,7 @@ - name: VOS_md_on_ssd base: "PREFIX" sudo: True - required_src: ["src/bio/tests/bio_ut.c"] + required_src: ["src/vos/tests/bio_ut.c"] tests: - cmd: ["bin/vos_tests", "-A", "50"] aio: "AIO_7"