Skip to content

Commit

Permalink
Merge branch 'master' into mjean/DAOS-16167
Browse files Browse the repository at this point in the history
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke

Required-githooks: true

Signed-off-by: Maureen Jean <maureen.jean@intel.com>
  • Loading branch information
mjean308 committed Jul 23, 2024
2 parents 9bc61a3 + f53f578 commit cd4c628
Show file tree
Hide file tree
Showing 171 changed files with 7,820 additions and 3,471 deletions.
3 changes: 3 additions & 0 deletions SConstruct
Original file line number Diff line number Diff line change
Expand Up @@ -373,6 +373,9 @@ def scons():

deps_env = Environment()

# Silence deprecation warning so it doesn't fail the build
SetOption('warn', ['no-python-version'])

add_command_line_options()

# Scons strips out the environment, however that is not always desirable so add back in
Expand Down
1 change: 1 addition & 0 deletions ci/unit/required_packages.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ pkgs="argobots \
numactl-devel \
openmpi$OPENMPI_VER \
patchelf \
pciutils-devel \
pmix \
protobuf-c \
spdk-devel \
Expand Down
13 changes: 13 additions & 0 deletions debian/changelog
Original file line number Diff line number Diff line change
@@ -1,3 +1,16 @@
daos (2.7.100-3) unstable; urgency=medium
[ Dalton Bohning ]
* Add pciutils-devel build dep for client-tests package

-- Dalton Bohning <dalton.bohning@intel.com>> Thu, 11 Jul 2024 10:00:00 -0800

daos (2.7.100-2) unstable; urgency=medium
[ Tom Nabarro ]
* Add pciutils runtime dep for daos_server lspci call
* Add libpci-dev build dep for pciutils CGO bindings

-- Tom Nabarro <tom.nabarro@intel.com>> Thu, 24 Jun 2024 16:55:00 -0000

daos (2.7.100-1) unstable; urgency=medium
[ Phillip Henderson ]
* Bump version to 2.7.100
Expand Down
5 changes: 3 additions & 2 deletions debian/control
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ Build-Depends: debhelper (>= 10),
python3-tabulate,
liblz4-dev,
libaio-dev,
libcapstone-dev
libcapstone-dev,
libpci-dev
Standards-Version: 4.1.2
Homepage: https://docs.daos.io/
Vcs-Git: https://github.com/daos-stack/daos.git
Expand Down Expand Up @@ -171,7 +172,7 @@ Package: daos-server
Section: net
Architecture: any
Multi-Arch: same
Depends: ${shlibs:Depends}, ${misc:Depends}, openmpi-bin,
Depends: ${shlibs:Depends}, ${misc:Depends}, openmpi-bin, pciutils,
ipmctl (>=03.00.00.0468), libfabric (>= 1.15.1-1), spdk-tools (>= 22.01.2)
Description: The Distributed Asynchronous Object Storage (DAOS) is an open-source
software-defined object store designed from the ground up for
Expand Down
16 changes: 13 additions & 3 deletions docs/admin/administration.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,10 +44,20 @@ severity, message, description, and cause.

|Event|Event type|Severity|Message|Description|Cause|
|:----|:----|:----|:----|:----|:----|
| device\_set\_faulty| INFO\_ONLY| NOTICE or ERROR| Device: <uuid\> set faulty / Device: <uuid\> set faulty failed: <rc\> / Device: <uuid\> auto faulty detect / Device: <uuid\> auto faulty detect failed: <rc\> | Indicates that a device has either been explicitly automatically set as faulty. Device UUID specified in event data. | Either DMG set nvme-faulty command was used to explicitly set device as faulty or an error threshold was reached on a device which has triggered an auto faulty reaction. |
| device\_media\_error| INFO\_ONLY| ERROR| Device: <uuid\> <error-type\> error logged from tgt\_id:<idx\> | Indicates that a device media error has been detected for a specific target. The error type could be unmap, write, read or checksum (csum). Device UUID and target ID specified in event data. | Media error occurred on backing device. |
| device\_unplugged| INFO\_ONLY| NOTICE| Device: <uuid\> unplugged | Indicates device was physically removed from host. | NVMe SSD physically removed from host. |
| device\_plugged| INFO\_ONLY| NOTICE| Detected hot plugged device: <bdev-name\> | Indicates device was physically inserted into host. | NVMe SSD physically added to host. |
| device\_replace| INFO\_ONLY| NOTICE or ERROR| Replaced device: <uuid\> with device: <uuid\> [failed: <rc\>] | Indicates that a faulty device was replaced with a new device and if the operation failed. The old and new device IDs as well as any non-zero return code are specified in the event data. | Device was replaced using DMG nvme replace command. |
| device\_link\_speed\_changed| NOTICE or WARNING| NVMe PCIe device at <pci-address\> port-<idx\>: link speed changed to <transfer-rate\> (max <transfer-rate\>)| Indicates that an NVMe device link speed has changed. The negotiated and maximum device link speeds are indicated in the event message field and the severity is set to warning if the negotiated speed is not at maximum capability (and notice level severity if at maximum). No other specific information is included in the event data.| Either device link speed was previously downgraded and has returned to maximum or link speed has downgraded to a value that is less than its maximum capability.|
| device\_link\_width\_changed| NOTICE or WARNING| NVMe PCIe device at <pci-address\> port-<idx\>: link width changed to <pcie-link-lanes\> (max <pcie-link-lanes\>)| Indicates that an NVMe device link width has changed. The negotiated and maximum device link widths are indicated in the event message field and the severity is set to warning if the negotiated width is not at maximum capability (and notice level severity if at maximum). No other specific information is included in the event data.| Either device link width was previously downgraded and has returned to maximum or link width has downgraded to a value that is less than its maximum capability.|
| engine\_format\_required|INFO\_ONLY|NOTICE|DAOS engine <idx\> requires a <type\> format|Indicates engine is waiting for allocated storage to be formatted on formatted on instance <idx\> with dmg tool. <type\> can be either SCM or Metadata.|DAOS server attempts to bring-up an engine that has unformatted storage.|
| engine\_died| STATE\_CHANGE| ERROR| DAOS engine <idx\> exited exited unexpectedly: <error\> | Indicates engine instance <idx\> unexpectedly. <error> describes the exit state returned from exited daos\_engine process.| N/A |
| engine\_asserted| STATE\_CHANGE| ERROR| TBD| Indicates engine instance <idx> threw a runtime assertion, causing a crash. | An unexpected internal state resulted in assert failure. |
| engine\_asserted| STATE\_CHANGE| ERROR| TBD| Indicates engine instance <idx\> threw a runtime assertion, causing a crash. | An unexpected internal state resulted in assert failure. |
| engine\_clock\_drift| INFO\_ONLY | ERROR| clock drift detected| Indicates CART comms layer has detected clock skew between engines.| NTP may not be syncing clocks across DAOS system. |
| engine\_join\_failed| INFO\_ONLY| ERROR | DAOS engine <idx\> (rank <rank\>) was not allowed to join the system | Join operation failed for the given engine instance ID and rank (if assigned). | Reason should be provided in the extended info field of the event data. |
| pool\_corruption\_detected| INFO\_ONLY| ERROR | Data corruption detected| Indicates a corruption in pool data has been detected. The event fields will contain pool and container UUIDs. | A corruption was found by the checksum scrubber. |
| pool\_destroy\_deferred| INFO\_ONLY| WARNING | pool:<uuid\> destroy is deferred| Indicates a destroy operation has been deferre. | Pool destroy in progress but not complete. |
| pool\_rebuild\_started| INFO\_ONLY| NOTICE | Pool rebuild started.| Indicates a pool rebuild has started. The event data field contains pool map version and pool operation identifier. | When a pool rank becomes unavailable a rebuild will be triggered. |
| pool\_rebuild\_finished| INFO\_ONLY| NOTICE| Pool rebuild finished.| Indicates a pool rebuild has finished successfully. The event data field includes the pool map version and pool operation identifier. | N/A|
| pool\_rebuild\_failed| INFO\_ONLY| ERROR| Pool rebuild failed: <rc\>.| Indicates a pool rebuild has failed. The event data field includes the pool map version and pool operation identifier. <rc\> provides a string representation of DER code.| N/A |
Expand All @@ -59,7 +69,7 @@ severity, message, description, and cause.
| swim\_rank\_dead| STATE\_CHANGE| NOTICE| SWIM rank marked as dead.| The SWIM protocol has detected the specified rank is unresponsive.| A remote DAOS engine has become unresponsive.|
| system\_start\_failed| INFO\_ONLY| ERROR| System startup failed, <errors\>| Indicates that a user initiated controlled startup failed. <errors\> shows which ranks failed.| Ranks failed to start.|
| system\_stop\_failed| INFO\_ONLY| ERROR| System shutdown failed during <action\> action, <errors\> | Indicates that a user initiated controlled shutdown failed. <action\> identifies the failing shutdown action and <errors\> shows which ranks failed.| Ranks failed to stop.|

| system\_fabric\_provider\_changed| NOTICE| System fabric provider has changed: <old-provider\> -> <new-provider\>| Indicates that the system-wide fabric provider has been updated. No other specific information is included in event data.| A system-wide fabric provider change has been intentionally applied to all joined ranks.|

## System Logging

Expand Down Expand Up @@ -578,7 +588,7 @@ The engine's NVMe config (produced during format) then contains the following
JSON to apply the criteria:
```json
[tanabarr@wolf-310 ~]$ cat /mnt/daos0/daos_nvme.conf
cat /mnt/daos0/daos_nvme.conf
{
"daos_data": {
"config": [
Expand Down
23 changes: 16 additions & 7 deletions src/bio/bio_device.c
Original file line number Diff line number Diff line change
Expand Up @@ -382,6 +382,7 @@ struct pci_dev_opts {
bool finished;
int *socket_id;
char **pci_type;
char **pci_cfg;
int status;
};

Expand All @@ -391,6 +392,7 @@ pci_device_cb(void *ctx, struct spdk_pci_device *pci_device)
struct pci_dev_opts *opts = ctx;
const char *device_type;
int len;
int rc;

if (opts->status != 0)
return;
Expand Down Expand Up @@ -422,6 +424,13 @@ pci_device_cb(void *ctx, struct spdk_pci_device *pci_device)
opts->status = -DER_NOMEM;
return;
}

rc = spdk_pci_device_cfg_read(pci_device, *opts->pci_cfg, NVME_PCI_CFG_SPC_MAX_LEN, 0);
if (rc != 0) {
D_ERROR("Failed to read config space of device (%s)\n", spdk_strerror(-rc));
opts->status = -DER_INVAL;
return;
}
}

static int
Expand All @@ -443,6 +452,7 @@ fetch_pci_dev_info(struct nvme_ctrlr_t *w_ctrlr, const char *tr_addr)
opts.pci_addr = pci_addr;
opts.socket_id = &w_ctrlr->socket_id;
opts.pci_type = &w_ctrlr->pci_type;
opts.pci_cfg = &w_ctrlr->pci_cfg;

spdk_pci_for_each_device(&opts, pci_device_cb);

Expand Down Expand Up @@ -485,6 +495,10 @@ alloc_ctrlr_info(uuid_t dev_id, char *dev_name, struct bio_dev_info *b_info)
if (b_info->bdi_ctrlr->nss == NULL)
return -DER_NOMEM;

D_ALLOC(b_info->bdi_ctrlr->pci_cfg, NVME_PCI_CFG_SPC_MAX_LEN);
if (b_info->bdi_ctrlr->pci_cfg == NULL)
return -DER_NOMEM;

/* Namespace capacity by direct query of SPDK bdev object */
blk_sz = spdk_bdev_get_block_size(bdev);
nr_blks = spdk_bdev_get_num_blocks(bdev);
Expand All @@ -497,13 +511,8 @@ alloc_ctrlr_info(uuid_t dev_id, char *dev_name, struct bio_dev_info *b_info)
return rc;
}

/* Fetch socket ID and PCI device type by enumerating spdk_pci_device list */
rc = fetch_pci_dev_info(b_info->bdi_ctrlr, b_info->bdi_traddr);
if (rc != 0) {
return rc;
}

return 0;
/* Fetch PCI details by enumerating spdk_pci_device list */
return fetch_pci_dev_info(b_info->bdi_ctrlr, b_info->bdi_traddr);
}

int
Expand Down
29 changes: 13 additions & 16 deletions src/bio/bio_monitor.c
Original file line number Diff line number Diff line change
Expand Up @@ -221,15 +221,14 @@ bio_dev_set_faulty(struct bio_xs_context *xs, uuid_t dev_uuid)
rc = dss_abterr2der(rc);

if (rc == 0)
ras_notify_eventf(RAS_DEVICE_SET_FAULTY, RAS_TYPE_INFO,
RAS_SEV_NOTICE, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL,
"Dev: "DF_UUID" set faulty\n", DP_UUID(dev_uuid));
ras_notify_eventf(RAS_DEVICE_SET_FAULTY, RAS_TYPE_INFO, RAS_SEV_NOTICE, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL,
"Device: " DF_UUID " set faulty\n", DP_UUID(dev_uuid));
else
ras_notify_eventf(RAS_DEVICE_SET_FAULTY, RAS_TYPE_INFO,
RAS_SEV_ERROR, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL,
"Dev: "DF_UUID" set faulty failed: %d\n", DP_UUID(dev_uuid), rc);
ras_notify_eventf(RAS_DEVICE_SET_FAULTY, RAS_TYPE_INFO, RAS_SEV_ERROR, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL,
"Device: " DF_UUID " set faulty failed: %d\n", DP_UUID(dev_uuid),
rc);
return rc;
}

Expand Down Expand Up @@ -742,16 +741,14 @@ auto_faulty_detect(struct bio_blobstore *bbs)
D_ERROR("Failed to set FAULTY state. "DF_RC"\n", DP_RC(rc));

if (rc == 0)
ras_notify_eventf(RAS_DEVICE_SET_FAULTY, RAS_TYPE_INFO,
RAS_SEV_NOTICE, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL,
"Dev: "DF_UUID" auto faulty detect\n",
ras_notify_eventf(RAS_DEVICE_SET_FAULTY, RAS_TYPE_INFO, RAS_SEV_NOTICE, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL,
"Device: " DF_UUID " auto faulty detect\n",
DP_UUID(bbs->bb_dev->bb_uuid));
else
ras_notify_eventf(RAS_DEVICE_SET_FAULTY, RAS_TYPE_INFO,
RAS_SEV_ERROR, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL,
"Dev: "DF_UUID" auto faulty detect failed: %d\n",
ras_notify_eventf(RAS_DEVICE_SET_FAULTY, RAS_TYPE_INFO, RAS_SEV_ERROR, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, NULL,
"Device: " DF_UUID " auto faulty detect failed: %d\n",
DP_UUID(bbs->bb_dev->bb_uuid), rc);
}

Expand Down
2 changes: 1 addition & 1 deletion src/bio/bio_wal.c
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ D_CASSERT(sizeof(struct wal_header) <= WAL_BLK_SZ);
D_CASSERT(sizeof(struct wal_trans_tail) == WAL_CSUM_LEN);

#define WAL_MIN_CAPACITY (8192 * WAL_BLK_SZ) /* Minimal WAL capacity, in bytes */
#define WAL_MAX_TRANS_BLKS 2048 /* Maximal blocks used by a transaction */
#define WAL_MAX_TRANS_BLKS 4096 /* Maximal blocks used by a transaction */
#define WAL_HDR_BLKS 1 /* Ensure atomic header write */

#define META_BLK_SZ WAL_BLK_SZ
Expand Down
5 changes: 2 additions & 3 deletions src/bio/bio_xstream.c
Original file line number Diff line number Diff line change
Expand Up @@ -747,9 +747,8 @@ bio_bdev_event_cb(enum spdk_bdev_event_type type, struct spdk_bdev *bdev,
D_ASSERT(d_bdev->bb_desc != NULL);
d_bdev->bb_removed = 1;

ras_notify_eventf(RAS_DEVICE_UNPLUGGED, RAS_TYPE_INFO,
RAS_SEV_NOTICE, NULL, NULL, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, "Dev: "DF_UUID" unplugged\n",
ras_notify_eventf(RAS_DEVICE_UNPLUGGED, RAS_TYPE_INFO, RAS_SEV_NOTICE, NULL, NULL, NULL,
NULL, NULL, NULL, NULL, NULL, NULL, "Device: " DF_UUID " unplugged\n",
DP_UUID(d_bdev->bb_uuid));

/* The bio_bdev is still under construction */
Expand Down
Loading

0 comments on commit cd4c628

Please sign in to comment.