Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test merge from release/2.4 [DO NOT MERGE] #14355

Closed
wants to merge 50 commits into from

Conversation

mlawsonca
Copy link
Collaborator

Run-GHA: true

chowes and others added 30 commits April 10, 2024 13:30
…iptor

libfuse supports opening /dev/fuse and passing the file descriptor as the
mountpoint. In some cases, realpath may not work for these file descriptors,
and so we should ignore ENOENT errors and instead check that we can get file
descriptor attributes from the given path.

Change-Id: I2e9aad0e11a4c6f27ec2c4b1aeb75fc651d2540d
The setuid, setgid, and sticky bit can cause fatal errors when the datamover
tool sets file permissions after copying a file, since these are not
supported by DFS.  We can just ignore this bit when calling dfs_chmod.

Change-Id: Ibf2b6d793f95dd59c902c8d847bc087fb479c5ea
In order to prevent known race to occur due to lack of
locking in Glibc environment APIs (getenv()/[uns]setenv()/
putenv()/clearenv()), they have been overloaded and
strengthened in Gurt with hooks now all using a common
lock/mutex.

Libgurt is the preferred place for this as it is the lowest
layer in DAOS, so it will be the earliest to be loaded and
will ensure the hook to be installed as early as possible
and could prevent usage of LD_PRELOAD.

This will address the main lack of multi-thread protection
in the Glibc APIs but do not handle all unsafe use-cases
(like the change/removal of an env var when its value address
has already been grabbed by a previous getenv(), ...).

Change-Id: I38cda09746ddb4e79f0297fee26c2a22e1cb881b
Signed-off-by: Bruno Faccini <bruno.faccini@intel.com>
Change-Id: Ic0eeee9df2f0ef29f3f3f047080fdce109af71bf
TESTED=https://paste.googleplex.com/6208972604833792
BUG=311738671

Change-Id: Ia6658d7c99c8d21c35d724b86fa2c1c48b41069f
The upstream 2.4 release has support for storing engine
metadata outside of tmpfs, but it is tied to the new
MD-on-SSD feature preview. With some small adjustments
to the code, we can enable external metadata without
MD-on-SSD.

Required-githooks: true

Change-Id: If3e728a2db7a4994572bbe53c92654f2e9b01ee0
Signed-off-by: Michael MacDonald <mjmac@google.com>
- D_QUOTA_RPCS envariable added. When set, limits the number of RPCs on a wire being sent out by the process.
- RPCs that exceed quota limit (if set), will now be queued by the sender
- Quota support code added to handle and track resources

Required-githooks: true

Signed-off-by: Alexander A Oganezov <alexander.a.oganezov@intel.com>
Adds some cart-level metrics for RPC quota
exceeded and RPC queue depth.

Required-githooks: true
Change-Id: I5760c255e13ca9a70d352017cae2f6bcee5a6959
Signed-off-by: Michael MacDonald <mjmac@google.com>
Matches new default in 2.6+; aligns default value with
standard tuning practices.

Required-githooks: true
Change-Id: I817927a160fc3dbb2c60a12107da668147e78706
Signed-off-by: Michael MacDonald <mjmac@google.com>
It should be part of server build, not tests

Required-githooks: true

Change-Id: I28b537e1ea7c32a323036c3ec935517ec97ad80c
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
This PR is a subset of the PR #13250 allowing thread safe management of environment variables: it has been split into smaller PRs to facilitate the review process.
This PR mainly add thread safe environment variables management functions.
It also remove and replace old non thread safe custom environment management functions.
Finally, it replace the setenv() function with d_setenv().

Required-githooks: true

Change-Id: Ife6690e2c63dd6c47279a2ac8c3c5a3da5cf8213
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Fix regression of d_getenv_xxx() functions used for retrieve int
envioronment variable: support of string reprsenting signed integer.

Required-githooks: true

Change-Id: I7a7f84fe17378ffca1cc0179e1119c1f17a3c4da
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Replace getenv() function with d_agetenv_str() and d_freeenv_str()

Required-githooks: true

Change-Id: I6a3e3fafc82327c091bfe96bea3e5f0ef5bece48
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Required-githooks: true

Change-Id: I886d130eb20194a1870579bd47ade2b6e4b3b35a
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
…13053)

Allow metadata caching even when the file is open. This was
initially disabled due to conflicts with the interception library
however dfuse now tracks interception library use so it's possible
to only disable when the interception library is in-use rather than
all the time.

Required-githooks: true

Change-Id: Ida03a854030f6b9ded24c5465e0f1126fcba310e
Signed-off-by: Ashley Pittman ashley.m.pittman@intel.com
fuse will call this often to read non-existent xattrs for every write request
so short-circuit these to avoid server round-trips.

Required-githooks: true

Change-Id: I3337b1724f237cc50a5a537e0844f05f0ed9cc61
Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
* DAOS-14981 gurt: restore d_getenv_int undefined symbol

Restore missing plain function d_getenv_int() to fix missing symbol with
libdaos.

Required-githooks: true

Change-Id: I86d5c2f5d4d8bbd3c4ab3fdef70ffc5b41ce0921
Signed-off-by: Cedric Koch-Hofer <cedric.koch-hofer@intel.com>
Change-Id: I6bf8765142024e3fd404d51f186c830e8af4bca5
getlogin does not work on the GKE pods that host our presubmits.

BUG=318885377

Change-Id: If4175d8a19b0174d489754659f34d4237cab6e97
Add a STATIC_FUSE option, default is off.  When enabled
DAOS will link statically with the fuse library.
Also add developer build.  This needs some work on
the libfuse RPM side.

Change-Id: I976f135af29d4e3da61cad9129ee19cbb419cddb
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
This ensures the dfuse we ship uses the version of
libfuse we want.

Required-githooks: true

Change-Id: I5aca28fdcb0e678fbd19df94cbf7428f5b9d61d2
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
1. target count calculation should not use pool_tree_count, which might
count the target count under other domain, thus corrupt the pool map
during extending.

2. return correct error code in migrate_pool_tls_lookup_create() and
mrone_one_fetch.

3. Missing free in regenerate_task_of_type.

Signed-off-by: Di Wang <di.wang@intel.com>
Adds a gauge to measure SWIM delay and a counter
for glitches (temporary network outages).

Change-Id: Ibd85c08ab3e3a38931d795d62270f3e4059d7c67
Required-githooks: true

Change-Id: I854937dd249ad9f7211a3b7d40d3365a3e2f79f2
Signed-off-by: Michael MacDonald <mjmac@google.com>
During migration, it should choose the minimum epoch from
rebuild stable epoch and EC aggregation boundary to make
sure correct data is being fetched during recovery.

Add tests to verify the process.

Signed-off-by: Di Wang <di.wang@intel.com>
Use stable epoch for partial parity update to make sure
these partial updates are not below stable epoch boundary,
otherwise both EC and VOS aggregation might operate on
the same recxs at the same time, which can corrupt the data
during rebuild.

During EC aggregation, it should consider the un-aggregate epoch on
non-leader parity as well, otherwise if the leader parity failed, which
will be excluded from global EC stable epoch calculation immediately,
then before the leader parity is being rebuilt, the global stable epoch
might pass the un-aggregated epoch on the failed target, then these
partial update on the data shard might be aggregated before EC
aggregation, which might cause data corruption.

And also it should choose a less fseq shard among all parity shards as
the aggregate leader, in case the last parity can not be rebuilt in
time.

Signed-off-by: Di Wang <di.wang@intel.com>
Add missing properties to the check (for testing purpose) in ds_pool_query_handler.

Add missing DAOS_FAIL_ALWAYS to POOL10.

Clear fail_loc in the MGMT and POOL tests even if DAOS_FAIL_ONCE has been
requested. Other fail_loc-using tests will be cleaned up later.

Change-Id: Ied6c248763ec60fc722a1c636bad08ffff0cc58c
Signed-off-by: Li Wei <wei.g.li@intel.com>
Fix and clean up fail_loc usage in daos_test CONTAINER tests. Also, fix
bugs revealed by the fixed tests:

  - cont_iv_prop_l2g should set DAOS_CO_QUERY_PROP_SCRUB_DIS for
    DAOS_PROP_CO_SCRUBBER_DISABLED.

  - CONT_ACL_UPDATE should update the IV.

Change-Id: I1fa3a25d8283c9e5ef0b7ddaa76febd29b100cfb
Signed-off-by: Li Wei <wei.g.li@intel.com>
Correct some doxygen style formatting that was not valid doxygen.

Change-Id: If332fc006b7ed615903a19f1ee59337322a406c0
Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Check RF and other performance before retry check, so
non-allowed write should return failuer immediately,
instead of retrying endless.

Use rebuild/reintegrate_pool_rank in daos_container test
to avoid DER_BUSY failure.

Change-Id: I421defce185a928ebd3e52f59f1b19247d90f420
Signed-off-by: Di Wang <di.wang@intel.com>
* DAOS-14010 rebuild: add delay rebuild

Add "delay rebuild" healing mode, so the delay rebuild process is

1) SWIM detects dead ranks and report to the PS leader, which update
the pool map, i.e. marking the related targets as DOWN.
2) Though the rebuild job will not be scheduled, until there are further
manual pool operations, for example drain, extend, reintegration.
3) Then all these pool operations will be merged into one rebuild job,
then scheduled.

Update placement algothrim to be able to calculate the layout with
merged pool operation.

Abort the rebuild job immediately if it finds further pool map update,
so the current job will be merged to the following rebuild job. So
concurrent pool operation will be allowed, no EBUSY check anymore.

Add various tests to verify the delay rebuild process.

Change-Id: If6f163345938bb7e1ee7550124770babd815c695
Signed-off-by: Di Wang <di.wang@intel.com>
wangdi and others added 20 commits April 10, 2024 13:31
Fix a few typo for delay rebuild.

Change-Id: I9db5c2de7e2773da9dd0cc631f13ebec12fbb6c0
Signed-off-by: Di Wang <di.wang@intel.com>
Address CVEs found in these dependencies by updating to
the latest released versions.

Change-Id: I032403700d6ebb43ba6be519bf0d82cc5eb1ebfb
…ner (#13807)

- add new public function for dfs to set-owner
- add an NLT test for it

TESTED=https://paste.googleplex.com/6210316960006144
BUG=311736144

Required-githooks: true

Change-Id: I9191b09219fbd58de60b75a36eec2f51a2766260
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com>
Test the provider we are deploying with.

Also fixes NLT after landing the cont chown patch.

Required-githooks: true
Change-Id: I3371be152d509cf1bb5f94cf85cc27b95fb108be
Signed-off-by: Michael MacDonald <mjmac@google.com>
Seeking to SEEK_END is not impolemented in libioil.
It causes interception to be disabled with some python framework.

Change-Id: I362d5d1d61449e7b03b2af21460512143547f99d
Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>
Change-Id: Ia4d10688686d992a706da725f7d15db45a418531
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
* DAOS-14845 object: retry migration for retriable failure

To avoid retry rebuild and reclaim, let's retry rebuild
until further pool map changes, in that case, it should
fail the current rebuild, and further rebuild will resolve
the failure.

various fixs about rebuild if PS leader keeps changing
during rebuild.

Move migrate max ULT control to migrate_obj_iter_cb() to make
sure max ULT count will not exceed the setting.

Change the yield freq from 128 to 16 to make sure the object

Optimize migrate memory usage
- Add max ULT control for all targets on xstream, so
  the object being migrated can not exceed MIGRATE_MAX_ULT.

- Add each target max ULT control, so each target migrate
   ULT can not exceed MIGRATE_MAX_ULT/dss_tgt_nr.

-  Add migrate_cont_open to avoid dsc_cont_open and dsc_pool_open
   for each object and dkey migration.

Change-Id: I3b426542f6a5b196fc0e7cabb680d4ff9b1db65c
Signed-off-by: Di Wang <di.wang@intel.com>
When using server target, daos_metrics wasn't built
because it was buried under a check for client target.
I really need to figure out a better way to specify
targets but this will fix the immediate issue.

Change-Id: Ifa3e49e42ad95fb96f246e723a5e4ec77f10e4d9
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Allow the suid and sgid bits to be stored in dfs_osetattr.
Even if libdfs does not support those bits, it allows dfuse to
support them via the kernel.

The lack of sgid support cause spack to fail over dfuse as
reported in the jira ticket.

Change-Id: I76b41d9b231fa2b7f1d434d6ae06e6252cadc2b4
Signed-off-by: Johann Lombardi <johann.lombardi@gmail.com>
disable CODEOWNERS for google branch
disable upstream hardware tests on branch by default
remove bad merge block
fix ordering of imports
Rename google-changeId.py
set option for dynamic fuse

Backports included here for test fixes
DAOS-15429 test: Fix Go unit tests (#13981)
DAOS-13490 test: Update valgrind suppressions. (#13142)
DAOS-15159 test: add a supression for new valgrind warning in NLT (#13782)
DAOS-14669 test: switch tcp;ofi_rxm testing to tcp (#13365)
DAOS-15548 test: add new valgrind suppression for daos tool (#14081)

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Signed-off-by: Michael MacDonald <mjmac@google.com>
Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com>
Signed-off-by: Jerome Soumagne <jerome.soumagne@intel.com>
Example usage:
// Original state
[juszhan_google_com@juszhan-dev daos]$ ls -la /tmp | grep dfuse0
drwxr-xr-x   1 juszhan_google_com juszhan_google_com   120 Apr 17 23:53 dfuse0
-rw-r--r--   1 juszhan_google_com juszhan_google_com  9146 Apr 17 23:53 dfuse0.log

// Change group to a known group id
[juszhan_google_com@juszhan-dev daos]$ getent group 1001
tmpuserjohn:x:1001:

[juszhan_google_com@juszhan-dev daos]$ run_cmd daos fs chown pool cont -g 1001 --dfs-path=/
Running DAOS_AGENT_DRPC_DIR=/tmp/agent daos fs chown pool cont -g 1001 --dfs-path=/

[juszhan_google_com@juszhan-dev daos]$ ls -la /tmp | grep dfuse0
drwxr-xr-x   1 juszhan_google_com tmpuserjohn          120 Apr 17 23:53 dfuse0
-rw-r--r--   1 juszhan_google_com juszhan_google_com  9146 Apr 17 23:53 dfuse0.log

// Change group to a nonexistent group id
[juszhan_google_com@juszhan-dev daos]$ getent group 1002

[juszhan_google_com@juszhan-dev daos]$ run_cmd daos fs chown pool cont -g 1002 --dfs-path=/
Running DAOS_AGENT_DRPC_DIR=/tmp/agent daos fs chown pool cont -g 1002 --dfs-path=/

[juszhan_google_com@juszhan-dev daos]$ ls -la /tmp | grep dfuse0
drwxr-xr-x   1 juszhan_google_com               1002   120 Apr 17 23:53 dfuse0
-rw-r--r--   1 juszhan_google_com juszhan_google_com  9146 Apr 17 23:53 dfuse0.log

Required-githooks: true

Signed-off-by: Justin Zhang <juszhan@google.com>
As requested by the Jira ticket, add a new I/O forwarding mechanism,
dss_chore, to avoid creating a ULT for every forwarding task.

  - Forwarding of object I/O and DTX RPCs is converted to chores.

  - Cancelation is not implemented, because the I/O forwarding tasks
    themselves do not support cancelation yet.

  - In certain engine configurations, some xstreams do not need to
    initialize dx_chore_queue. This is left to future work.

Signed-off-by: Li Wei <wei.g.li@intel.com>
When dss_chore.cho_func returns DSS_CHORE_DONE, the dss_chore object may
have been freed already. For instance, in the dtx_rpc_helper case,
dtx_check may have already returned, freeing (strictly speaking,
releasing) its stack frame that contains the dca.dca_chore object.
Hence, after calling chore->cho_func, dss_chore_queue_ult should only
dereference chore if the return value is DSS_CHORE_YIELD.

Signed-off-by: Li Wei <wei.g.li@intel.com>
Updates control plane tools to set a context in a logger
for ease of debug/trace logging.

Signed-off-by: Michael MacDonald <mjmac@google.com>
This commit comprises two separate patches to enable optional
collection and export of client-side telemetry.

The daos_agent configuration file includes new parameters to control
collection and export of per-client telemetry. If the telemetry_port option
is set, then per-client telemetry will be published in Prometheus format
for real-time sampling of client processes. By default, the client telemetry
will be automatically cleaned up on client exit, but may be optionally
retained for some amount of time after client exit in order to allow for
a final sample to be read.

Example daos_agent.yml updates:
telemetry_port: 9192 # export on port 9192
telemetry_enable: true # enable client telemetry for all connected clients
telemetry_retain: 1m # retain metrics for 1 minute after client exit

If telemetry_enable is false (default), client telemetry may be enabled on
a per-process basis by setting D_CLIENT_METRICS_ENABLE=1 in the
environment for clients that should collect telemetry.

Notes from the first patch by Di:

Move TLS to common, so both client and server can have TLS,
which metrics can be attached metrics on it.

Add object metrics on the client side, enabled by
export D_CLIENT_METRICS_ENABLE=1. And client metrics are organized
as "/jobid/pid/xxxxx".

During each daos thread initialization, it will created another
shmem (pid/xxx), which all metrics of the thread will be attached
to. And this metric will be destroyed once the thread exit, though
if D_CLIENT_METRICS_RETAIN is set, these client metrics will be
retain, and it can be retrieved by
daos_metrics --jobid
Add D_CLIENT_METRICS_DUMP_PATH dump metrics from current thread
once it exit.

Some fixes in telemetrics about conv_ptr during re-open the
share memory.

Add daos_metrics --jobid XXX options to retrieve all metrics
of the job.

Includes some useful ftest updates from the following commit:
* DAOS-11626 test: Adding MD on SSD metrics tests (#13661)
Adding tests for WAL commit, reply, and checkpoint metrics.

Signed-off-by: Phil Henderson <phillip.henderson@intel.com>
Signed-off-by: Michael MacDonald <mjmac@google.com>
Signed-off-by: Di Wang <di.wang@intel.com>
Co-authored-by: Phil Henderson <phillip.henderson@intel.com>
Co-authored-by: Di Wang <di.wang@intel.com>
Multiple cherry-picks to add daos fs query feature for fuse
statistics.  Helpful for understanding and tuning DAOS
performance when dfuse is used.

DAOS-13625 dfuse: Merge the info and projection_info structs. (#11881)
DAOS-13658 dfuse: Add filesystem query command. (#12367)
DAOS-12751 control: Add a daos filesystem evict command. (#12331)
DAOS-12751 dfuse: Improve evict command. (#12633)
DAOS-13625 dfuse: Remove dfuse_projection_info entirely. (#12796)
DAOS-13625 dfuse: Replace fs_handle with dfuse_info. (#12894)
DAOS-13625 dfuse: Add core inode_lookup() and inode_decref() functions. (#12573)
DAOS-14411 dfuse: Add per-container statistics. (#12819)
DAOS-14411 control: Expose dfuse statistics as yaml. (#13876)

Changed base branch to google/2.4 for daos_build test

Change-Id: I8ae3cc743697c2434ae0d54b382ee6c585a3b033

Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
Signed-off-by: Jeff Olivier <jeffolivier@google.com>
…4274)

Change-Id: Ia8452f68990f495e42e8af2e8a1eb7c951fbbdfa

Signed-off-by: Ashley Pittman <ashley.m.pittman@intel.com>
…stem (#14318)

improve the daos_init() and pool_connect() process to reuse the attach info
instead of doing agent drpc upcalls multiple times.

Also includes: DAOS-15655 control: fix support for non default system name (#14170)

Signed-off-by: Mohamad Chaarawi <mohamad.chaarawi@intel.com>
Pipeline lib isn't reading any default that has -
rather than underscore.  After talking to Intel,
changing to _ is best path forward.

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Backport for the following patches
DAOS-13380 engine: refine tgt_nr check (#12405)
DAOS-15739 engine: Add multi-socket support (#14234)
DAOS-623 engine: Fix a typo (#14329)

* DAOS-13380 engine: refine tgt_nr check

1. for non-DAOS_TARGET_OVERSUBSCRIBE case
   fail to start engine if #cores is not enough
2. for DAOS_TARGET_OVERSUBSCRIBE case
   allow to force start engine
The #nr_xs_helpers possibly be reduced for either case.

* DAOS-15739 engine: Add multi-socket support (#14234)

Add a simple multi-socket mode for use cases where a single
engine must be used. Avoids the issue of having all helper
xstreams automatically assigned to a single NUMA node thus
increasing efficiency of synchronizations between I/O and
helper xstreams.

It is the default behavior if all of the following are true

Neither pinned_numa_node nor first_core are used.
No oversubscription is requested
NUMA has uniform number of cores
targets and helpers divide evenly among numa nodes
There is more than one numa node
Update server config logic to ensure first_core is passed
on to engine if it's set while keeping existing behavior
when both first_core: 0 and pinned_numa_node are set.

Signed-off-by: Jeff Olivier <jeffolivier@google.com>
Signed-off-by: Xuezhao Liu <xuezhao.liu@intel.com>
Signed-off-by: Tom Nabarro <tom.nabarro@intel.com>
Copy link

Bug-tracker data:
Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/Test

Copy link
Collaborator

@daosbuild1 daosbuild1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -16,6 +16,8 @@ run-parts() {
for i in $(LC_ALL=C; echo "${dir%/}"/*[^~,]); do
# don't run vim .swp files
[ "${i%.sw?}" != "${i}" ] && continue
# for new repo, skip old changeId script
[ $(basename "${i}") == "20-user-changeId" ] && continue
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(lint) Quote this to prevent word splitting. [SC2046]

@@ -1167,6 +1212,7 @@ crt_context_req_track(struct crt_rpc_priv *rpc_priv)
d_list_t *rlink;
d_rank_t ep_rank;
int rc = 0;
int quota_rc = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int quota_rc = 0;
int quota_rc = 0;

int len = 0;

int res = sscanf(mountpoint, "/dev/fd/%u%n", &fd, &len);
if (res != 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (res != 1) {
int res = sscanf(mountpoint, "/dev/fd/%u%n", &fd, &len);

}

int fd_flags = fcntl(fd, F_GETFD);
if (fd_flags == -1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (fd_flags == -1) {
int fd_flags = fcntl(fd, F_GETFD);

* fail for these paths.
*/
int fd = check_fd_mountpoint(dfuse_info->di_mountpoint);
if (fd == -1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (fd == -1) {
int fd = check_fd_mountpoint(dfuse_info->di_mountpoint);

rc = regenerate_task_of_type(pool, PO_COMP_ST_DOWN, RB_OP_EXCLUDE);
if (entry->dpe_val & (DAOS_SELF_HEAL_AUTO_REBUILD | DAOS_SELF_HEAL_DELAY_REBUILD)) {
rc = regenerate_task_of_type(pool, PO_COMP_ST_DOWN,
entry->dpe_val & DAOS_SELF_HEAL_DELAY_REBUILD ? -1 : 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
entry->dpe_val & DAOS_SELF_HEAL_DELAY_REBUILD ? -1 : 0);
entry->dpe_val & DAOS_SELF_HEAL_DELAY_REBUILD ? -1 : 0);

Comment on lines +1378 to +1381
print_message("sleep 30 seconds for rebuild to be scheduled/delay \n");
sleep(30);
extend_single_pool_rank(arg, 6);
print_message("sleep 5 seconds for extend be scheduled/combined \n");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print_message("sleep 30 seconds for rebuild to be scheduled/delay \n");
sleep(30);
extend_single_pool_rank(arg, 6);
print_message("sleep 5 seconds for extend be scheduled/combined \n");
print_message("sleep 30 seconds for rebuild to be scheduled/delay\n");
sleep(30);
extend_single_pool_rank(arg, 6);
print_message("sleep 5 seconds for extend be scheduled/combined\n");

* If this is client rank 0, set fail_loc to \a fail_loc on \a engine_rank. The
* caller must eventually set fail_loc to 0 on these engines, even when using
* DAOS_FAIL_ONCE.
*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*
*


/**
* If this is client rank 0, set fail_value to \a fail_value on \a engine_rank.
*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*
*

Comment on lines +1447 to +1449

/**
* If this is client rank 0, set fail_num to \a fail_num on \a engine_rank.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* If this is client rank 0, set fail_num to \a fail_num on \a engine_rank.
/**
* If this is client rank 0, set fail_num to \a fail_num on \a engine_rank.
*

@daosbuild1
Copy link
Collaborator

@mlawsonca mlawsonca closed this May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.