-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16167 test: update soak test to use internal job scheduler #14775
base: master
Are you sure you want to change the base?
Conversation
Ticket title is 'Soak: update soak test to use internal job scheduler instead of depending on slurm' |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/1/execution/node/415/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/1/execution/node/313/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/1/execution/node/357/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/373/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/414/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/364/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/360/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/361/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/2/execution/node/367/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/383/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/387/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/363/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/365/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/364/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/3/execution/node/353/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/359/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/341/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/344/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/467/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/398/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/4/execution/node/336/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/383/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/351/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/350/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/357/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/518/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/6/execution/node/352/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Required-githooks: true Signed-off-by: Maureen Jean <maureen.jean@intel.com>
015b254
to
47e9f7c
Compare
src/tests/ftest/util/soak_utils.py
Outdated
run_local(self.log, cmd3, timeout=600) | ||
run_local(self.log, cmd4, timeout=600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to pass check=True
or it won't raise an exception
src/tests/ftest/util/soak_utils.py
Outdated
run_local(self.log, cmd, timeout=600) | ||
run_local(self.log, cmd2, timeout=600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to pass check=True
or it won't raise an exception
env = ";".join([f"export LD_LIBRARY_PATH={lib_path}", | ||
f"export PATH={path}", | ||
f"export VIRTUAL_ENV={v_env}"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using EnvironmentVariables here. It's just a dictionary with some helper methods
from command_utils_base import EnvironmentVariables
env = EnvironmentVariables()
env['FOO'] = 'bar'
env_str = env.to_export_str() # export FOO=bar;
while not job_queue.empty(): | ||
job_results = job_queue.get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this submits potentially multiple jobs and doesn't schedule anymore jobs until all of those are complete. I think you'd rather submit and complete asynchronously? I could help with that improvement if needed. There are a few approaches we could take
else: | ||
# update soak_results to include job id NOT run and set state = CANCELLED | ||
for job in job_id_list: | ||
if job not in list(self.soak_results.keys()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if job not in list(self.soak_results.keys()): | |
if job not in self.soak_results: |
src/tests/ftest/util/soak_utils.py
Outdated
@@ -233,7 +254,7 @@ def get_daos_server_logs(self): | |||
self (obj): soak obj | |||
""" | |||
daos_dir = self.outputsoak_dir + "/daos_server_logs" | |||
logs_dir = os.environ.get("DAOS_TEST_LOG_DIR", "/var/tmp/daos_testing/") | |||
logs_dir = os.environ.get("DAOS_TEST_LOG_DIR", "/tmp/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should always be defined now
logs_dir = os.environ.get("DAOS_TEST_LOG_DIR", "/tmp/") | |
logs_dir = self.test_env.log_dir |
job_queue.put(results) | ||
# give time to update the queue before exiting | ||
time.sleep(0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? As far as I'm aware, the queue should be updated as soon as put
returns
src/tests/ftest/util/soak_utils.py
Outdated
for key, value in list(sbatch_params.items()): | ||
if value is not None: | ||
if key == "error": | ||
value = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value = value |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/24/execution/node/1446/log |
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/24/execution/node/1551/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/28/execution/node/944/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/30/execution/node/812/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14775/31/execution/node/922/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke
Signed-off-by: Maureen Jean <maureen.jean@intel.com>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: soak_smoke Signed-off-by: Maureen Jean <maureen.jean@intel.com>
from test_utils_container import add_container | ||
|
||
H_LOCK = threading.Lock() | ||
id_counter = count(start=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion - you might want to start with a larger number so the job ids are easily sortable (if you need that). E.g. with start=
you'll get
1
2
...
9
10 <-- throws off sorting / printing ids in a list format
But e.g. start=1000
, numbers will be aligned from 1000 to 9999
1001
1002
...
1009
1010
...
9999
10000 # <-- finally not aligned
job_queue.put(results) | ||
# give time to update the queue before exiting | ||
time.sleep(0.5) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When job_queue.put
returns, it should be updated with no need to wait. Did you find otherwise?
|
||
for cmd in list(job_cmds): | ||
script_file.write(cmd + "\n") | ||
script_file.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary, since you opened like with open(scriptfile, 'w') as script_file
script_file.close() |
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: soak_smoke
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: