-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16585 tests: Fix NLT handling of __fxstat detection #15150
base: master
Are you sure you want to change the base?
Conversation
Ticket title is 'NLT test failures under Ubuntu 22.04' |
14f3e0e
to
6e23f42
Compare
Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/3/display/redirect |
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/3/display/redirect |
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/3/display/redirect |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/3/display/redirect |
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/3/display/redirect |
Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/4/display/redirect |
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/4/display/redirect |
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/4/display/redirect |
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/4/display/redirect |
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/4/display/redirect |
6e23f42
to
9a53b93
Compare
Use strace to determine whether calls to __fxstat actually happen when using a utility/command the IL is being tested on, and stop treating it as an error to not see __fxstat when it's not used. Signed-off-by: Nicholas Murphy <ncmurphy@google.com> Required-githooks: true Run-GHA: true
9a53b93
to
ad18961
Compare
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/6/display/redirect |
Test stage Unit Test on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/6/display/redirect |
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/6/display/redirect |
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/6/display/redirect |
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15150/6/display/redirect |
utils/node_local_test.py
Outdated
@@ -6327,6 +6327,23 @@ def server_fi(args): | |||
server.set_fi(probability=0) | |||
|
|||
|
|||
def look_for_library_call(conf, cmd, library_str): | |||
"""Look for library_str in the strace call stack of running cmd.""" | |||
tmpfile = tempfile.NamedTemporaryFile(mode='r', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't recall if pylint will still complain on every PR thereafter but it used to be the case. You could either remove strace from the comment or add a comment such as
Line 3402 in b5d2047
# pylint: disable=wrong-spelling-in-comment |
utils/node_local_test.py
Outdated
def look_for_library_call(conf, cmd, library_str): | ||
"""Look for library_str in the strace call stack of running cmd.""" | ||
tmpfile = tempfile.NamedTemporaryFile(mode='r', | ||
prefix='dnt_assess_', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is simple enough to address, then you don't need explicit close
Required-githooks: true
Required-githooks: true
Required-githooks: true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would work in as much as if glibc is behaving differently then the specific check that's failing will be disabled. What we should be testing however is if fstat is being intercepted properly, historically and what the code currently does is check the logs for the wrapper function however dfuse now has per operation statistics available to the client, a more comprehensive solution would be to sample the fstat count before and after the command is invoked as a way of knowing if it had been intercepted or not.
One complexity here is that the first fstat of every file is forwarded so that the st_dev value can be loaded/cached so in order to properly fix this il_stat may need to be passed in the number of files which are accessed. I'll see if I can get a PR together on this basis.
check_fstat = check_fstat and not self.caching and \ | ||
look_for_library_call(self.conf, cmd, '__fxstat') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be called for every il_cmd invocation and there are probably dozens so it could/should be saved in conf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that and wanted to not assume different executables ended up using the same libraries. And, I don't think it's going to affect overall runtime to just do this every time. Do you feel differently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd missed that it was running the actual command rather than just a generic unix command here. This means the command has to be idempotent and perform the same operations on subsequent invocations but looking at the places where this is called it seems it probably is.
# hack to install 24.04's golang-go on 22.04: | ||
apt-get install -y software-properties-common | ||
add-apt-repository "deb http://archive.ubuntu.com/ubuntu noble main" | ||
apt-get update | ||
apt-get install -y golang-go | ||
add-apt-repository -r "deb http://archive.ubuntu.com/ubuntu noble main" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a part of the fix or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a part of the fix or something else?
there was a recent change that made it such that made go 1.22 a requirement and ubuntu 22 has 1.18. Not sure why the builds didn't fail on that patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would probably be better landing as part of #15174 where the go developers can review it.
That said, this script normally just installs packages, https://github.com/daos-stack/daos/blob/master/utils/docker/Dockerfile.ubuntu would be a better place for this code, or perhaps a utils/scripts/helpers/repo-helper-debian.sh
script to match what rocky does.
FWIW I kind of like the strace approach as a general solution as it lets you establish a ground truth about what's actually happening without making assumptions. One can imagine extending this to have the strace dictate all the calls you should be intercepting. @jolivier23 pointed out, for instance, that "newfstat" shows up and is probably not being intercepted right now? shrug Meantime we (Google) would like some fix here ASAP to unblock our own client testing. So: request to separate a short term fix from a longer term more complete solution? |
I think we're for different points but with the same end. The current code checks is a particular glibc implementation of fstat is being intercepted, if it's not then is that because the interception is broken or because a different implementation is in use? Using strace will hide the second failure mode. Overall this is really just a quick smoke test and tracking for filenames in the log file is a bit of a hack, to do this properly we'd write custom
|
Use strace to determine whether calls to __fxstat actually happen when using a utility/command the IL is being tested on, and stop treating it as an error to not see __fxstat when it's not used.
Required-githooks: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: