-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16355 client: pydaos.torch module #15475
base: master
Are you sure you want to change the base?
Conversation
Ticket title is 'pydaos.torch modules' |
Converting to draft to fix linters. |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/1/execution/node/317/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/1/execution/node/273/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/1/execution/node/314/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/1/execution/node/357/log |
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/1/testReport/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several typos to fix.
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/2/execution/node/328/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/2/execution/node/379/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/2/execution/node/382/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/2/execution/node/331/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/3/execution/node/354/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/3/execution/node/348/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/3/execution/node/353/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/3/execution/node/357/log |
Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15475/3/display/redirect |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/4/execution/node/355/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/4/execution/node/354/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/4/execution/node/349/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/4/execution/node/364/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/5/execution/node/370/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/5/execution/node/367/log |
I think |
a472716
to
8c83854
Compare
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/15/testReport/ |
8c83854
to
76ed7e9
Compare
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/16/testReport/ |
76ed7e9
to
46dee9b
Compare
46dee9b
to
0db9a0c
Compare
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/18/testReport/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need a few changes before it can land. But looks promising :)
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/19/testReport/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a side note:
the rule for updating PRs after you mark them as ready for review is that you are not allowed to do force-push again because reviewers loose their review history and need to review everything again vs being able to review just the changes since their last review.
So please be mindful of that and do not force push to the PR anymore and just add commits on top. the gatekeeper will always squash on the merge to master.
src/client/pydaos/torch/torch_shim.c
Outdated
if (rc) { | ||
rc = daos_der2errno(rc); | ||
} | ||
if (rc == EBUSY) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this doesn't sound right to me.
daos_init/fini() will do proper ref counting internally for the case multiple threads calling init /finalize.
so if you are getting EBUSY, the only explanation is that you have some events that are inflight that were not completed on the last thread that calls fini. so it sounds like a problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was workaround before introducing daos_reinit()
.
I removed it, good call !
I'm still checking for -DER_UNINIT
for NLT tests and remove it with functional tests PR.
|
||
for (int i = 0; i < eq_rc; ++i) { | ||
struct io_op *op = container_of(evp[i], struct io_op, ev); | ||
int op_rc = complete_read_op(hdl, op); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
complete_read_op() sets op->err but i missed where this err is checked to indicate a read error has occurred (if it did)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point ! Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unresolving as it sounds like we can still overwrite the error on a subsequent iteration? and the return of that function would still be SUCCESS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function returns rc
and this loop it can be overwritten only if it's 0 , so only first encountered error code will be set 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mchaarawi ^ I'm not sure why it does not show reply in the comments section 😬
# For daos_der2errno() used by pydaos.torch module | ||
env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/common.h') | ||
env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/debug.h') | ||
env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/profile.h') | ||
env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/dtx.h') | ||
env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/cmd_parser.h') | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment accurate?
For daos_der2errno() used by pydaos.torch module
Are these imports ONLY for daos_der2errno? If so, it's odd that we need so much for that function - particularly dtx.h
out: | ||
if (rc) { | ||
dfs_disconnect(hdl->dfs); | ||
|
||
D_FREE(hdl->global.iov_buf); | ||
D_FREE(hdl); | ||
hdl = NULL; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to free this one on error?
PyObject *result = PyList_New(2);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the caller expects this tuple as a return value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if we return an error after allocating this, the caller is expected to free it? I would think on error we should free this and just return a null handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, caller does not need to do it, the python refcounter will clean it up after the caller done using the returned tuple.
Co-authored-by: Dalton Bohning <dalton.bohning@intel.com> Signed-off-by: 0xE0F <denis.barahtanov@gmail.com>
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
And trying to check other tests in the pipeline with: Allow-unstable-test: true Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15475/21/display/redirect |
Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/22/execution/node/1497/log |
static PyObject * | ||
__shim_handle__module_init(PyObject *self, PyObject *args) | ||
{ | ||
int rc = daos_init(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there every a case where the module might get loaded but not used? I know with the IL we delay init until someone actually does I/O that requires it. I realize it's a little different use case here and maybe they will only load the module if they are going to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure but I think python does lazy load module by default ?
|
||
int rc = dfs_lookup(hdl->dfs, path, O_RDONLY, &obj, NULL, NULL); | ||
if (rc) { | ||
return PyLong_FromLong(-rc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm still see some neg errno returned here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller expects positive number of splits in case of success and a negative errno in case of a failure, I'm not sure what's wrong with this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The caller expects positive number of splits in case of success and a negative errno in case of a failure, I'm not sure what's wrong with this ?
Since python can use buffer like objects that might not have contiguous memory layout, | ||
let's put a guardrail accepting only buffers with contiguous memory region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the challenge of supporting non-contiguous mem layout? DAOS supports that with the sgl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no challange, the python wrapper allocates buffer before the call to shim layer: https://github.com/daos-stack/daos/pull/15475/files/92e30a337949e2eb8116cf91476e18c1bd67cec3#diff-2ea6a1bbd40194ac265357a571f9323b1097ff49ef7d887ae663170a9396379dR359 so this is more a guardrail to keep things simple.
if (rc) { | ||
goto out; | ||
} | ||
if (read != bview.len) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see you added that now, but what i had meant earlier is that in this case, if you read less (hit EOF for example), is an error appropriate? like is this guaranteed to never read beyond EOF?
if yes then all good. if not then an error cannot be returned here i would think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this check is more for detecting errors earlier - during the namespace scanning, along with the file path, its size is also stored and later buffer is created to with that size. If it does not mach on read - something has chenged between scan and read which might be a bit fishy, since dataset is suppose to be static.
|
||
rc2 = daos_event_fini(&op->ev); | ||
if (rc2) { | ||
D_ERROR("Could not finalize event: %s (rc=%d)", d_errstr(rc2), rc2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as before, need to set rc to rc2 if rc == 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this function there's return 0
as a happy path , if the control flow is after out
then rc
would containt the original error code that I'd like to return, instead of the error code during the cleanup phase.
|
||
for (int i = 0; i < eq_rc; ++i) { | ||
struct io_op *op = container_of(evp[i], struct io_op, ev); | ||
int op_rc = complete_read_op(hdl, op); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unresolving as it sounds like we can still overwrite the error on a subsequent iteration? and the return of that function would still be SUCCESS?
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
Introducing pydaos.torch module that allows use DAOS POSIX containers as a datasource for pytorch framework in form of pydaos.torch.Dataset and pydaos.torch.IterableDataset classes.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: