Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16355 client: pydaos.torch module #15475

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open

DAOS-16355 client: pydaos.torch module #15475

wants to merge 22 commits into from

Conversation

0xE0F
Copy link
Contributor

@0xE0F 0xE0F commented Nov 8, 2024

Introducing pydaos.torch module that allows use DAOS POSIX containers as a datasource for pytorch framework in form of pydaos.torch.Dataset and pydaos.torch.IterableDataset classes.

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

@0xE0F 0xE0F requested review from a team as code owners November 8, 2024 07:09
Copy link

github-actions bot commented Nov 8, 2024

Ticket title is 'pydaos.torch modules'
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-16355

@0xE0F 0xE0F marked this pull request as draft November 8, 2024 07:12
@0xE0F
Copy link
Contributor Author

0xE0F commented Nov 8, 2024

Converting to draft to fix linters.

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/1/execution/node/317/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/1/execution/node/273/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/1/execution/node/314/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/1/execution/node/357/log

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/1/testReport/

Copy link
Contributor

@johannlombardi johannlombardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several typos to fix.

src/client/pydaos/torch/torch_api.py Outdated Show resolved Hide resolved
src/client/pydaos/torch/torch_shim.c Outdated Show resolved Hide resolved
src/client/pydaos/torch/torch_shim.c Show resolved Hide resolved
src/client/pydaos/torch/torch_shim.c Show resolved Hide resolved
utils/node_local_test.py Show resolved Hide resolved
@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/2/execution/node/328/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/2/execution/node/379/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/2/execution/node/382/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/2/execution/node/331/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/3/execution/node/354/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/3/execution/node/348/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/3/execution/node/353/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/3/execution/node/357/log

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15475/3/display/redirect

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/4/execution/node/355/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/4/execution/node/354/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/4/execution/node/349/log

@daosbuild1
Copy link
Collaborator

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/4/execution/node/364/log

utils/ci/run_in_gha.sh Fixed Show fixed Hide fixed
@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/5/execution/node/370/log

@daosbuild1
Copy link
Collaborator

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/5/execution/node/367/log

@0xE0F
Copy link
Contributor Author

0xE0F commented Nov 14, 2024

I think Build / Build DAOS (rocky, gcc) fails intermittently and not only for my PR 😬

utils/ci/run_in_gha.sh Fixed Show fixed Hide fixed
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/15/testReport/

@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/16/testReport/

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/18/testReport/

Copy link
Contributor

@johannlombardi johannlombardi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need a few changes before it can land. But looks promising :)

debian/changelog Show resolved Hide resolved
utils/rpms/daos.spec Show resolved Hide resolved
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
@daosbuild1
Copy link
Collaborator

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/19/testReport/

johannlombardi
johannlombardi previously approved these changes Nov 14, 2024
Copy link
Contributor

@mchaarawi mchaarawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a side note:
the rule for updating PRs after you mark them as ready for review is that you are not allowed to do force-push again because reviewers loose their review history and need to review everything again vs being able to review just the changes since their last review.
So please be mindful of that and do not force push to the PR anymore and just add commits on top. the gatekeeper will always squash on the merge to master.

if (rc) {
rc = daos_der2errno(rc);
}
if (rc == EBUSY) {
Copy link
Contributor

@mchaarawi mchaarawi Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this doesn't sound right to me.
daos_init/fini() will do proper ref counting internally for the case multiple threads calling init /finalize.
so if you are getting EBUSY, the only explanation is that you have some events that are inflight that were not completed on the last thread that calls fini. so it sounds like a problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was workaround before introducing daos_reinit().
I removed it, good call !
I'm still checking for -DER_UNINIT for NLT tests and remove it with functional tests PR.

src/client/pydaos/torch/torch_shim.c Outdated Show resolved Hide resolved
src/client/pydaos/torch/torch_shim.c Outdated Show resolved Hide resolved
src/client/pydaos/torch/torch_shim.c Outdated Show resolved Hide resolved
src/client/pydaos/torch/torch_shim.c Outdated Show resolved Hide resolved
src/client/pydaos/torch/torch_shim.c Show resolved Hide resolved
src/client/pydaos/torch/torch_shim.c Outdated Show resolved Hide resolved

for (int i = 0; i < eq_rc; ++i) {
struct io_op *op = container_of(evp[i], struct io_op, ev);
int op_rc = complete_read_op(hdl, op);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complete_read_op() sets op->err but i missed where this err is checked to indicate a read error has occurred (if it did)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point ! Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unresolving as it sounds like we can still overwrite the error on a subsequent iteration? and the return of that function would still be SUCCESS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function returns rc and this loop it can be overwritten only if it's 0 , so only first encountered error code will be set 😬

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mchaarawi ^ I'm not sure why it does not show reply in the comments section 😬

Comment on lines +69 to +75
# For daos_der2errno() used by pydaos.torch module
env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/common.h')
env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/debug.h')
env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/profile.h')
env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/dtx.h')
env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/cmd_parser.h')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment accurate?

For daos_der2errno() used by pydaos.torch module

Are these imports ONLY for daos_der2errno? If so, it's odd that we need so much for that function - particularly dtx.h

src/client/pydaos/torch/Readme.md Outdated Show resolved Hide resolved
src/client/pydaos/torch/Readme.md Outdated Show resolved Hide resolved
src/client/pydaos/torch/Readme.md Outdated Show resolved Hide resolved
src/client/pydaos/torch/Readme.md Outdated Show resolved Hide resolved
src/client/pydaos/torch/__init__.py Outdated Show resolved Hide resolved
src/client/pydaos/torch/torch_api.py Outdated Show resolved Hide resolved
utils/node_local_test.py Outdated Show resolved Hide resolved
Comment on lines +171 to +178
out:
if (rc) {
dfs_disconnect(hdl->dfs);

D_FREE(hdl->global.iov_buf);
D_FREE(hdl);
hdl = NULL;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to free this one on error?

PyObject *result = PyList_New(2);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the caller expects this tuple as a return value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we return an error after allocating this, the caller is expected to free it? I would think on error we should free this and just return a null handle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, caller does not need to do it, the python refcounter will clean it up after the caller done using the returned tuple.

src/client/pydaos/torch/Readme.md Show resolved Hide resolved
Co-authored-by: Dalton Bohning <dalton.bohning@intel.com>
Signed-off-by: 0xE0F <denis.barahtanov@gmail.com>
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
And trying to check other tests in the pipeline with:

Allow-unstable-test: true

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15475/21/display/redirect

@0xE0F 0xE0F requested a review from mchaarawi November 15, 2024 06:12
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/22/execution/node/1497/log

static PyObject *
__shim_handle__module_init(PyObject *self, PyObject *args)
{
int rc = daos_init();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there every a case where the module might get loaded but not used? I know with the IL we delay init until someone actually does I/O that requires it. I realize it's a little different use case here and maybe they will only load the module if they are going to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure but I think python does lazy load module by default ?

src/client/pydaos/torch/torch_shim.c Outdated Show resolved Hide resolved

int rc = dfs_lookup(hdl->dfs, path, O_RDONLY, &obj, NULL, NULL);
if (rc) {
return PyLong_FromLong(-rc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm still see some neg errno returned here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller expects positive number of splits in case of success and a negative errno in case of a failure, I'm not sure what's wrong with this ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller expects positive number of splits in case of success and a negative errno in case of a failure, I'm not sure what's wrong with this ?

Comment on lines +423 to +424
Since python can use buffer like objects that might not have contiguous memory layout,
let's put a guardrail accepting only buffers with contiguous memory region
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the challenge of supporting non-contiguous mem layout? DAOS supports that with the sgl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no challange, the python wrapper allocates buffer before the call to shim layer: https://github.com/daos-stack/daos/pull/15475/files/92e30a337949e2eb8116cf91476e18c1bd67cec3#diff-2ea6a1bbd40194ac265357a571f9323b1097ff49ef7d887ae663170a9396379dR359 so this is more a guardrail to keep things simple.

src/client/pydaos/torch/torch_shim.c Outdated Show resolved Hide resolved
if (rc) {
goto out;
}
if (read != bview.len) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see you added that now, but what i had meant earlier is that in this case, if you read less (hit EOF for example), is an error appropriate? like is this guaranteed to never read beyond EOF?
if yes then all good. if not then an error cannot be returned here i would think.

Copy link
Contributor Author

@0xE0F 0xE0F Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this check is more for detecting errors earlier - during the namespace scanning, along with the file path, its size is also stored and later buffer is created to with that size. If it does not mach on read - something has chenged between scan and read which might be a bit fishy, since dataset is suppose to be static.


rc2 = daos_event_fini(&op->ev);
if (rc2) {
D_ERROR("Could not finalize event: %s (rc=%d)", d_errstr(rc2), rc2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before, need to set rc to rc2 if rc == 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this function there's return 0 as a happy path , if the control flow is after out then rc would containt the original error code that I'd like to return, instead of the error code during the cleanup phase.

src/client/pydaos/torch/torch_shim.c Show resolved Hide resolved

for (int i = 0; i < eq_rc; ++i) {
struct io_op *op = container_of(evp[i], struct io_op, ev);
int op_rc = complete_read_op(hdl, op);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unresolving as it sounds like we can still overwrite the error on a subsequent iteration? and the return of that function would still be SUCCESS?

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
@0xE0F 0xE0F requested a review from mchaarawi November 15, 2024 23:55
Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants