DAOS-16355 client: pydaos.torch module #15475

0xE0F · 2024-11-08T07:09:17Z

Introducing pydaos.torch module that allows use DAOS POSIX containers as a datasource for pytorch framework in form of pydaos.torch.Dataset and pydaos.torch.IterableDataset classes.

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2024-11-08T07:09:38Z

Ticket title is 'pydaos.torch modules'
Status is 'Open'
https://daosio.atlassian.net/browse/DAOS-16355

0xE0F · 2024-11-08T07:13:08Z

Converting to draft to fix linters.

daosbuild1 · 2024-11-08T07:16:32Z

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/1/execution/node/317/log

daosbuild1 · 2024-11-08T07:16:47Z

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/1/execution/node/273/log

daosbuild1 · 2024-11-08T07:16:55Z

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/1/execution/node/314/log

daosbuild1 · 2024-11-08T07:21:30Z

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/1/execution/node/357/log

daosbuild1 · 2024-11-08T07:50:51Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/1/testReport/

johannlombardi

Several typos to fix.

src/client/pydaos/torch/torch_api.py

src/client/pydaos/torch/torch_shim.c

utils/node_local_test.py

daosbuild1 · 2024-11-11T00:16:41Z

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/2/execution/node/328/log

daosbuild1 · 2024-11-11T00:18:23Z

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/2/execution/node/379/log

daosbuild1 · 2024-11-11T00:18:34Z

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/2/execution/node/382/log

daosbuild1 · 2024-11-11T00:24:50Z

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/2/execution/node/331/log

daosbuild1 · 2024-11-11T00:47:08Z

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/3/execution/node/354/log

daosbuild1 · 2024-11-11T00:47:58Z

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/3/execution/node/348/log

daosbuild1 · 2024-11-11T00:48:07Z

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/3/execution/node/353/log

daosbuild1 · 2024-11-11T00:52:02Z

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/3/execution/node/357/log

daosbuild1 · 2024-11-11T00:54:19Z

Test stage NLT on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15475/3/display/redirect

daosbuild1 · 2024-11-11T01:00:12Z

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/4/execution/node/355/log

daosbuild1 · 2024-11-11T01:00:30Z

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/4/execution/node/354/log

daosbuild1 · 2024-11-11T01:00:32Z

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/4/execution/node/349/log

daosbuild1 · 2024-11-11T01:06:51Z

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/4/execution/node/364/log

utils/ci/run_in_gha.sh

daosbuild1 · 2024-11-11T03:52:16Z

Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/5/execution/node/370/log

daosbuild1 · 2024-11-11T03:52:53Z

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/5/execution/node/367/log

0xE0F · 2024-11-14T03:05:36Z

I think Build / Build DAOS (rocky, gcc) fails intermittently and not only for my PR 😬

utils/ci/run_in_gha.sh

daosbuild1 · 2024-11-14T04:44:41Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/15/testReport/

daosbuild1 · 2024-11-14T05:30:01Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/16/testReport/

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>

daosbuild1 · 2024-11-14T07:09:19Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/18/testReport/

johannlombardi

Still need a few changes before it can land. But looks promising :)

debian/changelog

utils/rpms/daos.spec

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>

daosbuild1 · 2024-11-14T12:17:10Z

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15475/19/testReport/

mchaarawi

just a side note:
the rule for updating PRs after you mark them as ready for review is that you are not allowed to do force-push again because reviewers loose their review history and need to review everything again vs being able to review just the changes since their last review.
So please be mindful of that and do not force push to the PR anymore and just add commits on top. the gatekeeper will always squash on the merge to master.

mchaarawi · 2024-11-14T18:07:27Z

src/client/pydaos/torch/torch_shim.c

+	if (rc) {
+		rc = daos_der2errno(rc);
+	}
+	if (rc == EBUSY) {


hmm this doesn't sound right to me.
daos_init/fini() will do proper ref counting internally for the case multiple threads calling init /finalize.
so if you are getting EBUSY, the only explanation is that you have some events that are inflight that were not completed on the last thread that calls fini. so it sounds like a problem

That was workaround before introducing daos_reinit().
I removed it, good call !
I'm still checking for -DER_UNINIT for NLT tests and remove it with functional tests PR.

src/client/pydaos/torch/torch_shim.c

mchaarawi · 2024-11-14T18:21:24Z

src/client/pydaos/torch/torch_shim.c

+
+		for (int i = 0; i < eq_rc; ++i) {
+			struct io_op *op    = container_of(evp[i], struct io_op, ev);
+			int           op_rc = complete_read_op(hdl, op);


complete_read_op() sets op->err but i missed where this err is checked to indicate a read error has occurred (if it did)?

Good point ! Fixed.

unresolving as it sounds like we can still overwrite the error on a subsequent iteration? and the return of that function would still be SUCCESS?

the function returns rc and this loop it can be overwritten only if it's 0 , so only first encountered error code will be set 😬

@mchaarawi ^ I'm not sure why it does not show reply in the comments section 😬

daltonbohning · 2024-11-14T18:52:44Z

src/SConscript

+    # For daos_der2errno() used by pydaos.torch module
+    env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/common.h')
+    env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/debug.h')
+    env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/profile.h')
+    env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/dtx.h')
+    env.Install(os.path.join('$PREFIX', 'include/daos'), 'include/daos/cmd_parser.h')
+


Is this comment accurate?

For daos_der2errno() used by pydaos.torch module

Are these imports ONLY for daos_der2errno? If so, it's odd that we need so much for that function - particularly dtx.h

src/client/pydaos/torch/Readme.md

src/client/pydaos/torch/__init__.py

src/client/pydaos/torch/torch_api.py

utils/node_local_test.py

daltonbohning · 2024-11-14T19:18:59Z

src/client/pydaos/torch/torch_shim.c

+out:
+	if (rc) {
+		dfs_disconnect(hdl->dfs);
+
+		D_FREE(hdl->global.iov_buf);
+		D_FREE(hdl);
+		hdl = NULL;
+	}


Do we also need to free this one on error?

PyObject *result = PyList_New(2);

No, the caller expects this tuple as a return value.

So if we return an error after allocating this, the caller is expected to free it? I would think on error we should free this and just return a null handle.

No, caller does not need to do it, the python refcounter will clean it up after the caller done using the returned tuple.

src/client/pydaos/torch/Readme.md

Co-authored-by: Dalton Bohning <dalton.bohning@intel.com> Signed-off-by: 0xE0F <denis.barahtanov@gmail.com>

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>

And trying to check other tests in the pipeline with: Allow-unstable-test: true Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>

daosbuild1 · 2024-11-15T03:56:28Z

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15475/21/display/redirect

daosbuild1 · 2024-11-15T12:13:35Z

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15475/22/execution/node/1497/log

jolivier23 · 2024-11-15T15:30:17Z

src/client/pydaos/torch/torch_shim.c

+static PyObject *
+__shim_handle__module_init(PyObject *self, PyObject *args)
+{
+	int rc = daos_init();


is there every a case where the module might get loaded but not used? I know with the IL we delay init until someone actually does I/O that requires it. I realize it's a little different use case here and maybe they will only load the module if they are going to use it.

I'm not entirely sure but I think python does lazy load module by default ?

src/client/pydaos/torch/torch_shim.c

mchaarawi · 2024-11-15T18:20:34Z

src/client/pydaos/torch/torch_shim.c

+
+	int rc = dfs_lookup(hdl->dfs, path, O_RDONLY, &obj, NULL, NULL);
+	if (rc) {
+		return PyLong_FromLong(-rc);


hmm still see some neg errno returned here.

The caller expects positive number of splits in case of success and a negative errno in case of a failure, I'm not sure what's wrong with this ?

mchaarawi · 2024-11-15T18:28:27Z

src/client/pydaos/torch/torch_shim.c

+	  Since python can use buffer like objects that might not have contiguous memory layout,
+	  let's put a guardrail accepting only buffers with contiguous memory region


what is the challenge of supporting non-contiguous mem layout? DAOS supports that with the sgl

There's no challange, the python wrapper allocates buffer before the call to shim layer: https://github.com/daos-stack/daos/pull/15475/files/92e30a337949e2eb8116cf91476e18c1bd67cec3#diff-2ea6a1bbd40194ac265357a571f9323b1097ff49ef7d887ae663170a9396379dR359 so this is more a guardrail to keep things simple.

src/client/pydaos/torch/torch_shim.c

mchaarawi · 2024-11-15T18:30:44Z

src/client/pydaos/torch/torch_shim.c

+	if (rc) {
+		goto out;
+	}
+	if (read != bview.len) {


i see you added that now, but what i had meant earlier is that in this case, if you read less (hit EOF for example), is an error appropriate? like is this guaranteed to never read beyond EOF?
if yes then all good. if not then an error cannot be returned here i would think.

Yeah, I think this check is more for detecting errors earlier - during the namespace scanning, along with the file path, its size is also stored and later buffer is created to with that size. If it does not mach on read - something has chenged between scan and read which might be a bit fishy, since dataset is suppose to be static.

mchaarawi · 2024-11-15T18:31:50Z

src/client/pydaos/torch/torch_shim.c

+
+	rc2 = daos_event_fini(&op->ev);
+	if (rc2) {
+		D_ERROR("Could not finalize event: %s (rc=%d)", d_errstr(rc2), rc2);


same as before, need to set rc to rc2 if rc == 0

for this function there's return 0 as a happy path , if the control flow is after out then rc would containt the original error code that I'd like to return, instead of the error code during the cleanup phase.

src/client/pydaos/torch/torch_shim.c

mchaarawi · 2024-11-15T18:34:30Z

src/client/pydaos/torch/torch_shim.c

+
+		for (int i = 0; i < eq_rc; ++i) {
+			struct io_op *op    = container_of(evp[i], struct io_op, ev);
+			int           op_rc = complete_read_op(hdl, op);


unresolving as it sounds like we can still overwrite the error on a subsequent iteration? and the return of that function would still be SUCCESS?

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>

0xE0F requested review from a team as code owners November 8, 2024 07:09

0xE0F marked this pull request as draft November 8, 2024 07:12

johannlombardi requested changes Nov 8, 2024

View reviewed changes

github-advanced-security bot found potential problems Nov 11, 2024

View reviewed changes

utils/ci/run_in_gha.sh Fixed Show fixed Hide fixed

0xE0F force-pushed the 0xe0f/pydaos.torch branch from a472716 to 8c83854 Compare November 14, 2024 04:03

github-advanced-security bot found potential problems Nov 14, 2024

View reviewed changes

utils/ci/run_in_gha.sh Fixed Show fixed Hide fixed

0xE0F force-pushed the 0xe0f/pydaos.torch branch from 8c83854 to 76ed7e9 Compare November 14, 2024 04:47

0xE0F force-pushed the 0xe0f/pydaos.torch branch from 76ed7e9 to 46dee9b Compare November 14, 2024 05:47

Fixed NLT

0db9a0c

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>

0xE0F force-pushed the 0xe0f/pydaos.torch branch from 46dee9b to 0db9a0c Compare November 14, 2024 06:20

johannlombardi requested changes Nov 14, 2024

View reviewed changes

debian/changelog Show resolved Hide resolved

utils/rpms/daos.spec Show resolved Hide resolved

Fixed release version in rpm spec

92e30a3

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>

johannlombardi approved these changes Nov 14, 2024

View reviewed changes

johannlombardi previously approved these changes Nov 14, 2024

View reviewed changes

mchaarawi requested changes Nov 14, 2024

View reviewed changes

mchaarawi requested a review from daltonbohning November 14, 2024 18:40

daltonbohning reviewed Nov 14, 2024

View reviewed changes

Apply suggestions from code review

6064df1

Co-authored-by: Dalton Bohning <dalton.bohning@intel.com> Signed-off-by: 0xE0F <denis.barahtanov@gmail.com>

0xE0F dismissed johannlombardi’s stale review via 6064df1 November 15, 2024 01:17

0xE0F added 2 commits November 15, 2024 12:21

Apply suggestions from code review

f1c2468

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>

Update python docstring

8183ece

And trying to check other tests in the pipeline with: Allow-unstable-test: true Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>

0xE0F requested a review from mchaarawi November 15, 2024 06:12

jolivier23 reviewed Nov 15, 2024

View reviewed changes

mchaarawi reviewed Nov 15, 2024

View reviewed changes

Review fixes

a3a09ff

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>

0xE0F requested a review from mchaarawi November 15, 2024 23:55

Linter fix

f63284a

Signed-off-by: Denis Barakhtanov <dbarahtanov@enakta.com>

		Since python can use buffer like objects that might not have contiguous memory layout,
		let's put a guardrail accepting only buffers with contiguous memory region

DAOS-16355 client: pydaos.torch module #15475

Are you sure you want to change the base?

DAOS-16355 client: pydaos.torch module #15475

Conversation

0xE0F commented Nov 8, 2024

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Nov 8, 2024

0xE0F commented Nov 8, 2024

daosbuild1 commented Nov 8, 2024

daosbuild1 commented Nov 8, 2024

daosbuild1 commented Nov 8, 2024

daosbuild1 commented Nov 8, 2024

daosbuild1 commented Nov 8, 2024

johannlombardi left a comment

Choose a reason for hiding this comment

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

daosbuild1 commented Nov 11, 2024

0xE0F commented Nov 14, 2024

daosbuild1 commented Nov 14, 2024

daosbuild1 commented Nov 14, 2024

daosbuild1 commented Nov 14, 2024

johannlombardi left a comment

Choose a reason for hiding this comment

daosbuild1 commented Nov 14, 2024

mchaarawi left a comment

Choose a reason for hiding this comment

mchaarawi Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daosbuild1 commented Nov 15, 2024

daosbuild1 commented Nov 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

0xE0F Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mchaarawi Nov 14, 2024 •

edited

Loading

0xE0F Nov 15, 2024 •

edited

Loading