Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warnings against resetting pipeline before end of epoch and test parallel ES with fw iterator #4023

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

stiepan
Copy link
Member

@stiepan stiepan commented Jun 29, 2022

Other (e.g. Documentation, Tests, Configuration)

Category:

Description:

  • If someone uses fw iterators and passes size that is incorrect (i.e. different than what follows from external source raising StopIteration or one of the pipelines raises while the other do not) it may result in premature resetting of the pipelines, which further can lead to different issues: from empty epochs to corrupted data.

Adding the warnings in hope that it will be helpful for users or us handling gh issues to have those warnings in logs.
Adding the PES tests to iterators, as the existing tests focused on a pipeline's run API which is not the one used by fw iterators.

Additional information:

Affected modules and functionalities:

  • Python pipeline
  • Python fw iterator (plugins)

Key points relevant for the review:

Tests:

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-2864

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
@stiepan
Copy link
Member Author

stiepan commented Jun 29, 2022

!build

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [5209598]: BUILD STARTED

if self._last_iter:
if not self._last_iter:
# resetting before some external source raised StopIteration is a no-op
if self._input_callbacks:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How legitimate use case is pipeline with external source that is infinite + FW iterator with iterator size passed explicitly? Because in that case, the warning will be triggered too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 I guess it still can happen and we should behave consistently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😟I thought that one would be useful if one has two or more pipelines with (P)ES that raise StopIteration and should be reset but for some reason the epochs diverge. Either because the number of iterations is really (but unintentionally) different in those two or because one uses .run API, with prefetch_queue_depth 1 and resets all pipelines when the first one raises, not letting the others actually reach end of epoch.

Copy link
Member Author

@stiepan stiepan Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I could safegaurd this check with pipeline._epoch_idx> 0. It seems that if it has ever been incremented, then there must be ES that raises StopIteration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 I guess it still can happen and we should behave consistently.

I think that we should warn that something may be not right and if te user provides -1 as the size it should work silently.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [5209598]: BUILD PASSED

# For one, when prefetching, parallel external source will reuse buffers
# that might be still referenced by no_copy input fed to pipeline
if not self.empty():
warnings.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is possible, prefetch queue depth is 2, batches to consume is 1, we can still schedule one more run. The native part can overschedule - it will just wait for the empty output buffer, but the ES may fail (parallel mode with nocopy, but the regular ES does copy and have an internal queue).

Copy link
Member Author

@stiepan stiepan Jun 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that it should be supported? By possible you mean that using fw iterators correctly you may still end up in such situation?

If that should be supported:

  • Should we warn only if we have PES and treat it as extra limitation of schedule api that is not there otherwise.
  • Or the PES needs to be adjusted to handle that as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we warn only if we have PES and treat it as extra limitation of schedule api that is not there otherwise.

I think so. You can still use this API without the FW iterator.

Signed-off-by: Kamil Tokarski <ktokarski@nvidia.com>
@stiepan stiepan marked this pull request as draft August 26, 2022 15:28
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [8224548]: BUILD FAILED

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants