Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executor 2.0 #5528

Merged
merged 6 commits into from
Sep 10, 2024
Merged

Executor 2.0 #5528

merged 6 commits into from
Sep 10, 2024

Conversation

mzient
Copy link
Contributor

@mzient mzient commented Jun 17, 2024

Category:

New feature (non-breaking change which adds functionality)

Description:

This PR adds the Executor 2.0 facade.
It does not expose it in Python yet nor does it allow the Python user to construct D2H transitions.

The PR consists of 3 parts:

  • the implementation of the ExecutorBase interface
  • the changes in ExecutorFactory that enable the new executor to be used by default when AsyncPipelined executor was used - based on DALI_USE_EXEC2 environment variable
  • changes in tests that use a different executor type for buffer presizing and ExecutorMeta tests - executor 2.0 provides neither (presizing is incompatible with the dynamic allocation approach)
  • changes in Python backend related to memory destruction timing (mostly GIL related)
  • new CI tests that explicitly enable or disable DALI_USE_EXEC2 flag

Additional information:

Affected modules and functionalities:

  • ExecutorFactory
  • buffer Presizing tests
  • ExecutorMeta tests
  • CI jobs

Key points relevant for the review:

Tests:

Existing tests can be reused by specifying DALI_USE_EXEC2=1 in CI arguments (or locally, as an environment variable).

  • Existing tests apply
  • New tests added
    • Python tests
    • GTests
    • Benchmark
    • Other
  • N/A

Checklist

Documentation

  • Existing documentation applies
  • Documentation updated
    • Docstring
    • Doxygen
    • RST
    • Jupyter
    • Other
  • N/A

DALI team only

Requirements

  • Implements new requirements
  • Affects existing requirements
  • N/A

REQ IDs: N/A

JIRA TASK: DALI-4030

@mzient mzient force-pushed the executor3 branch 3 times, most recently from 4e618df to 3dd85e6 Compare June 24, 2024 09:07
@mzient mzient force-pushed the executor3 branch 3 times, most recently from e2f392c to bb9b2d5 Compare July 4, 2024 12:54
@mzient mzient force-pushed the executor3 branch 5 times, most recently from c1ecd90 to fa82135 Compare July 10, 2024 17:45
@NVIDIA NVIDIA deleted a comment from dali-automaton Jul 16, 2024
@NVIDIA NVIDIA deleted a comment from dali-automaton Jul 16, 2024
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16632816]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16632816]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16675503]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16675793]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16675793]: BUILD FAILED

@mzient mzient force-pushed the executor3 branch 2 times, most recently from 51e4976 to ecdbb55 Compare July 18, 2024 18:42
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16715762]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [16715762]: BUILD FAILED

@mzient mzient force-pushed the executor3 branch 2 times, most recently from fcc9d27 to 803da65 Compare July 19, 2024 15:55
@mzient mzient force-pushed the executor3 branch 2 times, most recently from aa197c2 to 97dfdcb Compare July 23, 2024 08:01
@mzient mzient changed the title [WIP] Executor2 Executor 2.0 Sep 5, 2024
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [18167691]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [18186935]: BUILD STARTED

Comment on lines +625 to +627
daliCreatePipeline2(&handle, serialized.c_str(), serialized.size(), batch_size, num_thread,
this->device_id_, false, false, false,
prefetch_queue_depth, prefetch_queue_depth, prefetch_queue_depth, true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Executor 2.0 doesn't provide ExecutorMeta and doesn't support buffer preallocation (because it's against it's very design) - and it can forcibly replace the async-pipelined one if the environment variable is specified. Hence, use non-async, non-pipelined executor in meta and preallocation tests.

Comment on lines +406 to +407
const bool pipelined = false;
const bool async = false;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presizing is not supported for Executor 2.0, which can be foricbly enabled with env. DALI_USE_EXEC2=1.

@mzient mzient force-pushed the executor3 branch 2 times, most recently from 23feefc to 810ad5c Compare September 6, 2024 12:54
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [18186935]: BUILD PASSED

mzient and others added 3 commits September 9, 2024 00:28
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
…ice.

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@mzient mzient marked this pull request as ready for review September 8, 2024 22:36
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [18251041]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [18251041]: BUILD PASSED

Copy link
Member

@szalpal szalpal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments for possible enhancements

template <typename... T>
std::unique_ptr<ExecutorBase> GetExecutorImpl(bool pipelined, bool separated, bool async,
T&&... args) {
if (async && separated && pipelined) {
return std::make_unique<AsyncSeparatedPipelinedExecutor>(std::forward<T>(args)...);
} else if (async && !separated && pipelined) {
return std::make_unique<AsyncPipelinedExecutor>(std::forward<T>(args)...);
if (ForceExec2()) {
std::cerr << "\n!!! Forced use of Executor 2.0 !!!" << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nitpick: I'm wondering whether this message fits in cerr stream - as it's not really an error, more like info. Maybe simple cout should be sufficient?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit fuzzy, but I see it as a warning rather than output.

return cfg;
}

bool ForceExec2() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe a documentation saying, what does the return value mean would be good here. From the name I'd say this is more like a void function.

Comment on lines +28 to +33
enum class QueueDepthPolicy : int {
FullyBuffered, //< All operators maintain a queue
BackendChange, //< Only operators followed by one with a different backend have a queue
OutputOnly, //< Only the pipeline output has multiple buffers
Legacy = BackendChange,
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it hard to understand from this documentation (only this one, the other enums are clear). Could you extend the docs slightly and maybe back it up with some example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll try to clarify.

Comment on lines +42 to +47
Single, //< There's just one stream that's used by all operators
PerBackend, //< Operators are scheduled on a stream specific to their backend (mixed or GPU)
PerOperator //< Independent operators are executed on separate streams.

// TODO(michalz): Check if this is legal with existing operator implementations - likely not
// PerIteration, //< Streams are cycled on a per-iteration basis
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add that it's the CUDA stream that we're referring here to. Just to be perfectly clear.

Comment on lines +35 to +39
enum class OperatorConcurrency : int {
None, //< at no time can mutliple operators run
Backend, //< operators from different backends can execute in parallel
Full, //< independent operators can run in parallel
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a concurrency limit on operators?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of operators that can be run in parallel is limited by the number of threads in the executor.
Otherwise the concurrency is limited by the policy (as described in this enum), the graph topology and the operators being implicitly non-reentrant (you can't run the next iteration of an operator before the previous one is finished).

Comment on lines 204 to 205
// Must be 1st member to be destroyed last.
std::optional<DeviceGuard> dtor_guard_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to add some test/static_assert that will check this condition? I'm asking, because somebody may by accident add a member above (e.g. in unlabelled private section right after class keyword)

Also, how this would relate if somebody adds a static variable e.g in the constructor?

Copy link
Contributor Author

@mzient mzient Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope the comment will serve to address the former. I could add a base class, but there's nothing that could stop a careless contributor from adding another one before that.
Unfortunately, I don't think there's a way to have such a check. It's possible only if the enclosing class has no virtual functions (which may be true at the time, but it'd be limiting).
I'll move it to the top of the class.

As for the second - well, that would be extremely bad programming - and adding a static variable that depends on the thread-local state (current device) is asking for trouble to say the least.

Comment on lines 300 to 304
py::gil_scoped_acquire aqr;
{
auto tmp = std::move(obj_ref);
(void)tmp;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lambda is quite unclear - e.g. why the (void)tmp;. Could you add a description what happens there and why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "trick" stores the object whose lifetime we want to prolong in the lambda closure. When the shared pointer is destroyed, the deleter (the lambda) is invoked. In the lambda, we acquire GIL and destroy the contents of the closure while GIL is held. I've described it in the comments, too.

…plain closure destruction / GIL trick.

Signed-off-by: Michal Zientkiewicz <michalz@nvidia.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [18263317]: BUILD STARTED

}

void Executor2::Init() {
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it on purpose that impl_->Init is not called?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's nothing to call, so yes. I can mark it as no-op, as suggested below.

}

void Executor2::ReleaseOutputs() {
// no-op
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that it directly says no-op, can you mark other no-op calls as well (Init?, EnableMemoryStats?)

return prefetch_depth_;
}

OperatorBase *GetOperator(std::string_view input_name) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How likely is that the GetOperator will fail to find a given input_name?
If it is highly unlikely, maybe throwing an exception would be better than notifying about an error by returning a special value?
I understand that this would probably induce more significant refactoring, so this may be a suggestion for future improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an excellent suggestion, but as you rightfully pointed out, it calls for its own PR. I've looked at usages and in many places the caller simply assumes that the function returns non-null. There are a few places where the return value is checked - if the need to have those checks, then I'd rename this function to GetOperatorPtr and add an inline OperarorBase &GetOperator(sring_view name) which would check the result of GetOperatorPtr and throw if it's null.

}

enum class State {
New = 0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you expect to have more states in the future (if not ignore the comment)? If so, I would ask myself if any transitions could be in invalid, e.g. (Building->Running)? I guess I would be happy to see a state transition method, that would throw an exception if advancing between specific states is not correct. This would give an opportunity to find such bugs.

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [18263317]: BUILD FAILED

Signed-off-by: Michał Zientkiewicz <mzient@gmail.com>
@dali-automaton
Copy link
Collaborator

CI MESSAGE: [18268756]: BUILD STARTED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [18268756]: BUILD FAILED

@dali-automaton
Copy link
Collaborator

CI MESSAGE: [18268756]: BUILD PASSED

@mzient mzient merged commit f1a9a4d into NVIDIA:main Sep 10, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants