-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Add support for multi-agent off-policy algorithms in the new API stack. #45182
[RLlib] Add support for multi-agent off-policy algorithms in the new API stack. #45182
Conversation
…ge_episode_buffers_to_return_episode_lists_from_sample
…hat held DQN off from learning. In addition fixed some minor bugs. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…ists_from_sample Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…rror occurred in CI tests. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…ists_from_sample Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…r readability of the test code for users (we describe the connector to add the 'NEXT_OBS' to the batch). Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…ependent'-mode sampling. Added multi-agent example for SAC and modified 'compute_gradients' in 'SACTorchLearner' to deal with MARLModules. Commented 2 assertions in connectors that avoided multi-agent setups with 'SingleAgentEpisode's. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
@@ -33,7 +33,10 @@ def __call__( | |||
# to a batch structure of: | |||
# [module_id] -> [col0] -> [list of items] | |||
if is_marl_module and column in rl_module: | |||
assert is_multi_agent | |||
# assert is_multi_agent | |||
# TODO (simon, sven): Check, if we need for other cases this check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. There are still some "weird" assumptions left in some connectors' logic.
We should comb these out and make the logic when to go into what loop with SA- or MAEps more clear.
Some of this stuff has to do with the fact that EnvRunners can either have a SingleAgentRLModule OR a MultiAgentRLModule, but Learners always(!) have a MultiAgentModule. Maybe we should have Learners that operate on SingleAgentRLModules for simplicity and more transparency. It shouldn't be too hard to fix that on the Learner side.
rllib/utils/replay_buffers/prioritized_episode_replay_buffer.py
Outdated
Show resolved
Hide resolved
…sode' Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…agent off-policy algorithms. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
# If no episodes at all, log NaN stats. | ||
if len(self._done_episodes_for_metrics) == 0: | ||
self._log_episode_metrics(np.nan, np.nan, np.nan) | ||
# TODO (simon): This results in hundreds of warnings in the logs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have to see. This might lead to Tune errors in the sense that at the beginning, if no episode is done yet, Tune will complain that none of the stop criteria (e.g. num_env_steps_sampled_lifetime) can be found in the result dict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now.
I do have one concern about removing the NaN from the MultiAgentEnvRunner, but we can move it back or find a better solution (maybe initialize the most common stop keys already in algo) later.
Signed-off-by: Sven Mika <sven@anyscale.io>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…footprint of the class. Changes to 'MultiAgentEpisodeReplayBuffer' to reduce memory usage and increase performance. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…sed. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…irection single-agent buffer. Memory leak should be fixed with this commit. Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…:simonsays1980/ray into change_ma_buffer_to_use_list_of_episodes Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…ge_ma_buffer_to_use_list_of_episodes
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
…:simonsays1980/ray into change_ma_buffer_to_use_list_of_episodes Signed-off-by: Simon Zehnder <simon.zehnder@gmail.com>
Why are these changes needed?
Off-policy algorithms moved from old to the new stack and worked so far only in single-agent mode. We were missing a standard Learner API for the new stack which is now available: Any
LearnerGroup
receives nowList[EpisodeType]
for updates.This PR adds the support for multi-agent setups in off-policy algorithms using the new
MultiAgentEpisodeReplayBuffer
. This PR includes all necessary modifications for"independent"
sampling and includes an example for SAC to be added to thelearning_tests
.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.