Releases: takuseno/d3rlpy
Release v2.0.3
An emergency patch to fix a bug of predict_value
method #297 .
Release v2.0.2
The major update has been finally released! Since the start of the project, this project has earned almost 1K GitHub stars ⭐ , which is a great milestone of d3rlpy. In this update, there are many major changes.
Upgrade Gym version
From this version, d3rlpy only supports the latest Gym version 0.26.0
. This change allows us to support Gymnasium
in the future update.
Algorithm
Clear separation between configuration and algorithm
From this version, each algorithm (e.g. "DQN") has a config class (e.g. "DQNConfig"). This allows us to serialize and deserialize algorithms as described later.
dqn = d3rlpy.algos.DQNConfig(learning_rate=3e-4).create(device="cuda:0")
Decision Transformer
Decision Transformer
is finally available! You can check reproduction code to see how to use it.
import d3rlpy
dataset, env = d3rlpy.datasets.get_pendulum()
dt = d3rlpy.algos.DecisionTransformerConfig(
batch_size=64,
learning_rate=1e-4,
optim_factory=d3rlpy.models.AdamWFactory(weight_decay=1e-4),
encoder_factory=d3rlpy.models.VectorEncoderFactory(
[128],
exclude_last_activation=True,
),
observation_scaler=d3rlpy.preprocessing.StandardObservationScaler(),
reward_scaler=d3rlpy.preprocessing.MultiplyRewardScaler(0.001),
context_size=20,
num_heads=1,
num_layers=3,
warmup_steps=10000,
max_timestep=1000,
).create(device="cuda:0")
dt.fit(
dataset,
n_steps=100000,
n_steps_per_epoch=1000,
save_interval=10,
eval_env=env,
eval_target_return=0.0,
)
Serialization
In this version, d3rlpy introduces a compact serialization, d3
format, that includes both hyperparameters and model parameters in a single file. This makes it possible for you to easily save checkpoints and reconstruct algorithms for evaluation and deployment.
import d3rlpy
dataset, env = d3rlpy.datasets.get_cartpole()
dqn = d3rlpy.algos.DQNConfig().create()
dqn.fit(dataset, n_steps=10000)
# save as d3 file
dqn.save("model.d3")
# reconstruct the exactly same DQN
new_dqn = d3rlpy.load_learnable("model.d3")
ReplayBuffer
From this version, there is no clear separation between ReplayBuffer
and MDPDataset
anymore. Instead, ReplayBuffer
has unlimited flexibility to support any kinds of algorithms and experiments. Please check details at documentation.
Release v1.1.1
Benchmark
The benchmark results of IQL and NFQ have been added to d3rlpy-benchmarks. Plus, the results of the more random seeds up to 10 have been added to all algorithms. The benchmark results are more reliable now.
Documentation
- More descriptions have been added to
Finetuning
tutorial page. Offline Policy Selection
tutorial page has been added
Enhancements
cloudpickle
andGPUUtil
dependencies have been removed.- gaussian likelihood computation for MOPO becomes more mathematically right (thanks @tominku )
Release v1.1.0
MDPDataset
The timestep alignment is now exactly the same as D4RL:
# observations = [o_1, o_2, ..., o_n]
observations = np.random.random((1000, 10))
# actions = [a_1, a_2, ..., a_n]
actions = np.random.random((1000, 10))
# rewards = [r(o_1, a_1), r(o_2, a_2), ...]
rewards = np.random.random(1000)
# terminals = [t(o_1, a_1), t(o_2, a_2), ...]
terminals = ...
where r(o, a)
is the reward function and t(o, a)
is the terminal function.
The reason of this change is that the many users were confused with the difference between d3rlpy and D4RL. But, now it's aligned in the same way. This change might break your dataset.
Algorithms
- Neural Fitted Q-iteration (NFQ)
Enhancements
- AWAC, CRR and IQL use a non-squashed gaussian policy function.
- The more tutorial pages have been added to the documentation.
- The software design page has been added to the documentation.
- The reproduction script for IQL has been added.
- The progress bar in online training is visually improved in Jupyter Notebook #161 (thanks, @aiueola )
- The nan checks have been added to
MDPDataset
. - The
target_reduction_type
andbootstrap
options have been removed.
Bugfix
Release v1.0.0
It's proud to announce that v1.0.0 has been finally released! The first version was released in Aug 2020 under the support of the IPA MITOU program. At the first release, d3rlpy only supported a few algorithms and did not even support online training. After months of constructive feedbacks and insights from the users and the community, d3rlpy has been established as the first offline deep RL library with many online and offline algorithms support and unique features. The next chapter also starts towards the ambitious v2.0.0 today. Please stay tuned for the next announcement!
NeurIPS 2021 Offline RL Workshop
The workshop paper about d3rlpy has been presented at the NeurIPS 2021 Offline RL Workshop.
URL: https://arxiv.org/abs/2111.03788
Benchmarks
The full benchmark results are finally available at d3rlpy-benchmarks.
Algorithms
- Implicit Q-Learning (IQL)
Enhancements
deterministic
option is added tocollect
methodrollout_return
metrics is added to online trainingrandom_steps
is added tofit_online
method--save
option is added tod3rlpy
CLI commands (thanks, @pstansell )multiplier
option is added to reward normalizers- many reproduction scripts are added
policy_type
option is added to BCget_atari_transition
function is added for the Atari 2600 offline benchmark procedure
Bugfix
- document fix (thanks, @araffin )
- Fix TD3+BC's actor loss function
- Fix gaussian noise for TD3 exploration
Roadmap towards v2.0.0
- Sophisticated config system using
dataclasses
- Dump configuration and model parameters in a single file
- Change MDPDataset format to align with D4RL datasets
- Support large dataset
- Support tuple observation
- Support large-scale data-parallel offline training
- Support large-scale distributed online training
- Support Transformer architecture (e.g. Decision Transformer)
- Speed up training with
torch.jit.script
and CUDA Graphs - Change library name to represent the unification of offline and online
Release v0.91
Algorithm
RewardScaler
From this version, the preprocessors are available for the rewards, which allow you to normalize, standardize and clip the reward values.
import d3rlpy
# normalize
cql = d3rlpy.algos.CQL(reward_scaler="min_max")
# standardize
cql = d3rlpy.algos.CQL(reward_scaler="standardize")
# clip (you can't use string alias)
cql = d3rlpy.algos.CQL(reward_scaler=d3rlpy.preprocessing.ClipRewardScaler(-1.0, 1.0))
copy_policy_from and copy_q_function_from methods
In the scenario of finetuning, you might want to initialize SAC's policy function with the pretrained CQL's policy function to boost the initial performance. From this version, you can do that as follows:
import d3rlpy
# pretrain with static dataset
cql = d3rlpy.algos.CQL()
cql.fit(...)
# transfer the policy function
sac = d3rlpy.algos.SAC()
sac.copy_policy_from(cql)
# you can also transfer the Q-function
sac.copy_q_function_from(cql)
# finetuning with online algorithm
sac.fit_online(...)
Enhancements
- show messages for skipping model builds
- add
alpha
parameter option toDiscreteCQL
- keep counting the number of gradient steps
- allow expanding MDPDataset with the larger discrete actions (thanks, @jamartinh )
callback
function is called every gradient step (previously, it's called every epoch)
Bugfix
- FQE's loss function has been fixed (thanks for the report, @guyk1971)
- fix documentation build (thanks, @astrojuanlu)
- fix d4rl dataset conversion for MDPDataset (this will have a significant impact on the performance for d4rl dataset)
Release v0.90
Algorithm
- Conservative Offline Model-Based Optimization (COMBO)
Drop data augmentation feature
From this version, the data augmentation feature has been dropped. The reason for this is that the feature introduces a lot of code complexity. In order to make d3rlpy support many algorithms and keep it as simple as possible, the feature was dropped. Instead, TorchMiniBatch
was internally introduced, and all algorithms become more simple.
collect method
In offline RL experiments, data collection plays an important role especially when you try new tasks.
From this version, collect
method is finally available.
import d3rlpy
import gym
# prepare environment
env = gym.make('Pendulum-v0')
# prepare algorithm
sac = d3rlpy.algos.SAC()
# prepare replay buffer
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=100000, env=env)
# start data collection without updates
sac.collect(env, buffer)
# export to MDPDataset
dataset = buffer.to_mdp_dataset()
# save as file
dataset.dump('pendulum.h5')
Along with this change, random policies are also introduced. These are useful to collect dataset with random policy.
# continuous action-space
policy = d3rlpy.algos.RandomPolicy()
# discrete action-space
policy = d3rlpy.algos.DiscreteRandomPolicy()
Enhancements
- CQL and BEAR become closer to the official implementations
callback
argument has been added to algorithms- random dataset has been added to cartpole and pendulum dataset
- you can specify it via
dataset_type='random'
atget_cartpole
andget_pendulum
method
- you can specify it via
Bugfix
- fix action normalization at
predict_value
method (thanks, @navidmdn ) - fix seed settings at reproduction codes
What's missing before v1.00?
Currently, I'm benchmarking all algorithms with d4rl dataset. Through the experiments, I realized that it's very difficult to reproduce the table reported in the paper because they actually didn't reveal full hyper-parameters, which are tuned to each dataset. So I gave up reproducing the table, and start producing numbers with the official codes to see if d3rlpy's result matches.
Release v0.80
Algorithms
New algorithms are introduced in this version.
- Critic Regularized Regression (CRR)
- Model-based Offline Policy Optimization (MOPO)
Model-based RL
Previously, model-based RL has been supported. The model-based specific logic was implemented in dynamics
side. This approach enabled us to combine model-based algorithms with arbitrary model-free algorithms. However, this requires complex designs to implement the recent model-based RL. So, the dynamics interface was refactored and the MOPO is the first algorithm to show how d3rlpy supports model-based RL algorithms.
# train dynamics model
from d3rlpy.datasets import get_pendulum
from d3rlpy.dynamics import ProbabilisticEnsembleDynamics
from d3rlpy.metrics.scorer import dynamics_observation_prediction_error_scorer
from d3rlpy.metrics.scorer import dynamics_reward_prediction_error_scorer
from d3rlpy.metrics.scorer import dynamics_prediction_variance_scorer
from sklearn.model_selection import train_test_split
dataset, _ = get_pendulum()
train_episodes, test_episodes = train_test_split(dataset)
dynamics = d3rlpy.dynamics.ProbabilisticEnsembleDynamics(learning_rate=1e-4, use_gpu=True)
dynamics.fit(train_episodes,
eval_episodes=test_episodes,
n_epochs=100,
scorers={
'observation_error': dynamics_observation_prediction_error_scorer,
'reward_error': dynamics_reward_prediction_error_scorer,
'variance': dynamics_prediction_variance_scorer,
})
# train Model-based RL algorithm
from d3rlpy.algos import MOPO
# give mopo as generator argument.
mopo = MOPO(dynamics=dynamics)
mopo.fit(dataset, n_steps=100000)
enhancements
fitter
method has been implemented (thanks @jamartinh )tensorboard_dir
replecestensorboard
flag atfit
method (thanks @navidmdn )- show warning messages when the unused arguments are passed
- show comprehensive error messages when action-space is not compatible
fit
method acceptsMDPDataset
objectdropout
option has been implemented in encoders- add appropriate
__repr__
methods to show pretty outputs whenprint(algo)
- metrics collection is refactored
bugfix
- fix
core dumped
errors by fixing numpy version - fix CQL backup
Release v0.70
Command Line Interface
New commands are added in this version.
record
You can record the video of the evaluation episodes without coding anything.
$ d3rlpy record d3rlpy_logs/CQL_20201224224314/model_100.pt --env-id HopperBulletEnv-v0
# record wrapped environment
$ d3rlpy record d3rlpy_logs/Discrete_CQL_20201224224314/model_100.pt \
--env-header 'import gym; env = d3rlpy.envs.Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'
play
You can run the evaluation episodes with rendering images.
# record simple environment
$ d3rlpy play d3rlpy_logs/CQL_20201224224314/model_100.pt --env-id HopperBulletEnv-v0
# record wrapped environment
$ d3rlpy play d3rlpy_logs/Discrete_CQL_20201224224314/model_100.pt \
--env-header 'import gym; env = d3rlpy.envs.Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'
data-point mask for bootstrapping
Ensemble training for Q-functions has been shown as a powerful method to achieve robust training. Previously, bootstrap
option has been available for algorithms. But, the mask for Q-function loss is randomly created every time when the batch is sampled.
In this version, create_mask
option is available for MDPDataset
and ReplayBuffer
, which will create a unique mask at each data-point.
# offline training
dataset = d3rlpy.dataset.MDPDataset(observations, actions, rewards, terminals, create_mask=True, mask_size=5)
cql = d3rlpy.algos.CQL(n_critics=5, bootstrap=True, target_reduction_type='none')
cql.fit(dataset)
# online training
buffer = d3rlpy.online.buffers.ReplayBuffer(1000000, create_mask=True, mask_size=5)
sac = d3rlpy.algos.SAC(n_critics=5, bootstrap=True, target_reduction_type='none')
sac.fit_online(env, buffer)
As you noticed above, target_reduction_type
is newly introduced to specify how to aggregate target Q values. In the standard Soft Actor-Critic, the target_reduction_type='min'
. If you choose none
, each ensemble Q-function uses its own target value, which is similar to what Bootstrapped DQN does.
better module access
From this version, you can navigate to all modules through d3rlpy
.
# previously
from d3rlpy.datasets import get_cartpole
dataset = get_cartpole()
# v0.70
import d3rlpy
dataset = d3rlpy.datasets.get_cartpole()
new logger style
From this version, structlog
is internally used to print information instead of raw print
function. This allows us to emit more structural information. Furthermore, you can control what to show and what to save to the file if you overwrite logger configuration.
enhancements
soft_q_backup
option is added toCQL
.Paper Reproduction
page has been added to the documentation in order to show the performance with the paper configuration.commit
method atD3RLPyLogger
returns metrics (thanks, @jamartinh )
bugfix
- fix
epoch
count in offline training. - fix
total_step
count in online training. - fix typos at documentation (thanks, @pstansell )
Release v0.61
CLI
record
command is newly introduced in this version. You can record videos of evaluation episodes with the saved model.
$ d3rlpy record d3rlpy_logs/CQL_20210131144357/model_100.pt --env-id Hopper-v2
You can also use the wrapped environment.
$ d3rlpy record d3rlpy_logs/DQN_online_20210130170041/model_1000.pt \
--env-header 'import gym; from d3rlpy.envs import Atari; env = Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'
bugfix
- fix saving models every step in
fit_online
method - fix Atari wrapper to reproduce the paper result
- fix CQL and BEAR algorithms