Code for NeurIPS 2022 paper "Robust offline Reinforcement Learning via Conservative Smoothing". RORL trades off robustness and conservatism for offline RL via conservative smoothing and OOD underestimation.
The implementation is based on EDAC and rlkit.
To install the required dependencies:
-
Install the MuJoCo 2.0 engine, which can be downloaded from here.
-
Install Python packages in the requirement file, d4rl and
dm_control
. The commands are as follows:
conda create -n rorl python=3.7
conda activate rorl
pip install --no-cache-dir -r requirements.txt
git clone https://github.com/rail-berkeley/d4rl.git
cd d4rl
# Note: remove lines including 'dm_control' in setup.py of d4rl
pip install -e .
python -m scripts.sac --env_name [ENVIRONMENT] --num_qs 10 --norm_input --load_config_type 'benchmark' --exp_prefix RORL
To reproduce results of adersarial experiments, you can simply replace 'benchmark' with 'attack'.
python -m scripts.sac --env_name [ENVIRONMENT] --num_qs 10 --norm_input --exp_prefix SAC
python -m scripts.sac --env_name [ENVIRONMENT] --num_qs 10 --eta 1 --norm_input --exp_prefix EDAC
python -m scripts.sac --env_name [ENVIRONMENT] --num_qs 10 --norm_input --eval_no_training --load_path [model path] --exp_prefix eval_RORL
'model path': e.g., ~/offline_itr_3000.pt.
python -m scripts.sac --env_name [ENVIRONMENT] --num_qs 10 --norm_input --eval_no_training --load_path [model path] --eval_attack --eval_attack_mode [mode] --eval_attack_eps [epsilon] --exp_prefix eval_RORL
'mode': 'random, action_diff, min_Q, action_diff_mixed_order, min_Q_mixed_order'.
'epsilon': [0.0, 0.3]
python -m scripts.sac --env_name [ENVIRONMENT] --num_qs 10 --norm_input --eval_no_training --load_path [model path] --eval_attack --eval_attack_mode [mode] --eval_attack_eps [epsilon] --load_Qs [Qs path] --exp_prefix eval_RORL
'Qs path': the path of attacker's Q functions, which can be different from the evaluated agent's Q functions
According to our ablation study result in Appendix C, we summarize some tips for adapting RORL for customized use below.
- Hyper-parameter Tuning: Since RORL is proposed to solve a challenging problem, it has many hyper-parameters. Our first suggestion is to use our hyper-parameter search range in Appendix B.1. You can tune them according to the importance of each component, where the general order is : OOD loss > policy smoothing loss > Q smoothing loss.
-
Computation Cost: If you want less GPU memory usage and less training time, you can
(1) set
$\beta_Q$ = 0 and$\epsilon_Q$ = 0 because the Q smoothing loss contributes the least but consumes a large computational cost, and (2) use a small number$n$ of sampled perturbed states to reduce the GPU memory usage.
If you find RORL helpful for your work, please cite:
@inproceedings{yang2022rorl,
title={RORL: Robust Offline Reinforcement Learning via Conservative Smoothing},
author={Yang, Rui and Bai, Chenjia and Ma, Xiaoteng and Wang, Zhaoran and Zhang, Chongjie and Han, Lei},
booktitle={Advances in Neural Information Processing Systems},
year={2022}
}
2022.11.26 fixed q smooth loss