[Project Page] [Paper] [Models] [Processed Dataset] [Raw GoPro Videos]
Fanqi Lin1,2,3*, Yingdong Hu1,2,3*, Pingyue Sheng1, Chuan Wen1,2,3, Jiacheng You1, Yang Gao1,2,3,
1Tsinghua University, 2Shanghai Qi Zhi Institute, 3Shanghai Artificial Intelligence Laboratory
* indicates equal contributions
See the UMI repository for installation.
We release data for all four of our tasks: pour water, arrange mouse, fold towel, and unplug charger. You can view or download all raw GoPro videos from this link, and generate the dataset for training by running:
bash run_slam.sh && bash run_generate_dataset.sh
Alternatively, we provide processed dataset here, ready for direct use in training.
You can visualize the dataset with a simple script:
python visualize_dataset.py
For the hardware setup, please refer to the UMI repo (note: we remove the mirror from the gripper, see link).
For each task, we release a policy trained on data collected from 32 unique environment-object pairs, with 50 demonstrations per environment. These polices generalize well to any new environment and new object. You can download them from link and run real-world evaluation using:
bash eval.sh
The temporal_agg
parameter in eval.sh refers to temporal ensemble strategy mentioned in our paper, enabling smoother robot actions.
Additionally, you can use the -j
parameter to reset the robot arm to a fixed initial position (make sure that the initial joint configuration specified in example/eval_robots_config.yaml
is safe for your robot !!!).
After downloading the processed dataset, you can train a policy by running:
cd train_scripts && bash <task_name>.sh
For multi-GPU training, configure your setup with accelerate config
, then replace python
with accelerate launch
in the <task_name>.sh
script. Additionally, you can speed up training without sacrificing policy performance by adding the --mixed_precision 'bf16'
argument.
Note that for the pour_water and unplug_charger tasks, we incorporate an additional step of historical observation for policy training and inference.
The current parameters in the <task_name.sh>
scripts correspond to our released models, but you can customize training:
- Use
policy.obs_encoder.model_name
to specify the type of vision encoder for the diffusion policy. Other options includevit_base_patch14_dinov2.lvd142m
(DINOv2 ViT-Base) andvit_large_patch14_clip_224.openai
(CLIP ViT-Large). - To adjust the number of training environment-object pairs (up to a maximum of 32), modify
task.dataset.dataset_idx
. You can change the proportion of demonstrations used by adjustingtask.dataset.use_ratio
within the range (0, 1]. Training policies on data from different environment-object pairs, using 100% of the demonstrations, generates scaling curves similar to the following:
The curve (third column) shows that the policy’s ability to generalize to new environments and objects scales approximately as a power law with the number of training environment-object pairs.
We thank the authors of UMI for sharing their codebase.