Skip to content

ZhuXMMM/Afford-X

Repository files navigation

A Generalizable and Slim Affordance Reasoning Framework for Task-oriented Manipulation

This repository is an official implementation of [AffordX]:

A Generalizable and Slim Affordance Reasoning Framework for Task-oriented Manipulation Xiaomeng Zhu, Yuyang Li, Leiyao Cui, Pengfei Li, Huan-ang Gao, Yixin Zhu, Hao Zhao

Introduction

Object affordance reasoning, the ability to infer object functionalities based on physical properties, is fundamental for task-oriented planning and activities in both humans and \ac{ai}. This capability, required for planning and executing daily activities in a task-oriented manner, relies on commonsense knowledge of object physics and functionalities, extending beyond simple object recognition. Current computational models for affordance reasoning from perception lack generalizability, limiting their applicability in novel scenarios. Meanwhile, comprehensive \acp{llm} with emerging reasoning capabilities are challenging to deploy on local devices for task-oriented manipulations. Here, we introduce \affordanceDatasetNamelvis{}, a large-scale dataset comprising 1,496 tasks and 897k images, designed to enhance the generalizability of affordance reasoning from perception. Utilizing this dataset, we develop \affordanceModelName{}, an end-to-end trainable affordance reasoning model that incorporates Verb Attention and Bi-Fusion modules to improve multi-modal understanding. This model achieves up to a 25.5% performance improvement on unseen categories and tasks, while maintaining a compact 187M parameter size and inferring nearly 50 times faster than the GPT-4V API. Our work demonstrates the potential for efficient, generalizable affordance reasoning models that can be deployed on local devices for task-oriented manipulations. We showcase \affordanceModelName{}'s effectiveness in enabling task-oriented manipulations for robots across various tasks and environments, underscoring its efficiency and broad implications for advancing robotics and \ac{ai} systems in real-world applications.

This repository is a PyTorch implementation.

Datasets

Please follow the instructions in the official website to download the COCO-Tasks dataset.

You can organize the 'data' folder as follows:

data/
  ├── id2name.json
  ├── images/
  │    ├── train2014/
  │    └── val2014/
  └── coco-tasks/
       └── annotations/
            ├── task_1_train.json
            ├── task_1_test.json
            ...
            ├── task_14_train.json
            └── task_14_test.json

Then set the arguments coco_path, refexp_ann_path and catid2name_path in file configs/tdod.json to be the path of data/images/, data/coco-tasks/annotations/ and data/id2name.json, respectively.

Installation

Make sure that you have all dependencies in place. The simplest way to do so is to use anaconda.

Make a new conda env and activate it:

conda create --name AffordX python=3.8
conda activate AffordX

Install the the packages in the requirements.txt:

pip install -r requirements.txt

Running

1. Plain detection

Training

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --master_port=23456 --nproc_per_node=6 --use_env main.py \
--dataset_config configs/tdod.json \
--train_batch_size 6  \
--valid_batch_size 8  \
--load /path/to/pretrained_resnet101_checkpoint.pth  \
--ema --text_encoder_lr 1e-5 --lr 5e-5 \
--num_workers 5 \
--output-dir 'logs/test' \
--eval_skip 1

To leverage the pre-trained noun referring expression comprehension model, download the checkpoint from here (provided by MDETR) and change the value of --load to be the path of the checkpoint.

Evaluation

Please change --resume to the path of the trained model to be evaluated.

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port=23456 --nproc_per_node=1 --use_env main.py \
--dataset_config configs/tdod.json \
--valid_batch_size 8  \
--num_workers 5 \
--resume /path/to/checkpoint  \
--ema --eval \
--output-dir 'logs/test' \
--no_contrastive_align_loss

Verb-noun input

To train or evaluate the teacher model which leverages the privileged ground truth knowledge by taking verb-noun expression as text input, just set --verb_noun_input like:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --master_port=23456 --nproc_per_node=6 --use_env main.py \
--dataset_config configs/tdod.json \
--train_batch_size 6  \
--valid_batch_size 8  \
--load /path/to/pretrained_resnet101_checkpoint.pth  \
--ema --text_encoder_lr 1e-5 --lr 5e-5 \
--num_workers 5 \
--output-dir 'logs/test' \
--eval_skip 1 \
--verb_noun_input

Running without pre-training

To train Afford-X without using the pre-trained noun referring expression comprehension model, leave the parameter --load empty and set --without_pretrain.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --master_port=23456 --nproc_per_node=6 --use_env main.py \
--dataset_config configs/tdod.json \
--train_batch_size 6  \
--valid_batch_size 8  \
--ema --text_encoder_lr 1e-5 --lr 5e-5 \
--num_workers 5 \
--output-dir 'logs/test' \
--eval_skip 1 \
--without_pretrain

For evaluation, just change --resume and set --without_pretrain in the aforementioned evaluation command.

2. Plain segmentation

After training the detection part of Afford-X, using the following commands to train and evaluate the segment head of Afford-X.

Training

Please change --frozen_weights to the path of the trained detection model.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --master_port=23456 --nproc_per_node=6 --use_env main.py \
--dataset_config configs/tdod.json \
--train_batch_size 2  \
--valid_batch_size 4  \
--frozen_weights /path/to/trained/detection/checkpoint \
--mask_model smallconv \
--no_aux_loss \
--ema --text_encoder_lr 1e-5 --lr 5e-5 \
--num_workers 5 \
--output-dir 'logs/test' \
--eval_skip 1 \
--no_contrastive_align_loss

Evaluation

Please change --resume to the path of the trained model to be evaluated.

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port=23456 --nproc_per_node=1 --use_env main.py \
--dataset_config configs/tdod.json \
--valid_batch_size 4  \
--num_workers 5 \
--resume /path/to/checkpoint  \
--ema --eval \
--output-dir 'logs/test' \
--mask_model smallconv \
--no_contrastive_align_loss

3. Afford-X detection with noun-pronoun distillation

Training

To train Afford-X with distillation, change --load to the path of the trained student model (taking verb-pronoun as text input) and --load_noun to the path of the trained teacher model (taking verb-noun as text input).

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --master_port=23456 --nproc_per_node=6 --use_env main.py \
--dataset_config configs/tdod.json \
--train_batch_size 3  \
--valid_batch_size 8  \
--load /path/to/pronoun/detection/checkpoint  \
--load_noun /path/to/noun/detection/checkpoint \
--ema --text_encoder_lr 1e-5 --lr 5e-5 \
--num_workers 5 \
--output-dir 'logs/test' \
--eval_skip 1 \
--distillation \
--softkd_loss \
--softkd_coef 50 \
--cluster \
--cluster_memory_size 1024 \
--cluster_num 3 \
--cluster_feature_loss 1e4

The parameters --cluster, --cluster_memory_size, --cluster_num and --cluster_feature_loss are used for Clustering Distillation. The parameters --softkd_loss and --softkd_coef are used for Preference Distillation.

Evaluation

Please change --resume to the path of the trained model to be evaluated.

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port=23456 --nproc_per_node=1 --use_env main.py \
--dataset_config configs/tdod.json \
--valid_batch_size 4  \
--num_workers 5 \
--resume /path/to/checkpoint  \
--ema --eval \
--output-dir 'logs/test' \
--cluster \
--cluster_memory_size 1024 \
--cluster_num 3 \
--no_contrastive_align_loss \
--distillation

The parameters --cluster_memory_size and --cluster_num should be consistent with training setting.

4. Afford-X segmentation with noun-pronoun distillation

Training

Please change --frozen_weights to the path of the trained detection (with distillation) model.

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 python -m torch.distributed.launch --master_port=23456 --nproc_per_node=6 --use_env main.py \
--dataset_config configs/tdod.json \
--train_batch_size 2  \
--valid_batch_size 4  \
--frozen_weights /path/to/trained/detection/with/distillation/checkpoint \
--mask_model smallconv \
--no_aux_loss \
--ema --text_encoder_lr 1e-5 --lr 5e-5 \
--num_workers 5 \
--output-dir 'logs/test' \
--eval_skip 1 \
--cluster \
--cluster_memory_size 1024 \
--cluster_num 3 \
--no_contrastive_align_loss

Evaluation

Please change --resume to the path of the trained model to be evaluated.

CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port=23456 --nproc_per_node=1 --use_env main.py \
--dataset_config configs/tdod.json \
--valid_batch_size 4  \
--num_workers 5 \
--resume /path/to/checkpoint  \
--ema --eval \
--output-dir 'logs/test' \
--cluster \
--cluster_memory_size 1024 \
--cluster_num 3 \
--mask_model smallconv \
--no_contrastive_align_loss

License

Afford-X is released under the MIT License.

Acknowledgment

We would like to thank the open-source data and code of COCO-Tasks, Microsoft COCO, GGNN, MDETR, DETR, Detectron2 and TOIST.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published