This repository contains the code for the paper ReCLIP: A Strong Zero-shot Baseline for Referring Expression Comprehension (ACL 2022).
This code has been tested on Ubuntu 18.04. We recommend creating a new environment with Python 3.6+ to install the appropriate versions of dependencies for this project. First, install pytorch
, torchvision
, and cudatoolkit
following the instructions in https://pytorch.org/get-started/locally/
. Then run pip install -r requirements.txt
. Download the ALBEF pre-trained checkpoint and place it at the path albef/checkpoint.pth
.
Download preprocessed data files via gsutil cp gs://reclip-sanjays/reclip_data.tar.gz
, and extract the data using tar -xvzf reclip_data.tar.gz
. This data
does not include images.
Download the images for RefCOCO/g/+ from http://images.cocodataset.org/zips/train2014.zip. Download the images for RefGTA from the original dataset release. NOTE: As stated in the original RefGTA dataset release, the images in RefGTA may only be used "in non-commercial and research uses."
The following format can be used to run experiments:
python main.py --input_file INPUT_FILE --image_root IMAGE_ROOT --method {parse/baseline/gradcam/random} --gradcam_alpha 0.5 0.5 --box_method_aggregator sum {--clip_model RN50x16,ViT-B/32} {--albef_path albef --albef_mode itm/itc --albef_block_num 8/11} {--mdetr mdetr_efficientnetB3/mdetr_efficientnetB3_refcocoplus/mdetr_effcientnetB3_refcocog} {--box_representation_method crop,blur/crop/blur/shade} {--detector_file PATH_TO_DETECTOR_FILE} {--cache_path PATH_TO_CACHE_DIRECTORY} {--output_file PATH_TO_OUTPUT_FILE}
(/
is used above to denote different options for a given argument.)
--input_file
: should be in .jsonl
format (we provide these files for the datasets discussed in our paper; see the Data Download information above).
--image_root
: the top-level directory containing all images in the dataset. For RefCOCO/g/+, this is the train2014
directory. For RefGTA, this directory contains three subdirectories called black_wearing
, dont_specify
, white_wearing
.
--detector_file
: if not specified, ground-truth proposals are used. For RefCOCO/g/+, the detection files are in reclip_data.tar.gz
and have the format {refcoco/refcocog/refcoco+}_dets_dict.json
. For RefGTA, the detections are in reclip_data.tar.gz
and have the format refgta_{val/test}_{gt/unidet_dt/unidet_all_dt}_output2.json
.
For ALBEF, we use ALBEF block num 8 for ITM (following the ALBEF paper) and block num 11 for ITC. Note that several arguments are only required for a particular "method," but they can still be included in the command when using a different method.
Choices for method
: "parse" is the full version of ReCLIP that includes isolated proposal scoring and the heuristic-based relation handling system. "baseline" is the version of ReCLIP using only isolated proposal scoring. "gradcam" uses GradCAM, and "random" selects one of the proposals uniformly at random. (default: "parse")
Choices for clip_model
: The choices are the same as the model names used in the CLIP repository except that the model names can be concatenated with a comma between consecutive names. (default: "RN50x16,ViT-B/32")
Choices for box_representation_method
: This argument dictates which of the following methods is used to score proposals: CPT-adapted, cropping, blurring, or some combination of these. For CPT-adapted, choose "shade". To use more than one method, concatenate them with a comma between consecutive methods. (default: "crop,blur")
To see explanations of other arguments see the main.py
file.
We recommend creating a new environment for UNITER experiments. See UNITER/requirements.txt
for the dependencies/versions that we used for these experiments. Note that the lines commented out should still be installed, but it may be easier/better to install them in a different manner than simply installing all packages at once via pip
. In particular, we recommend first following the instructions in https://pytorch.org/get-started/locally
to install pytorch
, torchvision
, and cudatoolkit
. Then we recommend cloning https://github.com/NVIDIA/apex
and following the instructions within that repository to install apex
. Then we recommend installing horovod via pip install horovod
. Then we recommend running pip install -r requirements.txt
. Download the pre-trained UNITER model from https://acvrpublicycchen.blob.core.windows.net/uniter/pretrained/uniter-large.pt and place it inside UNITER/downloads/pretrained/
. To train a model on RefCOCO+, edit UNITER/configs/train-refcoco+-large-1gpu.json
to have the correct data paths and desired output path. The necessary data files are provided in reclip_data.tar.gz
. Run the following command within the UNITER/
directory to train the model:
python train_re.py --config config/train-refcoco+-large-1gpu.json --output_dir OUTPUT_DIR --simple_format
where OUTPUT_DIR
is the desired output directory. (Training on RefCOCOg can be done in a similar manner.) Alternatively, you can download our UNITER models trained on RefCOCO+/RefCOCOg:
gsutil cp gs://reclip-sanjays/uniter_large_refcoco+_py10100feats.tar.gz .
gsutil cp gs://reclip-sanjays/uniter_large_refcocog_py10100feats.tar.gz .
To evaluate, run bash scripts/eval_{refcoco+/refcocog/refgta}.sh OUTPUT_DIR
. Again, you will probably need to modify the data paths in eval_{refcoco+/refcocog/refgta}.sh
.
To obtain the accuracies for the relations task on synthetic CLEVR-like image (Section 3.2 in our paper), download the data via gsutil cp gs://reclip-sanjays/clevr-dataset-gen.tar.gz .
and extract the data using tar -xvzf clevr-dataset-gen
. Then run python generic_clip_pairs.py --input_file clevr-dataset-gen/spatial_2obj_text_pairs.json --image_root clevr-dataset-gen/output/images --gpu 0 --clip_model RN50x16
to obtain results on the spatial text pair task using the CLIP RN50x16 model. Results for the spatial image pair and non-spatial image/text pair tasks can be obtained by replacing the JSON file name appropriately, and results for the other CLIP models can be obtained by replacing "RN50x16" with the appropriate model name. Results for the ALBEF model can be obtained by specifying the ALBEF path (which should be "albef"), and to obtain results with ALBEF ITC you can add the --albef_itc
flag.
We used UniDet to detect objects for RefGTA. We provide the outputs in reclip_data.tar.gz
, but if you would like to run the pipeline yourself, you can clone UniDet
https://github.com/xingyizhou/UniDet and use our script in UniDet/extract_boxes.py
on the outputs to obtain the desired detections.
We provide input features for UNITER in reclip_data.tar.gz
, but if you would like to run the feature extraction yourself, you can clone py-bottom-up-attention
https://github.com/airsplay/py-bottom-up-attention and use our script in py-bottom-up-attention/extract_features.py
to obtain the features for ground-truth/detected proposals. You should compile the repository (following the directions given in the repository) before running the script.
The code in the albef
directory is taken from the ALBEF repository. The code in clip_mm_explain
is taken from https://github.com/hila-chefer/Transformer-MM-Explainability. The code in UNITER
is a slightly modified version of https://github.com/ChenRocks/UNITER. The script py-bottom-up-attention/extract_features.py
is adapted from code in https://github.com/airsplay/py-bottom-up-attention. The file clevr-dataset-gen/bounding_box.py
is adapted from https://github.com/larchen/clevr-vqa/blob/master/bounding_box.py.
If you find this repository useful, please cite our paper:
@inproceedings{subramanian-etal-2022-reclip,
title = "ReCLIP: A Strong Zero-shot Baseline for Referring Expression Comprehension",
author = "Subramanian, Sanjay and
Merrill, Will and
Darrell, Trevor and
Gardner, Matt and
Singh, Sameer and
Rohrbach, Anna",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics"
}