lilGym: Natural Language Visual Reasoning with Reinforcement Learning

arXiv | code & data | website | baselines

About

We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We annotate all statements with executable Python programs representing their meaning to enable exact reward computation in every possible world state.

Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty.

We experiment with lilGym with different models and learning regimes. Our results and analysis show that while existing methods are able to achieve non-trivial performance, lilGym forms a challenging open problem.

Examples

TowerScratch (left), TowerFlipIt (right)

ScatterScratch (left), ScatterFlipIt (right)

Data

The data and details can be found in: lilgym/data/.

A description can be found in lilGym: Natural Language Visual Reasoning with Reinforcement Learning. The data is based on the Cornell Natural Language Visual Reasoning (NLVR) Corpus v1.0 (Suhr et al. 2017) corpus.

Codebase

Installation

Notes:

The codebase has been tested with Python 3.7/3.8, with PyTorch 1.12.1+cu102, CUDA 11.2
On-going work for compatibility with higher versions

Create a conda environment

conda create -n lilgym python=3.7
conda activate lilgym

Install PyTorch:

pip install torch==1.12.1+cu102 torchvision==0.13.1+cu102 --extra-index-url https://download.pytorch.org/whl/cu102

Note:

For using conda with with 3.7 on Apple Silicone, you may check: link

Clone the repo: git clone https://github.com/lil-lab/lilgym.git
Install the dependencies

cd lilgym
pip install -r requirements.txt

Note: the environment is updated to be used with Gymnasium (formerly Gym).

To install the package from source:

cd lilgym
pip install .

Example

The environments follow standard Gym API.

Following is a short demo script:

import gymnasium as gym

env = gym.make("TowerScratch-v0", split="train", stop_forcing=False, disable_env_checker=True)

env.seed(1)
observation, info = env.reset()

for _ in range(100):
    action = env.action_space.sample()
    observation, reward, terminated, truncated, info = env.step(action)

    if terminated or truncated:
        observation, info = env.reset()

Note: disable_env_checker comes with Gymnasium (new Gym), and can be set to False if needed.

Configurations

There are four configurations: TowerScratch, TowerFlipIt, ScatterScratch and ScatterFlipIt. Examples:

env = gym.make("TowerFlipIt-v0", split="train", stop_forcing=False)

env = gym.make("ScatterScratch-v0", split="dev", stop_forcing=False)

env = gym.make("ScatterFlipIt-v0", split="test", stop_forcing=False)

Data splits

There are three data splits for each configuration: train, dev, and test.

Stop forcing

stop_forcing specifies whether to use the algorithm with stop forcing at training time. Inference is always done without stop forcing.

Data reading

There are two ways to load data:

Using the argument split as above
Using the argument data. An example:

import gym
from lilgym.data.utils import get_data

data = get_data('tower', 'scratch', 'train')
env = gym.make("TowerScratch-v0", data=data, stop_forcing=True)

More details about the environment can be found in: lilgym/envs/README.md.

The baselines with the training and inference code will also be soon released.

License

MIT

Citation

@inproceedings{wu-etal-2023-lilgym,
    title = "lil{G}ym: Natural Language Visual Reasoning with Reinforcement Learning",
    author = "Wu, Anne  and
      Brantley, Kiante  and
      Kojima, Noriyuki  and
      Artzi, Yoav",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    year = "2023",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.512",
    pages = "9214--9234",
}

Ackowledegment

This research was supported by ARO W911NF21-1-0106, NSF under grant No. 1750499, a gift from Open Philanthropy, and NSF under grant No. 2127309 to the Computing Research Association for the CIFellows Project. Results presented in this paper were obtained using CloudBank, which is supported by the National Science Foundation under award No. 1925001. We thank Alane Suhr, Ge Gao, Justin Chiu, Woojeong Kim, Jack Morris, Jacob Sharf and the Cornell NLP Group for support, comments and helpful discussions.

Contact

Anne Wu (annewu@cs.cornell.edu)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
lilgym		lilgym
media/images		media/images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lilGym: Natural Language Visual Reasoning with Reinforcement Learning

Table of Contents

About

Examples

Data

Codebase

Installation

Example

License

Citation

Ackowledegment

Contact

About

Releases

Packages

Contributors 2

Languages

License

lil-lab/lilgym

Folders and files

Latest commit

History

Repository files navigation

lilGym: Natural Language Visual Reasoning with Reinforcement Learning

Table of Contents

About

Examples

Data

Codebase

Installation

Example

License

Citation

Ackowledegment

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages