Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs

This repository contains code and technical details for the paper:

Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs (website)

Authors: Angelos Mavrogiannis, Dehao Yuan, Yiannis Aloimonos

Please cite our work if you found it useful:

@article{mavrogiannis2024discoveringobjectattributesprompting,
      title={Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs}, 
      author={Angelos Mavrogiannis and Dehao Yuan and Yiannis Aloimonos},
      year={2024},
      eprint={2409.15505},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2409.15505},
}

Overview

We describe our end-to-end framework for embodied attribute detection. The LLM receives as input a perception API with LLMs and VLMs as backbones, an action API based on a Robot Control API, a natural language (NL) instruction from a user, and a visual scene observation. It then produces a python program that combines LLM and VLM function calls with robot actions to actively reason about attribute detection.

Perception-Action API

The API consists of an $$\mathtt{ImagePatch}$$ class and a $\mathtt{Robot}$ action class with methods and examples of their uses in the form of docstrings. Inspired by ViperGPT, $\mathtt{ImagePatch}$ supports Open-Vocabulary object Detection (OVD), Visual Question Answering (VQA), and answering textual queries through the $\mathtt{find}$, $\mathtt{visual\_query}$, and $\mathtt{language\_query}$ functions, respectively. The $\mathtt{Robot}$ class has a list of sensors as a member variable, and a set of functions to focus on the center of an image ($\mathtt{focus\_on\_patch}$), measure weight ($\mathtt{measure\_weight}$) and distance ($\mathtt{measure\_weight}$), navigate to an object ($\mathtt{go\_to\_coords}$ and $\mathtt{go\_to\_object}$) or pick ($\mathtt{pick\_up}$) and place it ($\mathtt{put\_on}$). The input prompt includes the API with guidelines on how to use it, and a natural language query.

AI2-THOR Simulation

We integrate the perception-action API in different AI2-THOR household environments. In the Distance Estimation task (left) the robot has to identify which object is closer to its camera. We use the question ”which one is closer to me?” followed by the objects in question. Our API invokes an active perception behavior that computes the distance to an object by fixating on it with a call to $\mathtt{focus\_on\_patch}$. It then calls the $min$ function to find the smallest distance. In Weight Estimation (right), the invoked behavior determines the weight of an object by navigating to it ($\mathtt{go\_to\_object}$), picking it up ($\mathtt{pick\_up}$) and calling $\mathtt{measure\_weight}$, which simulates the use of a force/torque sensor mounted on the wrist of the robot arm.

Robot Demonstration

We integrate our framework into a RoboMaster EP robot with the $\mathtt{Robot}$ class as a wrapper over the RoboMaster SDK. The robot is connected to a (local) computer via wifi connection and communicates with a (remote) computing cluster through a client-server architecture running on an SSH tunnel. We only run OVD on the first frame captured by the robot camera to reduce latency, and then track the corresponding position(s) with the Kanade–Lucas–Tomasi (KLT) feature tracker (github implementation). We tune a lateral PID controller to align the geometric centers of the robot camera and the object bounding box to implement the $\mathtt{focus\_on\_patch}$ function. To approach an object and implement the $\mathtt{go\_to\_object}$ function we use an infrared distance sensor mounted on the front of the robot and tune a longitudinal PID controller. The video above shows the robot performing active distance estimation using our framework.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
BLIP		BLIP
GLIP		GLIP
images		images
klt		klt
.DS_Store		.DS_Store
README.md		README.md
abs_size.json		abs_size.json
api_prompt.py		api_prompt.py
attributes_pipeline.jpeg		attributes_pipeline.jpeg
classes_distance.jpeg		classes_distance.jpeg
client.py		client.py
heavy.json		heavy.json
lateral_control.py		lateral_control.py
location_size.jpeg		location_size.jpeg
longitudinal_control.py		longitudinal_control.py
order.json		order.json
output.avi		output.avi
ovd.jpg		ovd.jpg
pick_up.py		pick_up.py
plots.py		plots.py
prog_order.json		prog_order.json
prog_size.json		prog_size.json
size.json		size.json
size.py		size.py
task_prompt.txt		task_prompt.txt
test_client.py		test_client.py
visual_servoing.py		visual_servoing.py
weight_qualitative.jpeg		weight_qualitative.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs

Overview

Perception-Action API

AI2-THOR Simulation

Robot Demonstration

About

Releases

Packages

Languages

angmavrogiannis/Embodied-Attribute-Detection

Folders and files

Latest commit

History

Repository files navigation

Discovering Object Attributes by Prompting Large Language Models with Perception-Action APIs

Overview

Perception-Action API

AI2-THOR Simulation

Robot Demonstration

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages