This project is currently conceived for inference usage only. If you want to fine-tune any of the models, you will have to research and implement the required steps yourself.
Local model support will be relying heavily on llama.cpp, whisper.cpp, and their derivatives when possible. The comon interface of this project is created in python, however, to also allow easy integration and testing of native pytorch models and ML libraries.
This project currently supports requests to public OpenAI and Anthropic API servers. To be able to access their APIs you must have an account at OpenAI, and/or Anthropic with a valid payment method set.
The following three example scenarios showcase preliminary results. The commander always first waits for a fully generated audio response, and only then proceeds to generate a valid goal pose command for Navigation2. All examples were executed with locally hosted models. The used robotic platform was developed at the BUT Robotics and AI Group.
-
Sending the robot to an unspecified location (voice prompt, speech response):
explore_onethird_resolution_amplified_speech.mp4
-
After the pose contexts are updated with the new location, sending the robot back to the origin pose (voice prompt, speech response):
return_onethird_resolution_amplified_speech.mp4
-
Sending the robot to the newly acquired hall pose (voice prompt, speech response):
hall_onethird_resolution_amplified_speech.mp4
- The current design works best when translating prompts with defined keywords into exact pose messages. Some RAG-like feedback context about the current state could be included to allow for handling more complex prompts. This is currently a work-in-progress.
- The ROS interface currently only supports generating
PoseStamped
commands. Integration of other message types and some decision-making for more complex behaviors might be useful, but it will not be a high priority for now. - The state of quantization for TTS is currently not fully developed. This will have to be managed by the external project owners, unless we want to do the contributions ourselves.
- Install system dependencies:
sudo apt install ninja-build portaudio19-dev nlohmann-json3-dev
NOTE: Also make sure you have installed ROS and the following dependencies:
python3-colcon-common-extensions
,python3-rosdep
, andpython3-vcstool
. - Download this repo and update ROS dependencies:
git clone https://github.com/Imaniac230/robot-commander.git && . /opt/ros/<ros-distro>/setup.bash && rosdep install --from-paths robot-commander/ --ignore-src -r -y
- This project relies on the independent
robot_commander_library
package located inlibrary_vendor/
. The other external projects listed inlibrary_vendor/libraries.repos
are required for usage with local models only (if you're only going to make requests to external servers, you only needrobot_commander_library
). Theinput
package is also optional, and provides a gamepad interface with support for additional interactions with the DualSense PS5 controller. To download all the external libraries use thelibrary_vendor_setup.sh
script (or run the included commands manually):cd robot-commander/ && ./library_vendor_setup.sh
- Build the packages:
colcon build
NOTE: If you want to build the .cpp libraries with CUDA support enable the USE_CUDA option:
colcon build --cmake-args -DUSE_CUDA=ON
NOTE: If you want to compile
llama.cpp
andwhisper.cpp
with different options than defined in this project, please refer to their respective instructions for more detailed compilation steps. - Source the current ROS workspace:
. install/setup.bash
This project requires all models to be downloaded and/or quantized manually before running any examples. To get the most comprehensive and up-to-date instructions, please always follow the instructions in the llama.cpp and whisper.cpp README. Here are basic simplified example steps to prepare the whisper
, llama3
, and bark
models for use:
Whisper
- Download the official
whisper
repo from OpenAI:git clone https://github.com/openai/whisper.git
- Download one of the official models from OpenAI, ex:
wget https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt
Or download from OpenAI on huggingface, ex:
git clone https://huggingface.co/openai/whisper-large-v3
Make sure that you have
git lfs
installed:git lfs install
- Use the
convert-pt-to-ggml.py
script fromwhisper.cpp
to convert, ex:python3 whisper-convert-pt-to-ggml.py <path-to-downloaded-model-file>/large-v3.pt <path-to-official-whisper-repo> <path-to-converted-file> && mv <path-to-converted-file>/ggml-model.bin <path-to-converted-file>/ggml-model-f16-large-v3.bin
To convert the model downloaded from huggingface, use
convert-h5-to-ggml.py
instead, ex:python3 whisper-convert-h5-to-ggml.py <path-to-downloaded-model-files> <path-to-official-whisper-repo> <path-to-converted-file> && mv <path-to-converted-file>/ggml-model.bin <path-to-converted-file>/ggml-model-f16-large-v3.bin
- Quantize the converted model, ex.:
whisper-quantize <path-to-converted-file>/ggml-model-f16-large-v3.bin <path-to-quantized-file>/ggml-model-q4_0-large-v3.bin q4_0
Llama3
- Download one of the official models from Meta on huggingface, ex.:
git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
Make sure that you have
git lfs
installed:git lfs install
- Use the
convert_hf_to_gguf.py
script fromllama.cpp
to convert, ex.:python3 llama-convert_hf_to_gguf.py <path-to-downloaded-model-files> --outtype f16
- Quantize the converted model, ex.:
llama-quantize <path-to-converted-file>/ggml-model-f16.gguf <path-to-quantized-file>/ggml-model-q4_0.gguf Q4_0
Bark
- Download the official model files from SunoAI on huggingface, ex:
git clone https://huggingface.co/suno/bark
Make sure that you have
git lfs
installed:git lfs install
NOTE: The GGML format and quantization for bark are currently only experimental. Use the full pytorch models if you want the best results.
- Use the
convert.py
script frombark.cpp
to convert, ex.:python3 bark-convert.py --dir-model <path-to-downloaded-model-files> --use-f16 && mv <path-to-converted-file>/ggml_weights.bin <path-to-converted-file>/ggml-model-f16.bin
- Quantize the converted model, ex.:
bark-quantize <path-to-converted-file>/ggml-model-f16.bin <path-to-quantized-file>/ggml-model-q4_0.bin q4_0
-
Specify your commander parameters in
params/commander_params.yaml
. -
Launch the ROS commanders:
SUNO_OFFLOAD_CPU=True SUNO_USE_SMALL_MODELS=True ros2 launch robot_commander_py commanders.launch.py
If you have enough memory to hold the full Bark models, you can disable the
SUNO_USE_SMALL_MODELS
option. If you want to use CUDA support with Bark, you can disable theSUNO_OFFLOAD_CPU
option. Please refer to the Bark README for more details.NOTE: Local quantized Bark TTS agent server is currently only experimental. For best results, use the raw pytorch models, which do not support a server mode, and must be loaded dynamically for each request. To use the TTS server host, disable the
use_pytorch
parameter for the chat commandertext_to_speech
section. -
Start providing voice commands in natural language using a "push-to-talk" interface exposed through the
/record_prompt
topic.NOTE: If you want to use the included gamepad interface node, use the
launch_with_gamepad:=True
argument. -
To update the pose "keyword" mappings with the current robot position call the
update_pose_context
ROS service:ros2 service call /update_pose_context robot_commander_interfaces/srv/UpdateContext "{"keyword": "home"}"
-
To store the current pose "keyword" mappings into the specified file call the
save_contexts
ROS service:ros2 service call /save_contexts std_srvs/srv/Trigger
To launch the locally hosted agent servers:
- Specify your agent parameters in
params/agent_params.yaml
. - Launch the local agent servers:
ros2 launch robot_commander_py agents.launch.py
Some other examples, such as the voice_prompted_image.py
script use a wrapper around the openai
python library. If you want to use that, you will also have to install the library: pip install openai
.