Welcome to the GitHub repository of Xiaohao's embodied intelligence system. Xiaohao is a humanoid robot designed by Hangzhou Shenhao Technology, serving as a guide in the exhibition hall. This is multi-agent system integrating a large language model and a vision language model at its core, supported by robotic arms, mobility wheels, and a camera. These agents communicate via ROS messages, allowing Xiaohao to respond to user prompts, make decisions, and interact with its environment.
In this repository, you will find the open-sourced code for the large language model and the camera, as well as the ROS publisher nodes responsible for orchestrating communications among the different components.
-
Language Processing Nodes:
- cozy_chat.py: This node operates as a ROS subscriber to the 'wake' topic. Upon activation, it records audio, processes it using VAD (Voice Activity Detection), and converts the audio to text via the Paraformer(Please download it yourself from Modelscope by Alibaba, also for other Modelscope model I used in the initialization part) speech recognition model. The text is then processed by the large language model (Qwen) to generate responses, which are converted back to speech by using SAMBERT TTS to interact with users. This node also handles movement commands for the robot.
- vad.py: Voice activity detection function utilized in cn_chat.py.
- silero_vad.onnx: Open-source VAD model used throughout the project.
- cozy_tts.py: Text-to-speech function used in cozy_chat.py.
- llm_prompt.md: System prompt for the large language model.
- vl_prompt.md: System prompt for the vision language model.
- vad.py: Voice activity detection function utilized in cn_chat.py.
- cozy_chat.py: This node operates as a ROS subscriber to the 'wake' topic. Upon activation, it records audio, processes it using VAD (Voice Activity Detection), and converts the audio to text via the Paraformer(Please download it yourself from Modelscope by Alibaba, also for other Modelscope model I used in the initialization part) speech recognition model. The text is then processed by the large language model (Qwen) to generate responses, which are converted back to speech by using SAMBERT TTS to interact with users. This node also handles movement commands for the robot.
-
Camera and Vision Processing:
- rs_cam.py: This node manages the Intel® RealSense™ Stereo depth camera. It subscribes to the 'camera' topic and, upon receiving commands, captures color and depth images, identifies objects, and communicates with mobility components to navigate towards them.