Skip to content


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation


Welcome to the LiveAudio repository! This project hosts A exciting applications leveraging advanced audio understand and speech generation models to bring your audio experiences to life, designed to provide an interactive and natural chatting experience, making it easier to adopt sophisticated AI-driven dialogues in various settings.


Clone and install

  • Clone the repo and submodules
#0  source code

apt update
# (Ubuntu / Debian User) Install sox + ffmpeg
apt install libsox-dev espeak-ng ffmpeg libopenblas-dev vim git-lfs -y

# (Ubuntu / Debian User) Install pyaudio
apt install build-essential \
    cmake \
    libasound-dev \
    portaudio19-dev \
    libportaudio2 \

CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --upgrade --no-cache-dir --verbose

mkdir /asset
chmod 777 /asset/
git clone
cd /workspace/LiveAudio
git pull

# 安装 miniconda, PyTorch/CUDA 的 conda 环境
mkdir -p ~/miniconda3
wget -O ~/miniconda3/
bash ~/miniconda3/ -b -u -p ~/miniconda3
rm -rf ~/miniconda3/
~/miniconda3/bin/conda init bash && source ~/miniconda3/bin/activate
conda config --set auto_activate_base false
conda create -n rt python=3.10  -y
conda activate rt

#2  LiveAudio
cd /workspace/LiveAudio
pip install -r requirements.txt -i

#3 xtts
cd /workspace/LiveAudio/src/xtts
pip install -e .[all,server,notebooks,bn,ja,ko,zh,languages]  -i

#4. download xtts-v2
HF_ENDPOINT= huggingface-cli download coqui/XTTS-v2  --local-dir  XTTS-v2

##5. openvice v2
cd OpenVoice
pip install -e .   -i


#6. parler-tts
pip install git+
# pip install flash-attn

(rt) root@ash:~/audio# nvidia-smi
(rt) root@ash:~/audio# nvcc --version
(rt) root@ash:~/audio# pip show torch

Docker Setup

  1. Install NVIDIA Container Toolkit:

    To use GPU for model training and inference in Docker, you need to install NVIDIA Container Toolkit:

    For Ubuntu users:

    # Add repository
    curl -fsSL | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
        && curl -s -L | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
    # Install nvidia-container-toolkit
    sudo apt-get update
    sudo apt-get install -y nvidia-container-toolkit
    # Restart Docker service
    sudo systemctl restart docker

    For users of other Linux distributions, please refer to: NVIDIA Container Toolkit Install-guide.

  2. You can build the container image with:

    sudo docker build -t LiveAudio .

    After getting your VAD token (see next sections) run:

    sudo docker volume create huggingface
    sudo docker run --gpus all -p 8765:8765 -v huggingface:/root/.cache/huggingface  -e PYANNOTE_AUTH_TOKEN='VAD_TOKEN_HERE' LiveAudio

    The "volume" stuff will allow you not to re-download the huggingface models each time you re-run the container. If you don't need this, just use:

    sudo docker run --gpus all -p 19999:19999 -e PYANNOTE_AUTH_TOKEN='VAD_TOKEN_HERE' LiveAudio



openai api token.

pem file microphone need ssl/tls

HF_ENDPOINT= python3 -m src.main --port 20000 --certfile cf.pem --keyfile cf.key --tts-type xtts-v2 --vad-type pyannote --vad-args '{"auth_token": "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"}' --llm-type ollama


ASR_TYPE=sensevoice python -m unittest test.server.test_server


  1. "`GLIBCXX_3.4.32' not found" error at runtime. GCC 13.2.0***


  1. How clone a voice submit the filename of a wave file containing the source voice

voice cloning works best with a 22050 Hz mono 16bit WAV file containing a short (~5-30 sec) sample of the target speaker's voice. The sample should be a clean recording with no background noise or music. The speaker should be speaking in a natural, conversational tone. The sample should be representative of the speaker's voice, including their accent, intonation, and speaking style.

  1. Coqui AI XTTS-v2 tts 架构 high level

XTTS 利用 VQ-VAE 模型将音频离散化为音频标记。 它使用 GPT 模型根据输入文本和说话者潜变量speaker latents 预测这些音频标记。说话者潜变量speaker latents通过一系列自注意力层计算得出。 GPT 模型的输出被传递给解码器模型,输出音频信号。使用扩散模型将 GPT 输出转换为声谱图帧,然后利用 UnivNet 生成最终的音频信号。
