Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

real-time chat #2

Open
jingli-wtbox opened this issue Nov 16, 2023 · 11 comments
Open

real-time chat #2

jingli-wtbox opened this issue Nov 16, 2023 · 11 comments

Comments

@jingli-wtbox
Copy link

jingli-wtbox commented Nov 16, 2023

Thank you for sharing such great work. It's awesome.

I find it's like a real-time chat when i go through some examples, like "Examples on Image-based Chat Persona" in below page:

example

May I know if ChatAnything supports real-time chat?

thanks

@ermu2001
Copy link
Collaborator

Setting up the conversation usually takes around 60 sec.

Afterwards Chatting would usually takes 6 sec to get respond from chatgpt.

I tested on one gpu rendering takes around 8 sec (RTX A5000). But the rendering of sadtalker could be parallelized.

You can try run locally and see whether it was real-time. :)

@jingli-wtbox
Copy link
Author

thank you. will have a try on other types of GPU.

@puffy310
Copy link

Can you theoretically just run this one 8xH100 and it'll work in "real-time". Maybe a real time conversation version of this software should be looked into.

@zhoudaquan
Copy link
Owner

Can you theoretically just run this one 8xH100 and it'll work in "real-time". Maybe a real time conversation version of this software should be looked into.

Hi, thanks for your interest in the work! we do not have H100 at hand right now... However, based on our observation, on A100 GPUs, the total time cost excluding GPT API calls is within 10s, and the face rendering process takes 1-2s. We will try to replace the ChatGPT APIs for real-time chat in the coming month..

@tolecy
Copy link

tolecy commented Nov 24, 2023

Great project!
I replaced chatGPT with my own small model and tested it on my own 3080ti graphics card, and the time consumption details are as follows:

===================================
Face Renderer:: 100%|80/80 [00:22<00:00, 3.49it/s]

fps:25.0
OpenCV: FFMPEG: tag 0x44495658/'XVID' is not supported with codec id 12 and format 'mp4 / MP4 (MPEG-4 Part 14)'
OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v'
seamlessClone:: 100%|318/318 [00:15<00:00, 20.66it/s]

I wonder if anyone has any efficient implementation or ideas for accelerating the video generation process. I have been interested in this recently. What I want to do now is to output the facial image generation process in synchronization with the voice after TTS is completed. However, because the facial generation process is relatively slow, the streaming effect will actually be very jerky.

(My goal now is to be as smooth as D-ID, input any image and voice, and quickly generate videos or smooth streaming output.)

@tolecy
Copy link

tolecy commented Nov 24, 2023

Great project! I replaced chatGPT with my own small model and tested it on my own 3080ti graphics card, and the time consumption details are as follows:

===================================

Face Renderer:: 100%|80/80 [00:22<00:00, 3.49it/s]

fps:25.0

OpenCV: FFMPEG: tag 0x44495658/'XVID' is not supported with codec id 12 and format 'mp4 / MP4 (MPEG-4 Part 14)'
OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v'
seamlessClone:: 100%|318/318 [00:15<00:00, 20.66it/s]
I wonder if anyone has any efficient implementation or ideas for accelerating the video generation process. I have been interested in this recently. What I want to do now is to output the facial image generation process in synchronization with the voice after TTS is completed. However, because the facial generation process is relatively slow, the streaming effect will actually be very jerky.

(My goal now is to be as smooth as D-ID, input any image and voice, and quickly generate videos or smooth streaming output.)

btw,this is the message for the face rendering process --- “Thank you for the kind words. It is a pleasure to meet you as well. I am here to share the magic and beauty of the world around us. If you have any questions or need any guidance, I am always here to help.”

@puffy310
Copy link

How did you replace ChatGPT, with another OpenAI model or a locally hosted OpenAI API compatible program?

@tolecy
Copy link

tolecy commented Nov 24, 2023

How did you replace ChatGPT, with another OpenAI model or a locally hosted OpenAI API compatible program?

I simply wrapped my local model as a service (with input-output format similar to OpenAI) and deployed it locally, and then made some modifications to the content of /chat_anything/chatbot/chat.py.

@ermu2001
Copy link
Collaborator

ermu2001 commented Nov 24, 2023

Great project! I replaced chatGPT with my own small model and tested it on my own 3080ti graphics card, and the time consumption details are as follows:

===================================

Face Renderer:: 100%|80/80 [00:22<00:00, 3.49it/s]

fps:25.0

OpenCV: FFMPEG: tag 0x44495658/'XVID' is not supported with codec id 12 and format 'mp4 / MP4 (MPEG-4 Part 14)'
OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v'
seamlessClone:: 100%|318/318 [00:15<00:00, 20.66it/s]
I wonder if anyone has any efficient implementation or ideas for accelerating the video generation process. I have been interested in this recently. What I want to do now is to output the facial image generation process in synchronization with the voice after TTS is completed. However, because the facial generation process is relatively slow, the streaming effect will actually be very jerky.

(My goal now is to be as smooth as D-ID, input any image and voice, and quickly generate videos or smooth streaming output.)

The facial image generation only executes once -- at the first round of conversation "..., Bot: how are you doing...". I think it would be acceptable for the latency since.

And by the way, this step "seamlessClone:: 100%|318/318 [00:15<00:00, 20.66it/s]" is a option for sadtalker to somehow not crop out the face for rendering and pasting it back. You can disable it by unchecking the "Use full body instead of a face." on the setting tab. It seems unoptimized and takes up lots of time O.o

https://github.com/zhoudaquan/ChatAnything/blob/main/chat_anything/sad_talker/utils/paste_pic.py#L59-L65

@puffy310
Copy link

Very excited to see more progress in this area!

@tolecy
Copy link

tolecy commented Nov 29, 2023

Great project! I replaced chatGPT with my own small model and tested it on my own 3080ti graphics card, and the time consumption details are as follows:

===================================

Face Renderer:: 100%|80/80 [00:22<00:00, 3.49it/s]

fps:25.0

OpenCV: FFMPEG: tag 0x44495658/'XVID' is not supported with codec id 12 and format 'mp4 / MP4 (MPEG-4 Part 14)'
OpenCV: FFMPEG: fallback to use tag 0x7634706d/'mp4v'
seamlessClone:: 100%|318/318 [00:15<00:00, 20.66it/s]
I wonder if anyone has any efficient implementation or ideas for accelerating the video generation process. I have been interested in this recently. What I want to do now is to output the facial image generation process in synchronization with the voice after TTS is completed. However, because the facial generation process is relatively slow, the streaming effect will actually be very jerky.
(My goal now is to be as smooth as D-ID, input any image and voice, and quickly generate videos or smooth streaming output.)

The facial image generation only executes once -- at the first round of conversation "..., Bot: how are you doing...". I think it would be acceptable for the latency since.

And by the way, this step "seamlessClone:: 100%|318/318 [00:15<00:00, 20.66it/s]" is a option for sadtalker to somehow not crop out the face for rendering and pasting it back. You can disable it by unchecking the "Use full body instead of a face." on the setting tab. It seems unoptimized and takes up lots of time O.o

https://github.com/zhoudaquan/ChatAnything/blob/main/chat_anything/sad_talker/utils/paste_pic.py#L59-L65

Yep,
When running on 4090 (considering only face render), the time required for video generation is not significantly different from the video length. Theoretically, if it is a streaming output (at 25fps), a relatively smooth feeling can be achieved.

Currently, I am trying to integrate live2D, and further, I hope to input a custom full-body image for full-body driving (this is my next plan) just like a prepared-live2D model, but I don’t have much experience in this field of cv. Any suggestions about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants