Common Problems with EchoMimicV2 #53

lymhust · 2024-11-27T02:09:42Z

lymhust
Nov 27, 2024
Maintainer

The image and gesture need to be aligned, the notebook alignment code and gesture extraction code is demo.ipynb.
The 40 images generated by Flux provided on the homepage are aligned. The simplest method is to use the provided images as reference and use ControlNet or image generation to generate similar portraits.
Currently, we have strictly selected frontal and half-body data for training, so the algorithm does not support side views or non-standard images. The following are some unsupported types.
For well-handled bloggers, please refer to:

Currently, the SD-based algorithm has high requirements for the graphics card and graphics memory, requiring around 16GB to run, and it runs slowly on ordinary consumer cards.
The accelerated version is currently being trained, and is expected to be announced soon. The specific acceleration performance of the accelerated version can be referenced to the V1's accelerated version.

The noise in the generated video can be reduced by adjusting the CFG parameter. The adjustment range is from 1.5 to 3.0. A lower CFG value results in better video quality but poorer lip-sync accuracy. Conversely, a higher CFG value leads to lower video quality but better lip-sync accuracy.

cfg1-5.mp4	cfg1-8.mp4	cfg2-0.mp4	cfg2-5.mp4

The 40 images generated by Flux on the homepage are aligned, the simplest method is to use the provided images as reference and generate similar portraits using ControlNet or graphic generation tools.
The images provided on the homepage are generated by Flux using the Lora models:
- Female Lora: https://www.liblib.art/modelinfo/d9675e37370e493ab8bf52046827a2b0?via=techmoon&versionUuid=7852ee527ca34d8b940d0749a75e4b67
- Male Lora: https://www.liblib.art/modelinfo/8963f90cea46474b84fd4bbfc990e0cc?from=search&versionUuid=61172bdb46a6412e817c5a1bf4a72f6c
For those who do not know how to deploy code, existing tools can be used, for example:
- Tongyi Wanxiang Virtual Model (https://tongyi.aliyun.com/wanxiang/app/virtual-model)
- Jimeng (https://jimeng.jianying.com/ai-tool/image/generate)
- KREA.AI (https://www.krea.ai/apps/image/flux)

Our EchoMimicV2 can fundamentally generate video of unlimited length. As long as GPU memory allows, the length can be increased.
A common question from users is why their attempts to extend the video length are capped at 13 seconds. This is because they are using our provided test pose sequence samples of 13 seconds length. If custom poses of longer length are used, this issue will not occur.
We take this issue very seriously. We are currently developing a Jupyter notebook, which will be released soon. This notebook will include features for custom pose sequences, reference image alignment, and segmented inference. Please stay tuned.