Skip to content

Commit

Permalink
1. Hunyuan-Captioner is released.
Browse files Browse the repository at this point in the history
2. 6GB GPU VRAM Inference scripts are released.
3. Support LoRa and ControlNet in diffusers and Comfyui.
4. support specify different controlnet layer weight.
5. fix Comfyui bug.
  • Loading branch information
rongweiquan authored and zml-ai committed Jun 30, 2024
1 parent 5657364 commit 3bb80e1
Show file tree
Hide file tree
Showing 117 changed files with 81,555 additions and 572 deletions.
214 changes: 213 additions & 1 deletion Notice

Large diffs are not rendered by default.

140 changes: 132 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ This repo contains PyTorch model definitions, pre-trained weights and inference/
> [**DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation**](https://arxiv.org/abs/2403.08857) <br>
## 🔥🔥🔥 News!!
* Jun 27, 2024: :art: Hunyuan-Captioner is released, providing fine-grained caption for training data. See [mllm](./mllm) for details.
* Jun 27, 2024: :tada: Support LoRa and ControlNet in diffusers. See [diffusers](./diffusers) for details.
* Jun 27, 2024: :tada: 6GB GPU VRAM Inference scripts are released. See [lite](./lite) for details.
* Jun 19, 2024: :tada: ControlNet is released, supporting canny, pose and depth control. See [training/inference codes](#controlnet) for details.
* Jun 13, 2024: :zap: HYDiT-v1.1 version is released, which mitigates the issue of image oversaturation and alleviates the watermark issue. Please check [HunyuanDiT-v1.1 ](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.1) and
[Distillation-v1.1](https://huggingface.co/Tencent-Hunyuan/Distillation-v1.1) for more details.
Expand Down Expand Up @@ -70,11 +73,15 @@ or multi-turn language interactions to create the picture.
- [x] Training
- [x] Lora
- [x] Controlnet (Pose, Canny, Depth)
- [x] Hunyuan-Captioner (Re-caption the raw image-text pairs)
- [x] 6GB GPU VRAM Inference
- [ ] IP-adapter
- [ ] Hunyuan-DiT-S checkpoints (0.7B model)
- [ ] Caption model (Re-caption the raw image-text pairs)
- [DialogGen](https://github.com/Centaurusalpha/DialogGen) (Prompt Enhancement Model)
- [x] Inference
- Mllm
- Hunyuan-Captioner
- [x] Inference
- [Hunyuan-DialogGen](https://github.com/Centaurusalpha/DialogGen) (Prompt Enhancement Model)
- [x] Inference
- [X] Web Demo (Gradio)
- [x] Multi-turn T2I Demo (Gradio)
- [X] Cli Demo
Expand All @@ -100,13 +107,15 @@ or multi-turn language interactions to create the picture.
- [Full Parameter Training](#full-parameter-training)
- [LoRA](#lora)
- [🔑 Inference](#-inference)
- [6GB GPU VRAM Inference](#6gb-gpu-vram-inference)
- [Using Gradio](#using-gradio)
- [Using Diffusers](#using--diffusers)
- [Using Command Line](#using-command-line)
- [More Configurations](#more-configurations)
- [Using ComfyUI](#using-comfyui)
- [:building_construction: Adatper](#building_construction-adapter)
- [ControlNet](#controlnet)
- [:art: Hunyuan-Captioner](#art-hunyuan-captioner)
- [🚀 Acceleration (for Linux)](#-acceleration-for-linux)
- [🔗 BibTeX](#-bibtex)

Expand Down Expand Up @@ -225,6 +234,8 @@ cd HunyuanDiT
We provide an `environment.yml` file for setting up a Conda environment.
Conda's installation instructions are available [here](https://docs.anaconda.com/free/miniconda/index.html).

We recommend CUDA versions 11.7 and 12.0+.

```shell
# 1. Prepare conda environment
conda env create -f environment.yml
Expand Down Expand Up @@ -337,7 +348,7 @@ All models will be automatically downloaded. For more information about the mode

4. Data Selection and Configuration File Creation

We configure the training data through YAML files. In these files, you can set up standard data processing strategies for filtering, copying, deduplicating, and more regarding the training data. For more details, see [docs](IndexKits/docs/MakeDataset.md).
We configure the training data through YAML files. In these files, you can set up standard data processing strategies for filtering, copying, deduplicating, and more regarding the training data. For more details, see [./IndexKits](IndexKits/docs/MakeDataset.md).

For a sample file, please refer to [file](./dataset/yamls/porcelain.yaml). For a full parameter configuration file, see [file](./IndexKits/docs/MakeDataset.md).

Expand Down Expand Up @@ -389,7 +400,7 @@ All models will be automatically downloaded. For more information about the mode



We provide training and inference scripts for LoRA, detailed in the [guidances](./lora/README.md).
We provide training and inference scripts for LoRA, detailed in the [./lora](./lora/README.md).

```shell
# Training for porcelain LoRA.
Expand Down Expand Up @@ -446,6 +457,37 @@ We provide training and inference scripts for LoRA, detailed in the [guidances](

## 🔑 Inference

### 6GB GPU VRAM Inference
Running HunyuanDiT in under 6GB GPU VRAM is available now based on [diffusers](https://huggingface.co/docs/diffusers/main/en/api/pipelines/hunyuandit). Here we provide instructions and demo for your quick start.

> The 6GB version supports Nvidia Ampere architecture series graphics cards such as RTX 3070/3080/4080/4090, A100, and so on.

The only thing you need do is to install the following library:

```bash
pip install -U bitsandbytes
pip install git+https://github.com/huggingface/diffusers
pip install torch==2.0.0
```

Then you can enjoy your HunyuanDiT text-to-image journey under 6GB GPU VRAM directly!

Here is a demo for you.

```bash
cd HunyuanDiT
# Quick start
model_id=Tencent-Hunyuan/HunyuanDiT-v1.1-Diffusers-Distilled
prompt=一个宇航员在骑马
infer_steps=50
guidance_scale=6
python3 lite/inference.py ${model_id} ${prompt} ${infer_steps} ${guidance_scale}
```

More details can be found in [./lite](lite/README.md).


### Using Gradio

Make sure the conda environment is activated before running the following command.
Expand Down Expand Up @@ -513,6 +555,8 @@ image = pipe(prompt, num_inference_steps=25).images[0]
```
More details can be found in [HunyuanDiT-Diffusers-Distilled](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-Diffusers-Distilled)

**More functions:** For other functions like LoRA and ControlNet, please have a look at the README of [./diffusers](diffusers).

### Using Command Line

We provide several commands to quick start:
Expand Down Expand Up @@ -566,7 +610,7 @@ git clone https://github.com/comfyanonymous/ComfyUI.git
# Install torch, torchvision, torchaudio
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
# Install Comfyui essential python package
# Install Comfyui essential python package.
cd ComfyUI
pip install -r requirements.txt
Expand Down Expand Up @@ -594,13 +638,13 @@ python main.py --listen --port 80
# Running ComfyUI successfully!
```
More details can be found in [ComfyUI README](comfyui-hydit/README.md)
More details can be found in [./comfyui-hydit](comfyui-hydit/README.md)

## :building_construction: Adapter

### ControlNet

We provide training scripts for ControlNet, detailed in the [guidances](./controlnet/README.md).
We provide training scripts for ControlNet, detailed in the [./controlnet](./controlnet/README.md).

```shell
# Training for canny ControlNet.
Expand Down Expand Up @@ -654,6 +698,86 @@ We provide training scripts for ControlNet, detailed in the [guidances](./contro

</table>

## :art: Hunyuan-Captioner
Hunyuan-Captioner meets the need of text-to-image techniques by maintaining a high degree of image-text consistency. It can generate high-quality image descriptions from a variety of angles, including object description, objects relationships, background information, image style, etc. Our code is based on [LLaVA](https://github.com/haotian-liu/LLaVA) implementation.

### Examples

<td align="center"><img src="./asset/caption_demo.jpg" alt="Image 3" width="1200"/></td>

### Instructions
a. Install dependencies

The dependencies and installation are basically the same as the [**base model**](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT-v1.1).

b. Data download
```shell
cd HunyuanDiT
wget -O ./dataset/data_demo.zip https://dit.hunyuan.tencent.com/download/HunyuanDiT/data_demo.zip
unzip ./dataset/data_demo.zip -d ./dataset
mkdir ./dataset/porcelain/arrows ./dataset/porcelain/jsons
```

c. Model download
```shell
# Use the huggingface-cli tool to download the model.
huggingface-cli download Tencent-Hunyuan/HunyuanCaptioner --local-dir ./ckpts/captioner
```

### Inference

Current supported prompt templates:

|Mode | Prompt template |Description |
| --- | --- | --- |
|caption_zh | 描述这张图片 |Caption in Chinese |
|insert_content | 根据提示词“{}”,描述这张图片 |Insert specific knowledge into caption|
|caption_en | Please describe the content of this image |Caption in English |
| | | |


a. Single picture inference in Chinese

```bash
python mllm/caption_demo.py --mode "caption_zh" --image_file "mllm/images/demo1.png" --model_path "./ckpts/captioner"
```

b. Insert specific knowledge into caption

```bash
python mllm/caption_demo.py --mode "insert_content" --content "宫保鸡丁" --image_file "mllm/images/demo2.png" --model_path "./ckpts/captioner"
```

c. Single picture inference in English

```bash
python mllm/caption_demo.py --mode "caption_en" --image_file "mllm/images/demo3.png" --model_path "./ckpts/captioner"
```

d. Multiple pictures inference in Chinese

```bash
### Convert multiple pictures to csv file.
python mllm/make_csv.py --img_dir "mllm/images" --input_file "mllm/images/demo.csv"
### Multiple pictures inference
python mllm/caption_demo.py --mode "caption_zh" --input_file "mllm/images/demo.csv" --output_file "mllm/images/demo_res.csv" --model_path "./ckpts/captioner"
```

(Optional) To convert the output csv file to Arrow format, please refer to [Data Preparation #3](#data-preparation) for detailed instructions.


### Gradio
To launch a Gradio demo locally, please run the following commands one by one. For more detailed instructions, please refer to [LLaVA](https://github.com/haotian-liu/LLaVA).
```bash
cd mllm
python -m llava.serve.controller --host 0.0.0.0 --port 10000
python -m llava.serve.gradio_web_server --controller http://0.0.0.0:10000 --model-list-mode reload --port 443
python -m llava.serve.model_worker --host 0.0.0.0 --controller http://0.0.0.0:10000 --port 40000 --worker http://0.0.0.0:40000 --model-path "./ckpts/captioner" --model-name LlavaMistral
```
Then the demo can be accessed through http://0.0.0.0:443. It should be noted that the 0.0.0.0 here needs to be X.X.X.X with your server IP.

## 🚀 Acceleration (for Linux)

Expand Down
8 changes: 4 additions & 4 deletions app/multiTurnT2I_app.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
import base64
import pandas as pd
from sample_t2i import inferencer
from dialoggen.dialoggen_demo import init_dialoggen_model, eval_model
from mllm.dialoggen_demo import init_dialoggen_model, eval_model

SIZES = {
"正方形(square, 1024x1024)": (1024, 1024),
Expand Down Expand Up @@ -45,7 +45,7 @@ def get_image_md5(image):


# mllm调用
def request_mllm(server_url='http://0.0.0.0:8080',history_messages=[], question="画一个木制的鸟",image=""):
def request_dialogGen(server_url='http://0.0.0.0:8080',history_messages=[], question="画一个木制的鸟",image=""):
if image != "":
image = base64.b64encode(open(image, "rb").read()).decode()
print("history_messages before request",history_messages)
Expand Down Expand Up @@ -95,7 +95,7 @@ def image_generation(
# 图文对话
def chat(history_messages, input_text):

history_messages, response_text = request_mllm(history_messages=history_messages, question=input_text)
history_messages, response_text = request_dialogGen(history_messages=history_messages, question=input_text)
return history_messages, response_text
#
def pipeline(input_text, state, infer_steps, seed, image_size):
Expand Down Expand Up @@ -141,7 +141,7 @@ def upload_image(state, image_input):
(224, 224)).convert('RGB')
input_image.save(image_input.name) # Overwrite with smaller image.
system_prompt = '请先判断用户的意图,若为画图则在输出前加入<画图>:'
history_messages, response = request_mllm(question="这张图描述了什么?",history_messages=history_messages,
history_messages, response = request_dialogGen(question="这张图描述了什么?",history_messages=history_messages,
image=image_input.name)
conversation += [(f'<img src="./file={image_input.name}" style="display: inline-block;">', response)]
print("conversation" , conversation)
Expand Down
Binary file added asset/caption_demo.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
74 changes: 56 additions & 18 deletions comfyui-hydit/README.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,51 @@
# comfyui-hydit

This repository contains a customized node and workflow designed specifically for HunYuan DIT. The official tests conducted on DDPM, DDIM, and DPMMS have consistently yielded results that align with those obtained through the Diffusers library. However, it's important to note that we cannot assure the consistency of results from other ComfyUI native samplers with the Diffusers inference. We cordially invite users to explore our workflow and are open to receiving any inquiries or suggestions you may have.
This repository houses a tailored node and workflow designed specifically for HunYuan DIT. The official tests conducted on DDPM, DDIM, and DPMMS have consistently yielded results that align with those obtained through the Diffusers library. However, it's important to note that we cannot assure the consistency of results from other ComfyUI native samplers with the Diffusers inference. We cordially invite users to explore our workflow and are open to receiving any inquiries or suggestions you may have.

## Overview


### Workflow text2image

![Workflow](img/txt2img_v2.png)

[workflow_diffusers](workflow/hunyuan_diffusers_api.json) file for HunyuanDiT txt2image with diffusers backend.
[workflow_ksampler](workflow/hunyuan_ksampler_api.json) file for HunyuanDiT txt2image with ksampler backend.
![Workflow](img/work_diffusers.png)
[workflow_diffusers](workflow/workflow_diffusers.json) file for HunyuanDiT txt2image with diffusers backend.
![Workflow](img/workflow_ksampler.png)
[workflow_ksampler](workflow/workflow_ksampler.json) file for HunyuanDiT txt2image with ksampler backend.
![Workflow](img/workflow_lora_controlnet.png)
[workflow_lora_controlnet_diffusers](workflow/workflow_lora_controlnet.json) file for HunyuanDiT lora and controlnet model with diffusers backend.


## Usage

We provide several commands to quick start:
Make sure you run the following command inside [ComfyUI](https://github.com/comfyanonymous/ComfyUI) project with our [comfyui-hydit](.) and have correct conda environment.

```shell
# Please use python 3.10 version with cuda 11.7
# Download comfyui code
git clone https://github.com/comfyanonymous/ComfyUI.git

# Install torch, torchvision, torchaudio
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117
pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu117 --default-timeout=100 future

# Install Comfyui essential python package
cd ComfyUI
pip install -r requirements.txt

# ComfyUI has been successfully installed!

# Download model weight as before or link the existing model folder to ComfyUI.
python -m pip install "huggingface_hub[cli]"
mkdir models/hunyuan
huggingface-cli download Tencent-Hunyuan/HunyuanDiT --local-dir ./models/hunyuan/ckpts

# Move to the ComfyUI custom_nodes folder and copy comfyui-hydit folder from HunyuanDiT Repo.
cd custom_nodes
cp -r ${HunyuanDiT}/comfyui-hydit ./
git clone https://github.com/Tencent/HunyuanDiT.git
cp -r HunyuanDiT/comfyui-hydit ./
rm -rf HunyuanDiT
cd comfyui-hydit

# !!! If using windows system !!!
cd custom_nodes
git clone https://github.com/Tencent/HunyuanDiT.git
xcopy /E /I HunyuanDiT\comfyui-hydit comfyui-hydit
rmdir /S /Q HunyuanDiT
cd comfyui-hydit

# Install some essential python Package.
Expand All @@ -53,20 +61,42 @@ python main.py --listen --port 80
# Running ComfyUI successfully!
```

## Download weights for diffusers mode

```shell
python -m pip install "huggingface_hub[cli]"
mkdir models/hunyuan
huggingface-cli download Tencent-Hunyuan/HunyuanDiT-v1.1 --local-dir ./models/hunyuan/ckpts
huggingface-cli download Tencent-Hunyuan/Distillation-v1.1 pytorch_model_distill.pt --local-dir ./models/hunyuan/ckpts/t2i/model
```

## Download weights for ksampler mode
Download the [clip encoder](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/blob/main/t2i/clip_text_encoder/pytorch_model.bin) and place it in `ComfyUI/models/clip`
Download the [mt5](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/blob/main/t2i/mt5/pytorch_model.bin) and place it in `ComfyUI/models/t5`
Download the [base model](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/blob/main/t2i/model/pytorch_model_ema.pt) and place it in `ComfyUI/models/checkpoints`
Download the [sdxl vae](https://huggingface.co/Tencent-Hunyuan/HunyuanDiT/blob/main/t2i/sdxl-vae-fp16-fix/diffusion_pytorch_model.bin) and place it in `ComfyUI/models/vae`


## Custom Node
Below I'm trying to document all the nodes, thanks for some good work[[1]](#1)[[2]](#2).
#### HunYuan Pipeline Loader
- Loads the full stack of models needed for HunYuanDiT.
- **pipeline_folder_name** is the official weight folder path for hunyuan dit including clip_text_encoder, model, mt5, sdxl-vae-fp16-fix and tokenizer.
- **lora** optional to load lora weight.

#### HunYuan Checkpoint Loader
- Loads the base model for HunYuanDiT in ksampler backend.
- **model_name** is the weight list of comfyui checkpoint folder.
- **vae_name** is the weight list of comfyui vae folder.
- **backend** "diffusers" means using diffusers as the backend, while "ksampler" means using comfyui ksampler for the backend.
- **PIPELINE** is the instance of StableDiffusionPipeline.
- **MODEL** is the instance of comfyui MODEL.
- **CLIP** is the instance of comfyui CLIP.
- **VAE** is the instance of comfyui VAE.


#### HunYuan CLIP Loader
- Loads the clip and mt5 model for HunYuanDiT in ksampler backend.
- **text_encoder_path** is the weight list of comfyui clip model folder.
- **t5_text_encoder_path** is the weight list of comfyui t5 model folder.

#### HunYuan VAE Loader
- Loads the vae model for HunYuanDiT in ksampler backend.
- **model_name** is the weight list of comfyui vae model folder.

#### HunYuan Scheduler Loader
- Loads the scheduler algorithm for HunYuanDiT.
Expand All @@ -88,6 +118,14 @@ Below I'm trying to document all the nodes, thanks for some good work[[1]](#1)[[
- **Input** is the instance of StableDiffusionPipeline and some hyper-parameters for sampling.
- **Output** is the generated image.

#### HunYuan Lora Loader
- Loads the lora model for HunYuanDiT in diffusers backend.
- **lora_name** is the weight list of comfyui lora folder.

#### HunYuan ControNet Loader
- Loads the controlnet model for HunYuanDiT in diffusers backend.
- **controlnet_path** is the weight list of comfyui controlnet folder.

## Reference
<a id="1">[1]</a>
https://github.com/Limitex/ComfyUI-Diffusers
Expand Down
Loading

0 comments on commit 3bb80e1

Please sign in to comment.