A fine tune version of Stable Diffusion model on self-translate 10k diffusiondb Chinese Corpus and "extend" it
Stable Diffusion is a state of the art text-to-image model that generates images from text.
Nowadays, with the help of diffusers, which provides pretrained diffusion models across multiple modalities, people can customize their own image generator conditional (based on prompt) or unconditional.
This project focus on run the text to image example with self-translate data based on diffusiondb.
This repository create at 2022.11.5.
Recently, IDEA-CCNL have released their Taiyi-Stable-Diffusion-1B-Chinese-v0.1 in 2022.11.2. As a basic model train on massive dataset, it perform Chinese prompt to image generation task very well. The model trained on wukong-dataset, the dataset construct by many realistic style features. This makes the output slightly different with the original CompVis/stable-diffusion-v1-4's style. This may have some negative effect when one want to generate a image like original CompVis/stable-diffusion-v1-4 or a more valuable demand about use some modifier to make prompts more expressive.
The above idea is sourced from a project named prompt-extend, it extending stable diffusion English prompts with suitable style cues using text generation. And people can try it on HuggingFace Space.
Below are some examples about use Taiyi-Stable-Diffusion-1B-Chinese-v0.1 to generate image with or without style cues in Chinese.
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("IDEA-CCNL/Taiyi-Stable-Diffusion-1B-Chinese-v0.1")
pipe = pipe.to("cuda")
pipe.safety_checker = lambda images, clip_input: (images, False)
prompt = '卡通龙'
image = pipeline(prompt, guidance_scale=7.5).images[0]
image
x = "卡通龙,数字艺术、艺术品趋势、电影照明、工作室质量、光滑成型"
image = pipeline(x, guidance_scale=7.5).images[0]
image
Prompt | 卡通龙 | 卡通龙,数字艺术、艺术品趋势、电影照明、工作室质量、光滑成型 |
---|---|---|
卡通龙 |
You can check the output above, that the two outputs not have much differences in detail. This may reduce the imagination of the model and may
squeeze the space for fine creation.
This project is target on implement prompt-extend in Chinese domain by fine tune Taiyi-Stable-Diffusion-1B-Chinese-v0.1 on a self-translate sampled data from diffusiondb and provide a text-generator that makes style cues works. This project provide a finetuned version of Taiyi-Stable-Diffusion-1B-Chinese-v0.1 and a MT5 model that generate style cues.
All models are upload to Huggingface Hub.
To fine tune them, the only require are translated datasets. Firstly, I random sample 10k English samples from diffusiondb and use NMT translate them into Chinese with some corrections. It have upload to svjack/diffusiondb_random_10k_zh_v1 And use this dataset on base model will give us the demand.
The finetuned text to image model include three models named with
svjack/Stable-Diffusion-FineTuned-zh-v0
svjack/Stable-Diffusion-FineTuned-zh-v1
svjack/Stable-Diffusion-FineTuned-zh-v2
This three models are trained with increase steps.(i.e. v0 stop early and v2 the last stop)
Secondly, I train a Chinese style cues generator based on MT5. In this step, i only need translated text features. The model is located in svjack/prompt-extend-chinese and svjack/prompt-extend-chinese-gpt based on GPT2
The space can try it in svjack/prompt-extend-gpt-chinese
Refer the model cards in Huggingface Hub. Or locally. predict_image.py only use the fine tuned stable diffusion model, prompt_extend.py only predict style cues base on some short Chinese prompt strings, predict_image_and_extend.py merge them into one simple function.
The finetuned svjack/Stable-Diffusion-FineTuned-zh-vx models in V0, V1 and V2 are trained for 10000, 28000 and 56000 steps respectively on svjack/diffusiondb_random_10k_zh_v1. The V1 outperform others on imagination and sensitivity of style cues. Taiyi and V0 seems not imaginative, V2 seems not sensitive and become a totally rich style one. The even line of above table's prompt are style cues generate by MT5 model, the style cues works for V0 and V1, invalid in Taiyi and V2. Try to use svjack/Stable-Diffusion-FineTuned-zh-v0 and svjack/Stable-Diffusion-FineTuned-zh-v1 with MT5 model will give you a imagination and sensitivity of style cues outputs.
Sometimes, style cues may be important for sample migration finetuning. Below is a example. Use svjack/Stable-Diffusion-FineTuned-zh-v1, generate a imgae about "护国公克伦威尔"(Protector Cromwell), when without style cues, it gives a output like supernatural being, when add style cues generated by mt5 model "的肖像,由,和,制作,在艺术站上趋势", it give a relatively good output
Prompt | 护国公克伦威尔 | 护国公克伦威尔,的肖像,由,和,制作,在艺术站上趋势 |
---|---|---|
护国公克伦威尔 |
svjack - svjackbt@gmail.com - ehangzhou@outlook.com
Project Link:https://github.com/svjack/Stable-Diffusion-Chinese-Extend