Dreambooth

Personalized Diffusion Model

This repo aims at using Dreambooth to teach a Diffusion model to learn my pictures and generate images of me from text prompts. I fine-tune stable-diffusion-xl model from Huggingface (over 10GB in size) on a single Turing T4 GPU (16GB) on Google Colab using LoRA and Accelerate from Huggingface. The repo also looks at merging different LoRA adapters in order to merge styles.

Motivation

  • Can Low Rank Adapters work well for training Dreambooth?
  • Dreambooth works well on pictures of objects, can it learn to represent human faces well?
  • How many images do I need to teach the model about myself?
  • What is Prior Preservation?
  • How will a model recognize me from the text prompt?
  • Can we merge different Adapters to learn different styles?
  • What are some difficulties when it comes to training on human faces and how can we offset them?
  • On what text prompts does the model do well and when does it mess up?

Jump To

Project Structure

  • The data directory contains 6 high resolution images of me. This folder also contains prior.zip which contains 197 images of human faces (excluding my own). These images are used to train the model with prior preservation.
  • The Train_Raj.ipynb contains a notebook to train with and without Prior Preservation.
  • The dream_booth.py notebook contains the model and the script to train it. The script is a simpler adaptation of this script from Huggingface
  • The Dreambooth_Qualitative_Inference.ipynb contains a comprehensive and structured inference of the models trained with and without Prior Preservation. This contains all the images generated from text prompts post training.
  • The Dreambooth_Quantitaive_Inference.ipynb.ipynb notebook contains quantitative evaluation metrics.
  • The eval.py notebook contains the official evaluation script from the original Dreambooth Repo. Google Research - Dreambooth

Dataset

The data contains 6 high resolution images of me. For Dreambooth, it is important that these images cover different angles and clearly display the face. According to the experiments, 5-6 images are enough to train stable-diffusion-xl (SDXL) with LoRA. For prior preservation, we also use 197 images of other humans faces to increase diversity and reduce language drift. These images are generated by the same Diffusion model itself.

Prior-Preservation

Fine-tuning layers that are conditioned on the text embeddings, gives rise to the problem of language drift where a model that is pre-trained on a large text corpus and later fine-tuned for a specific task progressively loses syntactic and semantic knowledge of the language. This phenomenon also affects diffusion models, where to model slowly forgets how to generate subjects of the same class as the target subject.

Another problem is the possibility of reduced output diversity. Text-to-image diffusion models naturally posses high amounts of output diversity. When fine-tuning on a small set of images we would like to be able to generate the subject in novel viewpoints, poses and articulations. Yet, there is a risk of reducing the amount of variability in the output poses and views of the subject. To mitigate the two aforementioned issues, the paper proposes an autogenous class-specific prior preservation loss that encourages diversity and counters language drift. The method is to supervise the model with its own generated samples, in order for it to retain the prior once the few-shot fine-tuning begins. This allows it to generate diverse images of the class prior, as well as retain knowledge about the class prior that it can use in conjunction with knowledge about the subject instance.

image

Training

To accomodate such a large model on a 16GB Turing T4 GPU, I make use of gradient accumulation, gradient checkpointing, 8-bit fused Adam (instead of the regular Adam). Training on 6 images for 1000 steps is conducted with and without the prior preservation loss to verify that the prior preservation actually helps.

Training Prompt

In order ot teach the model a mapping between text and a subject, Dreambooth proposes using a rare token from model’s vocabulary and combining it with the subject prior. For instance, to train on my face, I use the prompt

A photo of rraj person

Here rraj is the rare vocabulary token and person is the class prior for the subject.

Results

It turns out that LoRA + Dreambooth with 1000 steps works decently well on human faces as well. Prior-Preservation definitely improves the model (as seen from images below). For me, the PNDM Scheduler works well with just 50 timesteps and DDIM with 80 timesteps.

prompt = "a painting of rraj person at Oktoberfest"

Without Prior Preservation

image

With Prior Preservation

image

Art Renditions

prompt = "a painting of rraj person in the style of Van Gogh"

image

Property Modification

prompt = "a painting of rraj person with blonde hair"

image

Novel-View Synthesis

prompt = "a side view photo of rraj person"

image

Acccesorization

prompt = "a of rraj person with sunglasses"

image

CLIP-I Score

CLIP-I is the average pairwise cosine similarity between CLIP embeddings of generated and real images.

Scheduler Steps Prior Preservation CLIP-I
DDIM 50 No 0.9580
DDIM 50 Yes 0.9760
DDIM 80 No 0.9663
DDIM 80 Yes 0.9683
PNDM 50 No 0.9761
PNDM 50 Yes 0.9702
PNDM 80 No 0.9751
PNDM 80 Yes 0.9688

Merging Adapters

I experiment generating my images in pixel-art style using two merged adapters. Particularly, I experiment with generating my pictures merged with the Pixel Art style.

prompt = "pixel, a photo of rraj person wearing sunglasses"

image

Limitations

Generating faces is tough, sometimes eyes and teeth are not rendered properly or could be mismatches. For instance, below, my eyes are rendered in green but are black in the training images.

image

Compute Limits

The GPU did not allow me to fine-tune the text-encoder (2 text-encoders in case of SDXL). Fine-tuning text encoders certainly improves image generation quality.

References

[1] Huggingface Blog

[2] “DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation”

[3] Dreambooth Diffusers Training Script