Improves text-to-image delivery models for object- or person-focused generation.
Dreambooth artificial intelligence research was carried out by researchers at Google.
- Nataniel Ruiz
- Yuanzhen Li
- Varun Jampani
- Yael Pritch
- Michael rubinstein
- Kfir Aberman
Dreambooth is like a photo booth, but once the object or the person is known, it can synthesize where your imagination takes you...
Text-to-image models have made a remarkable leap in the evolution of artificial intelligence, enabling diverse and high-quality image synthesis from a given text.
However, these models lack the ability to mimic the appearance of subjects in a given reference set and to synthesize new interpretations of it in different contexts.
In this work, Dreambooth presents a new approach to "customizing" text-to-image delivery models (by specializing them to user needs).
As input, just a few images of a topic, we fine-tune a pre-trained text-to-image model so that it learns to bind a unique link to the topic.
Once the subject is integrated into the model's output domain, the single link can then be used to synthesize photorealistic images of the subject contextualized in different scenes.
By leveraging the semantic prior built into the model with a new autogenic class-specific loss of prior preservation, Dreambooth allows the subject to be synthesized into various scenes, poses, views, and lighting conditions that do not appear in the reference images. We apply our technique to several previously unassailable tasks, including subject re-contextualization, text-guided view synthesis, appearance modification, and artistic rendering (while preserving key subject characteristics).
As in the example, the clock, today it is very difficult to generate it in different contexts with state-of-the-art text-to-image models, while maintaining high fidelity to its key visual characteristics.
Even with dozens of iterations over a text prompt with a detailed description of the clock's appearance ("retro-style yellow alarm clock with a white dial and a yellow number three on the right side of the dial in the jungle" ) , the Imagen model fails to reconstruct its main visual characteristics (third column).
Moreover, even models whose text embedding is within a shared language vision space and can create semantic variations of the image, such as DALL-E2, can neither reconstruct the appearance of the given topic nor alter the context (second column).
In contrast, Dreambooth (right) can synthesize the clock with high fidelity and in new contexts.
Dreambooth takes as input some images of a subject (e.g. a specific dog) and the corresponding class name (e.g. "dog"), and returns a "custom" text-to-image template that encodes an identifier unique that refers to the subject. Then, during inference, we can embed the unique identifier in different sentences to synthesize topics in different contexts.
Given a topic's images, Dreambooth creates a text-to-image broadcast in two steps:
- Refine the low-res text-to-image model with input images paired with a text prompt containing a unique identifier and the name of the class the subject belongs to (e.g., "A picture of a dog [ T]"), in parallel, we apply a class-specific a priori preservation loss, which exploits the semantic prior that the model has on the class and encourages it to generate various instances belonging to the subject's class by injecting the class name into the text prompt (e.g. "A picture of a dog").
- Refine the super resolution components with low resolution and high resolution image pairs drawn from our input image set, allowing us to maintain high fidelity to small subject details.
Here are some examples of results with Dreambooth for the re-contextualization of subject instances, glasses, a bag and a vase.
By refining a model using our method, we are able to generate different images of a subject instance in different environments, with high preservation of subject detail and realistic interaction between scene and subject.
We display the conditioning prompts below each image.
Dreambooth can represent a subject (dog for example) in the style of famous painters.
Dreambooth indicates that some renderings seem to have a novel composition and closely mimic the painter's style - even suggesting some sort of creativity (extrapolation given prior knowledge).
View guided by description
Dreambooth can synthesize images with specified viewpoints for a subject (from left to right: top, bottom, side and back views). Note that the generated poses are different from the input poses and the background changes realistically when the pose changes. Dreambooth also emphasizes the preservation of the intricate fur patterns on the subject cat's forehead.
Here you can see the change in colors, and the crosses between a specific dog and different animals. Dreambooth preserves the unique visual characteristics that give the subject its identity or essence, while performing the required property change.
Equip a dog with accessories. The identity of the subject is preserved and many different outfits or accessories can be applied to the dog. With Dreambooth there are a wide variety of options available.
Dreambooth aims to provide users with an efficient tool to synthesize personal subjects (animals, objects, people) in different contexts. While general text-to-image models can be biased towards specific attributes when synthesizing images from text, Dreambooth allows the user to achieve a better reconstruction of their subjects.
On the contrary, malicious parties might try to use these images to mislead viewers. This is a common problem, existing in other generative modeling approaches or content manipulation techniques.
Future research on generative modeling, and specifically on custom generative priors, should continue to investigate and revalidate these concerns.