June 25, 2024 • 6 min read

Generative AI: Learn How to Control Image Generation with Stable Diffusion

Rédigé par Simon Playe

Have you ever used ChatGPT to create images and been disappointed by the results? If you've felt this frustration or are simply interested in exploring more effective image-generation tools, Stable Diffusion could be the right solution.

With the popularity of ChatGPT, many are excited to uncover what generative AI can do. It's not just for creating text; this technology can also make images, videos, and even music. Yet sometimes, it's hard to get the pictures you want from a simple text prompt.

This article will look at Stable Diffusion, a tool built for generating images. Here, I'll show you how to use an API to control how images are made with Stable Diffusion, covering options for both non-coders and developers. The methods presented in this article include:

Directing the diffusion process towards a specific image (IP Adapter).
Retaining certain characteristics within an image through the diffusion process (ControlNet).
Extracting features from an initial image (Image to Image).
Training a specific model with a limited set of images for targeted adjustments (LORA, Model Fine-Tuning).
Adding new weights to the cross-attention layers to influence image characteristics (LORA).

Stable Diffusion

What is Stable Diffusion?

Stable Diffusion is a diffusion model specifically designed for image generation. It starts with an initial pattern of random noise and systematically refines or "denoises" this noise to produce images that closely resemble real-life pictures. The model guides this transformation by applying certain conditions (e.g. a text prompt) during the denoising process.

diffusion process stable diffusion — source: https://en.wikipedia.org/wiki/Diffusion_model#/media/File:X-Y_plot_of_algorithmically-generated_AI_art_of_European-style_castle_in_Japan_demonstrating_DDIM_diffusion_steps.png

However, unlike traditional methods that rely solely on text prompts for conditioning, Stable Diffusion offers more flexible and sophisticated controls:

Image-based Conditioning: Through features like the IP-Adapter and ControlNet, Stable Diffusion can use an existing image to steer the generation, enabling modifications or enhancements based on that image.
Image-to-Image Transformation: This feature lets the model start not from scratch but from an existing image, which it then alters into a new creation.
Style Control: Users can select models that generate images in specific artistic styles, or even train their models to produce custom styles.
Hybrid Conditioning: It allows the combination of text prompts, images and different models to condition the generation, offering unprecedented control over the output.

To experiment with these various techniques of controlling image generation, ModelsLab provides a convenient Stable Diffusion API. It is available in two formats:

Playground Version: A user-friendly interface designed for experimentation and exploration without needing extensive technical knowledge.
Developer API: Offers more detailed control and customization, suitable for developers looking to integrate these capabilities into their applications.

In the following sections, I will delve into each technique, demonstrating how to leverage the ModelsLab platform to realize your creative vision.

Stable Diffusion text-to-image

Before diving into the diverse tools available for controlling image generation, let us begin with an introduction to the Stable Diffusion text-to-image API. This API, akin to Dall-E, facilitates image generation from textual prompts.

For developers, there are two key endpoints: text2img and realtime-stable-diffusion. The text2img endpoint is the main choice for image generation, accessible via both the playground and the developer API. It lets users create images with community-trained diffusion models, available as Stable Diffusion and Stable Diffusion XL. The XL version provides more precise images but requires more time to generate.

In the playground, users can customize their image generation by setting different parameters such as:

Negative Prompt: Specify what you don't want to appear in the image.
Guidance Scale: Set how much the prompt influences the denoising.
Steps: Decide how many steps the generation should take, affecting image detail.

The developer API also offers additional options, like:

enhance_style: Choose a specific style for the image.
highres_fix: Create high-resolution images.

Moreover, the realtime-stable-diffusion endpoint has fewer customization options and doesn't allow choosing a diffusion model, but it's faster at generating images.

Here's a quick example of how text2img works.

text-to-image — A cat dancing on a table with a Christmas hat

Controlling Image Generation with Stable Diffusion

In this section, we'll explore three advanced tools provided by Stable Diffusion that offer alternative ways to guide image generation, beyond the conventional text prompt:

IP-Adapter
ControlNet
Image-to-image

IP-Adapter

Imagine guiding image generation not by giving textual instructions but by using a target image instead. That’s exactly what IP-Adapter does. It lets you steer the generation process with an image, much like how a text prompt would. Yet, IP-Adapter must be used with a text prompt (or image, as we will see below).

However, the IP-Adapter is only available through the developer API on the img2img endpoint. In this setting, you can adjust parameters like:

ip_adapter_id, which sets how the image is encoded,
ip_adapter_scale, which affects how much the IP-Adapter image influences the denoising,
ip_adapter_image, which is the URL of the IP-Adapter image.

IP-Adapter with a text prompt

ip-adapter stable diffusion — **Create a rose and cute castle with IP-Adapter**

ControlNet

ControlNet functions like IP-Adapter but with a nuanced approach. Unlike IP-Adapter, which influences image generation based on the whole image, ControlNet targets specific features within the image to guide the generation process towards these details.

Several ControlNet type exists that allows to extract specific features of an image:

softedge: finds edges in images
canny: accurately delineates boundaries in controlled environments
openpose: detects human body, hand, facial, and foot keypoints
etc..

You can also integrate multiple ControlNet models by listing several controlnet_model parameters, separated by commas, such as "canny, softedge".

ControlNet's adaptability means it can work alongside text prompts and IP-Adapter images. It's user-friendly for non-developers in the playground, while developers can tweak additional settings, such as:

controlnet_type: This sets one of the various accepted types of ControlNet models, like canny, depth, hed, etc. You can find the full list of model types available here.
controlnet_model: This specifies the specific ControlNet model being used, which could be a default or a community model. In the case of a default model, 'controlnet_model' corresponds directly to 'controlnet_type'.
controlnet_conditioning_scale: This determines how significantly ControlNet influences the denoising process.
control_image (optional): This is the image from which features will be extracted. If not specified, and an initial image is provided, that image will be used.

ControlNet without a text prompt and without an IP-Adapter image

ControlNet with a text prompt and without IP-Adapter image

Image-to-Image Generation

Finally, instead of orientating the image generation by conditioning the output, why not starting the generation from an image that shares characteristics with your final image? That’s how Stable Diffusion img2img API works. It does not start from complete random noise but it adds noise to the initial image. It is designed to capture general features from an initial image like its color and composition.

Image-to-image can be combined with a text prompt, ControlNet and/or IP-Adapter.

Because img2img generation operates similarly to text-to-image generation, both APIs share nearly the same features. Both community models and realtime-stable-diffusion can be applied to either generation type.

Image generation with a text prompt and without IP-Adapter

image-to-image — **A plane inspired from a bird picture**

Image generation without a text prompt and without IP-Adapter

Conclusion on Controlling Image Generation

In conclusion, traditional image generators like Dall-E rely solely on text prompts for control, but Stable Diffusion introduces IP-Adapter, ControlNet, and Image-to-Image as powerful tools to diversify and enhance image generation. Users can employ these tools independently or in combination, providing extensive flexibility and creativity in generating images, with or without text prompts.

Pick-up and train specific models

So far we spoke about adding conditions to control the image generation. But why not modifying the generation process at all? Under the hood of image generation hides a diffusion model. However, these models can be modified to generate more specific images. Here I will present two ways to enhance the generation process: by selecting or fine-tuning a diffusion model or a LoRA. Both these approaches can be combined with the tools presented above.

Diffusion Models

As I said above, Stable Diffusion works with a diffusion model. A diffusion model is a model taught to generate images. To do so, this model is trained on a set of images it aims to replicate. For instance, if you want a model generating dog pictures, you will train a diffusion model using a lot of dog pictures.

The power of Stable Diffusion is that it allows you to easily access models fine-tuned by the community. Fine-tuning is a process of partially re-training the standard Stable Diffusion model on your set of images. Instead of using the standard model that generates standard images, you can decide to pick up a model generating pixel images:

pixel diffusion — A pixelated fish using model Chibi Pixel Art Style

Or to generate cartoon images:

cartoon diffusion — Cartoon Image using model Cartoon Backgrounds

Model selection is available in both the non-coder and the developer API.

Lastly, if no model fits your tastes, you can also fine-tune your own model. According to the documentation, fine-tuning only requires 7-8 images. However, it is only available in the developer API.

LoRA

Finally, LoRA is the last tools I will present to you to get more control over image generation. LoRAs are compact versions of Stable Diffusion models, typically 10 to 100 times smaller. LoRAs add modification on top of the diffusion model. They fine-tune the standard models, subtly altering specific styles or the overall appearance of the generated images. The advantage of LoRAs lies in their low computational needs and quick training times.

Just like for diffusion models, you can easily pick up LoRAs trained by the community. But, contrary to diffusion models and similarly to ControlNet, you can use multiple LoRAs at once by separating them with commas. LoRA’s selection is available throughout the non-coder and the developer API.

Additionally, training your own LoRA is straightforward using the developer API—it typically requires just 7 to 8 images.

You can find two new settings for LoRAs in both the playground and developer API:

lora_model: Specifies which LoRA model to use.
lora_strength: Determines the extent of the LoRA's impact during denoising.

Here's an example of a LoRA “Princess”:

And here “Kid Illustrations”:

Conclusion

In summary, Stable Diffusion provides several ways to control image generation beyond just using prompts.

As a final note, integrating the various components of Stable Diffusion into a seamless workflow can be complex. For those looking to create detailed images with greater control in an easy-to-use format, ComfyUI offers an ideal solution.

ComfyUI enables you to design pipelines for image generation using Stable Diffusion. For more details on its capabilities, you can refer to the documentation. With the appropriate tools and knowledge, the potential of generative AI is limitless—enjoy creating!

Are you looking for Image Generation Experts ? Don't hesitate to contact us!

Cet article a été écrit par

Simon Playe