Introduction
Humans are creative beings. A plethora of ideas keep coming to us constantly. Over the past few centuries, artists have implemented their skills by sketching on paper, painting on canvas, carving on walls, and so on. But can we make a machine do all of these? That, too, in a few seconds? Let’s examine this question in this blog post.
Imagine you are a novice and don’t know how to draw. Even if you did know, it would take you hours to complete a sketch or a design. The sea of ideas are flooding your brain, and what you do is take a pen and paper, or if you have a computer on a text editor, jot down your ideas. Then you make it more structured. For example, you might come up with something like this:
‘Create an image of an astronaut, garbed in a pristine white space suit and shimmering visor, standing on an otherworldly landscape characterised by towering exotic plants and a surreal, multicoloured sky. The astronaut is holding out a large, glistening block of ice with frosty vapours emanating from it, offering it as if it were a precious gemstone. Directly in front of him is an alien, a peculiar, yet friendly creature devoid of clothes, displaying hues of pastel green with pearlescent skin that shimmers under the alien sun. The alien's eyes are wide with intrigue and anticipation, tentacle-like appendages reaching out towards the block of ice in an almost reverent manner, capturing an odd and comedic cosmic trade.’
Now give the above description as input to a text-to-image generating AI model and the following image will be created in seconds:
Isn’t it amazing how a machine generated such a high-quality image just via our thoughts inputted as text? This does not even have watermarks.This can be used for high-quality content generation for marketing and ads, and so on. The applications are countless.
We will walk through how such AI models work, what is actually happening under the hood and how they can be implemented.
A Brief History
- On 10 June 2014 the Generative Adversarial Network paper was released which dealt with the machine learning framework for generating images between two adversarial neural networks.
- On 1 July 2015, Google's computer program, characterised by its psychedelic visuals, released DeepDream. It was one of the first use cases that visualised how neural networks can recognize and generate image patterns.
- June 2016 saw StyleTransfer: A deep neural network that can separate content and style from an image and combine different ones.
- Fast forward to 2021, we have Latent Diffusion - that is, a text-to-image model by Computer Vision.
- In 2022, there is an influx of models like the DALL-E and DALL-E2 and various other models.
In this article, we will be talking about how Stable Diffusion can be used to create images via text prompts - which was released on 22 August 2022.
Stable Diffusion
Stable Diffusion is a text-to-image latent diffusion model created by the researchers and engineers from CompVis, Stability AI and LAION. It is trained on 512x512 images from a subset of the LAION-5B database. LAION-5B is the largest, freely accessible multi-modal dataset that currently exists.
What Is a Latent Diffusion Model?
Latent diffusion model is a type of diffusion model where the model learns the distribution of the patterns in an image via adding gaussian noise/filter to the image in a sequential manner.
Generated using plotly studio
The above is a graphical representation of a gaussian noise/filter. Whenever it is passed through an image vector, it adds some amount of noise to it. This process is done at each given step till the whole image vector is filled with noise.
Think of it like this: you are adding ink to a glass of water, step by step, and finally when the ink is diffused in water it becomes hard to distinguish the water from the ink. That is, the water becomes completely noisy.
In order to predict noise, reverse diffusion is conducted by subtracting noise at each step.
The process of forward and backward diffusion is very slow and it won’t run on any single GPU. The image space is large. For example, take a 512x512 image. It has its 3 layers of RGB channels. Running diffusion model for one single image would be very time complexive.
Instead of working on a high dimensional image space, latent diffusion compresses the images into a latent space - so that is way more quick. In latent space, the dimensionality is low so we don’t need to deal with the curse of dimensionality and hence don't need to reduce the dimensions using Principal Complex Analysis, etc. The point being, similar data points are closer together in space which makes processing a lot faster.
There are three main components in latent diffusion.
- An autoencoder (VAE).
- A U-Net.
- A text-encoder, e.g. CLIP's Text Encoder.
The Variational Autoencoder
The variational autoencoder uses a probabilistic approach to learn the latent space. This means that the VAE learns distribution over the latent space, rather than a single point. This allows the VAE to generate more realistic and diverse data.
The VAE consists of two neural networks: an encoder and a decoder. The encoder takes the input data and maps it to a distribution over the latent space. The decoder takes the latent space and maps it back to the original input data.The VAE is trained by minimising the difference between the input data and the decoded data.
The U-Net
The U-Net is used in latent diffusion models because it is a very effective architecture for generating noise that is both realistic and diverse. The U-Net is able to learn to generate noise that captures the different features of the latent space, such as the overall shape of the image, the texture of the image, and the colors of the image. This allows the model to generate images that are more realistic and diverse than images that are generated with other methods.
Text Encoder
The text-encoder is responsible for transforming the input prompt into an embedding space that can be understood by the U-Net. It is usually a simple transformer-based encoder that maps a sequence of input tokens to a sequence of latent text-embeddings.
Inference
- The Stable Diffusion model takes a latent seed and a text prompt as input.
- The latent seed is used to generate random latent image representations of size 64×64.
- The text prompt is transformed into text embeddings of size 77×768 using CLIP's text encoder.
- The U-Net iteratively denoises the random latent image representations while being conditioned on the text embeddings.
- The output of the U-Net, being the noise residual, is used to compute a denoised latent image representation.
Implementing Stable Diffusion
Now that we have understood the nuances of the Stable Diffusion algorithm, let’s see how it can be implemented via code.
We will conduct this experiment on E2E Cloud, which provides a range of Advanced GPUs and free credits to go along. If you haven’t yet created an account on E2E Cloud, go ahead and do so on MyAccount dashboard.
Once that’s done, log in to E2E Networks with your credentials and then follow the steps outlined below.
GPU Node Creation
To begin with, you would need to create a GPU node. This is where you would be training and testing your model.
Node Creation
Click on create node on your dashboard
Under GPU, select Ubuntu 22.04 node.
Select the appropriate node
You can choose cheaper nodes, but your mileage may vary. In this case we have chosen an advanced GPU node.
And then click create. The node will be created
Create SSH Keys
Generate your set of SSH keys in your local system using the following command:
A public and private key will be generated for your local system. Never ever share your private key with anyone. Add the public ssh key on E2E Cloud under Settings > SSH Keys > Add New Key. Like this:
SSH-ing into the Node
After you have added the key, log in to E2E Networks, create a from your local network via SSH:
Enter the password when prompted to.
It's always a good practice to update and upgrade the machine
Installation Steps
After this, we install diffusers as well scipy, ftfy and transformers. Accelerate is used to achieve much faster loading.
Out:
Stable Diffusion Pipeline
StableDiffusionPipeline is an end-to-end inference pipeline that we can use to generate images from text with just a few lines of code.
First, we load the pre-trained weights of all components of the model. In this case, we use Stable Diffusion version 1.4 (CompVis/stable-diffusion-v1-4)
Out:
Next, let’s move the pipeline to GPU to have faster inference.
And we are ready to generate images:
Out:
Running the above cell multiple times will give you a different image every time. If you want a deterministic output, you can pass a random seed to the pipeline. Every time you use the same seed, you’ll have the same image result.
Out:
We can change the number of inference steps using the num_inference_steps argument. In general, results are better the more steps we use. Stable Diffusion, being one of the latest models, works great with a relatively small number of steps, so we recommend using the default of 50. If you want faster results, you can use a smaller number.
The other parameter in the pipeline call is guidance_scale. It is a way to increase the adherence to the conditional signal which, in this case, is text as well as overall sample quality. In simple terms, classifier free guidance forces the generation to better match with the prompt. Numbers like 7 or 8.5 give good results; if you use a very large number the images might look good, but will be less diverse.
To generate multiple images for the same prompt, we simply use a list with the same prompt repeated several times. We’ll send the list to the pipeline instead of the string we used before.
Let’s first write a helper function to display a grid of images. Just run the following cell to create the image_grid function, or disclose the code if you are interested in how it’s done.
Now, we can generate a grid image once having run the pipeline with a list of 4 prompts.
Out:
Closing Thoughts
Now that you have got a basic idea of how Stable Diffusion works, and the code implementation, we can experiment it to generate our own art.
Follow the steps above to build, deploy, launch and scale your own Text-to-Image platform using E2E Cloud today. If you need further help, do feel free to reach out to sales@e2enetworks.com.