Fine-Tuning Stable Diffusion to Create a Virtual Fashion Designer for Customers

January 29, 2024

Introduction

The world of online shopping is changing quickly and consumers are expecting more individualized and engaging experiences. In this blog, we set out to build a virtual changing room using artificial intelligence (AI). Our goal is to provide users with the ability to upload their own photos and see life-like models of themselves in different outfits, providing a fresh and entertaining way for people to experiment with different looks.

Problem Statement

One of the most frequent problems customers in online retail encounter is trying to picture how a dress would appear on them. Our goal is to overcome this difficulty by creating a virtual changing room where users can upload pictures of themselves and see life-like simulations of themselves wearing various outfits. This not only makes online shopping more enjoyable; it also adds some creativity and fun to the process.

What Is Stable Diffusion?

A generative artificial intelligence (AI) model called Stable Diffusion can use text and image prompts to produce photorealistic images, videos, and animations. This deep learning model has the ability to translate written descriptions into intricate visuals.

Stable diffusion models use text or visual cues to produce graphics, videos, and animations. By using a latent diffusion model (LDM) that has been painstakingly trained on a variety of real-world imaging datasets, these models are able to provide outputs that are incredibly detailed and life-like.

Because the generated pictures' artistic style and content may be altered by the user, Stable Diffusion Models are incredibly flexible tools for developers and designers. These models are a part of a broader trend of artificial intelligence (AI)-driven creative tools that are revolutionizing digital art and content creation.

Realistic images can be produced using generative AI technology.
Makes use of a Latent Diffusion Model that was developed on actual photos.
Gives the user discretion over content and style.

How Can I Access Stable Diffusion Models?

Several websites that provide AI models offer access to downloads of Stable Diffusion Models. Two well-known repositories where users can access a variety of Stable Diffusion Models, each with special traits and abilities, are Civitai and Hugging Face.

User manuals and paperwork are frequently included with these devices to help with setup and operation. Furthermore, some models include built-in safety filters to check the creation of explicit content, but it's vital to remember that these filters are not infallible.

Available for download on websites like Civitai and Hugging Face.
User manuals and documentation are normally supplied.
Certain models come with safety filters.

‍

Why Is Stable Diffusion Important?

Because Stable Diffusion is readily available and simple to use, it is significant. Graphics cards suitable for consumers can run it. For the first time, anyone can download the model and create their own images. Important hyperparameters that you can adjust include the amount of noise applied and the number of denoising steps.

Stable Diffusion is easy to use, and it doesn't require any extra knowledge to generate images. Because of its vibrant community, Stable Diffusion has a wealth of tutorials and documentations. The program can be used, altered, and redistributed under the terms of the Creative ML OpenRAIL-M license.

What Architecture Does Stable Diffusion Use?

Text conditioning, a noise predictor, forward and reverse diffusion, and a variational encoder are the primary architectural elements of stable diffusion.

Autoencoder with Variation

There is a separate encoder and decoder for each variational autoencoder. The 512x512 pixel image is compressed by the encoder into a more manageable 64x64 model in latent space. The decoder converts the model back into a full-size 512x512 pixel image from latent space.

Forward Dispersion

Gaussian noise is gradually added by forward diffusion to an image until only random noise is present. From the final noisy image, it is impossible to determine what the image was. Every image goes through this process while it is being trained. Other than image-to-image conversion, forward diffusion is not used any more.

‍

Reverse Diffusion

This procedure basically undoes the forward diffusion iteratively using a parameterized approach. A dog and a cat are two examples of the two photos you may use to train the model. If you did, the opposite process would go in the direction of a dog or a cat, with no intermediate stops. In real life, model training creates unique visuals by using prompts on billions of photographs.

U-Net Noise Predictor

The secret to denoising photos is a noise predictor. A U-Net model is used by Stable Diffusion to accomplish this. Convolutional neural networks, or U-Net models, were first created for image segmentation in the biomedical field. Specifically, the Residual Neural Network (ResNet) model created for computer vision is used in Stable Diffusion.

Use Case of Stable Diffusion

‍

Stable Diffusion is unlike many other diffusion models. Diffusion models encode images in theory using Gaussian noise. Subsequently, they replicate the image using a reverse diffusion method and a noise predictor. Stable Diffusion is distinct from other diffusion models not just in its technical aspects but also in that it does not utilize the image's pixel space. Rather, it makes use of a latent space with decreased definition.

This is due to the fact that there are 786,432 potential values for a color image with 512 x 512 resolution. In contrast, Stable Diffusion makes use of a compressed image with 16,384 values, which is 48 times smaller. Processing requirements are greatly decreased as a result.

At the core of our solution lies the Stable Diffusion AI model, designed for image generation and manipulation. Fine-tuned specifically for clothing modifications, this model acts as the creative engine behind our virtual dressing room, delivering realistic and visually appealing results.

Dataset: Images of a Customer, Product Images of a Dress

Our dataset comprises a diverse collection of customer images and product images of different dresses. This dataset serves as the training ground for our AI model, allowing it to understand various clothing styles and generate compelling simulations.

Why Advanced GPUs Are Necessary

Running Stable Diffusion models requires a powerful dedicated GPU because of a number of computationally intensive requirements related to the model's architecture and training procedure.

In the figure below, a typical GPU architecture is displayed. However, developers can usually obtain the same capabilities through a cloud GPU platform rather than purchasing sophisticated GPUs. You can leverage the GPU stack's capabilities, such as GPU clusters, faster bandwidth, and memory efficiency, with the best cloud GPU architectures.

a realistic photograph of a gpu on fire <lora:add_detail:0.9>

Why advanced GPUs are necessary:

Computational Intensity: Complex operations such as forward and reverse diffusion, noise prediction, and image generation are involved in Stable Diffusion models. Although these operations require a significant amount of processing power, the complex calculations involved can be effectively handled by a powerful GPU.

Model Dimensions and Architecture: Latent Diffusion models usually function in a space with a large number of dimensions. To efficiently handle this large latent space, computations of this nature call for a powerful GPU with parallel processing capabilities. Complex operations are carried out by the VAE component, which encodes and decodes images. The computations are accelerated by a dedicated GPU, especially when working with high-resolution images.

High-Resolution Image Generation: Images with 512x512 pixels or higher in resolution are frequently produced by Stable Diffusion models. This resolution of image processing requires a significant amount of memory and computational resources.

E2E Networks: A Cloud-Based Dedicated GPU Platform

Leading Indian hyperscaler E2E Networks specializes in cutting-edge Cloud GPU infrastructure. We offer solutions for accelerated cloud computing, such as the AI Supercomputer HGX 8xH100 GPUs and state-of-the-art Cloud GPUs like A100/H100. We provide a selection of cutting-edge cloud GPUs at incredibly low prices. Go here to learn more about the products that E2E Networks offer. The optimal GPU for using the Stable Diffusion model will mostly depend on your needs and price range. I made use of an A100–80 GB GPU-dedicated compute.

To proceed with E2E Networks, add your SSH key by going to Settings.

Then create a node by going to Compute.

‍

Launch Visual Studio Code and download the Remote Explorer and Remote SSH extensions. Launch a fresh terminal. To gain access to your local system, just enter the code below:

ssh root@<your public ip address>

SSH will be used to log you in remotely on your local computer. Let's begin putting the code into practice now.

Step-by-Step Guide to Fine-Tuning Stable Diffusion to Create a Virtual Fashion Designer for Customers

Part 1: Launching Node and Downloading Model

Our journey commences with the setup of the computing environment. We launch a node on E2E Cloud and download the Stable Diffusion model.


# Install necessary libraries
!pip install -q matplotlib
!pip install -q numpy
!pip install -q pandas
!pip install -q scikit-learn
!pip install opencv-python -q
!pip install pyarrow pillow -q
!pip install keras-cv==0.6.0 -q
!pip install -U tensorflow -q
!pip install keras-core -q


# Import libraries
import os
import warnings
warnings.filterwarnings("ignore")
import keras_cv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.experimental.numpy as tnp
import cv2
from PIL import Image
from textwrap import wrap
from keras_cv.models.stable_diffusion.clip_tokenizer import SimpleTokenizer
from keras_cv.models.stable_diffusion.diffusion_model import DiffusionModel
from keras_cv.models.stable_diffusion.image_encoder import ImageEncoder
from keras_cv.models.stable_diffusion.noise_scheduler import NoiseScheduler
from keras_cv.models.stable_diffusion.text_encoder import TextEncoder
from tensorflow import keras

The installation of required libraries ensures a well-equipped environment for seamless execution. We then import essential libraries and set paths for our image and text data.

Part 2: Gathering Fine-Tuning Data

Next, we load images and text descriptions, creating a structured DataFrame. The data is filtered based on specific keywords related to clothing styles.

You can download the images dataset from here and the text descriptions as well from here.

During the training process, we used both detailed text and visual elements in the image which are in the datasets.


# Specify the paths to your image and text description files
images_dir = "/path/to/your/images/directory"
text_descriptions_file = "/path/to/your/text/descriptions/file.txt"

Load images from the directory
image_files = os.listdir(images_dir)
image_paths = [os.path.join(images_dir, file) for file in image_files]
Load text descriptions from the file
with open(text_descriptions_file, 'r') as file:
    text_descriptions = file.readlines()
Create a DataFrame with image paths and text descriptions
data = {'image': image_paths, 'text': text_descriptions}
df = pd.DataFrame(data)
Specify the keywords for fine-tuning
keywords = ["latex short black dress", "pantyhose", "white oversized coat"]
Filter the dataset based on keywords
filtered_df = df[df['text'].str.contains('|'.join(keywords), case=False)]

This step establishes the foundation for training our model by organizing the data and filtering out irrelevant entries using predefined keywords.

Part 3: Fine-Tuning Stable Diffusion

We prepare the model for fine-tuning by setting up components such as the image encoder, diffusion model, and trainer. We define hyperparameters and initiate the training process.

You can download the model from here.


Display a sample of the filtered dataset
filtered_df.sample(5)
Define constants for fine-tuning
RESOLUTION = 256
MAX_PROMPT_LENGTH = 77
PADDING_TOKEN = 49400
Load the pretrained model from the .safetensors file
pretrained_model_path = "/path/to/your/pretrained_model.safetensors"
pretrained_model = tf.saved_model.load(pretrained_model_path)
Define the tokenizer and text encoder for fine-tuning
tokenizer = SimpleTokenizer()
text_encoder = TextEncoder(MAX_PROMPT_LENGTH)

Fine-tuning the model involves configuring essential components and defining parameters for effective learning. We also consider mixed-precision training for enhanced efficiency.

Part 4: Showcasing Prompting

To effectively train our model, we create a function to process text for fine-tuning and tokenize the text data using a tokenizer.


Define a function to process text for fine-tuning
def process_text_for_fine_tuning(text):
    tokens = tokenizer.encode(text)
    tokens = tokens + [PADDING_TOKEN] * (MAX_PROMPT_LENGTH - len(tokens))
    return np.array(tokens)
Tokenize the text for fine-tuning
tokenized_texts = np.array([process_text_for_fine_tuning(text) for text in filtered_df['text']])

Text processing is a crucial step, ensuring that our AI model comprehends input prompts effectively. Tokenization converts textual data into a format suitable for training.


Define the image augmentation pipeline
augmenter = keras.Sequential(
    layers=[
        keras_cv.layers.CenterCrop(RESOLUTION, RESOLUTION),
        keras_cv.layers.RandomFlip(),
        tf.keras.layers.Rescaling(scale=1.0 / 127.5, offset=-1),
    ]
)

Part 5: Training the Model

Demonstrate the application of the trained model for clothing modification. We utilize a dedicated Trainer class and initiate the training process.


Define the Trainer class for fine-tuning
class Trainer(tf.keras.Model):
    def init(self, diffusion_model, vae, noise_scheduler, use_mixed_precision=False, max_grad_norm=1.0, **kwargs):
        super(Trainer, self).init(**kwargs)
        self.diffusion_model = diffusion_model
        self.vae = vae
        self.noise_scheduler = noise_scheduler
        self.max_grad_norm = max_grad_norm
        self.use_mixed_precision = use_mixed_precision
        self.vae.trainable = False
    def train_step(self, inputs):
        images = inputs["images"]
        encoded_text = inputs["encoded_text"]
        batch_size = tf.shape(images)[0]
with tf.GradientTape() as tape:
            latents = self.sample_from_encoder_outputs(self.vae(images, training=False))
            latents = latents * 0.18215
            noise = tf.random.normal(tf.shape(latents))
            timesteps = tnp.random.randint(0, self.noise_scheduler.train_timesteps, (batch_size,))
            noisy_latents = self.noise_scheduler.add_noise(tf.cast(latents, noise.dtype), noise, timesteps)
            target = noise
            timestep_embedding = tf.map_fn(lambda t: self.get_timestep_embedding(t), timesteps, dtype=tf.float32)
            timestep_embedding = tf.squeeze(timestep_embedding, 1)
            model_pred = self.diffusion_model([noisy_latents, timestep_embedding, encoded_text], training=True)
            loss = self.compiled_loss(target, model_pred)
            if self.use_mixed_precision:
                loss = self.optimizer.get_scaled_loss(loss)
        trainable_vars = self.diffusion_model.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)
        if self.use_mixed_precision:
            gradients = self.optimizer.get_unscaled_gradients(gradients)
        gradients = [tf.clip_by_norm(g, self.max_grad_norm) for g in gradients]
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))
        return {m.name: m.result() for m in self.metrics}
    def get_timestep_embedding(self, timestep, dim=320, max_period=10000):
        half = dim // 2
        log_max_period = tf.math.log(tf.cast(max_period, tf.float32))
        freqs = tf.math.exp(-log_max_period * tf.range(0, half, dtype=tf.float32) / half)
        args = tf.convert_to_tensor([timestep], dtype=tf.float32) * freqs
        embedding = tf.concat([tf.math.cos(args), tf.math.sin(args)], 0)
        embedding = tf.reshape(embedding, [1, -1])
        return embedding
    def sample_from_encoder_outputs(self, outputs):
        mean, logvar = tf.split(outputs, 2, axis=-1)
        logvar = tf.clip_by_value(logvar, -30.0, 20.0)
        std = tf.exp(0.5 * logvar)
        sample = tf.random.normal(tf.shape(mean), dtype=mean.dtype)
        return mean + std * sample
    def save_weights(self, filepath, overwrite=True, save_format=None, options=None):
        self.diffusion_model.save_weights(filepath=filepath, overwrite=overwrite, save_format=save_format, options=options)

Training the model involves specifying hyperparameters, defining a checkpoint for saving weights, and executing the training process. This step fine-tunes the model for accurate clothing modifications.


Enable mixed-precision training if the underlying GPU has tensor cores.
USE_MP = True
if USE_MP:
    keras.mixed_precision.set_global_policy("mixed_float16")
image_encoder = ImageEncoder()
diffusion_ft_trainer = Trainer(
    diffusion_model=DiffusionModel(RESOLUTION, RESOLUTION, MAX_PROMPT_LENGTH),
    vae=tf.keras.Model(image_encoder.input, image_encoder.layers[-2].output),
    noise_scheduler=NoiseScheduler(),
    use_mixed_precision=USE_MP,
)
Hyperparameters
lr = 1e-5
beta_1, beta_2 = 0.9, 0.999
weight_decay = 1e-2
epsilon = 1e-08
Optimizer
optimizer = tf.keras.optimizers.experimental.AdamW(
    learning_rate=lr,
    weight_decay=weight_decay,
    beta_1=beta_1,
    beta_2=beta_2,
    epsilon=epsilon,
)
diffusion_ft_trainer.compile(optimizer=optimizer, loss="mse")

Now, let’s train for 100 epochs.


Training
epochs = 100
ckpt_path = "finetuned_stable_diffusion.h5"
ckpt_callback = tf.keras.callbacks.ModelCheckpoint(
    ckpt_path,
    save_weights_only=True,
    monitor="loss",
    mode="min",
)
diffusion_ft_trainer.fit(training_dataset, epochs=epochs, callbacks=[ckpt_callback])

Results

Now, let's showcase the results of our virtual dressing room by modifying the clothing in an example image.


Text-to-Image Generation with Clothing Modification
def modify_clothing(image_path, prompt):
    input_image = tf.io.read_file(image_path)
    input_image = tf.io.decode_png(input_image, 3)
    input_image = tf.image.resize(input_image, (RESOLUTION, RESOLUTION))
    tokenized_prompt = process_text_for_fine_tuning(prompt)
    input_image = tf.expand_dims(input_image, axis=0)
    tokenized_prompt = tf.expand_dims(tokenized_prompt, axis=0)
    augmented_image, encoded_prompt = apply_augmentation(input_image, tokenized_prompt)
    _, _, encoded_text_batch = run_text_encoder(augmented_image, encoded_prompt)
    modified_image = diffusion_ft_trainer.diffusion_model.predict([augmented_image, encoded_text_batch])
    return modified_image[0]
Example Usage
input_image_path = '/path/to/your/input/image.png'
prompt_for_clothing_modification = "Change to suit"
modified_image_output = modify_clothing(input_image_path, prompt_for_clothing_modification)

The example demonstrates the transformation of an input image based on the provided prompt for clothing modification. The side-by-side comparison of the input and modified images allows users to witness the AI-driven changes in attire.

‍

‍

‍

‍

Conclusion

In conclusion, the Stable Diffusion model's fine-tuning for e-commerce image generation was greatly improved by the integration of E2E Networks' A100–80 GB GPU dedicated compute. The computational power of the A100 GPU effectively handled complex model operations, leading to faster training and the seamless process of image generation, noise prediction, forward and reverse diffusion.

The versatility of the A100 allowed for quick experimentation and effective model customization through fine-tuning on unique datasets. The A100 GPU guaranteed responsiveness for real-time image generation, cutting down on training times and improving user experience. The cloud-based infrastructure from E2E Networks offered a customizable setting that did away with hardware limitations and made dedicated GPU resources available.

In summary, the synergistic environment that was created by the partnership between E2E Networks’ A100 GPU and Stable Diffusion model fine-tuning was marked by accessibility, computational efficiency, and accelerated model training, making the process of creating visual content for e-commerce both efficient and pleasurable.

Sign up for Free Trial

Latest Blogs

A vector illustration of a tech city using latest cloud technologies & infrastructure

Fine-Tuning Stable Diffusion to Create a Virtual Fashion Designer for Customers

January 29, 2024

Manthan Abhay Deshpande

Introduction

Problem Statement

What Is Stable Diffusion?

Realistic images can be produced using generative AI technology.
Makes use of a Latent Diffusion Model that was developed on actual photos.
Gives the user discretion over content and style.

How Can I Access Stable Diffusion Models?

Available for download on websites like Civitai and Hugging Face.
User manuals and documentation are normally supplied.
Certain models come with safety filters.

‍

Why Is Stable Diffusion Important?

What Architecture Does Stable Diffusion Use?

Text conditioning, a noise predictor, forward and reverse diffusion, and a variational encoder are the primary architectural elements of stable diffusion.

Autoencoder with Variation

Forward Dispersion

‍

Reverse Diffusion

U-Net Noise Predictor

Use Case of Stable Diffusion

‍

Dataset: Images of a Customer, Product Images of a Dress

Why Advanced GPUs Are Necessary

Running Stable Diffusion models requires a powerful dedicated GPU because of a number of computationally intensive requirements related to the model's architecture and training procedure.

Why advanced GPUs are necessary:

E2E Networks: A Cloud-Based Dedicated GPU Platform

To proceed with E2E Networks, add your SSH key by going to Settings.

Then create a node by going to Compute.

‍

Launch Visual Studio Code and download the Remote Explorer and Remote SSH extensions. Launch a fresh terminal. To gain access to your local system, just enter the code below:

ssh root@<your public ip address>

SSH will be used to log you in remotely on your local computer. Let's begin putting the code into practice now.

Step-by-Step Guide to Fine-Tuning Stable Diffusion to Create a Virtual Fashion Designer for Customers

Part 1: Launching Node and Downloading Model

Our journey commences with the setup of the computing environment. We launch a node on E2E Cloud and download the Stable Diffusion model.


# Install necessary libraries
!pip install -q matplotlib
!pip install -q numpy
!pip install -q pandas
!pip install -q scikit-learn
!pip install opencv-python -q
!pip install pyarrow pillow -q
!pip install keras-cv==0.6.0 -q
!pip install -U tensorflow -q
!pip install keras-core -q


# Import libraries
import os
import warnings
warnings.filterwarnings("ignore")
import keras_cv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.experimental.numpy as tnp
import cv2
from PIL import Image
from textwrap import wrap
from keras_cv.models.stable_diffusion.clip_tokenizer import SimpleTokenizer
from keras_cv.models.stable_diffusion.diffusion_model import DiffusionModel
from keras_cv.models.stable_diffusion.image_encoder import ImageEncoder
from keras_cv.models.stable_diffusion.noise_scheduler import NoiseScheduler
from keras_cv.models.stable_diffusion.text_encoder import TextEncoder
from tensorflow import keras

The installation of required libraries ensures a well-equipped environment for seamless execution. We then import essential libraries and set paths for our image and text data.

Part 2: Gathering Fine-Tuning Data

Next, we load images and text descriptions, creating a structured DataFrame. The data is filtered based on specific keywords related to clothing styles.

You can download the images dataset from here and the text descriptions as well from here.

During the training process, we used both detailed text and visual elements in the image which are in the datasets.


# Specify the paths to your image and text description files
images_dir = "/path/to/your/images/directory"
text_descriptions_file = "/path/to/your/text/descriptions/file.txt"

Load images from the directory
image_files = os.listdir(images_dir)
image_paths = [os.path.join(images_dir, file) for file in image_files]
Load text descriptions from the file
with open(text_descriptions_file, 'r') as file:
    text_descriptions = file.readlines()
Create a DataFrame with image paths and text descriptions
data = {'image': image_paths, 'text': text_descriptions}
df = pd.DataFrame(data)
Specify the keywords for fine-tuning
keywords = ["latex short black dress", "pantyhose", "white oversized coat"]
Filter the dataset based on keywords
filtered_df = df[df['text'].str.contains('|'.join(keywords), case=False)]

This step establishes the foundation for training our model by organizing the data and filtering out irrelevant entries using predefined keywords.

Part 3: Fine-Tuning Stable Diffusion

We prepare the model for fine-tuning by setting up components such as the image encoder, diffusion model, and trainer. We define hyperparameters and initiate the training process.

You can download the model from here.


Display a sample of the filtered dataset
filtered_df.sample(5)
Define constants for fine-tuning
RESOLUTION = 256
MAX_PROMPT_LENGTH = 77
PADDING_TOKEN = 49400
Load the pretrained model from the .safetensors file
pretrained_model_path = "/path/to/your/pretrained_model.safetensors"
pretrained_model = tf.saved_model.load(pretrained_model_path)
Define the tokenizer and text encoder for fine-tuning
tokenizer = SimpleTokenizer()
text_encoder = TextEncoder(MAX_PROMPT_LENGTH)

Fine-tuning the model involves configuring essential components and defining parameters for effective learning. We also consider mixed-precision training for enhanced efficiency.

Part 4: Showcasing Prompting

To effectively train our model, we create a function to process text for fine-tuning and tokenize the text data using a tokenizer.


Define a function to process text for fine-tuning
def process_text_for_fine_tuning(text):
    tokens = tokenizer.encode(text)
    tokens = tokens + [PADDING_TOKEN] * (MAX_PROMPT_LENGTH - len(tokens))
    return np.array(tokens)
Tokenize the text for fine-tuning
tokenized_texts = np.array([process_text_for_fine_tuning(text) for text in filtered_df['text']])

Text processing is a crucial step, ensuring that our AI model comprehends input prompts effectively. Tokenization converts textual data into a format suitable for training.


Define the image augmentation pipeline
augmenter = keras.Sequential(
    layers=[
        keras_cv.layers.CenterCrop(RESOLUTION, RESOLUTION),
        keras_cv.layers.RandomFlip(),
        tf.keras.layers.Rescaling(scale=1.0 / 127.5, offset=-1),
    ]
)

Part 5: Training the Model

Demonstrate the application of the trained model for clothing modification. We utilize a dedicated Trainer class and initiate the training process.


Define the Trainer class for fine-tuning
class Trainer(tf.keras.Model):
    def init(self, diffusion_model, vae, noise_scheduler, use_mixed_precision=False, max_grad_norm=1.0, **kwargs):
        super(Trainer, self).init(**kwargs)
        self.diffusion_model = diffusion_model
        self.vae = vae
        self.noise_scheduler = noise_scheduler
        self.max_grad_norm = max_grad_norm
        self.use_mixed_precision = use_mixed_precision
        self.vae.trainable = False
    def train_step(self, inputs):
        images = inputs["images"]
        encoded_text = inputs["encoded_text"]
        batch_size = tf.shape(images)[0]
with tf.GradientTape() as tape:
            latents = self.sample_from_encoder_outputs(self.vae(images, training=False))
            latents = latents * 0.18215
            noise = tf.random.normal(tf.shape(latents))
            timesteps = tnp.random.randint(0, self.noise_scheduler.train_timesteps, (batch_size,))
            noisy_latents = self.noise_scheduler.add_noise(tf.cast(latents, noise.dtype), noise, timesteps)
            target = noise
            timestep_embedding = tf.map_fn(lambda t: self.get_timestep_embedding(t), timesteps, dtype=tf.float32)
            timestep_embedding = tf.squeeze(timestep_embedding, 1)
            model_pred = self.diffusion_model([noisy_latents, timestep_embedding, encoded_text], training=True)
            loss = self.compiled_loss(target, model_pred)
            if self.use_mixed_precision:
                loss = self.optimizer.get_scaled_loss(loss)
        trainable_vars = self.diffusion_model.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)
        if self.use_mixed_precision:
            gradients = self.optimizer.get_unscaled_gradients(gradients)
        gradients = [tf.clip_by_norm(g, self.max_grad_norm) for g in gradients]
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))
        return {m.name: m.result() for m in self.metrics}
    def get_timestep_embedding(self, timestep, dim=320, max_period=10000):
        half = dim // 2
        log_max_period = tf.math.log(tf.cast(max_period, tf.float32))
        freqs = tf.math.exp(-log_max_period * tf.range(0, half, dtype=tf.float32) / half)
        args = tf.convert_to_tensor([timestep], dtype=tf.float32) * freqs
        embedding = tf.concat([tf.math.cos(args), tf.math.sin(args)], 0)
        embedding = tf.reshape(embedding, [1, -1])
        return embedding
    def sample_from_encoder_outputs(self, outputs):
        mean, logvar = tf.split(outputs, 2, axis=-1)
        logvar = tf.clip_by_value(logvar, -30.0, 20.0)
        std = tf.exp(0.5 * logvar)
        sample = tf.random.normal(tf.shape(mean), dtype=mean.dtype)
        return mean + std * sample
    def save_weights(self, filepath, overwrite=True, save_format=None, options=None):
        self.diffusion_model.save_weights(filepath=filepath, overwrite=overwrite, save_format=save_format, options=options)


Enable mixed-precision training if the underlying GPU has tensor cores.
USE_MP = True
if USE_MP:
    keras.mixed_precision.set_global_policy("mixed_float16")
image_encoder = ImageEncoder()
diffusion_ft_trainer = Trainer(
    diffusion_model=DiffusionModel(RESOLUTION, RESOLUTION, MAX_PROMPT_LENGTH),
    vae=tf.keras.Model(image_encoder.input, image_encoder.layers[-2].output),
    noise_scheduler=NoiseScheduler(),
    use_mixed_precision=USE_MP,
)
Hyperparameters
lr = 1e-5
beta_1, beta_2 = 0.9, 0.999
weight_decay = 1e-2
epsilon = 1e-08
Optimizer
optimizer = tf.keras.optimizers.experimental.AdamW(
    learning_rate=lr,
    weight_decay=weight_decay,
    beta_1=beta_1,
    beta_2=beta_2,
    epsilon=epsilon,
)
diffusion_ft_trainer.compile(optimizer=optimizer, loss="mse")

Now, let’s train for 100 epochs.


Training
epochs = 100
ckpt_path = "finetuned_stable_diffusion.h5"
ckpt_callback = tf.keras.callbacks.ModelCheckpoint(
    ckpt_path,
    save_weights_only=True,
    monitor="loss",
    mode="min",
)
diffusion_ft_trainer.fit(training_dataset, epochs=epochs, callbacks=[ckpt_callback])

Results

Now, let's showcase the results of our virtual dressing room by modifying the clothing in an example image.


Text-to-Image Generation with Clothing Modification
def modify_clothing(image_path, prompt):
    input_image = tf.io.read_file(image_path)
    input_image = tf.io.decode_png(input_image, 3)
    input_image = tf.image.resize(input_image, (RESOLUTION, RESOLUTION))
    tokenized_prompt = process_text_for_fine_tuning(prompt)
    input_image = tf.expand_dims(input_image, axis=0)
    tokenized_prompt = tf.expand_dims(tokenized_prompt, axis=0)
    augmented_image, encoded_prompt = apply_augmentation(input_image, tokenized_prompt)
    _, _, encoded_text_batch = run_text_encoder(augmented_image, encoded_prompt)
    modified_image = diffusion_ft_trainer.diffusion_model.predict([augmented_image, encoded_text_batch])
    return modified_image[0]
Example Usage
input_image_path = '/path/to/your/input/image.png'
prompt_for_clothing_modification = "Change to suit"
modified_image_output = modify_clothing(input_image_path, prompt_for_clothing_modification)

‍

‍

‍

‍

Conclusion

Sign up for Free Trial

Latest Blogs

Fine-Tuning Stable Diffusion to Create a Virtual Fashion Designer for Customers

Introduction

Problem Statement

What Is Stable Diffusion?

How Can I Access Stable Diffusion Models?

Why Is Stable Diffusion Important?

What Architecture Does Stable Diffusion Use?

Use Case of Stable Diffusion

Dataset: Images of a Customer, Product Images of a Dress

Why Advanced GPUs Are Necessary

E2E Networks: A Cloud-Based Dedicated GPU Platform

Step-by-Step Guide to Fine-Tuning Stable Diffusion to Create a Virtual Fashion Designer for Customers

Part 1: Launching Node and Downloading Model

Part 2: Gathering Fine-Tuning Data

Load images from the directory

Load text descriptions from the file

Create a DataFrame with image paths and text descriptions

Specify the keywords for fine-tuning

Filter the dataset based on keywords

Part 3: Fine-Tuning Stable Diffusion

Display a sample of the filtered dataset

Define constants for fine-tuning

Load the pretrained model from the .safetensors file

Define the tokenizer and text encoder for fine-tuning

Part 4: Showcasing Prompting

Define a function to process text for fine-tuning

Tokenize the text for fine-tuning

Define the image augmentation pipeline

Part 5: Training the Model

Define the Trainer class for fine-tuning

Enable mixed-precision training if the underlying GPU has tensor cores.

Hyperparameters

Optimizer

Training

Results

Text-to-Image Generation with Clothing Modification

Example Usage

Conclusion

Fine-Tuning Stable Diffusion to Create a Virtual Fashion Designer for Customers

Introduction

Problem Statement

What Is Stable Diffusion?

How Can I Access Stable Diffusion Models?

Why Is Stable Diffusion Important?

What Architecture Does Stable Diffusion Use?

Use Case of Stable Diffusion

Dataset: Images of a Customer, Product Images of a Dress

Why Advanced GPUs Are Necessary

E2E Networks: A Cloud-Based Dedicated GPU Platform

Step-by-Step Guide to Fine-Tuning Stable Diffusion to Create a Virtual Fashion Designer for Customers

Part 1: Launching Node and Downloading Model

Part 2: Gathering Fine-Tuning Data

Load images from the directory

Load text descriptions from the file

Create a DataFrame with image paths and text descriptions

Specify the keywords for fine-tuning

Filter the dataset based on keywords

Part 3: Fine-Tuning Stable Diffusion

Display a sample of the filtered dataset

Define constants for fine-tuning

Load the pretrained model from the .safetensors file

Define the tokenizer and text encoder for fine-tuning

Part 4: Showcasing Prompting

Define a function to process text for fine-tuning

Tokenize the text for fine-tuning

Define the image augmentation pipeline

Part 5: Training the Model

Define the Trainer class for fine-tuning

Enable mixed-precision training if the underlying GPU has tensor cores.

Hyperparameters

Optimizer

Training

Results

Text-to-Image Generation with Clothing Modification

Example Usage

Conclusion

Making AI Deployment Affordable and Scalable: Cost Efficiency of Quantization

Interpretable vs. Black-Box Models: A Comprehensive Exploration on Early Prediction under Uncertainty

Generative AI in Healthcare: Applications, Benefits, and Its Future

No-Code Deployment of Fine-Tuned Models on TIR Foundation Studio: BYOM Made Easy