Introduction

Imagine a world where high-quality visuals of a fashion collection are created not in a photo studio, but by a neural network. There are no models, no cameras — just lines of code generating striking visuals tailored to a brand’s aesthetic. This is no longer the future; it’s the present, thanks to advancements in AI like Stable Diffusion XL (SDXL).

In this blog, I’ll walk you through my journey of fine-tuning SDXL to generate custom images for a fashion brand. By leveraging AI’s generative capabilities, I aimed to create visuals that reflect the brand’s unique style and maintain the high level of realism that SDXL is known for.

Why Stable Diffusion?

Accessibility and User-Friendliness: Anyone with a consumer-grade GPU can download the model and unleash their creative potential thanks to its accessibility and ease of use, which democratize image generation.
Active Community and Open License: Stable Diffusion has a vibrant community that provides a wealth of lessons and documentation. Additionally, users can freely use, alter, and distribute the software thanks to its liberal license.

Why Fine-Tune Stable Diffusion?

Stable Diffusion XL is a powerful model for generating high-resolution, photorealistic images from text prompts. Out of the box, it can create stunning visuals across various domains, but when it comes to niche areas like fashion, it might need to catch up. A generic model lacks the context and nuances needed to capture a specific brand’s aesthetic — whether it’s the intricate patterns of a dress or the distinct vibe of a collection.

Fine-tuning bridges this gap. By training the model on a curated dataset of a fashion brand’s images, I could teach it the brand’s visual language. This customization unlocks endless possibilities for fashion marketing, design ideation, and more, all while reducing reliance on traditional photoshoots.

Project Objective

The objective was to create a fine-tuned version of Stable Diffusion XL capable of generating photorealistic images of models wearing clothes inspired by a specific fashion brand. The process involved:

Preparing a dataset of branded fashion images.
Fine-tuning the model to align its outputs with the brand’s aesthetic.
Generating text prompts to produce realistic visuals of models wearing the designs.

The following sections detail the approach taken, from dataset preparation to training and results. This exploration aims to highlight the potential of generative AI in fashion and beyond, offering insights for tech enthusiasts, fashion designers, and anyone interested in the intersection of AI and creativity.

Preparing the Dataset

Fine-tuning Stable Diffusion XL begins with a crucial step: dataset preparation. The dataset forms the foundation for teaching the model the specific nuances of the fashion brand’s aesthetic. The process included:

Scraping Images and Captions
Images and corresponding captions were scraped directly from the fashion brand’s website. This ensured that the data accurately reflected the brand’s unique visual style and language.
Resizing Images
To meet Stable Diffusion XL’s requirements, all images were resized to 1024×1024 pixels, ensuring uniformity and compatibility during training.

from PIL import Image
import os
import glob

def resize_and_center_crop(img_paths, output_folder, target_size=(1024, 1024)):
os.makedirs(output_folder, exist_ok=True)

for img_path in img_paths:
img = Image.open(img_path)

img_ratio = img.width / img.height
target_ratio = target_size[0] / target_size[1]

if img_ratio > target_ratio:

new_height = target_size[1]
new_width = int(new_height * img_ratio)
img = img.resize((new_width, new_height), Image.LANCZOS)

left = (new_width – target_size[0]) // 2
img = img.crop((left, 0, left + target_size[0], target_size[1]))
else:

new_width = target_size[0]
new_height = int(new_width / img_ratio)
img = img.resize((new_width, new_height), Image.LANCZOS)

top = (new_height – target_size[1]) // 2
img = img.crop((0, top, target_size[0], top + target_size[1]))

img_name = os.path.basename(img_path)
img.save(os.path.join(output_folder, img_name))

print(f”Images resized and center-cropped, saved to {output_folder}”)

img_paths = glob.glob(“./SDXL_train/*.jpg”)
output_folder = ‘./SDXL’

resize_and_center_crop(img_paths, output_folder)

3. Creating a Metadata File

Each image was tagged with a token identifier and its corresponding caption, formatted into a JSON metadata file. Here’s the code snippet I used.

import json

caption_prefix = "a photo of TOK woman"
with open(f'{local_dir}metadata.jsonl', 'w') as outfile:
  for img in imgs_and_paths:
      caption = caption_prefix + caption_images(img[1]).split("\n")[0]
      entry = {"file_name":img[0].split("/")[-1], "prompt": caption}
      json.dump(entry, outfile)
      outfile.write('\n')

Fine-Tuning the Model

Once the dataset was ready, the next step was fine-tuning the Stable Diffusion XL model. This process involved teaching the model to generate images that align with the brand’s aesthetic.

Fine-Tuning: LoRA on Top of DreamBooth

Achieving true customization for a fashion brand required a two-step approach to fine-tuning: combining the personalization power of DreamBooth with the efficiency of LoRA (Low-Rank Adaptation).

How We Achieved Customization

DreamBooth
DreamBooth enabled us to personalize the AI model by training it on branded product images. This step ensured the model became familiar with the brand’s specifics — such as recurring patterns, styles, and overall aesthetic.
LoRA (Low-Rank Adaptation)
To refine the model further, we employed LoRA. This technique focuses on fine-tuning select components of the model without altering its entire structure. LoRA adjusts only a subset of weights, making it more lightweight and efficient compared to traditional fine-tuning.

Key Benefits of This Approach

Efficient & Lightweight

LoRA requires fewer computational resources than full-scale retraining, making it faster and more cost-effective.
This efficiency allowed us to focus our computational efforts where they were most impactful.

Granularity

LoRA allows precise updates when new products or variations are introduced.
This adaptability eliminates the need for complete retraining, saving time and effort while keeping the model up-to-date.

Optimizing the Training Pipeline

To ensure seamless integration of DreamBooth with LoRA on a resource-intensive pipeline like Stable Diffusion XL, I implemented the following optimizations:

Gradient Checkpointing
This reduces memory usage during training by saving intermediate activations and recomputing them during the backward pass.
Parameter: --gradient_accumulation_steps.
8-bit Adam Optimizer
A memory-efficient optimization algorithm designed to handle large-scale training with reduced resource requirements.
Parameter: --use_8bit_adam.
Mixed-Precision Training
Training in FP16 precision reduces memory consumption and speeds up training without significantly affecting accuracy.
Parameter: --mixed-precision="fp16".

Custom Captions or Instance Prompts

Depending on your data and goals, you can choose how captions are integrated into your fine-tuning process:

If you use custom captions (like those in a metadata file), you’ll need to install the datasets library and specify --dataset_name.
If you prefer to train exclusively with an instance prompt, you can skip the captioning step.

This flexibility makes the workflow adaptable to different datasets and training strategies.

!accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --dataset_name="SDXL_train" \
  --output_dir="SDXL_LoRA_model" \
  --caption_column="prompt"\
  --mixed_precision="fp16" \
  --instance_prompt="a photo of TOK woman" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=3 \
  --gradient_checkpointing \
  --learning_rate=1e-4 \
  --snr_gamma=5.0 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --mixed_precision="fp16" \
  --use_8bit_adam \
  --max_train_steps=2500 \
  --checkpointing_steps=717 \
  --seed="0"

Results and Outputs

After fine-tuning the Stable Diffusion XL model, the next step was to test its ability to generate high-quality, photorealistic fashion imagery. To make this process interactive and user-friendly, a Gradio interface was built, enabling users to input prompts and view the generated images instantly.

Building the Fashion Image Generator

The Gradio interface is a lightweight and intuitive way to interact with the fine-tuned model. Using just a few lines of Python code, a web-based application was created that allows users to generate custom fashion images by entering descriptive prompts.

Here’s the code to set up the interface:

import gradio as gr
from diffusers import StableDiffusionPipeline
import torch
from PIL import Image


def generate_image(prompt):
    image = pipe(prompt).images[0]
    return image.resize((512, 512), Image.Resampling.LANCZOS)


with gr.Blocks() as interface:
    gr.Markdown("# Fashion Image Generator")
    gr.Markdown("Enter a prompt to generate a fashion image using the fine-tuned Stable Diffusion model.")


    prompt_input = gr.Textbox(label="Enter your prompt", placeholder="Enter a fashion description...")


    generate_button = gr.Button("Generate Image")


    output_image = gr.Image(label="Generated Image")


    generate_button.click(fn=generate_image, inputs=prompt_input, outputs=output_image)


interface.launch()

How It Works

Input: The user provides a text description of the desired fashion image. For example, “A photo of a TOK woman wearing a Cutout Racerback Tank Top”
Processing: The fine-tuned Stable Diffusion model processes the prompt and generates a corresponding image.
Output: The generated image is displayed within the interface, resized for better visualization.

This interactive approach makes it easy to test various prompts and evaluate the model’s ability to capture the desired aesthetic.

Example Results:

Conclusion

Fine-tuning Stable Diffusion XL demonstrated the immense potential of AI in transforming the fashion industry. By bridging technology and creativity, we can:

Cut costs and time associated with traditional photoshoots.
Enable brands to generate visuals that truly reflect their aesthetic.
Open doors to new design ideation methods.

This project highlights how AI can not only complement creativity but also redefine how we approach design and marketing. Whether you’re in fashion, marketing, or tech, the future of personalized content generation is here — and it’s only getting started.

Fine-Tuning Stable Diffusion XL for Personalized Fashion Imagery