Decoding the Mystery of Diffusion Models in AI

    Decoding the Mystery of Diffusion Models in AI

    An introduction to diffusion models in AI, explaining their core concepts, training process, and applications in image generation, with a focus on noise prediction, neural networks, and resource-efficient breakthroughs like Stable Diffusion.

    By AI Club on 9/21/2024
    0
    BlockNote image

    Introduction

    In the ever-evolving world of artificial intelligence, diffusion models have emerged as a groundbreaking approach in image generation and manipulation. This article aims to demystify these complex models, making them accessible to enthusiasts and experts alike. We’ll embark on a journey through the intuitive concepts, intricate processes, and exciting applications of diffusion models.

    The Intuition Behind Diffusion Models

    Picture an ink drop dispersing in water. Initially distinct, the drop gradually blends until it’s indistinguishable. This physical diffusion process is the cornerstone of understanding diffusion models in AI. Unlike the irreversible nature of ink in water, diffusion models uniquely reverse this process, transforming diffused noise back into clear, structured data.

    BlockNote image

    The Neural Network Approach

    At the heart of this reverse engineering lies the neural network, a powerful tool capable of learning and mimicking complex functions. For diffusion models, the goal is to teach a neural network to counteract the diffusion process, transforming a noisy, unstructured image into its original, clear form.

    Training the Neural Network: A Two-Step Dance

    1. Adding Noise: We start by artificially introducing noise to an image, creating a series of increasingly distorted versions. It’s like watching our ink drop gradually dissolve. In the below example, we iteratively add gaussian noise to a sprite (Bob the Sprite). Notice, after adding sufficient noise to Bob the Sprite becomes unrecognizable from random noise.

    Adding noise to Bob The Sprite
    Adding noise to Bob The Sprite

    2. Predicting Noise: Here’s where the neural network shines. By analyzing these noisy images, it learns to identify and predict the specific noise in each one. Removing this predicted noise effectively reverses the diffusion, bringing us closer to the original image.

    Removing Noise from Fred The Sprite
    Removing Noise from Fred The Sprite

    The Role of Timesteps

    In diffusion models, timesteps are crucial. Each timestep represents a distinct phase in the image’s transformation, where new layers of noise are added. The number of timesteps is a key factor, influencing both the model’s performance and its computational demands.

    The Magic of Sampling

    Once trained, the model uses its noise-predicting prowess to generate images. Starting with a noisy sample, the model iteratively refines the image, gradually peeling away the noise through repeated applications. It’s a meticulous process, requiring several iterations to achieve high-quality results.

    But the above process won’t give a proper image or in this case a sprite. We need to repeat this step several times to get a goog quality image. This is because generating the entire image from “noisy sample” is an extremely difficult task.
    But the above process won’t give a proper image or in this case a sprite. We need to repeat this step several times to get a goog quality image. This is because generating the entire image from “noisy sample” is an extremely difficult task.

    The Architectural Marvel: U-Net

    The neural network of choice for diffusion models is U-Net. Renowned for its U-shaped architecture, U-Net excels in tasks like image segmentation, combining encoding and decoding pathways to process and refine images effectively. In diffusion models, U-Net is tailored to excel in noise prediction, with added layers to integrate timestep information crucial for the reverse diffusion process.

    The architecture of the U-Net
    The architecture of the U-Net

    But how do we add timestep related information to the neural network.

    As you can see the above image. The timestep related information will in a embedding and directly added to the upsampling blocks.
    As you can see the above image. The timestep related information will in a embedding and directly added to the upsampling blocks.

    Training: Crafting the Noise Schedule

    To understand the training part we need to understand what exactly is a Noise Schedule.

    Noise Schedule The noise schedule determines the level of noise to be added at each diffusion step. There can be different ways of adding noise to an image. One way can be just linearly adding noise at each time step, i.e. adding the same amount of noise at every step i.e. linearly scaled. Other way can using the cosine scaled i.e. adding little noise in initial timesteps and then adding a lot of noise in the later timesteps.

    Training a diffusion model requires a strategic approach to noise addition. This is where the noise schedule comes into play, guiding how noise is introduced at each step. From linear to cosine-scaled approaches, the noise schedule is pivotal in shaping the training process and the model’s ability to predict and reverse noise.

    BlockNote image

    For training, we sample images and and timesteps. Then according to the sampled timestep for every image we determine the amount of noise to be added using the “Noise Schedule” defined earlier. We add the sampled noise with the original image and ask our neural network to predict the added noise. The below image shows how this process happens.

    A simple loss function like mean square error between added noise and predicted noise can be used.

    The Control Mechanism: Steering the Output

    A key challenge is directing the model to generate specific images. This is achieved through text embeddings, where textual descriptions are converted into dense representations and integrated into the training process. By associating images with corresponding text during training, the model learns to generate images that align with textual inputs, giving us control over the output.

    Just like how we embedd the time information in the model, we embedd the text related information in the model.
    Just like how we embedd the time information in the model, we embedd the text related information in the model.

    Stable Diffusion: A Resource-Efficient Breakthrough

    Training robust diffusion models typically demands immense computational resources. Stable Diffusion addresses this by operating in a more compact latent space, significantly reducing the resources needed without compromising quality. The latent space represents a compressed, yet richly informative, version of the data, enabling efficient training and generation processes.

    As shown in the above image, the entire processing of adding noise and predicting noise happens in Latent Space, which is much smaller than the original image space.
    As shown in the above image, the entire processing of adding noise and predicting noise happens in Latent Space, which is much smaller than the original image space.

    Conclusion

    Diffusion models stand as a testament to the remarkable progress in AI and machine learning. By reversing a natural dispersion process, these models open up new horizons in image generation and manipulation, offering a blend of scientific rigor and creative potential. Whether you’re an AI enthusiast or a seasoned professional, understanding and leveraging diffusion models can be a transformative addition to your toolkit.

    You can try open sourced Stable Diffusion Models using the below notebook :
    https://colab.research.google.com/drive/1roZqqhsdpCXZr8kgV_Bx_ABVBPgea3lX?usp=sharing

    This article aimed to provide a comprehensive yet accessible overview of diffusion models. If you believe additional details or examples would enhance understanding, please feel free to suggest them. The goal is to ensure that this complex topic is as clear and engaging as possible for all readers.

    If you found this article useful and believe others would too, leave a clap!

    Resources

    1. https://colab.research.google.com/drive/1roZqqhsdpCXZr8kgV_Bx_ABVBPgea3lX?usp=sharing

    2. https://www.deeplearning.ai/short-courses/how-diffusion-models-work/

    3. https://medium.com/@sergio.leal/how-diffusion-models-work-by-deeplearning-ai-recap-c66de2e19d71

    4. https://www.youtube.com/watch?v=-lz30by8-sU

    5. https://www.youtube.com/watch?v=1CIpzeNxIhU&t=707s

    Comments