Today’s video explains some of the principles behind the generative AI algorithms that work with broadcast methods.
To get started, some important references on the subject. First of all, the article that first suggested methods of propagation
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015, June). Unsupervised deep learning using non-equilibrium thermodynamics. At the International Conference on Machine Learning (pp. 2256-2265). PMLR.
The paper you democratize use to generate images
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising probabilistic diffusion models. Advances in Neural Information Processing Systems, 33, 6840-6851.
And finally, the latest research paper behind Stable Diffusion
Rumbach, R, Plattmann, A, Lorenz, D, Esser, B, and Ohmer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684-10695).
With adaptations by class or including text in particular.
about how the delivery methods compare with others, This paper is from NVIDIA Provides a graph in the form of a “triple triplet”
One of Stable Diffusion’s strengths is the success in having sampling speed that diffusion models lack, thanks in particular to a trick: Noise Reduction in “Latent Space”.
propagation in latent space
Here’s the gist: If you want images of a reasonable resolution (say 512 x 512), the sampling process for the classic scattering model takes time. Their solution is to do this on a “compressed” copy of the image. We take 64 x 64 noise, denoise it, and then ‘upscale’ the image by a factor of 8 in each dimension to get to 512 x 512. Obviously for this to work, the denoising algorithm must have been trained on compressed versions of the images in the database .
At this point, you might think that what I call “compressing” or “scaling” is just a simple change in the size of the images. But no, the idea is to compress it into a “latent space” using a Variable Auto-Encoder (VAE). This is a set of algorithms that (almost) intelligently compress data by learning to represent a lower dimensional latent space: I recommend This excellent post by Joseph Rocca In the subject).
So if we have such a VAE trained on base images, we proceed as follows: we encode the images, we train a denoising algorithm on the encoded representations. When we produce an image by denoising, we produce it “in latent space”, and in the end we only have to decode it.
It is this decoding step in particular that explains why when you use Stable Diffusion and enjoy watching the intermediate steps (when all the noise hasn’t been removed yet), you see images whose noise “doesn’t look like the one I shot in my video, but rather like this:
It is Gaussian latent-space noise (64 x 64) upscaled by a VAE decoder to produce a 512 x 512 image.
Regarding DALL E, I understand that the first version didn’t use propagation algorithms (and worked mildly for that), while DALL E 2 switched to broadcast algaebut (if I understand correctly) without the trick doing the diffusion in latent space obtained by VAE.
Moreover, stable propagation like DALL E 2 both use the OpenAi CLIP model to perform the embedding I was talking about in the video.
Denoising and UNet
I didn’t specify in my video what kind of network structure we use to achieve noise reduction. You may know that at computer vision We use the famous convolution networks, made famous in particular by Yann LeCun. But to achieve this, we have a network that takes a lot of pixels as input and outputs a limited amount of numbers (eg the estimated probability for each of the 1000 predefined categories). Convolution filters allow spatial information to be aggregated specifically to extract semantic information from it.
To do noise reduction, we take an image as input, and we must have an image with the same resolution as the output. For this, we use an architecture invented specifically for medical image processing and segmentation issues: the “U-Net” architecture.
The idea is to start classically with a sequence of convolutions and filter pooling to reduce dimensionality by increasing the number of channels, and then re-route in the other direction to return to an image of the same initial dimensions. The subtlety is that in this ascent phase we supply the network with the intermediate phases we had in the descent (the horizontal gray arrows represent this in the diagram).
Sampling and tabulation
If you’ve played around a bit with Stable Diffusion or some of the serving APIs, you’ll notice that Algorithms offers many samplers/schedulers. These are different ways to implement noise reduction from the same noise reduction network. In particular, you can select the noise reduction rate. In the video, I acted as if we were removing 5% of the noise each time to reduce the noise in 20 steps. But on the one hand we can choose the number of steps, and on the other hand we are not obliged to adopt a linear rhythm. So the “scheduling” of noise reduction can vary, and some obviously work better than others, both in terms of quality and speed.
Among the possible uses of trained scatter models is the possibility of doing drawing, so allowing the algorithm to fill in certain parts of the image, for example after erasing something
Or to provide some kind of diagram and ask the algorithm to fill it in, here is an example of landscape generation.
Either way, it still works on the principle of adaptive noise reduction. We start from the noise, and we move it from the noise by adapting it with an image that is used as a reference.
Note that diffusion models can also be used to interpolate between two images in their “diffuse” representation, which are then returned to their original representation.