Analyzing SOTA Image Generation Models — From Diffusion to FLUX

Introduction
The Big Picture: What Is Being Learned
Principles of Diffusion Models
- Noise Prediction and the Objective Function
- Reverse Diffusion and Samplers
Latent Diffusion
Text Conditioning
- Text Encoders
- Cross-Attention and Condition Injection
CFG: Classifier-Free Guidance
Rectified Flow and Flow Matching
DiT: The Shift to Diffusion Transformers
Lineage and Representative Model Families
Comparison Table: An Overview Along Architectural Axes
Full Pipeline Diagram
Strengths
Limitations and Open Problems
Practical Implications
Conclusion
References

Introduction

Text-to-image generation has been one of the fastest-advancing areas of generative AI over the past few years. After the GAN era, diffusion models became the de facto standard, and since 2024 there has been a clear trend toward generalizing the mathematical frame of diffusion with rectified flow and flow matching, and replacing the backbone from U-Net with a transformer (DiT). Rather than asserting the detailed specs of individual models, this article focuses on the architectural principles and lineage shared by SOTA models in this field.

This field changes very quickly. The content below is based on widely known concepts, papers, and architecture families, and please read it with the understanding that any given model's ranking or specific numbers vary by benchmark, version, and evaluation method.

The Big Picture: What Is Being Learned

The goal of an image generation model is to learn a data distribution and draw new samples from it. A text-to-image model adds a "text condition" on top of this, sampling from the conditional distribution of images described by a given sentence.

The core idea of diffusion models divides into two stages.

Forward process: gradually add Gaussian noise to a clean image until it becomes pure noise.
Reverse process: start from pure noise and gradually remove the noise to reconstruct an image.

The model learns to predict "what noise was mixed in at each step." The more accurate this noise prediction becomes, the more it can turn pure noise back into a plausible image in reverse.

Forward (generate training signal): x0 --noise--> x1 --noise--> ... --> xT (pure noise)

Reverse (generation):           xT --denoise--> ... --> x1 --denoise--> x0 (image)

              At each step the network predicts the "mixed-in noise"

Principles of Diffusion Models

Noise Prediction and the Objective Function

In the most widely used formulation (the DDPM family), a fixed amount of noise is mixed into the image at an arbitrary timestep t, and the network is trained to predict that noise. The loss is generally the mean squared error between the "actually mixed noise" and the "predicted noise."

Training loop (conceptual):
1. Sample image x0 from the data
2. Randomly select timestep t (1..T)
3. Sample noise eps, generate x_t according to the schedule
4. Network predicts eps_pred = model(x_t, t, condition)
5. Minimize loss = mean( (eps - eps_pred)^2 )

Here the "condition" is the text embedding. Instead of the noise-prediction form, variants that directly predict the original image or predict the velocity (v-prediction, etc.) are also widely used. Even when the formulation differs, the essence is the same: the network learns the "denoising direction" from a noisy state.

Reverse Diffusion and Samplers

Once training is done, generation starts from pure noise and removes noise over multiple steps. The algorithm that actually performs this process is called a sampler (solver).

DDPM: The original approach. It is slow because it needs many steps (hundreds of steps).
DDIM: Allows a deterministic path and greatly reduces the number of steps.
DPM-Solver family: Uses high-order approximations from a differential-equation perspective to maintain quality with only a few steps.

Reverse diffusion can in fact be viewed as the problem of solving a stochastic differential equation (SDE) or an ordinary differential equation (ODE). This perspective connects naturally to the flow matching discussed later.

Latent Diffusion

Early diffusion models handled noise directly in pixel space. High-resolution images have many pixels, so computation explodes. The Latent Diffusion Model (arXiv 2112.10752) solved this problem elegantly.

The key is to first train an autoencoder (VAE) that compresses images into a much smaller latent space. The diffusion process then takes place on this compressed latent representation.

[image] --VAE encoder--> [small latent tensor] --diffusion training/generation--> [latent tensor] --VAE decoder--> [image]

           e.g.: 512x512x3 pixels  ==>  64x64x4 latent (8x spatial reduction)

Thanks to this structure, the diffusion network only has to handle much smaller tensors, greatly reducing computation and memory. The Stable Diffusion family was built on top of this latent diffusion, and most practical text-to-image models afterward adopted the latent-space approach.

Text Conditioning

To control images with text, you need to convert a sentence into a vector the network can understand and inject that vector into the generation process.

Text Encoders

CLIP text encoder: An encoder trained with image-text contrastive learning. It aligns text and images in the same embedding space. It is a natural way to connect the meaning of a prompt to image generation.
T5 family text encoder: A large-language-model-based encoder that captures the syntax and semantics of long, complex prompts more richly.

Recent models often adopt a hybrid configuration that uses both CLIP and T5. The intuition is that CLIP provides an image-alignment signal while T5 provides depth of language understanding.

Cross-Attention and Condition Injection

The representative way to inject text embeddings into image generation is cross-attention. The latent representation being generated becomes the query, and the text embedding becomes the key and value. Each image position learns which word of the prompt to attend to.

[latent representation tokens] --Query-->
                              [cross-attention] --> text-conditioned features
[text embedding]   --Key,Value-->

In a transformer backbone (DiT), the approach of concatenating text tokens and image tokens into a single sequence and running attention together (joint attention) is also widely used. This lets text and image representations interact with each other more deeply.

CFG: Classifier-Free Guidance

Classifier-Free Guidance (CFG) is a technique for controlling how strongly the model follows the text condition. During training, the condition is left empty (unconditional) with some probability so both are learned together, and during generation the conditional and unconditional predictions are mixed.

guided = uncond + scale * (cond - uncond)

scale = 1: close to ignoring the condition
higher scale: increased prompt fidelity, but too high causes oversaturation / unnaturalness

The CFG scale is the key knob for adjusting the trade-off between prompt fidelity and diversity/naturalness. When the value is too large, colors burn out or the image feels artificial.

Rectified Flow and Flow Matching

Flow matching and rectified flow emerged naturally from viewing the reverse of diffusion as an ODE. This family has established itself as the training frame for recent SOTA text-to-image models.

The core idea is this: learn directly the "velocity field" that goes from the noise distribution to the data distribution. In particular, rectified flow makes the path connecting noise and data as close to a straight line as possible.

Diffusion (curved path)        rectified flow (path close to a straight line)

noise . . . data      noise -------- data
   winding trajectory          straightened trajectory -> reaches with few steps

The closer the path is to a straight line, the fewer integration steps are needed at generation time. In other words, good quality can be obtained with less computation. This property is one of the practical reasons the latest models prefer the flow matching family.

DiT: The Shift to Diffusion Transformers

The backbone of early diffusion models was mostly U-Net, a convolution-based encoder-decoder structure with skip connections. Later, the Diffusion Transformer (DiT) appeared, strengthening the trend of replacing the backbone with a transformer.

The idea of DiT is simple. Cut the latent tensor into patches to form a token sequence, and process it with standard transformer blocks. The timestep t and the condition are injected via normalization layers and the like.

[latent tensor] --patch split--> [token sequence]
                              |
                     [transformer blocks x N]
                     (self-attention + condition injection)
                              |
                     [noise/velocity prediction] --> patch reconstruction

The advantage of the transformer backbone is scalability. Performance has been observed to improve smoothly as model size and data are scaled up, which is the background for large text-to-image models converging on the DiT family.

Lineage and Representative Model Families

Below is an overview centered on concept and architecture families. Detailed specs differ by version and release timing, so we look at the characteristics of each family as the basis.

Early Stable Diffusion family: latent diffusion + U-Net + CLIP conditioning. It greatly widened the open ecosystem.
DiT-based large models: a trend that replaced the backbone with a transformer and introduced text-image joint attention. The Stable Diffusion 3 family is known to have adopted rectified flow, a transformer backbone, and a multi-text-encoder configuration.
FLUX family: known as a family that combines rectified flow with a large-scale transformer backbone, and evaluated as showing strong performance in prompt fidelity and image quality. The detailed training recipe and exact numbers vary by what has been publicly disclosed.
Imagen / DALL-E family: known as a family that emphasizes powerful text encoders and diffusion (or cascade) structures. Many are closed, so details are disclosed only to a limited degree.

Three common directions are observed across this lineage: (1) latent space instead of pixels, (2) transformers instead of U-Net, and (3) flow matching / rectified flow instead of pure diffusion formulation.

Comparison Table: An Overview Along Architectural Axes

Axis	Early diffusion (U-Net)	DiT-based large models	Rectified flow family
Backbone	U-Net (convolution)	Transformer	Transformer-centric
Training formulation	Noise prediction (DDPM family)	Noise/velocity prediction	Velocity field (flow matching)
Condition injection	Cross-attention	Joint/cross-attention	Joint/cross-attention
Text encoder	Mostly CLIP	Tends toward CLIP + T5	Tends toward CLIP + T5
Scalability	Limited	Excellent	Excellent
Generation-step tendency	Many	Medium	Fewer (straight path)

The values in the table are general tendencies of each family and may differ from the exact configuration of a specific product or version.

Full Pipeline Diagram

[prompt text]
      |
 [text encoder: CLIP / T5]
      |
 [text embedding] ------------------+
                                   |
[pure noise (latent)] --> [diffusion/flow backbone: U-Net or DiT] <-- (cross/joint attention)
                                   |
                           [iterative denoise: sampler + CFG]
                                   |
                           [final latent tensor]
                                   |
                            [VAE decoder]
                                   |
                             [final image]

Strengths

Balance of quality and control: The diffusion/flow families make it easier to capture diversity and fidelity together, and provide an adjustment knob via CFG.
Modularity: The text encoder, backbone, and VAE are separated, making it easy to swap and improve components.
Scalability: With the introduction of DiT, they gain the benefits of large-scale scaling.
Room for efficiency: Research on reducing generation steps with rectified flow and high-order samplers is active.

Limitations and Open Problems

Text rendering: Drawing letters accurately inside an image is still tricky. It has improved greatly recently but is not perfect.
Compositional accuracy: Compositional prompts that must accurately keep objects, attributes, and spatial relations — like "a blue ball on top of a red cube" — are prone to failure.
Hands and anatomy: Details such as the number of fingers still have frequent errors.
Difficulty of evaluation: Metrics like FID do not fully capture perceptual quality and can diverge from human preference evaluation. Rankings vary by benchmark and version.
Copyright and data provenance: The copyright of training data and the imitation of style are major issues beyond the technical realm.

Practical Implications

If prompt fidelity matters, you must tune the CFG scale and the sampler together. A high CFG is not unconditionally good.
If speed matters, reducing steps with the rectified flow family or high-order samplers is advantageous.
To fix a specific style or object, combine lightweight fine-tuning such as LoRA with condition-control (e.g., structure guidance) techniques.
When selecting a model, it is safer to do direct comparative evaluation in your target domain rather than asserting "the newest is the best."

Conclusion

The trend of SOTA image generation can be summarized by three axes: "latent space + transformer backbone + flow matching." The Stable Diffusion 3 family and the FLUX family are known as representative examples that combine these axes. However, since rankings and detailed numbers in this field change very quickly, the attitude of understanding concepts and architectural principles and verifying them directly in your actual domain lasts the longest.