Skip to content

필사 모드: AI Image Generation 2026 — Flux / Midjourney 7 / Ideogram 3 / Recraft / SD 3.5 / GPT-4o / Imagen 4 Deep Dive

English
0%
정확도 0%
💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.
원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — One Model in 2024, an Entire Ecosystem in 2026

Two years ago, "I made an image with AI" naturally meant Midjourney, DALL·E 3, or Stable Diffusion XL. There were three models, and the choice was simple. For aesthetics, Midjourney. For chat integration, DALL·E. For hands-on control, SDXL.

By spring 2026, that simplicity is gone. To answer the same question, you first have to ask: "What kind of image? Photorealistic, illustration, a poster with text, a vector logo, real-time generation? Do you need open weights, or is an API enough? Does training-data licensing matter, or just the result?"

This piece is a map of AI image generation in 2026 that follows every branch of that question. The Black Forest Labs Flux 1.1 Pro and Kontext story setting a new bar for photorealistic open-weight quality. Midjourney 7 cementing itself as the aesthetic standard. Ideogram 3 running away with text-in-image. Recraft V3 opening a separate design category. Stable Diffusion 3.5 returning as the community base model after the Stability AI restructure. Plus the GPT-4o Ghibli moment, Imagen 4, Firefly 4, Krea and Photon real-time generation, the ComfyUI node graph, LoRA/ControlNet/IPAdapter building blocks, and the Korean and Japanese ecosystems.

Chapter 1 · The 2026 AI Image Generation Map — Three Camps

Draw the 2026 AI image generation market on a single map, and three camps appear.

**1. The closed-API camp** — Midjourney 7, OpenAI (GPT-4o image and the DALL·E 4 rumors), Google (Imagen 3 and 4), Adobe (Firefly 4), Ideogram 3. Weights are not published, inference runs on their own infrastructure, and users pay via tokens or subscriptions. The quality ceiling is high, tool integration is smooth, and safety filters are strict.

**2. The open-weight camp** — Black Forest Labs Flux (Schnell and Dev are open; Pro, Ultra, and Kontext are API), Stable Diffusion 3.5 Large and Medium, NovelAI (partial), Sakana AI Japanese models. Weights live on HuggingFace, anyone can download them and run them on their own GPU. LoRA finetuning, ControlNet, IPAdapter, and ComfyUI node graphs are the weapons of this camp. Civitai is the community LoRA hub.

**3. The real-time camp** — Krea AI, Luma Photon, fal.ai LCM/Turbo hosting, and the canvas UIs built on top. The target is 50 ms per generation, not five seconds. Move a slider and the image follows in real time. Sketch on a canvas with the mouse and diffusion is layered on instantly. The UX shifts from "prompt, wait, result" to "prompt, instant, interactive."

The borders between the three camps keep blurring. Black Forest Labs ships open-weight Dev while running Pro, Ultra, and Kontext as an API. Krea AI usually serves Flux and SD 3.5 distilled to LCM in real time rather than its own model. Still, the first question a user asks when picking a model — "can I touch the weights?" "do I pay in tokens or in GPU hours?" "does the result arrive in five seconds or 50 ms?" — keeps dividing these three camps.

Part 1 of this piece (Chapters 2 to 4) covers closed-API. Part 2 (Chapters 5 to 7) covers open-weight. Part 3 (Chapters 8 to 12) covers tools and real-time. Part 4 (Chapter 13) covers regional models. Part 5 (Chapter 14) covers selection.

Chapter 2 · Flux (Black Forest Labs) — The New Company by the SD Founders

In August 2024, the core researchers who built Stable Diffusion at Stability AI left and founded a new company. The name is Black Forest Labs (BFL). Headquartered near Stuttgart, Germany, with a roughly 31M USD seed led by Andreessen Horowitz. The first model was Flux.1, in three variants.

- **Flux.1 [schnell]** — Distilled variant that generates in about four steps. Apache 2.0 license. Commercial use is free. Anyone can download the weights on HuggingFace.

- **Flux.1 [dev]** — Standard variant at roughly 50 steps. Weights are open but the license is non-commercial. Free for personal and research use; commercial use requires a separate license.

- **Flux.1 [pro]** — The largest variant. Weights closed. Available only via the BFL API and partner hosts like fal.ai, Replicate, and Together.ai.

In October 2024, **Flux 1.1 Pro** arrived. Same interface, better quality, faster inference. Price around `$0.04` per image. In 2025, two larger announcements followed.

**Flux Ultra** — A variant that generates directly at 4K resolution. Not 1024 plus upscaling but diffusion in 4K latent space from the start. Big news for photographers heading to print (commercial ads, posters).

**Flux Kontext** — BFL's biggest move. A model dedicated to image editing and re-contextualization. It takes an input image and accepts natural-language instructions like "keep this person and change the background to a Tokyo street" or "change this product from red to blue." A clear step up from the previous generation of InstructPix2Pix, SDEdit, and IP2P work.

Three technical highlights of Flux.

First, a **rectified-flow-based diffusion transformer (MM-DiT)**. Inheriting the MM-DiT architecture introduced by Stable Diffusion 3, text and image latents are co-processed by cross-attention in the same transformer blocks. The result: subtle nuances in the text prompt (spatial relations, materials, lighting) make it into the image.

Second, **aggressive use of T5 text encoders**. Where SDXL used two CLIPs, Flux uses a large T5 (XXL) as its text encoder. T5 understands natural language far better than CLIP, so syntactic requirements like "holding a red apple in the left hand with the right hand in a pocket" land more reliably.

Third, **hands and text are almost solved**. Models up through SDXL routinely failed at finger counts, clock hands, and in-image text (signs, labels). Flux Pro nails all three almost reliably. Five fingers, twelve numbers on a clock, "OPEN" written exactly "OPEN" on a sign.

Here is a Flux workflow in ComfyUI.

ComfyUI node graph (summary; in practice you wire nodes in the GUI)

1) Load Diffusion Model -> flux1-dev.safetensors

2) Load CLIP -> t5xxl_fp8_e4m3fn.safetensors + clip_l.safetensors

3) Load VAE -> ae.safetensors

4) CLIP Text Encode (Positive) -> "a photo of a red ceramic mug on a wooden desk, soft window light"

5) Empty Latent Image -> 1024x1024

6) BasicScheduler / KSamplerSelect / RandomNoise / SamplerCustomAdvanced

7) VAE Decode -> Save Image

Flux Dev runs in ComfyUI on roughly 16 GB of VRAM with fp8 weights (RTX 4080, 4090, 5080, 5090, A100). For fp16 full precision, 24 GB is the safer target.

By spring 2026, Flux's position is clear. **"The high-water mark of photorealistic quality you can download as open weights."** It replaced SDXL as the new base, and hundreds of Flux-based LoRAs land on Civitai every week.

Chapter 3 · Midjourney 7 — The Aesthetic Standard

Midjourney has held one line from beginning to end: **"We sell aesthetics, not technical accuracy."** No public API, an interactive workflow on Discord (and since 2024 a native web app), and results that are always "artistic." Same prompt, SDXL leans photographic, Midjourney leans painterly.

In late 2024 the **V7 alpha** appeared. After the formal V7 launch in 2025, by spring 2026 V7 is the default model. The main V7 changes.

**1. Stronger character and style consistency.** `--cref` (character reference) and `--sref` (style reference) flags appeared in V6 and got far more precise in V7. Putting the same character into different scenes or transferring the look of one photo onto another prompt is stable.

**2. Personalize model.** Midjourney's model fine-tuned on a user's likes. After roughly 200 paired ratings it activates, and you trigger it with `--p`. The same prompt now produces different aesthetic results per user.

**3. Video mode.** V1 video shipped in mid-2025. It animates still images into 5-10 second clips. The category competes with Luma, Runway, and Pika, but the differentiator is that Midjourney's aesthetic consistency survives in video.

**4. Moodboards UI.** In the web interface you can collect a grid of images into a moodboard and use the entire board as a style guide instead of `--sref`.

Midjourney 7 pricing: `$10/month` (Basic, about 3.3 hours of GPU time), `$30/month` (Standard, 15 hours), `$60/month` (Pro, 30 hours plus Stealth Mode), `$120/month` (Mega). Unlimited mode runs on a "slow queue," with a set amount handled by the fast queue.

Technically Midjourney does not publish its architecture. The best guess is latent diffusion plus heavy in-house RLHF. User data is the company's core asset, and new "style tokens" surface on the community subreddit each week.

Midjourney's two weak spots. **No API.** Automation and service integration are painful (third-party wrappers around Discord exist but violate ToS). **Text-in-image is weak.** Posters, signage, anything where letterforms are the point — concede to Ideogram or Flux Pro.

That said, for "ad concept, fashion lookbook, book cover, moodboard, illustration, painterly style" — categories where aesthetics drive 90 percent of the result — Midjourney 7 is still the standard.

Chapter 4 · Ideogram 3 — The Answer for Text-in-Image

When Ideogram first appeared in late 2023, the biggest shock was that **text in an image was actually correct**. Every other model trying to draw a "STORE" sign produced "STOORE," "STOPRE," or "STORF." Only Ideogram drew "STORE."

After Ideogram 2, the 2025 launch of **Ideogram 3** extended that strength.

**Text fidelity.** English is almost perfect. Korean, Japanese, and Chinese are far less awkward than in V2. Font style (serif, sans, hand-drawn), letter size, alignment, multilingual mixing — you can direct all of it in natural language.

**Magic Fill / Magic Prompt.** Ideogram's inpainting. Mask a region of the image and say "change this here." Text-area edits are especially strong. Swapping "BLACK FRIDAY" to "CYBER MONDAY" on a poster works cleanly.

**Style library.** Around 4,400 predefined style tokens as of spring 2026. Drop "Vintage Travel Poster," "1980s Anime," or "Watercolor Illustration" into a prompt and the result stays consistent.

Ideogram pricing: `$8/month` (Basic), `$20/month` (Plus), `$60/month` (Pro). A free tier lets anyone test it. The public API also makes it easy to integrate into marketing and design tools.

Ideogram's strong categories are obvious. **Posters, ad banners, book covers, business cards, T-shirt designs, Instagram cards, menus** — anything where text matters. Pure illustration quality sits a step below Midjourney, but under the constraint "the letters have to be correct," Ideogram is the answer.

Technically, the conjecture for why Ideogram wins on text is that text rendering is treated as a separate loss term during training. A vanilla diffusion model treats all pixels equally; Ideogram reportedly runs an OCR-style auxiliary model on generated images, re-reads the text, and back-propagates the accuracy as a loss.

Chapter 5 · Recraft V3 — Design (Vectors / Logos) Specialty

Recraft started in a different category from every other model. **AI that outputs vector (SVG) images, not raster (pixel) images.** Logos, icons, illustrations, patterns — results designers can take straight into Illustrator or Figma.

The late-2024 Recraft V3 held the top of the LMSYS Artificial Analysis Image Arena for a while. Text fidelity, design quality, and SVG output all moved the needle.

Recraft's core features.

**Vector mode.** Give it a prompt, get SVG. Download the SVG and open it in Illustrator for further edits. Color palettes and layer structure are clean. Genuinely useful as a first draft of a logo.

**Brand Style.** Upload a few images of your brand and Recraft extracts the style for consistent output. It learns "our company's illustration style."

**Mockup.** Once you make a design, it auto-applies to T-shirts, mugs, posters, laptop cases, and more. Useful for ecommerce and POD (Print on Demand) businesses.

**Recraft API.** Plug into design workflows. Webflow, Framer, and Figma plugins are already integrated.

Pricing: free tier (daily quota), Basic `$10/month`, Advanced `$33/month`, Enterprise on request. The API charges per call.

Recraft's weakness is **photorealism**. Unlike Flux or Midjourney, it's not built for "photo-grade" output. It's a specialist tool for the narrow category of design and illustration.

The reason Recraft matters lies elsewhere. **It signals that AI image generation is no longer "one generalist model" but is splitting along category lines.** Photos for Flux, painterly illustration for Midjourney, text posters for Ideogram, vectors and logos for Recraft. For a while the trend was "everything in one model"; 2025-2026 is bending back toward category specialization.

Chapter 6 · Stable Diffusion 3.5 — After the Stability AI Restructure

Stability AI went through a major upheaval in 2024. CEO Emad Mostaque left, much of the core research team moved to Black Forest Labs, and the company nearly collapsed. Under new leadership (Sean Parker joined the board) the org reorganized, and the October 2024 launch of **Stable Diffusion 3.5** marked stabilization.

SD 3.5 came in three variants.

- **SD 3.5 Large** — 8.1B parameters. Full precision wants 24 GB VRAM. Fp8 runs on 16 GB.

- **SD 3.5 Medium** — 2.5B parameters. Runs on 12 GB.

- **SD 3.5 Large Turbo** — A 4-step distilled version of Large, for fast inference.

License: **Stability AI Community License**. Free for commercial use for companies and individuals under 1M USD in annual revenue; over that requires a separate enterprise license. The 2024 SD3 Medium license drew backlash for being too restrictive, and 3.5 relaxed it.

Technically, SD 3.5 keeps the MM-DiT (Multimodal Diffusion Transformer) architecture. Same lineage as Flux, carrying traces of the last joint work before the BFL founders left.

In ComfyUI, SD 3.5 looks like this:

1) Load Checkpoint -> sd3.5_large.safetensors

2) CLIPTextEncodeSD3 (clip_g + clip_l + t5xxl)

Positive: "A close-up portrait of a woman with curly hair, golden hour lighting"

Negative: "blurry, low quality, distorted hands"

3) EmptySD3LatentImage -> 1024x1024

4) ModelSamplingSD3 -> shift 3.0

5) KSampler -> euler / sgm_uniform / 28 steps / cfg 4.5

6) VAE Decode (sd3.5 vae) -> Save Image

SD 3.5's position is subtle. **On raw quality, Flux Dev is a step ahead.** But SD 3.5 has a **clearer license** (an explicit revenue threshold for free use) and a **richer community ecosystem** (the SD 1.5 and SDXL LoRA, ControlNet, and IPAdapter heritage is migrating).

By spring 2026, **two open-weight base models have settled in.** If photorealism, text fidelity, and reliable fingers come first, choose Flux Dev. If license clarity and the broadest community LoRA library matter more, choose SD 3.5. SD 1.5 and SDXL are sliding into legacy.

Chapter 7 · Google Imagen 3 / 4 / ImageFX

Google has always been in the "second to ship" position on AI image. Imagen 1 and 2 were papers without public weights. Only with Imagen 3 in mid-2024 did real users get access.

Imagen 3 is reachable two ways.

**ImageFX** — Google Labs's free web UI at labs.google/fx/tools/image-fx. Anyone gets a daily quota. Imagen 3 powered.

**Vertex AI / Gemini API** — Google Cloud's enterprise path. Pay per call. Safety filters, the SynthID watermark, enterprise SLAs.

In late 2025 **Imagen 4** launched, and by spring 2026 it is available in both ImageFX and the Gemini API. Imagen 4's changes.

- **Text fidelity** — The pre-3 weakness with in-image text is now close to Ideogram.

- **Multilingual prompts** — Non-English (Korean, Japanese, Chinese) prompt comprehension improved markedly. Type "노을 진 한강의 풍경" in Korean and you get a meaningful result.

- **SynthID watermark** — Google's strongly pushed invisible watermark. Imperceptible to humans but Google's detector identifies it as "AI generated."

Imagen's strength is **Google ecosystem integration**. Call image generation straight from Gemini, drop output into Google Workspace (Docs, Slides), use it naturally inside NotebookLM or Google AI Studio.

The weakness is **conservative safety filters**. Generating human faces is restricted (some race-gender combinations especially), political figures, depictions of violence, and sexual implication carry strong constraints. Good enough for ad and marketing illustration, but cramped as a free-form creative tool.

Chapter 8 · OpenAI GPT-4o Image (March 2025 Ghibli Moment) / DALL·E 4

OpenAI's DALL·E 3, integrated into ChatGPT in late 2023, was hugely influential. The current shifted in 2025. **GPT-4o's native image generation** launched, and instead of a separate DALL·E model, GPT-4o itself produces images.

**The March 2025 Ghibli moment.** In the days after GPT-4o image generation rolled out to all ChatGPT users, "make it Studio Ghibli style" exploded on Twitter (X). Personal photos, family photos, company logos, city scenes — anything got transformed into Hayao Miyazaki style and posted. OpenAI servers nearly melted for days, and Sam Altman tweeted "I didn't expect this."

Three things the moment meant.

**1. Natural image generation inside a chat UI redefined the category.** Instead of opening a separate tool and typing a prompt, you say "draw this in Ghibli style" mid-conversation and the result arrives. The friction delta produced a 100x usage delta.

**2. The cultural shock of style transfer.** One word — "Ghibli" — pours an entire studio's painterly style onto anyone's everyday photo. Debates about copyright and creator rights exploded, and Hayao Miyazaki's old remark ("AI animation is an insult to life itself") was re-circulated.

**3. The future of model unification.** The split between "image model" and "text model" started crumbling. GPT-4o handles text, image, audio, and video inside one model. As this multimodal unification becomes standard, the UX of "calling DALL·E separately" gradually disappears.

**The DALL·E 4 rumor.** No public release as of spring 2026, but industry chatter suggests a new dedicated image model is in the works to follow up the GPT-4o image release. The guess is integration with video generation (the Sora lineage) and a larger text encoder.

GPT-4o image pricing: included in ChatGPT Plus (`$20/month`), with a quota on ChatGPT Free, and metered separately on the API (priced per image output token).

Chapter 9 · Adobe Firefly 4 — Clean Training Data License

Adobe has run its own image model **Firefly** since 2023. The biggest difference from other models is one thing. **The training data has clean licensing.** Adobe says it trained only on Adobe Stock, public domain, and properly licensed images.

This licensing promise clearly targets one market. **Enterprise and ad agencies.** Users who, when handing off the result to a client, need a guarantee that "this image does not infringe anyone's copyright." Adobe **even backs Firefly output with legal indemnification**.

Firefly 4's position as of spring 2026.

**Quality** — Often rated a notch or two below Flux Pro, Midjourney 7, and Imagen 4. That "notch or two" is rarely a problem in everyday use. Good enough for advertising and marketing.

**Integration** — Photoshop, Illustrator, Premiere Pro, Express. Firefly is deeply integrated into every Adobe product. Photoshop's Generative Fill, Illustrator's Generative Recolor, Premiere's Generative Extend (automatic clip-length extension), and so on. To Adobe users it's not a separate tool but a part of daily work.

**Subscription** — Bundled into Adobe Creative Cloud, metered as "generative credits." A standalone Firefly Premium subscription also exists.

**Custom Models** — Enterprise can finetune Firefly on its own imagery. Brand guideline conformity, consistent characters, and so on.

Firefly 4's weakness is **creative latitude**. Safety filters and licensing policy are conservative, so images "somehow won't generate" with some frequency. Cramped as a free-form creative tool.

But Firefly's market value lies elsewhere. **In the "legal safety first" enterprise market, Firefly is close to a monopoly.** Ad agencies, enterprise marketing, government design contracts — you cannot use Midjourney or Flux in those markets.

Chapter 10 · Krea AI / Photon (Luma) — Real-time Generation

A new category emerged in 2024-2025: **real-time image generation**. Models and interfaces where a single image takes 50 ms instead of five seconds.

**Krea AI** is the most visible interface in the category, at krea.ai. Sketch roughly on a canvas with the mouse and diffusion overlays in real time. Change the color and the result follows immediately. Edit the prompt and the result refreshes with almost no lag.

Internally Krea **distills** base models like Flux, SD 3.5, and SDXL with **LCM (Latent Consistency Model) or Turbo** so they generate in fewer than four steps, then ships its own canvas UI on top. The UX shifts from "prompt, wait" to "prompt, interactive canvas."

**Luma Photon** is the image model from Luma Labs. Originally known for the Dream Machine video model, Luma announced the image-only Photon in late 2024. The pitch: **balance of fast inference and photorealistic quality**. Photon is available via API and the Luma web interface.

**fal.ai** is the infrastructure company hosting this real-time, fast-inference layer. Calling fast variants like Flux Schnell, SDXL Lightning, or SD 3.5 Turbo through fal.ai returns near-real-time responses. fal.ai also lets you host ComfyUI workflows directly as a server.

Three use cases where real-time generation matters.

**1. Design exploration.** Sliders for color, composition, and material with the result moving in real time. When the "result, edit, regenerate" loop is 50 ms, design thinking becomes a different shape.

**2. Real-time collaboration.** Diffusion output becomes a layer inside collaborative canvases like Figma and Miro. One person draws a shape and the others see the AI-overlaid version live.

**3. Live content.** Live streaming, VJ-ing (live visuals), real-time advertising — using live diffusion as part of the content itself is a growing case.

Pricing varies by model and infrastructure. On fal.ai, one Flux Schnell image is about `$0.003`; one SDXL Lightning is about `$0.001`. An hour of use rarely costs more than a few dollars.

Chapter 11 · ComfyUI — The Node-based Workflow Standard

For open-weight image generation, the standard tool as of spring 2026 is **ComfyUI**. A node-based workflow GUI that emerged in early 2023, by now Stability AI, Black Forest Labs, NVIDIA, and Apple all ship their models with "ComfyUI workflow examples."

ComfyUI's core idea is that **everything is a node**.

- Load model → node

- Encode text → node

- Initialize latent noise → node

- Diffusion step → node

- VAE decode → node

- Save → node

Each node has input and output ports, and you wire nodes together to form a graph. The graph saves to JSON, and anyone can import the JSON to reproduce the exact output.

ComfyUI's strengths.

**1. Reproducibility.** Share the workflow JSON and others can hit the same result. When you download a LoRA from Civitai, a "recommended ComfyUI workflow" usually comes with it.

**2. Natural expression of complex pipelines.** Pipelines like "text, first diffusion, upscale, apply ControlNet, second diffusion, post-process" sit naturally in a graph.

**3. Custom-node ecosystem.** Thousands of custom-node packages live on GitHub. ComfyUI-Manager installs them in one click, and you can grab "the nodes for this specific use case" as a bundle.

**4. API mode.** ComfyUI is a GUI but also an HTTP API. POST a workflow JSON to a ComfyUI instance and the image comes back. fal.ai, RunPod, and others host ComfyUI as serverless.

ComfyUI's weakness is the **learning curve**. Users accustomed to AUTOMATIC1111's WebUI or Fooocus's form-based UI find a node graph foreign at first. But once workflows get complex, there's essentially no alternative to graphs.

**Other tools worth noting.**

- **AUTOMATIC1111 / SD WebUI** — The oldest SD GUI, form-based. Good through SDXL by spring 2026, a step behind ComfyUI on Flux and SD 3.5.

- **Forge** — A1111 fork focused on performance. Lower VRAM.

- **InvokeAI** — A more designer-friendly interface. Inpainting and outpainting feel natural.

- **Fooocus** — A Midjourney-style simple interface. Fill two or three fields, get a result.

**Civitai** is the community hub for LoRA, checkpoints, and embeddings. Users upload their LoRAs and others download to use. By spring 2026, Flux- and SD 3.5-based LoRAs see the most uploads, and the NSFW policy debate continues.

**HuggingFace** is the official hub for model weights. BFL's Flux line, Stability AI's SD 3.5, and finetunes on top of them all live there. Where Civitai centers on community LoRAs, HuggingFace centers on base and research models.

Chapter 12 · LoRA / ControlNet / IPAdapter — Workflow Building Blocks

If you do open-weight image generation seriously, you need three building blocks.

**1. LoRA (Low-Rank Adaptation).** Skip retraining the whole base model and train a small adapter (about 10-100 MB) that changes the model's behavior. Used to teach one character, one art style, or one concept. SDXL LoRA was the richest catalog; by spring 2026, the center of gravity is shifting to Flux Dev LoRA.

What you need to make a LoRA: 20-100 reference images, captions for those images, and roughly 10-30 minutes of GPU time (RTX 4090 baseline). Train with Kohya_ss, OneTrainer, or ai-toolkit.

Using a LoRA in ComfyUI:

1) Load Checkpoint -> base model

2) Load LoRA -> my_character.safetensors / strength 0.8

3) CLIP Text Encode -> "a portrait of <trigger_word>, soft lighting"

4) Standard KSampler flow afterwards

`trigger_word` is the token chosen at training time. When the prompt contains it, the LoRA activates.

**2. ControlNet.** Extract structural information (edges, pose, depth map, segmentation) from an input image and generate a new image conditioned on that structure. Workflows like "keep the pose of this photo, change only the outfit" work.

ControlNet's main modes:

- **Canny edge** — Edge extraction. Preserve the composition.

- **OpenPose** — Human pose extraction. Same pose, different character.

- **Depth** — Depth-map extraction. Preserve spatial structure.

- **Tile** — Detail enhancement, upscaling.

- **Inpaint** — Regenerate only the masked region.

SDXL ControlNet is very rich; Flux ControlNet is filling in fast. SD 3.5 ControlNet is not yet at SDXL levels but covers the main modes.

**3. IPAdapter (Image Prompt Adapter).** An adapter that uses an image itself as the prompt. Carry "this style, mood, and color palette" through a reference image when text alone can't. Using CLIP embeddings, it injects the meaning of the reference image into diffusion.

IPAdapter use cases.

- **Style transfer** — Photo to painting, painting to photo.

- **Color palette consistency** — A series of images keeping the same color tone.

- **Character consistency** — One face appearing in many scenes.

ControlNet and IPAdapter shine together. ControlNet sets structure, IPAdapter brings style and mood.

**Img-to-Img / Inpainting / Outpainting** belong on the list too. Img-to-Img takes an existing image, partially re-noises it, and re-denoises. Inpainting regenerates only the masked region. Outpainting extends the image past its edges. All three are baseline across every open-weight model.

Chapter 13 · Korean / Japanese AI Image (NovelAI, Sakana, Tsuzumi)

Look only at English-speaking models and you miss half the market. Korea and Japan run their own ecosystems on the side.

**Korea.**

- **Kakao KoGPT image** — Kakao's in-house image generation. Integrated into KakaoTalk and KakaoTalk Channel. Strong on Korean illustration (webtoon style, hanbok, Korean food, and so on).

- **Naver CLOVA X (CLOVA Studio)** — Naver's combined LLM-image platform on top of HyperCLOVA X. Integrated into Naver Search, Naver Blog, and Naver Shopping. Korean-language prompt comprehension is natural.

- **lytics** — A Korean startup focused on advertising and marketing image generation. The model itself is SDXL or Flux base finetuned with LoRAs trained on Korean product data.

**Japan.**

- **NovelAI** — In effect the standard for Japanese anime style image generation since 2022. NovelAI Diffusion V4 (2025) is in a different league from other SDXL-based competitors on anime and illustration quality. Proprietary dataset and finetuning.

- **Sakana AI** — Tokyo-headquartered. UK-born researcher David Ha is a co-founder. Known for original research like evolutionary model merging. Developing Japanese LLMs and Japanese multimodal models, with growing government and enterprise tie-ins.

- **NTT Tsuzumi** — Japanese LLM developed by NTT. Stronger on multimodal understanding (describing images in text) than on raw image generation.

- **Yi-Vision** — 01.AI (China) model, but frequently cited in Japan and Korea. Multimodal understanding strong on OCR and document analysis.

Regional models matter for two reasons. **First, language and cultural understanding.** Ask a Korean model in Korean for "갈치조림" and it draws the correct dish. A global model often has no idea what "갈치" is. **Second, data sovereignty.** Government contracts, public agencies, and some enterprises don't want data on foreign clouds. They need models hosted in domestic data centers.

The weaknesses are clear too. **Absolute quality** doesn't yet match Flux, Midjourney, or Imagen 4. On generic photorealistic output the global models lead by a step. The regional models win in narrow categories — "Korean context," "Japanese anime style."

Chapter 14 · Who Should Pick What — Advertising / Product Design / Comics / Marketing

We've covered 11 models and tools. So what should real users pick? Organized by use case.

**Advertising and marketing visuals (agencies and in-house).**

A safe combo: **Midjourney 7 (concept) + Adobe Firefly 4 (delivery)**. Sketch concept moodboards quickly with Midjourney, then make finals with Firefly once the client approves. Firefly has clean licensing, so legal risk goes away. For ad banners that hinge on text, route to Ideogram 3 separately.

**Product photography (e-commerce, brand).**

**Flux 1.1 Pro or Flux Kontext**. Photorealistic quality is the most reliable here. Kontext is especially handy for "keep the product, change the background." Many teams keep an SDXL-era IPAdapter and ControlNet workflow running in ComfyUI as-is.

**Logo, icon, and illustration design.**

**Recraft V3**. Vector output is the decisive feature. With other models you redraw in Illustrator afterward; with Recraft you start in SVG. A mixed flow — concept in Midjourney 7, vectorize in Recraft — also works.

**Book covers, posters, album art.**

**Ideogram 3 (when text is the point) + Midjourney 7 (when imagery is the point)**. If text occupies a large area, Ideogram. If imagery is the focus with smaller text, generate in Midjourney and overlay text in Photoshop or Figma later.

**Webtoons, comics, illustration.**

**NovelAI (when anime style is the point) or SD 3.5 / Flux Dev + LoRA**. NovelAI's illustration quality is overwhelming but locks you into a service. To grow your own style, base on SD 3.5 or Flux Dev and train LoRAs on your own art. ComfyUI workflows hold consistency.

**Personal creation and experimentation.**

**ChatGPT (GPT-4o image)** has the lowest friction. Just type "make this" in chat. For more creative freedom, run Stable Diffusion 3.5 or Flux Dev locally.

**Design exploration and real-time collaboration.**

**Krea AI or Photon**. The friction of real-time generation changes design thinking itself. Figma and Miro integrations are getting more natural.

**Enterprise and government.**

**Adobe Firefly 4** (licensing), **Google Imagen 4 (Vertex AI)** (infrastructure and SLA), or **self-hosted SD 3.5 or Flux** (data sovereignty). A category where the user isn't picking alone — security, legal, and finance pick together.

**One more — the era of "one model for everything" is over.** Around 2024, "Midjourney for everything" was a viable answer. By 2026, a serious user runs two or three models in parallel. Flux for photos, Midjourney for illustration, Ideogram for text, Recraft for vectors, plus an in-house LoRA on SD 3.5. The era of locking into a single model or vendor has ended.

Epilogue — The Next Two Years

To close, two directions visible from spring 2026.

**1. Multimodal unification.** What GPT-4o showed — "text, image, audio, and video in one model" — becomes the norm. The footprint of "separate image models" like DALL·E and Imagen shrinks. UX converges on a chat-plus-canvas hybrid.

**2. The video generation explosion.** Sora in 2024, Veo 2 / Kling / Hailuo / Runway Gen-3 in 2025, Veo 3 / Sora 2 / Luma Dream Machine 2 in 2026. Image-tested techniques keep flowing into video. The "image model versus video model" boundary blurs (the same company runs both, the same UI calls both).

Image generation itself is no longer "the most shocking AI technology." The 2022 shock of DALL·E 2 is now everyday. By 2026 we use image generation as a tool and wait for the next shock on top. Whatever that next shock turns out to be, the models in this piece — Flux, Midjourney, Ideogram, Recraft, SD 3.5, Imagen, GPT-4o, Firefly — will keep working as someone's everyday tool.

References

- Black Forest Labs Flux: https://blackforestlabs.ai/

- Flux on HuggingFace: https://huggingface.co/black-forest-labs

- Midjourney: https://www.midjourney.com/

- Ideogram: https://ideogram.ai/

- Recraft: https://www.recraft.ai/

- Stable Diffusion 3.5 (Stability AI): https://stability.ai/news/introducing-stable-diffusion-3-5

- Google ImageFX: https://labs.google/fx/tools/image-fx

- Google Vertex AI Imagen: https://cloud.google.com/vertex-ai/generative-ai/docs/image/overview

- OpenAI DALL·E and GPT-4o image: https://openai.com/index/dall-e-3/

- Adobe Firefly: https://www.adobe.com/products/firefly.html

- Krea AI: https://www.krea.ai/

- Luma Photon: https://lumalabs.ai/

- fal.ai: https://fal.ai/

- ComfyUI: https://www.comfy.org/

- AUTOMATIC1111 / Stable Diffusion WebUI: https://github.com/AUTOMATIC1111/stable-diffusion-webui

- Forge: https://github.com/lllyasviel/stable-diffusion-webui-forge

- InvokeAI: https://invoke.com/

- Fooocus: https://github.com/lllyasviel/Fooocus

- Civitai: https://civitai.com/

- HuggingFace Diffusers: https://huggingface.co/docs/diffusers

- LoRA paper (Hu et al., 2021): https://arxiv.org/abs/2106.09685

- ControlNet paper (Zhang et al., 2023): https://arxiv.org/abs/2302.05543

- IPAdapter paper (Ye et al., 2023): https://arxiv.org/abs/2308.06721

- NovelAI: https://novelai.net/

- Sakana AI: https://sakana.ai/

- NTT Tsuzumi: https://www.rd.ntt/e/research/JN202310_15738.html

- Kakao Brain (KoGPT): https://kakaobrain.com/

- Naver HyperCLOVA X: https://clova.ai/hyperclova

현재 단락 (1/202)

Two years ago, "I made an image with AI" naturally meant Midjourney, DALL·E 3, or Stable Diffusion X...

작성 글자: 0원문 글자: 29,329작성 단락: 0/202