💡 왼쪽 원문을 읽으면서 오른쪽에 따라 써보세요. Tab 키로 힌트를 받을 수 있습니다.

원문 렌더가 준비되기 전까지 텍스트 가이드로 표시합니다.

Prologue — In 2026, you no longer buy clothes blind

As recently as 2019, "buying clothes online" was an infinite loop: look at product photos, look at the size guide, order anyway, return it. The biggest cost in fashion e-commerce was **return logistics**, and apparel return rates hovered around 30 to 40 percent.

The 2026 landscape is different.

- **IDM-VTON, OOTDiffusion, CatVTON, Outfit Anyone, StableVITON, MMTryon** — diffusion-based Virtual Try-On models produce photorealistic fit images at 1024x768.

- **Google Shopping**, **Amazon Virtual Try-On for Shoes**, **Nike Fit**, **Warby Parker** — major retail ships VTON as a standard mobile feature.

- Korean stacks like **Musinsa visual search**, **ABLY AI recommendations**, **PerfectFit VTON**, **Blackpin body measurement**, and **Doodle AI** are turning fashion AI into K-fashion infrastructure.

- Japanese solutions like **ZOZOSUIT** and **ZOZOMAT** digitized sizing through a dot-pattern suit and a foot measurement mat.

- **Cala** (acquired by Adobe), **Resleeve.ai**, and **Mosaic** moved AI past designer assistant and into collection co-creation.

Clothes are no longer something you guess at and order. You try them on first — on screen. This article maps that whole shift.

> One-line summary: **"Whose body, which garment, at what resolution and fidelity, and who pays the bill?"** Those four questions decide 90 percent of every fashion AI choice.

Chapter 1 · Why Virtual Try-On exploded in 2026

The value story for VTON is simple.

- **Conversion rate** — people buy what they tried on. +20 to 40 percent on average.

- **Return rate** — seeing fit in advance cuts returns. -15 to 30 percent on average.

- **Time on page** — VTON product pages get 2 to 3 times the dwell time of plain PDPs.

- **New categories** — eyewear, watches, shoes, makeup, and hair are all in scope.

Three forces converged here.

1. **Diffusion models** unlocked photoreal image generation.

2. **Mobile GPU/NPU** made on-device inference viable.

3. **Datasets (VITON-HD, DressCode, DeepFashion2)** opened up at trainable scale.

When those three lined up between 2024 and 2026, VTON graduated from "demo video" to "production feature."

Chapter 2 · VITON-HD — the first benchmark in 2021

**VITON-HD** is both a paper (CVPR 2021) and a dataset for high-resolution virtual try-on: 1024x768 resolution, 13K pairs of garment and model photos.

VITON-HD solved two things.

- **clothing-agnostic person representation** — erase the garment region from a person photo to make a "space to dress."

- **misalignment-aware normalization** — normalize the gap between a person's pose and the garment's fit.

Because it was GAN-based, results had unnatural patches, especially around hands, sleeves, and logos. Still, 1024-class resolution and the paired dataset became the starting point for all subsequent VTON research.

Chapter 3 · HR-VITON and GP-VTON — the late GAN era before diffusion

After VITON-HD came **HR-VITON** (ECCV 2022) and **GP-VTON** (CVPR 2023).

- **HR-VITON** — separates the try-on condition generator from the image generator for training stability. Artifacts around hands and hair are reduced.

- **GP-VTON** — Global Parsing-based Virtual Try-On. Decomposes garments into regions (sleeve, body, collar) for finer compositing.

The shared limits of this era were **GAN mode collapse** and **poor generalization to new garments and poses**. They composited well inside the training distribution and were brittle in the wild — varied poses, body types, busy patterns.

To break those limits, diffusion entered the room.

Chapter 4 · IDM-VTON — the de facto standard for diffusion VTON

**IDM-VTON** (Improving Diffusion Models for Virtual Try-On, Choi et al, ECCV 2024) is the most-cited and most-reimplemented diffusion VTON model of 2024 to 2026.

The core idea is **"inject garment information through two paths at once."**

1. **GarmentNet** — encode the garment image into visual features and inject via cross-attention.

2. **PromptNet** — encode garment text (for example, "white short-sleeve shirt with blue stripes") into text features as an extra condition.

Dual conditioning preserves color, texture, and logos much better than a single image-only path.

IDM-VTON inference — pseudo-flow

1. take a person image and a garment image

2. build a clothing-agnostic mask

3. encode the garment with GarmentNet

4. encode the garment text with PromptNet

5. compose with a Stable Diffusion backbone

~3-5 seconds per image on H100 or RTX 4090

The IDM-VTON HuggingFace checkpoint became the open-source baseline. ComfyUI nodes and the Replicate API both picked IDM-VTON as their first reference implementation.

Chapter 5 · OOTDiffusion — handling out-of-distribution garments

**OOTDiffusion** (Outfitting Fusion based Latent Diffusion, Xu et al, 2024) shipped around the same time as IDM-VTON but with a different design philosophy.

- **garment fusion** — blend garment latents and person latents in the same UNet via self-attention. No separate cross-attention module.

- **out-of-distribution generalization** — more robust to garments outside the training distribution (uncommon patterns or structures).

OOTDiffusion supports tops, bottoms, and dresses in a single model, with half-body and full-body variants. Half-body maximizes the realism of a single garment; full-body prioritizes the coherence of a full outfit.

Open code and weights live on GitHub at levihsu/OOTDiffusion, and it is the model Korean and Japanese fashion startups reach for first when running a PoC.

Chapter 6 · CatVTON — concatenation is all you need

The provocation of **CatVTON** (Chong et al, 2024) is its message: "Without complex garment encoders, plain concatenation alone gets us close to SOTA."

The setup.

- **Concatenate garment and person images along the channel dimension** in latent space.

- **Fine-tune a Stable Diffusion inpainting backbone** as-is, with no extra modules.

- Trainable parameters are roughly one tenth of IDM-VTON.

This model raised the question: "Why has everyone been building a separate GarmentNet all this time?" The answer was, "We didn't need to." Simplicity wins on inference speed, on training ease, and on integration. It is a frequently cited candidate for on-device mobile VTON.

Chapter 7 · Outfit Anyone — training-free garment composition from Alibaba

**Outfit Anyone** (Sun et al, Alibaba, 2024) is unusual in two ways.

1. **training-free** — runs on pretrained Stable Diffusion with no fine-tune.

2. **multi-garment** — supports top, bottom, and dress composition together.

The trick is two-stage inversion and mask-guided attention manipulation. Person and garment are each inverted, then swapped in latent space.

The upside is zero training cost and unlimited garment variety. The downside is that realism and detail preservation are not at IDM-VTON levels. But for users who say "use my own photo and try this without any training," it is the top choice.

Chapter 8 · StableVITON — direct descendant of Stable Diffusion

**StableVITON** (Kim et al, CVPR 2024) is designed as a direct heir to Stable Diffusion, as the name implies. The key contribution is **zero cross-attention** — preserving Stable Diffusion's existing cross-attention weights while injecting garment information through a separate path.

Two effects follow.

- Stable Diffusion's text comprehension is inherited intact.

- Garment texture and pattern preservation are strong.

**StableVITON** is the second most popular backbone in the ComfyUI community after IDM-VTON, with Stable Diffusion 1.5 and SDXL based variants.

Chapter 9 · MMTryon — the path to multi-modal inputs

**MMTryon** (Zhang et al, 2024) extends the input modalities themselves.

- **image** — garment image

- **text** — garment description ("a navy blazer with gold buttons")

- **garment sketch** — a hand-drawn sketch

- **garment composition** — combinations of several garments

Diversifying the input lets you try things even when "no photo of the garment exists." A designer can preview fit from a sketch; a casual user can simulate clothes from text alone.

Pure text or sketch can't match the realism of image conditioning, so MMTryon is typically used in a hybrid with image conditioning.

Chapter 10 · FitDiT, TPD, GR-VTON — the follow-up variants

Between 2025 and 2026 a wave of IDM-VTON and OOTDiffusion variants shipped.

- **FitDiT** — applies a Diffusion Transformer (DiT) backbone to VTON. Bigger model, longer training, better realism.

- **TPD** (Texture-Preserving Diffusion) — maximizes garment texture and pattern fidelity. Strong on checks and florals.

- **GR-VTON** (Garment-Region VTON) — splits garments into regions (sleeve, body, collar) for region-specific processing.

- **FashionFit** — bundles size guidance into the output as a complete solution.

These variants all share IDM-VTON's dual conditioning pattern and vary backbone, attention, or loss one piece at a time. 2026 VTON research is in a "refining existing patterns" phase rather than chasing the next big idea.

Chapter 11 · Datasets — VITON-HD, DressCode, DeepFashion, VTONHD-Public

VTON models live and die by their data. In 2026 the four standard datasets are:

| --- | --- | --- | --- |

Most models train on a VITON-HD plus DressCode mix. DeepFashion-family data feeds auxiliary tasks like garment classification, landmarks, and segmentation. Adding fine-tunes on Korean or Japanese user data is the standard playbook for K-fashion and J-fashion startups.

Chapter 12 · Doodle AI — VTON service born in Korea

**Doodle AI** is a Korean-origin Virtual Try-On service that sells VTON APIs to apparel brands and e-commerce. A shopper uploads one photo of themselves and can try on garments from a catalog.

Distinctive points.

- **Korean body-type data** — fine-tuned for East Asian body types.

- **Local hosting option** — inference inside Korean data centers.

- **Mobile SDK** — SDKs you can drop into iOS and Android apps.

Parts of K-fashion e-commerce — especially small and mid-size brands — adopt VTON via specialist services like Doodle AI instead of building their own models.

Chapter 13 · Vue.ai, 3DLook, Zeekit, Bold Metrics — overseas commercial solutions

Outside Korea the division of labor in fashion AI is sharper.

- **Vue.ai** (Mad Street Den) — first-generation retail fashion AI. Catalog auto-tagging, image enhancement, and VTON.

- **3DLook YourFit** — accurate body measurement from two photos. Sizing recommendation is the strength.

- **Zeekit** — acquired by Walmart in 2021. Powers the VTON inside the Walmart app.

- **Bold Metrics** — sizing recommendation from height and weight. Used by many US apparel brands.

- **Snap AR Try-On** — Snap's AR Mirror tech. Strong in eyewear, makeup, and shoes.

These are specialized — some do only VTON, some only sizing, some only AR — and brands typically combine two or more.

Chapter 14 · Amazon, Google Shopping, Nike Fit — Big Tech VTON

Big Tech absorbing VTON into their own platforms.

- **Amazon Virtual Try-On for Shoes** — try shoes on virtually inside the Amazon Fashion app.

- **Google Shopping virtual try-on** — launched with women's tops in 2023, expanded in September 2024. Preview garments on models close to your own body type.

- **Nike Fit** — recommends accurate sizes from a foot photo. A core Nike app feature.

- **Warby Parker virtual try-on** — composites eyewear onto a face. Uses the iPhone TrueDepth camera.

- **Fenty Beauty Pro Filt'r** and **L'Oreal Modiface** — real-time makeup, lipstick, and eyeshadow composition.

These do not run VTON in a separate app — they make it a natural part of the shopping flow. The fact that they apply via catalog-side metadata, without per-brand training, demonstrates how productized fashion AI can become.

Chapter 15 · Musinsa, ABLY, PerfectFit, Blackpin — Korean fashion AI

In Korea the dominant pattern is e-commerce platforms growing their own AI teams.

- **Musinsa** — visual search and style recommendation. Upload a photo, get similar products.

- **ABLY** — personalization recommendation AI at the core. Combines collaborative filtering and content-based recommendation, specialized for apparel.

- **PerfectFit** — VTON specialist startup. B2B SaaS for apparel brands.

- **Blackpin** — body measurement tech. Height, weight, and body shape in, accurate size out.

- **Cultureland** — virtual fitting at some partner outlets.

K-fashion always had brand-specific size charts as a major friction point, and solutions like Blackpin and PerfectFit are quietly closing that gap.

Chapter 16 · ZOZOSUIT, ZOZOMAT, ASNAS — Japan goes deep on body measurement

Japan pushes further on body measurement.

- **ZOZOSUIT** — a dot-pattern suit from ZOZO. Wear it, take a rotating phone video, and your body is measured in 360 degrees. Millions distributed after the 2018 launch.

- **ZOZOMAT** — a measurement mat for feet. Precise shoe sizing.

- **ZOZOGLASS** — face measurement that matches makeup tones.

- **ASNAS** — VTON service integrated with Japanese apparel brands.

- **Furusato** — recommender-driven fashion AI.

ZOZO's measurement data shaped how other Japanese apparel brands standardized sizing afterward. "My ZOZOSUIT size" effectively became a cross-brand unit.

Chapter 17 · Cala, Mosaic, Resleeve.ai — AI design and collection generation

If VTON is "trying on a garment that already exists," AI design is "making the garment itself."

- **Cala** — a fashion design platform Adobe acquired in 2024. Text or sketch in; clothing design, pattern output, and factory order in one flow.

- **Mosaic** — AI-driven collection generation. Feed in brand tone, season, and trend; get a lookbook.

- **Resleeve.ai** — clothing design generation. Positioned as a designer assistant.

- **The Fabricant** — digital-only fashion. Garments that exist only digitally, with no physical version.

These tools settled in not as "designer replacement" but as "the way a designer auditions 50 variations in 10 minutes." They also power the ever-shorter season cycles of fast fashion.

Chapter 18 · Body sizing and 3D fit — Apple Reality Composer, Maison Meta, Vsble

The last piece of fitting is a 3D body model.

- **Apple Reality Composer Pro / RealityKit** — generates a 3D body model from a user's LiDAR data on visionOS 26. Virtual closet scenarios.

- **Maison Meta** — a fashion 3D asset platform. A library of 3D models for garments and accessories.

- **Vsble** — virtual showroom. Real-time garment dressing on a 3D body.

- **CLO 3D** and **Browzwear** — 3D garment simulation software for design. Designers cut patterns, then check the fit on a 3D mannequin.

The 3D approach wins on data efficiency — physical simulation works without huge training sets — and loses on realistic texture and lighting. So the 2026 trend is **3D simulation plus diffusion rendering**, combined.

Chapter 19 · ComfyUI and open-source VTON workflows

ComfyUI is a node-based Stable Diffusion workflow tool, and between 2024 and 2026 it became the de facto VTON laboratory.

- **IDM-VTON nodes** — wrap IDM-VTON inference as ComfyUI nodes.

- **OOTDiffusion nodes** — support both half-body and full-body variants.

- **StableVITON nodes** — choose Stable Diffusion 1.5 or SDXL backbones.

- **CatVTON nodes** — the lightest nodes. Fast on a single GPU.

Typical flow.

[person image] ─┐

├─> [Garment Encoder] ─> [Inpainting Diffusion] ─> [output]

[garment image]─┘

┃

[pose extraction (OpenPose/DWPose)]

[garment mask (SAM/SCHP)]

Open-source workflows let small fashion brands stand up an in-house VTON PoC in days.

Chapter 20 · AI runway — NYFW, Milan, Digital Fashion Week

If VTON is the consumer side, AI runway is the industry side.

- **NYFW 2025 and 2026** — brands like Pinar&Viola and Collina Strada walked AI-generated garments down the runway.

- **Milan Fashion Week** — 3D digital assets from outfits like Maison Meta made their debut.

- **Metaverse Fashion Week** — a digital-only fashion week held on Decentraland and Spatial. The peak of NFT fashion, and also where its ceiling showed.

- **AI fashion editorial** — Vogue and Harper's Bazaar carry AI-generated garments in their printed pages.

The NFT fashion bubble of 2021 to 2023 cooled, but the practical applications of generative outfits (lookbooks, ads, design sketches) stuck.

Chapter 21 · AI fashion search and visual search

Discovery is changing too.

- **Pinterest Lens** — search for similar clothes by image. Launched in 2017.

- **Google Shopping image search** — based on Google Lens.

- **Musinsa visual search** — image search inside the K-fashion catalog.

- **ABLY AI recommendations** — collaborative filtering on click and purchase history.

- **TikTok Shop** — tap a garment in a video to buy. Uses CLIP and SigLIP-style embeddings.

The core technology is **multi-modal embedding** — measuring visual similarity of garments using image-text embeddings like CLIP, SigLIP, EVA-02, and DINOv2.

Chapter 22 · Ethics — body image, diversity, and privacy

The darker side of fashion AI.

1. **Body image** — if VTON only shows clothes on thin models, body image problems are amplified. Diverse baseline models are needed.

2. **Diversity** — training-data biases in ethnicity, body type, and age get reproduced in the output. VITON-HD is overwhelmingly frontal photos of white and Asian women.

3. **Privacy** — body scans and face photos are highly sensitive data. On-device inference or short-retention policies are needed.

4. **Model watermarking** — watermarks on synthetic images (C2PA, SynthID) are gradually becoming mandatory.

5. **Copyright** — copyright issues for designer garments in training data. Some designers have asked to opt out.

The EU AI Act and Korea's AI Framework Act both classify body and face data as biometric data, requiring consent, notice, and deletion rights for storage and processing.

Chapter 23 · Hardware and inference cost

VTON inference cost is, surprisingly, a serious problem.

- **H100 / A100** — IDM-VTON at 1024 resolution at 3 to 5 seconds per image. Backend for large e-commerce.

- **RTX 4090 / RTX 5090** — 4 to 6 seconds per image. A candidate for small operator self-hosting.

- **Apple M3/M4** — lightweight models like CatVTON at 10 to 20 seconds per image. On-device inference is possible.

- **Mobile NPU (Snapdragon 8 Gen 4, Apple Neural Engine)** — 10 to 30 seconds after quantization. Not real-time.

Large e-commerce VTON cost sits at about **0.001 to 0.01 USD per product view**, and the conversion lift makes the ROI comfortable. But once you accumulate billions of inferences a month, GPU capacity itself becomes the bottleneck. The 2026 trend is **batch plus cache plus quantization** to cut unit cost by 10x.

Chapter 24 · Beyond 2026 — the next five years of fashion AI

The five-year outlook.

1. **Real-time VTON** — composite garments onto video in real time. For live commerce, Zoom meetings, and social video.

2. **Personal avatars** — users build a personal body model once and reuse it everywhere.

3. **3D plus diffusion hybrid** — 3D for physical fit, diffusion for photorealistic rendering.

4. **Fast design** — compress trend, design, pattern, and production into a week.

5. **AR mirrors** — store mirrors become VTON displays. Already piloted in some Japanese department stores.

6. **Size standardization** — global sizing unified on body-measurement basis. ZOZOSUIT size may become the de facto unit.

7. **Mandatory ethics and labeling** — origin labels required on AI-generated apparel images.

8. **Designer rights** — emerging ledger-based models that compensate designers whose works are in training data.

Digitizing clothes is not like digitizing music or movies. Clothes still have to be worn. So the future of AI fashion is not "digital-only clothes" but a bridge between digital and physical.

Epilogue — Where to start

If the toolset in this article feels overwhelming, here is a recommended learning path.

1. **Theory** — start with the VITON-HD and IDM-VTON papers. Internalize the one-line arc from GAN to diffusion.

2. **Hands-on (open source)** — upload your own photo to the IDM-VTON demo on HuggingFace. Look at the composition results with garment photos.

3. **Workflow (ComfyUI)** — run the IDM-VTON nodes in ComfyUI. Observe the impact of garment masks and pose extraction.

4. **Commercial services** — compare demos of Doodle AI, Vue.ai, and 3DLook. Notice the differences in B2B SaaS packaging.

5. **Sizing** — walk through the measurement flow of ZOZOSUIT and Bold Metrics. Feel how recommendations shift.

> "Whose body, which garment, at what resolution and fidelity, and who pays the bill?" Carry those four questions back into the body of the article and the fashion AI choices become surprisingly clear.

— AI Fashion and VTON 2026, end.

References

1. Choi, Y. et al. (2024). "IDM-VTON: Improving Diffusion Models for Authentic Virtual Try-On." ECCV 2024. [https://arxiv.org/abs/2403.05139](https://arxiv.org/abs/2403.05139)

2. Xu, Y. et al. (2024). "OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable Virtual Try-On." [https://arxiv.org/abs/2403.01779](https://arxiv.org/abs/2403.01779)

3. Chong, Z. et al. (2024). "CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models." [https://arxiv.org/abs/2407.15886](https://arxiv.org/abs/2407.15886)

4. Sun, K. et al. (2024). "Outfit Anyone: Ultra-high quality virtual try-on for any clothing and any person." [https://humanaigc.github.io/outfit-anyone/](https://humanaigc.github.io/outfit-anyone/)

5. Kim, J. et al. (2024). "StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On." CVPR 2024. [https://arxiv.org/abs/2312.01725](https://arxiv.org/abs/2312.01725)

6. Choi, S. et al. (2021). "VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization." CVPR 2021. [https://arxiv.org/abs/2103.16874](https://arxiv.org/abs/2103.16874)

7. Lee, S. et al. (2022). "High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions (HR-VITON)." ECCV 2022. [https://arxiv.org/abs/2206.14180](https://arxiv.org/abs/2206.14180)

8. Xie, Z. et al. (2023). "GP-VTON: Towards General Purpose Virtual Try-on via Collaborative Local-Flow Global-Parsing Learning." CVPR 2023. [https://arxiv.org/abs/2303.13756](https://arxiv.org/abs/2303.13756)

9. Morelli, D. et al. (2022). "Dress Code: High-Resolution Multi-Category Virtual Try-On." CVPR 2022. [https://arxiv.org/abs/2204.08532](https://arxiv.org/abs/2204.08532)

10. Liu, Z. et al. (2016). "DeepFashion: Powering Robust Clothes Recognition and Retrieval with Rich Annotations." [https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html](https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html)

11. Ge, Y. et al. (2019). "DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images." [https://github.com/switchablenorms/DeepFashion2](https://github.com/switchablenorms/DeepFashion2)

12. Zhang, X. et al. (2024). "MMTryon: Multi-Modal Multi-Reference Virtual Try-On." [https://arxiv.org/abs/2405.00448](https://arxiv.org/abs/2405.00448)

13. Google. "Try on clothes virtually with generative AI in Search." [https://blog.google/products/shopping/virtual-try-on-google-generative-ai/](https://blog.google/products/shopping/virtual-try-on-google-generative-ai/)

14. Amazon. "Virtual Try-On for Shoes." [https://www.aboutamazon.com/news/retail/virtual-try-on-for-shoes](https://www.aboutamazon.com/news/retail/virtual-try-on-for-shoes)

15. Nike. "Nike Fit." [https://news.nike.com/news/nike-fit-digital-foot-measurement-tool](https://news.nike.com/news/nike-fit-digital-foot-measurement-tool)

16. Warby Parker. "Virtual Try-On." [https://www.warbyparker.com/virtual-try-on](https://www.warbyparker.com/virtual-try-on)

17. ZOZO. "ZOZOSUIT." [https://zozo.jp/zozosuit/](https://zozo.jp/zozosuit/)

18. ZOZO. "ZOZOMAT." [https://zozo.jp/zozomat/](https://zozo.jp/zozomat/)

19. Musinsa Tech. "Musinsa Visual Search." [https://www.musinsa.com/](https://www.musinsa.com/)

20. ABLY Corp. "ABLY AI Recommendations." [https://ably.co.kr/](https://ably.co.kr/)

21. Adobe. "Cala — AI-powered fashion design." [https://ca.la/](https://ca.la/)

22. Resleeve.ai. "AI Fashion Design." [https://www.resleeve.ai/](https://www.resleeve.ai/)

23. ComfyUI. "ComfyUI VTON workflows." [https://github.com/comfyanonymous/ComfyUI](https://github.com/comfyanonymous/ComfyUI)

24. HuggingFace. "IDM-VTON model card." [https://huggingface.co/yisol/IDM-VTON](https://huggingface.co/yisol/IDM-VTON)