TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On (2024)

Zhenchen Wan1  Yanwu Xu2  Zhaoqing Wang3  Feng Liu1  Tongliang Liu3  Mingming Gong1,4
1University of Melbourne, Melbourne, Australia
2Snapchat, Los Angeles, USA
3The University of Sydney, Sydney, Australia
4Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
zhenchenw@student.unimelb.edu.au, zwan6779@uni.sydney.edu.au, fengliu.ml@gmail.com
tongliang.liu@sydney.edu.au, mingming.gong@unimelb.edu.au

Abstract

Recent advancements in Virtual Try-On (VTO) have demonstrated exceptional efficacy in generating realistic images and preserving garment details, largely attributed to the robust generative capabilities of text-to-image (T2I) diffusion backbones. However, the T2I models that underpin these methods have become outdated, thereby limiting the potential for further improvement in VTO. Additionally, current methods face notable challenges in accurately rendering text on garments without distortion and preserving fine-grained details, such as textures and material fidelity. The emergence of Diffusion Transformer (DiT) based T2I models has showcased impressive performance and offers a promising opportunity for advancing VTO. Directly applying existing VTO techniques to transformer-based T2I models is ineffective due to substantial architectural differences, which hinder their ability to fully leverage the models’ advanced capabilities for improved text generation. To address these challenges and unlock the full potential of DiT-based T2I models for VTO, we propose TED-VITON, a novel framework that integrates a Garment Semantic (GS) Adapter for enhancing garment-specific features, a Text Preservation Loss to ensure accurate and distortion-free text rendering, and a constraint mechanism to generate prompts by optimizing Large Language Model (LLM). These innovations enable state-of-the-art (SOTA) performance in visual quality and text fidelity, establishing a new benchmark for VTO task.

TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On (1)

1 Introduction

Using input images of an individual and a selected garment, image-based VTO technology generates realistic images of the individual wearing the selected garment. By bypassing the necessity for physical fitting, VTO offers a transformative solution for applications in e-commerce, fashion cataloging, and the burgeoning metaverse. The primary challenges in VTO are threefold: (1) human body alignment, whereby the generated try-on image must accurately reflect the person’s body shape and pose; (2) garment fidelity, which preserves fine garment details, such as texture, color, and logo clarity, is essential to ensure authenticity; and (3) image quality, which pertains to the final output’s resolution and the absence of artifacts.

While early VTO methods based on Generative Adversarial Networks (GANs) [10] addressed these challenges to some extent [2, 4, 7, 13, 28, 40, 44, 46], they often struggled with garment misalignment, visible artifacts, and limited generalizability. To address these limitations, diffusion models [16] have emerged as a promising alternation in VTO research, leveraging a progressive noise-reversal process that enhances control over image generation and significantly improves texture and detail preservation. Recent UNet-based methods [11, 30, 31, 19, 23, 39, 5] utilize the generative strength of pretrained text-to-image (T2I) diffusion models [36, 34] to capture detailed garment semantics and enhance the realism of try-on images. These approaches achieve high image fidelity by encoding garment semantics through simple description [30, 5] or using explicit warping networks [11, 39] to align garment structure with human poses. However, despite their advancements, these models still face challenges in preserving fine-grained garment details, such as logos, text, and intricate textures, and often struggle with accurately representing natural lighting and adapting to complex body poses.

To overcome these limitations, we explore the use of transformers in diffusion models, specifically building upon the DiT architecture[8]. Unlike UNet-based architectures, transformers offer enhanced scalability, long-range dependency modeling, and the ability to handle diverse visual contexts. However, directly migrating existing VTO approaches to the Transformer-based diffusion model does not guarantee performance improvements, as traditional UNet-based methods fail to fully exploit the potential of the transformer architecture. This observation aligns with our initial experiments, where a naive application of prior VTO techniques on DiT yielded suboptimal results.

To harness the capabilities of DiT, this paper proposes Transformer-Empowered Diffusion Models for Virtual Try-On (TED-VITON), a framework designed to overcome key challenges in VTO by leveraging Transformer-based diffusion architectures. TED-VITON integrates several novel components to address limitations in garment detail preservation, model generalization, and text fidelity. Our contributions can be summarized as follows:

  • Successful Migration of VTO to DiT-based Architecture: We demonstrate the successful adaptation of Virtual Try-On technology to a DiT-based architecture. This paves the way for subsequent enhancements in the preservation of garment detail, semantic alignment, and visual fidelity.

  • Enhanced Garment Semantics with GS-Adapter: Integrating the GS-Adapter, TED-VITON precisely aligns high-order semantic features from the image encoder with the DiT. This integration allows the model to more accurately capture occlusions, wrinkles, and material properties, maintaining realism across varied poses.

  • Text and Logo Clarity through Text Preservation Loss: We introduce a Text Preservation Loss to address common challenges in text and logo fidelity. This loss function effectively enhances clarity and mitigates distortion, ensuring high-quality, distortion-free renderings of logos and text, critical for garments with complex designs.

  • Optimized Prompt Generation through Constraint Mechanism for LLMs: To optimize DiT training, we introduce a constraint mechanism that tailors LLM prompts to garment-specific semantics. This mechanism improves training input quality, facilitating effective learning and generating outputs with superior visual fidelity.

2 Related Works

Pose-Guided Person Image Synthesis (PPIS). VTO technology originated with Pose-Guided Person Image Synthesis (PPIS). Initial PPIS approaches aimed to generate person images conditioned on specific body poses, laying the groundwork for generating visually convincing images of people in various postures. Pioneering works in this domain [26, 24, 51, 49, 9, 1, 22, 27] concentrated on aligning human poses with target clothing images, addressing key challenges in pose transfer and adapting to individual body shapes.

GAN-based VTO. Following PPIS advancements, VTO progressed to the application of Generative Adversarial Networks (GANs) for 2D VTO. GAN-based VTO approaches [24, 9, 20, 7, 17, 35, 33, 1, 27, 43, 21, 38] typically involve two stages: deforming the garment to match the target person’s body shape, followed by fusing this deformed garment with the person’s image. Methods for improving garment deformation include using dense flow maps to create a seamless fit, while normalization and distillation techniques help to minimize misalignment. However, GAN-based VTO models face generalization limitations, especially in complex backgrounds and varied poses, limiting their applicability in dynamic real-world environments.

Diffusion-based VTO. Diffusion models have opened new avenues in VTO, enabling enhanced fidelity and detail preservation. Recent diffusion-based VTO methods [6, 3, 50, 31, 23, 30, 11, 19, 39, 5] extend beyond standard Stable Diffusion (SD), often employing customized architectures to boost performance. For instance, StableVITON [19] builds on SD1.4 [36] and incorporates ControlNet [47] to enhance control over garment and body alignment, while IDM-VTON [5] leverages SDXL [34] with IP-Adapter [45] to refine garment-body fit through additional image-based control signals. These approaches effectively address key limitations of GAN-based methods, particularly in garment fidelity and preservation of fine details, establishing diffusion-based models as suitable for complex VTO applications. However, preserving intricate elements like garment text, logos, and texture under diverse poses and lighting conditions remains a challenge. TED-VITON aims to bridge these gaps, advancing VTO with a DiT architecture that integrates the GS-Adapter for semantic alignment, a DINOv2 encoder for capturing fine-grained garment details, and a Text Preservation Loss that ensures clarity in logos and text.

3 Methodology

TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On (2)

3.1 Background on Controlling Diffusion Models

Stable Diffusion (SD) 3 Model. The SD 3 model [8] represents a significant breakthrough as the first diffusion model to utilize a Transformer-based architecture. Building upon Latent Diffusion Models (LDMs) [36], SD3 introduces the rectified flow approach [25], which connects data points in the latent space via straight linear paths, replacing the traditional curved trajectories. This straight-line trajectory minimizes noise accumulation and allows for efficient, high-quality image synthesis.

In SD3, the input image x𝑥xitalic_x is encoded into a latent representation z0=E(x)subscript𝑧0𝐸𝑥z_{0}=E(x)italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_E ( italic_x ) by a pre-trained encoder E𝐸Eitalic_E. The rectified flow formulation defines a forward diffusion process with a variance schedule βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT expressed as follows:

q(zt|z0)=𝒩(zt;α¯tz0,(1α¯t)I),𝑞conditionalsubscript𝑧𝑡subscript𝑧0𝒩subscript𝑧𝑡subscript¯𝛼𝑡subscript𝑧01subscript¯𝛼𝑡𝐼q(z_{t}|z_{0})=\mathcal{N}(z_{t};\sqrt{\bar{\alpha}_{t}}z_{0},(1-\bar{\alpha}_%{t})I),italic_q ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_I ) ,(1)

where t{1,,T}𝑡1𝑇t\in\{1,\dots,T\}italic_t ∈ { 1 , … , italic_T } indicates diffusion steps, αt:=1βtassignsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}:=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and α¯t:=s=1tαsassignsubscript¯𝛼𝑡superscriptsubscriptproduct𝑠1𝑡subscript𝛼𝑠\bar{\alpha}_{t}:=\prod_{s=1}^{t}\alpha_{s}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. SD3 leverages the Conditional Flow Matching (CFM) loss to guide rectified flow during training:

CFM=𝔼E(x),ϵ𝒩(0,1),t[vθ(z,t)ut(z|ϵ)2],\mathcal{L}_{\text{CFM}}=\mathbb{E}_{E(x),\epsilon\sim\mathcal{N}(0,1),t}\left%[\left\|v_{\theta}(z,t)-u_{t}(z|\epsilon)\right\|^{2}\right],caligraphic_L start_POSTSUBSCRIPT CFM end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_E ( italic_x ) , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ ∥ italic_v start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z , italic_t ) - italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z | italic_ϵ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,(2)

where ut(z|ϵ)subscript𝑢𝑡conditional𝑧italic-ϵu_{t}(z|\epsilon)italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_z | italic_ϵ ) denotes the rectified vector field for direct, linear alignment. Unlike prior diffusion models, SD3 employs a Transformer backbone (Multimodal DiT) that facilitates bidirectional information flow between text and image tokens. This multimodal structure enhances text comprehension and visual quality, making SD3 highly suitable for text-guided image generation with improved fidelity and detail retention.

ControlNet for Conditional Image Generation. ControlNet [47] extends diffusion models by enabling conditional image generation with additional guidance inputs, such as edge maps, segmentation masks, or pose annotations. ControlNet operates by branching from the base model’s intermediate features F𝐹Fitalic_F. The conditional input C𝐶Citalic_C is processed with learnable weights Wcsubscript𝑊𝑐W_{c}italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, resulting in conditioned features Fctrl=ControlNet(C;Wc)subscript𝐹ctrlControlNet𝐶subscript𝑊𝑐F_{\text{ctrl}}=\text{ControlNet}(C;W_{c})italic_F start_POSTSUBSCRIPT ctrl end_POSTSUBSCRIPT = ControlNet ( italic_C ; italic_W start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ). These conditioned features are merged back into the main pipeline as Fcombined=F+λFctrlsubscript𝐹combined𝐹𝜆subscript𝐹ctrlF_{\text{combined}}=F+\lambda F_{\text{ctrl}}italic_F start_POSTSUBSCRIPT combined end_POSTSUBSCRIPT = italic_F + italic_λ italic_F start_POSTSUBSCRIPT ctrl end_POSTSUBSCRIPT, where λ𝜆\lambdaitalic_λ regulates the influence of the conditional input. During training, ControlNet minimizes a composite loss:

control=diff+γcond,subscriptcontrolsubscriptdiff𝛾subscriptcond\mathcal{L}_{\text{control}}=\mathcal{L}_{\text{diff}}+\gamma\mathcal{L}_{%\text{cond}},caligraphic_L start_POSTSUBSCRIPT control end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT + italic_γ caligraphic_L start_POSTSUBSCRIPT cond end_POSTSUBSCRIPT ,(3)

where diffsubscriptdiff\mathcal{L}_{\text{diff}}caligraphic_L start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT is the base diffusion model’s loss, condsubscriptcond\mathcal{L}_{\text{cond}}caligraphic_L start_POSTSUBSCRIPT cond end_POSTSUBSCRIPT ensures alignment with the guidance input, and γ𝛾\gammaitalic_γ balances their contributions. This framework enables precise control over image generation, making ControlNet highly effective for tasks requiring fine-grained customization.

3.2 TED-VITON

Figure 2 (a) illustrates the TED-VITON framework, comprising DiT-GarmentNet, the Garment Semantic (GS) Adapter and DiT-TryOnNet. The following section provides a comprehensive description of each module and the training procedure.

DiT-GarmentNet. DiT-GarmentNet is designed to extract fine-grained garment features, including textures, patterns, fabric structures, logos, and other subtle design elements essential for realistic VTO results. By preserving the garment’s true visual characteristics, this module ensures high fidelity, particularly in applications requiring precise appearance rendering.

DiT-GarmentNet processes the latent representation of the garment image (Xg)subscript𝑋𝑔\mathcal{E}(X_{g})caligraphic_E ( italic_X start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ), extracted via a pre-trained VAE encoder \mathcal{E}caligraphic_E, along with the conditioned text prompt τθ(D)subscript𝜏𝜃𝐷\tau_{\theta}(D)italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_D ) generated by a multi-modal text encoders. These representations flow through multiple transformer layers, refining and retaining intricate garment details. The transformer architecture, inspired by Esser etal. [8], captures long-range dependencies, ensuring consistent textures, accurate logo placement.

DiT-GarmentNet processes the garment image Xgsubscript𝑋𝑔X_{g}italic_X start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT alongside the conditioned text prompt D𝐷Ditalic_D, is defined as:

Fgarmenti=DiT-GarmentNeti((Xg),τθ(D)),superscriptsubscript𝐹garment𝑖superscriptDiT-GarmentNet𝑖subscript𝑋𝑔subscript𝜏𝜃𝐷F_{\text{garment}}^{i}=\text{DiT-GarmentNet}^{i}(\mathcal{E}(X_{g}),\tau_{%\theta}(D)),italic_F start_POSTSUBSCRIPT garment end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = DiT-GarmentNet start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( caligraphic_E ( italic_X start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_D ) ) ,(4)

where Fgarmentisuperscriptsubscript𝐹garment𝑖F_{\text{garment}}^{i}italic_F start_POSTSUBSCRIPT garment end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT denotes the fine-grained features extracted from the i𝑖iitalic_i-th transformer layer of DiT-GarmentNet.

In this way, DiT-GarmentNet ensures high visual fidelity by combining garment-specific details with broader model context, enabling the VTO system to accurately render complex designs on various body shapes and poses.

Garment Semantic Adapter (GS-Adapter). The GS-Adapter [45] is a key module that enhances generalization, making the model less sensitive to variations in body poses, garment deformations, and conditions like lighting or camera angles. By focusing on low-frequency features, it captures essential garment attributes, enabling consistent performance across diverse scenarios.

Unlike DiT-GarmentNet, which extracts high-frequency details like textures and logos, the GS-Adapter uses the DINOv2 encoder [32] to distill semantic garment information, including structure, style, and material. These high-order semantics, Hsemanticsubscript𝐻semanticH_{\text{semantic}}italic_H start_POSTSUBSCRIPT semantic end_POSTSUBSCRIPT, encapsulate broader contextual attributes while maintaining adaptability.

The GS-Adapter employs a decoupled cross-attention mechanism to independently process joint and image embeddings. Let 𝐐N×d𝐐superscript𝑁𝑑\mathbf{Q}\in\mathbb{R}^{N\times d}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_d end_POSTSUPERSCRIPT represent the query matrix, and 𝐊j,𝐕jsubscript𝐊𝑗subscript𝐕𝑗\mathbf{K}_{j},\mathbf{V}_{j}bold_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝐊i,𝐕isubscript𝐊𝑖subscript𝐕𝑖\mathbf{K}_{i},\mathbf{V}_{i}bold_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote key-value pairs for joint and image embeddings, respectively. The combined output is:

𝐙new=Attention(𝐐,𝐊j,𝐕j)+λAttention(𝐐,𝐊i,𝐕i),subscript𝐙newAttention𝐐subscript𝐊𝑗subscript𝐕𝑗𝜆Attention𝐐subscript𝐊𝑖subscript𝐕𝑖\mathbf{Z}_{\text{new}}=\text{Attention}(\mathbf{Q},\mathbf{K}_{j},\mathbf{V}_%{j})+\lambda\cdot\text{Attention}(\mathbf{Q},\mathbf{K}_{i},\mathbf{V}_{i}),bold_Z start_POSTSUBSCRIPT new end_POSTSUBSCRIPT = Attention ( bold_Q , bold_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + italic_λ ⋅ Attention ( bold_Q , bold_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,(5)

where λ𝜆\lambdaitalic_λ balances image and joint feature contributions. This design allows the GS-Adapter to generalize effectively across diverse poses, complex garments, and varying environmental conditions, enhancing model robustness and ensuring realistic outputs.

DiT-TryOnNet. DiT-TryOnNet builds upon the DiT architecture, leveraging its powerful Transformer-based diffusion capabilities within the latent space of a pre-trained VAE. By integrating DiT, our model benefits from the scalability and long-range dependency modeling of Transformers, enabling precise alignment and realistic rendering in virtual try-on scenarios. For DiT-TryOnNet, we construct a combined input ζ=[(Xmodel);m;(Xmask);(Xpose)]𝜁subscript𝑋model𝑚subscript𝑋masksubscript𝑋pose\zeta=[\mathcal{E}(X_{\text{model}});m;\mathcal{E}(X_{\text{mask}});\mathcal{E%}(X_{\text{pose}})]italic_ζ = [ caligraphic_E ( italic_X start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ) ; italic_m ; caligraphic_E ( italic_X start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT ) ; caligraphic_E ( italic_X start_POSTSUBSCRIPT pose end_POSTSUBSCRIPT ) ] to provide a comprehensive context. This input consists of: the person’s latent image representation (Xmodel)subscript𝑋model\mathcal{E}(X_{\text{model}})caligraphic_E ( italic_X start_POSTSUBSCRIPT model end_POSTSUBSCRIPT ) as the primary structural guide; a dynamically resized mask m𝑚mitalic_m to isolate the garment area and focus the model’s attention; the masked person’s image Xmask=(1m)Xmodelsubscript𝑋maskdirect-product1𝑚subscript𝑋modelX_{\text{mask}}=(1-m)\odot X_{\text{model}}italic_X start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT = ( 1 - italic_m ) ⊙ italic_X start_POSTSUBSCRIPT model end_POSTSUBSCRIPT for garment reconstruction; and the DensePose embedding (Xpose)subscript𝑋pose\mathcal{E}(X_{\text{pose}})caligraphic_E ( italic_X start_POSTSUBSCRIPT pose end_POSTSUBSCRIPT ) to align with the person’s pose.

Within the MM-DiT-Block (Fig.2(b)), fine-grained garment details Fgarmentisuperscriptsubscript𝐹garment𝑖F_{\text{garment}}^{i}italic_F start_POSTSUBSCRIPT garment end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT extracted from the i𝑖iitalic_i-th transformer layer of DiT-GarmentNet, merge with the feature representation Ftryonisuperscriptsubscript𝐹tryon𝑖F_{\text{tryon}}^{i}italic_F start_POSTSUBSCRIPT tryon end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT from the corresponding i𝑖iitalic_i-th layer DiT-TryOnNet to form Fimageisuperscriptsubscript𝐹image𝑖F_{\text{image}}^{i}italic_F start_POSTSUBSCRIPT image end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, which serves as the primary input for attention processing. Descriptive text embeddings τθ(D)subscript𝜏𝜃𝐷\tau_{\theta}(D)italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_D ), generated by multimodal text encoders, are concatenated with Fimageisuperscriptsubscript𝐹image𝑖F_{\text{image}}^{i}italic_F start_POSTSUBSCRIPT image end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT within the query, key, and value components of the joint attention mechanism (i.e., Qjoint=Concat(Qimagei,Qτθ(D))subscript𝑄jointConcatsuperscriptsubscript𝑄image𝑖subscript𝑄subscript𝜏𝜃𝐷Q_{\text{joint}}=\text{Concat}(Q_{\text{image}}^{i},Q_{\tau_{\theta}(D)})italic_Q start_POSTSUBSCRIPT joint end_POSTSUBSCRIPT = Concat ( italic_Q start_POSTSUBSCRIPT image end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_D ) end_POSTSUBSCRIPT ), Kjoint=Concat(Kimagei,Kτθ(D))subscript𝐾jointConcatsuperscriptsubscript𝐾image𝑖subscript𝐾subscript𝜏𝜃𝐷K_{\text{joint}}=\text{Concat}(K_{\text{image}}^{i},K_{\tau_{\theta}(D)})italic_K start_POSTSUBSCRIPT joint end_POSTSUBSCRIPT = Concat ( italic_K start_POSTSUBSCRIPT image end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_D ) end_POSTSUBSCRIPT ), Vjoint=Concat(Vimagei,Vτθ(D))subscript𝑉jointConcatsuperscriptsubscript𝑉image𝑖subscript𝑉subscript𝜏𝜃𝐷V_{\text{joint}}=\text{Concat}(V_{\text{image}}^{i},V_{\tau_{\theta}(D)})italic_V start_POSTSUBSCRIPT joint end_POSTSUBSCRIPT = Concat ( italic_V start_POSTSUBSCRIPT image end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_D ) end_POSTSUBSCRIPT )). This results in a hidden state Hjointisuperscriptsubscript𝐻joint𝑖H_{\text{joint}}^{i}italic_H start_POSTSUBSCRIPT joint end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT that unifies visual and textual modalities. Subsequently, this hidden state is further enriched by incorporating high-order semantic features, Hsemanticsubscript𝐻semanticH_{\text{semantic}}italic_H start_POSTSUBSCRIPT semantic end_POSTSUBSCRIPT provided by the GS-Adapter, as described in Eq. 5.

To produce the final VTO output X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG, DiT-TryOnNet leverages the combined input ζ𝜁\zetaitalic_ζ and garment description embedding τθ(D)subscript𝜏𝜃𝐷\tau_{\theta}(D)italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_D ):

X^=DiT-TryOnNet(ζ,τθ(D)).^𝑋DiT-TryOnNet𝜁subscript𝜏𝜃𝐷\hat{X}=\text{DiT-TryOnNet}(\zeta,\tau_{\theta}(D)).over^ start_ARG italic_X end_ARG = DiT-TryOnNet ( italic_ζ , italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_D ) ) .(6)

Prior Preservation for Text Generation. To retain the model’s ability to generate accurate and clear text, such as logos and labels, we introduce a prior preservation mechanism inspired by DreamBooth [37]. This mechanism incorporates a text preservation loss to ensure text clarity and fidelity, preventing the model from losing this capability while fine-tuning for VTO tasks. As the final component of our framework, prior preservation complements the GS-Adapter and DiT-TryOnNet. Together, they form a comprehensive training objective, achieving a balance between high-fidelity garment rendering and robust text generation for realistic VTO outputs.

As shown in Fig.2(a), the total loss function combines two main components: (1) the CFM loss CFMsubscriptCFM\mathcal{L}_{\text{CFM}}caligraphic_L start_POSTSUBSCRIPT CFM end_POSTSUBSCRIPT defined in Eq.2, which ensures high-quality VTO outputs by aligning generated images with the desired garment and pose, and (2) the text preservation loss pressubscriptpres\mathcal{L}_{\text{pres}}caligraphic_L start_POSTSUBSCRIPT pres end_POSTSUBSCRIPT, which maintains clarity in text details. The CFM loss guides the model in generating the VTO result X^^𝑋\hat{X}over^ start_ARG italic_X end_ARG leveraging DiT-GarmentNet for detail retention and DiT-TryOnNet for fit adjustments based on pose and body type. The text preservation loss pressubscriptpres\mathcal{L}_{\text{pres}}caligraphic_L start_POSTSUBSCRIPT pres end_POSTSUBSCRIPT is computed as pres=MSE(X^,X^)subscriptpresMSE^𝑋superscript^𝑋\mathcal{L}_{\text{pres}}=\text{MSE}(\hat{X},\hat{X}^{\prime})caligraphic_L start_POSTSUBSCRIPT pres end_POSTSUBSCRIPT = MSE ( over^ start_ARG italic_X end_ARG , over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where X^superscript^𝑋\hat{X}^{\prime}over^ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the baseline latent representation from the original model, helping to retain text fidelity in the fine-tuned output. The final loss function is given by:

total=CFM+λprespres,subscripttotalsubscriptCFMsubscript𝜆pressubscriptpres\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{CFM}}+\lambda_{\text{pres}}\cdot%\mathcal{L}_{\text{pres}},caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT CFM end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT pres end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT pres end_POSTSUBSCRIPT ,(7)

where λpressubscript𝜆pres\lambda_{\text{pres}}italic_λ start_POSTSUBSCRIPT pres end_POSTSUBSCRIPT controls the balance between VTO adaptation and text retention. This approach enables high-quality garment realism while preserving essential text rendering for realistic try-on images.

GPT-4o Generated Garment Descriptions. Our approach uses GPT-4o to generate detailed garment descriptions that capture both basic and nuanced features. These descriptions provide rich semantic context, enhancing the model’s ability to faithfully represent garment details. For DiT-GarmentNet, descriptions help preserve intricate details like texture and logos. Meanwhile, for DiT-TryOnNet, the text prompt is tailored to emphasize how the garment appears when worn, focusing on fit and interaction with the body. This adjustment improves realism in the generated images. This dual-conditioning approach enables more accurate garment representation, as shown in Fig.2(a).

TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On (3)

DatasetVITON-HDDressCode Upper-body
MethodLPIPS↓SSIM↑CLIP-I↑FID↓ UNKID↓ UNLPIPS↓SSIM↑CLIP-I↑FID↓ UNKID↓ UN
GAN-based methods
HR-VITON [21]0.1150.8770.80012.2383.7570.1180.9100.74929.3833.104
SD-VITON [38]0.1040.8960.8319.8571.450-----
Diffusion-based methods
LaDI-VTON [30]0.1660.8730.8199.3861.5900.1570.9050.78922.6892.580
DCI-VTON [11]0.1970.8630.8239.7751.7620.1710.8930.75624.1842.379
StableVITON [19]0.1420.8750.8389.3711.9900.1130.9100.84419.7122.149
IDM–VTON [5]0.1020.8680.8759.1561.2420.0650.9200.87011.8521.181
TED-VITON (Ours)0.0950.8810.8788.8480.8580.0500.9340.87511.4511.393

4 Experiment

To thoroughly evaluate TED-VITON, we conduct a comprehensive study that includes quantitative and qualitative analyses, ablation studies to assess the contributions of individual components, and a user study to gauge human preferences. For the quantitative analysis, we measure standard metrics to evaluate the generated images’ alignment with ground truth and overall visual quality. In the qualitative analysis, we compare TED-VITON’s outputs with those of baseline models to examine its ability to reproduce fine garment features, such as textures, logos, and material details. We also perform ablation studies by systematically removing key components to assess their impact on performance and image quality. Since human preference is a critical measure of success in generative tasks, we conduct a user study to gather feedback on the perceived realism and aesthetic appeal of the fitting images. The results highlight TED-VITON’s ability to produce visually compelling and realistic outputs that surpass existing methods.

4.1 Experiment Setup

Baselines. We evaluate our method against both GAN-based and diffusion-based VTO approaches. The GAN-based baselines include HR-VITON [21] and SD-VTON [38]. Both methods employ a separate warping module to fit the garment onto the target person, followed by GAN-based generation with the fitted garment as input. Among the diffusion-based methods, we compare with LaDI-VTON [30], DCI-VTON [11], StableVITON [19], and IDM-VTON [5]. All of these models leverage pretrained SD models, though with different conditioning techniques. LaDI-VTON and DCI-VTON incorporate distinct warping modules for garment conditioning, while StableVITON directly uses the SD1.4 encoder for conditioning. IDM-VTON, by contrast, utilizes the SDXL inpainting model checkpoints from official repositories. Our approach similarly builds on SD3 original checkpoints from official sources. For a fair comparison, we generate images at a resolution of 1024×76810247681024\times 7681024 × 768 when available; otherwise, we generate images at 512×384512384512\times 384512 × 384 and upscale them to 1024×76810247681024\times 7681024 × 768 using interpolation or super-resolution techniques [41], reporting the highest-quality results achieved.

TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On (4)

ComponentLPIPS↓SSIM↑CLIP-I↑FID↓ UNKID↓ UN
w/o DINOv20.1200.8700.8529.6551.734
w/o GS-Adapter0.1110.8420.8299.6741.680
w/o DiT-GarmentNet0.1130.8500.8179.9311.693
w/o Text Preservation Loss0.0980.8770.8649.4381.487
Full Model0.0950.8810.8788.8480.858

Evaluation datasets. We evaluate the effectiveness of TED-VITON on two widely-used VTO datasets, VITON-HD [4] and DressCode [29]. The VITON-HD dataset consists of 13,679 pairs of frontal-view images of women and corresponding upper garments. Following the standard dataset practices of previous works [30, 11, 19, 5, 39], we divide VITON-HD into a training set of 11,647 pairs and a test set of 2,032 pairs. The DressCode dataset contains 15,366 image pairs focused specifically on upper-body garments. Consistent with the original dataset splits, we use 1,800 upper-body image pairs from DressCode as the test set. All experiments on both VITON-HD and DressCode are conducted at a resolution of 1024×76810247681024\times 7681024 × 768.

Evaluation metrics. We evaluate TED-VITON in both paired and unpaired settings, following established practices in VTO literature. In the paired setting, the input garment matches the one originally shown in the person image. To assess performance, we use three key metrics: Structural Similarity Index (SSIM) [42], Learned Perceptual Image Patch Similarity (LPIPS) [48] and the CLIP image similarity score (CLIP-I) [14] to measure similarity between the generated image and the ground truth. Additionally, in the unpaired setting, where the garment in the person image is replaced with a different one and no ground truth is available, we assess TED-VITON’s performance in terms of image quality and realism using Fréchet Inception Distance (FID) [15] and Kernel Inception Distance (KID) [18] scores.

4.2 Qualitative Results

Fig.3 provides a qualitative comparison of VTO models alongside the input person image and selected garments. TED-VITON stands out as the only model capable of accurately reproducing text details on garments, such as the large “1969” and “Wrangler” logos, as well as finer text like “Vans”. In terms of color and texture fidelity, TED-VITON precisely aligns the four colors in “1969” across the text rows, maintaining the original garment’s design. Unlike other models, which often exhibit text distortion or color misalignment, TED-VITON preserves text clarity and color accuracy. This is achieved through the integration of a Text Preservation Loss and enhanced prompt conditioning, which together ensure that fine-grained text and color details are retained in the generated VTO images.

4.3 Quantitative Results

VITON-HD. We evaluate TED-VITON on the VITON-HD dataset and compare it with SOTA VTO methods, including GAN-based approaches (HR-VITON [21] and SD-VITON [38]) and diffusion-based methods (LaDI-VTON [30], DCI-VTON [11], StableVITON [19], and IDM-VTON [5]). Table 1 presents the quantitative results, where TED-VITON achieves top scores in LPIPS, CLIP-I, FID, and KID, indicating superior perceptual quality and realism. It ranks second in SSIM, highlighting its strong structural similarity preservation and alignment with perceptual semantics.

DressCode Upper-body. To evaluate TED-VITON’s generalization across diverse garment styles, we test it on the DressCode upper-body dataset. As shown in Table 1, TED-VITON outperforms other models across most metrics, achieving top scores in LPIPS, SSIM, CLIP-I and FID scores, which indicate strong alignment with the perceptual features of the in-shop garment. TED-VITON outperforms diffusion-based models like IDM-VTON and StableVITON by consistently capturing finer patterns and more accurate garment textures. In contrast, GAN-based methods struggle with complex patterns, resulting in lower-quality outputs on this dataset.

4.4 User Study

TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On (5)

To complement objective metrics, we conducted a user study to evaluate the visual appeal of our model. The study used 10 image pairs from the VITON-HD dataset, divided into two groups: 5 pairs focused on text and logo clarity, and 5 pairs evaluated pattern and texture fidelity. As shown in Fig.5, with 50 valid responses, TED-VITON was the preferred model, demonstrating strong user preference for text clarity and pattern accuracy.

4.5 Ablation Study

Effect of key components of TED-VITON. To analyze the contributions of each key component in TED-VITON, we conduct an ablation study by systematically removing individual components, such as DINOv2, the GS-Adapter, DiT-GarmentNet, or Text Preservation Loss, and evaluate their impact on the results. In Fig.4(a), replacing DINOv2 with a standard CLIP encoder results in blurry and distorted text. This highlights DINOv2’s role, in conjunction with the GS-Adapter, in enhancing text clarity and garment alignment by capturing fine semantic and garment details. Fig.4(b) demonstrates the impact of removing the GS-Adapter, leading to misaligned garment features and reinforcing its importance for detailed garment representation. As shown in Fig.4(c), removing DiT-GarmentNet compromises fine garment details, like textures and logo placement, indicating its role in preserving intricate design elements. In Fig.4(d), without Text Preservation Loss, text appears slightly distorted, emphasizing this loss function’s role in maintaining text fidelity. As shown in Fig.4(e), the full model achieves optimal performance by incorporating all components, accurately capturing both structural and stylistic details. Quantitative evaluation in Table2 further supports these observations.

Effect of using GPT-generated captions. We conducted an ablation study to evaluate the effect of detailed GPT-generated captions on TED-VITON’s performance. As shown in Fig.6, utilizing GPT-generated captions significantly improves the model’s ability to render multi-line text and garment details like color accuracy and texture. Using a brief description, the model accurately renders the first line, “SUN”, but fails to capture “SAND” and “SURF” due to insufficient contextual guidance. Using a detailed caption provides the necessary guidance for accurately rendering all lines and maintaining consistent color and texture. Such quantitative improvements across all metrics are demonstrated in Table3.

TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On (6)

Detailed CaptionsLPIPS↓SSIM↑CLIP-I↑FID↓ UNKID↓ UN
0.1130.8720.8299.8811.706
0.0950.8810.8788.8480.858

5 Conclusion

We presented TED-VITON, a novel VTO framework built on the DiT architecture to tackle critical challenges in garment detail fidelity and text clarity. By incorporating a GS-Adapter and a Text Preservation Loss, TED-VITON significantly improves garment-specific feature representation and ensures distortion-free rendering of logos and text. Additionally, a constraint mechanism for LLM-generated prompts enhances training inputs, leading to superior performance. Comprehensive evaluations on the VITON-HD [4] and DressCode [29] datasets showcase state-of-the-art results in visual quality, garment alignment, and text fidelity, establishing TED-VITON as a scalable and high-quality solution for next-generation VTO applications.

References

  • Albahar etal. [2021]Badour Albahar, Jingwan Lu, Jimei Yang, Zhixin Shu, Eli Shechtman, and Jia-Bin Huang.Pose with style: detail-preserving pose-guided image synthesis with conditional StyleGAN.ACM Transactions on Graphics, 40(6):218:1–218:11, 2021.
  • Bai etal. [2022]Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, and Hongxia Yang.Single Stage Virtual Try-on via Deformable Attention Flows, 2022.arXiv:2207.09161 [cs].
  • Bhatnagar etal. [2019]BharatLal Bhatnagar, Garvita Tiwari, Christian Theobalt, and Gerard Pons-Moll.Multi-Garment Net: Learning to Dress 3D People From Images.pages 5420–5430, 2019.
  • Choi etal. [2021]Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo.VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization.pages 14131–14140, 2021.
  • Choi etal. [2024]Yisol Choi, Sangkyung Kwak, Kyungmin Lee, Hyungwon Choi, and Jinwoo Shin.Improving Diffusion Models for Virtual Try-on, 2024.arXiv:2403.05139 [cs].
  • Cui etal. [2024]Aiyu Cui, Jay Mahajan, Viraj Shah, Preeti Gomathinayagam, Chang Liu, and Svetlana Lazebnik.Street TryOn: Learning In-the-Wild Virtual Try-On from Unpaired Person Images.pages 8235–8239, 2024.
  • Dong etal. [2019]Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bowen Wu, Bing-Cheng Chen, and Jian Yin.FW-GAN: Flow-Navigated Warping GAN for Video Virtual Try-On.pages 1161–1170, 2019.
  • Esser etal. [2024]Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, and Robin Rombach.Scaling Rectified Flow Transformers for High-Resolution Image Synthesis.2024.
  • Frühstück etal. [2022]Anna Frühstück, KrishnaKumar Singh, Eli Shechtman, NiloyJ. Mitra, Peter Wonka, and Jingwan Lu.InsetGAN for Full-Body Image Generation.pages 7723–7732, 2022.
  • Goodfellow etal. [2014]Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio.Generative Adversarial Nets.In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2014.
  • Gou etal. [2023]Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing Zhang.Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow.In Proceedings of the 31st ACM International Conference on Multimedia, pages 7599–7607, New York, NY, USA, 2023. Association for Computing Machinery.
  • Güler etal. [2018]RızaAlp Güler, Natalia Neverova, and Iasonas Kokkinos.DensePose: Dense Human Pose Estimation in the Wild.pages 7297–7306, 2018.
  • Han etal. [2018]Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and LarryS. Davis.VITON: An Image-based Virtual Try-on Network, 2018.arXiv:1711.08447 [cs].
  • Hessel etal. [2021]Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan LeBras, and Yejin Choi.CLIPScore: A Reference-free Evaluation Metric for Image Captioning.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
  • Heusel etal. [2017]Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  • Ho etal. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising Diffusion Probabilistic Models.In Advances in Neural Information Processing Systems, pages 6840–6851. Curran Associates, Inc., 2020.
  • Honda [2019]Shion Honda.VITON-GAN: Virtual Try-on Image Generator Trained with Adversarial Loss, 2019.Publication Title: arXiv e-prints ADS Bibcode: 2019arXiv191107926H.
  • Kim etal. [2019]Junho Kim, Minjae Kim, Hyeonwoo Kang, and Kwanghee Lee.U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation, 2019.Publication Title: arXiv e-prints ADS Bibcode: 2019arXiv190710830K.
  • Kim etal. [2023]Jeongho Kim, Gyojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo.StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On, 2023.arXiv:2312.01725 [cs].
  • Kips etal. [2020]Robin Kips, Pietro Gori, Matthieu Perrot, and Isabelle Bloch.CA-GAN: Weakly Supervised Color Aware GAN for Controllable Makeup Transfer.In Computer Vision – ECCV 2020 Workshops, pages 280–296, Cham, 2020. Springer International Publishing.
  • Lee etal. [2022]Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, and Jaegul Choo.High-Resolution Virtual Try-On with Misalignment and Occlusion-Handled Conditions, 2022.arXiv:2206.14180 [cs].
  • Li etal. [2020]Kun Li, Jinsong Zhang, Yebin Liu, Yu-Kun Lai, and Qionghai Dai.PoNA: Pose-Guided Non-Local Attention for Human Pose Transfer.IEEE Transactions on Image Processing, 29:9584–9599, 2020.Conference Name: IEEE Transactions on Image Processing.
  • Li etal. [2023]Nannan Li, Qing Liu, KrishnaKumar Singh, Yilin Wang, Jianming Zhang, BryanA. Plummer, and Zhe Lin.UniHuman: A Unified Model for Editing Human Images in the Wild, 2023.arXiv:2312.14985 [cs].
  • Liu etal. [2019]Wen Liu, Zhixin Piao, Jie Min, Wenhan Luo, Lin Ma, and Shenghua Gao.Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis.pages 5904–5913, 2019.
  • Liu etal. [2022]Xingchao Liu, Chengyue Gong, and Qiang Liu.Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow, 2022.arXiv:2209.03003.
  • Ma etal. [2017]Liqian Ma, Xu Jia, Qianru Sun, Bernt Schiele, Tinne Tuytelaars, and Luc VanGool.Pose Guided Person Image Generation.In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2017.
  • Men etal. [2020]Yifang Men, Yiming Mao, Yuning Jiang, Wei-Ying Ma, and Zhouhui Lian.Controllable Person Image Synthesis With Attribute-Decomposed GAN.pages 5084–5093, 2020.
  • Minar etal. [2020]MatiurRahman Minar, T. Tuan, Heejune Ahn, PaulL. Rosin, and Yu-Kun Lai.CP-VTON+: Clothing Shape and Texture Preserving Image-Based Virtual Try-On.2020.
  • Morelli etal. [2022]Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara.Dress Code: High-Resolution Multi-Category Virtual Try-On.pages 2231–2235, 2022.
  • Morelli etal. [2023]Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara.LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On.In Proceedings of the 31st ACM International Conference on Multimedia, pages 8580–8589, New York, NY, USA, 2023. Association for Computing Machinery.
  • Ning etal. [2024]Shuliang Ning, Duomin Wang, Yipeng Qin, Zirong Jin, Baoyuan Wang, and Xiaoguang Han.PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns.pages 6976–6985, 2024.
  • Oquab etal. [2023]Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski.Dinov2: Learning robust visual features without supervision, 2023.
  • Pecenakova etal. [2022]Sonia Pecenakova, Nour Karessli, and Reza Shirvany.FitGAN: Fit- and Shape-Realistic Generative Adversarial Networks for Fashion.In 2022 26th International Conference on Pattern Recognition (ICPR), pages 3097–3104, 2022.ISSN: 2831-7475.
  • Podell etal. [2023]Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach.SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, 2023.arXiv:2307.01952 [cs].
  • Raffiee and Sollami [2021]AmirHossein Raffiee and Michael Sollami.GarmentGAN: Photo-realistic Adversarial Fashion Transfer.In 2020 25th International Conference on Pattern Recognition (ICPR), pages 3923–3930, 2021.ISSN: 1051-4651.
  • Rombach etal. [2022]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-Resolution Image Synthesis With Latent Diffusion Models.pages 10684–10695, 2022.
  • Ruiz etal. [2023]Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman.DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation, 2023.arXiv:2208.12242.
  • Shim etal. [2024]Sang-Heon Shim, Jiwoo Chung, and Jae-Pil Heo.Towards Squeezing-Averse Virtual Try-On via Sequential Deformation.Proceedings of the AAAI Conference on Artificial Intelligence, 38(5):4856–4863, 2024.Number: 5.
  • Wan etal. [2024]Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, and Tao Mei.Improving Virtual Try-On with Garment-focused Diffusion Models, 2024.
  • Wang etal. [2018]Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang.Toward Characteristic-Preserving Image-based Virtual Try-On Network.pages 589–604, 2018.
  • Wang etal. [2021]Xintao Wang, Liangbin Xie, Chao Dong, and Ying Shan.Real-ESRGAN: Training Real-World Blind Super-Resolution With Pure Synthetic Data.pages 1905–1914, 2021.
  • Wang etal. [2004]Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli.Image quality assessment: from error visibility to structural similarity.IEEE Transactions on Image Processing, 13(4):600–612, 2004.Conference Name: IEEE Transactions on Image Processing.
  • Xie etal. [2023]Zhenyu Xie, Zaiyu Huang, Xin Dong, Fuwei Zhao, Haoye Dong, Xijin Zhang, Feida Zhu, and Xiaodan Liang.GP-VTON: Towards General Purpose Virtual Try-On via Collaborative Local-Flow Global-Parsing Learning.pages 23550–23559, 2023.
  • Yang etal. [2020]Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and Ping Luo.Towards Photo-Realistic Virtual Try-On by Adaptively Generating$\leftrightarrow$Preserving Image Content, 2020.arXiv:2003.05863 [cs, eess].
  • Ye etal. [2023]Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang.IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models, 2023.arXiv:2308.06721 [cs].
  • Yu etal. [2019]Ruiyun Yu, Xiaoqi Wang, and Xiaohui Xie.VTNFP: An Image-Based Virtual Try-On Network With Body and Clothing Feature Preservation.pages 10511–10520, 2019.
  • Zhang etal. [2023]Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding Conditional Control to Text-to-Image Diffusion Models.pages 3836–3847, 2023.
  • Zhang etal. [2018]Richard Zhang, Phillip Isola, AlexeiA. Efros, Eli Shechtman, and Oliver Wang.The Unreasonable Effectiveness of Deep Features as a Perceptual Metric.pages 586–595, 2018.
  • Zhou etal. [2022]Xinyue Zhou, Mingyu Yin, Xinyuan Chen, Li Sun, Changxin Gao, and Qingli Li.Cross Attention Based Style Distribution for Controllable Person Image Synthesis.In Computer Vision – ECCV 2022, pages 161–178, Cham, 2022. Springer Nature Switzerland.
  • Zhu etal. [2023]Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, and Ira Kemelmacher-Shlizerman.TryOnDiffusion: A Tale of Two UNets.pages 4606–4615, 2023.
  • Zhu etal. [2019]Zhen Zhu, Tengteng Huang, Baoguang Shi, Miao Yu, Bofei Wang, and Xiang Bai.Progressive Pose Attention Transfer for Person Image Generation.pages 2347–2356, 2019.
TED-VITON: Transformer-Empowered Diffusion Models for Virtual Try-On (2024)

References

Top Articles
Latest Posts
Recommended Articles
Article information

Author: Terence Hammes MD

Last Updated:

Views: 6082

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Terence Hammes MD

Birthday: 1992-04-11

Address: Suite 408 9446 Mercy Mews, West Roxie, CT 04904

Phone: +50312511349175

Job: Product Consulting Liaison

Hobby: Jogging, Motor sports, Nordic skating, Jigsaw puzzles, Bird watching, Nordic skating, Sculpting

Introduction: My name is Terence Hammes MD, I am a inexpensive, energetic, jolly, faithful, cheerful, proud, rich person who loves writing and wants to share my knowledge and understanding with you.