151. 基于大规模预训练文本图像模型的虚拟试穿方法.
- Author
-
祖雅妮 and 张 毅
- Abstract
Virtual try-on is a technology used to predict and visualize how clothing will look on a given body input. Traditional virtual try-on methods rely on expensive 3D body scanning devices and simulations to simulate how clothing appears on the human body. While these methods offer high-quality results, the cost of 3D scanning devices can be a barrier. In contrast, using 2D images provides a more convenient and cost-effective alternative. Users only need to input 2D images of the human body and clothing, and the try-on result can be visualized. This study builds upon previous 2D virtual try-on methods and extends them from image-image input to text-image input. This means that users can now input text descriptions of clothing instead of specific images. By utilizing text descriptions, the system can generate corresponding clothing that matches the provided text, expanding the range of use cases for virtual try-on. To generate accurate try-on results based on the text-image input, this study introduces a comprehensive framework comprising six modules: the text-image encoder, pose extractor, image segmentation, GAN-encoder, GAN-generator, and a mapping module. The overall framework follows a GAN-inversion editing pipeline. First, the GAN encoder encodes the input body image, producing a latent vector that captures the essential characteristics of the image(such as the image style and the body shape). Second, the obtained latent vector is edited and then the GAN-generator is fed to the edited vector to generate the desired result. Specifically, the obtained latent vector is edited by using the mapping module, which shares the same network structure as the GAN generator. The mapping module generates an additional offset latent vector of the same dimension as the one obtained from the GAN encoder. This offset vector is used to edit the latent vector, ensuring that the generated image fulfills the desired pose and text description requirements. The offset vector also helps constrain the latent vector within the GAN latent space, facilitating the generation of high-quality images by using the GAN generator. To maintain consistency in poses and appearances, the pose extractor and image segmentation modules are utilized to construct loss functions. These loss functions guide the optimization process of the latent vectors, enabling the generator to produce a final generated image that remains consistent with the input human body. To generate accurate clothing images based on the input text descriptions, the pre-training text-image model CLIP is employed. CLIP encodes both the text descriptions and the final output image and constructs a loss function that regulates the optimization process during training. In experimental evaluations, the proposed method successfully generates correct images corresponding to the input body image and text descriptions. Compared to existing methods, quantitatively, the proposed method outperforms existing methods, achieving improvements of 15% in IoU, 8% in semantics, and 77. 1% in image quality evaluation. Compared to traditional physical fitting methods, virtual try-on provides consumers with an economical and convenient way to try on clothes. With the rapid advancements in machine learning and computer vision, virtual try-on has achieved impressive results. Furthermore, as consumers increasingly seek personalized experiences, the proposed virtual try-on method can generate the desired clothing based on text descriptions and present the final fitting results. This further enhances the flexibility and application scope of virtual try-on to meet the diverse needs of consumers. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF