Author: "Shan, Ying" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Shan, Ying"' showing total 2,297 results

Start Over Author "Shan, Ying"

2,297 results on '"Shan, Ying"'

101. Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation

Author: Di, Qianji, Ma, Wenxi, Qi, Zhongang, Hou, Tianxiang, Shan, Ying, and Wang, Hanzi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Scene Graph Generation (SGG) aims to structurally and comprehensively represent objects and their connections in images, it can significantly benefit scene understanding and other related downstream tasks. Existing SGG models often struggle to solve the long-tailed problem caused by biased datasets. However, even if these models can fit specific datasets better, it may be hard for them to resolve the unseen triples which are not included in the training set. Most methods tend to feed a whole triple and learn the overall features based on statistical machine learning. Such models have difficulty predicting unseen triples because the objects and predicates in the training set are combined differently as novel triples in the test set. In this work, we propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models. We propose a Joint Fearture Learning (JFL) module and a Factual Knowledge based Refinement (FKR) module to learn object and predicate categories separately at the feature level and align them with corresponding visual features so that the model is no longer limited to triples matching. Besides, since we observe the long-tailed problem also affects the generalization ability, we design a novel balanced learning strategy, including a Charater Guided Sampling (CGS) and an Informative Re-weighting (IR) module, to provide tailor-made learning methods for each predicate according to their characters. Extensive experiments show that our model achieves state-of-the-art performance. In more detail, TISGG boosts the performances by 11.7% of zR@20(zero-shot recall) on the PredCls sub-task on the Visual Genome dataset.
Published: 2023

102. TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter

Author: Zhang, Binjie, Ge, Yixiao, Xu, Xuyuan, Shan, Ying, and Shou, Mike Zheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual foundation models like CLIP excel in learning feature representations from extensive datasets through self-supervised methods, demonstrating remarkable transfer learning and generalization capabilities. A growing number of applications based on visual foundation models are emerging, including innovative solutions such as BLIP-2. These applications employ pre-trained CLIP models as upstream feature extractors and train various downstream modules to accomplish diverse tasks. In situations involving system upgrades that require updating the upstream foundation model, it becomes essential to re-train all downstream modules to adapt to the new foundation model, which is inflexible and inefficient. In this paper, we introduce a parameter-efficient and task-agnostic adapter, dubbed TaCA, that facilitates compatibility across distinct foundation models while ensuring enhanced performance for the new models. TaCA allows downstream applications to seamlessly integrate better-performing foundation models without necessitating retraining. We conduct extensive experimental validation of TaCA using different scales of models with up to one billion parameters on various tasks such as video-text retrieval, video recognition, and visual question answering. The results consistently demonstrate the emergent ability of TaCA on hot-plugging upgrades for visual foundation models. Codes and models will be available at https://github.com/TencentARC/TaCA.
Published: 2023

103. InstructP2P: Learning to Edit 3D Point Clouds with Text Instructions

Author: Xu, Jiale, Wang, Xintao, Cao, Yan-Pei, Cheng, Weihao, Shan, Ying, and Gao, Shenghua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Enhancing AI systems to perform tasks following human instructions can significantly boost productivity. In this paper, we present InstructP2P, an end-to-end framework for 3D shape editing on point clouds, guided by high-level textual instructions. InstructP2P extends the capabilities of existing methods by synergizing the strengths of a text-conditioned point cloud diffusion model, Point-E, and powerful language models, enabling color and geometry editing using language instructions. To train InstructP2P, we introduce a new shape editing dataset, constructed by integrating a shape segmentation dataset, off-the-shelf shape programs, and diverse edit instructions generated by a large language model, ChatGPT. Our proposed method allows for editing both color and geometry of specific regions in a single forward pass, while leaving other regions unaffected. In our experiments, InstructP2P shows generalization capabilities, adapting to novel shape categories and instructions, despite being trained on a limited amount of data.
Published: 2023

104. Sticker820K: Empowering Interactive Retrieval with Stickers

Author: Zhao, Sijie, Ge, Yixiao, Qi, Zhongang, Song, Lin, Ding, Xiaohan, Xie, Zehua, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Stickers have become a ubiquitous part of modern-day communication, conveying complex emotions through visual imagery. To facilitate the development of more powerful algorithms for analyzing stickers, we propose a large-scale Chinese sticker dataset, namely Sticker820K, which consists of 820k image-text pairs. Each sticker has rich and high-quality textual annotations, including descriptions, optical characters, emotional labels, and style classifications. Although vision-language tasks in the domain of natural images have been well studied, directly applying the those models, such as CLIP, to sticker data is not an optimal solution due to the discrepant nature between natural and emotive image data. Therefore, we propose StickerCLIP as a benchmark model on the Sticker820K dataset. For the text-to-image retrieval task, our StickerCLIP demonstrates strong superiority over the CLIP, which achieves an absolute gain of 66.0\% in mean recall on the Sticker820K test set. Additionally, we endeavor to extend the recently popularized LLM by means of prompt tuning, integrating its ability for sticker retrieval and allowing users to retrieve stickers through instructions. We validate the feasibility of this method, demonstrating the immense potential of prompt tuning in expanding LLM abilities while not affecting the quality of upstream tasks.
Published: 2023

105. SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation

Author: Li, Xuewei, Wu, Tao, Qi, Zhongang, Wang, Gaoang, Shan, Ying, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: As an important and challenging problem in computer vision, PAnoramic Semantic Segmentation (PASS) gives complete scene perception based on an ultra-wide angle of view. Usually, prevalent PASS methods with 2D panoramic image input focus on solving image distortions but lack consideration of the 3D properties of original $360^{\circ}$ data. Therefore, their performance will drop a lot when inputting panoramic images with the 3D disturbance. To be more robust to 3D disturbance, we propose our Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation (SGAT4PASS), considering 3D spherical geometry knowledge. Specifically, a spherical geometry-aware framework is proposed for PASS. It includes three modules, i.e., spherical geometry-aware image projection, spherical deformable patch embedding, and a panorama-aware loss, which takes input images with 3D disturbance into account, adds a spherical geometry-aware constraint on the existing deformable patch embedding, and indicates the pixel density of original $360^{\circ}$ data, respectively. Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS significantly improves performance and robustness, with approximately a 2% increase in mIoU, and when small 3D disturbances occur in the data, the stability of our performance is improved by an order of magnitude. Our code and supplementary material are available at https://github.com/TencentARC/SGAT4PASS., Comment: Accepted by IJCAI 2023
Published: 2023

106. PanoGRF: Generalizable Spherical Radiance Fields for Wide-baseline Panoramas

Author: Chen, Zheng, Cao, Yan-Pei, Guo, Yuan-Chen, Wang, Chen, Shan, Ying, and Zhang, Song-Hai
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Achieving an immersive experience enabling users to explore virtual environments with six degrees of freedom (6DoF) is essential for various applications such as virtual reality (VR). Wide-baseline panoramas are commonly used in these applications to reduce network bandwidth and storage requirements. However, synthesizing novel views from these panoramas remains a key challenge. Although existing neural radiance field methods can produce photorealistic views under narrow-baseline and dense image captures, they tend to overfit the training views when dealing with \emph{wide-baseline} panoramas due to the difficulty in learning accurate geometry from sparse $360^{\circ}$ views. To address this problem, we propose PanoGRF, Generalizable Spherical Radiance Fields for Wide-baseline Panoramas, which construct spherical radiance fields incorporating $360^{\circ}$ scene priors. Unlike generalizable radiance fields trained on perspective images, PanoGRF avoids the information loss from panorama-to-perspective conversion and directly aggregates geometry and appearance features of 3D sample points from each panoramic view based on spherical projection. Moreover, as some regions of the panorama are only visible from one view while invisible from others under wide baseline settings, PanoGRF incorporates $360^{\circ}$ monocular depth priors into spherical depth estimation to improve the geometry features. Experimental results on multiple panoramic datasets demonstrate that PanoGRF significantly outperforms state-of-the-art generalizable view synthesis methods for wide-baseline panoramas (e.g., OmniSyn) and perspective images (e.g., IBRNet, NeuRay)., Comment: accepted to NeurIPS2023; Project Page: https://thucz.github.io/PanoGRF/
Published: 2023

107. Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Author: Xing, Jinbo, Xia, Menghan, Liu, Yuxin, Zhang, Yuechen, Zhang, Yong, He, Yingqing, Liu, Hanyuan, Chen, Haoxin, Cun, Xiaodong, Wang, Xintao, Shan, Ying, and Wong, Tien-Tsin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image datasets solely into video generation. Moreover, we use a simple yet effective causal attention mask strategy to enable longer video synthesis, which mitigates the potential quality degradation effectively. Experimental results show the superiority of our method over existing baselines, particularly in terms of temporal coherence and fidelity to users' guidance. In addition, our model enables several intriguing applications that demonstrate potential for practical usage., Comment: 13 pages, 8 figures. Project page: https://doubiiu.github.io/projects/Make-Your-Video/
Published: 2023

108. Inserting Anybody in Diffusion Models via Celeb Basis

Author: Yuan, Ge, Cun, Xiaodong, Zhang, Yong, Li, Maomao, Qi, Chenyang, Wang, Xintao, Shan, Ying, and Zheng, Huicheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Exquisite demand exists for customizing the pretrained large text-to-image model, $\textit{e.g.}$, Stable Diffusion, to generate innovative concepts, such as the users themselves. However, the newly-added concept from previous customization methods often shows weaker combination abilities than the original ones even given several images during training. We thus propose a new personalization method that allows for the seamless integration of a unique individual into the pre-trained diffusion model using just $\textbf{one facial photograph}$ and only $\textbf{1024 learnable parameters}$ under $\textbf{3 minutes}$. So as we can effortlessly generate stunning images of this person in any pose or position, interacting with anyone and doing anything imaginable from text prompts. To achieve this, we first analyze and build a well-defined celeb basis from the embedding space of the pre-trained large text encoder. Then, given one facial photo as the target identity, we generate its own embedding by optimizing the weight of this basis and locking all other parameters. Empowered by the proposed celeb basis, the new identity in our customized model showcases a better concept combination ability than previous personalization methods. Besides, our model can also learn several new identities at once and interact with each other where the previous customization model fails to. The code will be released., Comment: Project page: http://celeb-basis.github.io ; Github repository: https://github.com/ygtxr1997/CelebBasis
Published: 2023

109. GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

Author: Yang, Rui, Song, Lin, Li, Yanwei, Zhao, Sijie, Ge, Yixiao, Li, Xiu, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: This paper aims to efficiently enable Large Language Models (LLMs) to use multimodal tools. Advanced proprietary LLMs, such as ChatGPT and GPT-4, have shown great potential for tool usage through sophisticated prompt engineering. Nevertheless, these models typically rely on prohibitive computational costs and publicly inaccessible data. To address these challenges, we propose the GPT4Tools based on self-instruct to enable open-source LLMs, such as LLaMA and OPT, to use tools. It generates an instruction-following dataset by prompting an advanced teacher with various multi-modal contexts. By using the Low-Rank Adaptation (LoRA) optimization, our approach facilitates the open-source LLMs to solve a range of visual problems, including visual comprehension and image generation. Moreover, we provide a benchmark to evaluate the ability of LLMs to use tools, which is performed in both zero-shot and fine-tuning ways. Extensive experiments demonstrate the effectiveness of our method on various language models, which not only significantly improves the accuracy of invoking seen tools, but also enables the zero-shot capacity for unseen tools. The code and demo are available at https://github.com/StevenGrove/GPT4Tools.
Published: 2023

110. Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models

Author: Gu, Yuchao, Wang, Xintao, Wu, Jay Zhangjie, Shi, Yujun, Chen, Yunpeng, Fan, Zihan, Xiao, Wuyou, Zhao, Rui, Chang, Shuning, Wu, Weijia, Ge, Yixiao, Shan, Ying, and Shou, Mike Zheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Public large-scale text-to-image diffusion models, such as Stable Diffusion, have gained significant attention from the community. These models can be easily customized for new concepts using low-rank adaptations (LoRAs). However, the utilization of multiple concept LoRAs to jointly support multiple customized concepts presents a challenge. We refer to this scenario as decentralized multi-concept customization, which involves single-client concept tuning and center-node concept fusion. In this paper, we propose a new framework called Mix-of-Show that addresses the challenges of decentralized multi-concept customization, including concept conflicts resulting from existing single-client LoRA tuning and identity loss during model fusion. Mix-of-Show adopts an embedding-decomposed LoRA (ED-LoRA) for single-client tuning and gradient fusion for the center node to preserve the in-domain essence of single concepts and support theoretically limitless concept fusion. Additionally, we introduce regionally controllable sampling, which extends spatially controllable sampling (e.g., ControlNet and T2I-Adaptor) to address attribute binding and missing object problems in multi-concept sampling. Extensive experiments demonstrate that Mix-of-Show is capable of composing multiple customized concepts with high fidelity, including characters, objects, and scenes.
Published: 2023

111. TaleCrafter: Interactive Story Visualization with Multiple Characters

Author: Gong, Yuan, Pang, Youxin, Cun, Xiaodong, Xia, Menghan, He, Yingqing, Chen, Haoxin, Wang, Longyue, Zhang, Yong, Wang, Xintao, Shan, Ying, and Yang, Yujiu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Accurate Story visualization requires several necessary elements, such as identity consistency across frames, the alignment between plain text and visual content, and a reasonable layout of objects in images. Most previous works endeavor to meet these requirements by fitting a text-to-image (T2I) model on a set of videos in the same style and with the same characters, e.g., the FlintstonesSV dataset. However, the learned T2I models typically struggle to adapt to new characters, scenes, and styles, and often lack the flexibility to revise the layout of the synthesized images. This paper proposes a system for generic interactive story visualization, capable of handling multiple novel characters and supporting the editing of layout and local structure. It is developed by leveraging the prior knowledge of large language and T2I models, trained on massive corpora. The system comprises four interconnected components: story-to-prompt generation (S2P), text-to-layout generation (T2L), controllable text-to-image generation (C-T2I), and image-to-video animation (I2V). First, the S2P module converts concise story information into detailed prompts required for subsequent stages. Next, T2L generates diverse and reasonable layouts based on the prompts, offering users the ability to adjust and refine the layout to their preference. The core component, C-T2I, enables the creation of images guided by layouts, sketches, and actor-specific identifiers to maintain consistency and detail across visualizations. Finally, I2V enriches the visualization process by animating the generated images. Extensive experiments and a user study are conducted to validate the effectiveness and flexibility of interactive editing of the proposed system., Comment: Github repository: https://github.com/VideoCrafter/TaleCrafter
Published: 2023

112. TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Author: Zeng, Ziyun, Ge, Yixiao, Tong, Zhan, Liu, Xihui, Xia, Shu-Tao, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: The ultimate goal for foundation models is realizing task-agnostic, i.e., supporting out-of-the-box usage without task-specific fine-tuning. Although breakthroughs have been made in natural language processing and image representation learning, it is still challenging for video models to reach it due to the increasing uncertainty of spatiotemporal signals. To ease training, existing works leverage image foundation models' prior knowledge and equip them with efficient temporal modules. Despite the satisfactory fine-tuning performance, we empirically find they fall short of out-of-the-box usage, given the even degraded performance in zero-shot/linear protocols compared to their baseline counterparts. In this work, we analyze the factor that leads to degradation from the perspective of language supervision distortion. We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers. The overfitted text encoder, in turn, provides a harmful supervision signal, degrading the video representation. To tackle this issue, we propose a degradation-free pre-training strategy to retain the generalization ability of the text encoder via freezing shallow layers while enabling the task-related semantics capturing in tunable deep layers. As for the training objective, we adopted the transcript sorting task in TVTS incorporated with masking techniques to enable scalable training. As a result, we produce a series of models, dubbed TVTSv2, with up to one billion parameters. We achieve new state-of-the-arts on various video benchmarks with a frozen backbone, surpassing the recent ImageBind, InternVideo, etc. Code is available at https://github.com/TencentARC/TVTS., Comment: Technical Report
Published: 2023

113. A Confidence-based Partial Label Learning Model for Crowd-Annotated Named Entity Recognition

Author: Xiong, Limao, Zhou, Jie, Zhu, Qunxi, Wang, Xiao, Wu, Yuanbin, Zhang, Qi, Gui, Tao, Huang, Xuanjing, Ma, Jin, and Shan, Ying
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Existing models for named entity recognition (NER) are mainly based on large-scale labeled datasets, which always obtain using crowdsourcing. However, it is hard to obtain a unified and correct label via majority voting from multiple annotators for NER due to the large labeling space and complexity of this task. To address this problem, we aim to utilize the original multi-annotator labels directly. Particularly, we propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER. This model learns a token- and content-dependent confidence via an Expectation-Maximization (EM) algorithm by minimizing empirical risk. The true posterior estimator and confidence estimator perform iteratively to update the true posterior and confidence respectively. We conduct extensive experimental results on both real-world and synthetic datasets, which show that our model can improve performance effectively compared with strong baselines.
Published: 2023

114. What Makes for Good Visual Tokenizers for Large Language Models?

Author: Wang, Guangzhi, Ge, Yixiao, Ding, Xiaohan, Kankanhalli, Mohan, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We empirically investigate proper pre-training methods to build good visual tokenizers, making Large Language Models (LLMs) powerful Multimodal Large Language Models (MLLMs). In our benchmark, which is curated to evaluate MLLMs visual semantic understanding and fine-grained perception capabilities, we discussed different visual tokenizers pre-trained with dominant methods (i.e., DeiT, CLIP, MAE, DINO), and observe that: i) Fully/weakly supervised models capture more semantics than self-supervised models, but the gap is narrowed by scaling up the pre-training dataset. ii) Self-supervised models are better at fine-grained perception, where patch-level supervision is particularly effective. iii) Tuning the visual tokenizer leads to the loss of semantics obtained from large-scale pretraining, which is unfavorable with relatively small-scale instruction-tuning dataset. Given the findings, we reviewed methods that attempted to unify semantics and fine-grained visual understanding, e.g., patch-level feature distillation with semantically-rich targets. We obtain an intriguing insight mask-based strategies that were once all the rage may not be applicable for obtaining good visual tokenizers. Based on this critical observation, we obtain a new MLLM equipped with a tailored Good Visual Tokenizer (GVT), which exhibits strong visual comprehension capability at multiple scales. In particular, without introducing extra parameters and task-specific fine-tuning, GVT achieves superior performance on visual question answering, image captioning, and other fine-grained visual understanding tasks such as object counting and multi-class identification., Comment: 15 pages, 3 figures. Project released at: https://github.com/TencentARC/GVT
Published: 2023

115. SparseGNV: Generating Novel Views of Indoor Scenes with Sparse Input Views

Author: Cheng, Weihao, Cao, Yan-Pei, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We study to generate novel views of indoor scenes given sparse input views. The challenge is to achieve both photorealism and view consistency. We present SparseGNV: a learning framework that incorporates 3D structures and image generative models to generate novel views with three modules. The first module builds a neural point cloud as underlying geometry, providing contextual information and guidance for the target novel view. The second module utilizes a transformer-based network to map the scene context and the guidance into a shared latent space and autoregressively decodes the target view in the form of discrete image tokens. The third module reconstructs the tokens into the image of the target view. SparseGNV is trained across a large indoor scene dataset to learn generalizable priors. Once trained, it can efficiently generate novel views of an unseen indoor scene in a feed-forward manner. We evaluate SparseGNV on both real-world and synthetic indoor scenes and demonstrate that it outperforms state-of-the-art methods based on either neural radiance fields or conditional image generation., Comment: 10 pages, 6 figures
Published: 2023

116. $\pi$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

Author: Wu, Chengyue, Wang, Teng, Ge, Yixiao, Lu, Zeyu, Zhou, Ruisong, Shan, Ying, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Foundation models have achieved great advances in multi-task learning with a unified interface of unimodal and multimodal tasks. However, the potential of such multi-task learners has not been exploited during transfer learning. In this work, we present a universal parameter-efficient transfer learning method, termed Predict-Interpolate Tuning ($\pi$-Tuning), for vision, language, and vision-language tasks. It aggregates the parameters of lightweight task-specific experts learned from similar tasks to aid the target downstream task. The task similarities are predicted in a unified modality-independent space, yielding a scalable graph to demonstrate task relationships. $\pi$-Tuning has several appealing benefits. First, it flexibly explores both intra- and inter-modal transferability between similar tasks to improve the accuracy and robustness of transfer learning, especially in data-scarce scenarios. Second, it offers a systematical solution for transfer learning with multi-task prediction-and-then-interpolation, compatible with diverse types of parameter-efficient experts, such as prompt and adapter. Third, an extensive study of task-level mutual benefits on 14 unimodal and 6 multimodal datasets shows that $\pi$-Tuning surpasses fine-tuning and other parameter-efficient transfer learning methods both in full-shot and low-shot regimes. The task graph also enables an in-depth interpretable analysis of task transferability across modalities. The code will be available at https://github.com/TencentARC/pi-Tuning., Comment: To appear in ICML 2023
Published: 2023

117. HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video

Author: Liu, Jia-Wei, Cao, Yan-Pei, Yang, Tianyuan, Xu, Eric Zhongcong, Keppo, Jussi, Shan, Ying, Qie, Xiaohu, and Shou, Mike Zheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce HOSNeRF, a novel 360{\deg} free-viewpoint rendering method that reconstructs neural radiance fields for dynamic human-object-scene from a single monocular in-the-wild video. Our method enables pausing the video at any frame and rendering all scene details (dynamic humans, objects, and backgrounds) from arbitrary viewpoints. The first challenge in this task is the complex object motions in human-object interactions, which we tackle by introducing the new object bones into the conventional human skeleton hierarchy to effectively estimate large object deformations in our dynamic human-object model. The second challenge is that humans interact with different objects at different times, for which we introduce two new learnable object state embeddings that can be used as conditions for learning our human-object representation and scene representation, respectively. Extensive experiments show that HOSNeRF significantly outperforms SOTA approaches on two challenging datasets by a large margin of 40% ~ 50% in terms of LPIPS. The code, data, and compelling examples of 360{\deg} free-viewpoint renderings from single videos will be released in https://showlab.github.io/HOSNeRF., Comment: Project page: https://showlab.github.io/HOSNeRF
Published: 2023

118. SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes

Author: Gao, Yiming, Cao, Yan-Pei, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Online reconstructing and rendering of large-scale indoor scenes is a long-standing challenge. SLAM-based methods can reconstruct 3D scene geometry progressively in real time but can not render photorealistic results. While NeRF-based methods produce promising novel view synthesis results, their long offline optimization time and lack of geometric constraints pose challenges to efficiently handling online input. Inspired by the complementary advantages of classical 3D reconstruction and NeRF, we thus investigate marrying explicit geometric representation with NeRF rendering to achieve efficient online reconstruction and high-quality rendering. We introduce SurfelNeRF, a variant of neural radiance field which employs a flexible and scalable neural surfel representation to store geometric attributes and extracted appearance features from input images. We further extend the conventional surfel-based fusion scheme to progressively integrate incoming input frames into the reconstructed global neural scene representation. In addition, we propose a highly-efficient differentiable rasterization scheme for rendering neural surfel radiance fields, which helps SurfelNeRF achieve $10\times$ speedups in training and inference time, respectively. Experimental results show that our method achieves the state-of-the-art 23.82 PSNR and 29.58 PSNR on ScanNet in feedforward inference and per-scene optimization settings, respectively., Comment: To appear in CVPR 2023
Published: 2023

119. NeAI: A Pre-convoluted Representation for Plug-and-Play Neural Ambient Illumination

Author: Zhuang, Yiyu, Zhang, Qi, Wang, Xuan, Zhu, Hao, Feng, Ying, Li, Xiaoyu, Shan, Ying, and Cao, Xun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: Recent advances in implicit neural representation have demonstrated the ability to recover detailed geometry and material from multi-view images. However, the use of simplified lighting models such as environment maps to represent non-distant illumination, or using a network to fit indirect light modeling without a solid basis, can lead to an undesirable decomposition between lighting and material. To address this, we propose a fully differentiable framework named neural ambient illumination (NeAI) that uses Neural Radiance Fields (NeRF) as a lighting model to handle complex lighting in a physically based way. Together with integral lobe encoding for roughness-adaptive specular lobe and leveraging the pre-convoluted background for accurate decomposition, the proposed method represents a significant step towards integrating physically based rendering into the NeRF representation. The experiments demonstrate the superior performance of novel-view rendering compared to previous works, and the capability to re-render objects under arbitrary NeRF-style environments opens up exciting possibilities for bridging the gap between virtual and real-world scenes. The project and supplementary materials are available at https://yiyuzhuang.github.io/NeAI/., Comment: Project page: https://yiyuzhuang.github.io/NeAI/
Published: 2023

120. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

Author: Cao, Mingdeng, Wang, Xintao, Qi, Zhongang, Shan, Ying, Qie, Xiaohu, and Zheng, Yinqiang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results. For example, generation approaches usually fail to synthesize multiple images of the same objects/characters but with different views or poses. Meanwhile, existing editing methods either fail to achieve effective complex non-rigid editing while maintaining the overall textures and identity, or require time-consuming fine-tuning to capture the image-specific appearance. In this paper, we develop MasaCtrl, a tuning-free method to achieve consistent image generation and complex non-rigid image editing simultaneously. Specifically, MasaCtrl converts existing self-attention in diffusion models into mutual self-attention, so that it can query correlated local contents and textures from source images for consistency. To further alleviate the query confusion between foreground and background, we propose a mask-guided mutual self-attention strategy, where the mask can be easily extracted from the cross-attention maps. Extensive experiments show that the proposed MasaCtrl can produce impressive results in both consistent image generation and complex non-rigid real image editing., Comment: Project available at https://ljzycmd.github.io/projects/MasaCtrl
Published: 2023

121. Improved Test-Time Adaptation for Domain Generalization

Author: Chen, Liang, Zhang, Yong, Song, Yibing, Shan, Ying, and Liu, Lingqiao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The main challenge in domain generalization (DG) is to handle the distribution shift problem that lies between the training and test data. Recent studies suggest that test-time training (TTT), which adapts the learned model with test data, might be a promising solution to the problem. Generally, a TTT strategy hinges its performance on two main factors: selecting an appropriate auxiliary TTT task for updating and identifying reliable parameters to update during the test phase. Both previous arts and our experiments indicate that TTT may not improve but be detrimental to the learned model if those two factors are not properly considered. This work addresses those two factors by proposing an Improved Test-Time Adaptation (ITTA) method. First, instead of heuristically defining an auxiliary objective, we propose a learnable consistency loss for the TTT task, which contains learnable parameters that can be adjusted toward better alignment between our TTT task and the main prediction task. Second, we introduce additional adaptive parameters for the trained model, and we suggest only updating the adaptive parameters during the test phase. Through extensive experiments, we show that the proposed two strategies are beneficial for the learned model (see Figure 1), and ITTA could achieve superior performance to the current state-of-the-art methods on several DG benchmarks. Code is available at https://github.com/liangchen527/ITTA., Comment: Accepted by CVPR 2023
Published: 2023

122. TagGPT: Large Language Models are Zero-shot Multimodal Taggers

Author: Li, Chen, Ge, Yixiao, Mao, Jiayong, Li, Dian, and Shan, Ying
Subjects: Computer Science - Information Retrieval
Abstract: Tags are pivotal in facilitating the effective distribution of multimedia content in various applications in the contemporary Internet era, such as search engines and recommendation systems. Recently, large language models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. In this work, we propose TagGPT, a fully automated system capable of tag extraction and multimodal tagging in a completely zero-shot fashion. Our core insight is that, through elaborate prompt engineering, LLMs are able to extract and reason about proper tags given textual clues of multimodal data, e.g., OCR, ASR, title, etc. Specifically, to automatically build a high-quality tag set that reflects user intent and interests for a specific application, TagGPT predicts large-scale candidate tags from a series of raw data via prompting LLMs, filtered with frequency and semantics. Given a new entity that needs tagging for distribution, TagGPT introduces two alternative options for zero-shot tagging, i.e., a generative method with late semantic matching with the tag set, and another selective method with early matching in prompts. It is well noticed that TagGPT provides a system-level solution based on a modular framework equipped with a pre-trained LLM (GPT-3.5 used here) and a sentence embedding model (SimCSE used here), which can be seamlessly replaced with any more advanced one you want. TagGPT is applicable for various modalities of data in modern social media and showcases strong generalization ability to a wide range of applications. We evaluate TagGPT on publicly available datasets, i.e., Kuaishou and Food.com, and demonstrate the effectiveness of TagGPT compared to existing hashtags and off-the-shelf taggers. Project page: https://github.com/TencentARC/TagGPT., Comment: 13 pages, 6 figures
Published: 2023

123. Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

Author: Ma, Yue, He, Yingqing, Cun, Xiaodong, Wang, Xintao, Chen, Siran, Shan, Ying, Li, Xiu, and Chen, Qifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e.,image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models will be made publicly available., Comment: Project page: https://follow-your-pose.github.io/; Github repository: https://github.com/mayuelala/FollowYourPose
Published: 2023

124. DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

Author: Cao, Yukang, Cao, Yan-Pei, Han, Kai, Shan, Ying, and Wong, Kwan-Yee K.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present DreamAvatar, a text-and-shape guided framework for generating high-quality 3D human avatars with controllable poses. While encouraging results have been produced by recent methods on text-guided 3D common object generation, generating high-quality human avatars remains an open challenge due to the complexity of the human body's shape, pose, and appearance. We propose DreamAvatar to tackle this challenge, which utilizes a trainable NeRF for predicting density and color features for 3D points and a pre-trained text-to-image diffusion model for providing 2D self-supervision. Specifically, we leverage SMPL models to provide rough pose and shape guidance for the generation. We introduce a dual space design that comprises a canonical space and an observation space, which are related by a learnable deformation field through the NeRF, allowing for the transfer of well-optimized texture and geometry from the canonical space to the target posed avatar. Additionally, we exploit a normal-consistency regularization to allow for more vivid generation with detailed geometry and texture. Through extensive evaluations, we demonstrate that DreamAvatar significantly outperforms existing methods, establishing a new state-of-the-art for text-and-shape guided 3D human generation., Comment: 19 pages, 19 figures. Project page: https://yukangcao.github.io/DreamAvatar/
Published: 2023

125. Learning Anchor Transformations for 3D Garment Animation

Author: Zhao, Fang, Li, Zekun, Huang, Shaoli, Weng, Junwu, Zhou, Tianfei, Xie, Guo-Sen, Wang, Jue, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper proposes an anchor-based deformation model, namely AnchorDEF, to predict 3D garment animation from a body motion sequence. It deforms a garment mesh template by a mixture of rigid transformations with extra nonlinear displacements. A set of anchors around the mesh surface is introduced to guide the learning of rigid transformation matrices. Once the anchor transformations are found, per-vertex nonlinear displacements of the garment template can be regressed in a canonical space, which reduces the complexity of deformation space learning. By explicitly constraining the transformed anchors to satisfy the consistencies of position, normal and direction, the physical meaning of learned anchor transformations in space is guaranteed for better generalization. Furthermore, an adaptive anchor updating is proposed to optimize the anchor position by being aware of local mesh topology for learning representative anchor transformations. Qualitative and quantitative experiments on different types of garments demonstrate that AnchorDEF achieves the state-of-the-art performance on 3D garment deformation prediction in motion, especially for loose-fitting garments., Comment: Accepted to CVPR 2023. Project page: https://semanticdh.github.io/AnchorDEF
Published: 2023

126. DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

Author: Wu, Qiangqiang, Yang, Tianyu, Liu, Ziquan, Wu, Baoyuan, Shan, Ying, and Chan, Antoni B.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in videos and reconstruct the frame pixels. However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS. To alleviate this problem, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We show that our DropMAE is a strong and efficient temporal matching learner, which achieves better finetuning results on matching-based tasks than the ImageNetbased MAE with 2X faster pre-training speed. Moreover, we also find that motion diversity in pre-training videos is more important than scene diversity for improving the performance on VOT and VOS. Our pre-trained DropMAE model can be directly loaded in existing ViT-based trackers for fine-tuning without further modifications. Notably, DropMAE sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets. Our code and pre-trained models are available at https://github.com/jimmy-dq/DropMAE.git., Comment: CVPR 2023; V2: fixed typos in Table-2
Published: 2023

127. Glucose fluctuations aggravate cardiomyocyte apoptosis by enhancing the interaction between Txnip and Akt

Author: Zhang, Zhen-Ye, Pan, Lu, Dang, Shipeng, Wang, Ning, Zhao, Shan-Ying, Li, Feng, Wu, Li-Da, Zhang, Lei, Liu, Huan-Huan, Zhao, Ning, Yang, Ya-Juan, Qian, Ling-Ling, Liu, Tong, and Wang, Ru-Xing
Published: 2024
Full Text: View/download PDF

128. COVID-19 in the Tibet, China, the roof of the world: a comparative analysis of high-altitude residents and newcomers

Author: Yan, Xiaosong, Shan, Ying, Luo, Qiaoxia, Liu, Dongmei, Zhang, Yanfei, Zhai, Qiangrong, Zhou, Zhenxing, Wang, Weikai, La, Ba, Yan, Jingting, Zhu, Xiaoyao, Wang, Xinxin, Thuen, Langjie, Chen, Wei, Li, Qian, Zeng, Jian, Tian, Gang, Chen, Xiaoli, Ci, Qu, Zhou, Qidi, Jin, Xin, and Pingcuo, Tashi
Published: 2024
Full Text: View/download PDF

129. COVID-19 increases extracorporeal coagulation during hemodialysis associated with upregulation of vWF/FBLN5 signaling in patients with severe/critical symptoms

Author: Yang, Guang, Shan, Hui, Wu, Dibin, Li, Sanmu, Lai, Zhiwei, Zheng, Fengping, Xiong, Zibo, Xiong, Zuying, Diao, Yuhan, Shan, Ying, Chen, Yun, Wang, Aihong, Liang, Wei, and Yin, Yuxin
Published: 2024
Full Text: View/download PDF

130. Physiological response of potato leaves to uniconazole under drought stress during the tuber expansion period

Author: Ding, Kaixin, Shan, Ying, Wang, Lichun, Tian, Guokui, Li, Fengyun, Wang, Haiyan, Pang, Ze, Pan, Yang, and Jiang, Hui
Published: 2024
Full Text: View/download PDF

131. LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

Author: Zheng, Guangcong, Zhou, Xianpan, Li, Xuewei, Qi, Zhongang, Shan, Ying, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at https://github.com/ZGCTroy/LayoutDiffusion., Comment: Accepted by CVPR2023
Published: 2023

132. VMesh: Hybrid Volume-Mesh Representation for Efficient View Synthesis

Author: Guo, Yuan-Chen, Cao, Yan-Pei, Wang, Chen, He, Yu, Shan, Ying, Qie, Xiaohu, and Zhang, Song-Hai
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: With the emergence of neural radiance fields (NeRFs), view synthesis quality has reached an unprecedented level. Compared to traditional mesh-based assets, this volumetric representation is more powerful in expressing scene geometry but inevitably suffers from high rendering costs and can hardly be involved in further processes like editing, posing significant difficulties in combination with the existing graphics pipeline. In this paper, we present a hybrid volume-mesh representation, VMesh, which depicts an object with a textured mesh along with an auxiliary sparse volume. VMesh retains the advantages of mesh-based assets, such as efficient rendering, compact storage, and easy editing, while also incorporating the ability to represent subtle geometric structures provided by the volumetric counterpart. VMesh can be obtained from multi-view images of an object and renders at 2K 60FPS on common consumer devices with high fidelity, unleashing new opportunities for real-time immersive applications., Comment: Project page: https://bennyguo.github.io/vmesh/
Published: 2023

133. Accelerating Vision-Language Pretraining with Free Language Modeling

Author: Wang, Teng, Ge, Yixiao, Zheng, Feng, Cheng, Ran, Shan, Ying, Qie, Xiaohu, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The state of the arts in vision-language pretraining (VLP) achieves exemplary performance but suffers from high training costs resulting from slow convergence and long training time, especially on large-scale web datasets. An essential obstacle to training efficiency lies in the entangled prediction rate (percentage of tokens for reconstruction) and corruption rate (percentage of corrupted tokens) in masked language modeling (MLM), that is, a proper corruption rate is achieved at the cost of a large portion of output tokens being excluded from prediction loss. To accelerate the convergence of VLP, we propose a new pretraining task, namely, free language modeling (FLM), that enables a 100% prediction rate with arbitrary corruption rates. FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted. FLM-trained models are encouraged to learn better and faster given the same GPU time by exploiting bidirectional contexts more flexibly. Extensive experiments show FLM could achieve an impressive 2.5x pretraining time reduction in comparison to the MLM-based methods, while keeping competitive performance on both vision-language understanding and generation tasks. Code will be public at https://github.com/TencentARC/FLM., Comment: To appear in CVPR 2023
Published: 2023

134. BoPR: Body-aware Part Regressor for Human Shape and Pose Estimation

Author: Cheng, Yongkang, Huang, Shaoli, Ning, Jifeng, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents a novel approach for estimating human body shape and pose from monocular images that effectively addresses the challenges of occlusions and depth ambiguity. Our proposed method BoPR, the Body-aware Part Regressor, first extracts features of both the body and part regions using an attention-guided mechanism. We then utilize these features to encode extra part-body dependency for per-part regression, with part features as queries and body feature as a reference. This allows our network to infer the spatial relationship of occluded parts with the body by leveraging visible parts and body reference information. Our method outperforms existing state-of-the-art methods on two benchmark datasets, and our experiments show that it significantly surpasses existing methods in terms of depth ambiguity and occlusion handling. These results provide strong evidence of the effectiveness of our approach.The code and data are available for research purposes at https://github.com/cyk990422/BoPR.
Published: 2023

135. HRDFuse: Monocular 360{\deg}Depth Estimation by Collaboratively Learning Holistic-with-Regional Depth Distributions

Author: Ai, Hao, cao, Zidong, Cao, Yan-pei, Shan, Ying, and Wang, Lin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Depth estimation from a monocular 360{\deg} image is a burgeoning problem owing to its holistic sensing of a scene. Recently, some methods, \eg, OmniFusion, have applied the tangent projection (TP) to represent a 360{\deg}image and predicted depth values via patch-wise regressions, which are merged to get a depth map with equirectangular projection (ERP) format. However, these methods suffer from 1) non-trivial process of merging plenty of patches; 2) capturing less holistic-with-regional contextual information by directly regressing the depth value of each pixel. In this paper, we propose a novel framework, \textbf{HRDFuse}, that subtly combines the potential of convolutional neural networks (CNNs) and transformers by collaboratively learning the \textit{holistic} contextual information from the ERP and the \textit{regional} structural information from the TP. Firstly, we propose a spatial feature alignment (\textbf{SFA}) module that learns feature similarities between the TP and ERP to aggregate the TP features into a complete ERP feature map in a pixel-wise manner. Secondly, we propose a collaborative depth distribution classification (\textbf{CDDC}) module that learns the \textbf{holistic-with-regional} histograms capturing the ERP and TP depth distributions. As such, the final depth values can be predicted as a linear combination of histogram bin centers. Lastly, we adaptively combine the depth predictions from ERP and TP to obtain the final depth map. Extensive experiments show that our method predicts\textbf{ more smooth and accurate depth} results while achieving \textbf{favorably better} results than the SOTA methods., Comment: To appear at CVPR2023, 20 pages
Published: 2023

136. HMC: Hierarchical Mesh Coarsening for Skeleton-free Motion Retargeting

Author: Wang, Haoyu, Huang, Shaoli, Zhao, Fang, Yuan, Chun, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: We present a simple yet effective method for skeleton-free motion retargeting. Previous methods transfer motion between high-resolution meshes, failing to preserve the inherent local-part motions in the mesh. Addressing this issue, our proposed method learns the correspondence in a coarse-to-fine fashion by integrating the retargeting process with a mesh-coarsening pipeline. First, we propose a mesh-coarsening module that coarsens the mesh representations for better motion transfer. This module improves the ability to handle small-part motion and preserves the local motion interdependence between neighboring mesh vertices. Furthermore, we leverage a hierarchical refinement procedure to complement missing mesh details by gradually improving the low-resolution mesh output with a higher-resolution one. We evaluate our method on several well-known 3D character datasets, and it yields an average improvement of 25% on point-wise mesh euclidean distance (PMD) against the start-of-art method. Moreover, our qualitative results show that our method is significantly helpful in preserving the moving consistency of different body parts on the target character due to disentangling body-part structures and mesh details in a hierarchical way.
Published: 2023

137. FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

Author: Qi, Chenyang, Cun, Xiaodong, Zhang, Yong, Lei, Chenyang, Wang, Xintao, Shan, Ying, and Chen, Qifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness in generation progress, it is still challenging to apply such models for real-world visual content editing, especially in videos. In this paper, we propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. To edit videos consistently, we propose several techniques based on the pre-trained models. Firstly, in contrast to the straightforward DDIM inversion technique, our approach captures intermediate attention maps during inversion, which effectively retain both structural and motion information. These maps are directly fused in the editing process rather than generated during denoising. To further minimize semantic leakage of the source video, we then fuse self-attentions with a blending mask obtained by cross-attention features from the source prompt. Furthermore, we have implemented a reform of the self-attention mechanism in denoising UNet by introducing spatial-temporal attention to ensure frame consistency. Yet succinct, our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model. We also have a better zero-shot shape-aware editing ability based on the text-to-video model. Extensive experiments demonstrate our superior temporal consistency and editing capability than previous works., Comment: Accepted to ICCV 2023 as an Oral Presentation. Project page: https://fate-zero-edit.github.io ; GitHub repository: https://github.com/ChenyangQiQi/FateZero
Published: 2023

138. Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry

Author: Zhang, Jiaxu, Weng, Junwu, Kang, Di, Zhao, Fang, Huang, Shaoli, Zhe, Xuefei, Bao, Linchao, Shan, Ying, Wang, Jue, and Tu, Zhigang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics
Abstract: A good motion retargeting cannot be reached without reasonable consideration of source-target differences on both the skeleton and shape geometry levels. In this work, we propose a novel Residual RETargeting network (R2ET) structure, which relies on two neural modification modules, to adjust the source motions to fit the target skeletons and shapes progressively. In particular, a skeleton-aware module is introduced to preserve the source motion semantics. A shape-aware module is designed to perceive the geometries of target characters to reduce interpenetration and contact-missing. Driven by our explored distance-based losses that explicitly model the motion semantics and geometry, these two modules can learn residual motion modifications on the source motion to generate plausible retargeted motion in a single inference without post-processing. To balance these two modifications, we further present a balancing gate to conduct linear interpolation between them. Extensive experiments on the public dataset Mixamo demonstrate that our R2ET achieves the state-of-the-art performance, and provides a good balance between the preservation of motion semantics as well as the attenuation of interpenetration and contact-missing. Code is available at https://github.com/Kebii/R2ET., Comment: CVPR 2023
Published: 2023

139. Binary Embedding-based Retrieval at Tencent

Author: Gan, Yukang, Ge, Yixiao, Zhou, Chang, Su, Shupeng, Xu, Zhouchuan, Xu, Xuyuan, Hui, Quanchao, Chen, Xiang, Wang, Yexin, and Shan, Ying
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Large-scale embedding-based retrieval (EBR) is the cornerstone of search-related industrial applications. Given a user query, the system of EBR aims to identify relevant information from a large corpus of documents that may be tens or hundreds of billions in size. The storage and computation turn out to be expensive and inefficient with massive documents and high concurrent queries, making it difficult to further scale up. To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension. Specifically, we compress the full-precision query and document embeddings, formulated as float vectors in general, into a composition of multiple binary vectors using a lightweight transformation model with residual multilayer perception (MLP) blocks. We can therefore tailor the number of bits for different applications to trade off accuracy loss and cost savings. Importantly, we enable task-agnostic efficient training of the binarization model using a new embedding-to-embedding strategy. We also exploit the compatible training of binary embeddings so that the BEBR engine can support indexing among multiple embedding versions within a unified system. To further realize efficient search, we propose Symmetric Distance Calculation (SDC) to achieve lower response time than Hamming codes. We successfully employed the introduced BEBR to Tencent products, including Sogou, Tencent Video, QQ World, etc. The binarization algorithm can be seamlessly generalized to various tasks with multiple modalities. Extensive experiments on offline benchmarks and online A/B tests demonstrate the efficiency and effectiveness of our method, significantly saving 30%~50% index costs with almost no loss of accuracy at the system level.
Published: 2023

140. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Author: Mou, Chong, Wang, Xintao, Xie, Liangbin, Wu, Yanze, Zhang, Jian, Qi, Zhongang, Shan, Ying, and Qie, Xiaohu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications., Comment: Tech Report. GitHub: https://github.com/TencentARC/T2I-Adapter
Published: 2023

141. OSRT: Omnidirectional Image Super-Resolution with Distortion-aware Transformer

Author: Yu, Fanghua, Wang, Xintao, Cao, Mingdeng, Li, Gen, Shan, Ying, and Dong, Chao
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: Omnidirectional images (ODIs) have obtained lots of research interest for immersive experiences. Although ODIs require extremely high resolution to capture details of the entire scene, the resolutions of most ODIs are insufficient. Previous methods attempt to solve this issue by image super-resolution (SR) on equirectangular projection (ERP) images. However, they omit geometric properties of ERP in the degradation process, and their models can hardly generalize to real ERP images. In this paper, we propose Fisheye downsampling, which mimics the real-world imaging process and synthesizes more realistic low-resolution samples. Then we design a distortion-aware Transformer (OSRT) to modulate ERP distortions continuously and self-adaptively. Without a cumbersome process, OSRT outperforms previous methods by about 0.2dB on PSNR. Moreover, we propose a convenient data augmentation strategy, which synthesizes pseudo ERP images from plain images. This simple strategy can alleviate the over-fitting problem of large networks and significantly boost the performance of ODISR. Extensive experiments have demonstrated the state-of-the-art performance of our OSRT. Codes and models will be available at https://github.com/Fanghua-Yu/OSRT., Comment: main paper + supplement
Published: 2023

142. Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

Author: Chen, Yizhen, Wang, Jie, Lin, Lijian, Qi, Zhongang, Ma, Jin, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the existing methods either transfer the knowledge of image-text pretraining model to video-text retrieval task without fully exploring the multi-modal information of videos, or simply fuse multi-modal features in a brute force manner without explicit guidance. In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment. Various pretrained experts are utilized for extracting the information of multiple modalities, including object, person, motion, audio, etc. To take full advantage of these information, we propose the TABLE (TAgging Before aLignmEnt) network, which consists of a visual encoder, a tag encoder, a text encoder, and a tag-guiding cross-modal encoder for jointly encoding multi-frame visual features and multi-modal tags information. Furthermore, to strengthen the interaction between video and text, we build a joint cross-modal encoder with the triplet input of [vision, tag, text] and perform two additional supervised tasks, Video Text Matching (VTM) and Masked Language Modeling (MLM). Extensive experimental results demonstrate that the TABLE model is capable of achieving State-Of-The-Art (SOTA) performance on various video-text retrieval benchmarks, including MSR-VTT, MSVD, LSMDC and DiDeMo., Comment: Accepted to AAAI 2023 (Oral)
Published: 2023

143. RILS: Masked Visual Reconstruction in Language Semantic Space

Author: Yang, Shusheng, Ge, Yixiao, Yi, Kun, Li, Dian, Shan, Ying, Qie, Xiaohu, and Wang, Xinggang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training. In this work, we seek the synergy between two paradigms and study the emerging properties when MIM meets natural language supervision. To this end, we present a novel masked visual Reconstruction In Language semantic Space (RILS) pre-training framework, in which sentence representations, encoded by the text encoder, serve as prototypes to transform the vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets. The vision models can therefore capture useful components with structured information by predicting proper semantic of masked tokens. Better visual representations could, in turn, improve the text encoder via the image-text alignment objective, which is essential for the effective MIM target transformation. Extensive experimental results demonstrate that our method not only enjoys the best of previous MIM and CLIP but also achieves further improvements on various tasks due to their mutual benefits. RILS exhibits advanced transferability on downstream classification, detection, and segmentation, especially for low-shot regimes. Code will be made available at https://github.com/hustvl/RILS.
Published: 2023

144. DPE: Disentanglement of Pose and Expression for General Video Portrait Editing

Author: Pang, Youxin, Zhang, Yong, Quan, Weize, Fan, Yanbo, Cun, Xiaodong, Shan, Ying, and Yan, Dong-ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: One-shot video-driven talking face generation aims at producing a synthetic talking video by transferring the facial motion from a video to an arbitrary portrait image. Head pose and facial expression are always entangled in facial motion and transferred simultaneously. However, the entanglement sets up a barrier for these methods to be used in video portrait editing directly, where it may require to modify the expression only while maintaining the pose unchanged. One challenge of decoupling pose and expression is the lack of paired data, such as the same pose but different expressions. Only a few methods attempt to tackle this challenge with the feat of 3D Morphable Models (3DMMs) for explicit disentanglement. But 3DMMs are not accurate enough to capture facial details due to the limited number of Blenshapes, which has side effects on motion transfer. In this paper, we introduce a novel self-supervised disentanglement framework to decouple pose and expression without 3DMMs and paired data, which consists of a motion editing module, a pose generator, and an expression generator. The editing module projects faces into a latent space where pose motion and expression motion can be disentangled, and the pose or expression transfer can be performed in the latent space conveniently via addition. The two generators render the modified latent codes to images, respectively. Moreover, to guarantee the disentanglement, we propose a bidirectional cyclic training strategy with well-designed constraints. Evaluations demonstrate our method can control pose or expression independently and be used for general video editing., Comment: https://carlyx.github.io/DPE/
Published: 2023

145. Study on the mechanism of DDX6 promoting proliferation and migration of nasopharyngeal carcinoma cells by regulating stability of CKMT1A mRNA

Author: XIONG Jiayan, LEI Wei, YOU Bo, ZHANG Zhenxin, XIE Haijing, SHAN Ying, XIA Tian, ZHOU Yong
Subjects: nasopharyngeal carcinoma, ckmt1a, ddx6, cell proliferation, cell migration, Neoplasms. Tumors. Oncology. Including cancer and carcinogens, RC254-282
Abstract: Background and purpose: DDX is an adenosine triphosphate (ATP)-dependent RNA helicase closely related to mRNA regulation, tumor proliferation and invasion. This article aimed to explore the effect of DDX6, a member of the DDX family, on the stability of CKMT1A mRNA, as well as the effect of the DDX6 CKMT1A axis on the proliferation and migration ability of human nasopharyngeal carcinoma cell CNE2 and its molecular mechanism. Methods: We retrieved the data of expressions of DDX6 and CKMT1A in human head and neck squamous cell carcinoma from The Cancer Genome Atlas (TCGA) database and performed a correlation analysis. Western blot was performed to detect the expressions of CKMT1A and DDX6 in human nasopharyngeal carcinoma tissues and normal nasopharyngeal tissues preserved by Affiliated Hospital of Nantong University. This study was approved by the Ethics Committee of Affiliated Hospital of Nantong University (Number: 2022-L114). We used transwell assay to detect cell migration ability, EdU assay to detect cell proliferation ability, and colony formation assay to detect clone formation ability. We transfect with lentivirus and plasmids to construct sh-DDX6, sh-CKMT1A, sh-CKMT1A+sh-DDX6 and oe-CKMT1A cell models derived from the human nasopharyngeal carcinoma cell line CNE2, preserved by Affiliated Hospital of Nantong University, to clarify the impact of DDX6 and CKMT1A expression levels on the malignant biological phenotypes of nasopharyngeal carcinoma cells. BALB/c nude mice subcutaneous xenograft tumor model was constructed to detect the effects of DDX6 and CKMT1A on nasopharyngeal carcinoma cells in mice. RNA stability assay was used to detect the effect of DDX6 knockout on CKMT1A mRNA and further clarify the molecular mechanism of DDX6. Results: DDX6 was highly expressed, CKMT1A level was low in human nasopharyngeal carcinoma tissue, and DDX6 was negatively correlated with CKMT1A expression. DDX6 inhibited protein translation of CKMT1A by disrupting its mRNA stability. Low expression of CKMT1A in CNE2 cells enhanced cell migration and proliferation ability, while high expression inhibited migration and proliferation ability. Knocking out DDX6 reversed the progression of malignant behavior caused by downregulation of CKMT1A. Low expression of CKMT1A promoted tumor cell growth in BALB/cnude mice subcutaneous xenograft tumor model, while low expression of DDX6 inhibited tumor cell growth. Knocking out DDX6 and CKMT1A simultaneously restored the inhibitory effect caused by knocking down DDX6 alone. Conclusion: DDX6 in nasopharyngeal carcinoma cells disrupts the stability of CKMT1A mRNA, negatively regulates CKMT1A protein translation, upregulates the proliferation and migration ability of nasopharyngeal carcinoma cells, and promotes malignant progression of nasopharyngeal carcinoma.
Published: 2024
Full Text: View/download PDF

146. Correction: Rotenone Could Activate Microglia Through NFκB Associated Pathway

Author: Yuan, Yu-he, Sun, Jian-dong, Wu, Miao-miao, Hu, Jin-feng, Peng, Shan-ying, and Chen, Nai-Hong
Published: 2024
Full Text: View/download PDF

147. Power Comprehensive Energy Optimization Scheme Based on Improved PSO Algorithm

Author: Shan, Ying, primary and Chang, Di, additional
Published: 2024
Full Text: View/download PDF

148. Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models

Author: Xu, Jiale, Wang, Xintao, Cheng, Weihao, Cao, Yan-Pei, Shan, Ying, Qie, Xiaohu, and Gao, Shenghua
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent CLIP-guided 3D optimization methods, such as DreamFields and PureCLIPNeRF, have achieved impressive results in zero-shot text-to-3D synthesis. However, due to scratch training and random initialization without prior knowledge, these methods often fail to generate accurate and faithful 3D structures that conform to the input text. In this paper, we make the first attempt to introduce explicit 3D shape priors into the CLIP-guided 3D optimization process. Specifically, we first generate a high-quality 3D shape from the input text in the text-to-shape stage as a 3D shape prior. We then use it as the initialization of a neural radiance field and optimize it with the full prompt. To address the challenging text-to-shape generation task, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between the images synthesized by the text-to-image diffusion model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, Dream3D, is capable of generating imaginative 3D content with superior visual quality and shape accuracy compared to state-of-the-art methods., Comment: Accepted by CVPR 2023. Project page: https://bluestyle97.github.io/dream3d/
Published: 2022

149. Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

Author: Wu, Jay Zhangjie, Ge, Yixiao, Wang, Xintao, Lei, Weixian, Gu, Yuchao, Shi, Yufei, Hsu, Wynne, Shan, Ying, Qie, Xiaohu, and Shou, Mike Zheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting$\unicode{x2014}$One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications., Comment: Preprint
Published: 2022

150. Mitigating Artifacts in Real-World Video Super-Resolution Models

Author: Xie, Liangbin, Wang, Xintao, Shi, Shuwei, Gu, Jinjin, Dong, Chao, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The recurrent structure is a prevalent framework for the task of video super-resolution, which models the temporal dependency between frames via hidden states. When applied to real-world scenarios with unknown and complex degradations, hidden states tend to contain unpleasant artifacts and propagate them to restored frames. In this circumstance, our analyses show that such artifacts can be largely alleviated when the hidden state is replaced with a cleaner counterpart. Based on the observations, we propose a Hidden State Attention (HSA) module to mitigate artifacts in real-world video super-resolution. Specifically, we first adopt various cheap filters to produce a hidden state pool. For example, Gaussian blur filters are for smoothing artifacts while sharpening filters are for enhancing details. To aggregate a new hidden state that contains fewer artifacts from the hidden state pool, we devise a Selective Cross Attention (SCA) module, in which the attention between input features and each hidden state is calculated. Equipped with HSA, our proposed method, namely FastRealVSR, is able to achieve 2x speedup while obtaining better performance than Real-BasicVSR. Codes will be available at https://github.com/TencentARC/FastRealVSR, Comment: Accepted by AAAI 2023. Codes will be available at https://github.com/TencentARC/FastRealVSR
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

2,297 results on '"Shan, Ying"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources