Author: "Qi, Zhongang" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Qi, Zhongang"' showing total 132 results

Start Over Author "Qi, Zhongang"

132 results on '"Qi, Zhongang"'

1. DOGE: Towards Versatile Visual Document Grounding and Referring

Author: Zhou, Yinan, Chen, Yuxin, Lin, Haokun, Yang, Shuyu, Zhu, Li, Qi, Zhongang, Ma, Chen, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: In recent years, Multimodal Large Language Models (MLLMs) have increasingly emphasized grounding and referring capabilities to achieve detailed understanding and flexible user interaction. However, in the realm of visual document understanding, these capabilities lag behind due to the scarcity of fine-grained datasets and comprehensive benchmarks. To fill this gap, we propose the DOcument Grounding and Eferring data engine (DOGE-Engine), which produces two types of high-quality fine-grained document data: multi-granular parsing data for enhancing fundamental text localization and recognition capabilities; and instruction-tuning data to activate MLLM's grounding and referring capabilities during dialogue and reasoning. Additionally, using our engine, we construct DOGE-Bench, which encompasses 7 grounding and referring tasks across 3 document types (chart, poster, PDF document), providing comprehensive evaluations for fine-grained document understanding. Furthermore, leveraging the data generated by our engine, we develop a strong baseline model, DOGE. This pioneering MLLM is capable of accurately referring and grounding texts at multiple granularities within document images. Our code, data, and model will be open-sourced for community development., Comment: 20 pages, 13 figures
Published: 2024

2. mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

Author: Zhang, Tao, Zhang, Ziqi, Ma, Zongyang, Chen, Yuxin, Qi, Zhongang, Yuan, Chunfeng, Li, Bing, Pu, Junfu, Zhao, Yuxuan, Xie, Zehua, Ma, Jin, Shan, Ying, and Hu, Weiming
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Advanced Multimodal Large Language Models (MLLMs) struggle with recent Knowledge-based VQA tasks, such as INFOSEEK and Encyclopedic-VQA, due to their limited and frozen knowledge scope, often leading to ambiguous and inaccurate responses. Thus, multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge, effectively expanding the knowledge scope. However, current mRAG methods have inherent drawbacks, including: 1) Performing retrieval even when external knowledge is not needed. 2) Lacking of identification of evidence that supports the query. 3) Increasing model complexity due to additional information filtering modules or rules. To address these shortcomings, we propose a novel generalized framework called \textbf{m}ultimodal \textbf{R}etrieval-\textbf{R}eflection-\textbf{A}ugmented \textbf{G}eneration (mR$^2$AG), which achieves adaptive retrieval and useful information localization to enable answers through two easy-to-implement reflection operations, preventing high model complexity. In mR$^2$AG, Retrieval-Reflection is designed to distinguish different user queries and avoids redundant retrieval calls, and Relevance-Reflection is introduced to guide the MLLM in locating beneficial evidence of the retrieved content and generating answers accordingly. In addition, mR$^2$AG can be integrated into any well-trained MLLM with efficient fine-tuning on the proposed mR$^2$AG Instruction-Tuning dataset (mR$^2$AG-IT). mR$^2$AG significantly outperforms state-of-the-art MLLMs (e.g., GPT-4v/o) and RAG-based MLLMs on INFOSEEK and Encyclopedic-VQA, while maintaining the exceptional capabilities of base MLLMs across a wide range of Visual-dependent tasks.
Published: 2024

3. Taming Rectified Flow for Inversion and Editing

Author: Wang, Jiangshan, Pu, Junfu, Qi, Zhongang, Guo, Jiayi, Ma, Yue, Huang, Nisha, Chen, Yuxin, Li, Xiu, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Rectified-flow-based diffusion transformers like FLUX and OpenSora have demonstrated outstanding performance in the field of image and video generation. Despite their robust generative capabilities, these models often struggle with inversion inaccuracies, which could further limit their effectiveness in downstream tasks such as image and video editing. To address this issue, we propose RF-Solver, a novel training-free sampler that effectively enhances inversion precision by mitigating the errors in the ODE-solving process of rectified flow. Specifically, we derive the exact formulation of the rectified flow ODE and apply the high-order Taylor expansion to estimate its nonlinear components, significantly enhancing the precision of ODE solutions at each timestep. Building upon RF-Solver, we further propose RF-Edit, a general feature-sharing-based framework for image and video editing. By incorporating self-attention features from the inversion process into the editing process, RF-Edit effectively preserves the structural information of the source image or video while achieving high-quality editing results. Our approach is compatible with any pre-trained rectified-flow-based models for image and video tasks, requiring no additional training or optimization. Extensive experiments across generation, inversion, and editing tasks in both image and video modalities demonstrate the superiority and versatility of our method. The source code is available at https://github.com/wangjiangshan0725/RF-Solver-Edit., Comment: GitHub: https://github.com/wangjiangshan0725/RF-Solver-Edit
Published: 2024

4. E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

Author: Liu, Ye, Ma, Zongyang, Qi, Zhongang, Wu, Yang, Shan, Ying, and Chen, Chang Wen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e.g., grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data. Focusing on these issues, we further propose a strong baseline model, E.T. Chat, together with an instruction-tuning dataset E.T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios., Comment: Accepted to NeurIPS 2024 Datasets and Benchmarks Track
Published: 2024

5. CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

Author: Wu, Tao, Zhang, Yong, Wang, Xintao, Zhou, Xianpan, Zheng, Guangcong, Qi, Zhongang, Shan, Ying, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Customized video generation aims to generate high-quality videos guided by text prompts and subject's reference images. However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions. To restore these abilities, some methods use additional video similar to the prompt to fine-tune or guide the model. This requires frequent changes of guiding videos and even re-tuning of the model when generating different motions, which is very inconvenient for users. In this paper, we propose CustomCrafter, a novel framework that preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For preserving conceptual combination ability, we design a plug-and-play module to update few parameters in VDMs, enhancing the model's ability to capture the appearance details and the ability of concept combinations for new subjects. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage. Therefore, we propose Dynamic Weighted Video Sampling Strategy. Using the pluggability of our subject learning modules, we reduce the impact of this module on motion generation in the early stage of denoising, preserving the ability to generate motion of VDMs. In the later stage of denoising, we restore this module to repair the appearance details of the specified subject, thereby ensuring the fidelity of the subject's appearance. Experimental results show that our method has a significant improvement compared to previous methods., Comment: project page: https://customcrafter.github.io/
Published: 2024

6. SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses

Author: Tan, Chaolei, Lin, Zihang, Pu, Junfu, Qi, Zhongang, Pei, Wei-Yi, Qu, Zhi, Wang, Yexin, Shan, Ying, Zheng, Wei-Shi, and Hu, Jian-Fang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Video grounding is a fundamental problem in multimodal content understanding, aiming to localize specific natural language queries in an untrimmed video. However, current video grounding datasets merely focus on simple events and are either limited to shorter videos or brief sentences, which hinders the model from evolving toward stronger multimodal understanding capabilities. To address these limitations, we present a large-scale video grounding dataset named SynopGround, in which more than 2800 hours of videos are sourced from popular TV dramas and are paired with accurately localized human-written synopses. Each paragraph in the synopsis serves as a language query and is manually annotated with precise temporal boundaries in the long video. These paragraph queries are tightly correlated to each other and contain a wealth of abstract expressions summarizing video storylines and specific descriptions portraying event details, which enables the model to learn multimodal perception on more intricate concepts over longer context dependencies. Based on the dataset, we further introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG), which takes as input multiple paragraphs and a long video for grounding each paragraph query to its temporal interval. In addition, we propose a novel Local-Global Multimodal Reasoner (LGMR) to explicitly model the local-global structures of long-term multimodal inputs for MPVG. Our method provides an effective baseline solution to the multi-paragraph video grounding problem. Extensive experiments verify the proposed model's effectiveness as well as its superiority in long-term multi-paragraph video grounding over prior state-of-the-arts. Dataset and code are publicly available. Project page: https://synopground.github.io/., Comment: Accepted to ACM MM 2024. Project page: https://synopground.github.io/
Published: 2024

7. How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Author: Chen, Yuxin, Ma, Zongyang, Zhang, Ziqi, Qi, Zhongang, Yuan, Chunfeng, Li, Bing, Pu, Junfu, Shan, Ying, Qi, Xiaojuan, and Hu, Weiming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Dominant dual-encoder models enable efficient image-text retrieval but suffer from limited accuracy while the cross-encoder models offer higher accuracy at the expense of efficiency. Distilling cross-modality matching knowledge from cross-encoder to dual-encoder provides a natural approach to harness their strengths. Thus we investigate the following valuable question: how to make cross-encoder a good teacher for dual-encoder? Our findings are threefold:(1) Cross-modal similarity score distribution of cross-encoder is more concentrated while the result of dual-encoder is nearly normal making vanilla logit distillation less effective. However ranking distillation remains practical as it is not affected by the score distribution.(2) Only the relative order between hard negatives conveys valid knowledge while the order information between easy negatives has little significance.(3) Maintaining the coordination between distillation loss and dual-encoder training loss is beneficial for knowledge transfer. Based on these findings we propose a novel Contrastive Partial Ranking Distillation (CPRD) method which implements the objective of mimicking relative order between hard negative samples with contrastive learning. This approach coordinates with the training of the dual-encoder effectively transferring valid knowledge from the cross-encoder to the dual-encoder. Extensive experiments on image-text retrieval and ranking tasks show that our method surpasses other distillation methods and significantly improves the accuracy of dual-encoder., Comment: Accepted by CVPR 2024
Published: 2024

8. EA-VTR: Event-Aware Video-Text Retrieval

Author: Ma, Zongyang, Zhang, Ziqi, Chen, Yuxin, Qi, Zhongang, Yuan, Chunfeng, Li, Bing, Luo, Yingmin, Li, Xu, Qi, Xiaojuan, Shan, Ying, and Hu, Weiming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Understanding the content of events occurring in the video and their inherent temporal logic is crucial for video-text retrieval. However, web-crawled pre-training datasets often lack sufficient event information, and the widely adopted video-level cross-modal contrastive learning also struggles to capture detailed and complex video-text event alignment. To address these challenges, we make improvements from both data and model perspectives. In terms of pre-training data, we focus on supplementing the missing specific event content and event temporal transitions with the proposed event augmentation strategies. Based on the event-augmented data, we construct a novel Event-Aware Video-Text Retrieval model, ie, EA-VTR, which achieves powerful video-text retrieval ability through superior video event awareness. EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment, ultimately enhancing the comprehensive understanding of video events. Our method not only significantly outperforms existing approaches on multiple datasets for Text-to-Video Retrieval and Video Action Recognition tasks, but also demonstrates superior event content perceive ability on Multi-event Video-Text Retrieval and Video Moment Retrieval tasks, as well as outstanding event temporal logic understanding ability on Test of Time task., Comment: Accepted by ECCV 2024
Published: 2024

9. PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Author: Yang, Tao, Luo, Yingmin, Qi, Zhongang, Wu, Yang, Shan, Ying, and Chen, Chang Wen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-following manner. Previous approaches are either inefficient for large-scale applications or lack flexibility for varying design requirements. Our research introduces a unified framework for automated graphic layout generation, leveraging the multi-modal large language model (MLLM) to accommodate diverse design tasks. In contrast, our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts under specific visual and textual constraints, including user-defined natural language specifications. We conducted extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks, demonstrating the effectiveness of our method. Moreover, recognizing existing datasets' limitations in capturing the complexity of real-world graphic designs, we propose two new datasets for much more challenging tasks (user-constrained generation and complicated poster), further validating our model's utility in real-life settings. Marking by its superior accessibility and adaptability, this approach further automates large-scale graphic design tasks. Finally, we develop an automated text-to-poster system that generates editable SVG posters based on users' design intentions, bridging the gap between layout generation and real-world graphic design applications. This system integrates our proposed layout generation method as the core component, demonstrating its effectiveness in practical scenarios. The code and datasets are open-sourced on https://github.com/posterllava/PosterLLaVA., Comment: 13 pages; with PosterGen as extension; IEEE template
Published: 2024

10. StyleAdapter: A Unified Stylized Image Generation Model

Author: Wang, Zhouxia, Wang, Xintao, Xie, Liangbin, Qi, Zhongang, Shan, Ying, Wang, Wenping, and Luo, Ping
Published: 2024
Full Text: View/download PDF

11. SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

Author: Wu, Tao, Li, Xuewei, Qi, Zhongang, Hu, Di, Wang, Xintao, Shan, Ying, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains.However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation.In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images.For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images.Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion.For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic.Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images.With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average., Comment: Accepted by AAAI2024
Published: 2024

12. RecDCL: Dual Contrastive Learning for Recommendation

Author: Zhang, Dan, Geng, Yangliao, Gong, Wenwen, Qi, Zhongang, Chen, Zhiyu, Tang, Xing, Shan, Ying, Dong, Yuxiao, and Tang, Jie
Subjects: Computer Science - Information Retrieval, Computer Science - Computation and Language
Abstract: Self-supervised learning (SSL) has recently achieved great success in mining the user-item interactions for collaborative filtering. As a major paradigm, contrastive learning (CL) based SSL helps address data sparsity in Web platforms by contrasting the embeddings between raw and augmented data. However, existing CL-based methods mostly focus on contrasting in a batch-wise way, failing to exploit potential regularity in the feature dimension. This leads to redundant solutions during the representation learning of users and items. In this work, we investigate how to employ both batch-wise CL (BCL) and feature-wise CL (FCL) for recommendation. We theoretically analyze the relation between BCL and FCL, and find that combining BCL and FCL helps eliminate redundant solutions but never misses an optimal solution. We propose a dual contrastive learning recommendation framework -- RecDCL. In RecDCL, the FCL objective is designed to eliminate redundant solutions on user-item positive pairs and to optimize the uniform distributions within users and items using a polynomial kernel for driving the representations to be orthogonal; The BCL objective is utilized to generate contrastive embeddings on output vectors for enhancing the robustness of the representations. Extensive experiments on four widely-used benchmarks and one industry dataset demonstrate that RecDCL can consistently outperform the state-of-the-art GNNs-based and SSL-based models (with an improvement of up to 5.65\% in terms of Recall@20). The source code is publicly available (https://github.com/THUDM/RecDCL)., Comment: Accepted to WWW 2024
Published: 2024
Full Text: View/download PDF

13. EA-VTR: Event-Aware Video-Text Retrieval

Author: Ma, Zongyang, Zhang, Ziqi, Chen, Yuxin, Qi, Zhongang, Yuan, Chunfeng, Li, Bing, Luo, Yingmin, Li, Xu, Qi, Xiaojuan, Shan, Ying, Hu, Weiming, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
Published: 2025
Full Text: View/download PDF

14. PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

Author: Li, Zhen, Cao, Mingdeng, Wang, Xintao, Qi, Zhongang, Cheng, Ming-Ming, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: Recent advances in text-to-image generation have made remarkable progress in synthesizing realistic human photos conditioned on given text prompts. However, existing personalized generation methods cannot simultaneously satisfy the requirements of high efficiency, promising identity (ID) fidelity, and flexible text controllability. In this work, we introduce PhotoMaker, an efficient personalized text-to-image generation method, which mainly encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information. Such an embedding, serving as a unified ID representation, can not only encapsulate the characteristics of the same input ID comprehensively, but also accommodate the characteristics of different IDs for subsequent integration. This paves the way for more intriguing and practically valuable applications. Besides, to drive the training of our PhotoMaker, we propose an ID-oriented data construction pipeline to assemble the training data. Under the nourishment of the dataset constructed through the proposed pipeline, our PhotoMaker demonstrates better ID preservation ability than test-time fine-tuning based methods, yet provides significant speed improvements, high-quality generation results, strong generalization capabilities, and a wide range of applications. Our project page is available at https://photo-maker.github.io/, Comment: Tech report; Project page: https://photo-maker.github.io/
Published: 2023

15. CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

Author: Yuan, Ziyang, Cao, Mingdeng, Wang, Xintao, Qi, Zhongang, Yuan, Chun, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Incorporating a customized object into image generation presents an attractive feature in text-to-image generation. However, existing optimization-based and encoder-based methods are hindered by drawbacks such as time-consuming optimization, insufficient identity preservation, and a prevalent copy-pasting effect. To overcome these limitations, we introduce CustomNet, a novel object customization approach that explicitly incorporates 3D novel view synthesis capabilities into the object customization process. This integration facilitates the adjustment of spatial position relationships and viewpoints, yielding diverse outputs while effectively preserving object identity. Moreover, we introduce delicate designs to enable location control and flexible background control through textual descriptions or specific user-defined images, overcoming the limitations of existing 3D novel view synthesis methods. We further leverage a dataset construction pipeline that can better handle real-world objects and complex backgrounds. Equipped with these designs, our method facilitates zero-shot object customization without test-time optimization, offering simultaneous control over the viewpoints, location, and background. As a result, our CustomNet ensures enhanced identity preservation and generates diverse, harmonious outputs., Comment: Project webpage available at https://jiangyzy.github.io/CustomNet/
Published: 2023

16. StyleAdapter: A Unified Stylized Image Generation Model

Author: Wang, Zhouxia, Wang, Xintao, Xie, Liangbin, Qi, Zhongang, Shan, Ying, Wang, Wenping, and Luo, Ping
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work focuses on generating high-quality images with specific style of reference images and content of provided textual descriptions. Current leading algorithms, i.e., DreamBooth and LoRA, require fine-tuning for each style, leading to time-consuming and computationally expensive processes. In this work, we propose StyleAdapter, a unified stylized image generation model capable of producing a variety of stylized images that match both the content of a given prompt and the style of reference images, without the need for per-style fine-tuning. It introduces a two-path cross-attention (TPCA) module to separately process style information and textual prompt, which cooperate with a semantic suppressing vision model (SSVM) to suppress the semantic content of style images. In this way, it can ensure that the prompt maintains control over the content of the generated images, while also mitigating the negative impact of semantic information in style references. This results in the content of the generated image adhering to the prompt, and its style aligning with the style references. Besides, our StyleAdapter can be integrated with existing controllable synthesis methods, such as T2I-adapter and ControlNet, to attain a more controllable and stable generation process. Extensive experiments demonstrate the superiority of our method over previous works., Comment: Accepted by IJCV24
Published: 2023

17. Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation

Author: Di, Qianji, Ma, Wenxi, Qi, Zhongang, Hou, Tianxiang, Shan, Ying, and Wang, Hanzi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Scene Graph Generation (SGG) aims to structurally and comprehensively represent objects and their connections in images, it can significantly benefit scene understanding and other related downstream tasks. Existing SGG models often struggle to solve the long-tailed problem caused by biased datasets. However, even if these models can fit specific datasets better, it may be hard for them to resolve the unseen triples which are not included in the training set. Most methods tend to feed a whole triple and learn the overall features based on statistical machine learning. Such models have difficulty predicting unseen triples because the objects and predicates in the training set are combined differently as novel triples in the test set. In this work, we propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models. We propose a Joint Fearture Learning (JFL) module and a Factual Knowledge based Refinement (FKR) module to learn object and predicate categories separately at the feature level and align them with corresponding visual features so that the model is no longer limited to triples matching. Besides, since we observe the long-tailed problem also affects the generalization ability, we design a novel balanced learning strategy, including a Charater Guided Sampling (CGS) and an Informative Re-weighting (IR) module, to provide tailor-made learning methods for each predicate according to their characters. Extensive experiments show that our model achieves state-of-the-art performance. In more detail, TISGG boosts the performances by 11.7% of zR@20(zero-shot recall) on the PredCls sub-task on the Visual Genome dataset.
Published: 2023

18. Sticker820K: Empowering Interactive Retrieval with Stickers

Author: Zhao, Sijie, Ge, Yixiao, Qi, Zhongang, Song, Lin, Ding, Xiaohan, Xie, Zehua, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Stickers have become a ubiquitous part of modern-day communication, conveying complex emotions through visual imagery. To facilitate the development of more powerful algorithms for analyzing stickers, we propose a large-scale Chinese sticker dataset, namely Sticker820K, which consists of 820k image-text pairs. Each sticker has rich and high-quality textual annotations, including descriptions, optical characters, emotional labels, and style classifications. Although vision-language tasks in the domain of natural images have been well studied, directly applying the those models, such as CLIP, to sticker data is not an optimal solution due to the discrepant nature between natural and emotive image data. Therefore, we propose StickerCLIP as a benchmark model on the Sticker820K dataset. For the text-to-image retrieval task, our StickerCLIP demonstrates strong superiority over the CLIP, which achieves an absolute gain of 66.0\% in mean recall on the Sticker820K test set. Additionally, we endeavor to extend the recently popularized LLM by means of prompt tuning, integrating its ability for sticker retrieval and allowing users to retrieve stickers through instructions. We validate the feasibility of this method, demonstrating the immense potential of prompt tuning in expanding LLM abilities while not affecting the quality of upstream tasks.
Published: 2023

19. SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation

Author: Li, Xuewei, Wu, Tao, Qi, Zhongang, Wang, Gaoang, Shan, Ying, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: As an important and challenging problem in computer vision, PAnoramic Semantic Segmentation (PASS) gives complete scene perception based on an ultra-wide angle of view. Usually, prevalent PASS methods with 2D panoramic image input focus on solving image distortions but lack consideration of the 3D properties of original $360^{\circ}$ data. Therefore, their performance will drop a lot when inputting panoramic images with the 3D disturbance. To be more robust to 3D disturbance, we propose our Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation (SGAT4PASS), considering 3D spherical geometry knowledge. Specifically, a spherical geometry-aware framework is proposed for PASS. It includes three modules, i.e., spherical geometry-aware image projection, spherical deformable patch embedding, and a panorama-aware loss, which takes input images with 3D disturbance into account, adds a spherical geometry-aware constraint on the existing deformable patch embedding, and indicates the pixel density of original $360^{\circ}$ data, respectively. Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS significantly improves performance and robustness, with approximately a 2% increase in mIoU, and when small 3D disturbances occur in the data, the stability of our performance is improved by an order of magnitude. Our code and supplementary material are available at https://github.com/TencentARC/SGAT4PASS., Comment: Accepted by IJCAI 2023
Published: 2023

20. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

Author: Cao, Mingdeng, Wang, Xintao, Qi, Zhongang, Shan, Ying, Qie, Xiaohu, and Zheng, Yinqiang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results. For example, generation approaches usually fail to synthesize multiple images of the same objects/characters but with different views or poses. Meanwhile, existing editing methods either fail to achieve effective complex non-rigid editing while maintaining the overall textures and identity, or require time-consuming fine-tuning to capture the image-specific appearance. In this paper, we develop MasaCtrl, a tuning-free method to achieve consistent image generation and complex non-rigid image editing simultaneously. Specifically, MasaCtrl converts existing self-attention in diffusion models into mutual self-attention, so that it can query correlated local contents and textures from source images for consistency. To further alleviate the query confusion between foreground and background, we propose a mask-guided mutual self-attention strategy, where the mask can be easily extracted from the cross-attention maps. Extensive experiments show that the proposed MasaCtrl can produce impressive results in both consistent image generation and complex non-rigid real image editing., Comment: Project available at https://ljzycmd.github.io/projects/MasaCtrl
Published: 2023

21. LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

Author: Zheng, Guangcong, Zhou, Xianpan, Li, Xuewei, Qi, Zhongang, Shan, Ying, and Li, Xi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at https://github.com/ZGCTroy/LayoutDiffusion., Comment: Accepted by CVPR2023
Published: 2023

22. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Author: Mou, Chong, Wang, Xintao, Xie, Liangbin, Wu, Yanze, Zhang, Jian, Qi, Zhongang, Shan, Ying, and Qie, Xiaohu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications., Comment: Tech Report. GitHub: https://github.com/TencentARC/T2I-Adapter
Published: 2023

23. Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

Author: Chen, Yizhen, Wang, Jie, Lin, Lijian, Qi, Zhongang, Ma, Jin, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years. Most of the existing methods either transfer the knowledge of image-text pretraining model to video-text retrieval task without fully exploring the multi-modal information of videos, or simply fuse multi-modal features in a brute force manner without explicit guidance. In this paper, we integrate multi-modal information in an explicit manner by tagging, and use the tags as the anchors for better video-text alignment. Various pretrained experts are utilized for extracting the information of multiple modalities, including object, person, motion, audio, etc. To take full advantage of these information, we propose the TABLE (TAgging Before aLignmEnt) network, which consists of a visual encoder, a tag encoder, a text encoder, and a tag-guiding cross-modal encoder for jointly encoding multi-frame visual features and multi-modal tags information. Furthermore, to strengthen the interaction between video and text, we build a joint cross-modal encoder with the triplet input of [vision, tag, text] and perform two additional supervised tasks, Video Text Matching (VTM) and Masked Language Modeling (MLM). Extensive experimental results demonstrate that the TABLE model is capable of achieving State-Of-The-Art (SOTA) performance on various video-text retrieval benchmarks, including MSR-VTT, MSVD, LSMDC and DiDeMo., Comment: Accepted to AAAI 2023 (Oral)
Published: 2023

24. Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning

Author: Du, Jia-Run, Feng, Jia-Chang, Lin, Kun-Yu, Hong, Fa-Ting, Wu, Xiao-Ming, Qi, Zhongang, Shan, Ying, and Zheng, Wei-Shi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Weakly Supervised Temporal Action Localization (WSTAL) aims to localize and classify action instances in long untrimmed videos with only video-level category labels. Due to the lack of snippet-level supervision for indicating action boundaries, previous methods typically assign pseudo labels for unlabeled snippets. However, since some action instances of different categories are visually similar, it is non-trivial to exactly label the (usually) one action category for a snippet, and incorrect pseudo labels would impair the localization performance. To address this problem, we propose a novel method from a category exclusion perspective, named Progressive Complementary Learning (ProCL), which gradually enhances the snippet-level supervision. Our method is inspired by the fact that video-level labels precisely indicate the categories that all snippets surely do not belong to, which is ignored by previous works. Accordingly, we first exclude these surely non-existent categories by a complementary learning loss. And then, we introduce the background-aware pseudo complementary labeling in order to exclude more categories for snippets of less ambiguity. Furthermore, for the remaining ambiguous snippets, we attempt to reduce the ambiguity by distinguishing foreground actions from the background. Extensive experimental results show that our method achieves new state-of-the-art performance on two popular benchmarks, namely THUMOS14 and ActivityNet1.3.
Published: 2022

25. Do we really need temporal convolutions in action segmentation?

Author: Du, Dazhao, Su, Bing, Li, Yu, Qi, Zhongang, Si, Lingyu, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Action classification has made great progress, but segmenting and recognizing actions from long untrimmed videos remains a challenging problem. Most state-of-the-art methods focus on designing temporal convolution-based models, but the inflexibility of temporal convolutions and the difficulties in modeling long-term temporal dependencies restrict the potential of these models. Transformer-based models with adaptable and sequence modeling capabilities have recently been used in various tasks. However, the lack of inductive bias and the inefficiency of handling long video sequences limit the application of Transformer in action segmentation. In this paper, we design a pure Transformer-based model without temporal convolutions by incorporating temporal sampling, called Temporal U-Transformer (TUT). The U-Transformer architecture reduces complexity while introducing an inductive bias that adjacent frames are more likely to belong to the same class, but the introduction of coarse resolutions results in the misclassification of boundaries. We observe that the similarity distribution between a boundary frame and its neighboring frames depends on whether the boundary frame is the start or end of an action segment. Therefore, we further propose a boundary-aware loss based on the distribution of similarity scores between frames from attention modules to enhance the ability to recognize boundaries. Extensive experiments show the effectiveness of our model.
Published: 2022

26. Accelerating the Training of Video Super-Resolution Models

Author: Lin, Lijian, Wang, Xintao, Qi, Zhongang, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: Despite that convolution neural networks (CNN) have recently demonstrated high-quality reconstruction for video super-resolution (VSR), efficiently training competitive VSR models remains a challenging problem. It usually takes an order of magnitude more time than training their counterpart image models, leading to long research cycles. Existing VSR methods typically train models with fixed spatial and temporal sizes from beginning to end. The fixed sizes are usually set to large values for good performance, resulting to slow training. However, is such a rigid training strategy necessary for VSR? In this work, we show that it is possible to gradually train video models from small to large spatial/temporal sizes, i.e., in an easy-to-hard manner. In particular, the whole training is divided into several stages and the earlier stage has smaller training spatial shape. Inside each stage, the temporal size also varies from short to long while the spatial size remains unchanged. Training is accelerated by such a multigrid training strategy, as most of computation is performed on smaller spatial and shorter temporal shapes. For further acceleration with GPU parallelization, we also investigate the large minibatch training without the loss in accuracy. Extensive experiments demonstrate that our method is capable of largely speeding up training (up to $6.2\times$ speedup in wall-clock training time) without performance drop for various VSR models. The code is available at https://github.com/TencentARC/Efficient-VSR-Training., Comment: The code is available at https://github.com/TencentARC/Efficient-VSR-Training
Published: 2022

27. CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation

Author: Zhang, Ziqi, Chen, Yuxin, Ma, Zongyang, Qi, Zhongang, Yuan, Chunfeng, Li, Bing, Shan, Ying, and Hu, Weiming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Previous works of video captioning aim to objectively describe the video's actual content, which lacks subjective and attractive expression, limiting its practical application scenarios. Video titling is intended to achieve this goal, but there is a lack of a proper benchmark. In this paper, we propose to CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title gEneration benchmark, to facilitate research and application in video titling and video retrieval in Chinese. CREATE consists of a high-quality labeled 210K dataset and two large-scale 3M/10M pre-training datasets, covering 51 categories, 50K+ tags, 537K manually annotated titles and captions, and 10M+ short videos. Based on CREATE, we propose a novel model ALWIG which combines video retrieval and video titling tasks to achieve the purpose of multi-modal ALignment WIth Generation with the help of video tags and a GPT pre-trained model. CREATE opens new directions for facilitating future research and applications on video titling and video retrieval in the field of Chinese short videos.
Published: 2022

28. From Heatmaps to Structural Explanations of Image Classifiers

Author: Fuxin, Li, Qi, Zhongang, Khorram, Saeed, Shitole, Vivswan, Tadepalli, Prasad, Kahng, Minsuk, and Fern, Alan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: This paper summarizes our endeavors in the past few years in terms of explaining image classifiers, with the aim of including negative results and insights we have gained. The paper starts with describing the explainable neural network (XNN), which attempts to extract and visualize several high-level concepts purely from the deep network, without relying on human linguistic concepts. This helps users understand network classifications that are less intuitive and substantially improves user performance on a difficult fine-grained classification task of discriminating among different species of seagulls. Realizing that an important missing piece is a reliable heatmap visualization tool, we have developed I-GOS and iGOS++ utilizing integrated gradients to avoid local optima in heatmap generation, which improved the performance across all resolutions. During the development of those visualizations, we realized that for a significant number of images, the classifier has multiple different paths to reach a confident prediction. This has lead to our recent development of structured attention graphs (SAGs), an approach that utilizes beam search to locate multiple coarse heatmaps for a single image, and compactly visualizes a set of heatmaps by capturing how different combinations of image regions impact the confidence of a classifier. Through the research process, we have learned much about insights in building deep network explanations, the existence and frequency of multiple explanations, and various tricks of the trade that make explanations work. In this paper, we attempt to share those insights and opinions with the readers with the hope that some of them will be informative for future researchers on explainable deep learning., Comment: Submitted to Applied AI Letters
Published: 2021
Full Text: View/download PDF

29. Finding Discriminative Filters for Specific Degradations in Blind Super-Resolution

Author: Xie, Liangbin, Wang, Xintao, Dong, Chao, Qi, Zhongang, and Shan, Ying
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Recent blind super-resolution (SR) methods typically consist of two branches, one for degradation prediction and the other for conditional restoration. However, our experiments show that a one-branch network can achieve comparable performance to the two-branch scheme. Then we wonder: how can one-branch networks automatically learn to distinguish degradations? To find the answer, we propose a new diagnostic tool -- Filter Attribution method based on Integral Gradient (FAIG). Unlike previous integral gradient methods, our FAIG aims at finding the most discriminative filters instead of input pixels/features for degradation removal in blind SR networks. With the discovered filters, we further develop a simple yet effective method to predict the degradation of an input image. Based on FAIG, we show that, in one-branch blind SR networks, 1) we are able to find a very small number of (1%) discriminative filters for each specific degradation; 2) The weights, locations and connections of the discovered filters are all important to determine the specific network function. 3) The task of degradation prediction can be implicitly realized by these discriminative filters without explicit supervised learning. Our findings can not only help us better understand network behaviors inside one-branch blind SR networks, but also provide guidance on designing more efficient architectures and diagnosing networks for blind SR., Comment: Tech report
Published: 2021

30. Stochastic Block-ADMM for Training Deep Networks

Author: Khorram, Saeed, Fu, Xiao, Danesh, Mohamad H., Qi, Zhongang, and Fuxin, Li
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: In this paper, we propose Stochastic Block-ADMM as an approach to train deep neural networks in batch and online settings. Our method works by splitting neural networks into an arbitrary number of blocks and utilizes auxiliary variables to connect these blocks while optimizing with stochastic gradient descent. This allows training deep networks with non-differentiable constraints where conventional backpropagation is not applicable. An application of this is supervised feature disentangling, where our proposed DeepFacto inserts a non-negative matrix factorization (NMF) layer into the network. Since backpropagation only needs to be performed within each block, our approach alleviates vanishing gradients and provides potentials for parallelization. We prove the convergence of our proposed method and justify its capabilities through experiments in supervised and weakly-supervised settings.
Published: 2021

31. Open-book Video Captioning with Retrieve-Copy-Generate Network

Author: Zhang, Ziqi, Qi, Zhongang, Yuan, Chunfeng, Shan, Ying, Li, Bing, Deng, Ying, and Hu, Weiming
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Due to the rapid emergence of short videos and the requirement for content understanding and creation, the video captioning task has received increasing attention in recent years. In this paper, we convert traditional video captioning task into a new paradigm, \ie, Open-book Video Captioning, which generates natural language under the prompts of video-content-relevant sentences, not limited to the video itself. To address the open-book video captioning problem, we propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively, and a copy-mechanism generator is introduced to extract expressions from multi-retrieved sentences dynamically. The two modules can be trained end-to-end or separately, which is flexible and extensible. Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video. Extensive experiments on several benchmark datasets show that our proposed approach surpasses the state-of-the-art performance, indicating the effectiveness and promising of the proposed paradigm in the task of video captioning., Comment: Accepted by CVPR 2021
Published: 2021

32. A Generic Object Re-identification System for Short Videos

Author: Qiu, Tairu, Chen, Guanxian, Qi, Zhongang, Li, Bin, Shan, Ying, and Xue, Xiangyang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Short video applications like TikTok and Kwai have been a great hit recently. In order to meet the increasing demands and take full advantage of visual information in short videos, objects in each short video need to be located and analyzed as an upstream task. A question is thus raised -- how to improve the accuracy and robustness of object detection, tracking, and re-identification across tons of short videos with hundreds of categories and complicated visual effects (VFX). To this end, a system composed of a detection module, a tracking module and a generic object re-identification module, is proposed in this paper, which captures features of major objects from short videos. In particular, towards the high efficiency demands in practical short video application, a Temporal Information Fusion Network (TIFN) is proposed in the object detection module, which shows comparable accuracy and improved time efficiency to the state-of-the-art video object detector. Furthermore, in order to mitigate the fragmented issue of tracklets in short videos, a Cross-Layer Pointwise Siamese Network (CPSN) is proposed in the tracking module to enhance the robustness of the appearance model. Moreover, in order to evaluate the proposed system, two challenge datasets containing real-world short videos are built for video object trajectory extraction and generic object re-identification respectively. Overall, extensive experiments for each module and the whole system demonstrate the effectiveness and efficiency of our system., Comment: 9 pages, 8 figures
Published: 2021

33. Visualizing Point Cloud Classifiers by Curvature Smoothing

Author: Ziwen, Chen, Wu, Wenxuan, Qi, Zhongang, and Fuxin, Li
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, several networks that operate directly on point clouds have been proposed. There is significant utility in understanding their mechanisms to classify point clouds, which can potentially help diagnosing these networks and designing better architectures. In this paper, we propose a novel approach to visualize features important to the point cloud classifiers. Our approach is based on smoothing curved areas on a point cloud. After prominent features were smoothed, the resulting point cloud can be evaluated on the network to assess whether the feature is important to the classifier. A technical contribution of the paper is an approximated curvature smoothing algorithm, which can smoothly transition from the original point cloud to one of constant curvature, such as a uniform sphere. Based on the smoothing algorithm, we propose PCI-GOS (Point Cloud Integrated-Gradients Optimized Saliency), a visualization technique that can automatically find the minimal saliency map that covers the most important features on a shape. Experiment results revealed insights into different point cloud classifiers.
Published: 2019

34. Visualizing Deep Networks by Optimizing with Integrated Gradients

Author: Qi, Zhongang, Khorram, Saeed, and Fuxin, Li
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Understanding and interpreting the decisions made by deep learning models is valuable in many domains. In computer vision, computing heatmaps from a deep network is a popular approach for visualizing and understanding deep networks. However, heatmaps that do not correlate with the network may mislead human, hence the performance of heatmaps in providing a faithful explanation to the underlying deep network is crucial. In this paper, we propose I-GOS, which optimizes for a heatmap so that the classification scores on the masked image would maximally decrease. The main novelty of the approach is to compute descent directions based on the integrated gradients instead of the normal gradient, which avoids local optima and speeds up convergence. Compared with previous approaches, our method can flexibly compute heatmaps at any resolution for different user needs. Extensive experiments on several benchmark datasets show that the heatmaps produced by our approach are more correlated with the decision of the underlying deep network, in comparison with other state-of-the-art approaches.
Published: 2019
Full Text: View/download PDF

35. Interactive Naming for Explaining Deep Neural Networks: A Formative Study

Author: Hamidi-Haines, Mandana, Qi, Zhongang, Fern, Alan, Li, Fuxin, and Tadepalli, Prasad
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: We consider the problem of explaining the decisions of deep neural networks for image recognition in terms of human-recognizable visual concepts. In particular, given a test set of images, we aim to explain each classification in terms of a small number of image regions, or activation maps, which have been associated with semantic concepts by a human annotator. This allows for generating summary views of the typical reasons for classifications, which can help build trust in a classifier and/or identify example types for which the classifier may not be trusted. For this purpose, we developed a user interface for "interactive naming," which allows a human annotator to manually cluster significant activation maps in a test set into meaningful groups called "visual concepts". The main contribution of this paper is a systematic study of the visual concepts produced by five human annotators using the interactive naming interface. In particular, we consider the adequacy of the concepts for explaining the classification of test-set images, correspondence of the concepts to activations of individual neurons, and the inter-annotator agreement of visual concepts. We find that a large fraction of the activation maps have recognizable visual concepts, and that there is significant agreement between the different annotators about their denotations. Our work is an exploratory study of the interplay between machine learning and human recognition mediated by visualizations of the results of learning.
Published: 2018

36. PointConv: Deep Convolutional Networks on 3D Point Clouds

Author: Wu, Wenxuan, Qi, Zhongang, and Fuxin, Li
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Unlike images which are represented in regular dense grids, 3D point clouds are irregular and unordered, hence applying convolution on them can be difficult. In this paper, we extend the dynamic filter to a new convolution operation, named PointConv. PointConv can be applied on point clouds to build deep convolutional networks. We treat convolution kernels as nonlinear functions of the local coordinates of 3D points comprised of weight and density functions. With respect to a given point, the weight functions are learned with multi-layer perceptron networks and density functions through kernel density estimation. The most important contribution of this work is a novel reformulation proposed for efficiently computing the weight functions, which allowed us to dramatically scale up the network and significantly improve its performance. The learned convolution kernel can be used to compute translation-invariant and permutation-invariant convolution on any point set in the 3D space. Besides, PointConv can also be used as deconvolution operators to propagate features from a subsampled point cloud back to its original resolution. Experiments on ModelNet40, ShapeNet, and ScanNet show that deep convolutional neural networks built on PointConv are able to achieve state-of-the-art on challenging semantic segmentation benchmarks on 3D point clouds. Besides, our experiments converting CIFAR-10 into a point cloud showed that networks built on PointConv can match the performance of convolutional networks in 2D images of a similar structure., Comment: CVPR 2019
Published: 2018

37. Deep Air Learning: Interpolation, Prediction, and Feature Analysis of Fine-grained Air Quality

Author: Qi, Zhongang, Wang, Tianchun, Song, Guojie, Hu, Weisong, Li, Xi, Zhongfei, and Zhang
Subjects: Computer Science - Learning
Abstract: The interpolation, prediction, and feature analysis of fine-gained air quality are three important topics in the area of urban air computing. The solutions to these topics can provide extremely useful information to support air pollution control, and consequently generate great societal and technical impacts. Most of the existing work solves the three problems separately by different models. In this paper, we propose a general and effective approach to solve the three problems in one model called the Deep Air Learning (DAL). The main idea of DAL lies in embedding feature selection and semi-supervised learning in different layers of the deep learning network. The proposed approach utilizes the information pertaining to the unlabeled spatio-temporal data to improve the performance of the interpolation and the prediction, and performs feature selection and association analysis to reveal the main relevant features to the variation of the air quality. We evaluate our approach with extensive experiments based on real data sources obtained in Beijing, China. Experiments show that DAL is superior to the peer models from the recent literature when solving the topics of interpolation, prediction, and feature analysis of fine-gained air quality.
Published: 2017
Full Text: View/download PDF

38. Embedding Deep Networks into Visual Explanations

Author: Qi, Zhongang, Khorram, Saeed, and Li, Fuxin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: In this paper, we propose a novel Explanation Neural Network (XNN) to explain the predictions made by a deep network. The XNN works by learning a nonlinear embedding of a high-dimensional activation vector of a deep network layer into a low-dimensional explanation space while retaining faithfulness i.e., the original deep learning predictions can be constructed from the few concepts extracted by our explanation network. We then visualize such concepts for human to learn about the high-level concepts that the deep network is using to make decisions. We propose an algorithm called Sparse Reconstruction Autoencoder (SRAE) for learning the embedding to the explanation space. SRAE aims to reconstruct part of the original feature space while retaining faithfulness. A pull-away term is applied to SRAE to make the bases of the explanation space more orthogonal to each other. A visualization system is then introduced for human understanding of the features in the explanation space. The proposed method is applied to explain CNN models in image classification tasks. We conducted a human study, which shows that the proposed approach outperforms single saliency map baselines, and improves human performance on a difficult classification tasks. Also, several novel metrics are introduced to evaluate the performance of explanations quantitatively without human involvement.
Published: 2017
Full Text: View/download PDF

39. RecDCL: Dual Contrastive Learning for Recommendation

Author: Zhang, Dan, primary, Geng, Yangliao, additional, Gong, Wenwen, additional, Qi, Zhongang, additional, Chen, Zhiyu, additional, Tang, Xing, additional, Shan, Ying, additional, Dong, Yuxiao, additional, and Tang, Jie, additional
Published: 2024
Full Text: View/download PDF

40. DARTScore: DuAl-Reconstruction Transformer for Video Captioning Evaluation

Author: Chen, Yuxin, primary, Zhang, Ziqi, additional, Qi, Zhongang, additional, Yuan, Chunfeng, additional, Wang, Jie, additional, Shan, Ying, additional, Li, Bing, additional, Hu, Weiming, additional, Qie, Xiaohu, additional, and Wu, Jianping, additional
Published: 2024
Full Text: View/download PDF

41. Embedding deep networks into visual explanations

Author: Qi, Zhongang, Khorram, Saeed, and Fuxin, Li
Published: 2021
Full Text: View/download PDF

42. Toward Human Perception-Centric Video Thumbnail Generation

Author: Yang, Tao, primary, Wang, Fan, additional, Lin, Junfan, additional, Qi, Zhongang, additional, Wu, Yang, additional, Xu, Jing, additional, Shan, Ying, additional, and Chen, Changwen, additional
Published: 2023
Full Text: View/download PDF

43. VTLayout: A Multi-Modal Approach for Video Text Layout

Author: Zhao, Yuxuan, primary, Ma, Jin, additional, Qi, Zhongang, additional, Xie, Zehua, additional, Luo, Yu, additional, Kang, Qiusheng, additional, and Shan, Ying, additional
Published: 2023
Full Text: View/download PDF

44. Task-Aware Dual-Representation Network for Few-Shot Action Recognition

Author: Wang, Xiao, primary, Ye, Weirong, additional, Qi, Zhongang, additional, Wang, Guangge, additional, Wu, Jianping, additional, Shan, Ying, additional, Qie, Xiaohu, additional, and Wang, Hanzi, additional
Published: 2023
Full Text: View/download PDF

45. SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation

Author: Li, Xuewei, primary, Wu, Tao, additional, Qi, Zhongang, additional, Wang, Gaoang, additional, Shan, Ying, additional, and Li, Xi, additional
Published: 2023
Full Text: View/download PDF

46. Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval

Author: Chen, Yizhen, primary, Wang, Jie, additional, Lin, Lijian, additional, Qi, Zhongang, additional, Ma, Jin, additional, and Shan, Ying, additional
Published: 2023
Full Text: View/download PDF

47. Accelerating the Training of Video Super-resolution Models

Author: Lin, Lijian, primary, Wang, Xintao, additional, Qi, Zhongang, additional, and Shan, Ying, additional
Published: 2023
Full Text: View/download PDF

48. ERBNet: An Effective Representation Based Network for Unbiased Scene Graph Generation

Author: Ma, Wenxi, primary, Hou, Tianxiang, additional, Di, Qianji, additional, Qi, Zhongang, additional, Shan, Ying, additional, and Wang, Hanzi, additional
Published: 2023
Full Text: View/download PDF

49. ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

Author: Chen, Yuxin, primary, Ma, Zongyang, additional, Zhang, Ziqi, additional, Qi, Zhongang, additional, Yuan, Chunfeng, additional, Shan, Ying, additional, Li, Bing, additional, HU, Weiming, additional, Qie, Xiaohu, additional, and Wu, JianPing, additional
Published: 2023
Full Text: View/download PDF

50. LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation

Author: Zheng, Guangcong, primary, Zhou, Xianpan, additional, Li, Xuewei, additional, Qi, Zhongang, additional, Shan, Ying, additional, and Li, Xi, additional
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

132 results on '"Qi, Zhongang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources