Author: "Zhang, Xiangyu" / Topic: computer science - computer vision and pattern recognition - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhang, Xiangyu"' showing total 206 results

Start Over Author "Zhang, Xiangyu" Topic computer science - computer vision and pattern recognition

206 results on '"Zhang, Xiangyu"'

1. Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models

Author: Cao, Meng, Liu, Yuyang, Liu, Yingfei, Wang, Tiancai, Dong, Jiahua, Ding, Henghui, Zhang, Xiangyu, Reid, Ian, and Liang, Xiaodan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Instruction tuning constitutes a prevalent technique for tailoring Large Vision Language Models (LVLMs) to meet individual task requirements. To date, most of the existing approaches are confined to single-task adaptation, whereas the requirements in real-world scenarios are inherently varied and continually evolving. Thus an ideal LVLM should sustain continual instruction tuning in the face of stream-task distributions (i.e., different domains, emerging capabilities, and new datasets) while minimizing the forgetting of previously acquired knowledge. To achieve this, we propose a new benchmark for COntinuAl inStruction Tuning on LVLMs (COAST), which encompasses the aforementioned domain-incremental, capability-incremental, and dataset-incremental configurations. In terms of methodology, we propose Continual LLaVA, a rehearsal-free method tailored for continual instruction tuning in LVLMs. To circumvent the additional overhead associated with experience replay, we freeze LVLMs and construct the dual increment embeddings for each input instruction to facilitate parameter-efficient tuning. Specifically, the increment embeddings can be decomposed into two principal components: 1) intrinsic increment embeddings to encode task-specific characteristics. To achieve this, we set up a low-rank pool containing candidate embeddings, from which we select the relevant ones based on their similarity with the user instructions; 2) contextual increment embeddings to investigate the inter-dependencies across tasks. In this regard, the low-rank embeddings chosen in the previous tasks are aggregated via learnable weighted sum to provide complementary hints. Extensive experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.
Published: 2024

2. Reconstructive Visual Instruction Tuning

Author: Wang, Haochen, Zheng, Anlin, Zhao, Yucheng, Wang, Tiancai, Ge, Zheng, Zhang, Xiangyu, and Zhang, Zhaoxiang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.
Published: 2024

3. General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Author: Wei, Haoran, Liu, Chenglong, Chen, Jinyue, Wang, Jia, Kong, Lingyu, Xu, Yanming, Ge, Zheng, Zhao, Liang, Sun, Jianjian, Peng, Yuang, Han, Chunrui, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Traditional OCR systems (OCR-1.0) are increasingly unable to meet people's usage due to the growing demand for intelligent processing of man-made optical characters. In this paper, we collectively refer to all artificial optical signals (e.g., plain texts, math/molecular formulas, tables, charts, sheet music, and even geometric shapes) as "characters" and propose the General OCR Theory along with an excellent model, namely GOT, to promote the arrival of OCR-2.0. The GOT, with 580M parameters, is a unified, elegant, and end-to-end model, consisting of a high-compression encoder and a long-contexts decoder. As an OCR-2.0 model, GOT can handle all the above "characters" under various OCR tasks. On the input side, the model supports commonly used scene- and document-style images in slice and whole-page styles. On the output side, GOT can generate plain or formatted results (markdown/tikz/smiles/kern) via an easy prompt. Besides, the model enjoys interactive OCR features, i.e., region-level recognition guided by coordinates or colors. Furthermore, we also adapt dynamic resolution and multi-page OCR technologies to GOT for better practicality. In experiments, we provide sufficient results to prove the superiority of our model.
Published: 2024

4. Panacea+: Panoramic and Controllable Video Generation for Autonomous Driving

Author: Wen, Yuqing, Zhao, Yucheng, Liu, Yingfei, Huang, Binyuan, Jia, Fan, Wang, Yanhui, Zhang, Chi, Wang, Tiancai, Sun, Xiaoyan, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The field of autonomous driving increasingly demands high-quality annotated video training data. In this paper, we propose Panacea+, a powerful and universally applicable framework for generating video data in driving scenes. Built upon the foundation of our previous work, Panacea, Panacea+ adopts a multi-view appearance noise prior mechanism and a super-resolution module for enhanced consistency and increased resolution. Extensive experiments show that the generated video samples from Panacea+ greatly benefit a wide range of tasks on different datasets, including 3D object tracking, 3D object detection, and lane detection tasks on the nuScenes and Argoverse 2 dataset. These results strongly prove Panacea+ to be a valuable data generation framework for autonomous driving., Comment: Project page: https://panacea-ad.github.io/. arXiv admin note: text overlap with arXiv:2311.16813
Published: 2024

5. XNN: Paradigm Shift in Mitigating Identity Leakage within Cloud-Enabled Deep Learning

Author: Liu, Kaixin, Xiong, Huixin, Duan, Bingyu, Cheng, Zexuan, Zhou, Xinyu, Zhang, Wanqian, and Zhang, Xiangyu
Subjects: Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition
Abstract: In the domain of cloud-based deep learning, the imperative for external computational resources coexists with acute privacy concerns, particularly identity leakage. To address this challenge, we introduce XNN and XNN-d, pioneering methodologies that infuse neural network features with randomized perturbations, striking a harmonious balance between utility and privacy. XNN, designed for the training phase, ingeniously blends random permutation with matrix multiplication techniques to obfuscate feature maps, effectively shielding private data from potential breaches without compromising training integrity. Concurrently, XNN-d, devised for the inference phase, employs adversarial training to integrate generative adversarial noise. This technique effectively counters black-box access attacks aimed at identity extraction, while a distilled face recognition network adeptly processes the perturbed features, ensuring accurate identification. Our evaluation demonstrates XNN's effectiveness, significantly outperforming existing methods in reducing identity leakage while maintaining a high model accuracy.
Published: 2024

6. UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening

Author: Cheng, Siyuan, Shen, Guangyu, Zhang, Kaiyuan, Tao, Guanhong, An, Shengwei, Guo, Hanxi, Ma, Shiqing, and Zhang, Xiangyu
Subjects: Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition
Abstract: Deep neural networks (DNNs) have demonstrated effectiveness in various fields. However, DNNs are vulnerable to backdoor attacks, which inject a unique pattern, called trigger, into the input to cause misclassification to an attack-chosen target label. While existing works have proposed various methods to mitigate backdoor effects in poisoned models, they tend to be less effective against recent advanced attacks. In this paper, we introduce a novel post-training defense technique UNIT that can effectively eliminate backdoor effects for a variety of attacks. In specific, UNIT approximates a unique and tight activation distribution for each neuron in the model. It then proactively dispels substantially large activation values that exceed the approximated boundaries. Our experimental results demonstrate that UNIT outperforms 7 popular defense methods against 14 existing backdoor attacks, including 2 advanced attacks, using only 5\% of clean training data. UNIT is also cost efficient. The code is accessible at https://github.com/Megum1/UNIT., Comment: The 18th European Conference on Computer Vision ECCV 2024
Published: 2024

7. DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

Author: Peng, Yuang, Cui, Yuxin, Tang, Haomiao, Qi, Zekun, Dong, Runpei, Bai, Jing, Han, Chunrui, Ge, Zheng, Zhang, Xiangyu, and Xia, Shu-Tao
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advanced multimodal GPT models. Specifically, we systematically design the prompts to let GPT be both human-aligned and self-aligned, empowered with task reinforcement. Further, we construct a comprehensive dataset comprising diverse images and prompts. By benchmarking 7 modern generative models, we demonstrate that DreamBench++ results in significantly more human-aligned evaluation, helping boost the community with innovative findings., Comment: Project page: https://dreambenchplus.github.io/
Published: 2024

8. Self-supervised Adversarial Training of Monocular Depth Estimation against Physical-World Attacks

Author: Cheng, Zhiyuan, Han, Cheng, Liang, James, Wang, Qifan, Zhang, Xiangyu, and Liu, Dongfang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Monocular Depth Estimation (MDE) plays a vital role in applications such as autonomous driving. However, various attacks target MDE models, with physical attacks posing significant threats to system security. Traditional adversarial training methods, which require ground-truth labels, are not directly applicable to MDE models that lack ground-truth depth. Some self-supervised model hardening techniques (e.g., contrastive learning) overlook the domain knowledge of MDE, resulting in suboptimal performance. In this work, we introduce a novel self-supervised adversarial training approach for MDE models, leveraging view synthesis without the need for ground-truth depth. We enhance adversarial robustness against real-world attacks by incorporating L_0-norm-bounded perturbation during training. We evaluate our method against supervised learning-based and contrastive learning-based approaches specifically designed for MDE. Our experiments with two representative MDE networks demonstrate improved robustness against various adversarial attacks, with minimal impact on benign performance., Comment: Accepted in TPAMI'24. Extended from our ICLR'23 publication (arXiv:2301.13487). arXiv admin note: substantial text overlap with arXiv:2301.13487
Published: 2024
Full Text: View/download PDF

9. Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving?

Author: Bai, Yifan, Wu, Dongming, Liu, Yingfei, Jia, Fan, Mao, Weixin, Zhang, Ziheng, Zhao, Yucheng, Shen, Jianbing, Wei, Xing, Wang, Tiancai, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Rapid advancements in Autonomous Driving (AD) tasks turned a significant shift toward end-to-end fashion, particularly in the utilization of vision-language models (VLMs) that integrate robust logical reasoning and cognitive abilities to enable comprehensive end-to-end planning. However, these VLM-based approaches tend to integrate 2D vision tokenizers and a large language model (LLM) for ego-car planning, which lack 3D geometric priors as a cornerstone of reliable planning. Naturally, this observation raises a critical concern: Can a 2D-tokenized LLM accurately perceive the 3D environment? Our evaluation of current VLM-based methods across 3D object detection, vectorized map construction, and environmental caption suggests that the answer is, unfortunately, NO. In other words, 2D-tokenized LLM fails to provide reliable autonomous driving. In response, we introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. This simple yet elegant strategy, termed Atlas, harnesses the inherent priors of the 3D physical world, enabling it to simultaneously process high-resolution multi-view images and employ spatiotemporal modeling. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks on nuScenes dataset, proving that 3D-tokenized LLM is the key to reliable autonomous driving. The code and datasets will be released.
Published: 2024

10. Focus Anywhere for Fine-grained Multi-page Document Understanding

Author: Liu, Chenglong, Wei, Haoran, Chen, Jinyue, Kong, Lingyu, Ge, Zheng, Zhu, Zining, Zhao, Liang, Sun, Jianjian, Han, Chunrui, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Modern LVLMs still struggle to achieve fine-grained document understanding, such as OCR/translation/caption for regions of interest to the user, tasks that require the context of the entire page, or even multiple pages. Accordingly, this paper proposes Fox, an effective pipeline, hybrid data, and tuning strategy, that catalyzes LVLMs to focus anywhere on single/multi-page documents. We introduce a novel task to boost the document understanding by making LVLMs focus attention on the document-level region, such as redefining full-page OCR as foreground focus. We employ multiple vision vocabularies to extract visual hybrid knowledge for interleaved document pages (e.g., a page containing a photo). Meanwhile, we render cross-vocabulary vision data as the catalyzer to achieve a full reaction of multiple visual vocabularies and in-document figure understanding. Further, without modifying the weights of multiple vision vocabularies, the above catalyzed fine-grained understanding capabilities can be efficiently tuned to multi-page documents, enabling the model to focus anywhere in both format-free and page-free manners. Besides, we build a benchmark including 9 fine-grained sub-tasks (e.g., region-level OCR/summary, color-guided OCR) to promote document analysis in the community. The experimental results verify the superiority of our model.
Published: 2024

11. Self-Supervised Visual Preference Alignment

Author: Zhu, Ke, Ge, Zheng, Zhao, Liang, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT-4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90\% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7\%/5.6\% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling. Code are available in https://github.com/Kevinz-code/SeVa., Comment: MM2024 oral
Published: 2024

12. OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

Author: Chen, Jinyue, Kong, Lingyu, Wei, Haoran, Liu, Chenglong, Ge, Zheng, Zhao, Liang, Sun, Jianjian, Han, Chunrui, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Chart parsing poses a significant challenge due to the diversity of styles, values, texts, and so forth. Even advanced large vision-language models (LVLMs) with billions of parameters struggle to handle such tasks satisfactorily. To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information. Similar to popular LVLMs, OneChart incorporates an autoregressive main body. Uniquely, to enhance the reliability of the numerical parts of the output, we introduce an auxiliary token placed at the beginning of the total tokens along with an additional decoder. The numerically optimized (auxiliary) token allows subsequent tokens for chart parsing to capture enhanced numerical features through causal attention. Furthermore, with the aid of the auxiliary token, we have devised a self-evaluation mechanism that enables the model to gauge the reliability of its chart parsing results by providing confidence scores for the generated content. Compared to current state-of-the-art (SOTA) chart parsing models, e.g., DePlot, ChartVLM, ChartAst, OneChart significantly outperforms in Average Precision (AP) for chart structural extraction across multiple public benchmarks, despite enjoying only 0.2 billion parameters. Moreover, as a chart parsing agent, it also brings 10%+ accuracy gains for the popular LVLM (LLaVA-1.6) in the downstream ChartQA benchmark., Comment: 14 pages, 9 figures and 6 tables
Published: 2024

13. BadPart: Unified Black-box Adversarial Patch Attacks against Pixel-wise Regression Tasks

Author: Cheng, Zhiyuan, Liu, Zhaoyi, Guo, Tengda, Feng, Shiwei, Liu, Dongfang, Tang, Mingjie, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Pixel-wise regression tasks (e.g., monocular depth estimation (MDE) and optical flow estimation (OFE)) have been widely involved in our daily life in applications like autonomous driving, augmented reality and video composition. Although certain applications are security-critical or bear societal significance, the adversarial robustness of such models are not sufficiently studied, especially in the black-box scenario. In this work, we introduce the first unified black-box adversarial patch attack framework against pixel-wise regression tasks, aiming to identify the vulnerabilities of these models under query-based black-box attacks. We propose a novel square-based adversarial patch optimization framework and employ probabilistic square sampling and score-based gradient estimation techniques to generate the patch effectively and efficiently, overcoming the scalability problem of previous black-box patch attacks. Our attack prototype, named BadPart, is evaluated on both MDE and OFE tasks, utilizing a total of 7 models. BadPart surpasses 3 baseline methods in terms of both attack performance and efficiency. We also apply BadPart on the Google online service for portrait depth estimation, causing 43.5% relative distance error with 50K queries. State-of-the-art (SOTA) countermeasures cannot defend our attack effectively., Comment: Paper accepted at ICML 2024
Published: 2024

14. SubjectDrive: Scaling Generative Data in Autonomous Driving via Subject Control

Author: Huang, Binyuan, Wen, Yuqing, Zhao, Yucheng, Hu, Yaosi, Liu, Yingfei, Jia, Fan, Mao, Weixin, Wang, Tiancai, Zhang, Chi, Chen, Chang Wen, Chen, Zhenzhong, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Autonomous driving progress relies on large-scale annotated datasets. In this work, we explore the potential of generative models to produce vast quantities of freely-labeled data for autonomous driving applications and present SubjectDrive, the first model proven to scale generative data production in a way that could continuously improve autonomous driving applications. We investigate the impact of scaling up the quantity of generative data on the performance of downstream perception models and find that enhancing data diversity plays a crucial role in effectively scaling generative data production. Therefore, we have developed a novel model equipped with a subject control mechanism, which allows the generative model to leverage diverse external data sources for producing varied and useful data. Extensive evaluations confirm SubjectDrive's efficacy in generating scalable autonomous driving training data, marking a significant step toward revolutionizing data production methods in this field., Comment: Project page: https://subjectdrive.github.io/
Published: 2024

15. LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning

Author: Cheng, Siyuan, Tao, Guanhong, Liu, Yingqi, Shen, Guangyu, An, Shengwei, Feng, Shiwei, Xu, Xiangzhe, Zhang, Kaiyuan, Ma, Shiqing, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Cryptography and Security
Abstract: Backdoor attack poses a significant security threat to Deep Learning applications. Existing attacks are often not evasive to established backdoor detection techniques. This susceptibility primarily stems from the fact that these attacks typically leverage a universal trigger pattern or transformation function, such that the trigger can cause misclassification for any input. In response to this, recent papers have introduced attacks using sample-specific invisible triggers crafted through special transformation functions. While these approaches manage to evade detection to some extent, they reveal vulnerability to existing backdoor mitigation techniques. To address and enhance both evasiveness and resilience, we introduce a novel backdoor attack LOTUS. Specifically, it leverages a secret function to separate samples in the victim class into a set of partitions and applies unique triggers to different partitions. Furthermore, LOTUS incorporates an effective trigger focusing mechanism, ensuring only the trigger corresponding to the partition can induce the backdoor behavior. Extensive experimental results show that LOTUS can achieve high attack success rate across 4 datasets and 7 model structures, and effectively evading 13 backdoor detection and mitigation techniques. The code is available at https://github.com/Megum1/LOTUS., Comment: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)
Published: 2024

16. Small Language Model Meets with Reinforced Vision Vocabulary

Author: Wei, Haoran, Kong, Lingyu, Chen, Jinyue, Zhao, Liang, Ge, Zheng, Yu, En, Sun, Jianjian, Han, Chunrui, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Playing Large Vision Language Models (LVLMs) in 2023 is trendy among the AI community. However, the relatively large number of parameters (more than 7B) of popular LVLMs makes it difficult to train and deploy on consumer GPUs, discouraging many researchers with limited resources. Imagine how cool it would be to experience all the features of current LVLMs on an old GTX1080ti (our only game card). Accordingly, we present Vary-toy in this report, a small-size Vary along with Qwen-1.8B as the base ``large'' language model. In Vary-toy, we introduce an improved vision vocabulary, allowing the model to not only possess all features of Vary but also gather more generality. Specifically, we replace negative samples of natural images with positive sample data driven by object detection in the procedure of generating vision vocabulary, more sufficiently utilizing the capacity of the vocabulary network and enabling it to efficiently encode visual information corresponding to natural objects. For experiments, Vary-toy can achieve 65.6% ANLS on DocVQA, 59.1% accuracy on ChartQA, 88.1% accuracy on RefCOCO, and 29% on MMVet. The code will be publicly available on the homepage.
Published: 2024

17. Slot-guided Volumetric Object Radiance Fields

Author: Qi, Di, Yang, Tong, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a novel framework for 3D object-centric representation learning. Our approach effectively decomposes complex scenes into individual objects from a single image in an unsupervised fashion. This method, called slot-guided Volumetric Object Radiance Fields (sVORF), composes volumetric object radiance fields with object slots as a guidance to implement unsupervised 3D scene decomposition. Specifically, sVORF obtains object slots from a single image via a transformer module, maps these slots to volumetric object radiance fields with a hypernetwork and composes object radiance fields with the guidance of object slots at a 3D location. Moreover, sVORF significantly reduces memory requirement due to small-sized pixel rendering during training. We demonstrate the effectiveness of our approach by showing top results in scene decomposition and generation tasks of complex synthetic datasets (e.g., Room-Diverse). Furthermore, we also confirm the potential of sVORF to segment objects in real-world scenes (e.g., the LLFF dataset). We hope our approach can provide preliminary understanding of the physical world and help ease future research in 3D object-centric representation learning., Comment: NeurIPS 2023
Published: 2024

18. Bootstrap Masked Visual Modeling via Hard Patches Mining

Author: Wang, Haochen, Fan, Junsong, Wang, Yuxi, Song, Kaiyou, Wang, Tiancai, Zhang, Xiangyu, and Zhang, Zhaoxiang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Masked visual modeling has attracted much attention due to its promising potential in learning generalizable representations. Typical approaches urge models to predict specific contents of masked tokens, which can be intuitively considered as teaching a student (the model) to solve given problems (predicting masked contents). Under such settings, the performance is highly correlated with mask strategies (the difficulty of provided problems). We argue that it is equally important for the model to stand in the shoes of a teacher to produce challenging problems by itself. Intuitively, patches with high values of reconstruction loss can be regarded as hard samples, and masking those hard patches naturally becomes a demanding reconstruction task. To empower the model as a teacher, we propose Hard Patches Mining (HPM), predicting patch-wise losses and subsequently determining where to mask. Technically, we introduce an auxiliary loss predictor, which is trained with a relative objective to prevent overfitting to exact loss values. Also, to gradually guide the training procedure, we propose an easy-to-hard mask strategy. Empirically, HPM brings significant improvements under both image and video benchmarks. Interestingly, solely incorporating the extra loss prediction objective leads to better representations, verifying the efficacy of determining where is hard to reconstruct. The code is available at https://github.com/Haochen-Wang409/HPM., Comment: arXiv admin note: substantial text overlap with arXiv:2304.05919
Published: 2023

19. Compound Text-Guided Prompt Tuning via Image-Adaptive Cues

Author: Tan, Hao, Li, Jun, Zhou, Yizhuang, Wan, Jun, Lei, Zhen, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable generalization capabilities to downstream tasks. However, existing prompt tuning based frameworks need to parallelize learnable textual inputs for all categories, suffering from massive GPU memory consumption when there is a large number of categories in the target dataset. Moreover, previous works require to include category names within prompts, exhibiting subpar performance when dealing with ambiguous category names. To address these shortcomings, we propose Compound Text-Guided Prompt Tuning (TGP-T) that significantly reduces resource demand while achieving superior performance. We introduce text supervision to the optimization of prompts, which enables two benefits: 1) releasing the model reliance on the pre-defined category names during inference, thereby enabling more flexible prompt generation; 2) reducing the number of inputs to the text encoder, which decreases GPU memory consumption significantly. Specifically, we found that compound text supervisions, i.e., category-wise and content-wise, is highly effective, since they provide inter-class separability and capture intra-class variations, respectively. Moreover, we condition the prompt generation on visual features through a module called Bonder, which facilitates the alignment between prompts and visual features. Extensive experiments on few-shot recognition and domain generalization demonstrate that TGP-T achieves superior performance with consistently lower training costs. It reduces GPU memory usage by 93% and attains a 2.5% performance gain on 16-shot ImageNet. The code is available at https://github.com/EricTan7/TGP-T., Comment: 13 pages, 7 figures, accepted to AAAI 2024
Published: 2023

20. Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

Author: Wei, Haoran, Kong, Lingyu, Chen, Jinyue, Zhao, Liang, Ge, Zheng, Yang, Jinrong, Sun, Jianjian, Han, Chunrui, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Modern Large Vision-Language Models (LVLMs) enjoy the same vision vocabulary -- CLIP, which can cover most common vision tasks. However, for some special vision task that needs dense and fine-grained vision perception, e.g., document-level OCR or chart understanding, especially in non-English scenarios, the CLIP-style vocabulary may encounter low efficiency in tokenizing the vision knowledge and even suffer out-of-vocabulary problem. Accordingly, we propose Vary, an efficient and effective method to scale up the vision vocabulary of LVLMs. The procedures of Vary are naturally divided into two folds: the generation and integration of a new vision vocabulary. In the first phase, we devise a vocabulary network along with a tiny decoder-only transformer to produce the desired vocabulary via autoregression. In the next, we scale up the vanilla vision vocabulary by merging the new one with the original one (CLIP), enabling the LVLMs can quickly garner new features. Compared to the popular BLIP-2, MiniGPT4, and LLaVA, Vary can maintain its vanilla capabilities while enjoying more excellent fine-grained perception and understanding ability. Specifically, Vary is competent in new document parsing features (OCR or markdown conversion) while achieving 78.2% ANLS in DocVQA and 36.2% in MMVet. Our code will be publicly available on the homepage.
Published: 2023

21. Merlin:Empowering Multimodal LLMs with Foresight Minds

Author: Yu, En, Zhao, Liang, Wei, Yana, Yang, Jinrong, Wu, Dongming, Kong, Lingyu, Wei, Haoran, Wang, Tiancai, Ge, Zheng, Zhang, Xiangyu, and Tao, Wenbing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Humans possess the remarkable ability to foresee the future to a certain extent based on present observations, a skill we term as foresight minds. However, this capability remains largely under explored within existing Multimodal Large Language Models (MLLMs), hindering their capacity to learn the fundamental principles of how things operate and the intentions behind the observed subjects. To address this issue, we introduce the integration of future modeling into the existing learning frameworks of MLLMs. By utilizing the subject trajectory, a highly structured representation of a consecutive frame sequence, as a learning objective, we aim to bridge the gap between the past and the future. We propose two innovative methods to empower MLLMs with foresight minds, Foresight Pre-Training (FPT) and Foresight Instruction-Tuning (FIT), which are inspired by the modern learning paradigm of LLMs. Specifically, FPT jointly training various tasks centered on trajectories, enabling MLLMs to learn how to attend and predict entire trajectories from a given initial observation. Then, FIT requires MLLMs to first predict trajectories of related objects and then reason about potential future events based on them. Aided by FPT and FIT, we build a novel and unified MLLM named Merlin that supports multi-images input and analysis about potential actions of multiple objects for the future reasoning. Experimental results show Merlin powerful foresight minds with impressive performance on both future reasoning and visual comprehension tasks., Comment: Accepted by ECCV2024. Project page: https://ahnsun.github.io/merlin
Published: 2023

22. ADriver-I: A General World Model for Autonomous Driving

Author: Jia, Fan, Mao, Weixin, Liu, Yingfei, Zhao, Yucheng, Wen, Yuqing, Zhang, Chi, Zhang, Xiangyu, and Wang, Tiancai
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Typically, autonomous driving adopts a modular design, which divides the full stack into perception, prediction, planning and control parts. Though interpretable, such modular design tends to introduce a substantial amount of redundancy. Recently, multimodal large language models (MLLM) and diffusion techniques have demonstrated their superior performance on comprehension and generation ability. In this paper, we first introduce the concept of interleaved vision-action pair, which unifies the format of visual features and control signals. Based on the vision-action pairs, we construct a general world model based on MLLM and diffusion model for autonomous driving, termed ADriver-I. It takes the vision-action pairs as inputs and autoregressively predicts the control signal of the current frame. The generated control signals together with the historical vision-action pairs are further conditioned to predict the future frames. With the predicted next frame, ADriver-I performs further control signal prediction. Such a process can be repeated infinite times, ADriver-I achieves autonomous driving in the world created by itself. Extensive experiments are conducted on nuScenes and our large-scale private datasets. ADriver-I shows impressive performance compared to several constructed baselines. We hope our ADriver-I can provide some new insights for future autonomous driving and embodied intelligence., Comment: Tech Report
Published: 2023

23. LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation

Author: Wu, Ruiqi, Chen, Liangyu, Yang, Tong, Guo, Chunle, Li, Chongyi, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: With the impressive progress in diffusion-based text-to-image generation, extending such powerful generative ability to text-to-video raises enormous attention. Existing methods either require large-scale text-video pairs and a large number of training resources or learn motions that are precisely aligned with template videos. It is non-trivial to balance a trade-off between the degree of generation freedom and the resource costs for video generation. In our study, we present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 8~16 videos on a single GPU. Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation so that our tuned video diffusion model mainly focuses on motion learning. The well-developed text-to-image techniques can provide visually pleasing and diverse content as generation conditions, which highly improves video quality and generation freedom. To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers and modify the attention blocks to the temporal level. Additionally, we develop an effective inference trick, shared-noise sampling, which can improve the stability of videos with computational costs. Our method can also be flexibly applied to other tasks, e.g. real-world image animation and video editing. Extensive experiments demonstrate that LAMP can effectively learn the motion pattern on limited data and generate high-quality videos. The code and models are available at https://rq-wu.github.io/projects/LAMP., Comment: Project Page: https://rq-wu.github.io/projects/LAMP
Published: 2023

24. Unidirectional brain-computer interface: Artificial neural network encoding natural images to fMRI response in the visual cortex

Author: Liang, Ruixing, Zhang, Xiangyu, Li, Qiong, Wei, Lai, Liu, Hexin, Kumar, Avisha, Leadingham, Kelley M. Kempski, Punnoose, Joshua, Garcia, Leibny Paola, and Manbachi, Amir
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction, Quantitative Biology - Neurons and Cognition
Abstract: While significant advancements in artificial intelligence (AI) have catalyzed progress across various domains, its full potential in understanding visual perception remains underexplored. We propose an artificial neural network dubbed VISION, an acronym for "Visual Interface System for Imaging Output of Neural activity," to mimic the human brain and show how it can foster neuroscientific inquiries. Using visual and contextual inputs, this multimodal model predicts the brain's functional magnetic resonance imaging (fMRI) scan response to natural images. VISION successfully predicts human hemodynamic responses as fMRI voxel values to visual inputs with an accuracy exceeding state-of-the-art performance by 45%. We further probe the trained networks to reveal representational biases in different visual areas, generate experimentally testable hypotheses, and formulate an interpretable metric to associate these hypotheses with cortical functions. With both a model and evaluation metric, the cost and time burdens associated with designing and implementing functional analysis on the visual cortex could be reduced. Our work suggests that the evolution of computational models may shed light on our fundamental understanding of the visual cortex and provide a viable approach toward reliable brain-machine interfaces.
Published: 2023

25. DreamLLM: Synergistic Multimodal Comprehension and Creation

Author: Dong, Runpei, Han, Chunrui, Peng, Yuang, Qi, Zekun, Ge, Zheng, Yang, Jinrong, Zhao, Liang, Sun, Jianjian, Zhou, Hongyu, Wei, Haoran, Kong, Xiangwen, Zhang, Xiangyu, Ma, Kaisheng, and Yi, Li
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: This paper presents DreamLLM, a learning framework that first achieves versatile Multimodal Large Language Models (MLLMs) empowered with frequently overlooked synergy between multimodal comprehension and creation. DreamLLM operates on two fundamental principles. The first focuses on the generative modeling of both language and image posteriors by direct sampling in the raw multimodal space. This approach circumvents the limitations and information loss inherent to external feature extractors like CLIP, and a more thorough multimodal understanding is obtained. Second, DreamLLM fosters the generation of raw, interleaved documents, modeling both text and image contents, along with unstructured layouts. This allows DreamLLM to learn all conditional, marginal, and joint multimodal distributions effectively. As a result, DreamLLM is the first MLLM capable of generating free-form interleaved content. Comprehensive experiments highlight DreamLLM's superior performance as a zero-shot multimodal generalist, reaping from the enhanced learning synergy. Project page: https://dreamllm.github.io., Comment: ICLR 2024 (Spotlight)
Published: 2023

26. Language Prompt for Autonomous Driving

Author: Wu, Dongming, Han, Wencheng, Wang, Tiancai, Liu, Yingfei, Zhang, Xiangyu, and Shen, Jianbing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: A new trend in the computer vision community is to capture objects of interest following flexible human command represented by a natural language prompt. However, the progress of using language prompts in driving scenarios is stuck in a bottleneck due to the scarcity of paired prompt-instance data. To address this challenge, we propose the first object-centric language prompt set for driving scenes within 3D, multi-view, and multi-frame space, named NuPrompt. It expands Nuscenes dataset by constructing a total of 35,367 language descriptions, each referring to an average of 5.3 object tracks. Based on the object-text pairs from the new benchmark, we formulate a new prompt-based driving task, \ie, employing a language prompt to predict the described object trajectory across views and frames. Furthermore, we provide a simple end-to-end baseline model based on Transformer, named PromptTrack. Experiments show that our PromptTrack achieves impressive performance on NuPrompt. We hope this work can provide more new insights for the autonomous driving community. Dataset and Code will be made public at \href{https://github.com/wudongming97/Prompt4Driving}{https://github.com/wudongming97/Prompt4Driving}.
Published: 2023

27. RevColV2: Exploring Disentangled Representations in Masked Image Modeling

Author: Han, Qi, Cai, Yuxuan, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance. Despite its success, existing MIM methods discard the decoder network during downstream applications, resulting in inconsistent representations between pre-training and fine-tuning and can hamper downstream task performance. In this paper, we propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoencoder architecture during both pre-training and fine-tuning. The main body of RevColV2 contains bottom-up columns and top-down columns, between which information is reversibly propagated and gradually disentangled. Such design enables our architecture with the nice property: maintaining disentangled low-level and semantic information at the end of the network in MIM pre-training. Our experimental results suggest that a foundation model with decoupled features can achieve competitive performance across multiple downstream vision tasks such as image classification, semantic segmentation and object detection. For example, after intermediate fine-tuning on ImageNet-22K dataset, RevColV2-L attains 88.4% top-1 accuracy on ImageNet-1K classification and 58.6 mIoU on ADE20K semantic segmentation. With extra teacher and large scale dataset, RevColv2-L achieves 62.1 box AP on COCO detection and 60.4 mIoU on ADE20K semantic segmentation. Code and models are released at https://github.com/megvii-research/RevCol
Published: 2023

28. Far3D: Expanding the Horizon for Surround-view 3D Object Detection

Author: Jiang, Xiaohui, Li, Shuailin, Liu, Yingfei, Wang, Shihao, Jia, Fan, Wang, Tiancai, Han, Lijin, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently 3D object detection from surround-view images has made notable advancements with its low deployment cost. However, most works have primarily focused on close perception range while leaving long-range detection less explored. Expanding existing methods directly to cover long distances poses challenges such as heavy computation costs and unstable convergence. To address these limitations, this paper proposes a novel sparse query-based framework, dubbed Far3D. By utilizing high-quality 2D object priors, we generate 3D adaptive queries that complement the 3D global queries. To efficiently capture discriminative features across different views and scales for long-range objects, we introduce a perspective-aware aggregation module. Additionally, we propose a range-modulated 3D denoising approach to address query error propagation and mitigate convergence issues in long-range tasks. Significantly, Far3D demonstrates SoTA performance on the challenging Argoverse 2 dataset, covering a wide range of 150 meters, surpassing several LiDAR-based approaches. Meanwhile, Far3D exhibits superior performance compared to previous methods on the nuScenes dataset. The code is available at https://github.com/megvii-research/Far3D., Comment: Accepted by AAAI-2024
Published: 2023

29. SCSC: Spatial Cross-scale Convolution Module to Strengthen both CNNs and Transformers

Author: Wang, Xijun, Chu, Xiaojie, Han, Chunrui, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper presents a module, Spatial Cross-scale Convolution (SCSC), which is verified to be effective in improving both CNNs and Transformers. Nowadays, CNNs and Transformers have been successful in a variety of tasks. Especially for Transformers, increasing works achieve state-of-the-art performance in the computer vision community. Therefore, researchers start to explore the mechanism of those architectures. Large receptive fields, sparse connections, weight sharing, and dynamic weight have been considered keys to designing effective base models. However, there are still some issues to be addressed: large dense kernels and self-attention are inefficient, and large receptive fields make it hard to capture local features. Inspired by the above analyses and to solve the mentioned problems, in this paper, we design a general module taking in these design keys to enhance both CNNs and Transformers. SCSC introduces an efficient spatial cross-scale encoder and spatial embed module to capture assorted features in one layer. On the face recognition task, FaceResNet with SCSC can improve 2.7% with 68% fewer FLOPs and 79% fewer parameters. On the ImageNet classification task, Swin Transformer with SCSC can achieve even better performance with 22% fewer FLOPs, and ResNet with CSCS can improve 5.3% with similar complexity. Furthermore, a traditional network (e.g., ResNet) embedded with SCSC can match Swin Transformer's performance., Comment: ICCV2023 Workshop (New Ideas in Vision Transformers)
Published: 2023

30. ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning

Author: Zhao, Liang, Yu, En, Ge, Zheng, Yang, Jinrong, Wei, Haoran, Zhou, Hongyu, Sun, Jianjian, Peng, Yuang, Dong, Runpei, Han, Chunrui, and Zhang, Xiangyu
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Human-AI interactivity is a critical aspect that reflects the usability of multimodal large language models (MLLMs). However, existing end-to-end MLLMs only allow users to interact with them through language instructions, leading to the limitation of the interactive accuracy and efficiency. In this study, we present precise referring instructions that utilize diverse reference representations such as points and boxes as referring prompts to refer to the special region. This enables MLLMs to focus on the region of interest and achieve finer-grained interaction. Based on precise referring instruction, we propose ChatSpot, a unified end-to-end multimodal large language model that supports diverse forms of interactivity including mouse clicks, drag-and-drop, and drawing boxes, which provides a more flexible and seamless interactive experience. We also construct a multi-grained vision-language instruction-following dataset based on existing datasets and GPT-4 generating. Furthermore, we design a series of evaluation tasks to assess the effectiveness of region recognition and interaction. Experimental results showcase ChatSpot's promising performance., Comment: 15 pages, 8 figures
Published: 2023

31. GroupLane: End-to-End 3D Lane Detection with Channel-wise Grouping

Author: Li, Zhuoling, Han, Chunrui, Ge, Zheng, Yang, Jinrong, Yu, En, Wang, Haoqian, Zhao, Hengshuang, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Efficiency is quite important for 3D lane detection due to practical deployment demand. In this work, we propose a simple, fast, and end-to-end detector that still maintains high detection precision. Specifically, we devise a set of fully convolutional heads based on row-wise classification. In contrast to previous counterparts, ours supports recognizing both vertical and horizontal lanes. Besides, our method is the first one to perform row-wise classification in bird-eye-view. In the heads, we split feature into multiple groups and every group of feature corresponds to a lane instance. During training, the predictions are associated with lane labels using the proposed single-win one-to-one matching to compute loss, and no post-processing operation is demanded for inference. In this way, our proposed fully convolutional detector, GroupLane, realizes end-to-end detection like DETR. Evaluated on 3 real world 3D lane benchmarks, OpenLane, Once-3DLanes, and OpenLane-Huawei, GroupLane adopting ConvNext-Base as the backbone outperforms the published state-of-the-art PersFormer by 13.6% F1 score in the OpenLane validation set. Besides, GroupLane with ResNet18 still surpasses PersFormer by 4.9% F1 score, while the inference speed is nearly 7x faster and the FLOPs is only 13.3% of it.
Published: 2023

32. OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation

Author: Wu, Dongming, Wang, Tiancai, Zhang, Yuang, Zhang, Xiangyu, and Shen, Jianbing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction. Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding for cross-modal understanding. They usually present that the offline pattern is necessary for RVOS, yet model limited temporal association within each clip. In this work, we break up the previous offline belief and propose a simple yet effective online model using explicit query propagation, named OnlineRefer. Specifically, our approach leverages target cues that gather semantic information and position prior to improve the accuracy and ease of referring predictions for the current frame. Furthermore, we generalize our online model into a semi-online framework to be compatible with video-based backbones. To show the effectiveness of our method, we evaluate it on four benchmarks, \ie, Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17, outperforming all other offline methods., Comment: Accepted by ICCV2023. The code is at https://github.com/wudongming97/OnlineRefer
Published: 2023

33. MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking

Author: Yu, En, Wang, Tiancai, Li, Zhuoling, Zhang, Yuang, Zhang, Xiangyu, and Tao, Wenbing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Although end-to-end multi-object trackers like MOTR enjoy the merits of simplicity, they suffer from the conflict between detection and association seriously, resulting in unsatisfactory convergence dynamics. While MOTRv2 partly addresses this problem, it demands an additional detection network for assistance. In this work, we serve as the first to reveal that this conflict arises from the unfair label assignment between detect queries and track queries during training, where these detect queries recognize targets and track queries associate them. Based on this observation, we propose MOTRv3, which balances the label assignment process using the developed release-fetch supervision strategy. In this strategy, labels are first released for detection and gradually fetched back for association. Besides, another two strategies named pseudo label distillation and track group denoising are designed to further improve the supervision for detection and association. Without the assistance of an extra detection network during inference, MOTRv3 achieves impressive performance across diverse benchmarks, e.g., MOT17, DanceTrack.
Published: 2023

34. Fusion is Not Enough: Single Modal Attacks on Fusion Models for 3D Object Detection

Author: Cheng, Zhiyuan, Choi, Hongjun, Liang, James, Feng, Shiwei, Tao, Guanhong, Liu, Dongfang, Zuzak, Michael, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Cryptography and Security
Abstract: Multi-sensor fusion (MSF) is widely used in autonomous vehicles (AVs) for perception, particularly for 3D object detection with camera and LiDAR sensors. The purpose of fusion is to capitalize on the advantages of each modality while minimizing its weaknesses. Advanced deep neural network (DNN)-based fusion techniques have demonstrated the exceptional and industry-leading performance. Due to the redundant information in multiple modalities, MSF is also recognized as a general defence strategy against adversarial attacks. In this paper, we attack fusion models from the camera modality that is considered to be of lesser importance in fusion but is more affordable for attackers. We argue that the weakest link of fusion models depends on their most vulnerable modality, and propose an attack framework that targets advanced camera-LiDAR fusion-based 3D object detection models through camera-only adversarial attacks. Our approach employs a two-stage optimization-based strategy that first thoroughly evaluates vulnerable image areas under adversarial attacks, and then applies dedicated attack strategies for different fusion models to generate deployable patches. The evaluations with six advanced camera-LiDAR fusion models and one camera-only model indicate that our attacks successfully compromise all of them. Our approach can either decrease the mean average precision (mAP) of detection performance from 0.824 to 0.353, or degrade the detection score of a target object from 0.728 to 0.156, demonstrating the efficacy of our proposed attack framework. Code is available., Comment: Accepted at ICLR'2024
Published: 2023

35. Self-supervised Learning by View Synthesis

Author: Liu, Shaoteng, Zhang, Xiangyu, Hu, Tao, and Jia, Jiaya
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We present view-synthesis autoencoders (VSA) in this paper, which is a self-supervised learning framework designed for vision transformers. Different from traditional 2D pretraining methods, VSA can be pre-trained with multi-view data. In each iteration, the input to VSA is one view (or multiple views) of a 3D object and the output is a synthesized image in another target pose. The decoder of VSA has several cross-attention blocks, which use the source view as value, source pose as key, and target pose as query. They achieve cross-attention to synthesize the target view. This simple approach realizes large-angle view synthesis and learns spatial invariant representation, where the latter is decent initialization for transformers on downstream tasks, such as 3D classification on ModelNet40, ShapeNet Core55, and ScanObjectNN. VSA outperforms existing methods significantly for linear probing and is competitive for fine-tuning. The code will be made publicly available., Comment: 13 pages, 12 figures
Published: 2023

36. Align-DETR: Improving DETR with Simple IoU-aware BCE loss

Author: Cai, Zhi, Liu, Songtao, Wang, Guodong, Ge, Zheng, Zhang, Xiangyu, and Huang, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: DETR has set up a simple end-to-end pipeline for object detection by formulating this task as a set prediction problem, showing promising potential. However, despite the significant progress in improving DETR, this paper identifies a problem of misalignment in the output distribution, which prevents the best-regressed samples from being assigned with high confidence, hindering the model's accuracy. We propose a metric, recall of best-regressed samples, to quantitively evaluate the misalignment problem. Observing its importance, we propose a novel Align-DETR that incorporates a localization precision-aware classification loss in optimization. The proposed loss, IA-BCE, guides the training of DETR to build a strong correlation between classification score and localization precision. We also adopt the mixed-matching strategy, to facilitate DETR-based detectors with faster training convergence while keeping an end-to-end scheme. Moreover, to overcome the dramatic decrease in sample quality induced by the sparsity of queries, we introduce a prime sample weighting mechanism to suppress the interference of unimportant samples. Extensive experiments are conducted with very competitive results reported. In particular, it delivers a 46 (+3.8)% AP on the DAB-DETR baseline with the ResNet-50 backbone and reaches a new SOTA performance of 50.2% AP in the 1x setting on the COCO validation set when employing the strong baseline DINO. Our code is available at https://github.com/FelixCaae/AlignDETR.
Published: 2023

37. Detecting Backdoors in Pre-trained Encoders

Author: Feng, Shiwei, Tao, Guanhong, Cheng, Siyuan, Shen, Guangyu, Xu, Xiangzhe, Liu, Yingqi, Zhang, Kaiyuan, Ma, Shiqing, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Cryptography and Security
Abstract: Self-supervised learning in computer vision trains on unlabeled data, such as images or (image, text) pairs, to obtain an image encoder that learns high-quality embeddings for input data. Emerging backdoor attacks towards encoders expose crucial vulnerabilities of self-supervised learning, since downstream classifiers (even further trained on clean data) may inherit backdoor behaviors from encoders. Existing backdoor detection methods mainly focus on supervised learning settings and cannot handle pre-trained encoders especially when input labels are not available. In this paper, we propose DECREE, the first backdoor detection approach for pre-trained encoders, requiring neither classifier headers nor input labels. We evaluate DECREE on over 400 encoders trojaned under 3 paradigms. We show the effectiveness of our method on image encoders pre-trained on ImageNet and OpenAI's CLIP 400 million image-text pairs. Our method consistently has a high detection accuracy even if we have only limited or no access to the pre-training dataset., Comment: Accepted at CVPR 2023. Code is available at https://github.com/GiantSeaweed/DECREE
Published: 2023

38. Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection

Author: Wang, Shihao, Liu, Yingfei, Wang, Tiancai, Li, Ying, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we propose a long-sequence modeling framework, named StreamPETR, for multi-view 3D object detection. Built upon the sparse query design in the PETR series, we systematically develop an object-centric temporal mechanism. The model is performed in an online manner and the long-term historical information is propagated through object queries frame by frame. Besides, we introduce a motion-aware layer normalization to model the movement of the objects. StreamPETR achieves significant performance improvements only with negligible computation cost, compared to the single-frame baseline. On the standard nuScenes benchmark, it is the first online multi-view method that achieves comparable performance (67.6% NDS & 65.3% AMOTA) with lidar-based methods. The lightweight version realizes 45.0% mAP and 31.7 FPS, outperforming the state-of-the-art method (SOLOFusion) by 2.3% mAP and 1.8x faster FPS. Code has been available at https://github.com/exiawsh/StreamPETR.git.
Published: 2023

39. VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

Author: Chen, Yukang, Liu, Jianhui, Zhang, Xiangyu, Qi, Xiaojuan, and Jia, Jiaya
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: 3D object detectors usually rely on hand-crafted proxies, e.g., anchors or centers, and translate well-studied 2D frameworks to 3D. Thus, sparse voxel features need to be densified and processed by dense prediction heads, which inevitably costs extra computation. In this paper, we instead propose VoxelNext for fully sparse 3D object detection. Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies. Our strong sparse convolutional network VoxelNeXt detects and tracks 3D objects through voxel features entirely. It is an elegant and efficient framework, with no need for sparse-to-dense conversion or NMS post-processing. Our method achieves a better speed-accuracy trade-off than other mainframe detectors on the nuScenes dataset. For the first time, we show that a fully sparse voxel-based representation works decently for LIDAR 3D object detection and tracking. Extensive experiments on nuScenes, Waymo, and Argoverse2 benchmarks validate the effectiveness of our approach. Without bells and whistles, our model outperforms all existing LIDAR methods on the nuScenes tracking test benchmark., Comment: In CVPR 2023, Code and models are available at https://github.com/dvlab-research/VoxelNeXt
Published: 2023

40. Exploring Recurrent Long-term Temporal Fusion for Multi-view 3D Perception

Author: Han, Chunrui, Yang, Jinrong, Sun, Jianjian, Ge, Zheng, Dong, Runpei, Zhou, Hongyu, Mao, Weixin, Peng, Yuang, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this paper, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection (55.4\% mAP and 62.9\% NDS), segmentation (48.6\% vehicle mIoU), tracking (54.8\% AMOTA), and motion prediction (0.80m minADE and 0.463 EPA).
Published: 2023

41. Referring Multi-Object Tracking

Author: Wu, Dongming, Han, Wencheng, Wang, Tiancai, Dong, Xingping, Zhang, Xiangyu, and Shen, Jianbing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and outperforms other counterparts. The dataset and code will be available at https://github.com/wudongming97/RMOT., Comment: Accpeted by CVPR 2023. The dataset and code will be available at https://github.com/wudongming97/RMOT
Published: 2023

42. Contrast with Reconstruct: Contrastive 3D Representation Learning Guided by Generative Pretraining

Author: Qi, Zekun, Dong, Runpei, Fan, Guofan, Ge, Zheng, Zhang, Xiangyu, Ma, Kaisheng, and Yi, Li
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Mainstream 3D representation learning approaches are built upon contrastive or generative modeling pretext tasks, where great improvements in performance on various downstream tasks have been achieved. However, we find these two paradigms have different characteristics: (i) contrastive models are data-hungry that suffer from a representation over-fitting issue; (ii) generative models have a data filling issue that shows inferior data scaling capacity compared to contrastive models. This motivates us to learn 3D representations by sharing the merits of both paradigms, which is non-trivial due to the pattern difference between the two paradigms. In this paper, we propose Contrast with Reconstruct (ReCon) that unifies these two paradigms. ReCon is trained to learn from both generative modeling teachers and single/cross-modal contrastive teachers through ensemble distillation, where the generative student guides the contrastive student. An encoder-decoder style ReCon-block is proposed that transfers knowledge through cross attention with stop-gradient, which avoids pretraining over-fitting and pattern difference issues. ReCon achieves a new state-of-the-art in 3D representation learning, e.g., 91.26% accuracy on ScanObjectNN. Codes have been released at https://github.com/qizekun/ReCon., Comment: Accepted at ICML 2023
Published: 2023

43. Adversarial Training of Self-supervised Monocular Depth Estimation against Physical-World Attacks

Author: Cheng, Zhiyuan, Liang, James, Tao, Guanhong, Liu, Dongfang, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Monocular Depth Estimation (MDE) is a critical component in applications such as autonomous driving. There are various attacks against MDE networks. These attacks, especially the physical ones, pose a great threat to the security of such systems. Traditional adversarial training method requires ground-truth labels hence cannot be directly applied to self-supervised MDE that does not have ground-truth depth. Some self-supervised model hardening techniques (e.g., contrastive learning) ignore the domain knowledge of MDE and can hardly achieve optimal performance. In this work, we propose a novel adversarial training method for self-supervised MDE models based on view synthesis without using ground-truth depth. We improve adversarial robustness against physical-world attacks using L0-norm-bounded perturbation in training. We compare our method with supervised learning based and contrastive learning based methods that are tailored for MDE. Results on two representative MDE networks show that we achieve better robustness against various adversarial attacks with nearly no benign performance degradation., Comment: Initially accepted at ICLR2023 (Spotlight)
Published: 2023

44. Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

Author: Yan, Junjie, Liu, Yingfei, Sun, Jianjian, Jia, Fan, Li, Shuailin, Wang, Tiancai, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. It achieves 74.1\% NDS (state-of-the-art with single model) on nuScenes test set while maintaining fast inference speed. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code is released at https://github.com/junjie18/CMT.
Published: 2023

45. Understanding Imbalanced Semantic Segmentation Through Neural Collapse

Author: Zhong, Zhisheng, Cui, Jiequan, Yang, Yibo, Wu, Xiaoyang, Qi, Xiaojuan, Zhang, Xiangyu, and Jia, Jiaya
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning., Comment: Technical Report
Published: 2023

46. Reversible Column Networks

Author: Cai, Yuxuan, Zhou, Yizhuang, Han, Qi, Sun, Jianjian, Kong, Xiangwen, Li, Jun, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a new neural network design paradigm Reversible Column Network (RevCol). The main body of RevCol is composed of multiple copies of subnetworks, named columns respectively, between which multi-level reversible connections are employed. Such architectural scheme attributes RevCol very different behavior from conventional networks: during forward propagation, features in RevCol are learned to be gradually disentangled when passing through each column, whose total information is maintained rather than compressed or discarded as other network does. Our experiments suggest that CNN-style RevCol models can achieve very competitive performances on multiple computer vision tasks such as image classification, object detection and semantic segmentation, especially with large parameter budget and large dataset. For example, after ImageNet-22K pre-training, RevCol-XL obtains 88.2% ImageNet-1K accuracy. Given more pre-training data, our largest model RevCol-H reaches 90.0% on ImageNet-1K, 63.8% APbox on COCO detection minival set, 61.0% mIoU on ADE20k segmentation. To our knowledge, it is the best COCO detection and ADE20k segmentation result among pure (static) CNN models. Moreover, as a general macro architecture fashion, RevCol can also be introduced into transformers or other neural networks, which is demonstrated to improve the performances in both computer vision and NLP tasks. We release code and models at https://github.com/megvii-research/RevCol, Comment: Accepted by ICLR 2023
Published: 2022

47. Twin-S: A Digital Twin for Skull-base Surgery

Author: Shu, Hongchao, Liang, Ruixing, Li, Zhaoshuo, Goodridge, Anna, Zhang, Xiangyu, Ding, Hao, Nagururu, Nimesh, Sahu, Manish, Creighton, Francis X., Taylor, Russell H., Munawar, Adnan, and Unberath, Mathias
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Purpose: Digital twins are virtual interactive models of the real world, exhibiting identical behavior and properties. In surgical applications, computational analysis from digital twins can be used, for example, to enhance situational awareness. Methods: We present a digital twin framework for skull-base surgeries, named Twin-S, which can be integrated within various image-guided interventions seamlessly. Twin-S combines high-precision optical tracking and real-time simulation. We rely on rigorous calibration routines to ensure that the digital twin representation precisely mimics all real-world processes. Twin-S models and tracks the critical components of skull-base surgery, including the surgical tool, patient anatomy, and surgical camera. Significantly, Twin-S updates and reflects real-world drilling of the anatomical model in frame rate. Results: We extensively evaluate the accuracy of Twin-S, which achieves an average 1.39 mm error during the drilling process. We further illustrate how segmentation masks derived from the continuously updated digital twin can augment the surgical microscope view in a mixed reality setting, where bone requiring ablation is highlighted to provide surgeons additional situational awareness. Conclusion: We present Twin-S, a digital twin environment for skull-base surgery. Twin-S tracks and updates the virtual model in real-time given measurements from modern tracking technologies. Future research on complementing optical tracking with higher-precision vision-based approaches may further increase the accuracy of Twin-S.
Published: 2022

48. MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception

Author: Zhou, Hongyu, Ge, Zheng, Li, Zeming, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper proposes an efficient multi-camera to Bird's-Eye-View (BEV) view transformation method for 3D perception, dubbed MatrixVT. Existing view transformers either suffer from poor transformation efficiency or rely on device-specific operators, hindering the broad application of BEV models. In contrast, our method generates BEV features efficiently with only convolutions and matrix multiplications (MatMul). Specifically, we propose describing the BEV feature as the MatMul of image feature and a sparse Feature Transporting Matrix (FTM). A Prime Extraction module is then introduced to compress the dimension of image features and reduce FTM's sparsity. Moreover, we propose the Ring \& Ray Decomposition to replace the FTM with two matrices and reformulate our pipeline to reduce calculation further. Compared to existing methods, MatrixVT enjoys a faster speed and less memory footprint while remaining deploy-friendly. Extensive experiments on the nuScenes benchmark demonstrate that our method is highly efficient but obtains results on par with the SOTA method in object detection and map segmentation tasks
Published: 2022

49. MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors

Author: Zhang, Yuang, Wang, Tiancai, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we propose MOTRv2, a simple yet effective pipeline to bootstrap end-to-end multi-object tracking with a pretrained object detector. Existing end-to-end methods, MOTR and TrackFormer are inferior to their tracking-by-detection counterparts mainly due to their poor detection performance. We aim to improve MOTR by elegantly incorporating an extra object detector. We first adopt the anchor formulation of queries and then use an extra object detector to generate proposals as anchors, providing detection prior to MOTR. The simple modification greatly eases the conflict between joint learning detection and association tasks in MOTR. MOTRv2 keeps the query propogation feature and scales well on large-scale benchmarks. MOTRv2 ranks the 1st place (73.4% HOTA on DanceTrack) in the 1st Multiple People Tracking in Group Dance Challenge. Moreover, MOTRv2 reaches state-of-the-art performance on the BDD100K dataset. We hope this simple and effective pipeline can provide some new insights to the end-to-end MOT community. Code is available at \url{https://github.com/megvii-research/MOTRv2}., Comment: Accepted by CVPR 2023
Published: 2022

50. Towards 3D Object Detection with 2D Supervision

Author: Yang, Jinrong, Wang, Tiancai, Ge, Zheng, Mao, Weixin, Li, Xiaoping, and Zhang, Xiangyu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The great progress of 3D object detectors relies on large-scale data and 3D annotations. The annotation cost for 3D bounding boxes is extremely expensive while the 2D ones are easier and cheaper to collect. In this paper, we introduce a hybrid training framework, enabling us to learn a visual 3D object detector with massive 2D (pseudo) labels, even without 3D annotations. To break through the information bottleneck of 2D clues, we explore a new perspective: Temporal 2D Supervision. We propose a temporal 2D transformation to bridge the 3D predictions with temporal 2D labels. Two steps, including homography wraping and 2D box deduction, are taken to transform the 3D predictions into 2D ones for supervision. Experiments conducted on the nuScenes dataset show strong results (nearly 90% of its fully-supervised performance) with only 25% 3D annotations. We hope our findings can provide new insights for using a large number of 2D annotations for 3D perception.
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

Publisher

206 results on '"Zhang, Xiangyu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources