Author: "Li, Hongyang" / Publication Type: Electronic Resources - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Li, Hongyang"' showing total 107 results

Start Over Author "Li, Hongyang" Publication Type Electronic Resources

107 results on '"Li, Hongyang"'

1. TAPTR: Tracking Any Point with Transformers as Detection

Author: Li, Hongyang, Zhang, Hao, Liu, Shilong, Zeng, Zhaoyang, Ren, Tianhe, Li, Feng, Zhang, Lei, Li, Hongyang, Zhang, Hao, Liu, Shilong, Zeng, Zhaoyang, Ren, Tianhe, Li, Feng, and Zhang, Lei
Abstract: In this paper, we propose a simple and strong framework for Tracking Any Point with TRansformers (TAPTR). Based on the observation that point tracking bears a great resemblance to object detection and tracking, we borrow designs from DETR-like algorithms to address the task of TAP. In the proposed framework, in each video frame, each tracking point is represented as a point query, which consists of a positional part and a content part. As in DETR, each query (its position and content feature) is naturally updated layer by layer. Its visibility is predicted by its updated content feature. Queries belonging to the same tracking point can exchange information through self-attention along the temporal dimension. As all such operations are well-designed in DETR-like algorithms, the model is conceptually very simple. We also adopt some useful designs such as cost volume from optical flow models and develop simple designs to provide long temporal information while mitigating the feature drifting issue. Our framework demonstrates strong performance with state-of-the-art performance on various TAP datasets with faster inference speed.
Published: 2024

2. Generalized Predictive Model for Autonomous Driving

Author: Yang, Jiazhi, Gao, Shenyuan, Qiu, Yihang, Chen, Li, Li, Tianyu, Dai, Bo, Chitta, Kashyap, Wu, Penghao, Zeng, Jia, Luo, Ping, Zhang, Jun, Geiger, Andreas, Qiao, Yu, Li, Hongyang, Yang, Jiazhi, Gao, Shenyuan, Qiu, Yihang, Chen, Li, Li, Tianyu, Dai, Bo, Chitta, Kashyap, Wu, Penghao, Zeng, Jia, Luo, Ping, Zhang, Jun, Geiger, Andreas, Qiao, Yu, and Li, Hongyang
Abstract: In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model, we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos, spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models, our model, dubbed GenAD, handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner, surpassing general or driving-specific video prediction counterparts. Furthermore, GenAD can be adapted into an action-conditioned prediction model or a motion planner, holding great potential for real-world driving applications., Comment: Accepted by CVPR 2024
Published: 2024

3. SparseFusion: Efficient Sparse Multi-Modal Fusion Framework for Long-Range 3D Perception

Author: Li, Yiheng, Li, Hongyang, Huang, Zehao, Chang, Hong, Wang, Naiyan, Li, Yiheng, Li, Hongyang, Huang, Zehao, Chang, Hong, and Wang, Naiyan
Abstract: Multi-modal 3D object detection has exhibited significant progress in recent years. However, most existing methods can hardly scale to long-range scenarios due to their reliance on dense 3D features, which substantially escalate computational demands and memory usage. In this paper, we introduce SparseFusion, a novel multi-modal fusion framework fully built upon sparse 3D features to facilitate efficient long-range perception. The core of our method is the Sparse View Transformer module, which selectively lifts regions of interest in 2D image space into the unified 3D space. The proposed module introduces sparsity from both semantic and geometric aspects which only fill grids that foreground objects potentially reside in. Comprehensive experiments have verified the efficiency and effectiveness of our framework in long-range 3D perception. Remarkably, on the long-range Argoverse2 dataset, SparseFusion reduces memory footprint and accelerates the inference by about two times compared to dense detectors. It also achieves state-of-the-art performance with mAP of 41.2% and CDS of 32.1%. The versatility of SparseFusion is also validated in the temporal object detection task and 3D lane detection task. Codes will be released upon acceptance.
Published: 2024

4. FastMAC: Stochastic Spectral Sampling of Correspondence Graph

Author: Zhang, Yifei, Zhao, Hao, Li, Hongyang, Chen, Siheng, Zhang, Yifei, Zhao, Hao, Li, Hongyang, and Chen, Siheng
Abstract: 3D correspondence, i.e., a pair of 3D points, is a fundamental concept in computer vision. A set of 3D correspondences, when equipped with compatibility edges, forms a correspondence graph. This graph is a critical component in several state-of-the-art 3D point cloud registration approaches, e.g., the one based on maximal cliques (MAC). However, its properties have not been well understood. So we present the first study that introduces graph signal processing into the domain of correspondence graph. We exploit the generalized degree signal on correspondence graph and pursue sampling strategies that preserve high-frequency components of this signal. To address time-consuming singular value decomposition in deterministic sampling, we resort to a stochastic approximate sampling strategy. As such, the core of our method is the stochastic spectral sampling of correspondence graph. As an application, we build a complete 3D registration algorithm termed as FastMAC, that reaches real-time speed while leading to little to none performance drop. Through extensive experiments, we validate that FastMAC works for both indoor and outdoor benchmarks. For example, FastMAC can accelerate MAC by 80 times while maintaining high registration success rate on KITTI. Codes are publicly available at https://github.com/Forrest-110/FastMAC., Comment: CVPR 2024, Code: https://github.com/Forrest-110/FastMAC
Published: 2024

5. Enhancing Generalization in Medical Visual Question Answering Tasks via Gradient-Guided Model Perturbation

Author: Liu, Gang, Li, Hongyang, He, Zerui, Zhong, Shenjun, Liu, Gang, Li, Hongyang, He, Zerui, and Zhong, Shenjun
Abstract: Leveraging pre-trained visual language models has become a widely adopted approach for improving performance in downstream visual question answering (VQA) applications. However, in the specialized field of medical VQA, the scarcity of available data poses a significant barrier to achieving reliable model generalization. Numerous methods have been proposed to enhance model generalization, addressing the issue from data-centric and model-centric perspectives. Data augmentation techniques are commonly employed to enrich the dataset, while various regularization approaches aim to prevent model overfitting, especially when training on limited data samples. In this paper, we introduce a method that incorporates gradient-guided parameter perturbations to the visual encoder of the multimodality model during both pre-training and fine-tuning phases, to improve model generalization for downstream medical VQA tasks. The small perturbation is adaptively generated by aligning with the direction of the moving average gradient in the optimization landscape, which is opposite to the directions of the optimizer's historical updates. It is subsequently injected into the model's visual encoder. The results show that, even with a significantly smaller pre-training image caption dataset, our approach achieves competitive outcomes on both VQA-RAD and SLAKE datasets.
Published: 2024

6. Embodied Understanding of Driving Scenarios

Author: Zhou, Yunsong, Huang, Linyan, Bu, Qingwen, Zeng, Jia, Li, Tianyu, Qiu, Hang, Zhu, Hongzi, Guo, Minyi, Qiao, Yu, Li, Hongyang, Zhou, Yunsong, Huang, Linyan, Bu, Qingwen, Zeng, Jia, Li, Tianyu, Qiu, Hang, Zhu, Hongzi, Guo, Minyi, Qiao, Yu, and Li, Hongyang
Abstract: Embodied scene understanding serves as the cornerstone for autonomous agents to perceive, interpret, and respond to open driving scenarios. Such understanding is typically founded upon Vision-Language Models (VLMs). Nevertheless, existing VLMs are restricted to the 2D domain, devoid of spatial awareness and long-horizon extrapolation proficiencies. We revisit the key aspects of autonomous driving and formulate appropriate rubrics. Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans. ELM incorporates space-aware pre-training to endow the agent with robust spatial localization capabilities. Besides, the model employs time-aware token selection to accurately inquire about temporal cues. We instantiate ELM on the reformulated multi-faced benchmark, and it surpasses previous state-of-the-art approaches in all aspects. All code, data, and models will be publicly shared., Comment: 43 pages, 16 figures
Published: 2024

7. Translating Images to Road Network:A Non-Autoregressive Sequence-to-Sequence Approach

Author: Lu, Jiachen, Peng, Renyuan, Cai, Xinyue, Xu, Hang, Li, Hongyang, Wen, Feng, Zhang, Wei, Zhang, Li, Lu, Jiachen, Peng, Renyuan, Cai, Xinyue, Xu, Hang, Li, Hongyang, Wen, Feng, Zhang, Wei, and Zhang, Li
Abstract: The extraction of road network is essential for the generation of high-definition maps since it enables the precise localization of road landmarks and their interconnections. However, generating road network poses a significant challenge due to the conflicting underlying combination of Euclidean (e.g., road landmarks location) and non-Euclidean (e.g., road topological connectivity) structures. Existing methods struggle to merge the two types of data domains effectively, but few of them address it properly. Instead, our work establishes a unified representation of both types of data domain by projecting both Euclidean and non-Euclidean data into an integer series called RoadNet Sequence. Further than modeling an auto-regressive sequence-to-sequence Transformer model to understand RoadNet Sequence, we decouple the dependency of RoadNet Sequence into a mixture of auto-regressive and non-autoregressive dependency. Building on this, our proposed non-autoregressive sequence-to-sequence approach leverages non-autoregressive dependencies while fixing the gap towards auto-regressive dependencies, resulting in success on both efficiency and accuracy. Extensive experiments on nuScenes dataset demonstrate the superiority of RoadNet Sequence representation and the non-autoregressive approach compared to existing state-of-the-art alternatives. The code is open-source on https://github.com/fudan-zvg/RoadNetworkTRansformer., Comment: ICCV 2023 Oral Presentation
Published: 2024

8. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

Author: Ren, Tianhe, Liu, Shilong, Zeng, Ailing, Lin, Jing, Li, Kunchang, Cao, He, Chen, Jiayu, Huang, Xinyu, Chen, Yukang, Yan, Feng, Zeng, Zhaoyang, Zhang, Hao, Li, Feng, Yang, Jie, Li, Hongyang, Jiang, Qing, Zhang, Lei, Ren, Tianhe, Liu, Shilong, Zeng, Ailing, Lin, Jing, Li, Kunchang, Cao, He, Chen, Jiayu, Huang, Xinyu, Chen, Yukang, Yan, Feng, Zeng, Zhaoyang, Zhang, Hao, Li, Feng, Yang, Jie, Li, Hongyang, Jiang, Qing, and Zhang, Lei
Abstract: We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models. As shown in Fig.1, a wide range of vision tasks can be achieved by using the versatile Grounded SAM pipeline. For example, an automatic annotation pipeline based solely on input images can be realized by incorporating models such as BLIP and Recognize Anything. Additionally, incorporating Stable-Diffusion allows for controllable image editing, while the integration of OSX facilitates promptable 3D human motion analysis. Grounded SAM also shows superior performance on open-vocabulary benchmarks, achieving 48.7 mean AP on SegInW (Segmentation in the wild) zero-shot benchmark with the combination of Grounding DINO-Base and SAM-Huge models.
Published: 2024

9. Learning Manipulation by Predicting Interaction

Author: Zeng, Jia, Bu, Qingwen, Wang, Bangjun, Xia, Wenke, Chen, Li, Dong, Hao, Song, Haoming, Wang, Dong, Hu, Di, Luo, Ping, Cui, Heming, Zhao, Bin, Li, Xuelong, Qiao, Yu, Li, Hongyang, Zeng, Jia, Bu, Qingwen, Wang, Bangjun, Xia, Wenke, Chen, Li, Dong, Hao, Song, Haoming, Wang, Dong, Hu, Di, Luo, Ping, Cui, Heming, Zhao, Bin, Li, Xuelong, Qiao, Yu, and Li, Hongyang
Abstract: Representation learning approaches for robotic manipulation have boomed in recent years. Due to the scarcity of in-domain robot data, prevailing methodologies tend to leverage large-scale human video datasets to extract generalizable features for visuomotor policy learning. Despite the progress achieved, prior endeavors disregard the interactive dynamics that capture behavior patterns and physical interaction during the manipulation process, resulting in an inadequate understanding of the relationship between objects and the environment. To this end, we propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction (MPI) and enhances the visual representation.Given a pair of keyframes representing the initial and final states, along with language instructions, our algorithm predicts the transition frame and detects the interaction object, respectively. These two learning objectives achieve superior comprehension towards "how-to-interact" and "where-to-interact". We conduct a comprehensive evaluation of several challenging robotic tasks.The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms as well as simulation environments. Code and checkpoints are publicly shared at https://github.com/OpenDriveLab/MPI., Comment: Accepted to RSS 2024. Project page: https://github.com/OpenDriveLab/MPI
Published: 2024

10. Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability

Author: Gao, Shenyuan, Yang, Jiazhi, Chen, Li, Chitta, Kashyap, Qiu, Yihang, Geiger, Andreas, Zhang, Jun, Li, Hongyang, Gao, Shenyuan, Yang, Jiazhi, Chen, Li, Chitta, Kashyap, Qiu, Yihang, Geiger, Andreas, Zhang, Jun, and Li, Hongyang
Abstract: World models can foresee the outcomes of different actions, which is of paramount importance for autonomous driving. Nevertheless, existing driving world models still have limitations in generalization to unseen environments, prediction fidelity of critical details, and action controllability for flexible application. In this paper, we present Vista, a generalizable driving world model with high fidelity and versatile controllability. Based on a systematic diagnosis of existing methods, we introduce several key ingredients to address these limitations. To accurately predict real-world dynamics at high resolution, we propose two novel losses to promote the learning of moving instances and structural information. We also devise an effective latent replacement approach to inject historical frames as priors for coherent long-horizon rollouts. For action controllability, we incorporate a versatile set of controls from high-level intentions (command, goal point) to low-level maneuvers (trajectory, angle, and speed) through an efficient learning strategy. After large-scale training, the capabilities of Vista can seamlessly generalize to different scenarios. Extensive experiments on multiple datasets show that Vista outperforms the most advanced general-purpose video generator in over 70% of comparisons and surpasses the best-performing driving world model by 55% in FID and 27% in FVD. Moreover, for the first time, we utilize the capacity of Vista itself to establish a generalizable reward for real-world action evaluation without accessing the ground truth actions., Comment: Code and model: https://github.com/OpenDriveLab/Vista, video demos: https://vista-demo.github.io
Published: 2024

11. Characterisation of novel mobile genetic elements and their association with antibiotic resistance genes in gram negative bacteria

Author: Li, Hongyang
Subjects: 616.9041
Abstract: Nosocomial antibiotic resistance is an increasing problem world wide; some pathogens, particularly Pseudomonas aeruginosa and Acinetobacter baumannii, in some countries have become resistant most useful antibiotics. These bacteria can sequester antibiotic resistance genes and incorporate them into their genome, subsequently spreading them to other bacterial species via a variety of mechanisms. The distribution of resistance genes among both gram-positive and gram-negative pathogens often as not involves mobile genetic elements such as plasmids, transposons, integron and insertion sequence.
Published: 2008

12. Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR

Author: Li, Feng, Zeng, Ailing, Liu, Shilong, Zhang, Hao, Li, Hongyang, Zhang, Lei, Ni, Lionel Ming-shuan, Li, Feng, Zeng, Ailing, Liu, Shilong, Zhang, Hao, Li, Hongyang, Zhang, Lei, and Ni, Lionel Ming-shuan
Abstract: Recent DEtection TRansformer-based (DETR) models have obtained remarkable performance. Its success cannot be achieved without the re-introduction of multi-scale feature fusion in the encoder. However, the excessively increased tokens in multi-scale features, especially for about 75% of low-level features, are quite computationally inefficient, which hinders real applications of DETR models. In this paper, we present Lite DETR, a simple yet efficient end-to-end object detection framework that can effectively reduce the GFLOPs of the detection head by 60% while keeping 99% of the original performance. Specifically, we design an efficient encoder block to update high-level features (corresponding to small-resolution feature maps) and low-level features (corresponding to large-resolution feature maps) in an interleaved way. In addition, to better fuse cross-scale features, we develop a key-aware deformable attention to predict more reliable attention weights. Comprehensive experiments validate the effectiveness and efficiency of the proposed Lite DETR, and the efficient encoder strategy can generalize well across existing DETR-based models. The code will be available in https://github.com/IDEA-Research/LiteDETR.
Published: 2023

13. Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking

Author: Gao, Peng, Zhang, Renrui, Fang, Rongyao, Lin, Ziyi, Li, Hongyang, Li, Hongsheng, Yu, Qiao, Gao, Peng, Zhang, Renrui, Fang, Rongyao, Lin, Ziyi, Li, Hongyang, Li, Hongsheng, and Yu, Qiao
Abstract: Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous state-of-the-art BEiT V2 base by +0.3%. Code and pre-trained models will be released at https://github.com/Alpha-VL/ConvMAE., Comment: 12 pages, 3 figures
Published: 2023

14. Introducing Depth into Transformer-based 3D Object Detection

Author: Zhang, Hao, Li, Hongyang, Zeng, Ailing, Li, Feng, Liu, Shilong, Liao, Xingyu, Zhang, Lei, Zhang, Hao, Li, Hongyang, Zeng, Ailing, Li, Feng, Liu, Shilong, Liao, Xingyu, and Zhang, Lei
Abstract: In this paper, we present DAT, a Depth-Aware Transformer framework designed for camera-based 3D detection. Our model is based on observing two major issues in existing methods: large depth translation errors and duplicate predictions along depth axes. To mitigate these issues, we propose two key solutions within DAT. To address the first issue, we introduce a Depth-Aware Spatial Cross-Attention (DA-SCA) module that incorporates depth information into spatial cross-attention when lifting image features to 3D space. To address the second issue, we introduce an auxiliary learning task called Depth-aware Negative Suppression loss. First, based on their reference points, we organize features as a Bird's-Eye-View (BEV) feature map. Then, we sample positive and negative features along each object ray that connects an object and a camera and train the model to distinguish between them. The proposed DA-SCA and DNS methods effectively alleviate these two problems. We show that DAT is a versatile method that enhances the performance of all three popular models, BEVFormer, DETR3D, and PETR. Our evaluation on BEVFormer demonstrates that DAT achieves a significant improvement of +2.8 NDS on nuScenes val under the same settings. Moreover, when using pre-trained VoVNet-99 as the backbone, DAT achieves strong results of 60.0 NDS and 51.5 mAP on nuScenes test. Our code will be soon., Comment: revision
Published: 2023

15. Policy Pre-training for Autonomous Driving via Self-supervised Geometric Modeling

Author: Wu, Penghao, Chen, Li, Li, Hongyang, Jia, Xiaosong, Yan, Junchi, Qiao, Yu, Wu, Penghao, Chen, Li, Li, Hongyang, Jia, Xiaosong, Yan, Junchi, and Qiao, Yu
Abstract: Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data., Comment: ICLR2023
Published: 2023

16. Scene as Occupancy

Author: Sima, Chonghao, Tong, Wenwen, Wang, Tai, Chen, Li, Wu, Silei, Deng, Hanming, Gu, Yi, Lu, Lewei, Luo, Ping, Lin, Dahua, Li, Hongyang, Sima, Chonghao, Tong, Wenwen, Wang, Tai, Chen, Li, Wu, Silei, Deng, Hanming, Gu, Yi, Lu, Lewei, Luo, Ping, Lin, Dahua, and Li, Hongyang
Abstract: Human driver can easily describe the complex traffic scene by visual system. Such an ability of precise perception is essential for driver's planning. To achieve this, a geometry-aware representation that quantizes the physical 3D scene into structured grid map with semantic labels per cell, termed as 3D Occupancy, would be desirable. Compared to the form of bounding box, a key insight behind occupancy is that it could capture the fine-grained details of critical obstacles in the scene, and thereby facilitate subsequent tasks. Prior or concurrent literature mainly concentrate on a single scene completion task, where we might argue that the potential of this occupancy representation might obsess broader impact. In this paper, we propose OccNet, a multi-view vision-centric pipeline with a cascade and temporal voxel decoder to reconstruct 3D occupancy. At the core of OccNet is a general occupancy embedding to represent 3D physical world. Such a descriptor could be applied towards a wide span of driving tasks, including detection, segmentation and planning. To validate the effectiveness of this new representation and our proposed algorithm, we propose OpenOcc, the first dense high-quality 3D occupancy benchmark built on top of nuScenes. Empirical experiments show that there are evident performance gain across multiple tasks, e.g., motion planning could witness a collision rate reduction by 15%-58%, demonstrating the superiority of our method., Comment: Project link: https://github.com/OpenDriveLab/OccNet
Published: 2023

17. Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

Author: Yan, Shilin, Zhang, Renrui, Guo, Ziyu, Chen, Wenchao, Zhang, Wei, Li, Hongyang, Qiao, Yu, Dong, Hao, He, Zhongjiang, Gao, Peng, Yan, Shilin, Zhang, Renrui, Guo, Ziyu, Chen, Wenchao, Zhang, Wei, Li, Hongyang, Qiao, Yu, Dong, Hao, He, Zhongjiang, and Gao, Peng
Abstract: Recently, video object segmentation (VOS) referred by multi-modal signals, e.g., language and audio, has evoked increasing attention in both industry and academia. It is challenging for exploring the semantic alignment within modalities and the visual correspondence across frames. However, existing methods adopt separate network architectures for different modalities, and neglect the inter-frame temporal interaction with references. In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. Specifically, we introduce two strategies to fully explore the temporal relations between videos and multi-modal signals. Firstly, for low-level temporal aggregation before the transformer, we enable the multi-modal references to capture multi-scale visual cues from consecutive video frames. This effectively endows the text or audio signals with temporal knowledge and boosts the semantic alignment between modalities. Secondly, for high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video. On Ref-YouTube-VOS and AVSBench datasets with respective text and audio references, MUTR achieves +4.2% and +8.7% J&F improvements to state-of-the-art methods, demonstrating our significance for unified multi-modal VOS. Code is released at https://github.com/OpenGVLab/MUTR., Comment: Accepted by AAAI 2024. Code is released at https://github.com/OpenGVLab/MUTR
Published: 2023

18. Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

Author: Jia, Xiaosong, Wu, Penghao, Chen, Li, Xie, Jiangwei, He, Conghui, Yan, Junchi, Li, Hongyang, Jia, Xiaosong, Wu, Penghao, Chen, Li, Xie, Jiangwei, He, Conghui, Yan, Junchi, and Li, Hongyang
Abstract: End-to-end autonomous driving has made impressive progress in recent years. Existing methods usually adopt the decoupled encoder-decoder paradigm, where the encoder extracts hidden features from raw sensor data, and the decoder outputs the ego-vehicle's future trajectories or actions. Under such a paradigm, the encoder does not have access to the intended behavior of the ego agent, leaving the burden of finding out safety-critical regions from the massive receptive field and inferring about future situations to the decoder. Even worse, the decoder is usually composed of several simple multi-layer perceptrons (MLP) or GRUs while the encoder is delicately designed (e.g., a combination of heavy ResNets or Transformer). Such an imbalanced resource-task division hampers the learning process. In this work, we aim to alleviate the aforementioned problem by two principles: (1) fully utilizing the capacity of the encoder; (2) increasing the capacity of the decoder. Concretely, we first predict a coarse-grained future position and action based on the encoder features. Then, conditioned on the position and action, the future scene is imagined to check the ramification if we drive accordingly. We also retrieve the encoder features around the predicted coordinate to obtain fine-grained information about the safety-critical region. Finally, based on the predicted future and the retrieved salient feature, we refine the coarse-grained position and action by predicting its offset from ground-truth. The above refinement module could be stacked in a cascaded fashion, which extends the capacity of the decoder with spatial-temporal prior knowledge about the conditioned future. We conduct experiments on the CARLA simulator and achieve state-of-the-art performance in closed-loop benchmarks. Extensive ablation studies demonstrate the effectiveness of each proposed module., Comment: Accepted by CVPR 2023
Published: 2023

19. A Strong and Reproducible Object Detector with Only Public Datasets

Author: Ren, Tianhe, Yang, Jianwei, Liu, Shilong, Zeng, Ailing, Li, Feng, Zhang, Hao, Li, Hongyang, Zeng, Zhaoyang, Zhang, Lei, Ren, Tianhe, Yang, Jianwei, Liu, Shilong, Zeng, Ailing, Li, Feng, Zhang, Hao, Li, Hongyang, Zeng, Zhaoyang, and Zhang, Lei
Abstract: This work presents Focal-Stable-DINO, a strong and reproducible object detection model which achieves 64.6 AP on COCO val2017 and 64.8 AP on COCO test-dev using only 700M parameters without any test time augmentation. It explores the combination of the powerful FocalNet-Huge backbone with the effective Stable-DINO detector. Different from existing SOTA models that utilize an extensive number of parameters and complex training techniques on large-scale private data or merged data, our model is exclusively trained on the publicly available dataset Objects365, which ensures the reproducibility of our approach., Comment: 64.8 AP on COCO test-dev
Published: 2023

20. OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping

Author: Wang, Huijie, Li, Tianyu, Li, Yang, Chen, Li, Sima, Chonghao, Liu, Zhenbo, Wang, Bangjun, Jia, Peijin, Wang, Yuting, Jiang, Shengyin, Wen, Feng, Xu, Hang, Luo, Ping, Yan, Junchi, Zhang, Wei, Li, Hongyang, Wang, Huijie, Li, Tianyu, Li, Yang, Chen, Li, Sima, Chonghao, Liu, Zhenbo, Wang, Bangjun, Jia, Peijin, Wang, Yuting, Jiang, Shengyin, Wen, Feng, Xu, Hang, Luo, Ping, Yan, Junchi, Zhang, Wei, and Li, Hongyang
Abstract: Accurately depicting the complex traffic scene is a vital component for autonomous vehicles to execute correct judgments. However, existing benchmarks tend to oversimplify the scene by solely focusing on lane perception tasks. Observing that human drivers rely on both lanes and traffic signals to operate their vehicles safely, we present OpenLane-V2, the first dataset on topology reasoning for traffic scene structure. The objective of the presented dataset is to advance research in understanding the structure of road scenes by examining the relationship between perceived entities, such as traffic elements and lanes. Leveraging existing datasets, OpenLane-V2 consists of 2,000 annotated road scenes that describe traffic elements and their correlation to the lanes. It comprises three primary sub-tasks, including the 3D lane detection inherited from OpenLane, accompanied by corresponding metrics to evaluate the model's performance. We evaluate various state-of-the-art methods, and present their quantitative and qualitative results on OpenLane-V2 to indicate future avenues for investigating topology reasoning in traffic scenes., Comment: Accepted by NeurIPS 2023 Track on Datasets and Benchmarks | OpenLane-V2 Dataset: https://github.com/OpenDriveLab/OpenLane-V2
Published: 2023

21. Graph-based Topology Reasoning for Driving Scenes

Author: Li, Tianyu, Chen, Li, Wang, Huijie, Li, Yang, Yang, Jiazhi, Geng, Xiangwei, Jiang, Shengyin, Wang, Yuting, Xu, Hang, Xu, Chunjing, Yan, Junchi, Luo, Ping, Li, Hongyang, Li, Tianyu, Chen, Li, Wang, Huijie, Li, Yang, Yang, Jiazhi, Geng, Xiangwei, Jiang, Shengyin, Wang, Yuting, Xu, Hang, Xu, Chunjing, Yan, Junchi, Luo, Ping, and Li, Hongyang
Abstract: Understanding the road genome is essential to realize autonomous driving. This highly intelligent problem contains two aspects - the connection relationship of lanes, and the assignment relationship between lanes and traffic elements, where a comprehensive topology reasoning method is vacant. On one hand, previous map learning techniques struggle in deriving lane connectivity with segmentation or laneline paradigms; or prior lane topology-oriented approaches focus on centerline detection and neglect the interaction modeling. On the other hand, the traffic element to lane assignment problem is limited in the image domain, leaving how to construct the correspondence from two views an unexplored challenge. To address these issues, we present TopoNet, the first end-to-end framework capable of abstracting traffic knowledge beyond conventional perception tasks. To capture the driving scene topology, we introduce three key designs: (1) an embedding module to incorporate semantic knowledge from 2D elements into a unified feature space; (2) a curated scene graph neural network to model relationships and enable feature interaction inside the network; (3) instead of transmitting messages arbitrarily, a scene knowledge graph is devised to differentiate prior knowledge from various types of the road genome. We evaluate TopoNet on the challenging scene understanding benchmark, OpenLane-V2, where our approach outperforms all previous works by a great margin on all perceptual and topological metrics. The code is released at https://github.com/OpenDriveLab/TopoNet
Published: 2023

22. Sparse Dense Fusion for 3D Object Detection

Author: Gao, Yulu, Sima, Chonghao, Shi, Shaoshuai, Di, Shangzhe, Liu, Si, Li, Hongyang, Gao, Yulu, Sima, Chonghao, Shi, Shaoshuai, Di, Shangzhe, Liu, Si, and Li, Hongyang
Abstract: With the prevalence of multimodal learning, camera-LiDAR fusion has gained popularity in 3D object detection. Although multiple fusion approaches have been proposed, they can be classified into either sparse-only or dense-only fashion based on the feature representation in the fusion module. In this paper, we analyze them in a common taxonomy and thereafter observe two challenges: 1) sparse-only solutions preserve 3D geometric prior and yet lose rich semantic information from the camera, and 2) dense-only alternatives retain the semantic continuity but miss the accurate geometric information from LiDAR. By analyzing these two formulations, we conclude that the information loss is inevitable due to their design scheme. To compensate for the information loss in either manner, we propose Sparse Dense Fusion (SDF), a complementary framework that incorporates both sparse-fusion and dense-fusion modules via the Transformer architecture. Such a simple yet effective sparse-dense fusion structure enriches semantic texture and exploits spatial structure information simultaneously. Through our SDF strategy, we assemble two popular methods with moderate performance and outperform baseline by 4.3% in mAP and 2.5% in NDS, ranking first on the nuScenes benchmark. Extensive ablations demonstrate the effectiveness of our method and empirically align our analysis.
Published: 2023

23. Detection Transformer with Stable Matching

Author: Liu, Shilong, Ren, Tianhe, Chen, Jiayu, Zeng, Zhaoyang, Zhang, Hao, Li, Feng, Li, Hongyang, Huang, Jun, Su, Hang, Zhu, Jun, Zhang, Lei, Liu, Shilong, Ren, Tianhe, Chen, Jiayu, Zeng, Zhaoyang, Zhang, Hao, Li, Feng, Li, Hongyang, Huang, Jun, Su, Hang, Zhu, Jun, and Zhang, Lei
Abstract: This paper is concerned with the matching stability problem across different decoder layers in DEtection TRansformers (DETR). We point out that the unstable matching in DETR is caused by a multi-optimization path problem, which is highlighted by the one-to-one matching design in DETR. To address this problem, we show that the most important design is to use and only use positional metrics (like IOU) to supervise classification scores of positive examples. Under the principle, we propose two simple yet effective modifications by integrating positional metrics to DETR's classification loss and matching cost, named position-supervised loss and position-modulated cost. We verify our methods on several DETR variants. Our methods show consistent improvements over baselines. By integrating our methods with DINO, we achieve 50.4 and 51.5 AP on the COCO detection benchmark using ResNet-50 backbones under 12 epochs and 24 epochs training settings, achieving a new record under the same setting. We achieve 63.8 AP on COCO detection test-dev with a Swin-Large backbone. Our code will be made available at https://github.com/IDEA-Research/Stable-DINO., Comment: SOTA detector. Project page: https://github.com/IDEA-Research/Stable-DINO
Published: 2023

24. Geometric-aware Pretraining for Vision-centric 3D Object Detection

Author: Huang, Linyan, Wang, Huijie, Zeng, Jia, Zhang, Shengchuan, Cao, Liujuan, Yan, Junchi, Li, Hongyang, Huang, Linyan, Wang, Huijie, Zeng, Jia, Zhang, Shengchuan, Cao, Liujuan, Yan, Junchi, and Li, Hongyang
Abstract: Multi-camera 3D object detection for autonomous driving is a challenging problem that has garnered notable attention from both academia and industry. An obstacle encountered in vision-based techniques involves the precise extraction of geometry-conscious features from RGB images. Recent approaches have utilized geometric-aware image backbones pretrained on depth-relevant tasks to acquire spatial information. However, these approaches overlook the critical aspect of view transformation, resulting in inadequate performance due to the misalignment of spatial knowledge between the image backbone and view transformation. To address this issue, we propose a novel geometric-aware pretraining framework called GAPretrain. Our approach incorporates spatial and structural cues to camera networks by employing the geometric-rich modality as guidance during the pretraining phase. The transference of modal-specific attributes across different modalities is non-trivial, but we bridge this gap by using a unified bird's-eye-view (BEV) representation and structural hints derived from LiDAR point clouds to facilitate the pretraining process. GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors. Our experiments demonstrate the effectiveness and generalization ability of the proposed method. We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively. We also conduct experiments on various image backbones and view transformations to validate the efficacy of our approach. Code will be released at https://github.com/OpenDriveLab/BEVPerception-Survey-Recipe., Comment: 15 pages, 3 figures
Published: 2023

25. 3D Data Augmentation for Driving Scenes on Camera

Author: Tong, Wenwen, Xie, Jiangwei, Li, Tianyu, Deng, Hanming, Geng, Xiangwei, Zhou, Ruoyi, Yang, Dingchen, Dai, Bo, Lu, Lewei, Li, Hongyang, Tong, Wenwen, Xie, Jiangwei, Li, Tianyu, Deng, Hanming, Geng, Xiangwei, Zhou, Ruoyi, Yang, Dingchen, Dai, Bo, Lu, Lewei, and Li, Hongyang
Abstract: Driving scenes are extremely diverse and complicated that it is impossible to collect all cases with human effort alone. While data augmentation is an effective technique to enrich the training data, existing methods for camera data in autonomous driving applications are confined to the 2D image plane, which may not optimally increase data diversity in 3D real-world scenarios. To this end, we propose a 3D data augmentation approach termed Drive-3DAug, aiming at augmenting the driving scenes on camera in the 3D space. We first utilize Neural Radiance Field (NeRF) to reconstruct the 3D models of background and foreground objects. Then, augmented driving scenes can be obtained by placing the 3D objects with adapted location and orientation at the pre-defined valid region of backgrounds. As such, the training database could be effectively scaled up. However, the 3D object modeling is constrained to the image quality and the limited viewpoints. To overcome these problems, we modify the original NeRF by introducing a geometric rectified loss and a symmetric-aware training strategy. We evaluate our method for the camera-only monocular 3D detection task on the Waymo and nuScences datasets. The proposed data augmentation approach contributes to a gain of 1.7% and 1.4% in terms of detection accuracy, on Waymo and nuScences respectively. Furthermore, the constructed 3D models serve as digital driving assets and could be recycled for different detectors or other 3D perception tasks.
Published: 2023

26. Grounded-SAM: Detect, Segment and Generate Anything

Author: Ren, Tianhe, Liu, Shilong, Zeng, Ailing, Lin, Jing, Cao, He, Li, Kunchang, Chem, Jiayu, Huang, Xinyu, Chen, Yukang, Yan, Feng, Zeng, Zhaoyang, Zhang, Hao, Li, Feng, Yang, Jie, Li, Hongyang, Jiang, Qing, Zhang, Lei, Ren, Tianhe, Liu, Shilong, Zeng, Ailing, Lin, Jing, Cao, He, Li, Kunchang, Chem, Jiayu, Huang, Xinyu, Chen, Yukang, Yan, Feng, Zeng, Zhaoyang, Zhang, Hao, Li, Feng, Yang, Jie, Li, Hongyang, Jiang, Qing, and Zhang, Lei
Published: 2023

27. Visual Point Cloud Forecasting enables Scalable Autonomous Driving

Author: Yang, Zetong, Chen, Li, Sun, Yanan, Li, Hongyang, Yang, Zetong, Chen, Li, Sun, Yanan, and Li, Hongyang
Abstract: In contrast to extensive studies on general vision, pre-training for scalable visual autonomous driving remains seldom explored. Visual autonomous driving applications require features encompassing semantics, 3D geometry, and temporal information simultaneously for joint perception, prediction, and planning, posing dramatic challenges for pre-training. To resolve this, we bring up a new pre-training task termed as visual point cloud forecasting - predicting future point clouds from historical visual input. The key merit of this task captures the synergic learning of semantics, 3D structures, and temporal dynamics. Hence it shows superiority in various downstream tasks. To cope with this new problem, we present ViDAR, a general model to pre-train downstream visual encoders. It first extracts historical embeddings by the encoder. These representations are then transformed to 3D geometric space via a novel Latent Rendering operator for future point cloud prediction. Experiments show significant gain in downstream tasks, e.g., 3.1% NDS on 3D detection, ~10% error reduction on motion forecasting, and ~15% less collision rate on planning.
Published: 2023

28. Fully Sparse 3D Occupancy Prediction

Author: Liu, Haisong, Chen, Yang, Wang, Haiguang, Yang, Zetong, Li, Tianyu, Zeng, Jia, Chen, Li, Li, Hongyang, Wang, Limin, Liu, Haisong, Chen, Yang, Wang, Haiguang, Yang, Zetong, Li, Tianyu, Zeng, Jia, Chen, Li, Li, Hongyang, and Wang, Limin
Abstract: Occupancy prediction plays a pivotal role in autonomous driving. Previous methods typically construct dense 3D volumes, neglecting the inherent sparsity of the scene and suffering high computational costs. To bridge the gap, we introduce a novel fully sparse occupancy network, termed SparseOcc. SparseOcc initially reconstructs a sparse 3D representation from visual inputs and subsequently predicts semantic/instance occupancy from the 3D sparse representation by sparse queries. A mask-guided sparse sampling is designed to enable sparse queries to interact with 2D features in a fully sparse manner, thereby circumventing costly dense features or global attention. Additionally, we design a thoughtful ray-based evaluation metric, namely RayIoU, to solve the inconsistency penalty along depths raised in traditional voxel-level mIoU criteria. SparseOcc demonstrates its effectiveness by achieving a RayIoU of 34.0, while maintaining a real-time inference speed of 17.3 FPS, with 7 history frames inputs. By incorporating more preceding frames to 15, SparseOcc continuously improves its performance to 35.1 RayIoU without whistles and bells. Code is available at https://github.com/MCG-NJU/SparseOcc., Comment: Add new metric: RayIoU
Published: 2023

29. LaneSegNet: Map Learning with Lane Segment Perception for Autonomous Driving

Author: Li, Tianyu, Jia, Peijin, Wang, Bangjun, Chen, Li, Jiang, Kun, Yan, Junchi, Li, Hongyang, Li, Tianyu, Jia, Peijin, Wang, Bangjun, Chen, Li, Jiang, Kun, Yan, Junchi, and Li, Hongyang
Abstract: A map, as crucial information for downstream applications of an autonomous driving system, is usually represented in lanelines or centerlines. However, existing literature on map learning primarily focuses on either detecting geometry-based lanelines or perceiving topology relationships of centerlines. Both of these methods ignore the intrinsic relationship of lanelines and centerlines, that lanelines bind centerlines. While simply predicting both types of lane in one model is mutually excluded in learning objective, we advocate lane segment as a new representation that seamlessly incorporates both geometry and topology information. Thus, we introduce LaneSegNet, the first end-to-end mapping network generating lane segments to obtain a complete representation of the road structure. Our algorithm features two key modifications. One is a lane attention module to capture pivotal region details within the long-range feature space. Another is an identical initialization strategy for reference points, which enhances the learning of positional priors for lane attention. On the OpenLane-V2 dataset, LaneSegNet outperforms previous counterparts by a substantial gain across three tasks, \textit{i.e.}, map element detection (+4.8 mAP), centerline perception (+6.9 DET$_l$), and the newly defined one, lane segment perception (+5.6 mAP). Furthermore, it obtains a real-time inference speed of 14.7 FPS. Code is accessible at https://github.com/OpenDriveLab/LaneSegNet., Comment: Accepted in ICLR 2024
Published: 2023

30. DriveLM: Driving with Graph Visual Question Answering

Author: Sima, Chonghao, Renz, Katrin, Chitta, Kashyap, Chen, Li, Zhang, Hanxue, Xie, Chengen, Luo, Ping, Geiger, Andreas, Li, Hongyang, Sima, Chonghao, Renz, Katrin, Chitta, Kashyap, Chen, Li, Zhang, Hanxue, Xie, Chengen, Luo, Ping, Geiger, Andreas, and Li, Hongyang
Abstract: We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems to boost generalization and enable interactivity with human users. While recent approaches adapt VLMs to driving via single-round visual question answering (VQA), human drivers reason about decisions in multiple steps. Starting from the localization of key objects, humans estimate object interactions before taking actions. The key insight is that with our proposed task, Graph VQA, where we model graph-structured reasoning through perception, prediction and planning question-answer pairs, we obtain a suitable proxy task to mimic the human reasoning process. We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving. The experiments demonstrate that Graph VQA provides a simple, principled framework for reasoning about a driving scene, and DriveLM-Data provides a challenging benchmark for this task. Our DriveLM-Agent baseline performs end-to-end autonomous driving competitively in comparison to state-of-the-art driving-specific architectures. Notably, its benefits are pronounced when it is evaluated zero-shot on unseen objects or sensor configurations. We hope this work can be the starting point to shed new light on how to apply VLMs for autonomous driving. To facilitate future research, all code, data, and models are available to the public.
Published: 2023

31. A Survey of Reasoning with Foundation Models

Author: Sun, Jiankai, Zheng, Chuanyang, Xie, Enze, Liu, Zhengying, Chu, Ruihang, Qiu, Jianing, Xu, Jiaqi, Ding, Mingyu, Li, Hongyang, Geng, Mengzhe, Wu, Yue, Wang, Wenhai, Chen, Junsong, Yin, Zhangyue, Ren, Xiaozhe, Fu, Jie, He, Junxian, Yuan, Wu, Liu, Qi, Liu, Xihui, Li, Yu, Dong, Hao, Cheng, Yu, Zhang, Ming, Heng, Pheng Ann, Dai, Jifeng, Luo, Ping, Wang, Jingdong, Wen, Ji-Rong, Qiu, Xipeng, Guo, Yike, Xiong, Hui, Liu, Qun, Li, Zhenguo, Sun, Jiankai, Zheng, Chuanyang, Xie, Enze, Liu, Zhengying, Chu, Ruihang, Qiu, Jianing, Xu, Jiaqi, Ding, Mingyu, Li, Hongyang, Geng, Mengzhe, Wu, Yue, Wang, Wenhai, Chen, Junsong, Yin, Zhangyue, Ren, Xiaozhe, Fu, Jie, He, Junxian, Yuan, Wu, Liu, Qi, Liu, Xihui, Li, Yu, Dong, Hao, Cheng, Yu, Zhang, Ming, Heng, Pheng Ann, Dai, Jifeng, Luo, Ping, Wang, Jingdong, Wen, Ji-Rong, Qiu, Xipeng, Guo, Yike, Xiong, Hui, Liu, Qun, and Li, Zhenguo
Abstract: Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation. It serves as a fundamental methodology in the field of Artificial General Intelligence (AGI). With the ongoing development of foundation models, e.g., Large Language Models (LLMs), there is a growing interest in exploring their abilities in reasoning tasks. In this paper, we introduce seminal foundation models proposed or adaptable for reasoning, highlighting the latest advancements in various reasoning tasks, methods, and benchmarks. We then delve into the potential future directions behind the emergence of reasoning abilities within foundation models. We also discuss the relevance of multimodal learning, autonomous agents, and super alignment in the context of reasoning. By discussing these future research directions, we hope to inspire researchers in their exploration of this field, stimulate further advancements in reasoning with foundation models, and contribute to the development of AGI., Comment: 20 Figures, 160 Pages, 750+ References, Project Page https://github.com/reasoning-survey/Awesome-Reasoning-Foundation-Models
Published: 2023

32. LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

Author: Zhang, Hao, Li, Hongyang, Li, Feng, Ren, Tianhe, Zou, Xueyan, Liu, Shilong, Huang, Shijia, Gao, Jianfeng, Zhang, Lei, Li, Chunyuan, Yang, Jianwei, Zhang, Hao, Li, Hongyang, Li, Feng, Ren, Tianhe, Zou, Xueyan, Liu, Shilong, Huang, Shijia, Gao, Jianfeng, Zhang, Lei, Li, Chunyuan, and Yang, Jianwei
Abstract: With the recent significant advancements in large multi-modal models (LMMs), the importance of their grounding capability in visual chat is increasingly recognized. Despite recent efforts to enable LMMs to support grounding, their capabilities for grounding and chat are usually separate, and their chat performance drops dramatically when asked to ground. The problem is the lack of a dataset for grounded visual chat (GVC). Existing grounding datasets only contain short captions. To address this issue, we have created GVC data that allows for the combination of grounding and chat capabilities. To better evaluate the GVC capabilities, we have introduced a benchmark called Grounding-Bench. Additionally, we have proposed a model design that can support GVC and various types of visual prompts by connecting segmentation models with language models. Experimental results demonstrate that our model outperforms other LMMs on Grounding-Bench. Furthermore, our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities. Our code will be released at https://github.com/UX-Decoder/LLaVA-Grounding .
Published: 2023

33. Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future

Author: Li, Hongyang, Li, Yang, Wang, Huijie, Zeng, Jia, Xu, Huilin, Cai, Pinlong, Chen, Li, Yan, Junchi, Xu, Feng, Xiong, Lu, Wang, Jingdong, Zhu, Futang, Xu, Chunjing, Wang, Tiancai, Xia, Fei, Mu, Beipeng, Peng, Zhihui, Lin, Dahua, Qiao, Yu, Li, Hongyang, Li, Yang, Wang, Huijie, Zeng, Jia, Xu, Huilin, Cai, Pinlong, Chen, Li, Yan, Junchi, Xu, Feng, Xiong, Lu, Wang, Jingdong, Zhu, Futang, Xu, Chunjing, Wang, Tiancai, Xia, Fei, Mu, Beipeng, Peng, Zhihui, Lin, Dahua, and Qiao, Yu
Abstract: With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem. Current autonomous driving datasets can broadly be categorized into two generations. The first-generation autonomous driving datasets are characterized by relatively simpler sensor modalities, smaller data scale, and is limited to perception-level tasks. KITTI, introduced in 2012, serves as a prominent representative of this initial wave. In contrast, the second-generation datasets exhibit heightened complexity in sensor modalities, greater data scale and diversity, and an expansion of tasks from perception to encompass prediction and control. Leading examples of the second generation include nuScenes and Waymo, introduced around 2019. This comprehensive review, conducted in collaboration with esteemed colleagues from both academia and industry, systematically assesses over seventy open-source autonomous driving datasets from domestic and international sources. It offers insights into various aspects, such as the principles underlying the creation of high-quality datasets, the pivotal role of data engine systems, and the utilization of generative foundation models to facilitate scalable data generation. Furthermore, this review undertakes an exhaustive analysis and discourse regarding the characteristics and data scales that future third-generation autonomous driving datasets should possess. It also delves into the scientific and technical challenges that warrant resolution. These endeavors are pivotal in advancing autonomous innovation and fostering technological enhancement in critical domains. For further details, please refer to https://github.com/OpenDriveLab/DriveAGI., Comment: This article is a simplified English translation of corresponding Chinese article. Please refer to Chinese version for the complete content
Published: 2023

34. Visual In-Context Prompting

Author: Li, Feng, Jiang, Qing, Zhang, Hao, Ren, Tianhe, Liu, Shilong, Zou, Xueyan, Xu, Huaizhe, Li, Hongyang, Li, Chunyuan, Yang, Jianwei, Zhang, Lei, Gao, Jianfeng, Li, Feng, Jiang, Qing, Zhang, Hao, Ren, Tianhe, Liu, Shilong, Zou, Xueyan, Xu, Huaizhe, Li, Hongyang, Li, Chunyuan, Yang, Jianwei, Zhang, Lei, and Gao, Jianfeng
Abstract: In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities, but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object, falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper, we introduce a universal visual in-context prompting framework for both tasks. In particular, we build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect, yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B, our model achieves $57.7$ PQ on COCO and $23.2$ PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv., Comment: technical report
Published: 2023

35. LLM4Drive: A Survey of Large Language Models for Autonomous Driving

Author: Yang, Zhenjie, Jia, Xiaosong, Li, Hongyang, Yan, Junchi, Yang, Zhenjie, Jia, Xiaosong, Li, Hongyang, and Yan, Junchi
Abstract: Autonomous driving technology, a catalyst for revolutionizing transportation and urban mobility, has the tend to transition from rule-based systems to data-driven strategies. Traditional module-based systems are constrained by cumulative errors among cascaded modules and inflexible pre-set rules. In contrast, end-to-end autonomous driving systems have the potential to avoid error accumulation due to their fully data-driven training process, although they often lack transparency due to their "black box" nature, complicating the validation and traceability of decisions. Recently, large language models (LLMs) have demonstrated abilities including understanding context, logical reasoning, and generating answers. A natural thought is to utilize these abilities to empower autonomous driving. By combining LLM with foundation vision models, it could open the door to open-world understanding, reasoning, and few-shot learning, which current autonomous driving systems are lacking. In this paper, we systematically review a research line about \textit{Large Language Models for Autonomous Driving (LLM4AD)}. This study evaluates the current state of technological advancements, distinctly outlining the principal challenges and prospective directions for the field. For the convenience of researchers in academia and industry, we provide real-time updates on the latest advances in the field as well as relevant open-source resources via the designated link: https://github.com/Thinklab-SJTU/Awesome-LLM4AD., Comment: GitHub Repo: https://github.com/Thinklab-SJTU/Awesome-LLM4AD
Published: 2023

36. Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

Author: Huang, Linyan, Li, Zhiqi, Sima, Chonghao, Wang, Wenhai, Wang, Jingdong, Qiao, Yu, Li, Hongyang, Huang, Linyan, Li, Zhiqi, Sima, Chonghao, Wang, Wenhai, Wang, Jingdong, Qiao, Yu, and Li, Hongyang
Abstract: Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR- or multi-modal-based counterparts (expert). However, the presence of the domain gap between LiDAR and camera features, coupled with the inherent incompatibility in temporal fusion, significantly hinders the effectiveness of distillation-based enhancements for apprentices. Motivated by the success of uni-modal distillation, an apprentice-friendly expert model would predominantly rely on camera features, while still achieving comparable performance to multi-modal models. To this end, we introduce VCD, a framework to improve the camera-only apprentice model, including an apprentice-friendly multi-modal expert and temporal-fusion-friendly distillation supervision. The multi-modal expert VCD-E adopts an identical structure as that of the camera-only apprentice in order to alleviate the feature disparity, and leverages LiDAR input as a depth prior to reconstruct the 3D scene, achieving the performance on par with other heterogeneous multi-modal experts. Additionally, a fine-grained trajectory-based distillation module is introduced with the purpose of individually rectifying the motion misalignment for each object in the scene. With those improvements, our camera-only apprentice VCD-A sets new state-of-the-art on nuScenes with a score of 63.1% NDS., Comment: Accepted by NeurIPS 2023
Published: 2023

37. DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving

Author: Jia, Xiaosong, Gao, Yulu, Chen, Li, Yan, Junchi, Liu, Patrick Langechuan, Li, Hongyang, Jia, Xiaosong, Gao, Yulu, Chen, Li, Yan, Junchi, Liu, Patrick Langechuan, and Li, Hongyang
Abstract: End-to-end autonomous driving aims to build a fully differentiable system that takes raw sensor data as inputs and directly outputs the planned trajectory or control signals of the ego vehicle. State-of-the-art methods usually follow the `Teacher-Student' paradigm. The Teacher model uses privileged information (ground-truth states of surrounding agents and map elements) to learn the driving strategy. The student model only has access to raw sensor data and conducts behavior cloning on the data collected by the teacher model. By eliminating the noise of the perception part during planning learning, state-of-the-art works could achieve better performance with significantly less data compared to those coupled ones. However, under the current Teacher-Student paradigm, the student model still needs to learn a planning head from scratch, which could be challenging due to the redundant and noisy nature of raw sensor inputs and the casual confusion issue of behavior cloning. In this work, we aim to explore the possibility of directly adopting the strong teacher model to conduct planning while letting the student model focus more on the perception part. We find that even equipped with a SOTA perception model, directly letting the student model learn the required inputs of the teacher model leads to poor driving performance, which comes from the large distribution gap between predicted privileged inputs and the ground-truth. To this end, we propose DriveAdapter, which employs adapters with the feature alignment objective function between the student (perception) and teacher (planning) modules. Additionally, since the pure learning-based teacher model itself is imperfect and occasionally breaks safety rules, we propose a method of action-guided feature learning with a mask for those imperfect teacher features to further inject the priors of hand-crafted rules into the learning process., Comment: Accepted by ICCV 2023 (Oral). Code url: https://github.com/OpenDriveLab/DriveAdapter
Published: 2023

38. DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting

Author: Li, Hongyang, Zhang, Hao, Zeng, Zhaoyang, Liu, Shilong, Li, Feng, Ren, Tianhe, Zhang, Lei, Li, Hongyang, Zhang, Hao, Zeng, Zhaoyang, Liu, Shilong, Li, Feng, Ren, Tianhe, and Zhang, Lei
Abstract: In this paper, we propose a new operator, called 3D DeFormable Attention (DFA3D), for 2D-to-3D feature lifting, which transforms multi-view 2D image features into a unified 3D space for 3D object detection. Existing feature lifting approaches, such as Lift-Splat-based and 2D attention-based, either use estimated depth to get pseudo LiDAR features and then splat them to a 3D space, which is a one-pass operation without feature refinement, or ignore depth and lift features by 2D attention mechanisms, which achieve finer semantics while suffering from a depth ambiguity problem. In contrast, our DFA3D-based method first leverages the estimated depth to expand each view's 2D feature map to 3D and then utilizes DFA3D to aggregate features from the expanded 3D feature maps. With the help of DFA3D, the depth ambiguity problem can be effectively alleviated from the root, and the lifted features can be progressively refined layer by layer, thanks to the Transformer-like architecture. In addition, we propose a mathematically equivalent implementation of DFA3D which can significantly improve its memory efficiency and computational speed. We integrate DFA3D into several methods that use 2D attention-based feature lifting with only a few modifications in code and evaluate on the nuScenes dataset. The experiment results show a consistent improvement of +1.41\% mAP on average, and up to +15.1\% mAP improvement when high-quality depth information is available, demonstrating the superiority, applicability, and huge potential of DFA3D. The code is available at https://github.com/IDEA-Research/3D-deformable-attention.git.
Published: 2023

39. Density-invariant Features for Distant Point Cloud Registration

Author: Liu, Quan, Zhu, Hongzi, Zhou, Yunsong, Li, Hongyang, Chang, Shan, Guo, Minyi, Liu, Quan, Zhu, Hongzi, Zhou, Yunsong, Li, Hongyang, Chang, Shan, and Guo, Minyi
Abstract: Registration of distant outdoor LiDAR point clouds is crucial to extending the 3D vision of collaborative autonomous vehicles, and yet is challenging due to small overlapping area and a huge disparity between observed point densities. In this paper, we propose Group-wise Contrastive Learning (GCL) scheme to extract density-invariant geometric features to register distant outdoor LiDAR point clouds. We mark through theoretical analysis and experiments that, contrastive positives should be independent and identically distributed (i.i.d.), in order to train densityinvariant feature extractors. We propose upon the conclusion a simple yet effective training scheme to force the feature of multiple point clouds in the same spatial location (referred to as positive groups) to be similar, which naturally avoids the sampling bias introduced by a pair of point clouds to conform with the i.i.d. principle. The resulting fully-convolutional feature extractor is more powerful and density-invariant than state-of-the-art methods, improving the registration recall of distant scenarios on KITTI and nuScenes benchmarks by 40.9% and 26.9%, respectively. Code is available at https://github.com/liuQuan98/GCL., Comment: In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023
Published: 2023

40. End-to-end Autonomous Driving: Challenges and Frontiers

Author: Chen, Li, Wu, Penghao, Chitta, Kashyap, Jaeger, Bernhard, Geiger, Andreas, Li, Hongyang, Chen, Li, Wu, Penghao, Chitta, Kashyap, Jaeger, Bernhard, Geiger, Andreas, and Li, Hongyang
Abstract: The autonomous driving community has witnessed a rapid growth in approaches that embrace an end-to-end algorithm framework, utilizing raw sensor input to generate vehicle motion plans, instead of concentrating on individual tasks such as detection and motion prediction. End-to-end systems, in comparison to modular pipelines, benefit from joint feature optimization for perception and planning. This field has flourished due to the availability of large-scale datasets, closed-loop evaluation, and the increasing need for autonomous driving algorithms to perform effectively in challenging scenarios. In this survey, we provide a comprehensive analysis of more than 270 papers, covering the motivation, roadmap, methodology, challenges, and future trends in end-to-end autonomous driving. We delve into several critical challenges, including multi-modality, interpretability, causal confusion, robustness, and world models, amongst others. Additionally, we discuss current advancements in foundation models and visual pre-training, as well as how to incorporate these techniques within the end-to-end driving framework. we maintain an active repository that contains up-to-date literature and open-source projects at https://github.com/OpenDriveLab/End-to-end-Autonomous-Driving.
Published: 2023

41. detrex: Benchmarking Detection Transformers

Author: Ren, Tianhe, Liu, Shilong, Li, Feng, Zhang, Hao, Zeng, Ailing, Yang, Jie, Liao, Xingyu, Jia, Ding, Li, Hongyang, Cao, He, Wang, Jianan, Zeng, Zhaoyang, Qi, Xianbiao, Yuan, Yuhui, Yang, Jianwei, Zhang, Lei, Ren, Tianhe, Liu, Shilong, Li, Feng, Zhang, Hao, Zeng, Ailing, Yang, Jie, Liao, Xingyu, Jia, Ding, Li, Hongyang, Cao, He, Wang, Jianan, Zeng, Zhaoyang, Qi, Xianbiao, Yuan, Yuhui, Yang, Jianwei, and Zhang, Lei
Abstract: The DEtection TRansformer (DETR) algorithm has received considerable attention in the research community and is gradually emerging as a mainstream approach for object detection and other perception tasks. However, the current field lacks a unified and comprehensive benchmark specifically tailored for DETR-based models. To address this issue, we develop a unified, highly modular, and lightweight codebase called detrex, which supports a majority of the mainstream DETR-based instance recognition algorithms, covering various fundamental tasks, including object detection, segmentation, and pose estimation. We conduct extensive experiments under detrex and perform a comprehensive benchmark for DETR-based models. Moreover, we enhance the performance of detection transformers through the refinement of training hyper-parameters, providing strong baselines for supported algorithms.We hope that detrex could offer research communities a standardized and unified platform to evaluate and compare different DETR-based models while fostering a deeper understanding and driving advancements in DETR-based instance recognition. Our code is available at https://github.com/IDEA-Research/detrex. The project is currently being actively developed. We encourage the community to use detrex codebase for further development and contributions., Comment: project link: https://github.com/IDEA-Research/detrex
Published: 2023

42. Planning-oriented Autonomous Driving

Author: Hu, Yihan, Yang, Jiazhi, Chen, Li, Li, Keyu, Sima, Chonghao, Zhu, Xizhou, Chai, Siqi, Du, Senyao, Lin, Tianwei, Wang, Wenhai, Lu, Lewei, Jia, Xiaosong, Liu, Qiang, Dai, Jifeng, Qiao, Yu, Li, Hongyang, Hu, Yihan, Yang, Jiazhi, Chen, Li, Li, Keyu, Sima, Chonghao, Zhu, Xizhou, Chai, Siqi, Du, Senyao, Lin, Tianwei, Wang, Wenhai, Lu, Lewei, Jia, Xiaosong, Liu, Qiang, Dai, Jifeng, Qiao, Yu, and Li, Hongyang
Abstract: Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction, and planning. In order to perform a wide diversity of tasks and achieve advanced-level intelligence, contemporary approaches either deploy standalone models for individual tasks, or design a multi-task paradigm with separate heads. However, they might suffer from accumulative errors or deficient task coordination. Instead, we argue that a favorable framework should be devised and optimized in pursuit of the ultimate goal, i.e., planning of the self-driving car. Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning. We introduce Unified Autonomous Driving (UniAD), a comprehensive framework up-to-date that incorporates full-stack driving tasks in one network. It is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective. Tasks are communicated with unified query interfaces to facilitate each other toward planning. We instantiate UniAD on the challenging nuScenes benchmark. With extensive ablations, the effectiveness of using such a philosophy is proven by substantially outperforming previous state-of-the-arts in all aspects. Code and models are public., Comment: CVPR 2023 award candidate. Project page: https://opendrivelab.github.io/UniAD
Published: 2022

43. BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

Author: Yang, Chenyu, Chen, Yuntao, Tian, Hao, Tao, Chenxin, Zhu, Xizhou, Zhang, Zhaoxiang, Huang, Gao, Li, Hongyang, Qiao, Yu, Lu, Lewei, Zhou, Jie, Dai, Jifeng, Yang, Chenyu, Chen, Yuntao, Tian, Hao, Tao, Chenxin, Zhu, Xizhou, Zhang, Zhaoxiang, Huang, Gao, Li, Hongyang, Qiao, Yu, Lu, Lewei, Zhou, Jie, and Dai, Jifeng
Abstract: We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones. Existing state-of-the-art BEV detectors are often tied to certain depth pre-trained backbones like VoVNet, hindering the synergy between booming image backbones and BEV detectors. To address this limitation, we prioritize easing the optimization of BEV detectors by introducing perspective space supervision. To this end, we propose a two-stage BEV detector, where proposals from the perspective head are fed into the bird's-eye-view head for final predictions. To evaluate the effectiveness of our model, we conduct extensive ablation studies focusing on the form of supervision and the generality of the proposed detector. The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset. The code shall be released soon.
Published: 2022

44. Stare at What You See: Masked Image Modeling without Reconstruction

Author: Xue, Hongwei, Gao, Peng, Li, Hongyang, Qiao, Yu, Sun, Hao, Li, Houqiang, Luo, Jiebo, Xue, Hongwei, Gao, Peng, Li, Hongyang, Qiao, Yu, Sun, Hao, Li, Houqiang, and Luo, Jiebo
Abstract: Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training. By reconstructing masked image patches from a small portion of visible image regions, MAE forces the model to infer semantic correlation within an image. Recently, some approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance. However, unlike the low-level features such as pixel values, we argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.This raises one question: is reconstruction necessary in Masked Image Modeling (MIM) with a teacher model? In this paper, we propose an efficient MIM paradigm named MaskAlign. MaskAlign simply learns the consistency of visible patch features extracted by the student model and intact image features extracted by the teacher model. To further advance the performance and tackle the problem of input inconsistency between the student and teacher model, we propose a Dynamic Alignment (DA) module to apply learnable alignment. Our experimental results demonstrate that masked modeling does not lose effectiveness even without reconstruction on masked regions. Combined with Dynamic Alignment, MaskAlign can achieve state-of-the-art performance with much higher efficiency. Code and models will be available at https://github.com/OpenPerceptionX/maskalign., Comment: Accepted by CVPR 2023
Published: 2022

45. DCL-Net: Deep Correspondence Learning Network for 6D Pose Estimation

Author: Li, Hongyang, Lin, Jiehong, Jia, Kui, Li, Hongyang, Lin, Jiehong, and Jia, Kui
Abstract: Establishment of point correspondence between camera and object coordinate systems is a promising way to solve 6D object poses. However, surrogate objectives of correspondence learning in 3D space are a step away from the true ones of object pose estimation, making the learning suboptimal for the end task. In this paper, we address this shortcoming by introducing a new method of Deep Correspondence Learning Network for direct 6D object pose estimation, shortened as DCL-Net. Specifically, DCL-Net employs dual newly proposed Feature Disengagement and Alignment (FDA) modules to establish, in the feature space, partial-to-partial correspondence and complete-to-complete one for partial object observation and its complete CAD model, respectively, which result in aggregated pose and match feature pairs from two coordinate systems; these two FDA modules thus bring complementary advantages. The match feature pairs are used to learn confidence scores for measuring the qualities of deep correspondence, while the pose feature pairs are weighted by confidence scores for direct object pose regression. A confidence-based pose refinement network is also proposed to further improve pose precision in an iterative manner. Extensive experiments show that DCL-Net outperforms existing methods on three benchmarking datasets, including YCB-Video, LineMOD, and Oclussion-LineMOD; ablation studies also confirm the efficacy of our novel designs.
Published: 2022

46. Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe

Author: Li, Hongyang, Sima, Chonghao, Dai, Jifeng, Wang, Wenhai, Lu, Lewei, Wang, Huijie, Zeng, Jia, Li, Zhiqi, Yang, Jiazhi, Deng, Hanming, Tian, Hao, Xie, Enze, Xie, Jiangwei, Chen, Li, Li, Tianyu, Li, Yang, Gao, Yulu, Jia, Xiaosong, Liu, Si, Shi, Jianping, Lin, Dahua, Qiao, Yu, Li, Hongyang, Sima, Chonghao, Dai, Jifeng, Wang, Wenhai, Lu, Lewei, Wang, Huijie, Zeng, Jia, Li, Zhiqi, Yang, Jiazhi, Deng, Hanming, Tian, Hao, Xie, Enze, Xie, Jiangwei, Chen, Li, Li, Tianyu, Li, Yang, Gao, Yulu, Jia, Xiaosong, Liu, Si, Shi, Jianping, Lin, Dahua, and Qiao, Yu
Abstract: Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry and academia. Conventional approaches for most autonomous driving algorithms perform detection, segmentation, tracking, etc., in a front or perspective view. As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance. BEV perception inherits several advantages, as representing surrounding scenes in BEV is intuitive and fusion-friendly; and representing objects in BEV is most desirable for subsequent modules as in planning and/or control. The core problems for BEV perception lie in (a) how to reconstruct the lost 3D information via view transformation from perspective view to BEV; (b) how to acquire ground truth annotations in BEV grid; (c) how to formulate the pipeline to incorporate features from different sources and views; and (d) how to adapt and generalize algorithms as sensor configurations vary across different scenarios. In this survey, we review the most recent works on BEV perception and provide an in-depth analysis of different solutions. Moreover, several systematic designs of BEV approach from the industry are depicted as well. Furthermore, we introduce a full suite of practical guidebook to improve the performance of BEV perception tasks, including camera, LiDAR and fusion inputs. At last, we point out the future research directions in this area. We hope this report will shed some light on the community and encourage more research effort on BEV perception. We keep an active repository to collect the most recent work and provide a toolbox for bag of tricks at https://github.com/OpenDriveLab/Birds-eye-view-Perception, Comment: https://github.com/OpenDriveLab/Birds-eye-view-Perception
Published: 2022

47. ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning

Author: Hu, Shengchao, Chen, Li, Wu, Penghao, Li, Hongyang, Yan, Junchi, Tao, Dacheng, Hu, Shengchao, Chen, Li, Wu, Penghao, Li, Hongyang, Yan, Junchi, and Tao, Dacheng
Abstract: Many existing autonomous driving paradigms involve a multi-stage discrete pipeline of tasks. To better predict the control signals and enhance user safety, an end-to-end approach that benefits from joint spatial-temporal feature learning is desirable. While there are some pioneering works on LiDAR-based input or implicit design, in this paper we formulate the problem in an interpretable vision-based setting. In particular, we propose a spatial-temporal feature learning scheme towards a set of more representative features for perception, prediction and planning tasks simultaneously, which is called ST-P3. Specifically, an egocentric-aligned accumulation technique is proposed to preserve geometry information in 3D space before the bird's eye view transformation for perception; a dual pathway modeling is devised to take past motion variations into account for future prediction; a temporal-based refinement unit is introduced to compensate for recognizing vision-based elements for planning. To the best of our knowledge, we are the first to systematically investigate each part of an interpretable end-to-end vision-based autonomous driving system. We benchmark our approach against previous state-of-the-arts on both open-loop nuScenes dataset as well as closed-loop CARLA simulation. The results show the effectiveness of our method. Source code, model and protocol details are made publicly available at https://github.com/OpenPerceptionX/ST-P3., Comment: ECCV 2022
Published: 2022

48. HDGT: Heterogeneous Driving Graph Transformer for Multi-Agent Trajectory Prediction via Scene Encoding

Author: Jia, Xiaosong, Wu, Penghao, Chen, Li, Liu, Yu, Li, Hongyang, Yan, Junchi, Jia, Xiaosong, Wu, Penghao, Chen, Li, Liu, Yu, Li, Hongyang, and Yan, Junchi
Abstract: Encoding a driving scene into vector representations has been an essential task for autonomous driving that can benefit downstream tasks e.g. trajectory prediction. The driving scene often involves heterogeneous elements such as the different types of objects (agents, lanes, traffic signs) and the semantic relations between objects are rich and diverse. Meanwhile, there also exist relativity across elements, which means that the spatial relation is a relative concept and need be encoded in a ego-centric manner instead of in a global coordinate system. Based on these observations, we propose Heterogeneous Driving Graph Transformer (HDGT), a backbone modelling the driving scene as a heterogeneous graph with different types of nodes and edges. For heterogeneous graph construction, we connect different types of nodes according to diverse semantic relations. For spatial relation encoding, the coordinates of the node as well as its in-edges are in the local node-centric coordinate system. For the aggregation module in the graph neural network (GNN), we adopt the transformer structure in a hierarchical way to fit the heterogeneous nature of inputs. Experimental results show that HDGT achieves state-of-the-art performance for the task of trajectory prediction, on INTERACTION Prediction Challenge and Waymo Open Motion Challenge., Comment: Accepted at IEEE TPAMI in 2023. Code url: https://github.com/OpenDriveLab/HDGT
Published: 2022

49. BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Author: Li, Zhiqi, Wang, Wenhai, Li, Hongyang, Xie, Enze, Sima, Chonghao, Lu, Tong, Yu, Qiao, Dai, Jifeng, Li, Zhiqi, Wang, Wenhai, Li, Hongyang, Xie, Enze, Sima, Chonghao, Lu, Tong, Yu, Qiao, and Dai, Jifeng
Abstract: 3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes \texttt{test} set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at \url{https://github.com/zhiqi-li/BEVFormer}., Comment: Accepted to ECCV 2022
Published: 2022

50. PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark

Author: Chen, Li, Sima, Chonghao, Li, Yang, Zheng, Zehan, Xu, Jiajie, Geng, Xiangwei, Li, Hongyang, He, Conghui, Shi, Jianping, Qiao, Yu, Yan, Junchi, Chen, Li, Sima, Chonghao, Li, Yang, Zheng, Zehan, Xu, Jiajie, Geng, Xiangwei, Li, Hongyang, He, Conghui, Shi, Jianping, Qiao, Yu, and Yan, Junchi
Abstract: Methods for 3D lane detection have been recently proposed to address the issue of inaccurate lane layouts in many autonomous driving scenarios (uphill/downhill, bump, etc.). Previous work struggled in complex cases due to their simple designs of the spatial transformation between front view and bird's eye view (BEV) and the lack of a realistic dataset. Towards these issues, we present PersFormer: an end-to-end monocular 3D lane detector with a novel Transformer-based spatial feature transformation module. Our model generates BEV features by attending to related front-view local regions with camera parameters as a reference. PersFormer adopts a unified 2D/3D anchor design and an auxiliary task to detect 2D/3D lanes simultaneously, enhancing the feature consistency and sharing the benefits of multi-task learning. Moreover, we release one of the first large-scale real-world 3D lane datasets: OpenLane, with high-quality annotation and scenario diversity. OpenLane contains 200,000 frames, over 880,000 instance-level lanes, 14 lane categories, along with scene tags and the closed-in-path object annotations to encourage the development of lane detection and more industrial-related autonomous driving methods. We show that PersFormer significantly outperforms competitive baselines in the 3D lane detection task on our new OpenLane dataset as well as Apollo 3D Lane Synthetic dataset, and is also on par with state-of-the-art algorithms in the 2D task on OpenLane. The project page is available at https://github.com/OpenPerceptionX/PersFormer_3DLane and OpenLane dataset is provided at https://github.com/OpenPerceptionX/OpenLane., Comment: Accepted by ECCV 2022 (Oral). Project page: https://github.com/OpenPerceptionX/PersFormer_3DLane | OpenLane dataset: https://github.com/OpenPerceptionX/OpenLane
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Language

Publication Type

Database

Publisher

107 results on '"Li, Hongyang"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources