Author: "Zhou, Jie" / Topic: computer science - computer vision and pattern recognition - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhou, Jie"' showing total 288 results

Start Over Author "Zhou, Jie" Topic computer science - computer vision and pattern recognition

288 results on '"Zhou, Jie"'

1. XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation

Author: Wang, Ziyi, Wang, Yanbo, Yu, Xumin, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Existing methodologies in open vocabulary 3D semantic segmentation primarily concentrate on establishing a unified feature space encompassing 3D, 2D, and textual modalities. Nevertheless, traditional techniques such as global feature alignment or vision-language model distillation tend to impose only approximate correspondence, struggling notably with delineating fine-grained segmentation boundaries. To address this gap, we propose a more meticulous mask-level alignment between 3D features and the 2D-text embedding space through a cross-modal mask reasoning framework, XMask3D. In our approach, we developed a mask generator based on the denoising UNet from a pre-trained diffusion model, leveraging its capability for precise textual control over dense pixel representations and enhancing the open-world adaptability of the generated masks. We further integrate 3D global features as implicit conditions into the pre-trained 2D denoising UNet, enabling the generation of segmentation masks with additional 3D geometry awareness. Subsequently, the generated 2D masks are employed to align mask-level 3D representations with the vision-language feature space, thereby augmenting the open vocabulary capability of 3D geometry embeddings. Finally, we fuse complementary 2D and 3D mask features, resulting in competitive performance across multiple benchmarks for 3D open vocabulary semantic segmentation. Code is available at https://github.com/wangzy22/XMask3D., Comment: Accepted to NeurIPS 2024
Published: 2024

2. V2M: Visual 2-Dimensional Mamba for Image Representation Learning

Author: Wang, Chengkun, Zheng, Wenzhao, Huang, Yuanhui, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Mamba has garnered widespread attention due to its flexible design and efficient hardware performance to process 1D sequences based on the state space model (SSM). Recent studies have attempted to apply Mamba to the visual domain by flattening 2D images into patches and then regarding them as a 1D sequence. To compensate for the 2D structure information loss (e.g., local similarity) of the original image, most existing methods focus on designing different orders to sequentially process the tokens, which could only alleviate this issue to some extent. In this paper, we propose a Visual 2-Dimensional Mamba (V2M) model as a complete solution, which directly processes image tokens in the 2D space. We first generalize SSM to the 2-dimensional space which generates the next state considering two adjacent states on both dimensions (e.g., columns and rows). We then construct our V2M based on the 2-dimensional SSM formulation and incorporate Mamba to achieve hardware-efficient parallel processing. The proposed V2M effectively incorporates the 2D locality prior yet inherits the efficiency and input-dependent scalability of Mamba. Extensive experimental results on ImageNet classification and downstream visual tasks including object detection and instance segmentation on COCO and semantic segmentation on ADE20K demonstrate the effectiveness of our V2M compared with other visual backbones.
Published: 2024

3. GlobalMamba: Global Image Serialization for Vision Mamba

Author: Wang, Chengkun, Zheng, Wenzhao, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch-based image tokenization and then flatten them into 1D sequences for causal processing, which ignore the intrinsic 2D structural correlations of images. It is also difficult to extract global information by sequential processing of local patches. In this paper, we propose a global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image. We first convert the image from the spatial domain to the frequency domain using Discrete Cosine Transform (DCT) and then arrange the pixels with corresponding frequency ranges. We further transform each set within the same frequency band back to the spatial domain to obtain a series of images before tokenization. We construct a vision mamba model, GlobalMamba, with a causal input format based on the proposed global image serialization, which can better exploit the causal relations among image sequences. Extensive experiments demonstrate the effectiveness of our GlobalMamba, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K.
Published: 2024

4. SG-Nav: Online 3D Scene Graph Prompting for LLM-based Zero-shot Object Navigation

Author: Yin, Hang, Xu, Xiuwei, Wu, Zhenyu, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: In this paper, we propose a new framework for zero-shot object navigation. Existing zero-shot object navigation methods prompt LLM with the text of spatially closed objects, which lacks enough scene context for in-depth reasoning. To better preserve the information of environment and fully exploit the reasoning ability of LLM, we propose to represent the observed scene with 3D scene graph. The scene graph encodes the relationships between objects, groups and rooms with a LLM-friendly structure, for which we design a hierarchical chain-of-thought prompt to help LLM reason the goal location according to scene context by traversing the nodes and edges. Moreover, benefit from the scene graph representation, we further design a re-perception mechanism to empower the object navigation framework with the ability to correct perception error. We conduct extensive experiments on MP3D, HM3D and RoboTHOR environments, where SG-Nav surpasses previous state-of-the-art zero-shot methods by more than 10% SR on all benchmarks, while the decision process is explainable. To the best of our knowledge, SG-Nav is the first zero-shot method that achieves even higher performance than supervised object navigation methods on the challenging MP3D benchmark., Comment: Accepted to NeurIPS 2024. Project page: https://bagh2178.github.io/SG-Nav/
Published: 2024

5. Q-VLM: Post-training Quantization for Large Vision-Language Models

Author: Wang, Changyuan, Wang, Ziwei, Xu, Xiuwei, Tang, Yansong, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. Conventional quantization methods sequentially search the layer-wise rounding functions by minimizing activation discretization errors, which fails to acquire optimal quantization strategy without considering cross-layer dependency. On the contrary, we mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy searching with low search cost. Specifically, we observe the strong correlation between the activation entropy and the cross-layer dependency concerning output discretization errors. Therefore, we employ the entropy as the proxy to partition blocks optimally, which aims to achieve satisfying trade-offs between discretization errors and the search cost. Moreover, we optimize the visual encoder to disentangle the cross-layer dependency for fine-grained decomposition of search space, so that the search cost is further reduced without harming the quantization accuracy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation on diverse multi-modal reasoning tasks. Code is available at https://github.com/ChangyuanWang17/QVLM.
Published: 2024

6. OPONeRF: One-Point-One NeRF for Robust Neural Rendering

Author: Zheng, Yu, Duan, Yueqi, Zheng, Kangfu, Yan, Hongru, Lu, Jiwen, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we propose a One-Point-One NeRF (OPONeRF) framework for robust scene rendering. Existing NeRFs are designed based on a key assumption that the target scene remains unchanged between the training and test time. However, small but unpredictable perturbations such as object movements, light changes and data contaminations broadly exist in real-life 3D scenes, which lead to significantly defective or failed rendering results even for the recent state-of-the-art generalizable methods. To address this, we propose a divide-and-conquer framework in OPONeRF that adaptively responds to local scene variations via personalizing appropriate point-wise parameters, instead of fitting a single set of NeRF parameters that are inactive to test-time unseen changes. Moreover, to explicitly capture the local uncertainty, we decompose the point representation into deterministic mapping and probabilistic inference. In this way, OPONeRF learns the sharable invariance and unsupervisedly models the unexpected scene variations between the training and testing scenes. To validate the effectiveness of the proposed method, we construct benchmarks from both realistic and synthetic data with diverse test-time perturbations including foreground motions, illumination variations and multi-modality noises, which are more challenging than conventional generalization and temporal reconstruction benchmarks. Experimental results show that our OPONeRF outperforms state-of-the-art NeRFs on various evaluation metrics through benchmark experiments and cross-scene evaluations. We further show the efficacy of the proposed method via experimenting on other existing generalization-based benchmarks and incorporating the idea of One-Point-One NeRF into other advanced baseline methods., Comment: Project page and dataset: https://yzheng97.github.io/OPONeRF/
Published: 2024

7. MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation

Author: Chen, Wenchao, Niu, Liqiang, Lu, Ziyao, Meng, Fandong, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image generation models have encountered challenges related to scalability and quadratic complexity, primarily due to the reliance on Transformer-based backbones. In this study, we introduce MaskMamba, a novel hybrid model that combines Mamba and Transformer architectures, utilizing Masked Image Modeling for non-autoregressive image synthesis. We meticulously redesign the bidirectional Mamba architecture by implementing two key modifications: (1) replacing causal convolutions with standard convolutions to better capture global context, and (2) utilizing concatenation instead of multiplication, which significantly boosts performance while accelerating inference speed. Additionally, we explore various hybrid schemes of MaskMamba, including both serial and grouped parallel arrangements. Furthermore, we incorporate an in-context condition that allows our model to perform both class-to-image and text-to-image generation tasks. Our MaskMamba outperforms Mamba-based and Transformer-based models in generation quality. Notably, it achieves a remarkable $54.44\%$ improvement in inference speed at a resolution of $2048\times 2048$ over Transformer.
Published: 2024

8. FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner

Author: Zhao, Wenliang, Shi, Minglei, Yu, Xumin, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Building on the success of diffusion models in visual generation, flow-based models reemerge as another prominent family of generative models that have achieved competitive or better performance in terms of both visual quality and inference speed. By learning the velocity field through flow-matching, flow-based models tend to produce a straighter sampling trajectory, which is advantageous during the sampling process. However, unlike diffusion models for which fast samplers are well-developed, efficient sampling of flow-based generative models has been rarely explored. In this paper, we propose a framework called FlowTurbo to accelerate the sampling of flow-based models while still enhancing the sampling quality. Our primary observation is that the velocity predictor's outputs in the flow-based models will become stable during the sampling, enabling the estimation of velocity via a lightweight velocity refiner. Additionally, we introduce several techniques including a pseudo corrector and sample-aware compilation to further reduce inference time. Since FlowTurbo does not change the multi-step sampling paradigm, it can be effectively applied for various tasks such as image editing, inpainting, etc. By integrating FlowTurbo into different flow-based models, we obtain an acceleration ratio of 53.1%$\sim$58.3% on class-conditional generation and 29.8%$\sim$38.5% on text-to-image generation. Notably, FlowTurbo reaches an FID of 2.12 on ImageNet with 100 (ms / img) and FID of 3.93 with 38 (ms / img), achieving the real-time image generation and establishing the new state-of-the-art. Code is available at https://github.com/shiml20/FlowTurbo., Comment: Accepted to NeurIPS 2024
Published: 2024

9. AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity

Author: Lan, Zhibin, Niu, Liqiang, Meng, Fandong, Li, Wenbo, Zhou, Jie, and Su, Jinsong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Recently, when dealing with high-resolution images, dominant LMMs usually divide them into multiple local images and one global image, which will lead to a large number of visual tokens. In this work, we introduce AVG-LLaVA, an LMM that can adaptively select the appropriate visual granularity based on the input image and instruction. This approach not only reduces the number of visual tokens and speeds up inference, but also improves the overall model performance. Specifically, we introduce the following modules based on LLaVA-NeXT: (a) a visual granularity scaler that includes multiple pooling layers to obtain visual tokens with different granularities; (b) a visual granularity router, which includes a Transformer layer, an MLP layer, and a voter layer, used to select the appropriate visual granularity based on the image and instruction. Furthermore, we propose RGLF, a novel training paradigm that aims at aligning the granularity predicted by the router with the preferences of the LMM, without the need for additional manually annotated data. Extensive experiments and analysis show that AVG-LLaVA achieves superior performance across 11 benchmarks, as well as significantly reduces the number of visual tokens and speeds up inference (e.g., an 85.3% reduction in visual tokens and a 2.53$\times$ increase in inference speed on the AI2D benchmark)., Comment: Preprint
Published: 2024

10. POINTS: Improving Your Vision-language Model with Affordable Strategies

Author: Liu, Yuan, Zhao, Zhongyin, Zhuang, Ziyuan, Tian, Le, Zhou, Xiao, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community., Comment: v2
Published: 2024

11. DC-Solver: Improving Predictor-Corrector Diffusion Sampler via Dynamic Compensation

Author: Zhao, Wenliang, Wang, Haolin, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion probabilistic models (DPMs) have shown remarkable performance in visual synthesis but are computationally expensive due to the need for multiple evaluations during the sampling. Recent predictor-corrector diffusion samplers have significantly reduced the required number of function evaluations (NFE), but inherently suffer from a misalignment issue caused by the extra corrector step, especially with a large classifier-free guidance scale (CFG). In this paper, we introduce a new fast DPM sampler called DC-Solver, which leverages dynamic compensation (DC) to mitigate the misalignment of the predictor-corrector samplers. The dynamic compensation is controlled by compensation ratios that are adaptive to the sampling steps and can be optimized on only 10 datapoints by pushing the sampling trajectory toward a ground truth trajectory. We further propose a cascade polynomial regression (CPR) which can instantly predict the compensation ratios on unseen sampling configurations. Additionally, we find that the proposed dynamic compensation can also serve as a plug-and-play module to boost the performance of predictor-only samplers. Extensive experiments on both unconditional sampling and conditional sampling demonstrate that our DC-Solver can consistently improve the sampling quality over previous methods on different DPMs with a wide range of resolutions up to 1024$\times$1024. Notably, we achieve 10.38 FID (NFE=5) on unconditional FFHQ and 0.394 MSE (NFE=5, CFG=7.5) on Stable-Diffusion-2.1. Code is available at https://github.com/wl-zhao/DC-Solver, Comment: Accepted by ECCV 2024
Published: 2024

12. EmbodiedSAM: Online Segment Any 3D Thing in Real Time

Author: Xu, Xiuwei, Chen, Huangxing, Zhao, Linqing, Wang, Ziwei, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration, so an online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed. Since high-quality 3D data is limited, directly training such a model in 3D is almost infeasible. Meanwhile, vision foundation models (VFM) has revolutionized the field of 2D computer vision with superior performance, which makes the use of VFM to assist embodied 3D perception a promising direction. However, most existing VFM-assisted 3D perception methods are either offline or too slow that cannot be applied in practical embodied tasks. In this paper, we aim to leverage Segment Anything Model (SAM) for real-time 3D instance segmentation in an online setting. This is a challenging problem since future frames are not available in the input streaming RGB-D video, and an instance may be observed in several frames so object matching between frames is required. To address these challenges, we first propose a geometric-aware query lifting module to represent the 2D masks generated by SAM by 3D-aware queries, which is then iteratively refined by a dual-level query decoder. In this way, the 2D masks are transferred to fine-grained shapes on 3D point clouds. Benefit from the query representation for 3D masks, we can compute the similarity matrix between the 3D masks from different views by efficient matrix operation, which enables real-time inference. Experiments on ScanNet, ScanNet200, SceneNN and 3RScan show our method achieves leading performance even compared with offline methods. Our method also demonstrates great generalization ability in several zero-shot dataset transferring experiments and show great potential in open-vocabulary and data-efficient setting. Code and demo are available at https://xuxw98.github.io/ESAM/, with only one RTX 3090 GPU required for training and evaluation., Comment: Project page: https://xuxw98.github.io/ESAM/
Published: 2024

13. Scene-wise Adaptive Network for Dynamic Cold-start Scenes Optimization in CTR Prediction

Author: Li, Wenhao, Zhou, Jie, Luo, Chuan, Tang, Chao, Zhang, Kun, and Zhao, Shixiong
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, 68T09, I.2.0
Abstract: In the realm of modern mobile E-commerce, providing users with nearby commercial service recommendations through location-based online services has become increasingly vital. While machine learning approaches have shown promise in multi-scene recommendation, existing methodologies often struggle to address cold-start problems in unprecedented scenes: the increasing diversity of commercial choices, along with the short online lifespan of scenes, give rise to the complexity of effective recommendations in online and dynamic scenes. In this work, we propose Scene-wise Adaptive Network (SwAN), a novel approach that emphasizes high-performance cold-start online recommendations for new scenes. Our approach introduces several crucial capabilities, including scene similarity learning, user-specific scene transition cognition, scene-specific information construction for the new scene, and enhancing the diverged logical information between scenes. We demonstrate SwAN's potential to optimize dynamic multi-scene recommendation problems by effectively online handling cold-start recommendations for any newly arrived scenes. More encouragingly, SwAN has been successfully deployed in Meituan's online catering recommendation service, which serves millions of customers per day, and SwAN has achieved a 5.64% CTR index improvement relative to the baselines and a 5.19% increase in daily order volume proportion., Comment: 10 pages, 6 figures, accepted by Recsys 2024
Published: 2024
Full Text: View/download PDF

14. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Author: Yao, Yuan, Yu, Tianyu, Zhang, Ao, Wang, Chongyi, Cui, Junbo, Zhu, Hongji, Cai, Tianchi, Li, Haoyu, Zhao, Weilin, He, Zhihui, Chen, Qianyu, Zhou, Huarong, Zou, Zhensheng, Zhang, Haoye, Hu, Shengding, Zheng, Zhi, Zhou, Jie, Cai, Jie, Han, Xu, Zeng, Guoyang, Li, Dahai, Liu, Zhiyuan, and Sun, Maosong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future., Comment: preprint
Published: 2024

15. UniTTA: Unified Benchmark and Versatile Framework Towards Realistic Test-Time Adaptation

Author: Du, Chaoqun, Wang, Yulin, Guo, Jiayi, Han, Yizeng, Zhou, Jie, and Huang, Gao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Test-Time Adaptation (TTA) aims to adapt pre-trained models to the target domain during testing. In reality, this adaptability can be influenced by multiple factors. Researchers have identified various challenging scenarios and developed diverse methods to address these challenges, such as dealing with continual domain shifts, mixed domains, and temporally correlated or imbalanced class distributions. Despite these efforts, a unified and comprehensive benchmark has yet to be established. To this end, we propose a Unified Test-Time Adaptation (UniTTA) benchmark, which is comprehensive and widely applicable. Each scenario within the benchmark is fully described by a Markov state transition matrix for sampling from the original dataset. The UniTTA benchmark considers both domain and class as two independent dimensions of data and addresses various combinations of imbalance/balance and i.i.d./non-i.i.d./continual conditions, covering a total of $ (2 \times 3)^2 = 36 $ scenarios. It establishes a comprehensive evaluation benchmark for realistic TTA and provides a guideline for practitioners to select the most suitable TTA method. Alongside this benchmark, we propose a versatile UniTTA framework, which includes a Balanced Domain Normalization (BDN) layer and a COrrelated Feature Adaptation (COFA) method--designed to mitigate distribution gaps in domain and class, respectively. Extensive experiments demonstrate that our UniTTA framework excels within the UniTTA benchmark and achieves state-of-the-art performance on average. Our code is available at \url{https://github.com/LeapLabTHU/UniTTA}.
Published: 2024

16. Mobius: A High Efficient Spatial-Temporal Parallel Training Paradigm for Text-to-Video Generation Task

Author: Yang, Yiran, Zhang, Jinchao, Deng, Ying, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Inspired by the success of the text-to-image (T2I) generation task, many researchers are devoting themselves to the text-to-video (T2V) generation task. Most of the T2V frameworks usually inherit from the T2I model and add extra-temporal layers of training to generate dynamic videos, which can be viewed as a fine-tuning task. However, the traditional 3D-Unet is a serial mode and the temporal layers follow the spatial layers, which will result in high GPU memory and training time consumption according to its serial feature flow. We believe that this serial mode will bring more training costs with the large diffusion model and massive datasets, which are not environmentally friendly and not suitable for the development of the T2V. Therefore, we propose a highly efficient spatial-temporal parallel training paradigm for T2V tasks, named Mobius. In our 3D-Unet, the temporal layers and spatial layers are parallel, which optimizes the feature flow and backpropagation. The Mobius will save 24% GPU memory and 12% training time, which can greatly improve the T2V fine-tuning task and provide a novel insight for the AIGC community. We will release our codes in the future.
Published: 2024

17. Camera-LiDAR Cross-modality Gait Recognition

Author: Guo, Wenxuan, Liang, Yingping, Pan, Zhiyu, Xi, Ziheng, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Gait recognition is a crucial biometric identification technique. Camera-based gait recognition has been widely applied in both research and industrial fields. LiDAR-based gait recognition has also begun to evolve most recently, due to the provision of 3D structural information. However, in certain applications, cameras fail to recognize persons, such as in low-light environments and long-distance recognition scenarios, where LiDARs work well. On the other hand, the deployment cost and complexity of LiDAR systems limit its wider application. Therefore, it is essential to consider cross-modality gait recognition between cameras and LiDARs for a broader range of applications. In this work, we propose the first cross-modality gait recognition framework between Camera and LiDAR, namely CL-Gait. It employs a two-stream network for feature embedding of both modalities. This poses a challenging recognition task due to the inherent matching between 3D and 2D data, exhibiting significant modality discrepancy. To align the feature spaces of the two modalities, i.e., camera silhouettes and LiDAR points, we propose a contrastive pre-training strategy to mitigate modality discrepancy. To make up for the absence of paired camera-LiDAR data for pre-training, we also introduce a strategy for generating data on a large scale. This strategy utilizes monocular depth estimated from single RGB images and virtual cameras to generate pseudo point clouds for contrastive pre-training. Extensive experiments show that the cross-modality gait recognition is very challenging but still contains potential and feasibility with our proposed model and pre-training strategy. To the best of our knowledge, this is the first work to address cross-modality gait recognition., Comment: Accepted at ECCV 2024
Published: 2024

18. Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Author: Yang, Chenyu, Zhu, Xizhou, Zhu, Jinguo, Su, Weijie, Wang, Junjie, Dong, Xuan, Wang, Wenhai, Lu, Lewei, Li, Bin, Zhou, Jie, Qiao, Yu, and Dai, Jifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data. Despite these advances, there is no pre-training method that effectively exploits the interleaved image-text data, which is very prevalent on the Internet. Inspired by the recent success of compression learning in natural language processing, we propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data. This method performs latent compression learning by maximizing the mutual information between the inputs and outputs of a causal attention model. The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation. Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets (e.g., LAION), but can also leverage interleaved pre-training data (e.g., MMC4) to learn robust visual representation from scratch, showcasing the potential of vision model pre-training with interleaved image-text data. Code is released at https://github.com/OpenGVLab/LCL.
Published: 2024

19. Learning 1D Causal Visual Representation with De-focus Attention Networks

Author: Tao, Chenxin, Zhu, Xizhou, Su, Shiqian, Lu, Lewei, Tian, Changyao, Luo, Xuan, Huang, Gao, Li, Hongsheng, Qiao, Yu, Zhou, Jie, and Dai, Jifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Modality differences have led to the development of heterogeneous architectures for vision and language models. While images typically require 2D non-causal modeling, texts utilize 1D causal modeling. This distinction poses significant challenges in constructing unified multi-modal models. This paper explores the feasibility of representing images using 1D causal modeling. We identify an "over-focus" issue in existing 1D causal vision models, where attention overly concentrates on a small proportion of visual tokens. The issue of "over-focus" hinders the model's ability to extract diverse visual features and to receive effective gradients for optimization. To address this, we propose De-focus Attention Networks, which employ learnable bandpass filters to create varied attention patterns. During training, large and scheduled drop path rates, and an auxiliary loss on globally pooled features for global understanding tasks are introduced. These two strategies encourage the model to attend to a broader range of tokens and enhance network optimization. Extensive experiments validate the efficacy of our approach, demonstrating that 1D causal visual representation can perform comparably to 2D non-causal representation in tasks such as global perception, dense prediction, and multi-modal understanding. Code is released at https://github.com/OpenGVLab/De-focus-Attention-Networks.
Published: 2024

20. Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion

Author: Liu, Fangfu, Wang, Hanyang, Yao, Shunyu, Zhang, Shengjun, Zhou, Jie, and Duan, Yueqi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Graphics
Abstract: In recent years, there has been rapid development in 3D generation models, opening up new possibilities for applications such as simulating the dynamic movements of 3D objects and customizing their behaviors. However, current 3D generative models tend to focus only on surface features such as color and shape, neglecting the inherent physical properties that govern the behavior of objects in the real world. To accurately simulate physics-aligned dynamics, it is essential to predict the physical properties of materials and incorporate them into the behavior prediction process. Nonetheless, predicting the diverse materials of real-world objects is still challenging due to the complex nature of their physical attributes. In this paper, we propose \textbf{Physics3D}, a novel method for learning various physical properties of 3D objects through a video diffusion model. Our approach involves designing a highly generalizable physical simulation system based on a viscoelastic material model, which enables us to simulate a wide range of materials with high-fidelity capabilities. Moreover, we distill the physical priors from a video diffusion model that contains more understanding of realistic object materials. Extensive experiments demonstrate the effectiveness of our method with both elastic and plastic materials. Physics3D shows great potential for bridging the gap between the physical world and virtual neural space, providing a better integration and application of realistic physical principles in virtual environments. Project page: https://liuff19.github.io/Physics3D., Comment: Project page: https://liuff19.github.io/Physics3D
Published: 2024

21. FlowIE: Efficient Image Enhancement via Rectified Flow

Author: Zhu, Yixuan, Zhao, Wenliang, Li, Ao, Tang, Yansong, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Image enhancement holds extensive applications in real-world scenarios due to complex environments and limitations of imaging devices. Conventional methods are often constrained by their tailored models, resulting in diminished robustness when confronted with challenging degradation conditions. In response, we propose FlowIE, a simple yet highly effective flow-based image enhancement framework that estimates straight-line paths from an elementary distribution to high-quality images. Unlike previous diffusion-based methods that suffer from long-time inference, FlowIE constructs a linear many-to-one transport mapping via conditioned rectified flow. The rectification straightens the trajectories of probability transfer, accelerating inference by an order of magnitude. This design enables our FlowIE to fully exploit rich knowledge in the pre-trained diffusion model, rendering it well-suited for various real-world applications. Moreover, we devise a faster inference algorithm, inspired by Lagrange's Mean Value Theorem, harnessing midpoint tangent direction to optimize path estimation, ultimately yielding visually superior results. Thanks to these designs, our FlowIE adeptly manages a diverse range of enhancement tasks within a concise sequence of fewer than 5 steps. Our contributions are rigorously validated through comprehensive experiments on synthetic and real-world datasets, unveiling the compelling efficacy and efficiency of our proposed FlowIE. Code is available at https://github.com/EternalEvan/FlowIE., Comment: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024 as an oral presentation
Published: 2024

22. GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction

Author: Huang, Yuanhui, Zheng, Wenzhao, Zhang, Yunpeng, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: 3D semantic occupancy prediction aims to obtain 3D fine-grained geometry and semantics of the surrounding scene and is an important task for the robustness of vision-centric autonomous driving. Most existing methods employ dense grids such as voxels as scene representations, which ignore the sparsity of occupancy and the diversity of object scales and thus lead to unbalanced allocation of resources. To address this, we propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features. We aggregate information from images through the attention mechanism and iteratively refine the properties of 3D Gaussians including position, covariance, and semantics. We then propose an efficient Gaussian-to-voxel splatting method to generate 3D occupancy predictions, which only aggregates the neighboring Gaussians for a certain position. We conduct extensive experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental results demonstrate that GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8% - 24.8% of their memory consumption. Code is available at: https://github.com/huang-yh/GaussianFormer., Comment: Code is available at: https://github.com/huang-yh/GaussianFormer
Published: 2024

23. NeuroGauss4D-PCI: 4D Neural Fields and Gaussian Deformation Fields for Point Cloud Interpolation

Author: Jiang, Chaokang, Du, Dalong, Liu, Jiuming, Zhu, Siting, Liu, Zhenqiang, Ma, Zhuang, Liang, Zhujin, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Point Cloud Interpolation confronts challenges from point sparsity, complex spatiotemporal dynamics, and the difficulty of deriving complete 3D point clouds from sparse temporal information. This paper presents NeuroGauss4D-PCI, which excels at modeling complex non-rigid deformations across varied dynamic scenes. The method begins with an iterative Gaussian cloud soft clustering module, offering structured temporal point cloud representations. The proposed temporal radial basis function Gaussian residual utilizes Gaussian parameter interpolation over time, enabling smooth parameter transitions and capturing temporal residuals of Gaussian distributions. Additionally, a 4D Gaussian deformation field tracks the evolution of these parameters, creating continuous spatiotemporal deformation fields. A 4D neural field transforms low-dimensional spatiotemporal coordinates ($x,y,z,t$) into a high-dimensional latent space. Finally, we adaptively and efficiently fuse the latent features from neural fields and the geometric features from Gaussian deformation fields. NeuroGauss4D-PCI outperforms existing methods in point cloud frame interpolation, delivering leading performance on both object-level (DHB) and large-scale autonomous driving datasets (NL-Drive), with scalability to auto-labeling and point cloud densification tasks. The source code is released at https://github.com/jiangchaokang/NeuroGauss4D-PCI., Comment: Under review
Published: 2024

24. Rethinking Overlooked Aspects in Vision-Language Models

Author: Liu, Yuan, Tian, Le, Zhou, Xiao, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advancements in large vision-language models (LVLMs), such as GPT4-V and LLaVA, have been substantial. LLaVA's modular architecture, in particular, offers a blend of simplicity and efficiency. Recent works mainly focus on introducing more pre-training and instruction tuning data to improve model's performance. This paper delves into the often-neglected aspects of data efficiency during pre-training and the selection process for instruction tuning datasets. Our research indicates that merely increasing the size of pre-training data does not guarantee improved performance and may, in fact, lead to its degradation. Furthermore, we have established a pipeline to pinpoint the most efficient instruction tuning (SFT) dataset, implying that not all SFT data utilized in existing studies are necessary. The primary objective of this paper is not to introduce a state-of-the-art model, but rather to serve as a roadmap for future research, aiming to optimize data usage during pre-training and fine-tuning processes to enhance the performance of vision-language models.
Published: 2024

25. Joint Identity Verification and Pose Alignment for Partial Fingerprints

Author: Guan, Xiongjun, Pan, Zhiyu, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Currently, portable electronic devices are becoming more and more popular. For lightweight considerations, their fingerprint recognition modules usually use limited-size sensors. However, partial fingerprints have few matchable features, especially when there are differences in finger pressing posture or image quality, which makes partial fingerprint verification challenging. Most existing methods regard fingerprint position rectification and identity verification as independent tasks, ignoring the coupling relationship between them -- relative pose estimation typically relies on paired features as anchors, and authentication accuracy tends to improve with more precise pose alignment. In this paper, we propose a novel framework for joint identity verification and pose alignment of partial fingerprint pairs, aiming to leverage their inherent correlation to improve each other. To achieve this, we present a multi-task CNN (Convolutional Neural Network)-Transformer hybrid network, and design a pre-training task to enhance the feature extraction capability. Experiments on multiple public datasets (NIST SD14, FVC2002 DB1A & DB3A, FVC2004 DB1A & DB2A, FVC2006 DB1A) and an in-house dataset show that our method achieves state-of-the-art performance in both partial fingerprint verification and relative pose estimation, while being more efficient than previous methods.
Published: 2024

26. Latent Fingerprint Matching via Dense Minutia Descriptor

Author: Pan, Zhiyu, Duan, Yongjie, Guan, Xiongjun, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Latent fingerprint matching is a daunting task, primarily due to the poor quality of latent fingerprints. In this study, we propose a deep-learning based dense minutia descriptor (DMD) for latent fingerprint matching. A DMD is obtained by extracting the fingerprint patch aligned by its central minutia, capturing detailed minutia information and texture information. Our dense descriptor takes the form of a three-dimensional representation, with two dimensions associated with the original image plane and the other dimension representing the abstract features. Additionally, the extraction process outputs the fingerprint segmentation map, ensuring that the descriptor is only valid in the foreground region. The matching between two descriptors occurs in their overlapping regions, with a score normalization strategy to reduce the impact brought by the differences outside the valid area. Our descriptor achieves state-of-the-art performance on several latent fingerprint datasets. Overall, our DMD is more representative and interpretable compared to previous methods., Comment: accepted by IJCB 2024
Published: 2024

27. Sports Analysis and VR Viewing System Based on Player Tracking and Pose Estimation with Multimodal and Multiview Sensors

Author: Guo, Wenxuan, Pan, Zhiyu, Xi, Ziheng, Tuerxun, Alapati, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Sports analysis and viewing play a pivotal role in the current sports domain, offering significant value not only to coaches and athletes but also to fans and the media. In recent years, the rapid development of virtual reality (VR) and augmented reality (AR) technologies have introduced a new platform for watching games. Visualization of sports competitions in VR/AR represents a revolutionary technology, providing audiences with a novel immersive viewing experience. However, there is still a lack of related research in this area. In this work, we present for the first time a comprehensive system for sports competition analysis and real-time visualization on VR/AR platforms. First, we utilize multiview LiDARs and cameras to collect multimodal game data. Subsequently, we propose a framework for multi-player tracking and pose estimation based on a limited amount of supervised data, which extracts precise player positions and movements from point clouds and images. Moreover, we perform avatar modeling of players to obtain their 3D models. Ultimately, using these 3D player data, we conduct competition analysis and real-time visualization on VR/AR. Extensive quantitative experiments demonstrate the accuracy and robustness of our multi-player tracking and pose estimation framework. The visualization results showcase the immense potential of our sports visualization system on the domain of watching games on VR/AR devices. The multimodal competition dataset we collected and all related code will be released soon., Comment: arXiv admin note: text overlap with arXiv:2312.06409
Published: 2024

28. Regression of Dense Distortion Field from a Single Fingerprint Image

Author: Guan, Xiongjun, Duan, Yongjie, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Skin distortion is a long standing challenge in fingerprint matching, which causes false non-matches. Previous studies have shown that the recognition rate can be improved by estimating the distortion field from a distorted fingerprint and then rectifying it into a normal fingerprint. However, existing rectification methods are based on principal component representation of distortion fields, which is not accurate and are very sensitive to finger pose. In this paper, we propose a rectification method where a self-reference based network is utilized to directly estimate the dense distortion field of distorted fingerprint instead of its low dimensional representation. This method can output accurate distortion fields of distorted fingerprints with various finger poses and distortion patterns. We conducted experiments on FVC2004 DB1\_A, expanded Tsinghua Distorted Fingerprint database (with additional distorted fingerprints in diverse finger poses and distortion patterns) and a latent fingerprint database. Experimental results demonstrate that our proposed method achieves the state-of-the-art rectification performance in terms of distortion field estimation and rectified fingerprint matching., Comment: arXiv admin note: text overlap with arXiv:2404.17148
Published: 2024
Full Text: View/download PDF

29. Phase-aggregated Dual-branch Network for Efficient Fingerprint Dense Registration

Author: Guan, Xiongjun, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Fingerprint dense registration aims to finely align fingerprint pairs at the pixel level, thereby reducing intra-class differences caused by distortion. Unfortunately, traditional methods exhibited subpar performance when dealing with low-quality fingerprints while suffering from slow inference speed. Although deep learning based approaches shows significant improvement in these aspects, their registration accuracy is still unsatisfactory. In this paper, we propose a Phase-aggregated Dual-branch Registration Network (PDRNet) to aggregate the advantages of both types of methods. A dual-branch structure with multi-stage interactions is introduced between correlation information at high resolution and texture feature at low resolution, to perceive local fine differences while ensuring global stability. Extensive experiments are conducted on more comprehensive databases compared to previous works. Experimental results demonstrate that our method reaches the state-of-the-art registration performance in terms of accuracy and robustness, while maintaining considerable competitiveness in efficiency.
Published: 2024
Full Text: View/download PDF

30. Pose-Specific 3D Fingerprint Unfolding

Author: Guan, Xiongjun, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In order to make 3D fingerprints compatible with traditional 2D flat fingerprints, a common practice is to unfold the 3D fingerprint into a 2D rolled fingerprint, which is then matched with the flat fingerprints by traditional 2D fingerprint recognition algorithms. The problem with this method is that there may be large elastic deformation between the unfolded rolled fingerprint and flat fingerprint, which affects the recognition rate. In this paper, we propose a pose-specific 3D fingerprint unfolding algorithm to unfold the 3D fingerprint using the same pose as the flat fingerprint. Our experiments show that the proposed unfolding algorithm improves the compatibility between 3D fingerprint and flat fingerprint and thus leads to higher genuine matching scores.
Published: 2024
Full Text: View/download PDF

31. Direct Regression of Distortion Field from a Single Fingerprint Image

Author: Guan, Xiongjun, Duan, Yongjie, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Skin distortion is a long standing challenge in fingerprint matching, which causes false non-matches. Previous studies have shown that the recognition rate can be improved by estimating the distortion field from a distorted fingerprint and then rectifying it into a normal fingerprint. However, existing rectification methods are based on principal component representation of distortion fields, which is not accurate and are very sensitive to finger pose. In this paper, we propose a rectification method where a self-reference based network is utilized to directly estimate the dense distortion field of distorted fingerprint instead of its low dimensional representation. This method can output accurate distortion fields of distorted fingerprints with various finger poses. Considering the limited number and variety of distorted fingerprints in the existing public dataset, we collected more distorted fingerprints with diverse finger poses and distortion patterns as a new database. Experimental results demonstrate that our proposed method achieves the state-of-the-art rectification performance in terms of distortion field estimation and rectified fingerprint matching.
Published: 2024
Full Text: View/download PDF

32. NTIRE 2024 Quality Assessment of AI-Generated Content Challenge

Author: Liu, Xiaohong, Min, Xiongkuo, Zhai, Guangtao, Li, Chunyi, Kou, Tengchuan, Sun, Wei, Wu, Haoning, Gao, Yixuan, Cao, Yuqin, Zhang, Zicheng, Wu, Xiele, Timofte, Radu, Peng, Fei, Fu, Huiyuan, Ming, Anlong, Wang, Chuanming, Ma, Huadong, He, Shuai, Dou, Zifei, Chen, Shu, Zhang, Huacong, Xie, Haiyi, Wang, Chengwei, Chen, Baoying, Zeng, Jishen, Yang, Jianquan, Wang, Weigang, Fang, Xi, Lv, Xiaoxin, Yan, Jun, Zhi, Tianwu, Zhang, Yabin, Li, Yaohui, Li, Yang, Xu, Jingwen, Liu, Jianzhao, Liao, Yiting, Li, Junlin, Yu, Zihao, Lu, Yiting, Li, Xin, Motamednia, Hossein, Hosseini-Benvidi, S. Farhad, Guan, Fengbin, Mahmoudi-Aznaveh, Ahmad, Mansouri, Azadeh, Gankhuyag, Ganzorig, Yoon, Kihwan, Xu, Yifang, Fan, Haotian, Kong, Fangyuan, Zhao, Shiling, Dong, Weifeng, Yin, Haibing, Zhu, Li, Wang, Zhiling, Huang, Bingchen, Saha, Avinab, Mishra, Sandeep, Gupta, Shashank, Sureddi, Rajesh, Saha, Oindrila, Celona, Luigi, Bianco, Simone, Napoletano, Paolo, Schettini, Raimondo, Yang, Junfeng, Fu, Jing, Zhang, Wei, Cao, Wenzhi, Liu, Limei, Peng, Han, Yuan, Weijun, Li, Zhan, Cheng, Yihang, Deng, Yifan, Li, Haohui, Qu, Bowen, Li, Yao, Luo, Shuqing, Wang, Shunzhou, Gao, Wei, Lu, Zihao, Conde, Marcos V., Wang, Xinrui, Chen, Zhibo, Liao, Ruling, Ye, Yan, Wang, Qiulin, Li, Bing, Zhou, Zhaokun, Geng, Miao, Chen, Rui, Tao, Xin, Liang, Xiaoyu, Sun, Shangkun, Ma, Xingyuan, Li, Jiaze, Yang, Mengduo, Xu, Haoran, Zhou, Jie, Zhu, Shiding, Yu, Bohan, Chen, Pengfei, Xu, Xinrui, Shen, Jiabin, Duan, Zhichao, Asadi, Erfan, Liu, Jiahe, Yan, Qi, Qu, Youran, Zeng, Xiaohui, Wang, Lele, and Liao, Renjie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC.
Published: 2024

33. CodeEnhance: A Codebook-Driven Approach for Low-Light Image Enhancement

Author: Wu, Xu, Hou, XianXu, Lai, Zhihui, Zhou, Jie, Zhang, Ya-nan, Pedrycz, Witold, and Shen, Linlin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Low-light image enhancement (LLIE) aims to improve low-illumination images. However, existing methods face two challenges: (1) uncertainty in restoration from diverse brightness degradations; (2) loss of texture and color information caused by noise suppression and light enhancement. In this paper, we propose a novel enhancement approach, CodeEnhance, by leveraging quantized priors and image refinement to address these challenges. In particular, we reframe LLIE as learning an image-to-code mapping from low-light images to discrete codebook, which has been learned from high-quality images. To enhance this process, a Semantic Embedding Module (SEM) is introduced to integrate semantic information with low-level features, and a Codebook Shift (CS) mechanism, designed to adapt the pre-learned codebook to better suit the distinct characteristics of our low-light dataset. Additionally, we present an Interactive Feature Transformation (IFT) module to refine texture and color information during image reconstruction, allowing for interactive enhancement based on user preferences. Extensive experiments on both real-world and synthetic benchmarks demonstrate that the incorporation of prior knowledge and controllable information transfer significantly enhances LLIE performance in terms of quality and fidelity. The proposed CodeEnhance exhibits superior robustness to various degradations, including uneven illumination, noise, and color distortion., Comment: 10 pages, 13 figures
Published: 2024

34. LOGO: A Long-Form Video Dataset for Group Action Quality Assessment

Author: Zhang, Shiyi, Dai, Wenxun, Wang, Sujia, Shen, Xiangwei, Lu, Jiwen, Zhou, Jie, and Tang, Yansong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Action quality assessment (AQA) has become an emerging topic since it can be extensively applied in numerous scenarios. However, most existing methods and datasets focus on single-person short-sequence scenes, hindering the application of AQA in more complex situations. To address this issue, we construct a new multi-person long-form video dataset for action quality assessment named LOGO. Distinguished in scenario complexity, our dataset contains 200 videos from 26 artistic swimming events with 8 athletes in each sample along with an average duration of 204.2 seconds. As for richness in annotations, LOGO includes formation labels to depict group information of multiple athletes and detailed annotations on action procedures. Furthermore, we propose a simple yet effective method to model relations among athletes and reason about the potential temporal logic in long-form videos. Specifically, we design a group-aware attention module, which can be easily plugged into existing AQA methods, to enrich the clip-wise representations based on contextual group information. To benchmark LOGO, we systematically conduct investigations on the performance of several popular methods in AQA and action segmentation. The results reveal the challenges our dataset brings. Extensive experiments also show that our approach achieves state-of-the-art on the LOGO dataset. The dataset and code will be released at \url{https://github.com/shiyi-zh0408/LOGO }., Comment: Accepted by CVPR 2023
Published: 2024

35. DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery

Author: Zhu, Yixuan, Li, Ao, Tang, Yansong, Zhao, Wenliang, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The recovery of occluded human meshes presents challenges for current methods due to the difficulty in extracting effective image features under severe occlusion. In this paper, we introduce DPMesh, an innovative framework for occluded human mesh recovery that capitalizes on the profound diffusion prior about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. Unlike previous methods reliant on conventional backbones for vanilla feature extraction, DPMesh seamlessly integrates the pre-trained denoising U-Net with potent knowledge as its image backbone and performs a single-step inference to provide occlusion-aware information. To enhance the perception capability for occluded poses, DPMesh incorporates well-designed guidance via condition injection, which produces effective controls from 2D observations for the denoising U-Net. Furthermore, we explore a dedicated noisy key-point reasoning approach to mitigate disturbances arising from occlusion and crowded scenarios. This strategy fully unleashes the perceptual capability of the diffusion prior, thereby enhancing accuracy. Extensive experiments affirm the efficacy of our framework, as we outperform state-of-the-art methods on both occlusion-specific and standard datasets. The persuasive results underscore its ability to achieve precise and robust 3D human mesh recovery, particularly in challenging scenarios involving occlusion and crowded scenes., Comment: Accepted by IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024
Published: 2024

36. Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models

Author: Liu, Zuyan, Dong, Yuhao, Rao, Yongming, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In the realm of vision-language understanding, the proficiency of models in interpreting and reasoning over visual content has become a cornerstone for numerous applications. However, it is challenging for the visual encoder in Large Vision-Language Models (LVLMs) to extract useful features tailored to questions that aid the language model's response. Furthermore, a common practice among existing LVLMs is to utilize lower-resolution images, which restricts the ability for visual recognition. Our work introduces the Chain-of-Spot (CoS) method, which we describe as Interactive Reasoning, a novel approach that enhances feature extraction by focusing on key regions of interest (ROI) within the image, corresponding to the posed questions or instructions. This technique allows LVLMs to access more detailed visual information without altering the original image resolution, thereby offering multi-granularity image features. By integrating Chain-of-Spot with instruct-following LLaVA-1.5 models, the process of image reasoning consistently improves performance across a wide range of multimodal datasets and benchmarks without bells and whistles and achieves new state-of-the-art results. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content, paving the way for more sophisticated visual instruction-following applications. Code and models are available at https://github.com/dongyh20/Chain-of-Spot, Comment: Project Page: https://sites.google.com/view/chain-of-spot/
Published: 2024

37. RCdpia: A Renal Carcinoma Digital Pathology Image Annotation dataset based on pathologists

Author: Sun, Qingrong, Zhong, Weixiang, Zhou, Jie, Lai, Chong, Teng, Xiaodong, and Lai, Maode
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The annotation of digital pathological slide data for renal cell carcinoma is of paramount importance for correct diagnosis of artificial intelligence models due to the heterogeneous nature of the tumor. This process not only facilitates a deeper understanding of renal cell cancer heterogeneity but also aims to minimize noise in the data for more accurate studies. To enhance the applicability of the data, two pathologists were enlisted to meticulously curate, screen, and label a kidney cancer pathology image dataset from The Cancer Genome Atlas Program (TCGA) database. Subsequently, a Resnet model was developed to validate the annotated dataset against an additional dataset from the First Affiliated Hospital of Zhejiang University. Based on these results, we have meticulously compiled the TCGA digital pathological dataset with independent labeling of tumor regions and adjacent areas (RCdpia), which includes 109 cases of kidney chromophobe cell carcinoma, 486 cases of kidney clear cell carcinoma, and 292 cases of kidney papillary cell carcinoma. This dataset is now publicly accessible at http://39.171.241.18:8888/RCdpia/. Furthermore, model analysis has revealed significant discrepancies in predictive outcomes when applying the same model to datasets from different centers. Leveraging the RCdpia, we can now develop more precise digital pathology artificial intelligence models for tasks such as normalization, classification, and segmentation. These advancements underscore the potential for more nuanced and accurate AI applications in the field of digital pathology., Comment: 8 pages, 3 figures, 1 table
Published: 2024

38. Learning Dual-Level Deformable Implicit Representation for Real-World Scale Arbitrary Super-Resolution

Author: Li, Zhiheng, Li, Muheng, Fan, Jixuan, Chen, Lei, Tang, Yansong, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Scale arbitrary super-resolution based on implicit image function gains increasing popularity since it can better represent the visual world in a continuous manner. However, existing scale arbitrary works are trained and evaluated on simulated datasets, where low-resolution images are generated from their ground truths by the simplest bicubic downsampling. These models exhibit limited generalization to real-world scenarios due to the greater complexity of real-world degradations. To address this issue, we build a RealArbiSR dataset, a new real-world super-resolution benchmark with both integer and non-integer scaling factors for the training and evaluation of real-world scale arbitrary super-resolution. Moreover, we propose a Dual-level Deformable Implicit Representation (DDIR) to solve real-world scale arbitrary super-resolution. Specifically, we design the appearance embedding and deformation field to handle both image-level and pixel-level deformations caused by real-world degradations. The appearance embedding models the characteristics of low-resolution inputs to deal with photometric variations at different scales, and the pixel-based deformation field learns RGB differences which result from the deviations between the real-world and simulated degradations at arbitrary coordinates. Extensive experiments show our trained model achieves state-of-the-art performance on the RealArbiSR and RealSR benchmarks for real-world scale arbitrary super-resolution. Our dataset as well as source code will be publicly available.
Published: 2024

39. Memory-based Adapters for Online 3D Scene Perception

Author: Xu, Xiuwei, Xia, Chong, Wang, Ziwei, Zhao, Linqing, Duan, Yueqi, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we propose a new framework for online 3D scene perception. Conventional 3D scene perception methods are offline, i.e., take an already reconstructed 3D scene geometry as input, which is not applicable in robotic applications where the input data is streaming RGB-D videos rather than a complete 3D scene reconstructed from pre-collected RGB-D videos. To deal with online 3D scene perception tasks where data collection and perception should be performed simultaneously, the model should be able to process 3D scenes frame by frame and make use of the temporal information. To this end, we propose an adapter-based plug-and-play module for the backbone of 3D scene perception model, which constructs memory to cache and aggregate the extracted RGB-D features to empower offline models with temporal learning ability. Specifically, we propose a queued memory mechanism to cache the supporting point cloud and image features. Then we devise aggregation modules which directly perform on the memory and pass temporal information to current frame. We further propose 3D-to-2D adapter to enhance image features with strong global context. Our adapters can be easily inserted into mainstream offline architectures of different tasks and significantly boost their performance on online tasks. Extensive experiments on ScanNet and SceneNN datasets demonstrate our approach achieves leading performance on three 3D scene perception tasks compared with state-of-the-art online methods by simply finetuning existing offline models, without any model and task-specific designs. \href{https://xuxw98.github.io/Online3D/}{Project page}., Comment: Accepted to CVPR24. Link: https://xuxw98.github.io/Online3D/
Published: 2024

40. 3D Vascular Segmentation Supervised by 2D Annotation of Maximum Intensity Projection

Author: Guo, Zhanqiang, Tan, Zimeng, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vascular structure segmentation plays a crucial role in medical analysis and clinical applications. The practical adoption of fully supervised segmentation models is impeded by the intricacy and time-consuming nature of annotating vessels in the 3D space. This has spurred the exploration of weakly-supervised approaches that reduce reliance on expensive segmentation annotations. Despite this, existing weakly supervised methods employed in organ segmentation, which encompass points, bounding boxes, or graffiti, have exhibited suboptimal performance when handling sparse vascular structure. To alleviate this issue, we employ maximum intensity projection (MIP) to decrease the dimensionality of 3D volume to 2D image for efficient annotation, and the 2D labels are utilized to provide guidance and oversight for training 3D vessel segmentation model. Initially, we generate pseudo-labels for 3D blood vessels using the annotations of 2D projections. Subsequently, taking into account the acquisition method of the 2D labels, we introduce a weakly-supervised network that fuses 2D-3D deep features via MIP to further improve segmentation performance. Furthermore, we integrate confidence learning and uncertainty estimation to refine the generated pseudo-labels, followed by fine-tuning the segmentation network. Our method is validated on five datasets (including cerebral vessel, aorta and coronary artery), demonstrating highly competitive performance in segmenting vessels and the potential to significantly reduce the time and effort required for vessel annotation. Our code is available at: https://github.com/gzq17/Weakly-Supervised-by-MIP.
Published: 2024

41. MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

Author: Tian, Changyao, Zhu, Xizhou, Xiong, Yuwen, Wang, Weiyun, Chen, Zhe, Wang, Wenhai, Chen, Yuntao, Lu, Lewei, Lu, Tong, Zhou, Jie, Li, Hongsheng, Qiao, Yu, and Dai, Jifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}., Comment: 20 pages, 9 figures, 17 tables
Published: 2024

42. Path Choice Matters for Clear Attribution in Path Methods

Author: Zhang, Borui, Zheng, Wenzhao, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Rigorousness and clarity are both essential for interpretations of DNNs to engender human trust. Path methods are commonly employed to generate rigorous attributions that satisfy three axioms. However, the meaning of attributions remains ambiguous due to distinct path choices. To address the ambiguity, we introduce \textbf{Concentration Principle}, which centrally allocates high attributions to indispensable features, thereby endowing aesthetic and sparsity. We then present \textbf{SAMP}, a model-agnostic interpreter, which efficiently searches the near-optimal path from a pre-defined set of manipulation paths. Moreover, we propose the infinitesimal constraint (IC) and momentum strategy (MS) to improve the rigorousness and optimality. Visualizations show that SAMP can precisely reveal DNNs by pinpointing salient image pixels. We also perform quantitative experiments and observe that our method significantly outperforms the counterparts. Code: https://github.com/zbr17/SAMP., Comment: ICLR 2024 accepted
Published: 2024

43. Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

Author: Xiong, Yuwen, Li, Zhiqi, Chen, Yuntao, Wang, Feng, Zhu, Xizhou, Luo, Jiapeng, Wang, Wenhai, Lu, Tong, Li, Hongsheng, Qiao, Yu, Lu, Lewei, Zhou, Jie, and Dai, Jifeng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for a broad spectrum of vision applications. DCNv4 addresses the limitations of its predecessor, DCNv3, with two key enhancements: 1. removing softmax normalization in spatial aggregation to enhance its dynamic property and expressive power and 2. optimizing memory access to minimize redundant operations for speedup. These improvements result in a significantly faster convergence compared to DCNv3 and a substantial increase in processing speed, with DCNv4 achieving more than three times the forward speed. DCNv4 demonstrates exceptional performance across various tasks, including image classification, instance and semantic segmentation, and notably, image generation. When integrated into generative models like U-Net in the latent diffusion model, DCNv4 outperforms its baseline, underscoring its possibility to enhance generative models. In practical applications, replacing DCNv3 with DCNv4 in the InternImage model to create FlashInternImage results in up to 80% speed increase and further performance improvement without further modifications. The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models., Comment: Tech report; Code: https://github.com/OpenGVLab/DCNv4
Published: 2024

44. Domain Similarity-Perceived Label Assignment for Domain Generalized Underwater Object Detection

Author: Li, Xisheng, Li, Wei, Song, Pinhao, Zhang, Mingjun, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: The inherent characteristics and light fluctuations of water bodies give rise to the huge difference between different layers and regions in underwater environments. When the test set is collected in a different marine area from the training set, the issue of domain shift emerges, significantly compromising the model's ability to generalize. The Domain Adversarial Learning (DAL) training strategy has been previously utilized to tackle such challenges. However, DAL heavily depends on manually one-hot domain labels, which implies no difference among the samples in the same domain. Such an assumption results in the instability of DAL. This paper introduces the concept of Domain Similarity-Perceived Label Assignment (DSP). The domain label for each image is regarded as its similarity to the specified domains. Through domain-specific data augmentation techniques, we achieved state-of-the-art results on the underwater cross-domain object detection benchmark S-UODAC2020. Furthermore, we validated the effectiveness of our method in the Cityscapes dataset., Comment: 10 pages,7 figures
Published: 2023

45. Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft

Author: Li, Hao, Yang, Xue, Wang, Zhaokai, Zhu, Xizhou, Zhou, Jie, Qiao, Yu, Wang, Xiaogang, Li, Hongsheng, Lu, Lewei, and Dai, Jifeng
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Many reinforcement learning environments (e.g., Minecraft) provide only sparse rewards that indicate task completion or failure with binary values. The challenge in exploration efficiency in such environments makes it difficult for reinforcement-learning-based agents to learn complex tasks. To address this, this paper introduces an advanced learning system, named Auto MC-Reward, that leverages Large Language Models (LLMs) to automatically design dense reward functions, thereby enhancing the learning efficiency. Auto MC-Reward consists of three important components: Reward Designer, Reward Critic, and Trajectory Analyzer. Given the environment information and task descriptions, the Reward Designer first design the reward function by coding an executable Python function with predefined observation inputs. Then, our Reward Critic will be responsible for verifying the code, checking whether the code is self-consistent and free of syntax and semantic errors. Further, the Trajectory Analyzer summarizes possible failure causes and provides refinement suggestions according to collected trajectories. In the next round, Reward Designer will further refine and iterate the dense reward function based on feedback. Experiments demonstrate a significant improvement in the success rate and learning efficiency of our agents in complex tasks in Minecraft, such as obtaining diamond with the efficient ability to avoid lava, and efficiently explore trees and animals that are sparse in the plains biome., Comment: Accepted by CVPR2024
Published: 2023

46. LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-frame 3D Human Pose Estimation

Author: Pan, Zhiyu, Zhong, Zhicheng, Guo, Wenxuan, Chen, Yifan, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Several methods have been proposed to estimate 3D human pose from multi-view images, achieving satisfactory performance on public datasets collected under relatively simple conditions. However, there are limited approaches studying extracting 3D human skeletons from multimodal inputs, such as RGB and point cloud data. To address this gap, we introduce LiCamPose, a pipeline that integrates multi-view RGB and sparse point cloud information to estimate robust 3D human poses via single frame. We demonstrate the effectiveness of the volumetric architecture in combining these modalities. Furthermore, to circumvent the need for manually labeled 3D human pose annotations, we develop a synthetic dataset generator for pretraining and design an unsupervised domain adaptation strategy to train a 3D human pose estimator without manual annotations. To validate the generalization capability of our method, LiCamPose is evaluated on four datasets, including two public datasets, one synthetic dataset, and one challenging self-collected dataset named BasketBall, covering diverse scenarios. The results demonstrate that LiCamPose exhibits great generalization performance and significant application potential. The code, generator, and datasets will be made available upon acceptance of this paper.
Published: 2023

47. HumanReg: Self-supervised Non-rigid Registration of Human Point Cloud

Author: Chen, Yifan, Pan, Zhiyu, Zhong, Zhicheng, Guo, Wenxuan, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we present a novel registration framework, HumanReg, that learns a non-rigid transformation between two human point clouds end-to-end. We introduce body prior into the registration process to efficiently handle this type of point cloud. Unlike most exsisting supervised registration techniques that require expensive point-wise flow annotations, HumanReg can be trained in a self-supervised manner benefiting from a set of novel loss functions. To make our model better converge on real-world data, we also propose a pretraining strategy, and a synthetic dataset (HumanSyn4D) consists of dynamic, sparse human point clouds and their auto-generated ground truth annotations. Our experiments shows that HumanReg achieves state-of-the-art performance on CAPE-512 dataset and gains a qualitative result on another more challenging real-world dataset. Furthermore, our ablation studies demonstrate the effectiveness of our synthetic dataset and novel loss functions. Our code and synthetic dataset is available at https://github.com/chenyifanthu/HumanReg.
Published: 2023

48. LiDAR-based Person Re-identification

Author: Guo, Wenxuan, Pan, Zhiyu, Liang, Yingping, Xi, Ziheng, Zhong, Zhi Chen, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Camera-based person re-identification (ReID) systems have been widely applied in the field of public security. However, cameras often lack the perception of 3D morphological information of human and are susceptible to various limitations, such as inadequate illumination, complex background, and personal privacy. In this paper, we propose a LiDAR-based ReID framework, ReID3D, that utilizes pre-training strategy to retrieve features of 3D body shape and introduces Graph-based Complementary Enhancement Encoder for extracting comprehensive features. Due to the lack of LiDAR datasets, we build LReID, the first LiDAR-based person ReID dataset, which is collected in several outdoor scenes with variations in natural conditions. Additionally, we introduce LReID-sync, a simulated pedestrian dataset designed for pre-training encoders with tasks of point cloud completion and shape parameter learning. Extensive experiments on LReID show that ReID3D achieves exceptional performance with a rank-1 accuracy of 94.0, highlighting the significant potential of LiDAR in addressing person ReID tasks. To the best of our knowledge, we are the first to propose a solution for LiDAR-based ReID. The code and datasets will be released soon.
Published: 2023

49. Fixed-length Dense Descriptor for Efficient Fingerprint Matching

Author: Pan, Zhiyu, Duan, Yongjie, Feng, Jianjiang, and Zhou, Jie
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: In fingerprint matching, fixed-length descriptors generally offer greater efficiency compared to minutiae set, but the recognition accuracy is not as good as that of the latter. Although much progress has been made in deep learning based fixed-length descriptors recently, they often fall short when dealing with incomplete or partial fingerprints, diverse fingerprint poses, and significant background noise. In this paper, we propose a three-dimensional representation called Fixed-length Dense Descriptor (FDD) for efficient fingerprint matching. FDD features great spatial properties, enabling it to capture the spatial relationships of the original fingerprints, thereby enhancing interpretability and robustness. Our experiments on various fingerprint datasets reveal that FDD outperforms other fixed-length descriptors, especially in matching fingerprints of different areas, cross-modal fingerprint matching, and fingerprint matching with background noise., Comment: Accepted by WIFS 2024
Published: 2023

50. SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction

Author: Huang, Yuanhui, Zheng, Wenzhao, Zhang, Borui, Zhou, Jie, and Lu, Jiwen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: 3D occupancy prediction is an important task for the robustness of vision-centric autonomous driving, which aims to predict whether each point is occupied in the surrounding 3D space. Existing methods usually require 3D occupancy labels to produce meaningful results. However, it is very laborious to annotate the occupancy status of each voxel. In this paper, we propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences. We first transform the images into the 3D space (e.g., bird's eye view) to obtain 3D representation of the scene. We directly impose constraints on the 3D representations by treating them as signed distance fields. We can then render 2D images of previous and future frames as self-supervision signals to learn the 3D representations. We propose an MVS-embedded strategy to directly optimize the SDF-induced weights with multiple depth proposals. Our SelfOcc outperforms the previous best method SceneRF by 58.7% using a single frame as input on SemanticKITTI and is the first self-supervised work that produces reasonable 3D occupancy for surround cameras on Occ3D. SelfOcc produces high-quality depth and achieves state-of-the-art results on novel depth synthesis, monocular depth estimation, and surround-view depth estimation on the SemanticKITTI, KITTI-2015, and nuScenes, respectively. Code: https://github.com/huang-yh/SelfOcc., Comment: Code is available at: https://github.com/huang-yh/SelfOcc
Published: 2023

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

288 results on '"Zhou, Jie"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources