Author: "He, Yuze" / Database: OAIster - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"He, Yuze"' showing total 11 results

Start Over Author "He, Yuze" Database OAIster

11 results on '"He, Yuze"'

1. LongAlign: A Recipe for Long Context Alignment of Large Language Models

Author: Bai, Yushi, Lv, Xin, Zhang, Jiajie, He, Yuze, Qi, Ji, Hou, Lei, Tang, Jie, Dong, Yuxiao, Li, Juanzi, Bai, Yushi, Lv, Xin, Zhang, Jiajie, He, Yuze, Qi, Ji, Hou, Lei, Tang, Jie, Dong, Yuxiao, and Li, Juanzi
Abstract: Extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. To address this, we present LongAlign -- a recipe of the instruction data, training, and evaluation for long context alignment. First, we construct a long instruction-following dataset using Self-Instruct. To ensure the data diversity, it covers a broad range of tasks from various long context sources. Second, we adopt the packing and sorted batching strategies to speed up supervised fine-tuning on data with varied length distributions. Additionally, we develop a loss weighting method to balance the contribution to the loss across different sequences during packing training. Third, we introduce the LongBench-Chat benchmark for evaluating instruction-following capabilities on queries of 10k-100k in length. Experiments show that LongAlign outperforms existing recipes for LLMs in long context tasks by up to 30\%, while also maintaining their proficiency in handling short, generic tasks. The code, data, and long-aligned models are open-sourced at https://github.com/THUDM/LongAlign.
Published: 2024

2. Soar: Design and Deployment of A Smart Roadside Infrastructure System for Autonomous Driving

Author: Shi, Shuyao, Ling, Neiwen, Jiang, Zhehao, Huang, Xuan, He, Yuze, Zhao, Xiaoguang, Yang, Bufang, Bian, Chen, Xia, Jingfei, Yan, Zhenyu, Yeung, Raymond, Xing, Guoliang, Shi, Shuyao, Ling, Neiwen, Jiang, Zhehao, Huang, Xuan, He, Yuze, Zhao, Xiaoguang, Yang, Bufang, Bian, Chen, Xia, Jingfei, Yan, Zhenyu, Yeung, Raymond, and Xing, Guoliang
Abstract: Recently,smart roadside infrastructure (SRI) has demonstrated the potential of achieving fully autonomous driving systems. To explore the potential of infrastructure-assisted autonomous driving, this paper presents the design and deployment of Soar, the first end-to-end SRI system specifically designed to support autonomous driving systems. Soar consists of both software and hardware components carefully designed to overcome various system and physical challenges. Soar can leverage the existing operational infrastructure like street lampposts for a lower barrier of adoption. Soar adopts a new communication architecture that comprises a bi-directional multi-hop I2I network and a downlink I2V broadcast service, which are designed based on off-the-shelf 802.11ac interfaces in an integrated manner. Soar also features a hierarchical DL task management framework to achieve desirable load balancing among nodes and enable them to collaborate efficiently to run multiple data-intensive autonomous driving applications. We deployed a total of 18 Soar nodes on existing lampposts on campus, which have been operational for over two years. Our real-world evaluation shows that Soar can support a diverse set of autonomous driving applications and achieve desirable real-time performance and high communication reliability. Our findings and experiences in this work offer key insights into the development and deployment of next-generation smart roadside infrastructure and autonomous driving systems.
Published: 2024

3. Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

Author: Hu, Yubin, He, Yuze, Li, Yanghao, Li, Jisheng, Han, Yuxing, Wen, Jiangtao, Liu, Yong-Jin, Hu, Yubin, He, Yuze, Li, Yanghao, Li, Jisheng, Han, Yuxing, Wen, Jiangtao, and Liu, Yong-Jin
Abstract: Video semantic segmentation (VSS) is a computationally expensive task due to the per-frame prediction for videos of high frame rates. In recent work, compact models or adaptive network strategies have been proposed for efficient VSS. However, they did not consider a crucial factor that affects the computational cost from the input side: the input resolution. In this paper, we propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient VSS. AR-Seg aims to reduce the computational cost by using low resolution for non-keyframes. To prevent the performance degradation caused by downsampling, we design a Cross Resolution Feature Fusion (CReFF) module, and supervise it with a novel Feature Similarity Training (FST) strategy. Specifically, CReFF first makes use of motion vectors stored in a compressed video to warp features from high-resolution keyframes to low-resolution non-keyframes for better spatial alignment, and then selectively aggregates the warped features with local attention mechanism. Furthermore, the proposed FST supervises the aggregated features with high-resolution features through an explicit similarity loss and an implicit constraint from the shared decoding layer. Extensive experiments on CamVid and Cityscapes show that AR-Seg achieves state-of-the-art performance and is compatible with different segmentation backbones. On CamVid, AR-Seg saves 67% computational cost (measured in GFLOPs) with the PSPNet18 backbone while maintaining high segmentation accuracy. Code: https://github.com/THU-LYJ-Lab/AR-Seg., Comment: CVPR 2023
Published: 2023

4. Benchmarking Foundation Models with Language-Model-as-an-Examiner

Author: Bai, Yushi, Ying, Jiahao, Cao, Yixin, Lv, Xin, He, Yuze, Wang, Xiaozhi, Yu, Jifan, Zeng, Kaisheng, Xiao, Yijia, Lyu, Haozhe, Zhang, Jiayin, Li, Juanzi, Hou, Lei, Bai, Yushi, Ying, Jiahao, Cao, Yixin, Lv, Xin, He, Yuze, Wang, Xiaozhi, Yu, Jifan, Zeng, Kaisheng, Xiao, Yijia, Lyu, Haozhe, Zhang, Jiayin, Li, Juanzi, and Hou, Lei
Abstract: Numerous benchmarks have been established to assess the performance of foundation models on open-ended question answering, which serves as a comprehensive test of a model's ability to understand and generate language in a manner similar to humans. Most of these works focus on proposing new datasets, however, we see two main issues within previous benchmarking pipelines, namely testing leakage and evaluation automation. In this paper, we propose a novel benchmarking framework, Language-Model-as-an-Examiner, where the LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner. Our framework allows for effortless extensibility as various LMs can be adopted as the examiner, and the questions can be constantly updated given more diverse trigger topics. For a more comprehensive and equitable evaluation, we devise three strategies: (1) We instruct the LM examiner to generate questions across a multitude of domains to probe for a broad acquisition, and raise follow-up questions to engage in a more in-depth assessment. (2) Upon evaluation, the examiner combines both scoring and ranking measurements, providing a reliable result as it aligns closely with human annotations. (3) We additionally propose a decentralized Peer-examination method to address the biases in a single examiner. Our data and benchmarking results are available at: http://lmexam.xlore.cn., Comment: NeurIPS 2023 Datasets and Benchmarks
Published: 2023

5. Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation

Author: He, Yuze, Bai, Yushi, Lin, Matthieu, Sheng, Jenny, Hu, Yubin, Wang, Qi, Wen, Yu-Hui, Liu, Yong-Jin, He, Yuze, Bai, Yushi, Lin, Matthieu, Sheng, Jenny, Hu, Yubin, Wang, Qi, Wen, Yu-Hui, and Liu, Yong-Jin
Abstract: By lifting the pre-trained 2D diffusion models into Neural Radiance Fields (NeRFs), text-to-3D generation methods have made great progress. Many state-of-the-art approaches usually apply score distillation sampling (SDS) to optimize the NeRF representations, which supervises the NeRF optimization with pre-trained text-conditioned 2D diffusion models such as Imagen. However, the supervision signal provided by such pre-trained diffusion models only depends on text prompts and does not constrain the multi-view consistency. To inject the cross-view consistency into diffusion priors, some recent works finetune the 2D diffusion model with multi-view data, but still lack fine-grained view coherence. To tackle this challenge, we incorporate multi-view image conditions into the supervision signal of NeRF optimization, which explicitly enforces fine-grained view consistency. With such stronger supervision, our proposed text-to-3D method effectively mitigates the generation of floaters (due to excessive densities) and completely empty spaces (due to insufficient densities). Our quantitative evaluations on the T$^3$Bench dataset demonstrate that our method achieves state-of-the-art performance over existing text-to-3D methods. We will make the code publicly available.
Published: 2023

6. T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

Author: He, Yuze, Bai, Yushi, Lin, Matthieu, Zhao, Wang, Hu, Yubin, Sheng, Jenny, Yi, Ran, Li, Juanzi, Liu, Yong-Jin, He, Yuze, Bai, Yushi, Lin, Matthieu, Zhao, Wang, Hu, Yubin, Sheng, Jenny, Yi, Ran, Li, Juanzi, and Liu, Yong-Jin
Abstract: Recent methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF. Notably, these methods are able to produce high-quality 3D scenes without training on 3D data. Due to the open-ended nature of the task, most studies evaluate their results with subjective case studies and user experiments, thereby presenting a challenge in quantitatively addressing the question: How has current progress in Text-to-3D gone so far? In this paper, we introduce T$^3$Bench, the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed for 3D generation. To assess both the subjective quality and the text alignment, we propose two automatic metrics based on multi-view images produced by the 3D contents. The quality metric combines multi-view text-image scores and regional convolution to detect quality and view inconsistency. The alignment metric uses multi-view captioning and GPT-4 evaluation to measure text-3D consistency. Both metrics closely correlate with different dimensions of human judgments, providing a paradigm for efficiently evaluating text-to-3D models. The benchmarking results, shown in Fig. 1, reveal performance differences among an extensive 10 prevalent text-to-3D methods. Our analysis further highlights the common struggles for current methods on generating surroundings and multi-object scenes, as well as the bottleneck of leveraging 2D guidance for 3D generation. Our project page is available at: https://t3bench.com., Comment: Under review
Published: 2023

7. MMPI: a Flexible Radiance Field Representation by Multiple Multi-plane Images Blending

Author: He, Yuze, Wang, Peng, Hu, Yubin, Zhao, Wang, Yi, Ran, Liu, Yong-Jin, Wang, Wenping, He, Yuze, Wang, Peng, Hu, Yubin, Zhao, Wang, Yi, Ran, Liu, Yong-Jin, and Wang, Wenping
Abstract: This paper presents a flexible representation of neural radiance fields based on multi-plane images (MPI), for high-quality view synthesis of complex scenes. MPI with Normalized Device Coordinate (NDC) parameterization is widely used in NeRF learning for its simple definition, easy calculation, and powerful ability to represent unbounded scenes. However, existing NeRF works that adopt MPI representation for novel view synthesis can only handle simple forward-facing unbounded scenes, where the input cameras are all observing in similar directions with small relative translations. Hence, extending these MPI-based methods to more complex scenes like large-range or even 360-degree scenes is very challenging. In this paper, we explore the potential of MPI and show that MPI can synthesize high-quality novel views of complex scenes with diverse camera distributions and view directions, which are not only limited to simple forward-facing scenes. Our key idea is to encode the neural radiance field with multiple MPIs facing different directions and blend them with an adaptive blending operation. For each region of the scene, the blending operation gives larger blending weights to those advantaged MPIs with stronger local representation abilities while giving lower weights to those with weaker representation abilities. Such blending operation automatically modulates the multiple MPIs to appropriately represent the diverse local density and color information. Experiments on the KITTI dataset and ScanNet dataset demonstrate that our proposed MMPI synthesizes high-quality images from diverse camera pose distributions and is fast to train, outperforming the previous fast-training NeRF methods for novel view synthesis. Moreover, we show that MMPI can encode extremely long trajectories and produce novel view renderings, demonstrating its potential in applications like autonomous driving.
Published: 2023

8. O$^2$-Recon: Completing 3D Reconstruction of Occluded Objects in the Scene with a Pre-trained 2D Diffusion Model

Author: Hu, Yubin, Ye, Sheng, Zhao, Wang, Lin, Matthieu, He, Yuze, Wen, Yu-Hui, He, Ying, Liu, Yong-Jin, Hu, Yubin, Ye, Sheng, Zhao, Wang, Lin, Matthieu, He, Yuze, Wen, Yu-Hui, He, Ying, and Liu, Yong-Jin
Abstract: Occlusion is a common issue in 3D reconstruction from RGB-D videos, often blocking the complete reconstruction of objects and presenting an ongoing problem. In this paper, we propose a novel framework, empowered by a 2D diffusion-based in-painting model, to reconstruct complete surfaces for the hidden parts of objects. Specifically, we utilize a pre-trained diffusion model to fill in the hidden areas of 2D images. Then we use these in-painted images to optimize a neural implicit surface representation for each instance for 3D reconstruction. Since creating the in-painting masks needed for this process is tricky, we adopt a human-in-the-loop strategy that involves very little human engagement to generate high-quality masks. Moreover, some parts of objects can be totally hidden because the videos are usually shot from limited perspectives. To ensure recovering these invisible areas, we develop a cascaded network architecture for predicting signed distance field, making use of different frequency bands of positional encoding and maintaining overall smoothness. Besides the commonly used rendering loss, Eikonal loss, and silhouette loss, we adopt a CLIP-based semantic consistency loss to guide the surface from unseen camera angles. Experiments on ScanNet scenes show that our proposed framework achieves state-of-the-art accuracy and completeness in object-level reconstruction from scene-level RGB-D videos. Code: https://github.com/THU-LYJ-Lab/O2-Recon., Comment: AAAI 2024
Published: 2023

9. AutoMatch: Leveraging Traffic Camera to Improve Perception and Localization of Autonomous Vehicles

Author: He, Yuze, Ma, Li, Cui, Jiahe, Yan, Zhenyu, Xing, Guoliang, Wang, Sen, Hu, Qintao, Pan, Chen, He, Yuze, Ma, Li, Cui, Jiahe, Yan, Zhenyu, Xing, Guoliang, Wang, Sen, Hu, Qintao, and Pan, Chen
Abstract: Traffic camera is one of the most ubiquitous traffic facilities, providing high coverage of complex, accident-prone road sections such as intersections. This work leverages traffic cameras to improve the perception and localization performance of autonomous vehicles at intersections. In particular, vehicles can expand their range of perception by matching the images captured by both the traffic cameras and on-vehicle cameras. Moreover, a traffic camera can match its images to an existing high-definition map (HD map) to derive centimeter-level location of the vehicles in its field of view. To this end, we propose AutoMatch - a novel system for real-time image registration, which is a key enabling technology for traffic camera-assisted perception and localization of autonomous vehicles. Our key idea is to leverage landmark keypoints of distinctive structures such as ground signs at intersections to facilitate image registration between traffic cameras and HD maps or vehicles. By leveraging the strong structural characteristics of ground signs, AutoMatch can extract very few but precise landmark keypoints for registration, which effectively reduces the communication/compute overhead. We implement AutoMatch on a testbed consisting of a self-built autonomous car, drones for surveying and mapping, and real traffic cameras. In addition, we collect two new multi-view traffic image datasets at intersections, which contain images from 220 real operational traffic cameras in 22 cities. Experimental results show that AutoMatch achieves pixel-level image registration accuracy within 88 milliseconds, and delivers an 11.7× improvement in accuracy, 1.4× speedup in compute time, and 17.1× data transmission saving over existing approaches. © 2022 ACM.
Published: 2022

10. VI-eye: semantic-based 3D point cloud registration for infrastructure-assisted autonomous driving

Author: He, Yuze, Ma, Li, Jiang, Zhehao, Tang, Yi, Xing, Guoliang, He, Yuze, Ma, Li, Jiang, Zhehao, Tang, Yi, and Xing, Guoliang
Abstract: Infrastructure-assisted autonomous driving is an emerging paradigm that aims to make affordable autonomous vehicles a reality. A key technology for realizing this vision is real-time point cloud registration which allows a vehicle to fuse the 3D point clouds generated by its own LiDAR and those on roadside infrastructures such as smart lampposts, which can deliver increased sensing range, more robust object detection, and centimeter-level navigation. Unfortunately, the existing methods for point cloud registration assume two clouds to share a similar perspective and large overlap, which result in significant delay and inaccuracy in real-world infrastructure-assisted driving settings. This paper proposes VI-Eye - the first system that can align vehicle-infrastructure point clouds at centimeter accuracy in real-time. Our key idea is to exploit traffic domain knowledge by detecting a set of key semantic objects including road, lane lines, curbs, and traffic signs. Based on the inherent regular geometries of such semantic objects, VI-Eye extracts a small number of saliency points and leverage them to achieve real-time registration of two point clouds. By allowing vehicles and infrastructures to extract the semantic information in parallel, VI-Eye leads to a highly scalable architecture for infrastructure-assisted autonomous driving. To evaluate the performance of VI-Eye, we collect two new multiview LiDAR point cloud datasets on an indoor autonomous driving testbed and a campus smart lamppost testbed, respectively. They contain total 915 point cloud pairs and cover three roads of 1.12km. Experiment results show that VI-Eye achieves centimeter-level accuracy within around 0.2s, and delivers a 5X improvement in accuracy and 2X speedup over state-of-the-art baselines.
Published: 2021

11. Learning to compose 6-DoF omnidirectional videos using multi-sphere images

Author: Li, Jisheng, He, Yuze, Hu, Yubin, Han, Yuxing, Wen, Jiangtao, Li, Jisheng, He, Yuze, Hu, Yubin, Han, Yuxing, and Wen, Jiangtao
Abstract: Omnidirectional video is an essential component of Virtual Reality. Although various methods have been proposed to generate content that can be viewed with six degrees of freedom (6-DoF), existing systems usually involve complex depth estimation, image in-painting or stitching pre-processing. In this paper, we propose a system that uses a 3D ConvNet to generate a multi-sphere images (MSI) representation that can be experienced in 6-DoF VR. The system utilizes conventional omnidirectional VR camera footage directly without the need for a depth map or segmentation mask, thereby significantly simplifying the overall complexity of the 6-DoF omnidirectional video composition. By using a newly designed weighted sphere sweep volume (WSSV) fusing technique, our approach is compatible with most panoramic VR camera setups. A ground truth generation approach for high-quality artifact-free 6-DoF contents is proposed and can be used by the research and development community for 6-DoF content generation.
Published: 2021

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

11 results on '"He, Yuze"'

1. LongAlign: A Recipe for Long Context Alignment of Large Language Models

2. Soar: Design and Deployment of A Smart Roadside Infrastructure System for Autonomous Driving

3. Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos

4. Benchmarking Foundation Models with Language-Model-as-an-Examiner

5. Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation

6. T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation

7. MMPI: a Flexible Radiance Field Representation by Multiple Multi-plane Images Blending

8. O$^2$-Recon: Completing 3D Reconstruction of Occluded Objects in the Scene with a Pre-trained 2D Diffusion Model

9. AutoMatch: Leveraging Traffic Camera to Improve Perception and Localization of Autonomous Vehicles

10. VI-eye: semantic-based 3D point cloud registration for infrastructure-assisted autonomous driving

11. Learning to compose 6-DoF omnidirectional videos using multi-sphere images

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Publication Year Range

Publication Type

Database

11 results on '"He, Yuze"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources