4,276 results on '"Zhao, Hang"'
Search Results
2. Can Open-source LLMs Enhance Data Synthesis for Toxic Detection?: An Experimental Study
- Author
-
Hui, Zheng, Guo, Zhaoxiao, Zhao, Hang, Duan, Juanyong, Ai, Lin, Li, Yinheng, Hirschberg, Julia, and Huang, Congrui
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Effective toxic content detection relies heavily on high-quality and diverse data, which serves as the foundation for robust content moderation models. This study explores the potential of open-source LLMs for harmful data synthesis, utilizing prompt engineering and fine-tuning techniques to enhance data quality and diversity. In a two-stage evaluation, we first examine the capabilities of six open-source LLMs in generating harmful data across multiple datasets using prompt engineering. In the second stage, we fine-tune these models to improve data generation while addressing challenges such as hallucination, data duplication, and overfitting. Our findings reveal that Mistral excels in generating high-quality and diverse harmful data with minimal hallucination. Furthermore, fine-tuning enhances data quality, offering scalable and cost-effective solutions for augmenting datasets for specific toxic content detection tasks. These results emphasize the significance of data synthesis in building robust, standalone detection models and highlight the potential of open-source LLMs to advance smaller downstream content moderation systems. We implemented this approach in real-world industrial settings, demonstrating the feasibility and efficiency of fine-tuned open-source LLMs for harmful data synthesis., Comment: 12 pages
- Published
- 2024
3. Generalizing Motion Planners with Mixture of Experts for Autonomous Driving
- Author
-
Sun, Qiao, Wang, Huimin, Zhan, Jiahao, Nie, Fan, Wen, Xin, Xu, Leimeng, Zhan, Kun, Jia, Peng, Lang, Xianpeng, and Zhao, Hang
- Subjects
Computer Science - Robotics ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Large real-world driving datasets have sparked significant research into various aspects of data-driven motion planners for autonomous driving. These include data augmentation, model architecture, reward design, training strategies, and planner pipelines. These planners promise better generalizations on complicated and few-shot cases than previous methods. However, experiment results show that many of these approaches produce limited generalization abilities in planning performance due to overly complex designs or training paradigms. In this paper, we review and benchmark previous methods focusing on generalizations. The experimental results indicate that as models are appropriately scaled, many design elements become redundant. We introduce StateTransformer-2 (STR2), a scalable, decoder-only motion planner that uses a Vision Transformer (ViT) encoder and a mixture-of-experts (MoE) causal Transformer architecture. The MoE backbone addresses modality collapse and reward balancing by expert routing during training. Extensive experiments on the NuPlan dataset show that our method generalizes better than previous approaches across different test sets and closed-loop simulations. Furthermore, we assess its scalability on billions of real-world urban driving scenarios, demonstrating consistent accuracy improvements as both data and model size grow., Comment: 7 pages, 3 figures
- Published
- 2024
4. Playful DoggyBot: Learning Agile and Precise Quadrupedal Locomotion
- Author
-
Duan, Xin, Zhuang, Ziwen, Zhao, Hang, and Schwertfeger, Soeren
- Subjects
Computer Science - Robotics - Abstract
Quadrupedal animals have the ability to perform agile while accurate tasks: a trained dog can chase and catch a flying frisbee before it touches the ground; a cat alone at home can jump and grab the door handle accurately. However, agility and precision are usually a trade-off in robotics problems. Recent works in quadruped robots either focus on agile but not-so-accurate tasks, such as locomotion in challenging terrain, or accurate but not-so-fast tasks, such as using an additional manipulator to interact with objects. In this work, we aim at an accurate and agile task, catching a small object hanging above the robot. We mount a passive gripper in front of the robot chassis, so that the robot has to jump and catch the object with extreme precision. Our experiment shows that our system is able to jump and successfully catch the ball at 1.05m high in simulation and 0.8m high in the real world, while the robot is 0.3m high when standing.
- Published
- 2024
5. ToxiCraft: A Novel Framework for Synthetic Generation of Harmful Information
- Author
-
Hui, Zheng, Guo, Zhaoxiao, Zhao, Hang, Duan, Juanyong, and Huang, Congrui
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
In different NLP tasks, detecting harmful content is crucial for online environments, especially with the growing influence of social media. However, previous research has two main issues: 1) a lack of data in low-resource settings, and 2) inconsistent definitions and criteria for judging harmful content, requiring classification models to be robust to spurious features and diverse. We propose Toxicraft, a novel framework for synthesizing datasets of harmful information to address these weaknesses. With only a small amount of seed data, our framework can generate a wide variety of synthetic, yet remarkably realistic, examples of toxic information. Experimentation across various datasets showcases a notable enhancement in detection model robustness and adaptability, surpassing or close to the gold labels. We release the generated data at Github upon acceptance.
- Published
- 2024
6. CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction
- Author
-
Ye, Zhangchen, Jiang, Tao, Xu, Chenfeng, Li, Yiming, and Zhao, Hang
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Vision-based 3D occupancy prediction is significantly challenged by the inherent limitations of monocular vision in depth estimation. This paper introduces CVT-Occ, a novel approach that leverages temporal fusion through the geometric correspondence of voxels over time to improve the accuracy of 3D occupancy predictions. By sampling points along the line of sight of each voxel and integrating the features of these points from historical frames, we construct a cost volume feature map that refines current volume features for improved prediction outcomes. Our method takes advantage of parallax cues from historical observations and employs a data-driven approach to learn the cost volume. We validate the effectiveness of CVT-Occ through rigorous experiments on the Occ3D-Waymo dataset, where it outperforms state-of-the-art methods in 3D occupancy prediction with minimal additional computational cost. The code is released at \url{https://github.com/Tsinghua-MARS-Lab/CVT-Occ}., Comment: Accepted to ECCV 2024
- Published
- 2024
7. Seed-Music: A Unified Framework for High Quality and Controlled Music Generation
- Author
-
Bai, Ye, Chen, Haonan, Chen, Jitong, Chen, Zhuo, Deng, Yi, Dong, Xiaohong, Hantrakul, Lamtharn, Hao, Weituo, Huang, Qingqing, Huang, Zhongyi, Jia, Dongya, La, Feihu, Le, Duc, Li, Bochen, Li, Chumin, Li, Hui, Li, Xingxing, Liu, Shouda, Lu, Wei-Tsung, Lu, Yiqing, Shaw, Andrew, Spijkervet, Janne, Sun, Yakun, Wang, Bo, Wang, Ju-Chiang, Wang, Yuping, Wang, Yuxuan, Xu, Ling, Yang, Yifeng, Yao, Chao, Zhang, Shuo, Zhang, Yang, Zhang, Yilin, Zhao, Hang, Zhao, Ziyi, Zhong, Dejian, Zhou, Shicen, and Zou, Pei
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We introduce Seed-Music, a suite of music generation systems capable of producing high-quality music with fine-grained style control. Our unified framework leverages both auto-regressive language modeling and diffusion approaches to support two key music creation workflows: controlled music generation and post-production editing. For controlled music generation, our system enables vocal music generation with performance controls from multi-modal inputs, including style descriptions, audio references, musical scores, and voice prompts. For post-production editing, it offers interactive tools for editing lyrics and vocal melodies directly in the generated audio. We encourage readers to listen to demo audio examples at https://team.doubao.com/seed-music "https://team.doubao.com/seed-music"., Comment: Seed-Music technical report, 20 pages, 5 figures
- Published
- 2024
8. Robust Robot Walker: Learning Agile Locomotion over Tiny Traps
- Author
-
Zhu, Shaoting, Huang, Runhan, Mou, Linzhan, and Zhao, Hang
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
Quadruped robots must exhibit robust walking capabilities in practical applications. In this work, we propose a novel approach that enables quadruped robots to pass various small obstacles, or "tiny traps". Existing methods often rely on exteroceptive sensors, which can be unreliable for detecting such tiny traps. To overcome this limitation, our approach focuses solely on proprioceptive inputs. We introduce a two-stage training framework incorporating a contact encoder and a classification head to learn implicit representations of different traps. Additionally, we design a set of tailored reward functions to improve both the stability of training and the ease of deployment for goal-tracking tasks. To benefit further research, we design a new benchmark for tiny trap task. Extensive experiments in both simulation and real-world settings demonstrate the effectiveness and robustness of our method. Project Page: https://robust-robot-walker.github.io/, Comment: 10 pages, 17 figures
- Published
- 2024
9. SARO: Space-Aware Robot System for Terrain Crossing via Vision-Language Model
- Author
-
Zhu, Shaoting, Li, Derun, Mou, Linzhan, Liu, Yong, Xu, Ningyi, and Zhao, Hang
- Subjects
Computer Science - Robotics - Abstract
The application of vision-language models (VLMs) has achieved impressive success in various robotics tasks. However, there are few explorations for these foundation models used in quadruped robot navigation through terrains in 3D environments. In this work, we introduce SARO (Space Aware Robot System for Terrain Crossing), an innovative system composed of a high-level reasoning module, a closed-loop sub-task execution module, and a low-level control policy. It enables the robot to navigate across 3D terrains and reach the goal position. For high-level reasoning and execution, we propose a novel algorithmic system taking advantage of a VLM, with a design of task decomposition and a closed-loop sub-task execution mechanism. For low-level locomotion control, we utilize the Probability Annealing Selection (PAS) method to effectively train a control policy by reinforcement learning. Numerous experiments show that our whole system can accurately and robustly navigate across several 3D terrains, and its generalization ability ensures the applications in diverse indoor and outdoor scenarios and terrains. Project page: https://saro-vlm.github.io/, Comment: 12 pages, 9 figures
- Published
- 2024
10. Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
- Author
-
Shao, Tong, Tian, Zhuotao, Zhao, Hang, and Su, Jingyong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects.Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.The code are made publicly available at: https://github.com/leaves162/CLIPtrase., Comment: ECCV24 accepted
- Published
- 2024
11. DISCO: Efficient Diffusion Solver for Large-Scale Combinatorial Optimization Problems
- Author
-
Yu, Kexiong, Zhao, Hang, Huang, Yuhang, Yi, Renjiao, Xu, Kai, and Zhu, Chenyang
- Subjects
Computer Science - Artificial Intelligence - Abstract
Combinatorial Optimization (CO) problems are fundamentally important in numerous real-world applications across diverse industries, characterized by entailing enormous solution space and demanding time-sensitive response. Despite recent advancements in neural solvers, their limited expressiveness struggles to capture the multi-modal nature of CO landscapes. While some research has shifted towards diffusion models, these models still sample solutions indiscriminately from the entire NP-complete solution space with time-consuming denoising processes, which limit their practicality for large problem scales. We propose DISCO, an efficient DIffusion Solver for large-scale Combinatorial Optimization problems that excels in both solution quality and inference speed. DISCO's efficacy is twofold: First, it enhances solution quality by constraining the sampling space to a more meaningful domain guided by solution residues, while preserving the multi-modal properties of the output distributions. Second, it accelerates the denoising process through an analytically solvable approach, enabling solution sampling with minimal reverse-time steps and significantly reducing inference time. DISCO delivers strong performance on large-scale Traveling Salesman Problems and challenging Maximal Independent Set benchmarks, with inference time up to 5.28 times faster than other diffusion alternatives. By incorporating a divide-and-conquer strategy, DISCO can well generalize to solve unseen-scale problem instances, even surpassing models specifically trained for those scales.
- Published
- 2024
12. GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory
- Author
-
Wu, Haoze, Qiu, Zihan, Wang, Zili, Zhao, Hang, and Fu, Jie
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Mixture-of-Experts (MoE) has been demonstrated as an efficient method to scale up models. By dynamically and sparsely selecting activated experts, MoE can effectively reduce computational costs. Despite the success, we observe that many tokens in the MoE models have uncertain routing results. These tokens have nearly equal scores for choosing each expert, and we demonstrate that this uncertainty can lead to incorrect selections. Inspired by the Global Workspace Theory (GWT), we propose a new fine-tuning method, GW-MoE, to address this issue. The core idea is to broadcast the uncertain tokens across experts during fine-tuning. Therefore, these tokens can acquire the necessary knowledge from any expert during inference and become less sensitive to the choice. GW-MoE does not introduce additional inference overhead. We validate that GW can mitigate the uncertain problem and consistently improve in different tasks (text classification, question answering, summarization, code generation, and mathematical problem solving) and model sizes (650M and 8B parameters).
- Published
- 2024
13. Humanoid Parkour Learning
- Author
-
Zhuang, Ziwen, Yao, Shenzhe, and Zhao, Hang
- Subjects
Computer Science - Robotics - Abstract
Parkour is a grand challenge for legged locomotion, even for quadruped robots, requiring active perception and various maneuvers to overcome multiple challenging obstacles. Existing methods for humanoid locomotion either optimize a trajectory for a single parkour track or train a reinforcement learning policy only to walk with a significant amount of motion references. In this work, we propose a framework for learning an end-to-end vision-based whole-body-control parkour policy for humanoid robots that overcomes multiple parkour skills without any motion prior. Using the parkour policy, the humanoid robot can jump on a 0.42m platform, leap over hurdles, 0.8m gaps, and much more. It can also run at 1.8m/s in the wild and walk robustly on different terrains. We test our policy in indoor and outdoor environments to demonstrate that it can autonomously select parkour skills while following the rotation command of the joystick. We override the arm actions and show that this framework can easily transfer to humanoid mobile manipulation tasks. Videos can be found at https://humanoid4parkour.github.io, Comment: Published on CoRL 2024
- Published
- 2024
14. TimeSieve: Extracting Temporal Dynamics through Information Bottlenecks
- Author
-
Feng, Ninghui, Lai, Songning, Yang, Jiayu, Zhou, Fobao, Yin, Zhenxiao, and Zhao, Hang
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Time series forecasting has become an increasingly popular research area due to its critical applications in various real-world domains such as traffic management, weather prediction, and financial analysis. Despite significant advancements, existing models face notable challenges, including the necessity of manual hyperparameter tuning for different datasets, and difficulty in effectively distinguishing signal from redundant features in data characterized by strong seasonality. These issues hinder the generalization and practical application of time series forecasting models. To solve this issues, we propose an innovative time series forecasting model TimeSieve designed to address these challenges. Our approach employs wavelet transforms to preprocess time series data, effectively capturing multi-scale features without the need for additional parameters or manual hyperparameter tuning. Additionally, we introduce the information bottleneck theory that filters out redundant features from both detail and approximation coefficients, retaining only the most predictive information. This combination reduces significantly improves the model's accuracy. Extensive experiments demonstrate that our model outperforms existing state-of-the-art methods on 70% of the datasets, achieving higher predictive accuracy and better generalization across diverse datasets. Our results validate the effectiveness of our approach in addressing the key challenges in time series forecasting, paving the way for more reliable and efficient predictive models in practical applications. The code for our model is available at https://github.com/xll0328/TimeSieve.
- Published
- 2024
15. FTS: A Framework to Find a Faithful TimeSieve
- Author
-
Lai, Songning, Feng, Ninghui, Gao, Jiechao, Wang, Hao, Sui, Haochen, Zou, Xin, Yang, Jiayu, Chen, Wenshuo, Zhao, Hang, Hu, Xuming, and Yue, Yutao
- Subjects
Computer Science - Machine Learning - Abstract
The field of time series forecasting has garnered significant attention in recent years, prompting the development of advanced models like TimeSieve, which demonstrates impressive performance. However, an analysis reveals certain unfaithfulness issues, including high sensitivity to random seeds, input and layer noise perturbations and parametric perturbations. Recognizing these challenges, we embark on a quest to define the concept of \textbf{\underline{F}aithful \underline{T}ime\underline{S}ieve \underline{(FTS)}}, a model that consistently delivers reliable and robust predictions. To address these issues, we propose a novel framework aimed at identifying and rectifying unfaithfulness in TimeSieve. Our framework is designed to enhance the model's stability and faithfulness, ensuring that its outputs are less susceptible to the aforementioned factors. Experimentation validates the effectiveness of our proposed framework, demonstrating improved faithfulness in the model's behavior.
- Published
- 2024
16. Generating Comprehensive Lithium Battery Charging Data with Generative AI
- Author
-
Jiang, Lidang, Hu, Changyan, Ji, Sibei, Zhao, Hang, Chen, Junxiong, and He, Ge
- Subjects
Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Signal Processing - Abstract
In optimizing performance and extending the lifespan of lithium batteries, accurate state prediction is pivotal. Traditional regression and classification methods have achieved some success in battery state prediction. However, the efficacy of these data-driven approaches heavily relies on the availability and quality of public datasets. Additionally, generating electrochemical data predominantly through battery experiments is a lengthy and costly process, making it challenging to acquire high-quality electrochemical data. This difficulty, coupled with data incompleteness, significantly impacts prediction accuracy. Addressing these challenges, this study introduces the End of Life (EOL) and Equivalent Cycle Life (ECL) as conditions for generative AI models. By integrating an embedding layer into the CVAE model, we developed the Refined Conditional Variational Autoencoder (RCVAE). Through preprocessing data into a quasi-video format, our study achieves an integrated synthesis of electrochemical data, including voltage, current, temperature, and charging capacity, which is then processed by the RCVAE model. Coupled with customized training and inference algorithms, this model can generate specific electrochemical data for EOL and ECL under supervised conditions. This method provides users with a comprehensive electrochemical dataset, pioneering a new research domain for the artificial synthesis of lithium battery data. Furthermore, based on the detailed synthetic data, various battery state indicators can be calculated, offering new perspectives and possibilities for lithium battery performance prediction.
- Published
- 2024
- Full Text
- View/download PDF
17. The Association Between Blood Mercury and Lipid Biomarkers in US Hypertensive Adults
- Author
-
Zhao, Hang and Peng, Jiecheng
- Published
- 2024
- Full Text
- View/download PDF
18. Influence of Porosity on Vibration of Porous FG Plates Resting on an Arbitrarily Orthotropic Winkler-Pasternak Foundation by PDDO
- Author
-
Yang, Yongyu, Wang, Xiaoqi, Zhao, Hang, Wang, Chao, Cheng, Changzheng, and Das, Raj
- Published
- 2024
- Full Text
- View/download PDF
19. P-MapNet: Far-seeing Map Generator Enhanced by both SDMap and HDMap Priors
- Author
-
Jiang, Zhou, Zhu, Zhenxin, Li, Pengfei, Gao, Huan-ang, Yuan, Tianyuan, Shi, Yongliang, Zhao, Hang, and Zhao, Hao
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Autonomous vehicles are gradually entering city roads today, with the help of high-definition maps (HDMaps). However, the reliance on HDMaps prevents autonomous vehicles from stepping into regions without this expensive digital infrastructure. This fact drives many researchers to study online HDMap generation algorithms, but the performance of these algorithms at far regions is still unsatisfying. We present P-MapNet, in which the letter P highlights the fact that we focus on incorporating map priors to improve model performance. Specifically, we exploit priors in both SDMap and HDMap. On one hand, we extract weakly aligned SDMap from OpenStreetMap, and encode it as an additional conditioning branch. Despite the misalignment challenge, our attention-based architecture adaptively attends to relevant SDMap skeletons and significantly improves performance. On the other hand, we exploit a masked autoencoder to capture the prior distribution of HDMap, which can serve as a refinement module to mitigate occlusions and artifacts. We benchmark on the nuScenes and Argoverse2 datasets. Through comprehensive experiments, we show that: (1) our SDMap prior can improve online map generation performance, using both rasterized (by up to $+18.73$ $\rm mIoU$) and vectorized (by up to $+8.50$ $\rm mAP$) output representations. (2) our HDMap prior can improve map perceptual metrics by up to $6.34\%$. (3) P-MapNet can be switched into different inference modes that covers different regions of the accuracy-efficiency trade-off landscape. (4) P-MapNet is a far-seeing solution that brings larger improvements on longer ranges. Codes and models are publicly available at https://jike5.github.io/P-MapNet., Comment: Code: https://jike5.github.io/P-MapNet
- Published
- 2024
20. PreSight: Enhancing Autonomous Vehicle Perception with City-Scale NeRF Priors
- Author
-
Yuan, Tianyuan, Mao, Yucheng, Yang, Jiawei, Liu, Yicheng, Wang, Yue, and Zhao, Hang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Autonomous vehicles rely extensively on perception systems to navigate and interpret their surroundings. Despite significant advancements in these systems recently, challenges persist under conditions like occlusion, extreme lighting, or in unfamiliar urban areas. Unlike these systems, humans do not solely depend on immediate observations to perceive the environment. In navigating new cities, humans gradually develop a preliminary mental map to supplement real-time perception during subsequent visits. Inspired by this human approach, we introduce a novel framework, PreSight, that leverages past traversals to construct static prior memories, enhancing online perception in later navigations. Our method involves optimizing a city-scale neural radiance field with data from previous journeys to generate neural priors. These priors, rich in semantic and geometric details, are derived without manual annotations and can seamlessly augment various state-of-the-art perception models, improving their efficacy with minimal additional computational cost. Experimental results on the nuScenes dataset demonstrate the framework's high compatibility with diverse online perception models. Specifically, it shows remarkable improvements in HD-map construction and occupancy prediction tasks, highlighting its potential as a new perception framework for autonomous driving systems. Our code will be released at https://github.com/yuantianyuan01/PreSight.
- Published
- 2024
21. DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
- Author
-
Tian, Xiaoyu, Gu, Junru, Li, Bailin, Liu, Yicheng, Wang, Yang, Zhao, Zhiyong, Zhan, Kun, Jia, Peng, Lang, Xianpeng, and Zhao, Hang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments., Comment: Project Page: https://tsinghua-mars-lab.github.io/DriveVLM/
- Published
- 2024
22. MINT: Boosting Audio-Language Model via Multi-Target Pre-Training and Instruction Tuning
- Author
-
Zhao, Hang, Xin, Yifei, Yu, Zhesong, Zhu, Bilei, Lu, Lu, and Ma, Zejun
- Subjects
Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
In the realm of audio-language pre-training (ALP), the challenge of achieving cross-modal alignment is significant. Moreover, the integration of audio inputs with diverse distributions and task variations poses challenges in developing generic audio-language models. In this study, we present MINT, a novel ALP framework boosting audio-language models through multi-target pre-training and instruction tuning. MINT leverages the strength of frozen pre-trained audio encoders and large language models (LLM) to improve audio-language pre-training, enabling effective transferablility to both audio-text understanding and generation tasks. To address the modality gap, we introduce Bridge-Net, a trainable module that enhances cross-modality alignment and the model's ability to follow instructions for a variety of audio-text tasks. Bridge-Net is pivotal within MINT, initially enhancing audio-language representation learning through a multi-target pre-training approach. Subsequently, Bridge-Net further boosts audio-to-language generative learning by integrating a frozen language model with instruction tuning. This integration empowers MINT to extract features in a flexible and effective manner, specifically tailored to the provided instructions for diverse tasks. Experimental results demonstrate that MINT attains superior performance across various audio-language understanding and generation tasks, highlighting its robust generalization capabilities even in zero-shot scenarios.
- Published
- 2024
23. PIXART-{\delta}: Fast and Controllable Image Generation with Latent Consistency Models
- Author
-
Chen, Junsong, Wu, Yue, Luo, Simian, Xie, Enze, Paul, Sayak, Luo, Ping, Zhao, Hang, and Li, Zhenguo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
This technical report introduces PIXART-{\delta}, a text-to-image synthesis framework that integrates the Latent Consistency Model (LCM) and ControlNet into the advanced PIXART-{\alpha} model. PIXART-{\alpha} is recognized for its ability to generate high-quality images of 1024px resolution through a remarkably efficient training process. The integration of LCM in PIXART-{\delta} significantly accelerates the inference speed, enabling the production of high-quality images in just 2-4 steps. Notably, PIXART-{\delta} achieves a breakthrough 0.5 seconds for generating 1024x1024 pixel images, marking a 7x improvement over the PIXART-{\alpha}. Additionally, PIXART-{\delta} is designed to be efficiently trainable on 32GB V100 GPUs within a single day. With its 8-bit inference capability (von Platen et al., 2023), PIXART-{\delta} can synthesize 1024px images within 8GB GPU memory constraints, greatly enhancing its usability and accessibility. Furthermore, incorporating a ControlNet-like module enables fine-grained control over text-to-image diffusion models. We introduce a novel ControlNet-Transformer architecture, specifically tailored for Transformers, achieving explicit controllability alongside high-quality image generation. As a state-of-the-art, open-source image generation model, PIXART-{\delta} offers a promising alternative to the Stable Diffusion family of models, contributing significantly to text-to-image synthesis., Comment: Technical Report
- Published
- 2024
24. CVT-Occ: Cost Volume Temporal Fusion for 3D Occupancy Prediction
- Author
-
Ye, Zhangchen, Jiang, Tao, Xu, Chenfeng, Li, Yiming, Zhao, Hang, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
25. Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
- Author
-
Shao, Tong, Tian, Zhuotao, Zhao, Hang, Su, Jingyong, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Leonardis, Aleš, editor, Ricci, Elisa, editor, Roth, Stefan, editor, Russakovsky, Olga, editor, Sattler, Torsten, editor, and Varol, Gül, editor
- Published
- 2025
- Full Text
- View/download PDF
26. Selective laser melting of 3D queue microelectrodes and its application in micro-EDM
- Author
-
Xu, Bin, Liu, Yang-quan, Lei, Jian-guo, Zhao, Hang, Guo, Cheng, Wu, Xiao-yu, and Peng, Tai-jiang
- Published
- 2024
- Full Text
- View/download PDF
27. Phonological Development in 3–6-Year-Old Mandarin-Speaking Children with Autism, Developmental Delays, and Typical Development
- Author
-
Liu, Min, Han, Jinhe, Zhang, Yuexin, Wen, Jieling, Wang, Yanxia, Hu, Xinyu, Sun, Mudi, Qu, Lu, Han, Xuling, Xu, Lian, Zhao, Hang, Lu, Haidan, and Liu, Qiaoyun
- Published
- 2024
- Full Text
- View/download PDF
28. The relationship between physical activity and quality of life in Chinese adolescents: a cross-sectional study
- Author
-
Zhao, Hang, Chen, Huayong, Luo, Yi, and Xiao, Mimi
- Published
- 2024
- Full Text
- View/download PDF
29. LCM-LoRA: A Universal Stable-Diffusion Acceleration Module
- Author
-
Luo, Simian, Tan, Yiqin, Patil, Suraj, Gu, Daniel, von Platen, Patrick, Passos, Apolinário, Huang, Longbo, Li, Jian, and Zhao, Hang
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Latent Consistency Models (LCMs) have achieved impressive performance in accelerating text-to-image generative tasks, producing high-quality images with minimal inference steps. LCMs are distilled from pre-trained latent diffusion models (LDMs), requiring only ~32 A100 GPU training hours. This report further extends LCMs' potential in two aspects: First, by applying LoRA distillation to Stable-Diffusion models including SD-V1.5, SSD-1B, and SDXL, we have expanded LCM's scope to larger models with significantly less memory consumption, achieving superior image generation quality. Second, we identify the LoRA parameters obtained through LCM distillation as a universal Stable-Diffusion acceleration module, named LCM-LoRA. LCM-LoRA can be directly plugged into various Stable-Diffusion fine-tuned models or LoRAs without training, thus representing a universally applicable accelerator for diverse image generation tasks. Compared with previous numerical PF-ODE solvers such as DDIM, DPM-Solver, LCM-LoRA can be viewed as a plug-in neural PF-ODE solver that possesses strong generalization abilities. Project page: https://github.com/luosiallen/latent-consistency-model., Comment: Technical Report
- Published
- 2023
30. Large Trajectory Models are Scalable Motion Predictors and Planners
- Author
-
Sun, Qiao, Zhang, Shiduo, Ma, Danjiao, Shi, Jingzhe, Li, Derun, Luo, Simian, Wang, Yu, Xu, Ningyi, Cao, Guangzhi, and Zhao, Hang
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Motion prediction and planning are vital tasks in autonomous driving, and recent efforts have shifted to machine learning-based approaches. The challenges include understanding diverse road topologies, reasoning traffic dynamics over a long time horizon, interpreting heterogeneous behaviors, and generating policies in a large continuous state space. Inspired by the success of large language models in addressing similar complexities through model scaling, we introduce a scalable trajectory model called State Transformer (STR). STR reformulates the motion prediction and motion planning problems by arranging observations, states, and actions into one unified sequence modeling task. Our approach unites trajectory generation problems with other sequence modeling problems, powering rapid iterations with breakthroughs in neighbor domains such as language modeling. Remarkably, experimental results reveal that large trajectory models (LTMs), such as STR, adhere to the scaling laws by presenting outstanding adaptability and learning efficiency. Qualitative results further demonstrate that LTMs are capable of making plausible predictions in scenarios that diverge significantly from the training data distribution. LTMs also learn to make complex reasonings for long-term planning, without explicit loss designs or costly high-level annotations.
- Published
- 2023
31. LiDAR-based 4D Occupancy Completion and Forecasting
- Author
-
Liu, Xinhao, Gong, Moonjun, Fang, Qi, Xie, Haoyu, Li, Yiming, Zhao, Hang, and Feng, Chen
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Robotics - Abstract
Scene completion and forecasting are two popular perception problems in research for mobile agents like autonomous vehicles. Existing approaches treat the two problems in isolation, resulting in a separate perception of the two aspects. In this paper, we introduce a novel LiDAR perception task of Occupancy Completion and Forecasting (OCF) in the context of autonomous driving to unify these aspects into a cohesive framework. This task requires new algorithms to address three challenges altogether: (1) sparse-to-dense reconstruction, (2) partial-to-complete hallucination, and (3) 3D-to-4D prediction. To enable supervision and evaluation, we curate a large-scale dataset termed OCFBench from public autonomous driving datasets. We analyze the performance of closely related existing baseline models and our own ones on our dataset. We envision that this research will inspire and call for further investigation in this evolving and crucial area of 4D perception. Our code for data curation and baseline implementation is available at https://github.com/ai4ce/Occ4cast., Comment: IROS 2024
- Published
- 2023
32. What Makes for Robust Multi-Modal Models in the Face of Missing Modalities?
- Author
-
Li, Siting, Du, Chenzhuang, Zhao, Yue, Huang, Yu, and Zhao, Hang
- Subjects
Computer Science - Artificial Intelligence - Abstract
With the growing success of multi-modal learning, research on the robustness of multi-modal models, especially when facing situations with missing modalities, is receiving increased attention. Nevertheless, previous studies in this domain exhibit certain limitations, as they often lack theoretical insights or their methodologies are tied to specific network architectures or modalities. We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective and illustrate that the performance ceiling in such scenarios can be approached by efficiently utilizing the information inherent in non-missing modalities. In practice, there are two key aspects: (1) The encoder should be able to extract sufficiently good features from the non-missing modality; (2) The extracted features should be robust enough not to be influenced by noise during the fusion process across modalities. To this end, we introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA). UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities. Apart from that, UME-MMA, built on a late-fusion learning framework, allows for the plug-and-play use of various encoders, making it suitable for a wide range of modalities and enabling seamless integration of large-scale pre-trained encoders to further enhance performance. And we demonstrate UME-MMA's effectiveness in audio-visual datasets~(e.g., AV-MNIST, Kinetics-Sound, AVE) and vision-language datasets~(e.g., MM-IMDB, UPMC Food101).
- Published
- 2023
33. Imitator Learning: Achieve Out-of-the-Box Imitation Ability in Variable Environments
- Author
-
Chen, Xiong-Hui, Ye, Junyin, Zhao, Hang, Li, Yi-Chen, Shi, Haoran, Xu, Yu-Yan, Ye, Zhihao, Yang, Si-Hang, Huang, Anqi, Xu, Kai, Zhang, Zongzhang, and Yu, Yang
- Subjects
Computer Science - Machine Learning - Abstract
Imitation learning (IL) enables agents to mimic expert behaviors. Most previous IL techniques focus on precisely imitating one policy through mass demonstrations. However, in many applications, what humans require is the ability to perform various tasks directly through a few demonstrations of corresponding tasks, where the agent would meet many unexpected changes when deployed. In this scenario, the agent is expected to not only imitate the demonstration but also adapt to unforeseen environmental changes. This motivates us to propose a new topic called imitator learning (ItorL), which aims to derive an imitator module that can on-the-fly reconstruct the imitation policies based on very limited expert demonstrations for different unseen tasks, without any extra adjustment. In this work, we focus on imitator learning based on only one expert demonstration. To solve ItorL, we propose Demo-Attention Actor-Critic (DAAC), which integrates IL into a reinforcement-learning paradigm that can regularize policies' behaviors in unexpected situations. Besides, for autonomous imitation policy building, we design a demonstration-based attention architecture for imitator policy that can effectively output imitated actions by adaptively tracing the suitable states in demonstrations. We develop a new navigation benchmark and a robot environment for \topic~and show that DAAC~outperforms previous imitation methods \textit{with large margins} both on seen and unseen tasks.
- Published
- 2023
34. Improving Discriminative Multi-Modal Learning with Large-Scale Pre-Trained Models
- Author
-
Du, Chenzhuang, Zhao, Yue, Liao, Chonghua, You, Jiacheng, Fu, Jie, and Zhao, Hang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
This paper investigates how to better leverage large-scale pre-trained uni-modal models to further enhance discriminative multi-modal learning. Even when fine-tuned with only uni-modal data, these models can outperform previous multi-modal models in certain tasks. It's clear that their incorporation into multi-modal learning would significantly improve performance. However, multi-modal learning with these models still suffers from insufficient learning of uni-modal features, which weakens the resulting multi-modal model's generalization ability. While fine-tuning uni-modal models separately and then aggregating their predictions is straightforward, it doesn't allow for adequate adaptation between modalities, also leading to sub-optimal results. To this end, we introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA). By freezing the weights of uni-modal fine-tuned models, adding extra trainable rank decomposition matrices to them, and subsequently performing multi-modal joint training, our method enhances adaptation between modalities and boosts overall performance. We demonstrate the effectiveness of MMLoRA on three dataset categories: audio-visual (e.g., AVE, Kinetics-Sound, CREMA-D), vision-language (e.g., MM-IMDB, UPMC Food101), and RGB-Optical Flow (UCF101).
- Published
- 2023
35. Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
- Author
-
Luo, Simian, Tan, Yiqin, Huang, Longbo, Li, Jian, and Zhao, Hang
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: https://latent-consistency-models.github.io/
- Published
- 2023
36. GPT-Driver: Learning to Drive with GPT
- Author
-
Mao, Jiageng, Qian, Yuxi, Ye, Junjie, Zhao, Hang, and Wang, Yue
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Robotics - Abstract
We present a simple yet effective approach that can transform the OpenAI GPT-3.5 model into a reliable motion planner for autonomous vehicles. Motion planning is a core challenge in autonomous driving, aiming to plan a driving trajectory that is safe and comfortable. Existing motion planners predominantly leverage heuristic methods to forecast driving trajectories, yet these approaches demonstrate insufficient generalization capabilities in the face of novel and unseen driving scenarios. In this paper, we propose a novel approach to motion planning that capitalizes on the strong reasoning capabilities and generalization potential inherent to Large Language Models (LLMs). The fundamental insight of our approach is the reformulation of motion planning as a language modeling problem, a perspective not previously explored. Specifically, we represent the planner inputs and outputs as language tokens, and leverage the LLM to generate driving trajectories through a language description of coordinate positions. Furthermore, we propose a novel prompting-reasoning-finetuning strategy to stimulate the numerical reasoning potential of the LLM. With this strategy, the LLM can describe highly precise trajectory coordinates and also its internal decision-making process in natural language. We evaluate our approach on the large-scale nuScenes dataset, and extensive experiments substantiate the effectiveness, generalization ability, and interpretability of our GPT-based motion planner. Code is now available at https://github.com/PointsCoder/GPT-Driver., Comment: NeurIPS 2023 Foundation Models for Decision Making Workshop
- Published
- 2023
37. Research on SAR image quality evaluation method based on improved harris hawk optimization algorithm and XGBoost
- Author
-
Huang, Min, Zhao, Hang, and Chen, Yazhou
- Published
- 2024
- Full Text
- View/download PDF
38. Formaldehyde initiates memory and motor impairments under weightlessness condition
- Author
-
Mei, Tianhao, Chen, Ying, Gao, Yajuan, Zhao, Hang, Lyu, Xingzhou, Lin, Jing, Niu, Tianye, Han, Hongbin, and Tong, Zhiqian
- Published
- 2024
- Full Text
- View/download PDF
39. Psychometric evaluation of the Chinese version of the Chronic Heart Failure Health-related Quality of Life Questionnaire (CHFQOLQ-20)
- Author
-
Zhao, Ying, Zhao, Hang, Deng, Xiaoxue, Wang, Yanan, Luan, Xin, and Yu, Hongyu
- Published
- 2024
- Full Text
- View/download PDF
40. Association between the non-high-density lipoprotein cholesterol to high-density lipoprotein cholesterol ratio (NHHR) and cardiovascular outcomes in patients undergoing percutaneous coronary intervention: a retrospective study
- Author
-
Liu, Jiuling, Oorloff, Melysze Deanne, Nadella, Adithya, Guo, Ping, Ye, Min, Wang, Xiaoqing, and Zhao, Hang
- Published
- 2024
- Full Text
- View/download PDF
41. Cardiovascular risk and its influencing factors during exercise in apparently healthy Chinese adult population
- Author
-
Zeng, Zhipeng, Zhao, Hang, Wang, Juan, Pi, Peng, Hao, Li, Wang, Yan, and Wang, Zhengzhen
- Published
- 2024
- Full Text
- View/download PDF
42. ERRFI1 exacerbates hepatic ischemia reperfusion injury by promoting hepatocyte apoptosis and ferroptosis in a GRB2-dependent manner
- Author
-
Zhao, Hang and Mao, Huizi
- Published
- 2024
- Full Text
- View/download PDF
43. Developing a machine learning model for accurate nucleoside hydrogels prediction based on descriptors
- Author
-
Li, Weiqi, Wen, Yinghui, Wang, Kaichao, Ding, Zihan, Wang, Lingfeng, Chen, Qianming, Xie, Liang, Xu, Hao, and Zhao, Hang
- Published
- 2024
- Full Text
- View/download PDF
44. The transcultural adaptation and validation of the Chinese version of the Oral Health Literacy Scale for Diabetic Patients
- Author
-
Zhao, Ying, Zhao, Hang, and Yu, Hongyu
- Published
- 2024
- Full Text
- View/download PDF
45. Uncertainty-Aware Decision Transformer for Stochastic Driving Environments
- Author
-
Li, Zenan, Nie, Fan, Sun, Qiao, Da, Fang, and Zhao, Hang
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Offline Reinforcement Learning (RL) enables policy learning without active interactions, making it especially appealing for self-driving tasks. Recent successes of Transformers inspire casting offline RL as sequence modeling, which, however, fails in stochastic environments with incorrect assumptions that identical actions can consistently achieve the same goal. In this paper, we introduce an UNcertainty-awaRE deciSion Transformer (UNREST) for planning in stochastic driving environments without introducing additional transition or complex generative models. Specifically, UNREST estimates uncertainties by conditional mutual information between transitions and returns. Discovering 'uncertainty accumulation' and 'temporal locality' properties of driving environments, we replace the global returns in decision transformers with truncated returns less affected by environments to learn from actual outcomes of actions rather than environment transitions. We also dynamically evaluate uncertainty at inference for cautious planning. Extensive experiments demonstrate UNREST's superior performance in various driving scenarios and the power of our uncertainty estimation strategy.
- Published
- 2023
46. AutoEncoding Tree for City Generation and Applications
- Author
-
Han, Wenyu, Wen, Congcong, Chok, Lazarus, Tan, Yan Liang, Chan, Sheung Lung, Zhao, Hang, and Feng, Chen
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
City modeling and generation have attracted an increased interest in various applications, including gaming, urban planning, and autonomous driving. Unlike previous works focused on the generation of single objects or indoor scenes, the huge volumes of spatial data in cities pose a challenge to the generative models. Furthermore, few publicly available 3D real-world city datasets also hinder the development of methods for city generation. In this paper, we first collect over 3,000,000 geo-referenced objects for the city of New York, Zurich, Tokyo, Berlin, Boston and several other large cities. Based on this dataset, we propose AETree, a tree-structured auto-encoder neural network, for city generation. Specifically, we first propose a novel Spatial-Geometric Distance (SGD) metric to measure the similarity between building layouts and then construct a binary tree over the raw geometric data of building based on the SGD metric. Next, we present a tree-structured network whose encoder learns to extract and merge spatial information from bottom-up iteratively. The resulting global representation is reversely decoded for reconstruction or generation. To address the issue of long-dependency as the level of the tree increases, a Long Short-Term Memory (LSTM) Cell is employed as a basic network element of the proposed AETree. Moreover, we introduce a novel metric, Overlapping Area Ratio (OAR), to quantitatively evaluate the generation results. Experiments on the collected dataset demonstrate the effectiveness of the proposed model on 2D and 3D city generation. Furthermore, the latent features learned by AETree can serve downstream urban planning applications.
- Published
- 2023
47. Boosting Offline Reinforcement Learning for Autonomous Driving with Hierarchical Latent Skills
- Author
-
Li, Zenan, Nie, Fan, Sun, Qiao, Da, Fang, and Zhao, Hang
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
Learning-based vehicle planning is receiving increasing attention with the emergence of diverse driving simulators and large-scale driving datasets. While offline reinforcement learning (RL) is well suited for these safety-critical tasks, it still struggles to plan over extended periods. In this work, we present a skill-based framework that enhances offline RL to overcome the long-horizon vehicle planning challenge. Specifically, we design a variational autoencoder (VAE) to learn skills from offline demonstrations. To mitigate posterior collapse of common VAEs, we introduce a two-branch sequence encoder to capture both discrete options and continuous variations of the complex driving skills. The final policy treats learned skills as actions and can be trained by any off-the-shelf offline RL algorithms. This facilitates a shift in focus from per-step actions to temporally extended skills, thereby enabling long-term reasoning into the future. Extensive results on CARLA prove that our model consistently outperforms strong baselines at both training and new scenarios. Additional visualizations and experiments demonstrate the interpretability and transferability of extracted skills.
- Published
- 2023
48. Robot Parkour Learning
- Author
-
Zhuang, Ziwen, Fu, Zipeng, Wang, Jianren, Atkeson, Christopher, Schwertfeger, Soeren, Finn, Chelsea, and Zhao, Hang
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence ,Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Parkour is a grand challenge for legged locomotion that requires robots to overcome various obstacles rapidly in complex environments. Existing methods can generate either diverse but blind locomotion skills or vision-based but specialized skills by using reference animal data or complex rewards. However, autonomous parkour requires robots to learn generalizable skills that are both vision-based and diverse to perceive and react to various scenarios. In this work, we propose a system for learning a single end-to-end vision-based parkour policy of diverse parkour skills using a simple reward without any reference motion data. We develop a reinforcement learning method inspired by direct collocation to generate parkour skills, including climbing over high obstacles, leaping over large gaps, crawling beneath low barriers, squeezing through thin slits, and running. We distill these skills into a single vision-based parkour policy and transfer it to a quadrupedal robot using its egocentric depth camera. We demonstrate that our system can empower two different low-cost robots to autonomously select and execute appropriate parkour skills to traverse challenging real-world environments., Comment: CoRL 2023 (Oral). Project website at https://robot-parkour.github.io
- Published
- 2023
49. StreamMapNet: Streaming Mapping Network for Vectorized Online HD Map Construction
- Author
-
Yuan, Tianyuan, Liu, Yicheng, Wang, Yue, Wang, Yilun, and Zhao, Hang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
High-Definition (HD) maps are essential for the safety of autonomous driving systems. While existing techniques employ camera images and onboard sensors to generate vectorized high-precision maps, they are constrained by their reliance on single-frame input. This approach limits their stability and performance in complex scenarios such as occlusions, largely due to the absence of temporal information. Moreover, their performance diminishes when applied to broader perception ranges. In this paper, we present StreamMapNet, a novel online mapping pipeline adept at long-sequence temporal modeling of videos. StreamMapNet employs multi-point attention and temporal information which empowers the construction of large-range local HD maps with high stability and further addresses the limitations of existing methods. Furthermore, we critically examine widely used online HD Map construction benchmark and datasets, Argoverse2 and nuScenes, revealing significant bias in the existing evaluation protocols. We propose to resplit the benchmarks according to geographical spans, promoting fair and precise evaluations. Experimental results validate that StreamMapNet significantly outperforms existing methods across all settings while maintaining an online inference speed of $14.2$ FPS. Our code is available at https://github.com/yuantianyuan01/StreamMapNet.
- Published
- 2023
50. Radio2Text: Streaming Speech Recognition Using mmWave Radio Signals
- Author
-
Zhao, Running, Yu, Jiangtao, Zhao, Hang, and Ngai, Edith C. H.
- Subjects
Computer Science - Sound ,Computer Science - Computation and Language ,Computer Science - Human-Computer Interaction ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Millimeter wave (mmWave) based speech recognition provides more possibility for audio-related applications, such as conference speech transcription and eavesdropping. However, considering the practicality in real scenarios, latency and recognizable vocabulary size are two critical factors that cannot be overlooked. In this paper, we propose Radio2Text, the first mmWave-based system for streaming automatic speech recognition (ASR) with a vocabulary size exceeding 13,000 words. Radio2Text is based on a tailored streaming Transformer that is capable of effectively learning representations of speech-related features, paving the way for streaming ASR with a large vocabulary. To alleviate the deficiency of streaming networks unable to access entire future inputs, we propose the Guidance Initialization that facilitates the transfer of feature knowledge related to the global context from the non-streaming Transformer to the tailored streaming Transformer through weight inheritance. Further, we propose a cross-modal structure based on knowledge distillation (KD), named cross-modal KD, to mitigate the negative effect of low quality mmWave signals on recognition performance. In the cross-modal KD, the audio streaming Transformer provides feature and response guidance that inherit fruitful and accurate speech information to supervise the training of the tailored radio streaming Transformer. The experimental results show that our Radio2Text can achieve a character error rate of 5.7% and a word error rate of 9.4% for the recognition of a vocabulary consisting of over 13,000 words., Comment: Accepted by Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (ACM IMWUT/UbiComp 2023)
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.