15,322 results on '"Zhao, Bo"'
Search Results
2. How to Design a State Education Aid Formula Using a Regression-based Estimate of the Cost-capacity Gap: The Case of Connecticut, USA
- Author
-
Zhao, Bo
- Published
- 2023
3. Molecular mechanism of extreme hypoxia tolerance difference between male and female adult fish and juvenile fish of acrossocheilus fasciatus by transcriptomics
- Author
-
He, Jinghong, Wang, Handong, Guo, Yongyao, Chu, Zhangjie, and Zhao, Bo
- Published
- 2022
- Full Text
- View/download PDF
4. A Comprehensive Description and Evolutionary Analysis of Testudines Mitochondrial Genomes
- Author
-
Wang, Handong, Chen, Ye, Shi, Wei, Guo, Yongyao, He, Jinghong, Chu, Zhangjie, and Zhao, Bo
- Published
- 2021
- Full Text
- View/download PDF
5. Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation?
- Author
-
Bassi, Pedro R. A. S., Li, Wenxuan, Tang, Yucheng, Isensee, Fabian, Wang, Zifu, Chen, Jieneng, Chou, Yu-Cheng, Kirchhoff, Yannick, Rokuss, Maximilian, Huang, Ziyan, Ye, Jin, He, Junjun, Wald, Tassilo, Ulrich, Constantin, Baumgartner, Michael, Roy, Saikat, Maier-Hein, Klaus H., Jaeger, Paul, Ye, Yiwen, Xie, Yutong, Zhang, Jianpeng, Chen, Ziyang, Xia, Yong, Xing, Zhaohu, Zhu, Lei, Sadegheih, Yousef, Bozorgpour, Afshin, Kumari, Pratibha, Azad, Reza, Merhof, Dorit, Shi, Pengcheng, Ma, Ting, Du, Yuxin, Bai, Fan, Huang, Tiejun, Zhao, Bo, Wang, Haonan, Li, Xiaomeng, Gu, Hanxue, Dong, Haoyu, Yang, Jichen, Mazurowski, Maciej A., Gupta, Saumya, Wu, Linshan, Zhuang, Jiaxin, Chen, Hao, Roth, Holger, Xu, Daguang, Blaschko, Matthew B., Decherchi, Sergio, Cavalli, Andrea, Yuille, Alan L., and Zhou, Zongwei
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
How can we test AI performance? This question seems trivial, but it isn't. Standard benchmarks often have problems such as in-distribution and small-size test sets, oversimplified metrics, unfair comparisons, and short-term outcome pressure. As a consequence, good performance on standard benchmarks does not guarantee success in real-world scenarios. To address these problems, we present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. This diverse test set enhances the statistical significance of benchmark results and rigorously evaluates AI algorithms across various out-of-distribution scenarios. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets. In addition, we also evaluated pre-existing AI frameworks--which, differing from algorithms, are more flexible and can support different algorithms--including MONAI from NVIDIA, nnU-Net from DKFZ, and numerous other open-source frameworks. We are committed to expanding this benchmark to encourage more innovation of AI algorithms for the medical domain., Comment: Accepted to NeurIPS-2024
- Published
- 2024
6. Emu3: Next-Token Prediction is All You Need
- Author
-
Wang, Xinlong, Zhang, Xiaosong, Luo, Zhengxiong, Sun, Quan, Cui, Yufeng, Wang, Jinsheng, Zhang, Fan, Wang, Yueze, Li, Zhen, Yu, Qiying, Zhao, Yingli, Ao, Yulong, Min, Xuebin, Li, Tao, Wu, Boya, Zhao, Bo, Zhang, Bowen, Wang, Liangdong, Liu, Guang, He, Zheqi, Yang, Xi, Liu, Jingjing, Lin, Yonghua, Huang, Tiejun, and Wang, Zhongyuan
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction., Comment: Project Page: https://emu.baai.ac.cn
- Published
- 2024
7. Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
- Author
-
Shu, Yan, Zhang, Peitian, Liu, Zheng, Qin, Minghao, Zhou, Junjie, Huang, Tiejun, and Zhao, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Although current Multi-modal Large Language Models (MLLMs) demonstrate promising results in video understanding, processing extremely long videos remains an ongoing challenge. Typically, MLLMs struggle with handling thousands of visual tokens that exceed the maximum context length, and they suffer from the information decay due to token aggregation. Another challenge is the high computational cost stemming from the large number of video tokens. To tackle these issues, we propose Video-XL, an extra-long vision language model designed for efficient hour-scale video understanding. Specifically, we argue that LLMs can be adapted as effective visual condensers and propose Visual Context Latent Summarization which condenses visual contexts into highly compact forms. Extensive experiments demonstrate that our model achieves promising results on popular long video understanding benchmarks. For example, Video-XL outperforms the current state-of-the-art method on VNBench by nearly 10\% in accuracy. Moreover, Video-XL presents an impressive balance between efficiency and effectiveness, processing 2048 frames on a single 80GB GPU while achieving nearly 95% accuracy in the Needle-in-a-Haystack evaluation.
- Published
- 2024
8. Automated design of nonreciprocal thermal emitters via Bayesian optimization
- Author
-
Do, Bach, Ghalekohneh, Sina Jafari, Adebiyi, Taiwo, Zhao, Bo, and Zhang, Ruda
- Subjects
Condensed Matter - Materials Science ,Computer Science - Machine Learning ,Physics - Applied Physics - Abstract
Nonreciprocal thermal emitters that break Kirchhoff's law of thermal radiation promise exciting applications for thermal and energy applications. The design of the bandwidth and angular range of the nonreciprocal effect, which directly affects the performance of nonreciprocal emitters, typically relies on physical intuition. In this study, we present a general numerical approach to maximize the nonreciprocal effect. We choose doped magneto-optic materials and magnetic Weyl semimetal materials as model materials and focus on pattern-free multilayer structures. The optimization randomly starts from a less effective structure and incrementally improves the broadband nonreciprocity through the combination of Bayesian optimization and reparameterization. Optimization results show that the proposed approach can discover structures that can achieve broadband nonreciprocal emission at wavelengths from 5 to 40 micrometers using only a fewer layers, significantly outperforming current state-of-the-art designs based on intuition in terms of both performance and simplicity.
- Published
- 2024
9. Enhancing Long Video Understanding via Hierarchical Event-Based Memory
- Author
-
Cheng, Dingxin, Li, Mingda, Liu, Jingyu, Guo, Yongxin, Jiang, Bin, Liu, Qingbin, Chen, Xi, and Zhao, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Recently, integrating visual foundation models into large language models (LLMs) to form video understanding systems has attracted widespread attention. Most of the existing models compress diverse semantic information within the whole video and feed it into LLMs for content comprehension. While this method excels in short video understanding, it may result in a blend of multiple event information in long videos due to coarse compression, which causes information redundancy. Consequently, the semantics of key events might be obscured within the vast information that hinders the model's understanding capabilities. To address this issue, we propose a Hierarchical Event-based Memory-enhanced LLM (HEM-LLM) for better understanding of long videos. Firstly, we design a novel adaptive sequence segmentation scheme to divide multiple events within long videos. In this way, we can perform individual memory modeling for each event to establish intra-event contextual connections, thereby reducing information redundancy. Secondly, while modeling current event, we compress and inject the information of the previous event to enhance the long-term inter-event dependencies in videos. Finally, we perform extensive experiments on various video understanding tasks and the results show that our model achieves state-of-the-art performances.
- Published
- 2024
10. TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
- Author
-
Gao, Mingze, Liu, Jingyu, Li, Mingda, Xie, Jiangtao, Liu, Qingbin, Zhao, Bo, Chen, Xi, and Xiong, Hui
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Multimodal Large Language Models (MLLMs) have significantly improved performance across various image-language applications. Recently, there has been a growing interest in adapting image pre-trained MLLMs for video-related tasks. However, most efforts concentrate on enhancing the vision encoder and projector components, while the core part, Large Language Models (LLMs), remains comparatively under-explored. In this paper, we propose two strategies to enhance the model's capability in video understanding tasks by improving inter-layer attention computation in LLMs. Specifically, the first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE, which introduces temporal position information to strengthen the MLLM's temporal modeling capabilities while preserving the relative position relationships of both visual and text tokens. The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask, a simple yet effective method that broadens visual token interactions within and across video frames while maintaining the causal inference mechanism. Based on these proposed methods, we adapt LLaVA for video understanding tasks, naming it Temporal-Considered LLaVA (TC-LLaVA). Our TC-LLaVA achieves new state-of-the-art performance across various video understanding benchmarks with only supervised fine-tuning (SFT) on video-related datasets.
- Published
- 2024
11. 52B to 1T: Lessons Learned via Tele-FLM Series
- Author
-
Li, Xiang, Yao, Yiqun, Jiang, Xin, Fang, Xuezhi, Wang, Chao, Liu, Xinzhang, Wang, Zihan, Zhao, Yu, Wang, Xin, Huang, Yuyao, Song, Shuangyong, Li, Yongxiang, Zhang, Zheng, Zhao, Bo, Sun, Aixin, Wang, Yequan, He, Zhongjiang, Wang, Zhongyuan, Li, Xuelong, and Huang, Tiejun
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Large Language Models (LLMs) represent a significant stride toward Artificial General Intelligence. As scaling laws underscore the potential of increasing model sizes, the academic community has intensified its investigations into LLMs with capacities exceeding 50 billion parameters. This technical report builds on our prior work with Tele-FLM (also known as FLM-2), a publicly available 52-billion-parameter model. We delve into two primary areas: we first discuss our observation of Supervised Fine-tuning (SFT) on Tele-FLM-52B, which supports the "less is more" approach for SFT data construction; second, we demonstrate our experiments and analyses on the best practices for progressively growing a model from 52 billion to 102 billion, and subsequently to 1 trillion parameters. We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research., Comment: For the Tele-FLM-52B tech report, see also 2404.16645
- Published
- 2024
12. PVUW 2024 Challenge on Complex Video Understanding: Methods and Results
- Author
-
Ding, Henghui, Liu, Chang, Wei, Yunchao, Ravi, Nikhila, He, Shuting, Bai, Song, Torr, Philip, Miao, Deshui, Li, Xin, He, Zhenyu, Wang, Yaowei, Yang, Ming-Hsuan, Xu, Zhensong, Yao, Jiangtao, Wu, Chengjing, Liu, Ting, Liu, Luoqi, Liu, Xinyu, Zhang, Jing, Zhang, Kexin, Yang, Yuting, Jiao, Licheng, Yang, Shuyuan, Gao, Mingqi, Luo, Jingnan, Yang, Jinyu, Han, Jungong, Zheng, Feng, Cao, Bin, Zhang, Yisi, Lin, Xuanxu, He, Xingjian, Zhao, Bo, Liu, Jing, Pan, Feiyu, Fang, Hao, and Lu, Xiankai
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Pixel-level Video Understanding in the Wild Challenge (PVUW) focus on complex video understanding. In this CVPR 2024 workshop, we add two new tracks, Complex Video Object Segmentation Track based on MOSE dataset and Motion Expression guided Video Segmentation track based on MeViS dataset. In the two new tracks, we provide additional videos and annotations that feature challenging elements, such as the disappearance and reappearance of objects, inconspicuous small objects, heavy occlusions, and crowded environments in MOSE. Moreover, we provide a new motion expression guided video segmentation dataset MeViS to study the natural language-guided video understanding in complex environments. These new videos, sentences, and annotations enable us to foster the development of a more comprehensive and robust pixel-level understanding of video scenes in complex environments and realistic scenarios. The MOSE challenge had 140 registered teams in total, 65 teams participated the validation phase and 12 teams made valid submissions in the final challenge phase. The MeViS challenge had 225 registered teams in total, 50 teams participated the validation phase and 5 teams made valid submissions in the final challenge phase., Comment: MOSE Challenge: https://henghuiding.github.io/MOSE/ChallengeCVPR2024, MeViS Challenge: https://henghuiding.github.io/MeViS/ChallengeCVPR2024
- Published
- 2024
13. 2nd Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation
- Author
-
Cao, Bin, Zhang, Yisi, Lin, Xuanxu, He, Xingjian, Zhao, Bo, and Liu, Jing
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Motion Expression guided Video Segmentation is a challenging task that aims at segmenting objects in the video based on natural language expressions with motion descriptions. Unlike the previous referring video object segmentation (RVOS), this task focuses more on the motion in video content for language-guided video object segmentation, requiring an enhanced ability to model longer temporal, motion-oriented vision-language data. In this report, based on the RVOS methods, we successfully introduce mask information obtained from the video instance segmentation model as preliminary information for temporal enhancement and employ SAM for spatial refinement. Finally, our method achieved a score of 49.92 J &F in the validation phase and 54.20 J &F in the test phase, securing the final ranking of 2nd in the MeViS Track at the CVPR 2024 PVUW Challenge.
- Published
- 2024
14. SpatialBot: Precise Spatial Understanding with Vision Language Models
- Author
-
Cai, Wenxiao, Ponomarenko, Iaroslav, Yuan, Jianhao, Li, Xiaoqi, Yang, Wankou, Dong, Hao, and Zhao, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images. Additionally, we have constructed the SpatialQA dataset, which involves multi-level depth-related questions to train VLMs for depth understanding. Finally, we present SpatialBench to comprehensively evaluate VLMs' capabilities in spatial understanding at different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks and Embodied AI tasks, demonstrate the remarkable improvements of SpatialBot trained on SpatialQA. The model, code and data are available at https://github.com/BAAI-DCAI/SpatialBot.
- Published
- 2024
15. Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions
- Author
-
Liu, Yexin, Liang, Zhengyang, Wang, Yueze, He, Muyang, Li, Jian, and Zhao, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Multimodal Large Language Models (MLLMs) have exhibited impressive capabilities in visual understanding and reasoning, providing sightly reasonable answers, such as image descriptions. This has spurred extensive research on the evaluation of MLLMs. Most evaluation benchmarks assume that incorrect answers indicate a lack of understanding of the visual content. However, our findings reveal that, in many cases, MLLMs answer questions incorrectly despite correctly understanding the visual content. This suggests that incorrect answers do not necessarily imply a lack of comprehension but may instead result from lacking robustness to leading questions. To comprehensively measure MLLMs' understanding capability and robustness to leading questions, we introduce a MultiModal Robustness benchmark (MMR). MMR contains paired positive and negative questions across 12 categories, meticulously annotated by humans. We evaluate 18 leading MLLMs on the MMB benchmark, revealing that MLLMs suffer from fragility to leading questions despite understanding the visual content. To enhance MLLMs' understanding capability and robustness, we further present a training set with paired positive and negative visual question-answer samples. Experiments verify that MLLMs' robustness can be significantly enhanced by tuning on this new training set. The benchmark, training set, and code can be found at https://github.com/BAAI-DCAI/Multimodal-Robustness-Benchmark.
- Published
- 2024
16. Omni6DPose: A Benchmark and Model for Universal 6D Object Pose Estimation and Tracking
- Author
-
Zhang, Jiyao, Huang, Weiyao, Peng, Bo, Wu, Mingdong, Hu, Fei, Chen, Zijian, Zhao, Bo, and Dong, Hao
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
6D Object Pose Estimation is a crucial yet challenging task in computer vision, suffering from a significant lack of large-scale datasets. This scarcity impedes comprehensive evaluation of model performance, limiting research advancements. Furthermore, the restricted number of available instances or categories curtails its applications. To address these issues, this paper introduces Omni6DPose, a substantial dataset characterized by its diversity in object categories, large scale, and variety in object materials. Omni6DPose is divided into three main components: ROPE (Real 6D Object Pose Estimation Dataset), which includes 332K images annotated with over 1.5M annotations across 581 instances in 149 categories; SOPE(Simulated 6D Object Pose Estimation Dataset), consisting of 475K images created in a mixed reality setting with depth simulation, annotated with over 5M annotations across 4162 instances in the same 149 categories; and the manually aligned real scanned objects used in both ROPE and SOPE. Omni6DPose is inherently challenging due to the substantial variations and ambiguities. To address this challenge, we introduce GenPose++, an enhanced version of the SOTA category-level pose estimation framework, incorporating two pivotal improvements: Semantic-aware feature extraction and Clustering-based aggregation. Moreover, we provide a comprehensive benchmarking analysis to evaluate the performance of previous methods on this large-scale dataset in the realms of 6D object pose estimation and pose tracking.
- Published
- 2024
17. VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
- Author
-
Zhou, Junjie, Liu, Zheng, Xiao, Shitao, Zhao, Bo, and Xiong, Yongping
- Subjects
Computer Science - Information Retrieval ,Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at https://github.com/FlagOpen/FlagEmbedding., Comment: Accepted to ACL 2024 main conference
- Published
- 2024
18. MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding
- Author
-
Zhou, Junjie, Shu, Yan, Zhao, Bo, Wu, Boya, Xiao, Shitao, Yang, Xi, Xiong, Yongping, Zhang, Bo, Huang, Tiejun, and Liu, Zheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
The evaluation of Long Video Understanding (LVU) performance poses an important but challenging research problem. Despite previous efforts, the existing video understanding benchmarks are severely constrained by several issues, especially the insufficient lengths of videos, a lack of diversity in video types and evaluation tasks, and the inappropriateness for evaluating LVU performances. To address the above problems, we propose a new benchmark, called MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and in-depth evaluation of LVU. MLVU presents the following critical values: 1) The substantial and flexible extension of video lengths, which enables the benchmark to evaluate LVU performance across a wide range of durations. 2) The inclusion of various video genres, e.g., movies, surveillance footage, egocentric videos, cartoons, game videos, etc., which reflects the models' LVU performances in different scenarios. 3) The development of diversified evaluation tasks, which enables a comprehensive examination of MLLMs' key abilities in long-video understanding. The empirical study with 20 latest MLLMs reveals significant room for improvement in today's technique, as all existing methods struggle with most of the evaluation tasks and exhibit severe performance degradation when handling longer videos. Additionally, it suggests that factors such as context length, image-understanding quality, and the choice of LLM backbone can play critical roles in future advancements. We anticipate that MLVU will advance the research of long video understanding by providing a comprehensive and in-depth analysis of MLLMs.
- Published
- 2024
19. Filamentary Hierarchies and Superbubbles: Galactic Multiscale MHD Simulations of GMC to Star Cluster Formation
- Author
-
Zhao, Bo, Pudritz, Ralph E., Pillsworth, Rachel, Robinson, Hector, and Wadsley, James
- Subjects
Astrophysics - Astrophysics of Galaxies - Abstract
There is now abundant observational evidence that star formation is a highly dynamical process that connects filament hierarchies and supernova feedback from galaxy scale kpc filaments and superbubbles, to giant molecular clouds (GMCs) on 100 pc scales and star clusters (1 pc). Here we present galactic multi-scale MHD simulations that track the formation of structure from galactic down to sub pc scales in a magnetized, Milky Way like galaxy undergoing supernova driven feedback processes. We do this by adopting a novel zoom-in technique that follows the evolution of typical 3-kpc sub regions without cutting out the surrounding galactic environment, allowing us to reach 0.28 pc resolution in the individual zoom-in regions. We find a wide range of morphologies and hierarchical structure including superbubbles, turbulence, kpc atomic gas filaments hosting multiple GMC condensations that are often associated with superbubble compression; down to smaller scale filamentary GMCs and star cluster regions within them. Gas accretion and compression ultimately drive filaments over a critical, scale - dependent, line mass leading to gravitational instabilities that produce GMCs and clusters. In quieter regions, galactic shear can produce filamentary GMCs within flattened, rotating disk-like structures on 100 pc scales. Strikingly, our simulations demonstrate the formation of helical magnetic fields associated with the formation of these disk like structures., Comment: 29 pages main text, 5 page Appendix, 29 figures, revised version submitted to ApJ
- Published
- 2024
20. The SkatingVerse Workshop & Challenge: Methods and Results
- Author
-
Zhao, Jian, Jin, Lei, Li, Jianshu, Zhu, Zheng, Teng, Yinglei, Zhao, Jiaojiao, Gulshad, Sadaf, Wang, Zheng, Zhao, Bo, Shu, Xiangbo, Wei, Yunchao, Nie, Xuecheng, Jin, Xiaojie, Liang, Xiaodan, Satoh, Shin'ichi, Guo, Yandong, Lu, Cewu, Xing, Junliang, and Shengmei, Jane Shen
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The SkatingVerse Workshop & Challenge aims to encourage research in developing novel and accurate methods for human action understanding. The SkatingVerse dataset used for the SkatingVerse Challenge has been publicly released. There are two subsets in the dataset, i.e., the training subset and testing subset. The training subsets consists of 19,993 RGB video sequences, and the testing subsets consists of 8,586 RGB video sequences. Around 10 participating teams from the globe competed in the SkatingVerse Challenge. In this paper, we provide a brief summary of the SkatingVerse Workshop & Challenge including brief introductions to the top three methods. The submission leaderboard will be reopened for researchers that are interested in the human action understanding challenge. The benchmark dataset and other information can be found at: https://skatingverse.github.io/.
- Published
- 2024
21. VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
- Author
-
Guo, Yongxin, Liu, Jingyu, Li, Mingda, Tang, Xiaoying, Chen, Xi, and Zhao, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Video Temporal Grounding (VTG) focuses on accurately identifying event timestamps within a particular video based on a linguistic query, playing a vital role in downstream tasks such as video browsing and editing. While Video Large Language Models (video LLMs) have made significant progress in understanding video content, they often face challenges in accurately pinpointing timestamps within videos, which limits their performance on VTG tasks. Therefore, to improve video LLMs' ability to effectively locate timestamps, we argue that two critical aspects need to be enhanced. First, it is essential to have high-quality instructional tuning datasets that encompass mainstream VTG tasks. Second, directly incorporating timestamp knowledge into video LLMs is crucial, as it enables models to efficiently comprehend timestamp information. To address these needs, we first introduce VTG-IT-120K, a high-quality and comprehensive instruction tuning dataset that covers VTG tasks such as moment retrieval, dense video captioning, video summarization, and video highlight detection. Furthermore, we propose a specially designed video LLM model for VTG tasks, VTG-LLM, which (1) effectively integrates timestamp knowledge into visual tokens; (2) incorporates absolute-time tokens that specifically handle timestamp knowledge, thereby avoiding concept shifts; and (3) introduces a lightweight, high-performance slot-based token compression method to facilitate the sampling of more video frames. Comprehensive experiments showcase the superior performance of VTG-LLM in comparison to other video LLM methods across various VTG tasks. Our code and datasets are available at \url{https://github.com/gyxxyg/VTG-LLM}.
- Published
- 2024
22. Efficient Multimodal Large Language Models: A Survey
- Author
-
Jin, Yizhang, Li, Jian, Liu, Yexin, Gu, Tianjun, Wu, Kai, Jiang, Zhengkai, He, Muyang, Zhao, Bo, Tan, Xin, Gan, Zhenye, Wang, Yabiao, Wang, Chengjie, and Ma, Lizhuang
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, we summarize the timeline of representative efficient MLLMs, research state of efficient structures and strategies, and the applications. Finally, we discuss the limitations of current efficient MLLM research and promising future directions. Please refer to our GitHub repository for more details: https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey.
- Published
- 2024
23. Eliminating nearfield coupling in dense high quality factor phase gradient metasurfaces
- Author
-
Ameyaw, Samuel, Lin, Lin, Zhao, Bo, Delgado, Hamish Carr, and Lawrence, Mark
- Subjects
Physics - Optics - Abstract
High Q phase gradient metasurfaces are becoming promising elements for revolutionizing light manipulation but near-field coupling typically forces a trade-off between quality factor and resolution. Here, we show a strategy for not just reducing but eliminating coupling-based nonlocal effects in wave shaping metasurfaces composed of meta-pixels with arbitrarily high Q arranged with sub-diffraction spatial resolution. By working at a zero-coupling regime introduced by the interference between enhanced longitudinal and transverse electric fields, the tradeoff between Q and resolution no longer exists. Exploiting for wave shaping the ability to fully suppress coupling between high Q meta-atoms, we numerically show structurally uniform devices that produce beam-splitting to angles of $\pm53^o$ and beam-steering to an angle of $33^o$ with diffraction efficiencies over 90% via refractive index bias of just $2\times10^{-6}$ and $7\times10^{-6}$, respectively. These are made possible by the meta-structure supporting local dipole resonances with Qs of 2.8 million and 0.87 million respectively, arranged with a dense pixel pitch of ${\lambda}/1.6$. Extending the approach to structures with ${\lambda}/2.2$ resolution we also unlock full-field beam steering via index biasing of just $1\times10^{-4}$. The signature of a zero-coupling regime is discovered in the form of a sign flip in the angular dispersion with resonant wavelength in experiment which validates our scheme. Aside from triangulating a perfect decoupling configuration, one of our fabricated nanofin-isolated metasurfaces with Q-factor >870 has a resonant wavelength that stays within the half linewidth for incident angles of $-20^o$ to $20^o$. Our platform provides a route for densely arrayed high Q metasurfaces with independently addressable constituent meta-atoms, paving the way for highly efficient nonlinear and dynamic wavefront shaping., Comment: 25 pages,11 figures
- Published
- 2024
24. Large Language Model-aided Edge Learning in Distribution System State Estimation
- Author
-
Xie, Renyou, Yin, Xin, Li, Chaojie, Chen, Guo, Liu, Nian, Zhao, Bo, and Dong, Zhaoyang
- Subjects
Electrical Engineering and Systems Science - Systems and Control - Abstract
Distribution system state estimation (DSSE) plays a crucial role in the real-time monitoring, control, and operation of distribution networks. Besides intensive computational requirements, conventional DSSE methods need high-quality measurements to obtain accurate states, whereas missing values often occur due to sensor failures or communication delays. To address these challenging issues, a forecast-then-estimate framework of edge learning is proposed for DSSE, leveraging large language models (LLMs) to forecast missing measurements and provide pseudo-measurements. Firstly, natural language-based prompts and measurement sequences are integrated by the proposed LLM to learn patterns from historical data and provide accurate forecasting results. Secondly, a convolutional layer-based neural network model is introduced to improve the robustness of state estimation under missing measurement. Thirdly, to alleviate the overfitting of the deep learning-based DSSE, it is reformulated as a multi-task learning framework containing shared and task-specific layers. The uncertainty weighting algorithm is applied to find the optimal weights to balance different tasks. The numerical simulation on the Simbench case is used to demonstrate the effectiveness of the proposed forecast-then-estimate framework.
- Published
- 2024
25. Understanding the Difficulty of Solving Cauchy Problems with PINNs
- Author
-
Wang, Tao, Zhao, Bo, Gao, Sicun, and Yu, Rose
- Subjects
Computer Science - Machine Learning ,Mathematics - Numerical Analysis - Abstract
Physics-Informed Neural Networks (PINNs) have gained popularity in scientific computing in recent years. However, they often fail to achieve the same level of accuracy as classical methods in solving differential equations. In this paper, we identify two sources of this issue in the case of Cauchy problems: the use of $L^2$ residuals as objective functions and the approximation gap of neural networks. We show that minimizing the sum of $L^2$ residual and initial condition error is not sufficient to guarantee the true solution, as this loss function does not capture the underlying dynamics. Additionally, neural networks are not capable of capturing singularities in the solutions due to the non-compactness of their image sets. This, in turn, influences the existence of global minima and the regularity of the network. We demonstrate that when the global minimum does not exist, machine precision becomes the predominant source of achievable error in practice. We also present numerical experiments in support of our theoretical claims., Comment: 13 pages and 18 figures
- Published
- 2024
26. FlexiFilm: Long Video Generation with Flexible Conditions
- Author
-
Ouyang, Yichen, Yuan, jianhao, Zhao, Hao, Wang, Gaoang, and zhao, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Generating long and consistent videos has emerged as a significant yet challenging problem. While most existing diffusion-based video generation models, derived from image generation models, demonstrate promising performance in generating short videos, their simple conditioning mechanism and sampling strategy-originally designed for image generation-cause severe performance degradation when adapted to long video generation. This results in prominent temporal inconsistency and overexposure. Thus, in this work, we introduce FlexiFilm, a new diffusion model tailored for long video generation. Our framework incorporates a temporal conditioner to establish a more consistent relationship between generation and multi-modal conditions, and a resampling strategy to tackle overexposure. Empirical results demonstrate FlexiFilm generates long and consistent videos, each over 30 seconds in length, outperforming competitors in qualitative and quantitative analyses. Project page: https://y-ichen.github.io/FlexiFilm-Page/, Comment: 9 pages, 9 figures
- Published
- 2024
27. Tele-FLM Technical Report
- Author
-
Li, Xiang, Yao, Yiqun, Jiang, Xin, Fang, Xuezhi, Wang, Chao, Liu, Xinzhang, Wang, Zihan, Zhao, Yu, Wang, Xin, Huang, Yuyao, Song, Shuangyong, Li, Yongxiang, Zhang, Zheng, Zhao, Bo, Sun, Aixin, Wang, Yequan, He, Zhongjiang, Wang, Zhongyuan, Li, Xuelong, and Huang, Tiejun
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Large language models (LLMs) have showcased profound capabilities in language understanding and generation, facilitating a wide array of applications. However, there is a notable paucity of detailed, open-sourced methodologies on efficiently scaling LLMs beyond 50 billion parameters with minimum trial-and-error cost and computational resources. In this report, we introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model that features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities. Tele-FLM demonstrates superior multilingual language modeling abilities, measured by BPB on textual corpus. Besides, in both English and Chinese foundation model evaluation, it is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B. In addition to the model weights, we share the core designs, engineering practices, and training details, which we expect to benefit both the academic and industrial communities.
- Published
- 2024
28. Advances and Open Challenges in Federated Foundation Models
- Author
-
Ren, Chao, Yu, Han, Peng, Hongyi, Tang, Xiaoli, Zhao, Bo, Yi, Liping, Tan, Alysa Ziying, Gao, Yulan, Li, Anran, Li, Xiaoxiao, Li, Zengxiang, and Yang, Qiang
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
The integration of Foundation Models (FMs) with Federated Learning (FL) presents a transformative paradigm in Artificial Intelligence (AI). This integration offers enhanced capabilities, while addressing concerns of privacy, data decentralization and computational efficiency. This paper provides a comprehensive survey of the emerging field of Federated Foundation Models (FedFM), elucidating their synergistic relationship and exploring novel methodologies, challenges, and future directions that the FL research field needs to focus on in order to thrive in the age of FMs. A systematic multi-tiered taxonomy is proposed, categorizing existing FedFM approaches for model training, aggregation, trustworthiness, and incentivization. Key challenges, including how to enable FL to deal with high complexity of computational demands, privacy considerations, contribution evaluation, and communication efficiency, are thoroughly discussed. Moreover, this paper explores the intricate challenges of communication, scalability and security inherent in training/fine-tuning FMs via FL. It highlights the potential of quantum computing to revolutionize the processes of training, inference, optimization and security. This survey also introduces the implementation requirement of FedFM and some practical FedFM applications. It highlights lessons learned with a clear understanding of our findings for FedFM. Finally, this survey not only provides insights into the current state and challenges of FedFM, but also offers a blueprint for future research directions, emphasizing the need for developing trustworthy solutions. It serves as a foundational guide for researchers and practitioners interested in contributing to this interdisciplinary and rapidly advancing field., Comment: Survey of Federated Foundation Models (FedFM)
- Published
- 2024
29. Stable Acceleration of a LHe-Free Nb3Sn demo SRF e-linac Based on Conduction Cooling
- Author
-
Yang, Ziqin, He, Yuan, Jiang, Tiancai, Bai, Feng, Wang, Fengfeng, Chen, Weilong, Jiang, Guangze, Chu, Yimeng, Li, Hangxu, Zhao, Bo, Sun, Guozhen, Xue, Zongheng, Zhao, Yugang, Gao, Zheng, Li, Yaguang, Xiong, Pingran, Guo, Hao, Sun, Liepeng, Huang, Guirong, Wang, Zhijun, Zhang, Junhui, Tan, Teng, Zhao, Hongwei, and Zhan, Wenlong
- Subjects
Physics - Accelerator Physics - Abstract
The design, construction, and commissioning of a conduction-cooled Nb3Sn demonstration superconducting radio frequency (SRF) electron accelerator at the Institute of Modern Physics of the Chinese Academy of Sciences (IMP, CAS) will be presented. In the context of engineering application planning for Nb3Sn thin-film SRF cavities within the CiADS project, a 650MHz 5-cell elliptical cavity was coated using the vapor diffusion method for electron beam acceleration. Through high-precision collaborative control of 10 GM cryocooler, slow cooldown of the cavity crossing 18K is achieved accompanied by obviously characteristic magnetic flux expulsion. The horizontal test results of the liquid helium-free (LHe-free) cryomodule show that the cavity can operate steadily at Epk=6.02MV/m in continuous wave (CW) mode, and at Epk=14.90MV/m in 40% duty cycle pulse mode. The beam acceleration experiment indicates that the maximum average current of the electron beam in the macropulse after acceleration exceeds 200mA, with a maximum energy gain of 4.6MeV. The results provide a principle validation for the engineering application of Nb3Sn thin-film SRF cavities, highlighting the promising industrial application prospects of a small-scale compact Nb3Sn SRF accelerator driven by commercial cryocoolers.
- Published
- 2024
30. M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models
- Author
-
Bai, Fan, Du, Yuxin, Huang, Tiejun, Meng, Max Q. -H., and Zhao, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Medical image analysis is essential to clinical diagnosis and treatment, which is increasingly supported by multi-modal large language models (MLLMs). However, previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information. This paper aims to advance 3D medical image analysis with MLLMs. To this end, we present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs specifically tailored for various 3D medical tasks, such as image-text retrieval, report generation, visual question answering, positioning, and segmentation. Additionally, we propose M3D-LaMed, a versatile multi-modal large language model for 3D medical image analysis. Furthermore, we introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks. Through comprehensive evaluation, our method proves to be a robust model for 3D medical image analysis, outperforming existing solutions. All code, data, and models are publicly available at: https://github.com/BAAI-DCAI/M3D., Comment: MLLM, 3D medical image analysis
- Published
- 2024
31. Efficient size-prescribed $k$-core search
- Author
-
Liu, Yiping, Yan, Bo, Zhao, Bo, Su, Hongyi, Chen, Yang, and Witbrock, Michael
- Subjects
Computer Science - Data Structures and Algorithms - Abstract
$k$-core is a subgraph where every node has at least $k$ neighbors within the subgraph. The $k$-core subgraphs has been employed in large platforms like Network Repository to comprehend the underlying structures and dynamics of the network. Existing studies have primarily focused on finding $k$-core groups without considering their size, despite the relevance of solution sizes in many real-world scenarios. This paper addresses this gap by introducing the size-prescribed $k$-core search (SPCS) problem, where the goal is to find a subgraph of a specified size that has the highest possible core number. We propose two algorithms, namely the {\it TSizeKcore-BU} and the {\it TSizeKcore-TD}, to identify cohesive subgraphs that satisfy both the $k$-core requirement and the size constraint. Our experimental results demonstrate the superiority of our approach in terms of solution quality and efficiency. The {\it TSizeKcore-BU} algorithm proves to be highly efficient in finding size-prescribed $k$-core subgraphs on large datasets, making it a favorable choice for such scenarios. On the other hand, the {\it TSizeKcore-TD} algorithm is better suited for small datasets where running time is less critical.
- Published
- 2024
- Full Text
- View/download PDF
32. Joint optimization of day-ahead of a microgrid including demand response and electric vehicles
- Author
-
Fu, Chengfang, Zhao, Bo, Dadfar, Sajjad, and Samad, Nasir
- Published
- 2024
- Full Text
- View/download PDF
33. Mesenchymal stem cell-derived exosomes in renal ischemia–reperfusion injury: a new therapeutic strategy
- Author
-
Zhao, Bo, Zhang, Zhenwang, Guo, Xiying, Liu, Xiufen, Lei, Min, Guo, Shuang, Yao, Qing, Zhang, Feixue, Peng, Tie, Liu, Aimei, Jiang, Botao, and Zhu, Dan
- Published
- 2024
- Full Text
- View/download PDF
34. Design of Ionic Liquids for HF/HFC-245fa Superefficient Separation: COSMO-RS Selection and Process Assessment
- Author
-
Liao, Yuan-Hao, Zeng, Jijun, Yang, Zhiqiang, Han, Sheng, Zhao, Bo, an, Yu, Tang, Xiaobo, Yu, Tao, Zhang, Wei, and Lu, Jian
- Published
- 2024
- Full Text
- View/download PDF
35. Polyoxometalate-based iron-organic complex nanozymes with peroxidase-like activities for colorimetric detection of hydrogen peroxide and ascorbic acid
- Author
-
Liu, Jingjing, Zhang, Yuan, Wang, Siyue, Zhao, Bo, Liu, Zhelin, Dong, Xiangting, and Feng, Shouhua
- Published
- 2024
- Full Text
- View/download PDF
36. Phylogeny of Neolissochilus and studies on intergeneric kinship geography of Cyprinidae
- Author
-
Zhou, Chenyao, He, Jinghong, Huang, Honghao, Wang, Handong, Chu, Zhangjie, Zhao, Bo, and Guo, Shuirong
- Published
- 2024
- Full Text
- View/download PDF
37. Deep learning nomogram for predicting neoadjuvant chemotherapy response in locally advanced gastric cancer patients
- Author
-
Zhang, Jingjing, Zhang, Qiang, Zhao, Bo, and Shi, Gaofeng
- Published
- 2024
- Full Text
- View/download PDF
38. Identification of LRRC46 as a novel candidate gene for high myopia
- Author
-
Jiang, Lingxi, Dai, Chao, Wei, Yao, Zhao, Bo, Li, Qi, Wu, Zhengzheng, Zou, Liang, Ye, Zimeng, Yang, Zhenglin, Huang, Lulin, and Shi, Yi
- Published
- 2024
- Full Text
- View/download PDF
39. SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model
- Author
-
Cao, Bin, Yuan, Jianhao, Liu, Yexin, Li, Jian, Sun, Shuyang, Liu, Jing, and Zhao, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In the rapidly evolving area of image synthesis, a serious challenge is the presence of complex artifacts that compromise perceptual realism of synthetic images. To alleviate artifacts and improve quality of synthetic images, we fine-tune Vision-Language Model (VLM) as artifact classifier to automatically identify and classify a wide range of artifacts and provide supervision for further optimizing generative models. Specifically, we develop a comprehensive artifact taxonomy and construct a dataset of synthetic images with artifact annotations for fine-tuning VLM, named SynArtifact-1K. The fine-tuned VLM exhibits superior ability of identifying artifacts and outperforms the baseline by 25.66%. To our knowledge, this is the first time such end-to-end artifact classification task and solution have been proposed. Finally, we leverage the output of VLM as feedback to refine the generative model for alleviating artifacts. Visualization results and user study demonstrate that the quality of images synthesized by the refined diffusion model has been obviously improved.
- Published
- 2024
40. Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability
- Author
-
Qian, Xuelin, Wang, Yu, Luo, Simian, Zhang, Yinda, Tai, Ying, Zhang, Zhenyu, Wang, Chengjie, Xue, Xiangyang, Zhao, Bo, Huang, Tiejun, Wu, Yunsheng, and Fu, Yanwei
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Auto-regressive models have achieved impressive results in 2D image generation by modeling joint distributions in grid space. In this paper, we extend auto-regressive models to 3D domains, and seek a stronger ability of 3D shape generation by improving auto-regressive models at capacity and scalability simultaneously. Firstly, we leverage an ensemble of publicly available 3D datasets to facilitate the training of large-scale models. It consists of a comprehensive collection of approximately 900,000 objects, with multiple properties of meshes, points, voxels, rendered images, and text captions. This diverse labeled dataset, termed Objaverse-Mix, empowers our model to learn from a wide range of object variations. However, directly applying 3D auto-regression encounters critical challenges of high computational demands on volumetric grids and ambiguous auto-regressive order along grid dimensions, resulting in inferior quality of 3D shapes. To this end, we then present a novel framework Argus3D in terms of capacity. Concretely, our approach introduces discrete representation learning based on a latent vector instead of volumetric grids, which not only reduces computational costs but also preserves essential geometric details by learning the joint distributions in a more tractable order. The capacity of conditional generation can thus be realized by simply concatenating various conditioning inputs to the latent vector, such as point clouds, categories, images, and texts. In addition, thanks to the simplicity of our model architecture, we naturally scale up our approach to a larger model with an impressive 3.6 billion parameters, further enhancing the quality of versatile 3D generation. Extensive experiments on four generation tasks demonstrate that Argus3D can synthesize diverse and faithful shapes across multiple categories, achieving remarkable performance., Comment: Project page: https://argus-3d.github.io/ . Datasets: https://huggingface.co/datasets/BAAI/Objaverse-MIX. arXiv admin note: substantial text overlap with arXiv:2303.14700
- Published
- 2024
41. Efficient Multimodal Learning from Data-centric Perspective
- Author
-
He, Muyang, Liu, Yexin, Wu, Boya, Yuan, Jianhao, Wang, Yueze, Huang, Tiejun, and Zhao, Bo
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs in both training and inference, limiting accessibility to the broader research and user communities. A straightforward solution is to leverage smaller pre-trained vision and language models, which inevitably cause significant performance drops. In this paper, we demonstrate the possibility of training a smaller but better MLLM with high-quality training data. Specifically, we introduce Bunny, a family of lightweight MLLMs with flexible vision and language backbones for efficient multimodal learning from selected training data. Experiments show that our Bunny-4B/8B outperforms the state-of-the-art large MLLMs on multiple benchmarks. We expect that this work can provide the community with a clean and flexible open-source tool for further research and development. The code, models, and data can be found in https://github.com/BAAI-DCAI/Bunny.
- Published
- 2024
42. RAG-Driver: Generalisable Driving Explanations with Retrieval-Augmented In-Context Learning in Multi-Modal Large Language Model
- Author
-
Yuan, Jianhao, Sun, Shuyang, Omeiza, Daniel, Zhao, Bo, Newman, Paul, Kunze, Lars, and Gadd, Matthew
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
We need to trust robots that use often opaque AI methods. They need to explain themselves to us, and we need to trust their explanation. In this regard, explainability plays a critical role in trustworthy autonomous decision-making to foster transparency and acceptance among end users, especially in complex autonomous driving. Recent advancements in Multi-Modal Large Language models (MLLMs) have shown promising potential in enhancing the explainability as a driving agent by producing control predictions along with natural language explanations. However, severe data scarcity due to expensive annotation costs and significant domain gaps between different datasets makes the development of a robust and generalisable system an extremely challenging task. Moreover, the prohibitively expensive training requirements of MLLM and the unsolved problem of catastrophic forgetting further limit their generalisability post-deployment. To address these challenges, we present RAG-Driver, a novel retrieval-augmented multi-modal large language model that leverages in-context learning for high-performance, explainable, and generalisable autonomous driving. By grounding in retrieved expert demonstration, we empirically validate that RAG-Driver achieves state-of-the-art performance in producing driving action explanations, justifications, and control signal prediction. More importantly, it exhibits exceptional zero-shot generalisation capabilities to unseen environments without further training endeavours., Comment: 14 pages, 6 figures
- Published
- 2024
43. Spin: An Efficient Secure Computation Framework with GPU Acceleration
- Author
-
Jiang, Wuxuan, Song, Xiangjun, Hong, Shenbai, Zhang, Haijun, Liu, Wenxin, Zhao, Bo, Xu, Wei, and Li, Yi
- Subjects
Computer Science - Cryptography and Security ,Computer Science - Machine Learning - Abstract
Accuracy and efficiency remain challenges for multi-party computation (MPC) frameworks. Spin is a GPU-accelerated MPC framework that supports multiple computation parties and a dishonest majority adversarial setup. We propose optimized protocols for non-linear functions that are critical for machine learning, as well as several novel optimizations specific to attention that is the fundamental unit of Transformer models, allowing Spin to perform non-trivial CNNs training and Transformer inference without sacrificing security. At the backend level, Spin leverages GPU, CPU, and RDMA-enabled smart network cards for acceleration. Comprehensive evaluations demonstrate that Spin can be up to $2\times$ faster than the state-of-the-art for deep neural network training. For inference on a Transformer model with 18.9 million parameters, our attention-specific optimizations enable Spin to achieve better efficiency, less communication, and better accuracy.
- Published
- 2024
44. Distributional Counterfactual Explanations With Optimal Transport
- Author
-
You, Lei, Cao, Lele, Nilsson, Mattias, Zhao, Bo, and Lei, Lei
- Subjects
Computer Science - Artificial Intelligence ,Statistics - Machine Learning - Abstract
Counterfactual explanations (CE) are the de facto method for providing insights into black-box decision-making models by identifying alternative inputs that lead to different outcomes. However, existing CE approaches, including group and global methods, focus predominantly on specific input modifications, lacking the ability to capture nuanced distributional characteristics that influence model outcomes across the entire input-output spectrum. This paper proposes distributional counterfactual explanation (DCE), shifting focus to the distributional properties of observed and counterfactual data, thus providing broader insights. DCE is particularly beneficial for stakeholders making strategic decisions based on statistical data analysis, as it makes the statistical distribution of the counterfactual resembles the one of the factual when aligning model outputs with a target distribution\textemdash something that the existing CE methods cannot fully achieve. We leverage optimal transport (OT) to formulate a chance-constrained optimization problem, deriving a counterfactual distribution aligned with its factual counterpart, supported by statistical confidence. The efficacy of this approach is demonstrated through experiments, highlighting its potential to provide deeper insights into decision-making models.
- Published
- 2024
45. Learning Position-Aware Implicit Neural Network for Real-World Face Inpainting
- Author
-
Zhao, Bo, Yang, Huan, and Fu, Jianlong
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Face inpainting requires the model to have a precise global understanding of the facial position structure. Benefiting from the powerful capabilities of deep learning backbones, recent works in face inpainting have achieved decent performance in ideal setting (square shape with $512px$). However, existing methods often produce a visually unpleasant result, especially in the position-sensitive details (e.g., eyes and nose), when directly applied to arbitrary-shaped images in real-world scenarios. The visually unpleasant position-sensitive details indicate the shortcomings of existing methods in terms of position information processing capability. In this paper, we propose an \textbf{I}mplicit \textbf{N}eural \textbf{I}npainting \textbf{N}etwork (IN$^2$) to handle arbitrary-shape face images in real-world scenarios by explicit modeling for position information. Specifically, a downsample processing encoder is proposed to reduce information loss while obtaining the global semantic feature. A neighbor hybrid attention block is proposed with a hybrid attention mechanism to improve the facial understanding ability of the model without restricting the shape of the input. Finally, an implicit neural pyramid decoder is introduced to explicitly model position information and bridge the gap between low-resolution features and high-resolution output. Extensive experiments demonstrate the superiority of the proposed method in real-world face inpainting task., Comment: 10 pages, 5 figures
- Published
- 2024
46. Tenplex: Dynamic Parallelism for Deep Learning using Parallelizable Tensor Collections
- Author
-
Wagenländer, Marcel, Li, Guo, Zhao, Bo, Mai, Luo, and Pietzuch, Peter
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Deep learning (DL) jobs use multi-dimensional parallelism, i.e. combining data, model, and pipeline parallelism, to use large GPU clusters efficiently. Long-running jobs may experience changes to their GPU allocation: (i) resource elasticity during training adds or removes GPUs; (ii) hardware maintenance may require redeployment on different GPUs; and (iii) GPU failures force jobs to run with fewer devices. Current DL frameworks tie jobs to a set of GPUs and thus lack support for these scenarios. In particular, they cannot change the multi-dimensional parallelism of an already-running job in an efficient and model-independent way. We describe Scalai, a state management library for DL systems that enables jobs to change their parallelism dynamically after the GPU allocation is updated at runtime. Scalai achieves this through a new abstraction, a parallelizable tensor collection (PTC), that externalizes the job state during training. After a GPU change, Scalai uses the PTC to transform the job state: the PTC repartitions the dataset state under data parallelism and exposes it to DL workers through a virtual file system; and the PTC obtains the model state as partitioned checkpoints and transforms them to reflect the new parallelization configuration. For efficiency, Scalai executes PTC transformations in parallel with minimum data movement between workers. Our experiments show that Scalai enables DL jobs to support dynamic parallelization with low overhead., Comment: The 30th Symposium on Operating Systems Principles (SOSP24)
- Published
- 2023
- Full Text
- View/download PDF
47. Open-DDVM: A Reproduction and Extension of Diffusion Model for Optical Flow Estimation
- Author
-
Dong, Qiaole, Zhao, Bo, and Fu, Yanwei
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recently, Google proposes DDVM which for the first time demonstrates that a general diffusion model for image-to-image translation task works impressively well on optical flow estimation task without any specific designs like RAFT. However, DDVM is still a closed-source model with the expensive and private Palette-style pretraining. In this technical report, we present the first open-source DDVM by reproducing it. We study several design choices and find those important ones. By training on 40k public data with 4 GPUs, our reproduction achieves comparable performance to the closed-source DDVM. The code and model have been released in https://github.com/DQiaole/FlowDiffusion_pytorch., Comment: Technical Report
- Published
- 2023
48. Defect-engineered indium–organic framework displays the higher CO2 adsorption and more excellent catalytic performance on the cycloaddition of CO2 with epoxides under mild conditions
- Author
-
Ren, Meiyu, Zhao, Bo, Li, Chong, Fei, Yang, Wang, Xiaotong, Fan, Liming, Hu, Tuoping, and Zhang, Xiutang
- Published
- 2024
- Full Text
- View/download PDF
49. An AI-Based Method for Estimating the Potential Runout Distance of Post-Seismic Debris Flows
- Author
-
Qiu, Chenchen, Su, Lijun, Bian, Congchao, Zhao, Bo, and Geng, Xueyu
- Published
- 2024
- Full Text
- View/download PDF
50. Enhanced Ethylene Production from Electrocatalytic Acetylene Semi-hydrogenation Over Porous Carbon-Supported Cu Nanoparticles
- Author
-
Li, Li, Chen, Fanpeng, Zhao, Bo-Hang, and Yu, Yifu
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.