Author: "Wang, Bin" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Wang, Bin"' showing total 59,047 results

Start Over Author "Wang, Bin"

59,047 results on '"Wang, Bin"'

1. HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation

Author: Chen, Yuhan, Lv, Ang, Luan, Jian, Wang, Bin, and Liu, Wei
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Many positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion: tokens farther away from the current position carry less relevant information. We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information from arbitrary positions. Firstly, we present empirical analyses on various PEs, demonstrating that models inherently learn attention with only a local-decay pattern while forming a U-shape pattern globally, contradicting the principle of long-term decay. Furthermore, we conduct a detailed analysis of rotary position encoding (RoPE, a prevalent relative positional encoding in LLMs), and found that the U-shape attention is caused by some learned components, which are also the key factor limiting RoPE's expressiveness and extrapolation.Inspired by these insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE replaces the specific components in RoPE with position-independent ones, retaining only high-frequency signals, which also breaks the principle of long-term decay in theory. HoPE achieves two major advantages: (1) Without constraints imposed by long-term decay, contradictory factors that limit spontaneous attention optimization and model extrapolation performance are removed. (2) Components representing positions and semantics are are optimized. These enhances model's context awareness and extrapolation, as validated by extensive experiments.
Published: 2024

2. Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Author: Zhang, Qintong, Huang, Victor Shea-Jay, Wang, Bin, Zhang, Junyuan, Wang, Zhengren, Liang, Hao, Wang, Shawn, Lin, Matthieu, He, Conghui, and Zhang, Wentao
Subjects: Computer Science - Multimedia, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Document parsing is essential for converting unstructured and semi-structured documents-such as contracts, academic papers, and invoices-into structured, machine-readable data. Document parsing extract reliable structured data from unstructured inputs, providing huge convenience for numerous applications. Especially with recent achievements in Large Language Models, document parsing plays an indispensable role in both knowledge base construction and training data generation. This survey presents a comprehensive review of the current state of document parsing, covering key methodologies, from modular pipeline systems to end-to-end models driven by large vision-language models. Core components such as layout detection, content extraction (including text, tables, and mathematical expressions), and multi-modal data integration are examined in detail. Additionally, this paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts, integrating multiple modules, and recognizing high-density text. It emphasizes the importance of developing larger and more diverse datasets and outlines future research directions.
Published: 2024

3. Personalized Playback Technology: How Short Video Services Create Excellent User Experience

Author: Deng, Weihui, Fan, Zhiwei, Fu, Deliang, Gong, Yun, Huang, Shenglan, Li, Xiaocheng, Li, Zheng, Liao, Yiting, Liu, He, Qiao, Chunyu, Wang, Bin, Wang, Zhen, and Xiong, Zhengyu
Subjects: Computer Science - Multimedia
Abstract: Short-form video content has become increasingly popular and influential in recent years. Its concise yet engaging format aligns well with todays' fast-paced and on-the-go lifestyles, making it a dominating trend in the digital world. As one of the front runners in the short video platform space, ByteDance has been highly successful in delivering a one-of-a-kind short video experience and attracting billions of users worldwide. One key contributing factor is its advanced end-to-end personalized short video playback technology, where we pioneered and developed the new technical field over the past five years to optimize user experience. This paper introduces the major concepts and methodologies of this personalized video playback technology that distinguish it from traditional multimedia technologies. More details, including goal setting, iterative process, modeling, experimental methods and required supporting systems, are also provided to encourage deeper research in this area.
Published: 2024

4. DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization

Author: Zhu, Haowei, Tang, Dehua, Liu, Ji, Lu, Mingjie, Zheng, Jintu, Peng, Jinzhang, Li, Dong, Wang, Yu, Jiang, Fan, Tian, Lu, Tiwari, Spandan, Sirasao, Ashish, Yong, Jun-Hai, Wang, Bin, and Barsoum, Emad
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Diffusion models have achieved remarkable progress in the field of image generation due to their outstanding capabilities. However, these models require substantial computing resources because of the multi-step denoising process during inference. While traditional pruning methods have been employed to optimize these models, the retraining process necessitates large-scale training datasets and extensive computational costs to maintain generalization ability, making it neither convenient nor efficient. Recent studies attempt to utilize the similarity of features across adjacent denoising stages to reduce computational costs through simple and static strategies. However, these strategies cannot fully harness the potential of the similar feature patterns across adjacent timesteps. In this work, we propose a novel pruning method that derives an efficient diffusion model via a more intelligent and differentiable pruner. At the core of our approach is casting the model pruning process into a SubNet search process. Specifically, we first introduce a SuperNet based on standard diffusion via adding some backup connections built upon the similar features. We then construct a plugin pruner network and design optimization losses to identify redundant computation. Finally, our method can identify an optimal SubNet through few-step gradient optimization and a simple post-processing procedure. We conduct extensive experiments on various diffusion models including Stable Diffusion series and DiTs. Our DiP-GO approach achieves 4.4 x speedup for SD-1.5 without any loss of accuracy, significantly outperforming the previous state-of-the-art methods.
Published: 2024

5. SiamSeg: Self-Training with Contrastive Learning for Unsupervised Domain Adaptation Semantic Segmentation in Remote Sensing

Author: Wang, Bin, Deng, Fei, Wang, Shuang, Luo, Wen, Zhang, Zhixuan, Zhang, Gulan, and Jiang, Peifan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Semantic segmentation of remote sensing (RS) images is a challenging yet crucial task. While deep learning, particularly supervised learning with large-scale labeled datasets, has significantly advanced this field, acquiring high-quality labeled data is expensive and time-consuming. Additionally, variations in ground sampling distance, imaging equipment, and geographic differences cause domain shifts between datasets, which limit model performance across domains. Unsupervised domain adaptation (UDA) offers a solution by enabling models to learn from unlabeled target domain data while training on labeled source domain data. Recent self-supervised learning approaches using pseudo-label generation have shown promise in addressing domain discrepancies. Combining source and target images with their true and pseudo labels has proven effective in reducing domain bias. However, the use of pseudo-labeling for RS image segmentation is underexplored. Existing methods often rely on high-confidence pixel points as pseudo-labels, reducing supervision in low-confidence areas. Noise in pseudo-labels further weakens the model's ability to learn target domain semantics. While some methods assign confidence weights, noisy pseudo-labels remain an issue. To address these limitations, we propose integrating contrastive learning into UDA, enhancing the model's capacity to capture semantic information by maximizing the similarity between augmented views of the same image. This provides additional supervision to improve performance in the target domain. Extensive experiments on key RS datasets (Potsdam, Vaihingen, LoveDA) demonstrate that our SimSeg method outperforms existing approaches, achieving state-of-the-art results. Visualization and quantitative analyses confirm its superior ability to learn from the target domain. The code is available at \url{https://github.com/woldier/SiamSeg}.
Published: 2024

6. Order-aware Interactive Segmentation

Author: Wang, Bin, Choudhuri, Anwesa, Zheng, Meng, Gao, Zhongpai, Planche, Benjamin, Deng, Andong, Liu, Qin, Chen, Terrence, Bagci, Ulas, and Wu, Ziyan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Interactive segmentation aims to accurately segment target objects with minimal user interactions. However, current methods often fail to accurately separate target objects from the background, due to a limited understanding of order, the relative depth between objects in a scene. To address this issue, we propose OIS: order-aware interactive segmentation, where we explicitly encode the relative depth between objects into order maps. We introduce a novel order-aware attention, where the order maps seamlessly guide the user interactions (in the form of clicks) to attend to the image features. We further present an object-aware attention module to incorporate a strong object-level understanding to better differentiate objects with similar order. Our approach allows both dense and sparse integration of user clicks, enhancing both accuracy and efficiency as compared to prior works. Experimental results demonstrate that OIS achieves state-of-the-art performance, improving mIoU after one click by 7.61 on the HQSeg44K dataset and 1.32 on the DAVIS dataset as compared to the previous state-of-the-art SegNext, while also doubling inference speed compared to current leading methods. The project page is https://ukaukaaaa.github.io/projects/OIS/index.html, Comment: Interactive demo can be found in project page: https://ukaukaaaa.github.io/projects/OIS/index.html
Published: 2024

7. DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

Author: Zhao, Zhiyuan, Kang, Hengrui, Wang, Bin, and He, Conghui
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Document Layout Analysis is crucial for real-world document understanding systems, but it encounters a challenging trade-off between speed and accuracy: multimodal methods leveraging both text and visual features achieve higher accuracy but suffer from significant latency, whereas unimodal methods relying solely on visual features offer faster processing speeds at the expense of accuracy. To address this dilemma, we introduce DocLayout-YOLO, a novel approach that enhances accuracy while maintaining speed advantages through document-specific optimizations in both pre-training and model design. For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm, which frames document synthesis as a two-dimensional bin packing problem, generating the large-scale, diverse DocSynth-300K dataset. Pre-training on the resulting DocSynth-300K dataset significantly improves fine-tuning performance across various document types. In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module that is capable of better handling multi-scale variations of document elements. Furthermore, to validate performance across different document types, we introduce a complex and challenging benchmark named DocStructBench. Extensive experiments on downstream datasets demonstrate that DocLayout-YOLO excels in both speed and accuracy. Code, data, and models are available at https://github.com/opendatalab/DocLayout-YOLO., Comment: Github Repo: https://github.com/opendatalab/DocLayout-YOLO
Published: 2024

8. 3-D Magnetotelluric Deep Learning Inversion Guided by Pseudo-Physical Information

Author: Jiang, Peifan, Wang, Xuben, Wang, Shuang, Deng, Fei, Wang, Kunpeng, Wang, Bin, Yang, Yuhan, and Fadel, Islam
Subjects: Physics - Geophysics, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Magnetotelluric deep learning (DL) inversion methods based on joint data-driven and physics-driven have become a hot topic in recent years. When mapping observation data (or forward modeling data) to the resistivity model using neural networks (NNs), incorporating the error (loss) term of the inversion resistivity's forward modeling response--which introduces physical information about electromagnetic field propagation--can significantly enhance the inversion accuracy. To efficiently achieve data-physical dual-driven MT deep learning inversion for large-scale 3-D MT data, we propose using DL forward modeling networks to compute this portion of the loss. This approach introduces pseudo-physical information through the forward modeling of NN simulation, further guiding the inversion network fitting. Specifically, we first pre-train the forward modeling networks as fixed forward modeling operators, then transfer and integrate them into the inversion network training, and finally optimize the inversion network by minimizing the multinomial loss. Theoretical experimental results indicate that despite some simulation errors in DL forward modeling, the introduced pseudo-physical information still enhances inversion accuracy and significantly mitigates the overfitting problem during training. Additionally, we propose a new input mode that involves masking and adding noise to the data, simulating the field data environment of 3-D MT inversion, thereby making the method more flexible and effective for practical applications.
Published: 2024

9. CALoR: Towards Comprehensive Model Inversion Defense

Author: Yu, Hongyao, Qiu, Yixiang, Fang, Hao, Chen, Bin, Yu, Sijin, Wang, Bin, Xia, Shu-Tao, and Xu, Ke
Subjects: Computer Science - Cryptography and Security, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Model Inversion Attacks (MIAs) aim at recovering privacy-sensitive training data from the knowledge encoded in the released machine learning models. Recent advances in the MIA field have significantly enhanced the attack performance under multiple scenarios, posing serious privacy risks of Deep Neural Networks (DNNs). However, the development of defense strategies against MIAs is relatively backward to resist the latest MIAs and existing defenses fail to achieve further trade-off between model utility and model robustness. In this paper, we provide an in-depth analysis from the perspective of intrinsic vulnerabilities of MIAs, comprehensively uncovering the weaknesses inherent in the basic pipeline, which are partially investigated in the previous defenses. Building upon these new insights, we propose a robust defense mechanism, integrating Confidence Adaptation and Low-Rank compression(CALoR). Our method includes a novel robustness-enhanced classification loss specially-designed for model inversion defenses and reveals the extraordinary effectiveness of compressing the classification header. With CALoR, we can mislead the optimization objective, reduce the leaked information and impede the backpropagation of MIAs, thus mitigating the risk of privacy leakage. Extensive experimental results demonstrate that our method achieves state-of-the-art (SOTA) defense performance against MIAs and exhibits superior generalization to existing defenses across various scenarios., Comment: 26 pages
Published: 2024

10. MinerU: An Open-Source Solution for Precise Document Content Extraction

Author: Wang, Bin, Xu, Chao, Zhao, Xiaomeng, Ouyang, Linke, Wu, Fan, Zhao, Zhiyuan, Xu, Rui, Liu, Kaiwen, Qu, Yuan, Shang, Fukai, Zhang, Bo, Wei, Liqun, Sui, Zhihao, Li, Wei, Shi, Botian, Qiao, Yu, Lin, Dahua, and He, Conghui
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at https://github.com/opendatalab/MinerU., Comment: MinerU Technical Report
Published: 2024

11. PMSS: Pretrained Matrices Skeleton Selection for LLM Fine-tuning

Author: Wang, Qibin, Hu, Xiaolin, Xu, Weikai, Liu, Wei, Luan, Jian, and Wang, Bin
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Low-rank adaptation (LoRA) and its variants have recently gained much interest due to their ability to avoid excessive inference costs. However, LoRA still encounters the following challenges: (1) Limitation of low-rank assumption; and (2) Its initialization method may be suboptimal. To this end, we propose PMSS(Pre-trained Matrices Skeleton Selection), which enables high-rank updates with low costs while leveraging semantic and linguistic information inherent in pre-trained weight. It achieves this by selecting skeletons from the pre-trained weight matrix and only learning a small matrix instead. Experiments demonstrate that PMSS outperforms LoRA and other fine-tuning methods across tasks with much less trainable parameters. We demonstrate its effectiveness, especially in handling complex tasks such as DROP benchmark(+3.4%/+5.9% on LLaMA2-7B/13B) and math reasoning(+12.89%/+5.61%/+3.11% on LLaMA2-7B, Mistral-7B and Gemma-7B of GSM8K). The code and model will be released soon.
Published: 2024

12. ToolPlanner: A Tool Augmented LLM for Multi Granularity Instructions with Path Planning and Feedback

Author: Wu, Qinzhuo, Liu, Wei, Luan, Jian, and Wang, Bin
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Recently, tool-augmented LLMs have gained increasing attention. Given an instruction, tool-augmented LLMs can interact with various external tools in multiple rounds and provide a final answer. However, previous LLMs were trained on overly detailed instructions, which included API names or parameters, while real users would not explicitly mention these API details. This leads to a gap between trained LLMs and real-world scenarios. In addition, most works ignore whether the interaction process follows the instruction. To address these issues, we constructed a training dataset called MGToolBench, which contains statement and category-level instructions to better reflect real-world scenarios. In addition, we propose ToolPlanner, a two-stage reinforcement learning framework that utilizes path planning and two feedback mechanisms to enhance the LLM's task completion and instruction-following capabilities. Experimental results show that ToolPlanner significantly improves the Match Rate, Pass Rate and Win Rate by 26.8%, 20.2%, and 5.6% compared to the SOTA model. Human evaluation verifies that the multi-granularity instructions can better align with users' usage habits. Our data and code will be released upon acceptance.
Published: 2024

13. MobileVLM: A Vision-Language Model for Better Intra- and Inter-UI Understanding

Author: Wu, Qinzhuo, Xu, Weikai, Liu, Wei, Tan, Tao, Liu, Jianfeng, Li, Ang, Luan, Jian, Wang, Bin, and Shang, Shuo
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Recently, mobile AI agents based on VLMs have been gaining increasing attention. These works typically utilize VLM as a foundation, fine-tuning it with instruction-based mobile datasets. However, these VLMs are typically pre-trained on general-domain data, which often results in a lack of fundamental capabilities specific to the mobile domain. Therefore, they may struggle to recognize specific UI elements and understand intra-UI fine-grained information. In addition, the current fine-tuning task focuses on interacting with the most relevant element for the given instruction. These fine-tuned VLMs may still ignore the relationships between UI pages, neglect the roles of elements in page transitions and lack inter-UI understanding. To address issues, we propose a VLM called MobileVLM, which includes two additional pre-training stages to enhance both intra- and inter-UI understanding. We defined four UI-based pre-training tasks, enabling the model to better perceive fine-grained elements and capture page transition actions. To address the lack of mobile pre-training data, we built a large Chinese mobile dataset Mobile3M from scratch, which contains 3 million UI pages, and real-world transition actions, forming a directed graph structure. Experimental results show MobileVLM excels on both our test set and public mobile benchmarks, outperforming existing VLMs.
Published: 2024

14. Gauge invariant quantum transport theory for non-Hermitian systems

Author: Wei, Miaomiao, Wang, Bin, and Wang, Jian
Subjects: Condensed Matter - Mesoscale and Nanoscale Physics
Abstract: Gauge invariance is a fundamental principle that must be preserved in quantum transport. However, when a complex potential is incorporated into the Hamiltonian, we find that the current described by the well-established Landauer-B$\ddot{u}$ttiker formula no longer satisfies gauge invariance. Using the non-equilibrium Green's function (NEGF) method, we derive a current expression for a multi-probe system that includes a complex potential in the scattering region. We observe that an additional current term arises compared to the Landauer-B$\ddot{u}$ttiker formula, which leads to a violation of gauge invariance. To address this, we propose two phenomenological methods for redistributing the conductance to restore gauge invariance in non-Hermitian systems. These methods are applied to various trivial and nontrivial non-Hermitian quantum states, confirming the necessity of gauge-invariant treatments in non-Hermitian systems., Comment: 6 pages of main text, 2 figures
Published: 2024

15. Tunable Anomalous Hall Effect in a Kagome Ferromagnetic Weyl Semimetal

Author: Pate, Samuel E., Wang, Bin, Zhang, Yang, Shen, Bing, Liu, Enke, Martin, Ivar, Jiang, J. Samuel, Zhou, Xiuquan, Chung, Duck Young, Kanatzidis, Mercouri G., Welp, Ulrich, Kwok, Wai-Kwong, and Xiao, Zhi-Li
Subjects: Condensed Matter - Materials Science
Abstract: Emerging from the intricate interplay of topology and magnetism, the giant anomalous Hall effect (AHE) is the most known topological property of the recently discovered kagome ferromagnetic Weyl semimetal Co_3Sn_2S_2 with the magnetic Co atoms arranged on a kagome lattice. Here we report that the AHE in Co_3Sn_2S_2 can be fine-tuned by an applied magnetic field orientated within ~2 degrees of the kagome plane, while beyond this regime, it stays unchanged. Particularly, it can vanish in magnetic fields parallel to the kagome plane and even decrease in magnetic fields collinear with the spin direction. This tunable AHE can be attributed to local spin switching enabled by the geometrical frustration of the magnetic kagome lattice, revealing that spins in a kagome ferromagnet change their switching behavior as the magnetic field approaches the kagome plane. Our results also suggest a versatile way to tune the properties of a kagome magnet.
Published: 2024
Full Text: View/download PDF

16. Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic Data

Author: Zhou, Jiaming, Ghaddar, Abbas, Zhang, Ge, Ma, Liheng, Hu, Yaochen, Pal, Soumyasundar, Coates, Mark, Wang, Bin, Zhang, Yingxue, and Hao, Jianye
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Despite recent advances in training and prompting strategies for Large Language Models (LLMs), these models continue to face challenges with complex logical reasoning tasks that involve long reasoning chains. In this work, we explore the potential and limitations of using graph-based synthetic reasoning data as training signals to enhance LLMs' reasoning capabilities. Our extensive experiments, conducted on two established natural language reasoning tasks -- inductive reasoning and spatial reasoning -- demonstrate that supervised fine-tuning (SFT) with synthetic graph-based reasoning data effectively enhances LLMs' reasoning performance without compromising their effectiveness on other standard evaluation benchmarks.
Published: 2024

17. Mixture of Diverse Size Experts

Author: Sun, Manxi, Liu, Wei, Luan, Jian, Gao, Pengzhi, and Wang, Bin
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The Sparsely-Activated Mixture-of-Experts (MoE) has gained increasing popularity for scaling up large language models (LLMs) without exploding computational costs. Despite its success, the current design faces a challenge where all experts have the same size, limiting the ability of tokens to choose the experts with the most appropriate size for generating the next token. In this paper, we propose the Mixture of Diverse Size Experts (MoDSE), a new MoE architecture with layers designed to have experts of different sizes. Our analysis of difficult token generation tasks shows that experts of various sizes achieve better predictions, and the routing path of the experts tends to be stable after a training period. However, having experts of diverse sizes can lead to uneven workload distribution. To tackle this limitation, we introduce an expert-pair allocation strategy to evenly distribute the workload across multiple GPUs. Comprehensive evaluations across multiple benchmarks demonstrate the effectiveness of MoDSE, as it outperforms existing MoEs by allocating the parameter budget to experts adaptively while maintaining the same total parameter size and the number of experts.
Published: 2024

18. An Enhanced-State Reinforcement Learning Algorithm for Multi-Task Fusion in Large-Scale Recommender Systems

Author: Liu, Peng, Zhu, Jiawei, Xu, Cong, Zhao, Ming, and Wang, Bin
Subjects: Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: As the last key stage of Recommender Systems (RSs), Multi-Task Fusion (MTF) is in charge of combining multiple scores predicted by Multi-Task Learning (MTL) into a final score to maximize user satisfaction, which decides the ultimate recommendation results. In recent years, to maximize long-term user satisfaction within a recommendation session, Reinforcement Learning (RL) is widely used for MTF in large-scale RSs. However, limited by their modeling pattern, all the current RL-MTF methods can only utilize user features as the state to generate actions for each user, but unable to make use of item features and other valuable features, which leads to suboptimal results. Addressing this problem is a challenge that requires breaking through the current modeling pattern of RL-MTF. To solve this problem, we propose a novel method called Enhanced-State RL for MTF in RSs. Unlike the existing methods mentioned above, our method first defines user features, item features, and other valuable features collectively as the enhanced state; then proposes a novel actor and critic learning process to utilize the enhanced state to make much better action for each user-item pair. To the best of our knowledge, this novel modeling pattern is being proposed for the first time in the field of RL-MTF. We conduct extensive offline and online experiments in a large-scale RS. The results demonstrate that our model outperforms other models significantly. Enhanced-State RL has been fully deployed in our RS more than half a year, improving +3.84% user valid consumption and +0.58% user duration time compared to baseline., Comment: arXiv admin note: substantial text overlap with arXiv:2404.17589
Published: 2024

19. Picard Groups of Spectral Varieties and Moduli of Higgs Sheaves

Author: Su, Xiaoyu and Wang, Bin
Subjects: Mathematics - Algebraic Geometry
Abstract: We study moduli spaces of Higgs sheaves valued in line bundles and the associated Hitchin maps on surfaces. We first work out Picard groups of generic (very general) spectral varieties which holds for dimension of at least 2, i.e., a Noether--Lefschetz type theorem for spectral varieties. We then apply this to obtain a necessary and sufficient condition for the non-emptyness of generic Hitchin fibers for surfaces cases. Then we move on to detect the geometry of the moduli spaces of Higgs sheaves as the second Chern class varies., Comment: Comments are welcome!. arXiv admin note: text overlap with arXiv:2109.09989
Published: 2024

20. Probing Quantum Gravity Effects with Eccentric Extreme Mass-Ratio Inspirals

Author: Fu, Guoyang, Liu, Yunqi, Wang, Bin, Wu, Jian-Pin, and Zhang, Chao
Subjects: General Relativity and Quantum Cosmology, High Energy Physics - Theory
Abstract: In this paper, we investigate the impact of loop quantum gravity (LQG) on extreme mass-ratio inspirals (EMRIs), and the results indicate that LQG effects cause the orbital decay to occur faster compared to the Schwarzschild case. Furthermore, we use the augmented analytic kludge approach to generate EMRI waveforms and study the LISA's capability to detect the LQG effect with faithfulness. Additionally, employing the Fisher information matrix method for parameter estimation, we estimate that after one year of observation, the uncertainty in $r_0$ reduces to approximately $6.59\times 10^{-4}$ with a signal-to-noise ratio of $49$., Comment: 22 pages,7 figures
Published: 2024

21. MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

Author: Zhang, Wenyu, Sun, Shuo, Wang, Bin, Zou, Xunlong, Liu, Zhuohan, He, Yingxu, Lin, Geyu, Chen, Nancy F., and Aw, Ai Ti
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
Published: 2024

22. Dark QCD perspective inspired by strong CP problem at QCD scale

Author: Wang, Bin, Matsuzaki, Shinya, and Ishida, Hiroyuki
Subjects: High Energy Physics - Phenomenology
Abstract: We discuss a QCD-scale composite axion model arising from dark QCD coupled to QCD. The presently proposed scenario not only solves the strong CP problem, but also is compatible with the preheating setup for the QCD baryogenesis. The composite axion is phenomenologically required to mimic the QCD pion, but can generically be flavorful, which could be testable via the induced flavor changing processes at experiments. Another axionlike particle (ALP) is predicted to achieve the axion relaxation mechanism, which can phenomenologically act as the conventional QCD axion. This ALP can be ultralight, having the mass less than 1 eV, to be a dark matter candidate. The QCD $\times$ dark QCD symmetry structure constrains dark QCD meson spectra, so that the dark $\eta'$-like meson would only be accessible at the collider experiments. Still, the Belle II and Electron ion collider experiments can have a high enough sensitivity to probe the dark $\eta'$-like meson in the diphoton channel, which dominantly arises from the mixing with the QCD $\eta'$ and the pionic composite axion. We also briefly address nontrivial cosmological aspects, such as those related to the dark-chiral phase transition, the dark matter production, and an ultraviolet completion related to the ultralight ALP., Comment: 15 pages, 1 figure; discussions on experimental and astrophysical limits revised"
Published: 2024

23. UMOD: A Novel and Effective Urban Metro Origin-Destination Flow Prediction Method

Author: Xie, Peng, Ma, Minbo, Wang, Bin, Zhang, Junbo, and Li, Tianrui
Subjects: Computer Science - Machine Learning
Abstract: Accurate prediction of metro Origin-Destination (OD) flow is essential for the development of intelligent transportation systems and effective urban traffic management. Existing approaches typically either predict passenger outflow of departure stations or inflow of destination stations. However, we argue that travelers generally have clearly defined departure and arrival stations, making these OD pairs inherently interconnected. Consequently, considering OD pairs as a unified entity more accurately reflects actual metro travel patterns and allows for analyzing potential spatio-temporal correlations between different OD pairs. To address these challenges, we propose a novel and effective urban metro OD flow prediction method (UMOD), comprising three core modules: a data embedding module, a temporal relation module, and a spatial relation module. The data embedding module projects raw OD pair inputs into hidden space representations, which are subsequently processed by the temporal and spatial relation modules to capture both inter-pair and intra-pair spatio-temporal dependencies. Experimental results on two real-world urban metro OD flow datasets demonstrate that adopting the OD pairs perspective is critical for accurate metro OD flow prediction. Our method outperforms existing approaches, delivering superior predictive performance.
Published: 2024

24. CDM: A Reliable Metric for Fair and Accurate Formula Recognition Evaluation

Author: Wang, Bin, Wu, Fan, Ouyang, Linke, Gu, Zhuangcheng, Zhang, Rui, Xia, Renqiu, Zhang, Bo, and He, Conghui
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Formula recognition presents significant challenges due to the complicated structure and varied notation of mathematical expressions. Despite continuous advancements in formula recognition models, the evaluation metrics employed by these models, such as BLEU and Edit Distance, still exhibit notable limitations. They overlook the fact that the same formula has diverse representations and is highly sensitive to the distribution of training data, thereby causing the unfairness in formula recognition evaluation. To this end, we propose a Character Detection Matching (CDM) metric, ensuring the evaluation objectivity by designing a image-level rather than LaTex-level metric score. Specifically, CDM renders both the model-predicted LaTeX and the ground-truth LaTeX formulas into image-formatted formulas, then employs visual feature extraction and localization techniques for precise character-level matching, incorporating spatial position information. Such a spatially-aware and character-matching method offers a more accurate and equitable evaluation compared with previous BLEU and Edit Distance metrics that rely solely on text-based character matching. Experimentally, we evaluated various formula recognition models using CDM, BLEU, and ExpRate metrics. Their results demonstrate that the CDM aligns more closely with human evaluation standards and provides a fairer comparison across different models by eliminating discrepancies caused by diverse formula representations., Comment: Project Website: https://github.com/opendatalab/UniMERNet/tree/main/cdm
Published: 2024

25. Further study of the maximally symmetry breaking patterns in an ${\rm SU}(8)$ theory

Author: Chen, Ning, Chen, Zhiyuan, Hou, Zhanpeng, Teng, Zhaolong, and Wang, Bin
Subjects: High Energy Physics - Phenomenology
Abstract: The ${\rm SU}(8)$ was previously found to be the minimal simple gauge group where all three-generational Standard Model fermions can be non-trivially embedded, and it is maximally broken into ${\rm SU}(8)\to {\cal G}_{441}\equiv {\rm SU}(4)_s \otimes {\rm SU}(4)_W \otimes {\rm U}(1)_{X_0}$ at the GUT scale by the ${\rm SU}(8)$ adjoint Higgs field. Gauge symmetries in the strong and the weak sectors are extended by one and two ranks, respectively. The sequential strong-weak-weak (SWW) symmetry breaking stages were found to generate the observed hierarchical SM quark/lepton masses as well as the Cabibbo-Kobayashi-Maskawa (CKM) mixing pattern with the precise flavor identifications [1, 2]. We further study the possible weak-strong-weak (WSW) and weak-weak-strong (WWS) symmetry breaking patterns, and compare with the results that we have obtained by following the SWW sequence. The two-loop RGEs following both patterns are derived, where we cannot achieve the gauge coupling unification in the field theory framework. Based on these analyses, we suggest the gauge coupling unification to be interpreted in the context of the Ka{\v{c}}-Moody Lie algebra., Comment: 55 pages plus references, 3 figures, 21 tables. arXiv admin note: substantial text overlap with arXiv:2402.10471
Published: 2024

26. PuYun: Medium-Range Global Weather Forecasting Using Large Kernel Attention Convolutional Networks

Author: Zhu, Shengchen, Chen, Yiming, Yu, Peiying, Qu, Xiang, Zhou, Yuxiao, Ma, Yiming, Zhao, Zhizhan, Liu, Yukai, Mi, Hao, and Wang, Bin
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Physics - Atmospheric and Oceanic Physics
Abstract: Accurate weather forecasting is essential for understanding and mitigating weather-related impacts. In this paper, we present PuYun, an autoregressive cascade model that leverages large kernel attention convolutional networks. The model's design inherently supports extended weather prediction horizons while broadening the effective receptive field. The integration of large kernel attention mechanisms within the convolutional layers enhances the model's capacity to capture fine-grained spatial details, thereby improving its predictive accuracy for meteorological phenomena. We introduce PuYun, comprising PuYun-Short for 0-5 day forecasts and PuYun-Medium for 5-10 day predictions. This approach enhances the accuracy of 10-day weather forecasting. Through evaluation, we demonstrate that PuYun-Short alone surpasses the performance of both GraphCast and FuXi-Short in generating accurate 10-day forecasts. Specifically, on the 10th day, PuYun-Short reduces the RMSE for Z500 to 720 $m^2/s^2$, compared to 732 $m^2/s^2$ for GraphCast and 740 $m^2/s^2$ for FuXi-Short. Additionally, the RMSE for T2M is reduced to 2.60 K, compared to 2.63 K for GraphCast and 2.65 K for FuXi-Short. Furthermore, when employing a cascaded approach by integrating PuYun-Short and PuYun-Medium, our method achieves superior results compared to the combined performance of FuXi-Short and FuXi-Medium. On the 10th day, the RMSE for Z500 is further reduced to 638 $m^2/s^2$, compared to 641 $m^2/s^2$ for FuXi. These findings underscore the effectiveness of our model ensemble in advancing medium-range weather prediction. Our training code and model will be open-sourced.
Published: 2024

27. ToolACE: Winning the Points of LLM Function Calling

Author: Liu, Weiwen, Huang, Xu, Zeng, Xingshan, Hao, Xinlong, Yu, Shuai, Li, Dexun, Wang, Shuai, Gan, Weinan, Liu, Zhengying, Yu, Yuanqing, Wang, Zezhong, Wang, Yuxian, Ning, Wu, Hou, Yutai, Wang, Bin, Wu, Chuhan, Wang, Xinzhi, Liu, Yong, Wang, Yasheng, Tang, Duyu, Tu, Dandan, Shang, Lifeng, Jiang, Xin, Tang, Ruiming, Lian, Defu, Liu, Qun, and Chen, Enhong
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard, rivaling the latest GPT-4 models. Our model and a subset of the data are publicly available at https://huggingface.co/Team-ACE., Comment: 21 pages, 22 figures
Published: 2024

28. Integrated Treatment Technology of Rural Domestic Sewage

Author: Li, Wensheng, Li, Yungui, Zhang, Jianmin, Wang, Fengyu, and Wang, Bin
Subjects: Decentralized Sewage Treatment Technology, Environmental Equipment Manufacturing, Environmental Automatic Control System, Environmental Cloud Management System, Water Treatment Operation and Maintenance Guidelines, bic Book Industry Communication::R Earth sciences, geography, environment, planning::RN The environment::RNH Waste management, bic Book Industry Communication::R Earth sciences, geography, environment, planning::RB Earth sciences::RBK Hydrology & the hydrosphere, bic Book Industry Communication::R Earth sciences, geography, environment, planning::RN The environment::RNP Pollution & threats to the environment, bic Book Industry Communication::R Earth sciences, geography, environment, planning::RN The environment::RNU Sustainability
Abstract: This open access book provides integrated treatment technology of rural domestic sewage, and highlights ten typical cases in China. The integrated sewage treatment system (ISTE) combine sewage pretreatment, biological treatment, sedimentation, and disinfection, which particularly suitable for decentralized domestic wastewater treatment in rural areas without pipe networks. The main advantages of ISTE include the compact structure, small footprint, short construction period, high treatment efficiency, and economical rationality. First applied in Japan, ISTE has recently achieved rapid growth in China in the last decade. The relevant technological R&D, practice, and promotion experience accumulated by Chinese enterprises can offer significant reference to other developing countries and regions. The book consists of five chapters. The first chapter introduces characteristics, environmental risks and management styles of rural domestic sewage. The second chapter illustrates the collection and treatment approaches for rural domestic sewage.The third chapter presents the integrated sewage treatment process of high efficiency. The fourth chapter involves the design concept, manufacturing requirements, automatic control, and the cloud management system for ISTE. Chapter five covers ten typical application cases of ISTE in multiple regions of China. This book is suitable for graduate students, engineers, university teachers and government administrators engaged in decentralized sewage treatment.
Published: 2024
Full Text: View/download PDF

29. Segmentation-guided Layer-wise Image Vectorization with Gradient Fills

Author: Zhou, Hengyu, Zhang, Hui, and Wang, Bin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The widespread use of vector graphics creates a significant demand for vectorization methods. While recent learning-based techniques have shown their capability to create vector images of clear topology, filling these primitives with gradients remains a challenge. In this paper, we propose a segmentation-guided vectorization framework to convert raster images into concise vector graphics with radial gradient fills. With the guidance of an embedded gradient-aware segmentation subroutine, our approach progressively appends gradient-filled B\'ezier paths to the output, where primitive parameters are initiated with our newly designed initialization technique and are optimized to minimize our novel loss function. We build our method on a differentiable renderer with traditional segmentation algorithms to develop it as a model-free tool for raster-to-vector conversion. It is tested on various inputs to demonstrate its feasibility, independent of datasets, to synthesize vector graphics with improved visual quality and layer-wise topology compared to prior work.
Published: 2024

30. IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities

Author: Wang, Bin, Xie, Chunyu, Leng, Dawei, and Yin, Yuhui
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: In the field of multimodal large language models (MLLMs), common methods typically involve unfreezing the language model during training to foster profound visual understanding. However, the fine-tuning of such models with vision-language data often leads to a diminution of their natural language processing (NLP) capabilities. To avoid this performance degradation, a straightforward solution is to freeze the language model while developing multimodal competencies. Unfortunately, previous works have not attained satisfactory outcomes. Building on the strategy of freezing the language model, we conduct thorough structural exploration and introduce the Inner-Adaptor Architecture (IAA). Specifically, the architecture incorporates multiple multimodal adaptors at varying depths within the large language model to facilitate direct interaction with the inherently text-oriented transformer layers, thereby enabling the frozen language model to acquire multimodal capabilities. Unlike previous approaches of freezing language models that require large-scale aligned data, our proposed architecture is able to achieve superior performance on small-scale datasets. We conduct extensive experiments to improve the general multimodal capabilities and visual grounding abilities of the MLLM. Our approach remarkably outperforms previous state-of-the-art methods across various vision-language benchmarks without sacrificing performance on NLP tasks. Code and models are available at https://github.com/360CVGroup/Inner-Adaptor-Architecture.
Published: 2024

31. Understanding Literary Texts by LLMs: A Case Study of Ancient Chinese Poetry

Author: Zhao, Cheng, Wang, Bin, and Wang, Zhen
Subjects: Computer Science - Computation and Language
Abstract: The birth and rapid development of large language models (LLMs) have caused quite a stir in the field of literature. Once considered unattainable, AI's role in literary creation is increasingly becoming a reality. In genres such as poetry, jokes, and short stories, numerous AI tools have emerged, offering refreshing new perspectives. However, it's difficult to further improve the quality of these works. This is primarily because understanding and appreciating a good literary work involves a considerable threshold, such as knowledge of literary theory, aesthetic sensibility, interdisciplinary knowledge. Therefore, authoritative data in this area is quite lacking. Additionally, evaluating literary works is often complex and hard to fully quantify, which directly hinders the further development of AI creation. To address this issue, this paper attempts to explore the mysteries of literary texts from the perspective of LLMs, using ancient Chinese poetry as an example for experimentation. First, we collected a variety of ancient poems from different sources and had experts annotate a small portion of them. Then, we designed a range of comprehension metrics based on LLMs to evaluate all these poems. Finally, we analyzed the correlations and differences between various poem collections to identify literary patterns. Through our experiments, we observed a series of enlightening phenomena that provide technical support for the future development of high-level literary creation based on LLMs.
Published: 2024

32. TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning

Author: Wang, Bin and Wang, Wenqian
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, large-scale pre-trained vision-language models (e.g., CLIP), have garnered significant attention thanks to their powerful representative capabilities. This inspires researchers in transferring the knowledge from these large pre-trained models to other task-specific models, e.g., Video Action Recognition (VAR) models, via particularly leveraging side networks to enhance the efficiency of parameter-efficient fine-tuning (PEFT). However, current transferring approaches in VAR tend to directly transfer the frozen knowledge from large pre-trained models to action recognition networks with minimal cost, instead of exploiting the temporal modeling capabilities of the action recognition models themselves. Therefore, in this paper, we propose a memory-efficient Temporal Difference Side Network (TDS-CLIP) to balance knowledge transferring and temporal modeling, avoiding backpropagation in frozen parameter models. Specifically, we introduce a Temporal Difference Adapter (TD-Adapter), which can effectively capture local temporal differences in motion features to strengthen the model's global temporal modeling capabilities. Furthermore, we designed a Side Motion Enhancement Adapter (SME-Adapter) to guide the proposed side network in efficiently learning the rich motion information in videos, thereby improving the side network's ability to capture and learn motion information. Extensive experiments are conducted on three benchmark datasets, including Something-Something V1\&V2, and Kinetics-400. Experimental results demonstrate that our approach achieves competitive performance.
Published: 2024

33. SpeechEE: A Novel Benchmark for Speech Event Extraction

Author: Wang, Bin, Zhang, Meishan, Fei, Hao, Zhao, Yu, Li, Bobo, Wu, Shengqiong, Ji, Wei, and Zhang, Min
Subjects: Computer Science - Multimedia
Abstract: Event extraction (EE) is a critical direction in the field of information extraction, laying an important foundation for the construction of structured knowledge bases. EE from text has received ample research and attention for years, yet there can be numerous real-world applications that require direct information acquisition from speech signals, online meeting minutes, interview summaries, press releases, etc. While EE from speech has remained under-explored, this paper fills the gap by pioneering a SpeechEE, defined as detecting the event predicates and arguments from a given audio speech. To benchmark the SpeechEE task, we first construct a large-scale high-quality dataset. Based on textual EE datasets under the sentence, document, and dialogue scenarios, we convert texts into speeches through both manual real-person narration and automatic synthesis, empowering the data with diverse scenarios, languages, domains, ambiences, and speaker styles. Further, to effectively address the key challenges in the task, we tailor an E2E SpeechEE system based on the encoder-decoder architecture, where a novel Shrinking Unit module and a retrieval-aided decoding mechanism are devised. Extensive experimental results on all SpeechEE subsets demonstrate the efficacy of the proposed model, offering a strong baseline for the task. At last, being the first work on this topic, we shed light on key directions for future research.
Published: 2024

34. Hypersurfaces of constant scalar curvature in hyperbolic space with prescribed asymptotic boundary at infinity

Author: Wang, Bin
Subjects: Mathematics - Differential Geometry, Primary 53C21, Secondary 35J60, 53C40
Abstract: In this note, we study the asymptotic Plateau problem in hyperbolic space, and we prove the existence of a smooth complete hypersurface of constant scalar curvature in hyperbolic space with a prescribed asymptotic boundary at infinity. Following a pioneering work of Bo Guan and Joel Spruck, we seek the solution as a vertical graph over a bounded domain and solve the corresponding Dirichlet problem for a fully nonlinear partial differential equation by establishing the crucial second order estimates for admissible solutions. Our proof consists of three main ingredients: (1) a non-standard, special choice for the parameter in the auxiliary function, (2) a so-called almost-Jacobi inequality for the equation operator, and (3) a set of arguments which reduce the situation to semi-convex case and which keep the coefficient of the troublesome negative term within a suitable magnitude., Comment: fixed minor typos
Published: 2024

35. Gravitational odd-parity perturbation of a Horndeski hairy black hole: quasinormal mode and parameter constraint

Author: Yang, Zhen-Hao, Lei, Yun-He, Kuang, Xiao-Mei, and Wang, Bin
Subjects: General Relativity and Quantum Cosmology
Abstract: In the binary black hole coalescence, the gravitational wave emitted at the ringdown stage can be well described within the black hole perturbation theory, where the quasinormal modes (QNMs) become the important ingredient in modeling the pattern wave form. In general ralativity (GR), the QNMs can be obtained from solving the Regge-Wheeler equation in static black hole, while in Horndeski gravity, the metric perturbation equation can be simplified into a modified Regge-Wheeler equation from the perturbed action. In this paper, we calculate the QNMs frequencies of the gravitational odd-parity perturbation of a specific hairy black hole in Horndeski gravity with the use of the matrix method and pseudo spectral method. Our results indicate that such a Horndeski hairy black hole is stable under the odd perturbation, which is also verified by the time evolution of the perturbation. In particular, we find that for a certain range of the Horndeski hair, the $\ell>2$ modes become the long-lived mode instead of $\ell=2$ mode in GR. Then, we use the ringdown QNMs to preliminarily investigate the signal-to-noise-ratio (SNR) rescaled measurement error of the Horndeski hair. We obtained significant effects of the angular momentum and overtone on the error bound of the hairy parameter. We hope that our findings could inspire more theoretical and phenomenal work on the test of no-hair theorem of black hole from gravitational wave physics., Comment: 16 pages; v2: corrected typos and added references
Published: 2024

36. CSI-Free Position Optimization for Movable Antenna Communication Systems: A Black-Box Optimization Approach

Author: Zeng, Xianlong, Fang, Jun, Wang, Bin, Ning, Boyu, and Li, Hongbin
Subjects: Electrical Engineering and Systems Science - Signal Processing
Abstract: Movable antenna (MA) is a new technology which leverages local movement of antennas to improve channel qualities and enhance the communication performance. Nevertheless, to fully realize the potential of MA systems, complete channel state information (CSI) between the transmitter-MA and the receiver-MA is required, which involves estimating a large number of channel parameters and incurs an excessive amount of training overhead. To address this challenge, in this paper, we propose a CSI-free MA position optimization method. The basic idea is to treat position optimization as a black-box optimization problem and calculate the gradient of the unknown objective function using zeroth-order (ZO) gradient approximation techniques. Simulation results show that the proposed ZO-based method, through adaptively adjusting the position of the MA, can achieve a favorable signal-to-noise-ratio (SNR) using a smaller number of position measurements than the CSI-based approach. Such a merit makes the proposed algorithm more adaptable to fast-changing propagation channels., Comment: 5 pages, 4 figures, published in IEEE WCL
Published: 2024
Full Text: View/download PDF

37. In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models

Author: Joaquin, Ayrton San, Wang, Bin, Liu, Zhengyuan, Asher, Nicholas, Lim, Brian, Muller, Philippe, and Chen, Nancy F.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Despite advancements, fine-tuning Large Language Models (LLMs) remains costly due to the extensive parameter count and substantial data requirements for model generalization. Accessibility to computing resources remains a barrier for the open-source community. To address this challenge, we propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model. Notably, we assess the model's internal gradients to estimate this relationship, aiming to rank the contribution of each training point. To enhance efficiency, we propose an optimization to compute influence functions with a reduced number of layers while achieving similar accuracy. By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data. Meantime, using influence functions to analyze model coverage to certain testing samples could provide a reliable and interpretable signal on the training set's coverage of those test points., Comment: EMNLP 2024 - Findings
Published: 2024

38. Walk Wisely on Graph: Knowledge Graph Reasoning with Dual Agents via Efficient Guidance-Exploration

Author: Wang, Zijian, Wang, Bin, Jing, Haifeng, Li, Huayu, and Dou, Hongbo
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recent years, multi-hop reasoning has been widely studied for knowledge graph (KG) reasoning due to its efficacy and interpretability. However, previous multi-hop reasoning approaches are subject to two primary shortcomings. First, agents struggle to learn effective and robust policies at the early phase due to sparse rewards. Second, these approaches often falter on specific datasets like sparse knowledge graphs, where agents are required to traverse lengthy reasoning paths. To address these problems, we propose a multi-hop reasoning model with dual agents based on hierarchical reinforcement learning (HRL), which is named FULORA. FULORA tackles the above reasoning challenges by eFficient GUidance-ExpLORAtion between dual agents. The high-level agent walks on the simplified knowledge graph to provide stage-wise hints for the low-level agent walking on the original knowledge graph. In this framework, the low-level agent optimizes a value function that balances two objectives: (1) maximizing return, and (2) integrating efficient guidance from the high-level agent. Experiments conducted on three real-word knowledge graph datasets demonstrate that FULORA outperforms RL-based baselines, especially in the case of long-distance reasoning.
Published: 2024

39. Image Re-Identification: Where Self-supervision Meets Vision-Language Learning

Author: Wang, Bin, Liang, Yuying, Cai, Lei, Huang, Huakun, and Zeng, Huanqiang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, large-scale vision-language pre-trained models like CLIP have shown impressive performance in image re-identification (ReID). In this work, we explore whether self-supervision can aid in the use of CLIP for image ReID tasks. Specifically, we propose SVLL-ReID, the first attempt to integrate self-supervision and pre-trained CLIP via two training stages to facilitate the image ReID. We observe that: 1) incorporating language self-supervision in the first training stage can make the learnable text prompts more distinguishable, and 2) incorporating vision self-supervision in the second training stage can make the image features learned by the image encoder more discriminative. These observations imply that: 1) the text prompt learning in the first stage can benefit from the language self-supervision, and 2) the image feature learning in the second stage can benefit from the vision self-supervision. These benefits jointly facilitate the performance gain of the proposed SVLL-ReID. By conducting experiments on six image ReID benchmark datasets without any concrete text labels, we find that the proposed SVLL-ReID achieves the overall best performances compared with state-of-the-arts. Codes will be publicly available at https://github.com/BinWangGzhu/SVLL-ReID.
Published: 2024

40. Two-dimensional DtN-FEM scattering analysis of SH guided waves by an interface debonding in a double-layered plate

Author: Yang, Chen, Qin, Ruigang, Hirose, Sohichi, Wang, Bin, and Qian, Zhenghua
Subjects: Mathematics - Numerical Analysis
Abstract: In this paper, a two-dimensional Dirichlet-to-Neumann (DtN) finite element method (FEM) is developed to analyze the scattering of SH guided waves due to an interface delamination in a bi-material plate. During the finite element analysis, it is necessary to determine the far-field DtN conditions at virtual boundaries where both displacements and tractions are unknown. In this study, firstly, the scattered waves at the virtual boundaries are represented by a superposition of guided waves with unknown scattered coefficients. Secondly, utilizing the mode orthogonality, the unknown tractions at virtual boundaries are expressed in terms of the unknown scattered displacements at virtual boundaries via scattered coefficients. Thirdly, this relationship at virtual boundaries can be finally assembled into the global DtN-FEM matrix to solve the problem. This method is simple and elegant, which has advantages on dimension reduction and needs no absorption medium or perfectly matched layer to suppress the reflected waves compared to traditional FEM. Furthermore, the reflection and transmission coefficients of each guided mode can be directly obtained without post-processing. This proposed DtN-FEM will be compared with boundary element method (BEM), and finally validated for several benchmark problems.
Published: 2024

41. A New Dataset and Framework for Real-World Blurred Images Super-Resolution

Author: Qin, Rui, Sun, Ming, Zhou, Chao, and Wang, Bin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent Blind Image Super-Resolution (BSR) methods have shown proficiency in general images. However, we find that the efficacy of recent methods obviously diminishes when employed on image data with blur, while image data with intentional blur constitute a substantial proportion of general data. To further investigate and address this issue, we developed a new super-resolution dataset specifically tailored for blur images, named the Real-world Blur-kept Super-Resolution (ReBlurSR) dataset, which consists of nearly 3000 defocus and motion blur image samples with diverse blur sizes and varying blur intensities. Furthermore, we propose a new BSR framework for blur images called Perceptual-Blur-adaptive Super-Resolution (PBaSR), which comprises two main modules: the Cross Disentanglement Module (CDM) and the Cross Fusion Module (CFM). The CDM utilizes a dual-branch parallelism to isolate conflicting blur and general data during optimization. The CFM fuses the well-optimized prior from these distinct domains cost-effectively and efficiently based on model interpolation. By integrating these two modules, PBaSR achieves commendable performance on both general and blur data without any additional inference and deployment cost and is generalizable across multiple model architectures. Rich experiments show that PBaSR achieves state-of-the-art performance across various metrics without incurring extra inference costs. Within the widely adopted LPIPS metrics, PBaSR achieves an improvement range of approximately 0.02-0.10 with diverse anchor methods and blur types, across both the ReBlurSR and multiple common general BSR benchmarks. Code here: https://github.com/Imalne/PBaSR.
Published: 2024

42. Learning Structurally Stabilized Representations for Multi-modal Lossless DNA Storage

Author: Cao, Ben, He, Tiantian, Li, Xue, Wang, Bin, Wu, Xiaohu, Zhang, Qiang, and Ong, Yew-Soon
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Emerging Technologies, Computer Science - Information Theory, Quantitative Biology - Biomolecules
Abstract: In this paper, we present Reed-Solomon coded single-stranded representation learning (RSRL), a novel end-to-end model for learning representations for multi-modal lossless DNA storage. In contrast to existing learning-based methods, the proposed RSRL is inspired by both error-correction codec and structural biology. Specifically, RSRL first learns the representations for the subsequent storage from the binary data transformed by the Reed-Solomon codec. Then, the representations are masked by an RS-code-informed mask to focus on correcting the burst errors occurring in the learning process. With the decoded representations with error corrections, a novel biologically stabilized loss is formulated to regularize the data representations to possess stable single-stranded structures. By incorporating these novel strategies, the proposed RSRL can learn highly durable, dense, and lossless representations for the subsequent storage tasks into DNA sequences. The proposed RSRL has been compared with a number of strong baselines in real-world tasks of multi-modal data storage. The experimental results obtained demonstrate that RSRL can store diverse types of data with much higher information density and durability but much lower error rates.
Published: 2024

43. Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

Author: Shen, Bowen, Lin, Zheng, Zha, Daren, Liu, Wei, Luan, Jian, Wang, Bin, and Wang, Weiping
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Structured pruning fundamentally reduces computational and memory overheads of large language models (LLMs) and offers a feasible solution for end-side LLM deployment. Structurally pruned models remain dense and high-precision, highly compatible with further tuning and compression. However, as the coarse-grained structured pruning poses large damage to the highly interconnected model, achieving a high compression ratio for scaled-up LLMs remains a challenge. In this paper, we introduce a task-agnostic structured pruning approach coupled with a compact Transformer architecture design. The proposed approach, named TransAct, reduces transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules, while preserving the inter-module activations that are sensitive to perturbations. Hence, the LLM is pruned into an intra-module low-rank architecture, significantly reducing weights, KV Cache and attention computation. TransAct is implemented on the LLaMA model and evaluated on downstream benchmarks. Results verify the optimality of our approach at high compression with respect to both efficiency and performance. Further, ablation studies reveal the strength of activation-guided iterative pruning and provide experimental analysis on the redundancy of MHA and MLP modules., Comment: Findings of ACL 2024
Published: 2024

44. A General Maximum Principle for Progressive Optimal Control of Fully Coupled Forward-Backward Stochastic Systems with Jumps

Author: Wang, Bin, Si, Yu, and Shi, Jingtao
Subjects: Mathematics - Optimization and Control, 93E20, 49K45, 60H10, 60G55
Abstract: This paper is concerned with a general maximum principle for the fully coupled forward-backward stochastic optimal control problem with jumps, where the control domain is not necessarily convex, within the progressively measurable framework. It is worth noting that not only the control variable enters into all the coefficients, but also the jump size "$e$" . We first proposed that the solution $Z$ of BSDEP also contains the variable "$e$", which is different from previous articles and we provide an explanation in Remark 2.1., Comment: 32 pages
Published: 2024

45. InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Author: Zhang, Pan, Dong, Xiaoyi, Zang, Yuhang, Cao, Yuhang, Qian, Rui, Chen, Lin, Guo, Qipeng, Duan, Haodong, Wang, Bin, Ouyang, Linke, Zhang, Songyang, Zhang, Wenwei, Li, Yining, Gao, Yang, Sun, Peng, Zhang, Xinyue, Li, Wei, Li, Jingwen, Wang, Wenhai, Yan, Hang, He, Conghui, Zhang, Xingcheng, Chen, Kai, Dai, Jifeng, Qiao, Yu, Lin, Dahua, and Wang, Jiaqi
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer., Comment: Technical Report. https://github.com/InternLM/InternLM-XComposer
Published: 2024

46. A Wolf in Sheep's Clothing: Practical Black-box Adversarial Attacks for Evading Learning-based Windows Malware Detection in the Wild

Author: Ling, Xiang, Wu, Zhiyu, Wang, Bin, Deng, Wei, Wu, Jingzheng, Ji, Shouling, Luo, Tianyue, and Wu, Yanjun
Subjects: Computer Science - Cryptography and Security
Abstract: Given the remarkable achievements of existing learning-based malware detection in both academia and industry, this paper presents MalGuise, a practical black-box adversarial attack framework that evaluates the security risks of existing learning-based Windows malware detection systems under the black-box setting. MalGuise first employs a novel semantics-preserving transformation of call-based redividing to concurrently manipulate both nodes and edges of malware's control-flow graph, making it less noticeable. By employing a Monte-Carlo-tree-search-based optimization, MalGuise then searches for an optimized sequence of call-based redividing transformations to apply to the input Windows malware for evasions. Finally, it reconstructs the adversarial malware file based on the optimized transformation sequence while adhering to Windows executable format constraints, thereby maintaining the same semantics as the original. MalGuise is systematically evaluated against three state-of-the-art learning-based Windows malware detection systems under the black-box setting. Evaluation results demonstrate that MalGuise achieves a remarkably high attack success rate, mostly exceeding 95%, with over 91% of the generated adversarial malware files maintaining the same semantics. Furthermore, MalGuise achieves up to a 74.97% attack success rate against five anti-virus products, highlighting potential tangible security concerns to real-world users., Comment: This paper has been accepted by 33rd USENIX Security Symposium 2024
Published: 2024

47. Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents

Author: Deng, Shihan, Xu, Weikai, Sun, Hongda, Liu, Wei, Tan, Tao, Liu, Jianfeng, Li, Ang, Luan, Jian, Wang, Bin, Yan, Rui, and Shang, Shuo
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: With the remarkable advancements of large language models (LLMs), LLM-based agents have become a research hotspot in human-computer interaction. However, there is a scarcity of benchmarks available for LLM-based mobile agents. Benchmarking these agents generally faces three main challenges: (1) The inefficiency of UI-only operations imposes limitations to task evaluation. (2) Specific instructions within a singular application lack adequacy for assessing the multi-dimensional reasoning and decision-making capacities of LLM mobile agents. (3) Current evaluation metrics are insufficient to accurately assess the process of sequential actions. To this end, we propose Mobile-Bench, a novel benchmark for evaluating the capabilities of LLM-based mobile agents. First, we expand conventional UI operations by incorporating 103 collected APIs to accelerate the efficiency of task completion. Subsequently, we collect evaluation data by combining real user queries with augmentation from LLMs. To better evaluate different levels of planning capabilities for mobile agents, our data is categorized into three distinct groups: SAST, SAMT, and MAMT, reflecting varying levels of task complexity. Mobile-Bench comprises 832 data entries, with more than 200 tasks specifically designed to evaluate multi-APP collaboration scenarios. Furthermore, we introduce a more accurate evaluation metric, named CheckPoint, to assess whether LLM-based mobile agents reach essential points during their planning and reasoning steps.
Published: 2024

48. A Cross Spatio-Temporal Pathology-based Lung Nodule Dataset

Author: Jian, Muwei, Zhang, Haoran, Shao, Mingju, Chen, Hongyu, Huang, Huihui, Zhong, Yanjie, Zhang, Changlei, Wang, Bin, and Gao, Penghui
Subjects: Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Recently, intelligent analysis of lung nodules with the assistant of computer aided detection (CAD) techniques can improve the accuracy rate of lung cancer diagnosis. However, existing CAD systems and pulmonary datasets mainly focus on Computed Tomography (CT) images from one single period, while ignoring the cross spatio-temporal features associated with the progression of nodules contained in imaging data from various captured periods of lung cancer. If the evolution patterns of nodules across various periods in the patients' CT sequences can be explored, it will play a crucial role in guiding the precise screening identification of lung cancer. Therefore, a cross spatio-temporal lung nodule dataset with pathological information for nodule identification and diagnosis is constructed, which contains 328 CT sequences and 362 annotated nodules from 109 patients. This comprehensive database is intended to drive research in the field of CAD towards more practical and robust methods, and also contribute to the further exploration of precision medicine related field. To ensure patient confidentiality, we have removed sensitive information from the dataset.
Published: 2024

49. AudioBench: A Universal Benchmark for Audio Large Language Models

Author: Wang, Bin, Zou, Xunlong, Lin, Geyu, Sun, Shuo, Liu, Zhuohan, Zhang, Wenyu, Liu, Zhengyuan, Aw, AiTi, and Chen, Nancy F.
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce AudioBench, a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, among which, 7 are newly proposed datasets. The evaluation targets three main aspects: speech understanding, audio scene understanding, and voice understanding (paralinguistic). Despite recent advancements, there lacks a comprehensive benchmark for AudioLLMs on instruction following capabilities conditioned on audio signals. AudioBench addresses this gap by setting up datasets as well as desired evaluation metrics. Besides, we also evaluated the capabilities of five popular models and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-sourced evaluation toolkit, data, and leaderboard will offer a robust testbed for future model developments., Comment: v4 - Add acknowledgment and slight update on structure; Code: https://github.com/AudioLLMs/AudioBench
Published: 2024

50. EmpathyEar: An Open-source Avatar Multimodal Empathetic Chatbot

Author: Fei, Hao, Zhang, Han, Wang, Bin, Liao, Lizi, Liu, Qian, and Cambria, Erik
Subjects: Computer Science - Multimedia
Abstract: This paper introduces EmpathyEar, a pioneering open-source, avatar-based multimodal empathetic chatbot, to fill the gap in traditional text-only empathetic response generation (ERG) systems. Leveraging the advancements of a large language model, combined with multimodal encoders and generators, EmpathyEar supports user inputs in any combination of text, sound, and vision, and produces multimodal empathetic responses, offering users, not just textual responses but also digital avatars with talking faces and synchronized speeches. A series of emotion-aware instruction-tuning is performed for comprehensive emotional understanding and generation capabilities. In this way, EmpathyEar provides users with responses that achieve a deeper emotional resonance, closely emulating human-like empathy. The system paves the way for the next emotional intelligence, for which we open-source the code for public access., Comment: ACL 2024 Demonstration Paper
Published: 2024

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Region

Database

Publisher

59,047 results on '"Wang, Bin"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources