Author: "Zhang, Yichi" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhang, Yichi"' showing total 3,204 results

Start Over Author "Zhang, Yichi"

3,204 results on '"Zhang, Yichi"'

1. Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Author: Zhao, Haozhe, Si, Shuzheng, Chen, Liang, Zhang, Yichi, Sun, Maosong, Zhang, Mingjia, and Chang, Baobao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [lacing-lvlm.github.io](https://lacing-lvlm.github.io)., Comment: 19 pages, 12 figures
Published: 2024

2. MKGL: Mastery of a Three-Word Language

Author: Guo, Lingbing, Bo, Zhongpu, Chen, Zhuo, Zhang, Yichi, Chen, Jiaoyan, Lan, Yarong, Sun, Mengshu, Zhang, Zhiqiang, Luo, Yangyifei, Li, Qian, Zhang, Qiang, Zhang, Wen, and Chen, Huajun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) have significantly advanced performance across a spectrum of natural language processing (NLP) tasks. Yet, their application to knowledge graphs (KGs), which describe facts in the form of triplets and allow minimal hallucinations, remains an underexplored frontier. In this paper, we investigate the integration of LLMs with KGs by introducing a specialized KG Language (KGL), where a sentence precisely consists of an entity noun, a relation verb, and ends with another entity noun. Despite KGL's unfamiliar vocabulary to the LLM, we facilitate its learning through a tailored dictionary and illustrative sentences, and enhance context understanding via real-time KG context retrieval and KGL token embedding augmentation. Our results reveal that LLMs can achieve fluency in KGL, drastically reducing errors compared to conventional KG embedding methods on KG completion. Furthermore, our enhanced LLM shows exceptional competence in generating accurate three-word sentences from an initial entity and interpreting new unseen terms out of KGs., Comment: NeurIPS 2024 (spotlight)
Published: 2024

3. Liver Cancer Knowledge Graph Construction based on dynamic entity replacement and masking strategies RoBERTa-BiLSTM-CRF model

Author: Zhang, YiChi, Wang, HaiLing, Gao, YongBin, Hu, XiaoJun, Fan, YingFang, and Fang, ZhiJun
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence
Abstract: Background: Liver cancer ranks as the fifth most common malignant tumor and the second most fatal in our country. Early diagnosis is crucial, necessitating that physicians identify liver cancer in patients at the earliest possible stage. However, the diagnostic process is complex and demanding. Physicians must analyze a broad spectrum of patient data, encompassing physical condition, symptoms, medical history, and results from various examinations and tests, recorded in both structured and unstructured medical formats. This results in a significant workload for healthcare professionals. In response, integrating knowledge graph technology to develop a liver cancer knowledge graph-assisted diagnosis and treatment system aligns with national efforts toward smart healthcare. Such a system promises to mitigate the challenges faced by physicians in diagnosing and treating liver cancer. Methods: This paper addresses the major challenges in building a knowledge graph for hepatocellular carcinoma diagnosis, such as the discrepancy between public data sources and real electronic medical records, the effective integration of which remains a key issue. The knowledge graph construction process consists of six steps: conceptual layer design, data preprocessing, entity identification, entity normalization, knowledge fusion, and graph visualization. A novel Dynamic Entity Replacement and Masking Strategy (DERM) for named entity recognition is proposed. Results: A knowledge graph for liver cancer was established, including 7 entity types such as disease, symptom, and constitution, containing 1495 entities. The recognition accuracy of the model was 93.23%, the recall was 94.69%, and the F1 score was 93.96%.
Published: 2024

4. MetaOOD: Automatic Selection of OOD Detection Models

Author: Qin, Yuehan, Zhang, Yichi, Nian, Yi, Ding, Xueying, and Zhao, Yue
Subjects: Computer Science - Machine Learning
Abstract: How can we automatically select an out-of-distribution (OOD) detection model for various underlying tasks? This is crucial for maintaining the reliability of open-world applications by identifying data distribution shifts, particularly in critical domains such as online transactions, autonomous driving, and real-time patient diagnosis. Despite the availability of numerous OOD detection methods, the challenge of selecting an optimal model for diverse tasks remains largely underexplored, especially in scenarios lacking ground truth labels. In this work, we introduce MetaOOD, the first zero-shot, unsupervised framework that utilizes meta-learning to automatically select an OOD detection model. As a meta-learning approach, MetaOOD leverages historical performance data of existing methods across various benchmark OOD datasets, enabling the effective selection of a suitable model for new datasets without the need for labeled data at the test time. To quantify task similarities more accurately, we introduce language model-based embeddings that capture the distinctive OOD characteristics of both datasets and detection models. Through extensive experimentation with 24 unique test dataset pairs to choose from among 11 OOD detection models, we demonstrate that MetaOOD significantly outperforms existing methods and only brings marginal time overhead. Our results, validated by Wilcoxon statistical tests, show that MetaOOD surpasses a diverse group of 11 baselines, including established OOD detectors and advanced unsupervised selection methods., Comment: Best paper at 2024 KDD Workshop on Resource-Efficient Learning. Extended version
Published: 2024

5. A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Author: Chen, Liang, Tan, Sinan, Cai, Zefan, Xie, Weichu, Zhao, Haozhe, Zhang, Yichi, Lin, Junyang, Bai, Jinze, Liu, Tianyu, and Chang, Baobao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, \textit{model depth}, along with the sequence length direction. Compared to traditional 1D autoregression and previous work utilizing similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer's potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities. This has not been previously demonstrated for popular vision generative models such as diffusion models, showing a spark of vision-language intelligence when trained solely on images. Code, datasets and models are open at https://github.com/chenllliang/DnD-Transformer., Comment: 25 pages, 20 figures, code is open at https://github.com/chenllliang/DnD-Transformer
Published: 2024

6. A Unified Framework of Bond-Associated Peridynamic Material Correspondence Models

Author: Hu, Xuan, Chen, Hailong, Zhang, Yichi, and Wang, Zening
Subjects: Condensed Matter - Materials Science, Physics - Classical Physics
Abstract: This paper presents a unified framework for bond-associated peridynamic material correspondence models that were proposed to inherently address the issue of material instability or existence of zero-energy modes in the conventional correspondence formulation. The conventional formulation is well-known for having the issue of material instability due to the non-unique mapping between bond force density state and nonlocal deformation gradient. Several bond-associated models that employ bond-level deformation gradients address this issue in a very effectively and inherent manner. Although different approaches were taken to formulate bond-level deformation gradient so the bond-associated quantities can be captured more accurately, a detailed study finds a unified systematic framework exists for these models. It is the purpose of this paper to consolidate these approaches by providing a unified and systematic framework for bond-associated peridynamic correspondence models. Based on all the bond-associated deformation gradients proposed in the literature, a unified bond-associated deformation gradient is formulated. Assuming energy equivalence with the local continuum mechanics theory, the unified bond force density state is derived using the Fr\'echet derivative. Additionally, the properties of the formulated unified framework including linear momentum balance, angular momentum balance, and objectivity are thoroughly examined. This work serves as a valuable reference for the further development and application of bond-associated correspondence formulations in peridynamics.
Published: 2024

7. Robust Training of Neural Networks at Arbitrary Precision and Sparsity

Author: Ye, Chengxi, Chu, Grace, Liu, Yanfeng, Zhang, Yichi, Lew, Lukasz, and Howard, Andrew
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Mathematics - Numerical Analysis
Abstract: The discontinuous operations inherent in quantization and sparsification introduce obstacles to backpropagation. This is particularly challenging when training deep neural networks in ultra-low precision and sparse regimes. We propose a novel, robust, and universal solution: a denoising affine transform that stabilizes training under these challenging conditions. By formulating quantization and sparsification as perturbations during training, we derive a perturbation-resilient approach based on ridge regression. Our solution employs a piecewise constant backbone model to ensure a performance lower bound and features an inherent noise reduction mechanism to mitigate perturbation-induced corruption. This formulation allows existing models to be trained at arbitrarily low precision and sparsity levels with off-the-shelf recipes. Furthermore, our method provides a novel perspective on training temporal binary neural networks, contributing to ongoing efforts to narrow the gap between artificial and biological neural networks.
Published: 2024

8. Best of two worlds: Cartesian sampling and volume computation for distance-constrained configuration spaces using Cayley coordinates

Author: Zhang, Yichi and Sitharam, Meera
Subjects: Computer Science - Computational Geometry
Abstract: Volume calculation of configurational spaces acts as a vital part in configurational entropy calculation, which contributes towards calculating free energy landscape for molecular systems. In this article, we present our sampling-based volume computation method using distance-based Cayley coordinate, mitigating drawbacks: our method guarantees that the sampling procedure stays in lower-dimensional coordinate space (instead of higher-dimensional Cartesian space) throughout the whole process; and our mapping function, utilizing Cayley parameterization, can be applied in both directions with low computational cost. Our method uniformly samples and computes a discrete volume measure of a Cartesian configuration space of point sets satisfying systems of distance inequality constraints. The systems belong to a large natural class whose feasible configuration spaces are effectively lower dimensional subsets of high dimensional ambient space. Their topological complexity makes discrete volume computation challenging, yet necessary in several application scenarios including free energy calculation in soft matter assembly modeling. The algorithm runs in linear time and empirically sub-linear space in the number of grid hypercubes (used to define the discrete volume measure) \textit{that intersect} the configuration space. In other words, the number of wasted grid cube visits is insignificant compared to prevailing methods typically based on gradient descent. Specifically, the traversal stays within the feasible configuration space by viewing it as a branched covering, using a recent theory of Cayley or distance coordinates to convexify the base space, and by employing a space-efficient, frontier hypercube traversal data structure. A software implementation and comparison with existing methods is provided.
Published: 2024

9. Unleashing the Potential of SAM2 for Biomedical Images and Videos: A Survey

Author: Zhang, Yichi and Shen, Zhenrong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The unprecedented developments in segmentation foundational models have become a dominant force in the field of computer vision, introducing a multitude of previously unexplored capabilities in a wide range of natural images and videos. Specifically, the Segment Anything Model (SAM) signifies a noteworthy expansion of the prompt-driven paradigm into the domain of image segmentation. The recent introduction of SAM2 effectively extends the original SAM to a streaming fashion and demonstrates strong performance in video segmentation. However, due to the substantial distinctions between natural and medical images, the effectiveness of these models on biomedical images and videos is still under exploration. This paper presents an overview of recent efforts in applying and adapting SAM2 to biomedical images and videos. The findings indicate that while SAM2 shows promise in reducing annotation burdens and enabling zero-shot segmentation, its performance varies across different datasets and tasks. Addressing the domain gap between natural and medical images through adaptation and fine-tuning is essential to fully unleash SAM2's potential in clinical applications. To support ongoing research endeavors, we maintain an active repository that contains up-to-date SAM & SAM2-related papers and projects at https://github.com/YichiZhang98/SAM4MIS.
Published: 2024

10. Prompt Your Brain: Scaffold Prompt Tuning for Efficient Adaptation of fMRI Pre-trained Model

Author: Dong, Zijian, Wu, Yilei, Chen, Zijiao, Zhang, Yichi, Jin, Yueming, and Zhou, Juan Helen
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: We introduce Scaffold Prompt Tuning (ScaPT), a novel prompt-based framework for adapting large-scale functional magnetic resonance imaging (fMRI) pre-trained models to downstream tasks, with high parameter efficiency and improved performance compared to fine-tuning and baselines for prompt tuning. The full fine-tuning updates all pre-trained parameters, which may distort the learned feature space and lead to overfitting with limited training data which is common in fMRI fields. In contrast, we design a hierarchical prompt structure that transfers the knowledge learned from high-resource tasks to low-resource ones. This structure, equipped with a Deeply-conditioned Input-Prompt (DIP) mapping module, allows for efficient adaptation by updating only 2% of the trainable parameters. The framework enhances semantic interpretability through attention mechanisms between inputs and prompts, and it clusters prompts in the latent space in alignment with prior knowledge. Experiments on public resting state fMRI datasets reveal ScaPT outperforms fine-tuning and multitask-based prompt tuning in neurodegenerative diseases diagnosis/prognosis and personality trait prediction, even with fewer than 20 participants. It highlights ScaPT's efficiency in adapting pre-trained fMRI models to low-resource tasks., Comment: MICCAI 2024
Published: 2024

11. Timeliness-Fidelity Tradeoff in 3D Scene Representations

Author: Xu, Xiangmin, Meng, Zhen, Zhang, Yichi, She, Changyang, and Zhao, Philip G.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Real-time three-dimensional (3D) scene representations serve as one of the building blocks that bolster various innovative applications, e.g., digital manufacturing, Virtual/Augmented/Extended/Mixed Reality (VR/AR/XR/MR), and the metaverse. Despite substantial efforts that have been made to real-time communications and computing, real-time 3D scene representations remain a challenging task. This paper investigates the tradeoff between timeliness and fidelity in real-time 3D scene representations. Specifically, we establish a framework to evaluate the impact of communication delay on the tradeoff, where the real-world scenario is monitored by multiple cameras that communicate with an edge server. To improve fidelity for 3D scene representations, we propose to use a single-step Proximal Policy Optimization (PPO) method that leverages the Age of Information (AoI) to decide if the received image needs to be involved in 3D scene representations and rendering. We test our framework and the proposed approach with different well-known 3D scene representation methods. Simulation results reveal that real-time 3D scene representation can be sensitively affected by communication delay, and our proposed method can achieve optimal 3D scene representation results., Comment: This paper has been accepted for publication by the IEEE International Conference on Computer Communications (INFOCOM) Workshops 2024
Published: 2024

12. Cloud-based Semi-Quantum Money

Author: Zhang, Yichi, Jin, Siyuan, Huang, Yuhan, Zeng, Bei, and Shao, Qiming
Subjects: Quantum Physics, Computer Science - Cryptography and Security, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: In the 1970s, Wiesner introduced the concept of quantum money, where quantum states generated according to specific rules function as currency. These states circulate among users with quantum resources through quantum channels or face-to-face interactions. Quantum mechanics grants quantum money physical-level unforgeability but also makes minting, storing, and circulating it significantly challenging. Currently, quantum computers capable of minting and preserving quantum money have not yet emerged, and existing quantum channels are not stable enough to support the efficient transmission of quantum states for quantum money, limiting its practicality. Semi-quantum money schemes support fully classical transactions and complete classical banks, reducing dependence on quantum resources and enhancing feasibility. To further minimize the system's reliance on quantum resources, we propose a cloud-based semi-quantum money (CSQM) scheme. This scheme relies only on semi-honest third-party quantum clouds, while the rest of the system remains entirely classical. We also discuss estimating the computational power required by the quantum cloud for the scheme and conduct a security analysis.
Published: 2024

13. MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine

Author: Zhang, Renrui, Wei, Xinyu, Jiang, Dongzhi, Guo, Ziyu, Li, Shicheng, Zhang, Yichi, Tong, Chengzhuo, Liu, Jiaming, Zhou, Aojun, Wei, Bin, Zhang, Shanghang, Gao, Peng, Li, Chunyuan, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The mathematical capabilities of Multi-modal Large Language Models (MLLMs) remain under-explored with three areas to be improved: visual encoding of math diagrams, diagram-language alignment, and chain-of-thought (CoT) reasoning. This draws forth an urgent demand for an effective training paradigm and a large-scale, comprehensive dataset with detailed CoT rationales, which is challenging to collect and costly to annotate manually. To tackle this issue, we propose MAVIS, a MAthematical VISual instruction tuning pipeline for MLLMs, featuring an automatic data engine to efficiently create mathematical visual datasets. We design the data generation process to be entirely independent of human intervention or GPT API usage, while ensuring the diagram-caption correspondence, question-answer correctness, and CoT reasoning quality. With this approach, we curate two datasets, MAVIS-Caption (558K diagram-caption pairs) and MAVIS-Instruct (834K visual math problems with CoT rationales), and propose four progressive stages for training MLLMs from scratch. First, we utilize MAVIS-Caption to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we also leverage MAVIS-Caption to align the CLIP-Math with a large language model (LLM) by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we adopt MAVIS-Instruct to perform the instruction tuning for robust problem-solving skills, and term the resulting model as MAVIS-7B. Fourth, we apply Direct Preference Optimization (DPO) to enhance the CoT capabilities of our model, further refining its step-wise reasoning performance. Code and data will be released at https://github.com/ZrrSkywalker/MAVIS, Comment: WData and Models will be released at https://github.com/ZrrSkywalker/MAVIS
Published: 2024

14. MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

Author: Huang, Jinsheng, Chen, Liang, Guo, Taian, Zeng, Fu, Zhao, Yusheng, Wu, Bohan, Yuan, Ye, Zhao, Haozhe, Guo, Zhihui, Zhang, Yichi, Yuan, Jingyang, Ju, Wei, Liu, Luchen, Liu, Tianyu, Chang, Baobao, and Zhang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises $2,138$ question triplets, totaling $6,414$ distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by $31.73\%$, compared to an average gap of $8.03\%$ in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by $23.09\%$, whereas the gap for previous benchmarks is just $14.64\%$). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research., Comment: 21 pages, code released at https://github.com/chenllliang/MMEvalPro, Homepage at https://mmevalpro.github.io/
Published: 2024

15. MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

Author: Wang, Guanqun, Wei, Xinyu, Liu, Jiaming, Zhang, Ray, Zhang, Yichi, Zhang, Kevin, Chong, Maurice, and Zhang, Shanghang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception tasks, such as detection and segmentation. However, MLLMs mainly focus on high-level image-text interpretations and struggle with fine-grained visual understanding, and vision perception models usually suffer from open-world distribution shifts due to their limited model capacity. To overcome these challenges, we propose the Mutually Reinforced Multimodal Large Language Model (MR-MLLM), a novel framework that synergistically enhances visual perception and multimodal comprehension. First, a shared query fusion mechanism is proposed to harmonize detailed visual inputs from vision models with the linguistic depth of language models, enhancing multimodal comprehension and vision perception synergistically. Second, we propose the perception-enhanced cross-modal integration method, incorporating novel modalities from vision perception outputs, like object detection bounding boxes, to capture subtle visual elements, thus enriching the understanding of both visual and textual data. In addition, an innovative perception-embedded prompt generation mechanism is proposed to embed perceptual information into the language model's prompts, aligning the responses contextually and perceptually for a more accurate multimodal interpretation. Extensive experiments demonstrate MR-MLLM's superior performance in various multimodal comprehension and vision perception tasks, particularly those requiring corner case vision perception and fine-grained language comprehension., Comment: 14 pages, 8 figures
Published: 2024

16. Semiparametric Localized Principal Stratification Analysis with Continuous Strata

Author: Zhang, Yichi and Yang, Shu
Subjects: Statistics - Methodology
Abstract: Principal stratification is essential for revealing causal mechanisms involving post-treatment intermediate variables. Principal stratification analysis with continuous intermediate variables is increasingly common but challenging due to the infinite principal strata and the nonidentifiability and nonregularity of principal causal effects. Inspired by recent research, we resolve these challenges by first using a flexible copula-based principal score model to identify principal causal effect under weak principal ignorability. We then target the local functional substitute of principal causal effect, which is statistically regular and can accurately approximate principal causal effect with vanishing bandwidth. We simplify the full efficient influence function of the local functional substitute by considering its oracle-scenario alternative. This leads to a computationally efficient and straightforward estimator for the local functional substitute and principal causal effect with vanishing bandwidth. We prove the double robustness and statistical optimality of our proposed estimator, and derive its asymptotic normality for inferential purposes. We illustrate the appealing statistical performance of our proposed estimator in simulations, and apply it to two real datasets with intriguing scientific discoveries.
Published: 2024

17. On Efficient Neural Network Architectures for Image Compression

Author: Zhang, Yichi, Duan, Zhihao, and Zhu, Fengqing
Subjects: Electrical Engineering and Systems Science - Image and Video Processing
Abstract: Recent advances in learning-based image compression typically come at the cost of high complexity. Designing computationally efficient architectures remains an open challenge. In this paper, we empirically investigate the impact of different network designs in terms of rate-distortion performance and computational complexity. Our experiments involve testing various transforms, including convolutional neural networks and transformers, as well as various context models, including hierarchical, channel-wise, and space-channel context models. Based on the results, we present a series of efficient models, the final model of which has comparable performance to recent best-performing methods but with significantly lower complexity. Extensive experiments provide insights into the design of architectures for learned image compression and potential direction for future research. The code is available at \url{https://gitlab.com/viper-purdue/efficient-compression}., Comment: 2024 IEEE International Conference on Image Processing (ICIP2024)
Published: 2024

18. Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

Author: Zhang, Yichi, Huang, Yao, Sun, Yitong, Liu, Chang, Zhao, Zhe, Fang, Zhengwei, Wang, Yifan, Chen, Huanran, Yang, Xiao, Wei, Xingxing, Su, Hang, Dong, Yinpeng, and Zhu, Jun
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. In this work, we establish MultiTrust, the first comprehensive and unified benchmark on the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks, highlighting the complexities introduced by the multimodality and underscoring the necessity for advanced methodologies to enhance their reliability. For instance, typical proprietary models still struggle with the perception of visually confusing images and are vulnerable to multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to disclose privacy in text and reveal ideological and cultural biases even when paired with irrelevant images in inference, indicating that the multimodality amplifies the internal risks from base LLMs. Additionally, we release a scalable toolbox for standardized trustworthiness research, aiming to facilitate future advancements in this important field. Code and resources are publicly available at: https://multi-trust.github.io/., Comment: 100 pages, 84 figures, 33 tables
Published: 2024

19. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Author: Cai, Zefan, Zhang, Yichi, Gao, Bofei, Liu, Yuliang, Liu, Tianyu, Lu, Keming, Xiong, Wayne, Dong, Yue, Chang, Baobao, Hu, Junjie, and Xiao, Wen
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In this study, we investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers. Motivated by these insights, we developed PyramidKV, a novel and effective KV cache compression method. This approach dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones, diverging from traditional methods that maintain a uniform KV cache size. Our experimental evaluations, utilizing the LongBench benchmark, show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, thus significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques, achieving up to a 20.5 absolute accuracy improvement on TREC dataset. In the Needle-in-a-Haystack experiment, PyramidKV outperforms competing methods in maintaining long-context comprehension in LLMs; notably, retaining just 128 KV cache entries enables the LLAMA-3-70B model to achieve 100% Acc. performance, matching that of a full KV cache.
Published: 2024

20. Quantum-Inspired Mean Field Probabilistic Model for Combinatorial Optimization Problems

Author: Huang, Yuhan, Jin, Siyuan, Zhang, Yichi, Pan, Ling, and Shao, Qiming
Subjects: Mathematics - Optimization and Control, Quantum Physics
Abstract: Combinatorial optimization problems are pivotal across many fields. Among these, Quadratic Unconstrained Binary Optimization (QUBO) problems, central to fields like portfolio optimization, network design, and computational biology, are NP-hard and require exponential computational resources. To address these challenges, we develop a novel Quantum-Inspired Mean Field (QIMF) probabilistic model that approximates solutions to QUBO problems with enhanced accuracy and efficiency. The QIMF model draws inspiration from quantum measurement principles and leverages the mean field probabilistic model. We incorporate a measurement grouping technique and an amplitude-based shot allocation strategy, both critical for optimizing cost functions with a polynomial speedup over traditional methods. Our extensive empirical studies demonstrate significant improvements in solution evaluation for large-scale problems of portfolio selection, the weighted maxcut problem, and the Ising model. Specifically, using S&P 500 data from 2022 and 2023, QIMF improves cost values by 152.8% and 12.5%, respectively, compared to the state-of-the-art baselines. Furthermore, when evaluated on increasingly larger datasets for QUBO problems, QIMF's scalability demonstrates its potential for large-scale QUBO challenges., Comment: 13 pages, 10 figures
Published: 2024

21. Multiple Heads are Better than One: Mixture of Modality Knowledge Experts for Entity Representation Learning

Author: Zhang, Yichi, Chen, Zhuo, Guo, Lingbing, Xu, Yajing, Hu, Binbin, Liu, Ziqi, Zhang, Wen, and Chen, Huajun
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Learning high-quality multi-modal entity representations is an important goal of multi-modal knowledge graph (MMKG) representation learning, which can enhance reasoning tasks within the MMKGs, such as MMKG completion (MMKGC). The main challenge is to collaboratively model the structural information concealed in massive triples and the multi-modal features of the entities. Existing methods focus on crafting elegant entity-wise multi-modal fusion strategies, yet they overlook the utilization of multi-perspective features concealed within the modalities under diverse relational contexts. To address this issue, we introduce a novel framework with Mixture of Modality Knowledge experts (MoMoK for short) to learn adaptive multi-modal entity representations for better MMKGC. We design relation-guided modality knowledge experts to acquire relation-aware modality embeddings and integrate the predictions from multi-modalities to achieve joint decisions. Additionally, we disentangle the experts by minimizing their mutual information. Experiments on four public MMKG benchmarks demonstrate the outstanding performance of MoMoK under complex scenarios., Comment: Work in progress. Code and data will be released at https://github.com/zjukg/MoMoK
Published: 2024

22. Eliciting Informative Text Evaluations with Large Language Models

Author: Lu, Yuxuan, Xu, Shengwei, Zhang, Yichi, Kong, Yuqing, and Schoenebeck, Grant
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory
Abstract: Peer prediction mechanisms motivate high-quality feedback with provable guarantees. However, current methods only apply to rather simple reports, like multiple-choice or scalar numbers. We aim to broaden these techniques to the larger domain of text-based reports, drawing on the recent developments in large language models. This vastly increases the applicability of peer prediction mechanisms as textual feedback is the norm in a large variety of feedback channels: peer reviews, e-commerce customer reviews, and comments on social media. We introduce two mechanisms, the Generative Peer Prediction Mechanism (GPPM) and the Generative Synopsis Peer Prediction Mechanism (GSPPM). These mechanisms utilize LLMs as predictors, mapping from one agent's report to a prediction of her peer's report. Theoretically, we show that when the LLM prediction is sufficiently accurate, our mechanisms can incentivize high effort and truth-telling as an (approximate) Bayesian Nash equilibrium. Empirically, we confirm the efficacy of our mechanisms through experiments conducted on two real datasets: the Yelp review dataset and the ICLR OpenReview dataset. We highlight the results that on the ICLR dataset, our mechanisms can differentiate three quality levels -- human-written reviews, GPT-4-generated reviews, and GPT-3.5-generated reviews in terms of expected scores. Additionally, GSPPM penalizes LLM-generated reviews more effectively than GPPM., Comment: Accepted by the Twenty-Fifth ACM Conference on Economics and Computation (EC'24)
Published: 2024

23. Multi-domain Knowledge Graph Collaborative Pre-training and Prompt Tuning for Diverse Downstream Tasks

Author: Zhang, Yichi, Hu, Binbin, Chen, Zhuo, Guo, Lingbing, Liu, Ziqi, Zhang, Zhiqiang, Liang, Lei, Chen, Huajun, and Zhang, Wen
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Knowledge graphs (KGs) provide reliable external knowledge for a wide variety of AI tasks in the form of structured triples. Knowledge graph pre-training (KGP) aims to pre-train neural networks on large-scale KGs and provide unified interfaces to enhance different downstream tasks, which is a key direction for KG management, maintenance, and applications. Existing works often focus on purely research questions in open domains, or they are not open source due to data security and privacy in real scenarios. Meanwhile, existing studies have not explored the training efficiency and transferability of KGP models in depth. To address these problems, We propose a framework MuDoK to achieve multi-domain collaborative pre-training and efficient prefix prompt tuning to serve diverse downstream tasks like recommendation and text understanding. Our design is a plug-and-play prompt learning approach that can be flexibly adapted to different downstream task backbones. In response to the lack of open-source benchmarks, we constructed a new multi-domain KGP benchmark called KPI with two large-scale KGs and six different sub-domain tasks to evaluate our method and open-sourced it for subsequent research. We evaluated our approach based on constructed KPI benchmarks using diverse backbone models in heterogeneous downstream tasks. The experimental results show that our framework brings significant performance gains, along with its generality, efficiency, and transferability., Comment: Work in progress. Code and data will be open-sourced at https://github.com/zjukg/MuDoK
Published: 2024

24. Surveying Attitudinal Alignment Between Large Language Models Vs. Humans Towards 17 Sustainable Development Goals

Author: Wu, Qingyang, Xu, Ying, Xiao, Tingsong, Xiao, Yunze, Li, Yitong, Wang, Tianyang, Zhang, Yichi, Zhong, Shanghai, Zhang, Yuwei, Lu, Wei, and Yang, Yifan
Subjects: Computer Science - Computers and Society, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large Language Models (LLMs) have emerged as potent tools for advancing the United Nations' Sustainable Development Goals (SDGs). However, the attitudinal disparities between LLMs and humans towards these goals can pose significant challenges. This study conducts a comprehensive review and analysis of the existing literature on the attitudes of LLMs towards the 17 SDGs, emphasizing the comparison between their attitudes and support for each goal and those of humans. We examine the potential disparities, primarily focusing on aspects such as understanding and emotions, cultural and regional differences, task objective variations, and factors considered in the decision-making process. These disparities arise from the underrepresentation and imbalance in LLM training data, historical biases, quality issues, lack of contextual understanding, and skewed ethical values reflected. The study also investigates the risks and harms that may arise from neglecting the attitudes of LLMs towards the SDGs, including the exacerbation of social inequalities, racial discrimination, environmental destruction, and resource wastage. To address these challenges, we propose strategies and recommendations to guide and regulate the application of LLMs, ensuring their alignment with the principles and goals of the SDGs, and therefore creating a more just, inclusive, and sustainable future.
Published: 2024

25. Beyond Pixel-Wise Supervision for Medical Image Segmentation: From Traditional Models to Foundation Models

Author: Shi, Yuyan, Ma, Jialu, Yang, Jin, Wang, Shasha, and Zhang, Yichi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Medical image segmentation plays an important role in many image-guided clinical approaches. However, existing segmentation algorithms mostly rely on the availability of fully annotated images with pixel-wise annotations for training, which can be both labor-intensive and expertise-demanding, especially in the medical imaging domain where only experts can provide reliable and accurate annotations. To alleviate this challenge, there has been a growing focus on developing segmentation methods that can train deep models with weak annotations, such as image-level, bounding boxes, scribbles, and points. The emergence of vision foundation models, notably the Segment Anything Model (SAM), has introduced innovative capabilities for segmentation tasks using weak annotations for promptable segmentation enabled by large-scale pre-training. Adopting foundation models together with traditional learning methods has increasingly gained recent interest research community and shown potential for real-world applications. In this paper, we present a comprehensive survey of recent progress on annotation-efficient learning for medical image segmentation utilizing weak annotations before and in the era of foundation models. Furthermore, we analyze and discuss several challenges of existing approaches, which we believe will provide valuable guidance for shaping the trajectory of foundational models to further advance the field of medical image segmentation.
Published: 2024

26. Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Author: Zhang, Yichi, Dong, Yinpeng, Zhang, Siyuan, Min, Tianzan, Su, Hang, and Zhu, Jun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility. However, fine-tuning methods require independent training for every model, leading to huge computation and memory overheads. In this paper, we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this, we propose Transferable Visual Prompting (TVP), a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts, including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction., Comment: Accepted in CVPR 2024 as Poster (Highlight)
Published: 2024

27. MyGO: Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion

Author: Zhang, Yichi, Chen, Zhuo, Guo, Lingbing, Xu, Yajing, Hu, Binbin, Liu, Ziqi, Chen, Huajun, and Zhang, Wen
Subjects: Computer Science - Artificial Intelligence
Abstract: Multi-modal knowledge graphs (MMKG) store structured world knowledge containing rich multi-modal descriptive information. To overcome their inherent incompleteness, multi-modal knowledge graph completion (MMKGC) aims to discover unobserved knowledge from given MMKGs, leveraging both structural information from the triples and multi-modal information of the entities. Existing MMKGC methods usually extract multi-modal features with pre-trained models and employ a fusion module to integrate multi-modal features with triple prediction. However, this often results in a coarse handling of multi-modal data, overlooking the nuanced, fine-grained semantic details and their interactions. To tackle this shortfall, we introduce a novel framework MyGO to process, fuse, and augment the fine-grained modality information from MMKGs. MyGO tokenizes multi-modal raw data as fine-grained discrete tokens and learns entity representations with a cross-modal entity encoder. To further augment the multi-modal representations, MyGO incorporates fine-grained contrastive learning to highlight the specificity of the entity representations. Experiments on standard MMKGC benchmarks reveal that our method surpasses 20 of the latest models, underlining its superior performance. Code and data are available at https://github.com/zjukg/MyGO, Comment: Working in progress; Repo is available at https://github.com/zjukg/MyGO
Published: 2024

28. Autonomous Evaluation and Refinement of Digital Agents

Author: Pan, Jiayi, Zhang, Yichi, Tomlin, Nicholas, Zhou, Yifei, Levine, Sergey, and Suhr, Alane
Subjects: Computer Science - Artificial Intelligence
Abstract: We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We experiment with multiple evaluation models that trade off between inference cost, modularity of design, and accuracy. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics. Finally, we use these evaluators to improve the performance of existing agents via fine-tuning and inference-time guidance. Without any additional supervision, we improve state-of-the-art performance by 29% on the popular benchmark WebArena, and achieve around 75% relative improvement in device control settings., Comment: Published at COLM 2024. Code at https://github.com/Berkeley-NLP/Agent-Eval-Refine
Published: 2024

29. Quantum Graph Optimization Algorithm

Author: Huang, Yuhan, Nugraha, Ferris Prima, Jin, Siyuan, Zhang, Yichi, Zeng, Bei, and Shao, Qiming
Subjects: Quantum Physics
Abstract: Quadratic unconstrained binary optimization (QUBO) tasks are very important in chemistry, finance, job scheduling, and so on, which can be represented using graph structures, with the variables as nodes and the interaction between them as edges. Variational quantum algorithms, especially the Quantum Approximate Optimization Algorithm (QAOA) and its variants, present a promising way, potentially exceeding the capabilities of classical algorithms, for addressing QUBO tasks. However, the possibility of using message-passing machines, inspired by classical graph neural networks, to enhance the power and performance of these quantum algorithms for QUBO tasks was not investigated. This study introduces a novel variational quantum graph optimization algorithm that integrates the message-passing mechanism, which demonstrates significant improvements in performance for solving QUBO problems in terms of resource efficiency and solution precision, compared to QAOA, its variants, and other quantum graph neural networks. Furthermore, in terms of scalability on QUBO tasks, our algorithm shows superior performance compared to QAOA, presenting a substantial advancement in the field of quantum approximate optimization., Comment: 11pages,5figures
Published: 2024

30. Movable Antenna-Aided Hybrid Beamforming for Multi-User Communications

Author: Zhang, Yichi, Zhang, Yuchen, Zhu, Lipeng, Xiao, Sa, Tang, Wanbin, Eldar, Yonina C., and Zhang, Rui
Subjects: Computer Science - Information Theory, Electrical Engineering and Systems Science - Signal Processing
Abstract: In this correspondence, we propose a movable antenna (MA)-aided multi-user hybrid beamforming scheme with a sub-connected structure, where multiple movable sub-arrays can independently change their positions within different local regions. To maximize the system sum rate, we jointly optimize the digital beamformer, analog beamformer, and positions of subarrays, under the constraints of unit modulus, finite movable regions, and power budget. Due to the non-concave/non-convex objective function/constraints, as well as the highly coupled variables, the formulated problem is challenging to solve. By employing fractional programming, we develop an alternating optimization framework to solve the problem via a combination of Lagrange multipliers, penalty method, and gradient descent. Numerical results reveal that the proposed MA-aided hybrid beamforming scheme significantly improves the sum rate compared to its fixed-position antenna (FPA) counterpart. Moreover, with sufficiently large movable regions, the proposed scheme with sub-connected MA arrays even outperforms the fully-connected FPA array.
Published: 2024

31. Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs

Author: Liu, Xiaoze, Wu, Feijie, Xu, Tianyang, Chen, Zhuo, Zhang, Yichi, Wang, Xiaoqian, and Gao, Jing
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The advent of Large Language Models (LLMs) has significantly transformed the AI landscape, enhancing machine learning and AI capabilities. Factuality issue is a critical concern for LLMs, as they may generate factually incorrect responses. In this paper, we propose GraphEval to evaluate an LLM's performance using a substantially large test dataset. Specifically, the test dataset is retrieved from a large knowledge graph with more than 10 million facts without expensive human efforts. Unlike conventional methods that evaluate LLMs based on generated responses, GraphEval streamlines the evaluation process by creating a judge model to estimate the correctness of the answers given by the LLM. Our experiments demonstrate that the judge model's factuality assessment aligns closely with the correctness of the LLM's generated outputs, while also substantially reducing evaluation costs. Besides, our findings offer valuable insights into LLM performance across different metrics and highlight the potential for future improvements in ensuring the factual integrity of LLM outputs. The code is publicly available at https://github.com/xz-liu/GraphEval.
Published: 2024

32. NativE: Multi-modal Knowledge Graph Completion in the Wild

Author: Zhang, Yichi, Chen, Zhuo, Guo, Lingbing, Xu, Yajing, Hu, Binbin, Liu, Ziqi, Zhang, Wen, and Chen, Huajun
Subjects: Computer Science - Multimedia, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval
Abstract: Multi-modal knowledge graph completion (MMKGC) aims to automatically discover the unobserved factual knowledge from a given multi-modal knowledge graph by collaboratively modeling the triple structure and multi-modal information from entities. However, real-world MMKGs present challenges due to their diverse and imbalanced nature, which means that the modality information can span various types (e.g., image, text, numeric, audio, video) but its distribution among entities is uneven, leading to missing modalities for certain entities. Existing works usually focus on common modalities like image and text while neglecting the imbalanced distribution phenomenon of modal information. To address these issues, we propose a comprehensive framework NativE to achieve MMKGC in the wild. NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities and employs a collaborative modality adversarial training framework to augment the imbalanced modality information. We construct a new benchmark called WildKGC with five datasets to evaluate our method. The empirical results compared with 21 recent baselines confirm the superiority of our method, consistently achieving state-of-the-art performance across different datasets and various scenarios while keeping efficient and generalizable. Our code and data are released at https://github.com/zjukg/NATIVE, Comment: Accepted by SIGIR 2024 as a full paper
Published: 2024

33. Theoretical Bound-Guided Hierarchical VAE for Neural Image Codecs

Author: Zhang, Yichi, Duan, Zhihao, Huang, Yuning, and Zhu, Fengqing
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Machine Learning
Abstract: Recent studies reveal a significant theoretical link between variational autoencoders (VAEs) and rate-distortion theory, notably in utilizing VAEs to estimate the theoretical upper bound of the information rate-distortion function of images. Such estimated theoretical bounds substantially exceed the performance of existing neural image codecs (NICs). To narrow this gap, we propose a theoretical bound-guided hierarchical VAE (BG-VAE) for NIC. The proposed BG-VAE leverages the theoretical bound to guide the NIC model towards enhanced performance. We implement the BG-VAE using Hierarchical VAEs and demonstrate its effectiveness through extensive experiments. Along with advanced neural network blocks, we provide a versatile, variable-rate NIC that outperforms existing methods when considering both rate-distortion performance and computational complexity. The code is available at BG-VAE., Comment: 2024 IEEE International Conference on Multimedia and Expo (ICME2024)
Published: 2024

34. MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?

Author: Zhang, Renrui, Jiang, Dongzhi, Zhang, Yichi, Lin, Haokun, Guo, Ziyu, Qiu, Pengshuo, Zhou, Aojun, Lu, Pan, Chang, Kai-Wei, Gao, Peng, and Li, Hongsheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: The remarkable progress of Multi-modal Large Language Models (MLLMs) has garnered unparalleled attention, due to their superior performance in visual contexts. However, their capabilities in visual math problem-solving remain insufficiently evaluated and understood. We investigate current benchmarks to incorporate excessive visual content within textual questions, which potentially assist MLLMs in deducing answers without truly interpreting the input diagrams. To this end, we introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. Each problem is then transformed by human annotators into six distinct versions, each offering varying degrees of information content in multi-modality, contributing to 15K test samples in total. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning. In addition, we propose a Chain-of-Thought (CoT) evaluation strategy for a fine-grained assessment of the output answers. Rather than naively judging True or False, we employ GPT-4(V) to adaptively extract crucial reasoning steps, and then score each step with detailed error analysis, which can reveal the intermediate CoT reasoning quality by MLLMs. We hope the MathVerse benchmark may provide unique insights to guide the future development of MLLMs. Project page: https://mathverse-cuhk.github.io, Comment: Accepted by ECCV 2024, 46 Pages, Benchmark Project Page: https://mathverse-cuhk.github.io
Published: 2024

35. D-Net: Dynamic Large Kernel with Dynamic Feature Fusion for Volumetric Medical Image Segmentation

Author: Yang, Jin, Qiu, Peijie, Zhang, Yichi, Marcus, Daniel S., and Sotiras, Aristeidis
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: Hierarchical transformers have achieved significant success in medical image segmentation due to their large receptive field and capabilities of effectively leveraging global long-range contextual information. Convolutional neural networks (CNNs) can also deliver a large receptive field by using large kernels, enabling them to achieve competitive performance with fewer model parameters. However, CNNs incorporated with large convolutional kernels remain constrained in adaptively capturing multi-scale features from organs with large variations in shape and size due to the employment of fixed-sized kernels. Additionally, they are unable to utilize global contextual information efficiently. To address these limitations, we propose Dynamic Large Kernel (DLK) and Dynamic Feature Fusion (DFF) modules. The DLK module employs multiple large kernels with varying kernel sizes and dilation rates to capture multi-scale features. Subsequently, a dynamic selection mechanism is utilized to adaptively highlight the most important spatial features based on global information. Additionally, the DFF module is proposed to adaptively fuse multi-scale local feature maps based on their global information. We integrate DLK and DFF in a hierarchical transformer architecture to develop a novel architecture, termed D-Net. D-Net is able to effectively utilize a multi-scale large receptive field and adaptively harness global contextual information. Extensive experimental results demonstrate that D-Net outperforms other state-of-the-art models in the two volumetric segmentation tasks, including abdominal multi-organ segmentation and multi-modality brain tumor segmentation. Our code is available at https://github.com/sotiraslab/DLK., Comment: 18 pages, 8 figures, 9 tables
Published: 2024

36. The Power of Noise: Toward a Unified Multi-modal Knowledge Graph Representation Framework

Author: Chen, Zhuo, Fang, Yin, Zhang, Yichi, Guo, Lingbing, Chen, Jiaoyan, Chen, Huajun, and Zhang, Wen
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: The advancement of Multi-modal Pre-training highlights the necessity for a robust Multi-Modal Knowledge Graph (MMKG) representation learning framework. This framework is crucial for integrating structured knowledge into multi-modal Large Language Models (LLMs) at scale, aiming to alleviate issues like knowledge misconceptions and multi-modal hallucinations. In this work, to evaluate models' ability to accurately embed entities within MMKGs, we focus on two widely researched tasks: Multi-modal Knowledge Graph Completion (MKGC) and Multi-modal Entity Alignment (MMEA). Building on this foundation, we propose a novel SNAG method that utilizes a Transformer-based architecture equipped with modality-level noise masking for the robust integration of multi-modal entity features in KGs. By incorporating specific training objectives for both MKGC and MMEA, our approach achieves SOTA performance across a total of ten datasets (three for MKGC and seven for MEMA), demonstrating its robustness and versatility. Besides, SNAG can not only function as a standalone model but also enhance other existing methods, providing stable performance improvements. Our code and data are available at: https://github.com/zjukg/SNAG., Comment: Ongoing work; 10 pages, 6 Tables, 2 Figures; Repo is available at https://github.com/zjukg/SNAG
Published: 2024

37. Facile microwave hydrothermal synthesis of citric acid-derived carbon dots for photothermal therapy of cancers under NIR irradiation

Author: Jin, Yingying, Qiao, Huanhuan, Zhang, Yichi, He, Yujia, Xie, Shuangning, Gu, Yiwen, and Lin, Fawei
Published: 2024
Full Text: View/download PDF

38. Emergence of ferromagnetism at the onset of moiré Kondo breakdown

Author: Zhao, Wenjin, Shen, Bowen, Tao, Zui, Kim, Sunghoon, Knüppel, Patrick, Han, Zhongdong, Zhang, Yichi, Watanabe, Kenji, Taniguchi, Takashi, Chowdhury, Debanjan, Shan, Jie, and Mak, Kin Fai
Published: 2024
Full Text: View/download PDF

39. Evaluating two small-sample corrections for fixed-effects standard errors and inferences in multilevel models with heteroscedastic, unbalanced, clustered data

Author: Zhang, Yichi and Lai, Mark H. C.
Published: 2024
Full Text: View/download PDF

40. GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Author: Zhang, Yichi, Ma, Ziqiao, Gao, Xiaofeng, Shakiah, Suhaila, Gao, Qiaozi, and Chai, Joyce
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Most multimodal large language models (MLLMs) learn language-to-object grounding through causal language modeling where grounded objects are captured by bounding boxes as sequences of location tokens. This paradigm lacks pixel-level representations that are important for fine-grained visual understanding and diagnosis. In this work, we introduce GROUNDHOG, an MLLM developed by grounding Large Language Models to holistic segmentation. GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone, which then connects groundable phrases to unified grounding masks by retrieving and merging the entity masks. To train GROUNDHOG, we carefully curated M3G2, a grounded visual instruction tuning dataset with Multi-Modal Multi-Grained Grounding, by harvesting a collection of segmentation-grounded datasets with rich annotations. Our experimental results show that GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning, and significantly reduces object hallucination. GROUNDHOG also demonstrates better grounding towards complex forms of visual input and provides easy-to-understand diagnosis in failure cases., Comment: Accepted to CVPR 2024. Website: https://groundhog-mllm.github.io/
Published: 2024

41. Unleashing the Power of Imbalanced Modality Information for Multi-modal Knowledge Graph Completion

Author: Zhang, Yichi, Chen, Zhuo, Liang, Lei, Chen, Huajun, and Zhang, Wen
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: Multi-modal knowledge graph completion (MMKGC) aims to predict the missing triples in the multi-modal knowledge graphs by incorporating structural, visual, and textual information of entities into the discriminant models. The information from different modalities will work together to measure the triple plausibility. Existing MMKGC methods overlook the imbalance problem of modality information among entities, resulting in inadequate modal fusion and inefficient utilization of the raw modality information. To address the mentioned problems, we propose Adaptive Multi-modal Fusion and Modality Adversarial Training (AdaMF-MAT) to unleash the power of imbalanced modality information for MMKGC. AdaMF-MAT achieves multi-modal fusion with adaptive modality weights and further generates adversarial samples by modality-adversarial training to enhance the imbalanced modality information. Our approach is a co-design of the MMKGC model and training strategy which can outperform 19 recent MMKGC methods and achieve new state-of-the-art results on three public MMKGC benchmarks. Our code and data have been released at https://github.com/zjukg/AdaMF-MAT., Comment: Accepted by LREC-COLING 2024
Published: 2024

42. PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain

Author: Chen, Liang, Zhang, Yichi, Ren, Shuhuai, Zhao, Haozhe, Cai, Zefan, Wang, Yuchi, Wang, Peiyi, Meng, Xiangdi, Liu, Tianyu, and Chang, Baobao
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: We present PCA-Bench, a multimodal decision-making benchmark for evaluating the integrated capabilities of Multimodal Large Language Models (MLLMs). Departing from previous benchmarks focusing on simplistic tasks and individual model capability, PCA-Bench introduces three complex scenarios: autonomous driving, domestic robotics, and open-world games. Given task instructions and diverse contexts, the model is required to seamlessly integrate multiple capabilities of Perception, Cognition, and Action in a reasoning chain to make accurate decisions. Moreover, PCA-Bench features error localization capabilities, scrutinizing model inaccuracies in areas such as perception, knowledge, or reasoning. This enhances the reliability of deploying MLLMs. To balance accuracy and efficiency in evaluation, we propose PCA-Eval, an automatic evaluation protocol, and assess 10 prevalent MLLMs. The results reveal significant performance disparities between open-source models and powerful proprietary models like GPT-4 Vision. To address this, we introduce Embodied-Instruction-Evolution (EIE), an automatic framework for synthesizing instruction tuning examples in multimodal embodied environments. EIE generates 7,510 training examples in PCA-Bench and enhances the performance of open-source MLLMs, occasionally surpassing GPT-4 Vision (+3\% in decision accuracy), thereby validating the effectiveness of EIE. Our findings suggest that robust MLLMs like GPT4-Vision show promise for decision-making in embodied agents, opening new avenues for MLLM research., Comment: Code and Data released at https://github.com/pkunlp-icler/PCA-EVAL. Leaderboard at: https://docs.qq.com/sheet/DVUd4WUpGRHRqUnNV. This article supersedes its workshop version arxiv: 2310.02071. arXiv admin note: text overlap with arXiv:2310.02071
Published: 2024

43. Spot Check Equivalence: an Interpretable Metric for Information Elicitation Mechanisms

Author: Xu, Shengwei, Zhang, Yichi, Resnick, Paul, and Schoenebeck, Grant
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory
Abstract: Because high-quality data is like oxygen for AI systems, effectively eliciting information from crowdsourcing workers has become a first-order problem for developing high-performance machine learning algorithms. Two prevalent paradigms, spot-checking and peer prediction, enable the design of mechanisms to evaluate and incentivize high-quality data from human labelers. So far, at least three metrics have been proposed to compare the performances of these techniques [33, 8, 3]. However, different metrics lead to divergent and even contradictory results in various contexts. In this paper, we harmonize these divergent stories, showing that two of these metrics are actually the same within certain contexts and explain the divergence of the third. Moreover, we unify these different contexts by introducing \textit{Spot Check Equivalence}, which offers an interpretable metric for the effectiveness of a peer prediction mechanism. Finally, we present two approaches to compute spot check equivalence in various contexts, where simulation results verify the effectiveness of our proposed metric., Comment: Accepted by the Web Conference 2024 (WWW '24)
Published: 2024

44. UMOEA/D: A Multiobjective Evolutionary Algorithm for Uniform Pareto Objectives based on Decomposition

Author: Zhang, Xiaoyuan, Lin, Xi, Zhang, Yichi, Chen, Yifan, and Zhang, Qingfu
Subjects: Computer Science - Machine Learning
Abstract: Multiobjective optimization (MOO) is prevalent in numerous applications, in which a Pareto front (PF) is constructed to display optima under various preferences. Previous methods commonly utilize the set of Pareto objectives (particles on the PF) to represent the entire PF. However, the empirical distribution of the Pareto objectives on the PF is rarely studied, which implicitly impedes the generation of diverse and representative Pareto objectives in previous methods. To bridge the gap, we suggest in this paper constructing \emph{uniformly distributed} Pareto objectives on the PF, so as to alleviate the limited diversity found in previous MOO approaches. We are the first to formally define the concept of ``uniformity" for an MOO problem. We optimize the maximal minimal distances on the Pareto front using a neural network, resulting in both asymptotically and non-asymptotically uniform Pareto objectives. Our proposed method is validated through experiments on real-world and synthetic problems, which demonstrates the efficacy in generating high-quality uniform Pareto objectives and the encouraging performance exceeding existing state-of-the-art methods. The detailed model implementation and the code are scheduled to be open-sourced upon publication.
Published: 2024

45. Knowledge Graphs Meet Multi-Modal Learning: A Comprehensive Survey

Author: Chen, Zhuo, Zhang, Yichi, Fang, Yin, Geng, Yuxia, Guo, Lingbing, Chen, Xiang, Li, Qian, Zhang, Wen, Chen, Jiaoyan, Zhu, Yushan, Li, Jiaqi, Liu, Xiaoze, Pan, Jeff Z., Zhang, Ningyu, and Chen, Huajun
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the semantic web community's exploration into multi-modal dimensions unlocking new avenues for innovation. In this survey, we carefully review over 300 articles, focusing on KG-aware research in two principal aspects: KG-driven Multi-Modal (KG4MM) learning, where KGs support multi-modal tasks, and Multi-Modal Knowledge Graph (MM4KG), which extends KG studies into the MMKG realm. We begin by defining KGs and MMKGs, then explore their construction progress. Our review includes two primary task categories: KG-aware multi-modal learning tasks, such as Image Classification and Visual Question Answering, and intrinsic MMKG tasks like Multi-modal Knowledge Graph Completion and Entity Alignment, highlighting specific research trajectories. For most of these tasks, we provide definitions, evaluation benchmarks, and additionally outline essential insights for conducting relevant research. Finally, we discuss current challenges and identify emerging trends, such as progress in Large Language Modeling and Multi-modal Pre-training strategies. This survey aims to serve as a comprehensive reference for researchers already involved in or considering delving into KG and multi-modal learning research, offering insights into the evolving landscape of MMKG research and supporting future work., Comment: Ongoing work; 41 pages (Main Text), 55 pages (Total), 11 Tables, 13 Figures, 619 citations; Paper list is available at https://github.com/zjukg/KG-MM-Survey
Published: 2024

46. Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation

Author: Wang, Ziyang, Zheng, Jian-Qing, Zhang, Yichi, Cui, Ge, and Li, Lei
Subjects: Electrical Engineering and Systems Science - Image and Video Processing, Computer Science - Computer Vision and Pattern Recognition
Abstract: In recent advancements in medical image analysis, Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have set significant benchmarks. While the former excels in capturing local features through its convolution operations, the latter achieves remarkable global context understanding by leveraging self-attention mechanisms. However, both architectures exhibit limitations in efficiently modeling long-range dependencies within medical images, which is a critical aspect for precise segmentation. Inspired by the Mamba architecture, known for its proficiency in handling long sequences and global contextual information with enhanced computational efficiency as a State Space Model (SSM), we propose Mamba-UNet, a novel architecture that synergizes the U-Net in medical image segmentation with Mamba's capability. Mamba-UNet adopts a pure Visual Mamba (VMamba)-based encoder-decoder structure, infused with skip connections to preserve spatial information across different scales of the network. This design facilitates a comprehensive feature learning process, capturing intricate details and broader semantic contexts within medical images. We introduce a novel integration mechanism within the VMamba blocks to ensure seamless connectivity and information flow between the encoder and decoder paths, enhancing the segmentation performance. We conducted experiments on publicly available ACDC MRI Cardiac segmentation dataset, and Synapse CT Abdomen segmentation dataset. The results show that Mamba-UNet outperforms several types of UNet in medical image segmentation under the same hyper-parameter setting. The source code and baseline implementations are available.
Published: 2024

47. Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science

Author: Tang, Xiangru, Jin, Qiao, Zhu, Kunlun, Yuan, Tongxin, Zhang, Yichi, Zhou, Wangchunshu, Qu, Meng, Zhao, Yilun, Tang, Jian, Zhang, Zhuosheng, Cohan, Arman, Lu, Zhiyong, and Gerstein, Mark
Subjects: Computer Science - Computers and Society, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines. While their capabilities are promising, these agents, called scientific LLM agents, also introduce novel vulnerabilities that demand careful consideration for safety. However, there exists a notable gap in the literature, as there has been no comprehensive exploration of these vulnerabilities. This perspective paper fills this gap by conducting a thorough examination of vulnerabilities in LLM-based agents within scientific domains, shedding light on potential risks associated with their misuse and emphasizing the need for safety measures. We begin by providing a comprehensive overview of the potential risks inherent to scientific LLM agents, taking into account user intent, the specific scientific domain, and their potential impact on the external environment. Then, we delve into the origins of these vulnerabilities and provide a scoping review of the limited existing works. Based on our analysis, we propose a triadic framework involving human regulation, agent alignment, and an understanding of environmental feedback (agent regulation) to mitigate these identified risks. Furthermore, we highlight the limitations and challenges associated with safeguarding scientific agents and advocate for the development of improved models, robust benchmarks, and comprehensive regulations to address these issues effectively.
Published: 2024

48. Electrical 180o switching of N\'eel vector in spin-splitting antiferromagnet

Author: Han, Lei, Fu, Xizhi, Peng, Rui, Cheng, Xingkai, Dai, Jiankun, Liu, Liangyang, Li, Yidian, Zhang, Yichi, Zhu, Wenxuan, Bai, Hua, Zhou, Yongjian, Liang, Shixuan, Chen, Chong, Wang, Qian, Chen, Xianzhe, Yang, Luyi, Zhang, Yang, Song, Cheng, Liu, Junwei, and Pan, Feng
Subjects: Condensed Matter - Materials Science
Abstract: Antiferromagnetic spintronics have attracted wide attention due to its great potential in constructing ultra-dense and ultra-fast antiferromagnetic memory that suits modern high-performance information technology. The electrical 180o switching of N\'eel vector is a long-term goal for developing electrical-controllable antiferromagnetic memory with opposite N\'eel vectors as binary "0" and "1". However, the state-of-art antiferromagnetic switching mechanisms have long been limited for 90o or 120o switching of N\'eel vector, which unavoidably require multiple writing channels that contradicts ultra-dense integration. Here, we propose a deterministic switching mechanism based on spin-orbit torque with asymmetric energy barrier, and experimentally achieve electrical 180o switching of spin-splitting antiferromagnet Mn5Si3. Such a 180o switching is read out by the N\'eel vector-induced anomalous Hall effect. Based on our writing and readout methods, we fabricate an antiferromagnet device with electrical-controllable high and low resistance states that accomplishes robust write and read cycles. Besides fundamental advance, our work promotes practical spin-splitting antiferromagnetic devices based on spin-splitting antiferromagnet., Comment: 19 pages, 4 figures
Published: 2024

49. Trainable Fixed-Point Quantization for Deep Learning Acceleration on FPGAs

Author: Dai, Dingyi, Zhang, Yichi, Zhang, Jiahao, Hu, Zhanqiu, Cai, Yaohui, Sun, Qi, and Zhang, Zhiru
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Quantization is a crucial technique for deploying deep learning models on resource-constrained devices, such as embedded FPGAs. Prior efforts mostly focus on quantizing matrix multiplications, leaving other layers like BatchNorm or shortcuts in floating-point form, even though fixed-point arithmetic is more efficient on FPGAs. A common practice is to fine-tune a pre-trained model to fixed-point for FPGA deployment, but potentially degrading accuracy. This work presents QFX, a novel trainable fixed-point quantization approach that automatically learns the binary-point position during model training. Additionally, we introduce a multiplier-free quantization strategy within QFX to minimize DSP usage. QFX is implemented as a PyTorch-based library that efficiently emulates fixed-point arithmetic, supported by FPGA HLS, in a differentiable manner during backpropagation. With minimal effort, models trained with QFX can readily be deployed through HLS, producing the same numerical results as their software counterparts. Our evaluation shows that compared to post-training quantization, QFX can quantize models trained with element-wise layers quantized to fewer bits and achieve higher accuracy on both CIFAR-10 and ImageNet datasets. We further demonstrate the efficacy of multiplier-free quantization using a state-of-the-art binarized neural network accelerator designed for an embedded FPGA (AMD Xilinx Ultra96 v2). We plan to release QFX in open-source format.
Published: 2024

50. Higher-Order Entrywise Eigenvectors Analysis of Low-Rank Random Matrices: Bias Correction, Edgeworth Expansion, and Bootstrap

Author: Xie, Fangzheng and Zhang, Yichi
Subjects: Mathematics - Statistics Theory
Abstract: Understanding the distributions of spectral estimators in low-rank random matrix models, also known as signal-plus-noise matrix models, is fundamentally important in various statistical learning problems, including network analysis, matrix denoising, and matrix completion. This paper investigates the entrywise eigenvector distributions in a broad range of low-rank signal-plus-noise matrix models by establishing their higher-order accurate stochastic expansions. At a high level, the stochastic expansion states that the eigenvector perturbation approximately decomposes into the sum of a first-order term and a second-order term, where the first-order term in the expansion is a linear function of the noise matrix, and the second-order term is a linear function of the squared noise matrix. Our theoretical finding is used to derive the bias correction procedure for the eigenvectors. We further establish the Edgeworth expansion formula for the studentized entrywise eigenvector statistics. In particular, under mild conditions, we show that Cram\'er's condition on the smoothness of noise distribution is not required, thanks to the self-smoothing effect of the second-order term in the eigenvector stochastic expansion. The Edgeworth expansion result is then applied to justify the higher-order correctness of the residual bootstrap procedure for approximating the distributions of the studentized entrywise eigenvector statistics.
Published: 2024

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

3,204 results on '"Zhang, Yichi"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources