Author: "Jin, Qin" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Jin, Qin"' showing total 1,604 results

Start Over Author "Jin, Qin"

1,604 results on '"Jin, Qin"'

1. Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models

Author: Wang, Ye, Zheng, Sipeng, Cao, Bin, Wei, Qianshan, Jin, Qin, and Lu, Zongqing
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Inspired by the recent success of LLMs, the field of human motion understanding has increasingly shifted towards the development of large motion models. Despite some progress, current state-of-the-art works remain far from achieving truly generalist models, largely due to the lack of large-scale, high-quality motion data. To address this, we present MotionBase, the first million-level motion generation benchmark, offering 15 times the data volume of the previous largest dataset, and featuring multimodal data with hierarchically detailed text descriptions. By leveraging this vast dataset, our large motion model demonstrates strong performance across a broad range of motions, including unseen ones. Through systematic investigation, we underscore the importance of scaling both data and model size, with synthetic data and pseudo labels playing a crucial role in mitigating data acquisition costs. Moreover, our research reveals the limitations of existing evaluation metrics, particularly in handling out-of-domain text instructions -- an issue that has long been overlooked. In addition to these, we introduce a novel 2D lookup-free approach for motion tokenization, which preserves motion information and expands codebook capacity, further enhancing the representative ability of large motion models. The release of MotionBase and the insights gained from this study are expected to pave the way for the development of more powerful and versatile motion generation models.
Published: 2024

2. Revealing Personality Traits: A New Benchmark Dataset for Explainable Personality Recognition on Dialogues

Author: Sun, Lei, Zhao, Jinming, and Jin, Qin
Subjects: Computer Science - Computation and Language
Abstract: Personality recognition aims to identify the personality traits implied in user data such as dialogues and social media posts. Current research predominantly treats personality recognition as a classification task, failing to reveal the supporting evidence for the recognized personality. In this paper, we propose a novel task named Explainable Personality Recognition, aiming to reveal the reasoning process as supporting evidence of the personality trait. Inspired by personality theories, personality traits are made up of stable patterns of personality state, where the states are short-term characteristic patterns of thoughts, feelings, and behaviors in a concrete situation at a specific moment in time. We propose an explainable personality recognition framework called Chain-of-Personality-Evidence (CoPE), which involves a reasoning process from specific contexts to short-term personality states to long-term personality traits. Furthermore, based on the CoPE framework, we construct an explainable personality recognition dataset from dialogues, PersonalityEvd. We introduce two explainable personality state recognition and explainable personality trait recognition tasks, which require models to recognize the personality state and trait labels and their corresponding support evidence. Our extensive experiments based on Large Language Models on the two tasks show that revealing personality traits is very challenging and we present some insights for future research. Our data and code are available at https://github.com/Lei-Sun-RUC/PersonalityEvd., Comment: Accepted to EMNLP 2024 Main Conference (Long Paper)
Published: 2024

3. ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech

Author: Shi, Jiatong, Tian, Jinchuan, Wu, Yihan, Jung, Jee-weon, Yip, Jia Qi, Masuyama, Yoshiki, Chen, William, Wu, Yuning, Tang, Yuxun, Baali, Massa, Alharhi, Dareen, Zhang, Dong, Deng, Ruifan, Srivastava, Tejes, Wu, Haibin, Liu, Alexander H., Raj, Bhiksha, Jin, Qin, Song, Ruihua, and Watanabe, Shinji
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with autoregressive language models. However, as extensive downstream applications are investigated, challenges have arisen in ensuring fair comparisons across diverse applications. To address these issues, we present a new open-source platform ESPnet-Codec, which is built on ESPnet and focuses on neural codec training and evaluation. ESPnet-Codec offers various recipes in audio, music, and speech for training and evaluation using several widely adopted codec models. Together with ESPnet-Codec, we present VERSA, a standalone evaluation toolkit, which provides a comprehensive evaluation of codec performance over 20 audio evaluation metrics. Notably, we demonstrate that ESPnet-Codec can be integrated into six ESPnet tasks, supporting diverse applications., Comment: Accepted by SLT
Published: 2024

4. Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm

Author: Wu, Yuning, Shi, Jiatong, Yu, Yifeng, Tang, Yuxun, Qian, Tao, Lin, Yueqian, Han, Jionghao, Bai, Xinyi, Watanabe, Shinji, and Jin, Qin
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: This research presents Muskits-ESPnet, a versatile toolkit that introduces new paradigms to Singing Voice Synthesis (SVS) through the application of pretrained audio models in both continuous and discrete approaches. Specifically, we explore discrete representations derived from SSL models and audio codecs and offer significant advantages in versatility and intelligence, supporting multi-format inputs and adaptable data processing workflows for various SVS models. The toolkit features automatic music score error detection and correction, as well as a perception auto-evaluation module to imitate human subjective evaluating scores. Muskits-ESPnet is available at \url{https://github.com/espnet/espnet}., Comment: Accepted by ACMMM 2024 demo track
Published: 2024

5. mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Author: Hu, Anwen, Xu, Haiyang, Zhang, Liang, Ye, Jiabo, Yan, Ming, Zhang, Ji, Jin, Qin, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multimodel Large Language Models(MLLMs) have achieved promising OCR-free Document Understanding performance by increasing the supported resolution of document images. However, this comes at the cost of generating thousands of visual tokens for a single document image, leading to excessive GPU memory and slower inference times, particularly in multi-page document comprehension. In this work, to address these challenges, we propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens, guided by low-resolution global visual features. With this compression module, to strengthen multi-page document comprehension ability and balance both token efficiency and question-answering performance, we develop the DocOwl2 under a three-stage training framework: Single-image Pretraining, Multi-image Continue-pretraining, and Multi-task Finetuning. DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%, demonstrating advanced capabilities in multi-page questioning answering, explanation with evidence pages, and cross-page structure understanding. Additionally, compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens. Our codes, models, and data are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl2., Comment: 15 pages, 7 figures
Published: 2024

6. What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation

Author: Yang, Dingyi and Jin, Qin
Subjects: Computer Science - Computation and Language, A.1, I.2.7, I.2.10
Abstract: With the development of artificial intelligence, particularly the success of Large Language Models (LLMs), the quantity and quality of automatically generated stories have significantly increased. This has led to the need for automatic story evaluation to assess the generative capabilities of computing systems and analyze the quality of both automatic-generated and human-written stories. Evaluating a story can be more challenging than other generation evaluation tasks. While tasks like machine translation primarily focus on assessing the aspects of fluency and accuracy, story evaluation demands complex additional measures such as overall coherence, character development, interestingness, etc. This requires a thorough review of relevant research. In this survey, we first summarize existing storytelling tasks, including text-to-text, visual-to-text, and text-to-visual. We highlight their evaluation challenges, identify various human criteria to measure stories, and present existing benchmark datasets. Then, we propose a taxonomy to organize evaluation metrics that have been developed or can be adopted for story evaluation. We also provide descriptions of these metrics, along with the discussion of their merits and limitations. Later, we discuss the human-AI collaboration for story evaluation and generation. Finally, we suggest potential future research directions, extending from story evaluation to general evaluations.
Published: 2024

7. Unveiling Visual Biases in Audio-Visual Localization Benchmarks

Author: Chen, Liangyu, Yue, Zihao, Xu, Boshen, and Jin, Qin
Subjects: Computer Science - Multimedia, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Audio-Visual Source Localization (AVSL) aims to localize the source of sound within a video. In this paper, we identify a significant issue in existing benchmarks: the sounding objects are often easily recognized based solely on visual cues, which we refer to as visual bias. Such biases hinder these benchmarks from effectively evaluating AVSL models. To further validate our hypothesis regarding visual biases, we examine two representative AVSL benchmarks, VGG-SS and EpicSounding-Object, where the vision-only models outperform all audiovisual baselines. Our findings suggest that existing AVSL benchmarks need further refinement to facilitate audio-visual learning., Comment: Accepted by ECCV24 AVGenL Workshop
Published: 2024

8. QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds

Author: Wang, Ye, Mei, Yuting, Zheng, Sipeng, and Jin, Qin
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence
Abstract: While pets offer companionship, their limited intelligence restricts advanced reasoning and autonomous interaction with humans. Considering this, we propose QuadrupedGPT, a versatile agent designed to master a broad range of complex tasks with agility comparable to that of a pet. To achieve this goal, the primary challenges include: i) effectively leveraging multimodal observations for decision-making; ii) mastering agile control of locomotion and path planning; iii) developing advanced cognition to execute long-term objectives. QuadrupedGPT processes human command and environmental contexts using a large multimodal model (LMM). Empowered by its extensive knowledge base, our agent autonomously assigns appropriate parameters for adaptive locomotion policies and guides the agent in planning a safe but efficient path towards the goal, utilizing semantic-aware terrain analysis. Moreover, QuadrupedGPT is equipped with problem-solving capabilities that enable it to decompose long-term goals into a sequence of executable subgoals through high-level reasoning. Extensive experiments across various benchmarks confirm that QuadrupedGPT can adeptly handle multiple tasks with intricate instructions, demonstrating a significant step towards the versatile quadruped agents in open-ended worlds. Our website and codes can be found at https://quadruped-hub.github.io/Quadruped-GPT/., Comment: Under review
Published: 2024

9. UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

Author: Mei, Yuting, Yao, Linli, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: With the surge in the amount of video data, video summarization techniques, including visual-modal(VM) and textual-modal(TM) summarization, are attracting more and more attention. However, unimodal summarization inevitably loses the rich semantics of the video. In this paper, we focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV). Specifically, we first construct a large-scale dataset, BIDS, in (video, VM-Summary, TM-Summary) triplet format. Unlike traditional processing methods, our construction procedure contains a VM-Summary extraction algorithm aiming to preserve the most salient content within long videos. Based on BIDS, we propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously. We further optimize our model with a list-wise ranking-based objective to improve its capacity to capture highlights. Lastly, we propose a metric, $NDCG_{MS}$, to provide a joint evaluation of the bimodal summary. Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines. Code and data are available at https://github.com/MeiYutingg/UBiSS., Comment: Accepted by ACM International Conference on Multimedia Retrieval (ICMR'24)
Published: 2024
Full Text: View/download PDF

10. ESCoT: Towards Interpretable Emotional Support Dialogue Systems

Author: Zhang, Tenggan, Zhang, Xinjie, Zhao, Jinming, Zhou, Li, and Jin, Qin
Subjects: Computer Science - Computation and Language
Abstract: Understanding the reason for emotional support response is crucial for establishing connections between users and emotional support dialogue systems. Previous works mostly focus on generating better responses but ignore interpretability, which is extremely important for constructing reliable dialogue systems. To empower the system with better interpretability, we propose an emotional support response generation scheme, named $\textbf{E}$motion-Focused and $\textbf{S}$trategy-Driven $\textbf{C}$hain-$\textbf{o}$f-$\textbf{T}$hought ($\textbf{ESCoT}$), mimicking the process of $\textit{identifying}$, $\textit{understanding}$, and $\textit{regulating}$ emotions. Specially, we construct a new dataset with ESCoT in two steps: (1) $\textit{Dialogue Generation}$ where we first generate diverse conversation situations, then enhance dialogue generation using richer emotional support strategies based on these situations; (2) $\textit{Chain Supplement}$ where we focus on supplementing selected dialogues with elements such as emotion, stimuli, appraisal, and strategy reason, forming the manually verified chains. Additionally, we further develop a model to generate dialogue responses with better interpretability. We also conduct extensive experiments and human evaluations to validate the effectiveness of the proposed ESCoT and generated dialogue responses. Our data and code are available at $\href{https://github.com/TeigenZhang/ESCoT}{https://github.com/TeigenZhang/ESCoT}$., Comment: Accepted to ACL 2024 (Long Paper)
Published: 2024

11. SingMOS: An extensive Open-Source Singing Voice Dataset for MOS Prediction

Author: Tang, Yuxun, Shi, Jiatong, Wu, Yuning, and Jin, Qin
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In speech generation tasks, human subjective ratings, usually referred to as the opinion score, are considered the "gold standard" for speech quality evaluation, with the mean opinion score (MOS) serving as the primary evaluation metric. Due to the high cost of human annotation, several MOS prediction systems have emerged in the speech domain, demonstrating good performance. These MOS prediction models are trained using annotations from previous speech-related challenges. However, compared to the speech domain, the singing domain faces data scarcity and stricter copyright protections, leading to a lack of high-quality MOS-annotated datasets for singing. To address this, we propose SingMOS, a high-quality and diverse MOS dataset for singing, covering a range of Chinese and Japanese datasets. These synthesized vocals are generated using state-of-the-art models in singing synthesis, conversion, or resynthesis tasks and are rated by professional annotators alongside real vocals. Data analysis demonstrates the diversity and reliability of our dataset. Additionally, we conduct further exploration on SingMOS, providing insights for singing MOS prediction and guidance for the continued expansion of SingMOS.
Published: 2024

12. Adaptive Temporal Motion Guided Graph Convolution Network for Micro-expression Recognition

Author: Zhang, Fengyuan, Huang, Zhaopei, Zhang, Xinjie, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Micro-expressions serve as essential cues for understanding individuals' genuine emotional states. Recognizing micro-expressions attracts increasing research attention due to its various applications in fields such as business negotiation and psychotherapy. However, the intricate and transient nature of micro-expressions poses a significant challenge to their accurate recognition. Most existing works either neglect temporal dependencies or suffer from redundancy issues in clip-level recognition. In this work, we propose a novel framework for micro-expression recognition, named the Adaptive Temporal Motion Guided Graph Convolution Network (ATM-GCN). Our framework excels at capturing temporal dependencies between frames across the entire clip, thereby enhancing micro-expression recognition at the clip level. Specifically, the integration of Adaptive Temporal Motion layers empowers our method to aggregate global and local motion features inherent in micro-expressions. Experimental results demonstrate that ATM-GCN not only surpasses existing state-of-the-art methods, particularly on the Composite dataset, but also achieves superior performance on the latest micro-expression dataset CAS(ME)$^3$., Comment: Accepted by ICME 2024
Published: 2024

13. SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models

Author: Tang, Yuxun, Wu, Yuning, Shi, Jiatong, and Jin, Qin
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Discrete representation has shown advantages in speech generation tasks, wherein discrete tokens are derived by discretizing hidden features from self-supervised learning (SSL) pre-trained models. However, the direct application of speech SSL models to singing generation encounters domain gaps between speech and singing. Furthermore, singing generation necessitates a more refined representation than typical speech. To address these challenges, we introduce SingOMD, a novel method to extract singing-oriented multi-resolution discrete representations from speech SSL models. Specifically, we first adapt the features from speech SSL through a resynthesis task and incorporate multi-resolution modules based on resampling to better serve singing generation. These adapted multi-resolution features are then discretized via clustering. Extensive experiments demonstrate the robustness, efficiency, and effectiveness of these representations in singing vocoders and singing voice synthesis., Comment: Accepted by Interspeech 2024
Published: 2024

14. TokSing: Singing Voice Synthesis based on Discrete Tokens

Author: Wu, Yuning, zhang, Chunlei, Shi, Jiatong, Tang, Yuxun, Yang, Shan, and Jin, Qin
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent advancements in speech synthesis witness significant benefits by leveraging discrete tokens extracted from self-supervised learning (SSL) models. Discrete tokens offer higher storage efficiency and greater operability in intermediate representations compared to traditional continuous Mel spectrograms. However, when it comes to singing voice synthesis(SVS), achieving higher levels of melody expression poses a great challenge for utilizing discrete tokens. In this paper, we introduce TokSing, a discrete-based SVS system equipped with a token formulator that offers flexible token blendings. We observe a melody degradation during discretization, prompting us to integrate a melody signal with the discrete token and incorporate a specially-designed melody enhancement strategy in the musical encoder. Extensive experiments demonstrate that our TokSing achieves better performance against the Mel spectrogram baselines while offering advantages in intermediate representation space cost and convergence speed., Comment: Accepted by Interspeech 2024
Published: 2024

15. The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

Author: Chang, Xuankai, Shi, Jiatong, Tian, Jinchuan, Wu, Yuning, Tang, Yuxun, Wu, Yihan, Watanabe, Shinji, Adi, Yossi, Chen, Xie, and Jin, Qin
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compression and restoration, speech recognition, and speech generation. To foster exploration in this domain, we introduce the Interspeech 2024 Challenge, which focuses on new speech processing benchmarks using discrete units. It encompasses three pivotal tasks, namely multilingual automatic speech recognition, text-to-speech, and singing voice synthesis, and aims to assess the potential applicability of discrete units in these tasks. This paper outlines the challenge designs and baseline descriptions. We also collate baseline and selected submission systems, along with preliminary findings, offering valuable contributions to future research in this evolving field., Comment: This manuscript has been accepted by Interspeech2024
Published: 2024

16. EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?

Author: Xu, Boshen, Wang, Ziheng, Du, Yang, Song, Zhinan, Zheng, Sipeng, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Egocentric video-language pretraining is a crucial paradigm to advance the learning of egocentric hand-object interactions (EgoHOI). Despite the great success on existing testbeds, these benchmarks focus more on closed-set visual concepts or limited scenarios. Due to the occurrence of diverse EgoHOIs in the real world, we propose an open-vocabulary benchmark named EgoHOIBench to reveal the diminished performance of current egocentric video-language models (EgoVLM) on fined-grained concepts, indicating that these models still lack a full spectrum of egocentric understanding. We attribute this performance gap to insufficient fine-grained supervision and strong bias towards understanding objects rather than temporal dynamics in current methods. To tackle these issues, we introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++. For video-to-text loss, we enhance text supervision through the generation of negative captions by leveraging the in-context learning of large language models to perform HOI-related word substitution. For text-to-video loss, we propose an object-centric positive video sampling strategy that aggregates video representations by the same nouns. Our extensive experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks across various egocentric models, with improvements of up to +26.55%. Our code is available at https://github.com/xuboshen/EgoNCEpp., Comment: Code: https://github.com/xuboshen/EgoNCEpp
Published: 2024

17. Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

Author: Yang, Dingyi, Zhan, Chunru, Wang, Ziheng, Wang, Biao, Ge, Tiezheng, Zheng, Bo, and Jin, Qin
Subjects: Computer Science - Multimedia
Abstract: Video storytelling is engaging multimedia content that utilizes video and its accompanying narration to attract the audience, where a key challenge is creating narrations for recorded visual scenes. Previous studies on dense video captioning and video story generation have made some progress. However, in practical applications, we typically require synchronized narrations for ongoing visual scenes. In this work, we introduce a new task of Synchronized Video Storytelling, which aims to generate synchronous and informative narrations for videos. These narrations, associated with each video clip, should relate to the visual content, integrate relevant knowledge, and have an appropriate word count corresponding to the clip's duration. Specifically, a structured storyline is beneficial to guide the generation process, ensuring coherence and integrity. To support the exploration of this task, we introduce a new benchmark dataset E-SyncVidStory with rich annotations. Since existing Multimodal LLMs are not effective in addressing this task in one-shot or few-shot settings, we propose a framework named VideoNarrator that can generate a storyline for input videos and simultaneously generate narrations with the guidance of the generated or predefined storyline. We further introduce a set of evaluation metrics to thoroughly assess the generation. Both automatic and human evaluations validate the effectiveness of our approach. Our dataset, codes, and evaluations will be released., Comment: 15 pages, 13 figures
Published: 2024

18. ECR-Chain: Advancing Generative Language Models to Better Emotion-Cause Reasoners through Reasoning Chains

Author: Huang, Zhaopei, Zhao, Jinming, and Jin, Qin
Subjects: Computer Science - Computation and Language
Abstract: Understanding the process of emotion generation is crucial for analyzing the causes behind emotions. Causal Emotion Entailment (CEE), an emotion-understanding task, aims to identify the causal utterances in a conversation that stimulate the emotions expressed in a target utterance. However, current works in CEE mainly focus on modeling semantic and emotional interactions in conversations, neglecting the exploration of the emotion-generation process. This hinders the models from deeply understanding emotions, restricting their ability to produce explainable predictions. In this work, inspired by the emotion generation process of "stimulus-appraisal-emotion" in the cognitive appraisal theory, we introduce a step-by-step reasoning method, Emotion-Cause Reasoning Chain (ECR-Chain), to infer the stimulus from the target emotional expressions in conversations. Specifically, we first introduce the ECR-Chain to ChatGPT via few-shot prompting, which significantly improves its performance on the CEE task. We further propose an automated construction process to utilize ChatGPT in building an ECR-Chain set, which can enhance the reasoning abilities of smaller models through supervised training and assist the Vicuna-7B model in achieving state-of-the-art CEE performance. Moreover, our methods can enable these generative language models to effectively perform emotion-cause reasoning in an explainable manner. Our code, data and more details are at https://github.com/hzp3517/ECR-Chain., Comment: Accepted by IJCAI 2024
Published: 2024

19. TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

Author: Zhang, Liang, Hu, Anwen, Xu, Haiyang, Yan, Ming, Xu, Yichen, Jin, Qin, Zhang, Ji, and Huang, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Charts are important for presenting and explaining complex data relationships. Recently, multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks. However, the sheer size of these models in terms of parameters and computational requirements limits their use in resource-constrained environments. In this paper, we present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments demonstrate that our 3B TinyChart achieves SOTA performance on a variety of chart understanding benchmarks including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLM with up to 13B parameters such as ChartLlama and ChartAst, and close-sourced general-purpose MLLM GPT-4V on ChartQA. It also demonstrates its superior efficiency with higher throughput during inference due to a smaller model scale and more efficient vision encoding. Our code and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart., Comment: 13 pages, 11 figures
Published: 2024

20. Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

Author: He, Qingrong, Lin, Kejun, Chen, Shizhe, Hu, Anwen, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of leveraging large language models (LLMs) for visual reasoning, we propose LLM-TPC, a novel framework that leverages the planning, tool usage, and reflection capabilities of LLMs through a ThinkProgram-reCtify loop. The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules. Finally, the Rectify phase adjusts the plan and code if the program fails to execute. Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method. Our code is publicly available at https://qingrongh.github.io/LLM-TPC/.
Published: 2024

21. Movie101v2: Improved Movie Narration Benchmark

Author: Yue, Zihao, Zhang, Yepeng, Wang, Ziheng, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Multimedia
Abstract: Automatic movie narration aims to generate video-aligned plot descriptions to assist visually impaired audiences. Unlike standard video captioning, it involves not only describing key visual details but also inferring plots that unfold across multiple movie shots, presenting distinct and complex challenges. To advance this field, we introduce Movie101v2, a large-scale, bilingual dataset with enhanced data quality specifically designed for movie narration. Revisiting the task, we propose breaking down the ultimate goal of automatic movie narration into three progressive stages, offering a clear roadmap with corresponding evaluation metrics. Based on our new benchmark, we baseline a range of large vision-language models, including GPT-4V, and conduct an in-depth analysis of the challenges in narration generation. Our findings highlight that achieving applicable movie narration generation is a fascinating goal that requires significant research.
Published: 2024

22. mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

Author: Hu, Anwen, Xu, Haiyang, Ye, Jiabo, Yan, Ming, Zhang, Liang, Zhang, Bo, Li, Chen, Zhang, Ji, Jin, Qin, Huang, Fei, and Zhou, Jingren
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Multimodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs. Our Unified Structure Learning comprises structure-aware parsing tasks and multi-grained text localization tasks across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Furthermore, by constructing structure-aware text sequences and multi-grained pairs of texts and bounding boxes for publicly available text-rich images, we build a comprehensive training set DocStruct4M to support structure learning. Finally, we construct a small but high-quality reasoning tuning dataset DocReason25K to trigger the detailed explanation ability in the document domain. Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks, improving the SOTA performance of MLLMs with a 7B LLM by more than 10 points in 5/10 benchmarks. Our codes, models, and datasets are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5., Comment: 21 pages, 15 figures
Published: 2024

23. SPAFormer: Sequential 3D Part Assembly with Transformers

Author: Xu, Boshen, Zheng, Sipeng, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task. This task requires accurate prediction of each part's pose and shape in sequential steps, and as the number of parts increases, the possible assembly combinations increase exponentially, leading to a combinatorial explosion that severely hinders the efficacy of 3D-PA. SPAFormer addresses this problem by leveraging weak constraints from assembly sequences, effectively reducing the solution space's complexity. Since assembly part sequences convey construction rules similar to sentences being structured through words, our model explores both parallel and autoregressive generation. It further enhances assembly through knowledge enhancement strategies that utilize the attributes of parts and their sequence information, enabling it to capture the inherent assembly pattern and relationships among sequentially ordered parts. We also construct a more challenging benchmark named PartNet-Assembly covering 21 varied categories to more comprehensively validate the effectiveness of SPAFormer. Extensive experiments demonstrate the superior generalization capabilities of SPAFormer, particularly with multi-tasking and in scenarios requiring long-horizon assembly. Codes and model weights will be released at https://github.com/xuboshen/SPAFormer., Comment: Code: https://github.com/xuboshen/SPAFormer
Published: 2024

24. POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

Author: Xu, Boshen, Zheng, Sipeng, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view. However, current methods struggle to replicate this ability of view adaptation from third-person to first-person. Although some approaches attempt to learn view-agnostic representation from large-scale video datasets, they ignore the relationships among multiple third-person views. To this end, we propose a Prompt-Oriented View-agnostic learning (POV) framework in this paper, which enables this view adaptation with few egocentric videos. Specifically, We introduce interactive masking prompts at the frame level to capture fine-grained action information, and view-aware prompts at the token level to learn view-agnostic representation. To verify our method, we establish two benchmarks for transferring from multiple third-person views to the egocentric view. Our extensive experiments on these benchmarks demonstrate the efficiency and effectiveness of our POV framework and prompt tuning techniques in terms of view adaptation and view generalization. Our code is available at \url{https://github.com/xuboshen/pov_acmmm2023}., Comment: Accepted by ACM MM 2023. Project page: https://xuboshen.github.io/
Published: 2024
Full Text: View/download PDF

25. Cuproptosis-Associated lncRNA Gene Signature Establishes New Prognostic Profile and Predicts Immunotherapy Response in Endometrial Carcinoma

Author: Jiang, Xi-Ya, Hu, Jing-Jing, Wang, Rui, Zhang, Wei-Yu, Jin, Qin-Qin, Yang, Yin-Ting, Mei, Jie, Hong, Lin, Yao, Hui, Tao, Feng, Li, Jie-Jie, Liu, Yu, Zhang, Li, Chen, Shun-Xia, Chen, Guo, Song, Yang, and Zhou, Shu-Guang
Published: 2024
Full Text: View/download PDF

26. Less is More: Mitigating Multimodal Hallucination from an EOS Decision Perspective

Author: Yue, Zihao, Zhang, Liang, and Jin, Qin
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Large Multimodal Models (LMMs) often suffer from multimodal hallucinations, wherein they may create content that is not present in the visual inputs. In this paper, we explore a new angle of this issue: overly detailed training data hinders the model's ability to timely terminate generation, leading to continued outputs beyond visual perception limits. By investigating how the model decides to terminate generation with EOS, the special end-of-sentence token, we find that the model assesses the completeness of the entire sequence by comparing the generated text with the image. This observation suggests that the model possesses an inherent potential of making proper EOS decisions based on its visual perception to avoid overly lengthy outputs. To take advantage of such potential, we explore two methods to mitigate multimodal hallucinations: a training objective that enables the model to reduce hallucinations by learning from regular instruction data, and a data filtering strategy to prevent harmful training data from exacerbating model hallucinations. Both methods significantly improve the hallucination performance of LMMs, without requiring any additional data or knowledge., Comment: Accepted to ACL 2024
Published: 2024

27. Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing

Author: Shi, Jiatong, Lin, Yueqian, Bai, Xinyi, Zhang, Keyi, Wu, Yuning, Tang, Yuxun, Yu, Yifeng, Jin, Qin, and Watanabe, Shinji
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In singing voice synthesis (SVS), generating singing voices from musical scores faces challenges due to limited data availability. This study proposes a unique strategy to address the data scarcity in SVS. We employ an existing singing voice synthesizer for data augmentation, complemented by detailed manual tuning, an approach not previously explored in data curation, to reduce instances of unnatural voice synthesis. This innovative method has led to the creation of two expansive singing voice datasets, ACE-Opencpop and ACE-KiSing, which are instrumental for large-scale, multi-singer voice synthesis. Through thorough experimentation, we establish that these datasets not only serve as new benchmarks for SVS but also enhance SVS performance on other singing voice datasets when used as supplementary resources. The corpora, pre-trained models, and their related training recipes are publicly available at ESPnet-Muskits (\url{https://github.com/espnet/espnet}), Comment: Accepted by Interspeech2024
Published: 2024

28. Deactivation Mechanism of Potassium on the γ-Fe2O3 Catalysts for SCR Reaction: A DFT Study

Author: Zhong, Jin-Qin, Li, Zi-Peng, Ren, Dong-Dong, Guo, Jian-Xiang, Wang, Ji-Jin, Zhang, Lin-Yang, and Liu, Na
Published: 2024
Full Text: View/download PDF

29. The upregulation of Annexin A2 by TLR4 pathway facilitates lipid accumulation and liver injury via blocking AMPK/mTOR-mediated autophagy flux during the development of non-alcoholic fatty liver disease

Author: Wu, Haifeng, Zhou, Meng, Jin, Qin, Wang, Xun, Xu, Yue, Li, Ming, Chen, Shuhui, Tang, Qin, Wang, Qi, Hu, Baoying, Wu, Hongpei, Xiao, Mingbing, Qu, Lishuai, Zhang, Qiong, and Liu, Jinxia
Published: 2024
Full Text: View/download PDF

30. UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model

Author: Ye, Jiabo, Hu, Anwen, Xu, Haiyang, Ye, Qinghao, Yan, Ming, Xu, Guohai, Li, Chenliang, Tian, Junfeng, Qian, Qi, Zhang, Ji, Jin, Qin, He, Liang, Lin, Xin Alex, and Huang, Fei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Text is ubiquitous in our visual world, conveying crucial information, such as in documents, websites, and everyday photographs. In this work, we propose UReader, a first exploration of universal OCR-free visually-situated language understanding based on the Multimodal Large Language Model (MLLM). By leveraging the shallow text recognition ability of the MLLM, we only finetuned 1.2% parameters and the training cost is much lower than previous work following domain-specific pretraining and finetuning paradigms. Concretely, UReader is jointly finetuned on a wide range of Visually-situated Language Understanding tasks via a unified instruction format. To enhance the visual text and semantic understanding, we further apply two auxiliary tasks with the same format, namely text reading and key points generation tasks. We design a shape-adaptive cropping module before the encoder-decoder architecture of MLLM to leverage the frozen low-resolution vision encoder for processing high-resolution images. Without downstream finetuning, our single model achieves state-of-the-art ocr-free performance in 8 out of 10 visually-situated language understanding tasks, across 5 domains: documents, tables, charts, natural images, and webpage screenshots. Codes and instruction-tuning datasets will be released.
Published: 2023

31. The development of a dietary nutrient density educational tool and the investigation of its acceptance by Chinese residents from Henan province

Author: Junya Zhai, Xu Zhang, Pipasha Khatun, Saiqi Wang, Minghua Cong, Rui Liang, Fangfang Yao, Huan Liu, Jin Qin, Lijun Guo, Yongxia Kong, Hongbo Wu, and Baihui Ma
Subjects: Public aspects of medicine, RA1-1270
Abstract: Abstract Objectives Helping residents select nutrient-dense foods is a strategy to improve their diet quality. However, communication based on the nutrient-dense foods as a positive attribute has not been widely used in nutritional education. This study aimed to develop an educational tool based on the picture and guidance of “Chinese food guide pagoda (2022) “, extend it with the concept of nutrient density, and investigate its acceptance by Chinese residents from Henan province. Methods Three examples (one-day diet with high, medium, and low nutrient-rich food (NRF) 9.2 score, an indicator for evaluating dietary nutrient density) were designed for developing a dietary nutrient density educational tool. A self-designed questionnaire was conducted to investigate the acceptance of the “dietary nutrient density educational tool” among college students from Henan province on the basis of the theory of planned behavior. Results Among the three one-day diets used in the tool, with the decrease in the NRF9.2 score, the energy intake increased from 1686 kcal to 2363 kcal, the dietary fat-to-energy ratio increased from 28 to 42%, and the mean adequacy ratio (MAR) decreased from 0.97 to 0.87. A total of 851 college students completed the acceptance questionnaire. The average score of the acceptance was 4.07, with a total score of 5. This study showed that resident’s intention to use the tool was correlated with family residence, perceptual behavior control, and subjective norms. These three factors accounted for 83.5% of the variation in behavior intention. Conclusion To encourage residents choosing healthier foods, a dietary nutrient density educational tool was developed to expanding the current instructional tool—the Chinese food guide pagoda (2022). The acceptance questionnaire survey revealed that residents had good acceptance of the tool, and family residence, perceptual behavior control, subjective norms may strongly contribute to their acceptance and the intention to use of the tool.
Published: 2024
Full Text: View/download PDF

32. Explore and Tell: Embodied Visual Captioning in 3D Environments

Author: Hu, Anwen, Chen, Shizhe, Zhang, Liang, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints. Specifically, starting at a random viewpoint, an agent must navigate the environment to gather information from different viewpoints and generate a comprehensive paragraph describing all objects in the scene. To support this task, we build the ET-Cap dataset with Kubric simulator, consisting of 10K 3D scenes with cluttered objects and three annotated paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task. The navigator predicts which actions to take in the environment, while the captioner generates a paragraph description based on the whole navigation trajectory. Extensive experiments demonstrate that our model outperforms other carefully designed baselines. Our dataset, codes and models are available at https://aim3-ruc.github.io/ExploreAndTell., Comment: 12 pages; 10 figures; ICCV 2023
Published: 2023

33. A Systematic Exploration of Joint-training for Singing Voice Synthesis

Author: Wu, Yuning, Yu, Yifeng, Shi, Jiatong, Qian, Tao, and Jin, Qin
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: There has been a growing interest in using end-to-end acoustic models for singing voice synthesis (SVS). Typically, these models require an additional vocoder to transform the generated acoustic features into the final waveform. However, since the acoustic model and the vocoder are not jointly optimized, a gap can exist between the two models, leading to suboptimal performance. Although a similar problem has been addressed in the TTS systems by joint-training or by replacing acoustic features with a latent representation, adopting corresponding approaches to SVS is not an easy task. How to improve the joint-training of SVS systems has not been well explored. In this paper, we conduct a systematic investigation of how to better perform a joint-training of an acoustic model and a vocoder for SVS. We carry out extensive experiments and demonstrate that our joint-training strategy outperforms baselines, achieving more stable performance across different datasets while also increasing the interpretability of the entire framework.
Published: 2023

34. Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences

Author: Yang, Dingyi, Chen, Hongyu, Hou, Xinglin, Ge, Tiezheng, Jiang, Yuning, and Jin, Qin
Subjects: Computer Science - Multimedia, Computer Science - Computer Vision and Pattern Recognition
Abstract: Stylized visual captioning aims to generate image or video descriptions with specific styles, making them more attractive and emotionally appropriate. One major challenge with this task is the lack of paired stylized captions for visual content, so most existing works focus on unsupervised methods that do not rely on parallel datasets. However, these approaches still require training with sufficient examples that have style labels, and the generated captions are limited to predefined styles. To address these limitations, we explore the problem of Few-Shot Stylized Visual Captioning, which aims to generate captions in any desired style, using only a few examples as guidance during inference, without requiring further training. We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module. Our two-step training scheme proceeds as follows: first, we train a style extractor to generate style representations on an unlabeled text-only corpus. Then, we freeze the extractor and enable our decoder to generate stylized descriptions based on the extracted style vector and projected visual content vectors. During inference, our model can generate desired stylized captions by deriving the style representation from user-supplied examples. Our automatic evaluation results for few-shot sentimental visual captioning outperform state-of-the-art approaches and are comparable to models that are fully trained on labeled style corpora. Human evaluations further confirm our model s ability to handle multiple styles., Comment: 9 pages, 6 figures
Published: 2023

35. No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

Author: Zhang, Qi, Zheng, Sipeng, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video. A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR. Prior works have addressed this challenge using sophisticated techniques. In this paper, we propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection. The multi-scale neighboring attention restricts each video token to only aggregate visual contexts from its neighbor, enabling the extraction of the most distinguishing information with multi-scale feature hierarchies from high-ratio noises. The zoom-in boundary detection then focuses on local-wise discrimination of the selected top candidates for fine-grained grounding adjustment. With an end-to-end training strategy, our model achieves competitive performance on different TVG benchmarks, while also having the advantage of faster inference speed and lighter model parameters, thanks to its lightweight architecture.
Published: 2023

36. Learning Descriptive Image Captioning via Semipermeable Maximum Likelihood Estimation

Author: Yue, Zihao, Hu, Anwen, Zhang, Liang, and Jin, Qin
Subjects: Computer Science - Computation and Language
Abstract: Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as conciseness optimization. In contrast, predictions that are more concise than labels lead to richness optimization. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus encouraging the model to generate longer captions with more details. Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works., Comment: Accepted to NeurIPS 2023
Published: 2023

37. Exosomes derived from BMSCs in osteogenic differentiation promote type H blood vessel angiogenesis through miR-150-5p mediated metabolic reprogramming of endothelial cells

Author: Wu, Feng, Song, Chengchao, Zhen, Guanqi, Jin, Qin, Li, Wei, Liang, Xiongjie, Xu, Wenbo, Guo, Wenhui, Yang, Yang, Dong, Wei, Jiang, Anlong, Kong, Pengyu, and Yan, Jinglong
Published: 2024
Full Text: View/download PDF

38. Exonuclease editor promotes precision of gene editing in mammalian cells

Author: Shi, Hui, Li, Lei, Mu, Shuangshuang, Gou, Shixue, Liu, Xiaoyi, Chen, Fangbing, Chen, Menglong, Jin, Qin, Lai, Liangxue, and Wang, Kepin
Published: 2024
Full Text: View/download PDF

39. Enhancing prime editor flexibility with coiled-coil heterodimers

Author: Mu, Shuangshuang, Chen, Huangyao, Li, Qianru, Gou, Shixue, Liu, Xiaoyi, Wang, Junwei, Zheng, Wei, Chen, Menglong, Jin, Qin, Lai, Liangxue, Wang, Kepin, and Shi, Hui
Published: 2024
Full Text: View/download PDF

40. Movie101: A New Movie Understanding Benchmark

Author: Yue, Zihao, Zhang, Qi, Hu, Anwen, Zhang, Liang, Wang, Ziheng, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: To help the visually impaired enjoy movies, automatic movie narrating systems are expected to narrate accurate, coherent, and role-aware plots when there are no speaking lines of actors. Existing works benchmark this challenge as a normal video captioning task via some simplifications, such as removing role names and evaluating narrations with ngram-based metrics, which makes it difficult for automatic systems to meet the needs of real application scenarios. To narrow this gap, we construct a large-scale Chinese movie benchmark, named Movie101. Closer to real scenarios, the Movie Clip Narrating (MCN) task in our benchmark asks models to generate role-aware narration paragraphs for complete movie clips where no actors are speaking. External knowledge, such as role information and movie genres, is also provided for better movie understanding. Besides, we propose a new metric called Movie Narration Score (MNScore) for movie narrating evaluation, which achieves the best correlation with human evaluation. Our benchmark also supports the Temporal Narration Grounding (TNG) task to investigate clip localization given text descriptions. For both two tasks, our proposed methods well leverage external knowledge and outperform carefully designed baselines. The dataset and codes are released at https://github.com/yuezih/Movie101., Comment: Accepted to ACL 2023
Published: 2023

41. Edit As You Wish: Video Caption Editing with Multi-grained User Control

Author: Yao, Linli, Zhang, Yuanmeng, Wang, Ziheng, Hou, Xinglin, Ge, Tiezheng, Jiang, Yuning, Sun, Xu, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Automatically narrating videos in natural language complying with user requests, i.e. Controllable Video Captioning task, can help people manage massive videos with desired intentions. However, existing works suffer from two shortcomings: 1) the control signal is single-grained which can not satisfy diverse user intentions; 2) the video description is generated in a single round which can not be further edited to meet dynamic needs. In this paper, we propose a novel \textbf{V}ideo \textbf{C}aption \textbf{E}diting \textbf{(VCE)} task to automatically revise an existing video description guided by multi-grained user requests. Inspired by human writing-revision habits, we design the user command as a pivotal triplet \{\textit{operation, position, attribute}\} to cover diverse user needs from coarse-grained to fine-grained. To facilitate the VCE task, we \textit{automatically} construct an open-domain benchmark dataset named VATEX-EDIT and \textit{manually} collect an e-commerce dataset called EMMAD-EDIT. We further propose a specialized small-scale model (i.e., OPA) compared with two generalist Large Multi-modal Models to perform an exhaustive analysis of the novel task. For evaluation, we adopt comprehensive metrics considering caption fluency, command-caption consistency, and video-caption alignment. Experiments reveal the task challenges of fine-grained multi-modal semantics understanding and processing. Our datasets, codes, and evaluation tools are available at https://github.com/yaolinli/VCE., Comment: Accepted by ACM MM 2024
Published: 2023

42. InfoMetIC: An Informative Metric for Reference-free Image Caption Evaluation

Author: Hu, Anwen, Chen, Shizhe, Zhang, Liang, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Automatic image captioning evaluation is critical for benchmarking and promoting advances in image captioning research. Existing metrics only provide a single score to measure caption qualities, which are less explainable and informative. Instead, we humans can easily identify the problems of captions in details, e.g., which words are inaccurate and which salient objects are not described, and then rate the caption quality. To support such informative feedback, we propose an Informative Metric for Reference-free Image Caption evaluation (InfoMetIC). Given an image and a caption, InfoMetIC is able to report incorrect words and unmentioned image regions at fine-grained level, and also provide a text precision score, a vision recall score and an overall quality score at coarse-grained level. The coarse-grained score of InfoMetIC achieves significantly better correlation with human judgements than existing metrics on multiple benchmarks. We also construct a token-level evaluation dataset and demonstrate the effectiveness of InfoMetIC in fine-grained evaluation. Our code and datasets are publicly available at https://github.com/HAWLYQ/InfoMetIC., Comment: Accepted by ACL 2023 main conference
Published: 2023

43. Knowledge Enhanced Model for Live Video Comment Generation

Author: Chen, Jieting, Ding, Junkai, Chen, Wenping, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Live video commenting is popular on video media platforms, as it can create a chatting atmosphere and provide supplementary information for users while watching videos. Automatically generating live video comments can improve user experience and enable human-like generation for bot chatting. Existing works mostly focus on short video datasets while ignoring other important video types such as long videos like movies. In this work, we collect a new Movie Live Comments (MovieLC) dataset to support research on live video comment generation for long videos. We also propose a knowledge enhanced generation model inspired by the divergent and informative nature of live video comments. Our model adopts a pre-training encoder-decoder framework and incorporates external knowledge. Extensive experiments show that both objective metrics and human evaluation demonstrate the effectiveness of our proposed model. The MovieLC dataset and our code will be released.
Published: 2023

44. Rethinking Benchmarks for Cross-modal Image-text Retrieval

Author: Chen, Weijing, Yao, Linli, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Image-text retrieval, as a fundamental and important branch of information retrieval, has attracted extensive research attentions. The main challenge of this task is cross-modal semantic understanding and matching. Some recent works focus more on fine-grained cross-modal semantic matching. With the prevalence of large scale multimodal pretraining models, several state-of-the-art models (e.g. X-VLM) have achieved near-perfect performance on widely-used image-text retrieval benchmarks, i.e. MSCOCO-Test-5K and Flickr30K-Test-1K. In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching. The reason is that a large amount of images and texts in the benchmarks are coarse-grained. Based on the observation, we renovate the coarse-grained images and texts in the old benchmarks and establish the improved benchmarks called MSCOCO-FG and Flickr30K-FG. Specifically, on the image side, we enlarge the original image pool by adopting more similar images. On the text side, we propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort. Furthermore, we evaluate representative image-text retrieval models on our new benchmarks to demonstrate the effectiveness of our method. We also analyze the capability of models on fine-grained semantic comprehension through extensive experiments. The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding, especially in distinguishing attributes of close objects in images. Our code and improved benchmark datasets are publicly available at: https://github.com/cwj1412/MSCOCO-Flikcr30K_FG, which we hope will inspire further in-depth research on cross-modal retrieval., Comment: Accepted to SIGIR2023
Published: 2023
Full Text: View/download PDF

45. MPMQA: Multimodal Question Answering on Product Manuals

Author: Zhang, Liang, Hu, Anwen, Zhang, Jing, Hu, Shuo, and Jin, Qin
Subjects: Computer Science - Computation and Language
Abstract: Visual contents, such as illustrations and images, play a big role in product manual understanding. Existing Product Manual Question Answering (PMQA) datasets tend to ignore visual contents and only retain textual parts. In this work, to emphasize the importance of multimodal contents, we propose a Multimodal Product Manual Question Answering (MPMQA) task. For each question, MPMQA requires the model not only to process multimodal contents but also to provide multimodal answers. To support MPMQA, a large-scale dataset PM209 is constructed with human annotations, which contains 209 product manuals from 27 well-known consumer electronic brands. Human annotations include 6 types of semantic regions for manual contents and 22,021 pairs of question and answer. Especially, each answer consists of a textual sentence and related visual regions from manuals. Taking into account the length of product manuals and the fact that a question is always related to a small number of pages, MPMQA can be naturally split into two subtasks: retrieving most related pages and then generating multimodal answers. We further propose a unified model that can perform these two subtasks all together and achieve comparable performance with multiple task-specific models. The PM209 dataset is available at https://github.com/AIM3-RUC/MPMQA.
Published: 2023

46. PHONEix: Acoustic Feature Processing Strategy for Enhanced Singing Pronunciation with Phoneme Distribution Predictor

Author: Wu, Yuning, Shi, Jiatong, Qian, Tao, Gao, Dongji, and Jin, Qin
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Singing voice synthesis (SVS), as a specific task for generating the vocal singing voice from a music score, has drawn much attention in recent years. SVS faces the challenge that the singing has various pronunciation flexibility conditioned on the same music score. Most of the previous works of SVS can not well handle the misalignment between the music score and actual singing. In this paper, we propose an acoustic feature processing strategy, named PHONEix, with a phoneme distribution predictor, to alleviate the gap between the music score and the singing voice, which can be easily adopted in different SVS systems. Extensive experiments in various settings demonstrate the effectiveness of our PHONEix in both objective and subjective evaluations., Comment: Accepted by ICASSP 2023
Published: 2023

47. Accommodating Audio Modality in CLIP for Multimodal Processing

Author: Ruan, Ludan, Hu, Anwen, Song, Yuqing, Zhang, Liang, Zheng, Sipeng, and Jin, Qin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multimodal processing has attracted much attention lately especially with the success of pre-training. However, the exploration has mainly focused on vision-language pre-training, as introducing more modalities can greatly complicate model design and optimization. In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities in addition to the inner characteristics of the audio modality. Moreover, we further design an audio type token to dynamically learn different audio information type for different scenarios, as both verbal and nonverbal heterogeneous information is conveyed in general audios. Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning, and achieves the state-of-the-art performance on the benchmark datasets of MSR-VTT, VATEX, and Audiocaps., Comment: Accepted by AAAI2023
Published: 2023

48. TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World

Author: Lin, Hongpeng, Ruan, Ludan, Xia, Wenke, Liu, Peiyu, Wen, Jingyuan, Xu, Yixin, Hu, Di, Song, Ruihua, Zhao, Wayne Xin, Jin, Qin, and Lu, Zhiwu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: To facilitate the research on intelligent and human-like chatbots with multi-modal context, we introduce a new video-based multi-modal dialogue dataset, called TikTalk. We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them. Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context. Compared to previous multi-modal dialogue datasets, the richer context types in TikTalk lead to more diverse conversations, but also increase the difficulty in capturing human interests from intricate multi-modal information to generate personalized responses. Moreover, external knowledge is more frequently evoked in our dataset. These facts reveal new challenges for multi-modal dialogue models. We quantitatively demonstrate the characteristics of TikTalk, propose a video-based multi-modal chitchat task, and evaluate several dialogue baselines. Experimental results indicate that the models incorporating large language models (LLM) can generate more diverse responses, while the model utilizing knowledge graphs to introduce external knowledge performs the best overall. Furthermore, no existing model can solve all the above challenges well. There is still a large room for future improvements, even for LLM with visual extensions. Our dataset is available at \url{https://ruc-aimind.github.io/projects/TikTalk/}., Comment: Accepted to ACM Multimedia 2023
Published: 2023
Full Text: View/download PDF

49. MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

Author: Ruan, Ludan, Ma, Yiyang, Yang, Huan, He, Huiguo, Liu, Bei, Fu, Jianlong, Yuan, Nicholas Jing, Jin, Qin, and Guo, Baining
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. The code and pre-trained models can be downloaded at https://github.com/researchmm/MM-Diffusion., Comment: Accepted by CVPR 2023
Published: 2022

50. Systematic review and network meta-analysis of non-invasive respiratory support in paediatric patients with acute hypoxaemic respiratory failure: a protocol

Author: Jin Qin, Yan-Dong Feng, Yu-Xia Li, Yang-Qi Yin, and Ji-Zu Ling
Subjects: Medicine
Abstract: Introduction Acute hypoxic respiratory failure (AHRF) is one of the most common causes of admission to paediatric intensive care units (PICUs) around the world, posing a serious health concern for the global community. Non-invasive respiratory support (NRS) is considered effective in reducing mortality and intubation rates in adults. However, it is not yet clear whether NRS is beneficial for children and which NRS modalities are most effective. This network meta-analysis aims to summarise existing evidence and compare the efficacy and safety of different NRS modalities in paediatric patients with acute hypoxaemic respiratory failure.Methods and analysis To identify randomised controlled trials, we will perform a systematic search of key databases (Embase, PubMed, CENTRAL, CINAHL Complete and Web of Science) and registered clinical trials (ClinicalTrials.gov, WHO ICTRP and ISRCTN). To ensure the inclusion of the latest literature, an initial pilot search was conducted on 8 July 2024, and an updated search will be conducted after the main research work of this study. AHRF in children treated with NRS will be included. Hospital mortality, intubation rate, treatment failure rate and serious adverse events are critical outcomes closely related to patient-centredness and importance. Two authors will independently select the studies and extract the data. The risk of bias will be assessed using the Cochrane risk of bias tool V.2.0. In order to compare the effects of different NRS modalities, pairwise meta-analysis and network meta-analysis will be conducted using R software. Several subgroup analyses will be conducted, including analyses of different causes of AHRF. We will conduct sensitivity analyses by excluding studies with a high risk of bias and those involving neonates. Using the Grading of Recommendations Assessment, Development and Evaluation methodology, we will assess the certainty of the evidence for the effect estimates of all the outcomes.Ethics and dissemination Since this research is a network meta-analysis based on published literature, no formal ethics approval is required. The results will be disseminated through a peer-reviewed journal for publication.PROSPERO registration number CRD42024529804.
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Category

Publication Type

Journal

Region

Database

Publisher

1,604 results on '"Jin, Qin"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources