Author: "Chen, Jingdong" / Database: arXiv - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Chen, Jingdong"' showing total 31 results

Start Over Author "Chen, Jingdong" Database arXiv

31 results on '"Chen, Jingdong"'

1. POA: Pre-training Once for Models of All Sizes

Author: Zhang, Yingying, Guo, Xin, Lao, Jiangwei, Yu, Lei, Ru, Lixiang, Wang, Jian, Ye, Guo, He, Huimei, Chen, Jingdong, and Yang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition, 68T07
Abstract: Large-scale self-supervised pre-training has paved the way for one foundation model to handle many different vision tasks. Most pre-training methodologies train a single model of a certain size at one time. Nevertheless, various computation or storage constraints in real-world scenarios require substantial efforts to develop a series of models with different sizes to deploy. Thus, in this study, we propose a novel tri-branch self-supervised training framework, termed as POA (Pre-training Once for All), to tackle this aforementioned issue. Our approach introduces an innovative elastic student branch into a modern self-distillation paradigm. At each pre-training step, we randomly sample a sub-network from the original student to form the elastic student and train all branches in a self-distilling fashion. Once pre-trained, POA allows the extraction of pre-trained models of diverse sizes for downstream tasks. Remarkably, the elastic student facilitates the simultaneous pre-training of multiple models with different sizes, which also acts as an additional ensemble of models of various sizes to enhance representation learning. Extensive experiments, including k-nearest neighbors, linear probing evaluation and assessments on multiple downstream tasks demonstrate the effectiveness and advantages of our POA. It achieves state-of-the-art performance using ViT, Swin Transformer and ResNet backbones, producing around a hundred models with different sizes through a single pre-training session. The code is available at: https://github.com/Qichuzyy/POA., Comment: Accepted by ECCV2024
Published: 2024

2. Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

Author: Huang, Ziyuan, Ji, Kaixiang, Gong, Biao, Qing, Zhiwu, Zhang, Qinglong, Zheng, Kecheng, Wang, Jian, Chen, Jingdong, and Yang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales. This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pre-training. Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by ~73%. Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks.
Published: 2024

3. ViTime: A Visual Intelligence-Based Foundation Model for Time Series Forecasting

Author: Yang, Luoxiao, Wang, Yun, Fan, Xinqi, Cohen, Israel, Chen, Jingdong, Zhao, Yue, and Zhang, Zijun
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: The success of large pretrained models in natural language processing (NLP) and computer vision (CV) has opened new avenues for constructing foundation models for time series forecasting (TSF). Traditional TSF foundation models rely heavily on numerical data fitting. In contrast, the human brain is inherently skilled at processing visual information, prefer predicting future trends by observing visualized sequences. From a biomimetic perspective, utilizing models to directly process numerical sequences might not be the most effective route to achieving Artificial General Intelligence (AGI). This paper proposes ViTime, a novel Visual Intelligence-based foundation model for TSF. ViTime overcomes the limitations of numerical time series data fitting by utilizing visual data processing paradigms and employs a innovative data synthesis method during training, called Real Time Series (RealTS). Experiments on a diverse set of previously unseen forecasting datasets demonstrate that ViTime achieves state-of-the-art zero-shot performance, even surpassing the best individually trained supervised models in some situations. These findings suggest that visual intelligence can significantly enhance time series analysis and forecasting, paving the way for more advanced and versatile models in the field. The code for our framework is accessible at https://github.com/IkeYang/ViTime.
Published: 2024

4. SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding

Author: Luo, Junwei, Pang, Zhen, Zhang, Yongjun, Wang, Tingzhu, Wang, Linlin, Dang, Bo, Lao, Jiangwei, Wang, Jian, Chen, Jingdong, Tan, Yihua, and Li, Yansheng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Remote Sensing Large Multi-Modal Models (RSLMMs) are developing rapidly and showcase significant capabilities in remote sensing imagery (RSI) comprehension. However, due to the limitations of existing datasets, RSLMMs have shortcomings in understanding the rich semantic relations among objects in complex remote sensing scenes. To unlock RSLMMs' complex comprehension ability, we propose a large-scale instruction tuning dataset FIT-RS, containing 1,800,851 instruction samples. FIT-RS covers common interpretation tasks and innovatively introduces several complex comprehension tasks of escalating difficulty, ranging from relation reasoning to image-level scene graph generation. Based on FIT-RS, we build the FIT-RSFG benchmark. Furthermore, we establish a new benchmark to evaluate the fine-grained relation comprehension capabilities of LMMs, named FIT-RSRC. Based on combined instruction data, we propose SkySenseGPT, which achieves outstanding performance on both public datasets and FIT-RSFG, surpassing existing RSLMMs. We hope the FIT-RS dataset can enhance the relation comprehension capability of RSLMMs and provide a large-scale fine-grained data source for the remote sensing community. The dataset will be available at https://github.com/Luo-Z13/SkySenseGPT, Comment: 30 pages, 5 figures, 19 tables, dataset and code see https://github.com/Luo-Z13/SkySenseGPT
Published: 2024

5. Low algorithmic delay implementation of convolutional beamformer for online joint source separation and dereverberation

Author: Mo, Kaien, Wang, Xianrui, Yang, Yichen, Makino, Shoji, and Chen, Jingdong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Blind-audio-source-separation (BASS) techniques, particularly those with low latency, play an important role in a wide range of real-time systems, e.g., hearing aids, in-car hand-free voice communication, real-time human-machine interaction, etc. Most existing BASS algorithms are deduced to run on batch mode, and therefore large latency is unavoidable. Recently, some online algorithms were developed, which achieve separation on a frame-by-frame basis in the short-time-Fourier-transform (STFT) domain and the latency is significantly reduced as compared to those batch methods. However, the latency with these algorithms may still be too long for many real-time systems to bear. To further reduce latency while achieving good separation performance, we propose in this work to integrate a weighted prediction error (WPE) module into a non-causal sample-truncating-based independent vector analysis (NST-IVA). The resulting algorithm can maintain the algorithmic delay as NST-IVA if the delay with WPE is appropriately controlled while achieving significantly better performance, which is validated by simulations., Comment: 4 pages, 4 figures. Accepted by EUSIPCO 2024
Published: 2024

6. Enhancing DETRs Variants through Improved Content Query and Similar Query Aggregation

Author: Zhang, Yingying, Shi, Chuangji, Guo, Xin, Lao, Jiangwei, Wang, Jian, Wang, Jiaotuan, and Chen, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: The design of the query is crucial for the performance of DETR and its variants. Each query consists of two components: a content part and a positional one. Traditionally, the content query is initialized with a zero or learnable embedding, lacking essential content information and resulting in sub-optimal performance. In this paper, we introduce a novel plug-and-play module, Self-Adaptive Content Query (SACQ), to address this limitation. The SACQ module utilizes features from the transformer encoder to generate content queries via self-attention pooling. This allows candidate queries to adapt to the input image, resulting in a more comprehensive content prior and better focus on target objects. However, this improved concentration poses a challenge for the training process that utilizes the Hungarian matching, which selects only a single candidate and suppresses other similar ones. To overcome this, we propose a query aggregation strategy to cooperate with SACQ. It merges similar predicted candidates from different queries, easing the optimization. Our extensive experiments on the COCO dataset demonstrate the effectiveness of our proposed approaches across six different DETR's variants with multiple configurations, achieving an average improvement of over 1.0 AP., Comment: 11 pages, 7 figures
Published: 2024

7. Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis

Author: Zhang, Zicheng, Zheng, Ruobing, Liu, Ziwen, Han, Congying, Li, Tianqi, Wang, Meng, Guo, Tiande, Chen, Jingdong, Li, Bonan, and Yang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent works in implicit representations, such as Neural Radiance Fields (NeRF), have advanced the generation of realistic and animatable head avatars from video sequences. These implicit methods are still confronted by visual artifacts and jitters, since the lack of explicit geometric constraints poses a fundamental challenge in accurately modeling complex facial deformations. In this paper, we introduce Dynamic Tetrahedra (DynTet), a novel hybrid representation that encodes explicit dynamic meshes by neural networks to ensure geometric consistency across various motions and viewpoints. DynTet is parameterized by the coordinate-based networks which learn signed distance, deformation, and material texture, anchoring the training data into a predefined tetrahedra grid. Leveraging Marching Tetrahedra, DynTet efficiently decodes textured meshes with a consistent topology, enabling fast rendering through a differentiable rasterizer and supervision via a pixel loss. To enhance training efficiency, we incorporate classical 3D Morphable Models to facilitate geometry learning and define a canonical space for simplifying texture learning. These advantages are readily achievable owing to the effective geometric representation employed in DynTet. Compared with prior works, DynTet demonstrates significant improvements in fidelity, lip synchronization, and real-time performance according to various metrics. Beyond producing stable and visually appealing synthesis videos, our method also outputs the dynamic meshes which is promising to enable many emerging applications., Comment: CVPR 2024
Published: 2024

8. M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Author: Guo, Qingpei, Xu, Furong, Zhang, Hanxiao, Ren, Wang, Ma, Ziping, Ju, Lin, Wang, Jian, Chen, Jingdong, and Yang, Ming
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval and classification tasks. Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previously reported SoTA methods by 2.2% and 21.1%, respectively. The $M^2$-Encoder series represents one of the most comprehensive bilingual image-text foundation models to date, so we are making it available to the research community for further exploration and development.
Published: 2024

9. Independent low-rank matrix analysis based on the Sinkhorn divergence source model for blind source separation

Author: Wang, Jianyu, Guan, Shanzheng, Chen, Jingdong, and Benesty, Jacob
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The so-called independent low-rank matrix analysis (ILRMA) has demonstrated a great potential for dealing with the problem of determined blind source separation (BSS) for audio and speech signals. This method assumes that the spectra from different frequency bands are independent and the spectral coefficients in any frequency band are Gaussian distributed. The Itakura-Saito divergence is then employed to estimate the source model related parameters. In reality, however, the spectral coefficients from different frequency bands may be dependent, which is not considered in the existing ILRMA algorithm. This paper presents an improved version of ILRMA, which considers the dependency between the spectral coefficients from different frequency bands. The Sinkhorn divergence is then exploited to optimize the source model parameters. As a result of using the cross-band information, the BSS performance is improved. But the number of parameters to be estimated also increases significantly, and so is the computational complexity. To reduce the algorithm complexity, we apply the Kronecker product to decompose the modeling matrix into the product of a number of matrices of much smaller dimensionality. An efficient algorithm is then developed to implement the Sinkhorn divergence based BSS algorithm and the complexity is reduced by an order of magnitude.
Published: 2024

10. SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Author: Guo, Xin, Lao, Jiangwei, Dang, Bo, Zhang, Yingying, Yu, Lei, Ru, Lixiang, Zhong, Liheng, Huang, Ziyuan, Wu, Kang, Hu, Dingxiang, He, Huimei, Wang, Jian, Chen, Jingdong, Yang, Ming, Zhang, Yongjun, and Li, Yansheng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless, these works primarily focus on a single modality without temporal and geo-context modeling, hampering their capabilities for diverse tasks. In this study, we present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue, we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to temporal, and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically, it outperforms the latest models such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications., Comment: Accepted by CVPR2024
Published: 2023

11. A computationally efficient semi-blind source separation based approach for nonlinear echo cancellation based on an element-wise iterative source steering

Author: Lu, Kunxing, Wang, Xianrui, Ueda, Tetsuya, Makino, Shoji, and Chen, Jingdong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: While the semi-blind source separation-based acoustic echo cancellation (SBSS-AEC) has received much research attention due to its promising performance during double-talk compared to the traditional adaptive algorithms, it suffers from system latency and nonlinear distortions. To circumvent these drawbacks, the recently developed ideas on convolutive transfer function (CTF) approximation and nonlinear expansion have been used in the iterative projection (IP)-based semi-blind source separation (SBSS) algorithm. However, because of the introduction of CTF approximation and nonlinear expansion, this algorithm becomes computationally very expensive, which makes it difficult to implement in embedded systems. Thus, we attempt in this paper to improve this IP-based algorithm, thereby developing an element-wise iterative source steering (EISS) algorithm. In comparison with the IP-based SBSS algorithm, the proposed algorithm is computationally much more efficient, especially when the nonlinear expansion order is high and the length of the CTF filter is long. Meanwhile, its AEC performance is as good as that of IP-based SBSS.
Published: 2023

12. Large Multimodal Model Compression via Efficient Pruning and Distillation at AntGroup

Author: Wang, Maolin, Zhao, Yao, Liu, Jiajia, Chen, Jingdong, Zhuang, Chenyi, Gu, Jinjie, Guo, Ruocheng, and Zhao, Xiangyu
Subjects: Computer Science - Artificial Intelligence
Abstract: The deployment of Large Multimodal Models (LMMs) within AntGroup has significantly advanced multimodal tasks in payment, security, and advertising, notably enhancing advertisement audition tasks in Alipay. However, the deployment of such sizable models introduces challenges, particularly in increased latency and carbon emissions, which are antithetical to the ideals of Green AI. This paper introduces a novel multi-stage compression strategy for our proprietary LLM, AntGMM. Our methodology pivots on three main aspects: employing small training sample sizes, addressing multi-level redundancy through multi-stage pruning, and introducing an advanced distillation loss design. In our research, we constructed a dataset, the Multimodal Advertisement Audition Dataset (MAAD), from real-world scenarios within Alipay, and conducted experiments to validate the reliability of our proposed strategy. Furthermore, the effectiveness of our strategy is evident in its operational success in Alipay's real-world multimodal advertisement audition for three months from September 2023. Notably, our approach achieved a substantial reduction in latency, decreasing it from 700ms to 90ms, while maintaining online performance with only a slight performance decrease. Moreover, our compressed model is estimated to reduce electricity consumption by approximately 75 million kWh annually compared to the direct deployment of AntGMM, demonstrating our commitment to green AI initiatives. We will publicly release our code and the MAAD dataset after some reviews\footnote{https://github.com/MorinW/AntGMM$\_$Pruning}.
Published: 2023

13. LogicMP: A Neuro-symbolic Approach for Encoding First-order Logic Constraints

Author: Xu, Weidi, Wang, Jingwei, Xie, Lele, He, Jianshan, Zhou, Hongting, Wang, Taifeng, Wan, Xiaopei, Chen, Jingdong, Qu, Chao, and Chu, Wei
Subjects: Computer Science - Artificial Intelligence, Computer Science - Symbolic Computation
Abstract: Integrating first-order logic constraints (FOLCs) with neural networks is a crucial but challenging problem since it involves modeling intricate correlations to satisfy the constraints. This paper proposes a novel neural layer, LogicMP, whose layers perform mean-field variational inference over an MLN. It can be plugged into any off-the-shelf neural network to encode FOLCs while retaining modularity and efficiency. By exploiting the structure and symmetries in MLNs, we theoretically demonstrate that our well-designed, efficient mean-field iterations effectively mitigate the difficulty of MLN inference, reducing the inference from sequential calculation to a series of parallel tensor operations. Empirical results in three kinds of tasks over graphs, images, and text show that LogicMP outperforms advanced competitors in both performance and efficiency., Comment: 28 pages, 14 figures, 12 tables
Published: 2023

14. The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

Author: Wu, Shilong, Wang, Chenxi, Chen, Hang, Dai, Yusheng, Zhang, Chenyue, Wang, Ruoyu, Lan, Hongbo, Du, Jun, Lee, Chin-Hui, Chen, Jingdong, Watanabe, Shinji, Siniscalchi, Sabato Marco, Scharenborg, Odette, Wang, Zhong-Qiu, Pan, Jia, and Gao, Jianqing
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Previous Multimodal Information based Speech Processing (MISP) challenges mainly focused on audio-visual speech recognition (AVSR) with commendable success. However, the most advanced back-end recognition systems often hit performance limits due to the complex acoustic environments. This has prompted a shift in focus towards the Audio-Visual Target Speaker Extraction (AVTSE) task for the MISP 2023 challenge in ICASSP 2024 Signal Processing Grand Challenges. Unlike existing audio-visual speech enhance-ment challenges primarily focused on simulation data, the MISP 2023 challenge uniquely explores how front-end speech processing, combined with visual clues, impacts back-end tasks in real-world scenarios. This pioneering effort aims to set the first benchmark for the AVTSE task, offering fresh insights into enhancing the ac-curacy of back-end speech recognition systems through AVTSE in challenging and real acoustic environments. This paper delivers a thorough overview of the task setting, dataset, and baseline system of the MISP 2023 challenge. It also includes an in-depth analysis of the challenges participants may encounter. The experimental results highlight the demanding nature of this task, and we look forward to the innovative solutions participants will bring forward., Comment: 5 pages, 4 figures
Published: 2023

15. Mapping EEG Signals to Visual Stimuli: A Deep Learning Approach to Match vs. Mismatch Classification

Author: Yang, Yiqian, Zhao, Zhengqiao, Wang, Qian, Yang, Yan, and Chen, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computational Engineering, Finance, and Science
Abstract: Existing approaches to modeling associations between visual stimuli and brain responses are facing difficulties in handling between-subject variance and model generalization. Inspired by the recent progress in modeling speech-brain response, we propose in this work a "match-vs-mismatch" deep learning model to classify whether a video clip induces excitatory responses in recorded EEG signals and learn associations between the visual content and corresponding neural recordings. Using an exclusive experimental dataset, we demonstrate that the proposed model is able to achieve the highest accuracy on unseen subjects as compared to other baseline models. Furthermore, we analyze the inter-subject noise using a subject-level silhouette score in the embedding space and show that the developed model is able to mitigate inter-subject noise and significantly reduce the silhouette score. Moreover, we examine the Grad-CAM activation score and show that the brain regions associated with language processing contribute most to the model predictions, followed by regions associated with visual processing. These results have the potential to facilitate the development of neural recording-based video reconstruction and its related applications.
Published: 2023

16. An Anchor-Point Based Image-Model for Room Impulse Response Simulation with Directional Source Radiation and Sensor Directivity Patterns

Author: Pan, Chao, Zhang, Lei, Lu, Yilong, Jin, Jilu, Qiu, Lin, Chen, Jingdong, and Benesty, Jacob
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The image model method has been widely used to simulate room impulse responses and the endeavor to adapt this method to different applications has also piqued great interest over the last few decades. This paper attempts to extend the image model method and develops an anchor-point-image-model (APIM) approach as a solution for simulating impulse responses by including both the source radiation and sensor directivity patterns. To determine the orientations of all the virtual sources, anchor points are introduced to real sources, which subsequently lead to the determination of the orientations of the virtual sources. An algorithm is developed to generate room impulse responses with APIM by taking into account the directional pattern functions, factional time delays, as well as the computational complexity. The developed model and algorithms can be used in various acoustic problems to simulate room acoustics and improve and evaluate processing algorithms., Comment: 19 pages, 8 figures
Published: 2023

17. The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

Author: Wang, Zhe, Wu, Shilong, Chen, Hang, He, Mao-Kui, Du, Jun, Lee, Chin-Hui, Chen, Jingdong, Watanabe, Shinji, Siniscalchi, Sabato, Scharenborg, Odette, Liu, Diyuan, Yin, Baocai, Pan, Jia, Gao, Jianqing, and Liu, Cong
Subjects: Computer Science - Multimedia
Abstract: The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve ``who spoken when'' using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing ``who spoken what when'' with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers., Comment: 5 pages, 4 figures, to be published in ICASSP2023
Published: 2023

18. Robust Manifold Nonnegative Tucker Factorization for Tensor Data Representation

Author: Wang, Jianyu, Tang, Linruize, Chen, Jie, and Chen, Jingdong
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Nonnegative Tucker Factorization (NTF) minimizes the euclidean distance or Kullback-Leibler divergence between the original data and its low-rank approximation which often suffers from grossly corruptions or outliers and the neglect of manifold structures of data. In particular, NTF suffers from rotational ambiguity, whose solutions with and without rotation transformations are equally in the sense of yielding the maximum likelihood. In this paper, we propose three Robust Manifold NTF algorithms to handle outliers by incorporating structural knowledge about the outliers. They first applies a half-quadratic optimization algorithm to transform the problem into a general weighted NTF where the weights are influenced by the outliers. Then, we introduce the correntropy induced metric, Huber function and Cauchy function for weights respectively, to handle the outliers. Finally, we introduce a manifold regularization to overcome the rotational ambiguity of NTF. We have compared the proposed method with a number of representative references covering major branches of NTF on a variety of real-world image databases. Experimental results illustrate the effectiveness of the proposed method under two evaluation metrics (accuracy and nmi).
Published: 2022

19. SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization

Author: Luo, Canjie, Jin, Lianwen, and Chen, Jingdong
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently self-supervised representation learning has drawn considerable attention from the scene text recognition community. Different from previous studies using contrastive learning, we tackle the issue from an alternative perspective, i.e., by formulating the representation learning scheme in a generative manner. Typically, the neighboring image patches among one text line tend to have similar styles, including the strokes, textures, colors, etc. Motivated by this common sense, we augment one image patch and use its neighboring patch as guidance to recover itself. Specifically, we propose a Similarity-Aware Normalization (SimAN) module to identify the different patterns and align the corresponding styles from the guiding patch. In this way, the network gains representation capability for distinguishing complex patterns such as messy strokes and cluttered backgrounds. Experiments show that the proposed SimAN significantly improves the representation quality and achieves promising performance. Moreover, we surprisingly find that our self-supervised generative network has impressive potential for data synthesis, text image editing, and font interpolation, which suggests that the proposed SimAN has a wide range of practical applications., Comment: Accepted to appear in CVPR 2022
Published: 2022

20. Hierarchical Memory Learning for Fine-Grained Scene Graph Generation

Author: Deng, Youming, Li, Yansheng, Zhang, Yongjun, Xiang, Xiang, Wang, Jian, Chen, Jingdong, and Ma, Jiayi
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: As far as Scene Graph Generation (SGG), coarse and fine predicates mix in the dataset due to the crowd-sourced labeling, and the long-tail problem is also pronounced. Given this tricky situation, many existing SGG methods treat the predicates equally and learn the model under the supervision of mixed-granularity predicates in one stage, leading to relatively coarse predictions. In order to alleviate the negative impact of the suboptimum mixed-granularity annotation and long-tail effect problems, this paper proposes a novel Hierarchical Memory Learning (HML) framework to learn the model from simple to complex, which is similar to the human beings' hierarchical memory learning process. After the autonomous partition of coarse and fine predicates, the model is first trained on the coarse predicates and then learns the fine predicates. In order to realize this hierarchical learning pattern, this paper, for the first time, formulates the HML framework using the new Concept Reconstruction (CR) and Model Reconstruction (MR) constraints. It is worth noticing that the HML framework can be taken as one general optimization strategy to improve various SGG models, and significant improvement can be achieved on the SGG benchmark (i.e., Visual Genome)., Comment: ECCV 2022
Published: 2022
Full Text: View/download PDF

21. Training Protocol Matters: Towards Accurate Scene Text Recognition via Training Protocol Searching

Author: Chu, Xiaojie, Wang, Yongtao, Shen, Chunhua, Chen, Jingdong, and Chu, Wei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The development of scene text recognition (STR) in the era of deep learning has been mainly focused on novel architectures of STR models. However, training protocol (i.e., settings of the hyper-parameters involved in the training of STR models), which plays an equally important role in successfully training a good STR model, is under-explored for scene text recognition. In this work, we attempt to improve the accuracy of existing STR models by searching for optimal training protocol. Specifically, we develop a training protocol search algorithm, based on a newly designed search space and an efficient search algorithm using evolutionary optimization and proxy tasks. Experimental results show that our searched training protocol can improve the recognition accuracy of mainstream STR models by 2.7%~3.9%. In particular, with the searched training protocol, TRBA-Net achieves 2.1% higher accuracy than the state-of-the-art STR model (i.e., EFIFSTR), while the inference speed is 2.3x and 3.7x faster on CPU and GPU respectively. Extensive experiments are conducted to demonstrate the effectiveness of the proposed method and the generalization ability of the training protocol found by our search method. Code is available at https://github.com/VDIGPKU/STR_TPSearch.
Published: 2022

22. CBNet: A Composite Backbone Network Architecture for Object Detection

Author: Liang, Tingting, Chu, Xiaojie, Liu, Yudong, Wang, Yongtao, Tang, Zhi, Chu, Wei, Chen, Jingdong, and Ling, Haibin
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Modern top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. In this paper, we propose a novel and flexible backbone framework, namely CBNetV2, to construct high-performance detectors using existing open-sourced pre-trained backbones under the pre-training fine-tuning paradigm. In particular, CBNetV2 architecture groups multiple identical backbones, which are connected through composite connections. Specifically, it integrates the high- and low-level features of multiple backbone networks and gradually expands the receptive field to more efficiently perform object detection. We also propose a better training strategy with assistant supervision for CBNet-based detectors. Without additional pre-training of the composite backbone, CBNetV2 can be adapted to various backbones (CNN-based vs. Transformer-based) and head designs of most mainstream detectors (one-stage vs. two-stage, anchor-based vs. anchor-free-based). Experiments provide strong evidence that, compared with simply increasing the depth and width of the network, CBNetV2 introduces a more efficient, effective, and resource-friendly way to build high-performance backbone networks. Particularly, our Dual-Swin-L achieves 59.4% box AP and 51.6% mask AP on COCO test-dev under the single-model and single-scale testing protocol, which is significantly better than the state-of-the-art result (57.7% box AP and 50.2% mask AP) achieved by Swin-L, while the training schedule is reduced by 6$\times$. With multi-scale testing, we push the current best single model result to a new record of 60.1% box AP and 52.3% mask AP without using extra training data. Code is available at https://github.com/VDIGPKU/CBNetV2., Comment: IEEE Transactions on Image Processing (TIP) camera ready
Published: 2021
Full Text: View/download PDF

23. MatchVIE: Exploiting Match Relevancy between Entities for Visual Information Extraction

Author: Tang, Guozhi, Xie, Lele, Jin, Lianwen, Wang, Jiapeng, Chen, Jingdong, Xu, Zhen, Wang, Qianying, Wu, Yaqiang, and Li, Hui
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Visual Information Extraction (VIE) task aims to extract key information from multifarious document images (e.g., invoices and purchase receipts). Most previous methods treat the VIE task simply as a sequence labeling problem or classification problem, which requires models to carefully identify each kind of semantics by introducing multimodal features, such as font, color, layout. But simply introducing multimodal features couldn't work well when faced with numeric semantic categories or some ambiguous texts. To address this issue, in this paper we propose a novel key-value matching model based on a graph neural network for VIE (MatchVIE). Through key-value matching based on relevancy evaluation, the proposed MatchVIE can bypass the recognitions to various semantics, and simply focuses on the strong relevancy between entities. Besides, we introduce a simple but effective operation, Num2Vec, to tackle the instability of encoded values, which helps model converge more smoothly. Comprehensive experiments demonstrate that the proposed MatchVIE can significantly outperform previous methods. Notably, to the best of our knowledge, MatchVIE may be the first attempt to tackle the VIE task by modeling the relevancy between keys and values and it is a good complement to the existing methods., Comment: accepted by IJCAI 2021
Published: 2021

24. CMUA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes

Author: Huang, Hao, Wang, Yongtao, Chen, Zhaoyu, Zhang, Yuze, Li, Yuheng, Tang, Zhi, Chu, Wei, Chen, Jingdong, Lin, Weisi, and Ma, Kai-Kuang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Malicious applications of deepfakes (i.e., technologies generating target facial attributes or entire faces from facial images) have posed a huge threat to individuals' reputation and security. To mitigate these threats, recent studies have proposed adversarial watermarks to combat deepfake models, leading them to generate distorted outputs. Despite achieving impressive results, these adversarial watermarks have low image-level and model-level transferability, meaning that they can protect only one facial image from one specific deepfake model. To address these issues, we propose a novel solution that can generate a Cross-Model Universal Adversarial Watermark (CMUA-Watermark), protecting a large number of facial images from multiple deepfake models. Specifically, we begin by proposing a cross-model universal attack pipeline that attacks multiple deepfake models iteratively. Then, we design a two-level perturbation fusion strategy to alleviate the conflict between the adversarial watermarks generated by different facial images and models. Moreover, we address the key problem in cross-model optimization with a heuristic approach to automatically find the suitable attack step sizes for different models, further weakening the model-level conflict. Finally, we introduce a more reasonable and comprehensive evaluation method to fully test the proposed method and compare it with existing ones. Extensive experimental results demonstrate that the proposed CMUA-Watermark can effectively distort the fake facial images generated by multiple deepfake models while achieving a better performance than existing methods., Comment: 9 pages, 7 figures, Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI22
Published: 2021

25. AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

Author: Fu, Yihui, Cheng, Luyao, Lv, Shubo, Jv, Yukai, Kong, Yuxiang, Chen, Zhuo, Hu, Yanxin, Xie, Lei, Wu, Jian, Bu, Hui, Xu, Xin, Du, Jun, and Chen, Jingdong
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field., Comment: Accepted by Interspeech 2021
Published: 2021

26. Affine Combination of Diffusion Strategies over Networks

Author: Jin, Danqi, Chen, Jie, Richard, Cedric, Chen, Jingdong, and Sayed, Ali H.
Subjects: Electrical Engineering and Systems Science - Signal Processing, Electrical Engineering and Systems Science - Systems and Control
Abstract: Diffusion adaptation is a powerful strategy for distributed estimation and learning over networks. Motivated by the concept of combining adaptive filters, this work proposes a combination framework that aggregates the operation of multiple diffusion strategies for enhanced performance. By assigning a combination coefficient to each node, and using an adaptation mechanism to minimize the network error, we obtain a combined diffusion strategy that benefits from the best characteristics of all component strategies simultaneously in terms of excess-mean-square error (EMSE). Analyses of the universality are provided to show the superior performance of affine combination scheme and to characterize its behavior in the mean and mean-square sense. Simulation results are presented to demonstrate the effectiveness of the proposed strategies, as well as the accuracy of theoretical findings., Comment: 31 pages
Published: 2020

27. Partial AUC optimization based deep speaker embeddings with class-center learning for text-independent speaker verification

Author: Bai, Zhongxin, Zhang, Xiao-Lei, and Chen, Jingdong
Subjects: Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Deep embedding based text-independent speaker verification has demonstrated superior performance to traditional methods in many challenging scenarios. Its loss functions can be generally categorized into two classes, i.e., verification and identification. The verification loss functions match the pipeline of speaker verification, but their implementations are difficult. Thus, most state-of-the-art deep embedding methods use the identification loss functions with softmax output units or their variants. In this paper, we propose a verification loss function, named the maximization of partial area under the Receiver-operating-characteristic (ROC) curve (pAUC), for deep embedding based text-independent speaker verification. We also propose a class-center based training trial construction method to improve the training efficiency, which is critical for the proposed loss function to be comparable to the identification loss in performance. Experiments on the Speaker in the Wild (SITW) and NIST SRE 2016 datasets show that the proposed pAUC loss function is highly competitive with the state-of-the-art identification loss functions.
Published: 2019

28. Speaker Verification By Partial AUC Optimization With Mahalanobis Distance Metric Learning

Author: Bai, Zhongxin, Zhang, Xiao-Lei, and Chen, Jingdong
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Receiver operating characteristic (ROC) and detection error tradeoff (DET) curves are two widely used evaluation metrics for speaker verification. They are equivalent since the latter can be obtained by transforming the former's true positive y-axis to false negative y-axis and then re-scaling both axes by a probit operator. Real-world speaker verification systems, however, usually work on part of the ROC curve instead of the entire ROC curve given an application. Therefore, we propose in this paper to use the area under part of the ROC curve (pAUC) as a more efficient evaluation metric for speaker verification. A Mahalanobis distance metric learning based back-end is applied to optimize pAUC, where the Mahalanobis distance metric learning guarantees that the optimization objective of the back-end is a convex one so that the global optimum solution is achievable. To improve the performance of the state-of-the-art speaker verification systems by the proposed back-end, we further propose two feature preprocessing techniques based on length-normalization and probabilistic linear discriminant analysis respectively. We evaluate the proposed systems on the major languages of NIST SRE16 and the core tasks of SITW. Experimental results show that the proposed back-end outperforms the state-of-the-art speaker verification back-ends in terms of seven evaluation metrics.
Published: 2019

29. End-to-End Model for Speech Enhancement by Consistent Spectrogram Masking

Author: Du, Xingjian, Zhu, Mengyao, Shi, Xuan, Zhang, Xinpeng, Zhang, Wen, and Chen, Jingdong
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recently, phase processing is attracting increasinginterest in speech enhancement community. Some researchersintegrate phase estimations module into speech enhancementmodels by using complex-valued short-time Fourier transform(STFT) spectrogram based training targets, e.g. Complex RatioMask (cRM) [1]. However, masking on spectrogram would violentits consistency constraints. In this work, we prove that theinconsistent problem enlarges the solution space of the speechenhancement model and causes unintended artifacts. ConsistencySpectrogram Masking (CSM) is proposed to estimate the complexspectrogram of a signal with the consistency constraint in asimple but not trivial way. The experiments comparing ourCSM based end-to-end model with other methods are conductedto confirm that the CSM accelerate the model training andhave significant improvements in speech quality. From ourexperimental results, we assured that our method could enha
Published: 2019

30. Adaptive Parameters Adjustment for Group Reweighted Zero-Attracting LMS

Author: Jin, Danqi, Chen, Jie, Richard, Cedric, and Chen, Jingdong
Subjects: Electrical Engineering and Systems Science - Signal Processing
Abstract: Group zero-attracting LMS and its reweighted form have been proposed for addressing system identification problems with structural group sparsity in the parameters to estimate. Both algorithms however suffer from a trade-off between sparsity degree and estimation bias and, in addition, between convergence speed and steady-state performance like most adaptive filtering algorithms. It is therefore necessary to properly set their step size and regularization parameter. Based on a model of their transient behavior, we introduce a variable-parameter variant of both algorithms to address this issue. By minimizing their mean-square deviation at each time instant, we obtain closed-form expressions of the optimal step size and regularization parameter. Simulation results illustrate the effectiveness of the proposed algorithms., Comment: 9 pages, 3 figures
Published: 2018

31. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Author: Amodei, Dario, Anubhai, Rishita, Battenberg, Eric, Case, Carl, Casper, Jared, Catanzaro, Bryan, Chen, Jingdong, Chrzanowski, Mike, Coates, Adam, Diamos, Greg, Elsen, Erich, Engel, Jesse, Fan, Linxi, Fougner, Christopher, Han, Tony, Hannun, Awni, Jun, Billy, LeGresley, Patrick, Lin, Libby, Narang, Sharan, Ng, Andrew, Ozair, Sherjil, Prenger, Ryan, Raiman, Jonathan, Satheesh, Sanjeev, Seetapun, David, Sengupta, Shubho, Wang, Yi, Wang, Zhiqian, Wang, Chong, Xiao, Bo, Yogatama, Dani, Zhan, Jun, and Zhu, Zhenyao
Subjects: Computer Science - Computation and Language
Abstract: We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.
Published: 2015

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

31 results on '"Chen, Jingdong"'

1. POA: Pre-training Once for Models of All Sizes

2. Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight

3. ViTime: A Visual Intelligence-Based Foundation Model for Time Series Forecasting

4. SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding

5. Low algorithmic delay implementation of convolutional beamformer for online joint source separation and dereverberation

6. Enhancing DETRs Variants through Improved Content Query and Similar Query Aggregation

7. Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis

8. M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

9. Independent low-rank matrix analysis based on the Sinkhorn divergence source model for blind source separation

10. SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

11. A computationally efficient semi-blind source separation based approach for nonlinear echo cancellation based on an element-wise iterative source steering

12. Large Multimodal Model Compression via Efficient Pruning and Distillation at AntGroup

13. LogicMP: A Neuro-symbolic Approach for Encoding First-order Logic Constraints

14. The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

15. Mapping EEG Signals to Visual Stimuli: A Deep Learning Approach to Match vs. Mismatch Classification

16. An Anchor-Point Based Image-Model for Room Impulse Response Simulation with Directional Source Radiation and Sensor Directivity Patterns

17. The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

18. Robust Manifold Nonnegative Tucker Factorization for Tensor Data Representation

19. SimAN: Exploring Self-Supervised Representation Learning of Scene Text via Similarity-Aware Normalization

20. Hierarchical Memory Learning for Fine-Grained Scene Graph Generation

21. Training Protocol Matters: Towards Accurate Scene Text Recognition via Training Protocol Searching

22. CBNet: A Composite Backbone Network Architecture for Object Detection

23. MatchVIE: Exploiting Match Relevancy between Entities for Visual Information Extraction

24. CMUA-Watermark: A Cross-Model Universal Adversarial Watermark for Combating Deepfakes

25. AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario

26. Affine Combination of Diffusion Strategies over Networks

27. Partial AUC optimization based deep speaker embeddings with class-center learning for text-independent speaker verification

28. Speaker Verification By Partial AUC Optimization With Mahalanobis Distance Metric Learning

29. End-to-End Model for Speech Enhancement by Consistent Spectrogram Masking

30. Adaptive Parameters Adjustment for Group Reweighted Zero-Attracting LMS

31. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Publication Type

Database

31 results on '"Chen, Jingdong"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources