13 results on '"Songlong Xing"'
Search Results
2. Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion.
- Author
-
Sijie Mai, Haifeng Hu 0001, and Songlong Xing
- Published
- 2020
- Full Text
- View/download PDF
3. Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing.
- Author
-
Sijie Mai, Haifeng Hu 0001, and Songlong Xing
- Published
- 2019
- Full Text
- View/download PDF
4. Multimodal Graph for Unaligned Multimodal Sequence Analysis via Graph Convolution and Graph Pooling
- Author
-
Sijie Mai, Songlong Xing, Jiaxuan He, Ying Zeng, and Haifeng Hu
- Subjects
Computer Networks and Communications ,Hardware and Architecture - Abstract
Multimodal sequence analysis aims to draw inferences from visual, language, and acoustic sequences. A majority of existing works focus on the aligned fusion of three modalities to explore inter-modal interactions, which is impractical in real-world scenarios. To overcome this issue, we seek to focus on analyzing unaligned sequences, which is still relatively underexplored and also more challenging. We propose Multimodal Graph, whose novelty mainly lies in transforming the sequential learning problem into graph learning problem. The graph-based structure enables parallel computation in time dimension (as opposed to recurrent neural network) and can effectively learn longer intra- and inter-modal temporal dependency in unaligned sequences. First, we propose multiple ways to construct the adjacency matrix for sequence to perform sequence to graph transformation. To learn intra-modal dynamics, a graph convolution network is employed for each modality based on the defined adjacency matrix. To learn inter-modal dynamics, given that the unimodal sequences are unaligned, the commonly considered word-level fusion does not pertain. To this end, we innovatively devise graph pooling algorithms to automatically explore the associations between various time slices from different modalities and learn high-level graph representation hierarchically. Multimodal Graph outperforms state-of-the-art models on three datasets under the same experimental setting.
- Published
- 2023
5. Adapted Dynamic Memory Network for Emotion Recognition in Conversation
- Author
-
Sijie Mai, Haifeng Hu, and Songlong Xing
- Subjects
Dependency (UML) ,Computer science ,media_common.quotation_subject ,Speech recognition ,Process (computing) ,020206 networking & telecommunications ,Context (language use) ,02 engineering and technology ,Human-Computer Interaction ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Task analysis ,020201 artificial intelligence & image processing ,Conversation ,Representation (mathematics) ,Episodic memory ,Software ,media_common - Abstract
In this paper, we address Emotion Recognition in Conversation (ERC) where conversational data are presented in a multimodal setting. Psychological evidence shows that self and inter-speaker influence are two central factors to emotion dynamics in conversation. State-of-the-art models do not effectively synthesise these two factors. Therefore, we propose an Adapted Dynamic Memory Network (A-DMN) where self and inter-speaker influences are modelled individually and further synthesised oriented towards the current utterance. Specifically, we model the dependency of the constituent utterances in a dialogue video using a global RNN to capture inter-speaker influence. Likewise, each speaker is assigned an RNN to capture their self influence. Afterwards, an Episodic Memory Module is devised to extract contexts for self and inter-speaker influence and synthesise them to update the memory. This process repeats itself for multiple passes until a refined representation is obtained and used for final prediction. Additionally, we explore cross-modal fusion in the context of multimodal ERC, and propose a convolution-based method which proves effective in extracting local interactions and computationally efficient. Extensive experiments demonstrate that A-DMN outperforms the state-of-the-art models on benchmark datasets.
- Published
- 2022
6. Multi-Fusion Residual Memory Network for Multimodal Human Sentiment Comprehension
- Author
-
Sijie Mai, Haifeng Hu, Songlong Xing, and Jia Xu
- Subjects
Focus (computing) ,Sequence ,Modalities ,Forgetting ,Dependency (UML) ,Computer science ,business.industry ,Process (engineering) ,020206 networking & telecommunications ,02 engineering and technology ,Machine learning ,computer.software_genre ,Human-Computer Interaction ,Comprehension ,030507 speech-language pathology & audiology ,03 medical and health sciences ,0202 electrical engineering, electronic engineering, information engineering ,State (computer science) ,Artificial intelligence ,0305 other medical science ,business ,computer ,Software - Abstract
Multimodal human sentiment comprehension refers to recognizing human affection from multiple modalities. There exist two key issues for this problem. Firstly, it is difficult to explore time-dependent interactions between modalities and focus on the important time steps. Secondly, processing the long fused sequence of utterances is susceptible to the forgetting problem due to the long-term temporal dependency. In this paper, we introduce a hierarchical learning architecture to classify utterance-level sentiment. To address the first issue, we perform time-step level fusion to generate fused features for each time step, which explicitly models time-restricted interactions by incorporating information across modalities at the same time step. Furthermore, based on the assumption that acoustic features directly reflect emotional intensity, we pioneer emotion intensity attention to focus on the time steps where emotion changes or intense affections take place. To handle the second issue, we propose Residual Memory Network (RMN) to process the fused sequence. RMN utilizes some techniques such as directly passing the previous state into the next time step, which helps to retain the information from many time steps ago. We show that our method achieves state-of-the-art performance on multiple datasets. Results also suggest that RMN yields competitive performance on sequence modeling tasks.
- Published
- 2022
7. A Unimodal Representation Learning and Recurrent Decomposition Fusion Structure for Utterance-Level Multimodal Embedding Learning
- Author
-
Sijie Mai, Songlong Xing, and Haifeng Hu
- Subjects
Facial expression ,Modality (human–computer interaction) ,Computer science ,Speech recognition ,Computer Science Applications ,Kernel (image processing) ,Signal Processing ,Media Technology ,Embedding ,Electrical and Electronic Engineering ,Representation (mathematics) ,Feature learning ,Utterance ,Spoken language - Abstract
Learning a unified embedding for utterance-level video attracts significant attention recently due to the rapid development of social media and its broad applications. An utterance normally contains not only spoken language but also the nonverbal behaviors such as facial expressions and vocal patterns. Instead of directly learning utterance embedding based on low-level features, we firstly explore high-level representation for each modality separately via an unimodal representation learning gyroscope structure. In this way, the learnt unimodal representations are more representative and contain more abstract semantic information. In the gyroscope structure, we introduce multi-scale kernel learning, ‘channel expansion’ and ‘channel fusion’ operations to explore high-level features both spatially and channelwise. Another insight of our method lies in that we fuse representations of all modalities to obtain a unified embedding by interpreting fusion procedure as the flow of inter-modality information between various modalities, which is more specialized in terms of the information to be fused and the fusion process. Specifically, considering that each modality carries modality-specific and cross-modality interactions, we innovate to decompose unimodal representations into intra- and inter-modality dynamics using gating mechanism, and further fuse the inter-modality dynamics by passing them from previous modalities to the following one using a recurrent neural fusion architecture. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple benchmark datasets.
- Published
- 2022
8. Analyzing Multimodal Sentiment Via Acoustic- and Visual-LSTM With Channel-Aware Temporal Convolution Network
- Author
-
Sijie Mai, Haifeng Hu, and Songlong Xing
- Subjects
Modality (human–computer interaction) ,Acoustics and Ultrasonics ,Computer science ,Speech recognition ,Perspective (graphical) ,Feature extraction ,Sentiment analysis ,Visualization ,Multimodal learning ,Computational Mathematics ,ComputerApplications_MISCELLANEOUS ,Computer Science (miscellaneous) ,Task analysis ,Electrical and Electronic Engineering ,Feature learning - Abstract
The emotion of human is always expressed in a multimodal perspective. Analyzing multimodal human sentiment remains challenging due to the difficulties of the interpretation in inter-modality dynamics. Mainstream multimodal learning architectures tend to design various fusion strategies to learn inter-modality interactions, which barely consider the fact that the language modality is far more important than the acoustic and visual modalities. In contrast, we learn inter-modality dynamics in a different perspective via acoustic- and visual-LSTMs where language features play dominant role. Specifically, inside each LSTM variant, a well-designed gating mechanism is introduced to enhance the language representation via the corresponding auxiliary modality. Furthermore, in the unimodal representation learning stage, instead of using RNNs, we introduce ‘channel-aware’ temporal convolution network to extract high-level representations for each modality to explore both temporal and channel-wise interdependencies. Extensive experiments demonstrate that our approach achieves very competitive performance compared to the state-of-the-art methods on three widely-used benchmarks for multimodal sentiment analysis and emotion recognition.
- Published
- 2021
9. Constrained LSTM and Residual Attention for Image Captioning
- Author
-
Xinlong Lu, Haifeng Hu, Songlong Xing, and Liang Yang
- Subjects
Closed captioning ,Computer Networks and Communications ,business.industry ,Computer science ,020208 electrical & electronic engineering ,02 engineering and technology ,computer.software_genre ,Object detection ,Focus (linguistics) ,Hardware and Architecture ,Visual Objects ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Relevance (information retrieval) ,Language model ,Artificial intelligence ,Representation (mathematics) ,business ,computer ,Natural language processing ,Sentence ,computer.programming_language - Abstract
Visual structure and syntactic structure are essential in images and texts, respectively. Visual structure depicts both entities in an image and their interactions, whereas syntactic structure in texts can reflect the part-of-speech constraints between adjacent words. Most existing methods either use visual global representation to guide the language model or generate captions without considering the relationships of different entities or adjacent words. Thus, their language models lack relevance in both visual and syntactic structure. To solve this problem, we propose a model that aligns the language model to certain visual structure and also constrains it with a specific part-of-speech template. In addition, most methods exploit the latent relationship between words in a sentence and pre-extracted visual regions in an image yet ignore the effects of unextracted regions on predicted words. We develop a residual attention mechanism to simultaneously focus on the pre-extracted visual objects and unextracted regions in an image. Residual attention is capable of capturing precise regions of an image corresponding to the predicted words considering both the effects of visual objects and unextracted regions. The effectiveness of our entire framework and each proposed module are verified on two classical datasets: MSCOCO and Flickr30k. Our framework is on par with or even better than the state-of-the-art methods and achieves superior performance on COCO captioning Leaderboard.
- Published
- 2020
10. Locally Confined Modality Fusion Network With a Global Perspective for Multimodal Human Affective Computing
- Author
-
Sijie Mai, Haifeng Hu, and Songlong Xing
- Subjects
Fusion ,Modality (human–computer interaction) ,Theoretical computer science ,Computer science ,Feature vector ,Perspective (graphical) ,02 engineering and technology ,Computer Science Applications ,Signal Processing ,0202 electrical engineering, electronic engineering, information engineering ,Media Technology ,020201 artificial intelligence & image processing ,Segmentation ,Tensor ,Electrical and Electronic Engineering ,Affective computing - Abstract
In this paper, we propose a novel multimodal fusion framework, called the locally confined modality fusion network (LMFN), that contains a bidirectional multiconnected LSTM (BM-LSTM) to address the multimodal human affective computing problem. In the LMFN, we introduce a generic fusion structure that explores both local and global fusion to obtain an integral comprehension of information. Specifically, we partition the feature vector corresponding to each modality into multiple segments and learn every local interaction through a tensor fusion procedure. Global interaction is then modeled by learning the dependence between local tensors via an originally designed BM-LSTM architecture, establishing a direct connection of cells and states of local tensors that are several time steps apart. With the LMFN, we achieve advantages over other methods in the following aspects: 1) local interactions are successfully modeled using a feasible vector segmentation procedure that can explore cross-modal dynamics in a more specialized manner; 2) global interactions are modeled to obtain an integral view of multimodal information using BM-LSTM, which guarantees an adequate flow of information; and 3) our general fusion structure is highly extendable by applying other local and global fusion methods. Experiments show that the LMFN yields state-of-the-art results. Moreover, the LMFN achieves higher efficiency compared to other models by applying the outer product as the fusion method.
- Published
- 2020
11. Efficient and Fast Real-World Noisy Image Denoising by Combining Pyramid Neural Network and Two-Pathway Unscented Kalman Filter
- Author
-
Haifeng Hu, Songlong Xing, Zhengming Li, and Ruijun Ma
- Subjects
Artificial neural network ,Computer science ,business.industry ,Noise reduction ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Pattern recognition ,02 engineering and technology ,Filter (signal processing) ,Kalman filter ,Real image ,Computer Graphics and Computer-Aided Design ,Pyramid ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Noise (video) ,Pyramid (image processing) ,Artificial intelligence ,business ,Neural coding ,Software - Abstract
Recently, image prior learning has emerged as an effective tool for image denoising, which exploits prior knowledge to obtain sparse coding models and utilize them to reconstruct the clean image from the noisy one. Albeit promising, these prior-learning based methods suffer from some limitations such as lack of adaptivity and failed attempts to improve performance and efficiency simultaneously. With the purpose of addressing these problems, in this paper, we propose a Pyramid Guided Filter Network (PGF-Net) integrated with pyramid-based neural network and Two-Pathway Unscented Kalman Filter (TP-UKF). The combination of pyramid network and TP-UKF is based on the consideration that the former enables our model to better exploit hierarchical and multi-scale features, while the latter can guide the network to produce an improved (a posteriori) estimation of the denoising results with fine-scale image details. Through synthesizing the respective advantages of pyramid network and TP-UKF, our proposed architecture, in stark contrast to prior learning methods, is able to decompose the image denoising task into a series of more manageable stages and adaptively eliminate the noise on real images in an efficient manner. We conduct extensive experiments and show that our PGF-Net achieves notable improvement on visual perceptual quality and higher computational efficiency compared to state-of-the-art methods.
- Published
- 2020
12. Constrained LSTM and Residual Attention for Image Captioning.
- Author
-
LIANG YANG, HAIFENG HU, SONGLONG XING, and XINLONG LU
- Subjects
IMAGE ,MACHINE learning - Abstract
Visual structure and syntactic structure are essential in images and texts, respectively. Visual structure depicts both entities in an image and their interactions, whereas syntactic structure in texts can reflect the part-ofspeech constraints between adjacent words. Most existing methods either use visual global representation to guide the language model or generate captions without considering the relationships of different entities or adjacent words. Thus, their language models lack relevance in both visual and syntactic structure. To solve this problem, we propose a model that aligns the language model to certain visual structure and also constrains it with a specific part-of-speech template. In addition, most methods exploit the latent relationship betweenwords in a sentence and pre-extracted visual regions in an image yet ignore the effects of unextracted regions on predicted words. We develop a residual attention mechanism to simultaneously focus on the preextracted visual objects and unextracted regions in an image. Residual attention is capable of capturing precise regions of an image corresponding to the predicted words considering both the effects of visual objects and unextracted regions. The effectiveness of our entire framework and each proposed module are verified on two classical datasets: MSCOCO and Flickr30k. Our framework is on par with or even better than the stateof- the-art methods and achieves superior performance on COCO captioning Leaderboard. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
13. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion
- Author
-
Sijie Mai, Haifeng Hu, and Songlong Xing
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,Machine learning ,computer.software_genre ,Machine Learning (cs.LG) ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Adversarial system ,Discriminative model ,0202 electrical engineering, electronic engineering, information engineering ,business.industry ,020206 networking & telecommunications ,General Medicine ,Visualization ,Multimedia (cs.MM) ,Embedding ,Graph (abstract data type) ,Artificial intelligence ,0305 other medical science ,business ,computer ,Encoder ,Feature learning ,Computer Science - Multimedia - Abstract
Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality-invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore, we exert additional constraints on embedding space by introducing reconstruction loss and classification loss. Then we fuse the encoded representations using hierarchical graph neural network which explicitly explores unimodal, bimodal and trimodal interactions in multi-stage. Our method achieves state-of-the-art performance on multiple datasets. Visualization of the learned embeddings suggests that the joint embedding space learned by our method is discriminative. code is available at: \url{https://github.com/TmacMai/ARGF_multimodal_fusion}, Comment: Accepted by AAAI-2020; code is available at: https://github.com/TmacMai/ARGF_multimodal_fusion
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.