Author: "Dong, Jianfeng" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Dong, Jianfeng"' showing total 419 results

Start Over Author "Dong, Jianfeng"

419 results on '"Dong, Jianfeng"'

1. Rethinking Video Deblurring with Wavelet-Aware Dynamic Transformer and Diffusion Model

Author: Rao, Chen, Li, Guangyuan, Lan, Zehua, Sun, Jiakai, Luan, Junsheng, Xing, Wei, Zhao, Lei, Lin, Huaizhong, Dong, Jianfeng, and Zhang, Dalong
Subjects: Computer Science - Computer Vision and Pattern Recognition, I.4.4
Abstract: Current video deblurring methods have limitations in recovering high-frequency information since the regression losses are conservative with high-frequency details. Since Diffusion Models (DMs) have strong capabilities in generating high-frequency details, we consider introducing DMs into the video deblurring task. However, we found that directly applying DMs to the video deblurring task has the following problems: (1) DMs require many iteration steps to generate videos from Gaussian noise, which consumes many computational resources. (2) DMs are easily misled by the blurry artifacts in the video, resulting in irrational content and distortion of the deblurred video. To address the above issues, we propose a novel video deblurring framework VD-Diff that integrates the diffusion model into the Wavelet-Aware Dynamic Transformer (WADT). Specifically, we perform the diffusion model in a highly compact latent space to generate prior features containing high-frequency information that conforms to the ground truth distribution. We design the WADT to preserve and recover the low-frequency information in the video while utilizing the high-frequency information generated by the diffusion model. Extensive experiments show that our proposed VD-Diff outperforms SOTA methods on GoPro, DVD, BSD, and Real-World Video datasets., Comment: accepted by ECCV2024
Published: 2024

2. Alleviating Hallucination in Large Vision-Language Models with Active Retrieval Augmentation

Author: Qu, Xiaoye, Chen, Qiyuan, Wei, Wei, Sun, Jishuo, and Dong, Jianfeng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Despite the remarkable ability of large vision-language models (LVLMs) in image comprehension, these models frequently generate plausible yet factually incorrect responses, a phenomenon known as hallucination.Recently, in large language models (LLMs), augmenting LLMs by retrieving information from external knowledge resources has been proven as a promising solution to mitigate hallucinations.However, the retrieval augmentation in LVLM significantly lags behind the widespread applications of LVLM. Moreover, when transferred to augmenting LVLMs, sometimes the hallucination degree of the model is even exacerbated.Motivated by the research gap and counter-intuitive phenomenon, we introduce a novel framework, the Active Retrieval-Augmented large vision-language model (ARA), specifically designed to address hallucinations by incorporating three critical dimensions: (i) dissecting the retrieval targets based on the inherent hierarchical structures of images. (ii) pinpointing the most effective retrieval methods and filtering out the reliable retrieval results. (iii) timing the retrieval process to coincide with episodes of low certainty, while circumventing unnecessary retrieval during periods of high certainty. To assess the capability of our proposed ARA model in reducing hallucination, we employ three widely used LVLM models (LLaVA-1.5, Qwen-VL, and mPLUG-Owl2) across four benchmarks. Our empirical observations suggest that by utilizing fitting retrieval mechanisms and timing the retrieval judiciously, we can effectively mitigate the hallucination problem. We hope that this study can provide deeper insights into how to adapt the retrieval augmentation to LVLMs for reducing hallucinations with more effective retrieval and minimal retrieval occurrences.
Published: 2024

3. Mitigating Multilingual Hallucination in Large Vision-Language Models

Author: Qu, Xiaoye, Song, Mingyang, Wei, Wei, Dong, Jianfeng, and Cheng, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: While Large Vision-Language Models (LVLMs) have exhibited remarkable capabilities across a wide range of tasks, they suffer from hallucination problems, where models generate plausible yet incorrect answers given the input image-query pair. This hallucination phenomenon is even more severe when querying the image in non-English languages, while existing methods for mitigating hallucinations in LVLMs only consider the English scenarios. In this paper, we make the first attempt to mitigate this important multilingual hallucination in LVLMs. With thorough experiment analysis, we found that multilingual hallucination in LVLMs is a systemic problem that could arise from deficiencies in multilingual capabilities or inadequate multimodal abilities. To this end, we propose a two-stage Multilingual Hallucination Removal (MHR) framework for LVLMs, aiming to improve resistance to hallucination for both high-resource and low-resource languages. Instead of relying on the intricate manual annotations of multilingual resources, we fully leverage the inherent capabilities of the LVLM and propose a novel cross-lingual alignment method, which generates multiple responses for each image-query input and then identifies the hallucination-aware pairs for each language. These data pairs are finally used for direct preference optimization to prompt the LVLMs to favor non-hallucinating responses. Experimental results show that our MHR achieves a substantial reduction in hallucination generation for LVLMs. Notably, on our extended multilingual POPE benchmark, our framework delivers an average increase of 19.0% in accuracy across 13 different languages. Our code and model weights are available at https://github.com/ssmisya/MHR
Published: 2024

4. Representation Alignment Contrastive Regularization for Multi-Object Tracking

Author: Liu, Zhonglin, Chen, Shujie, Dong, Jianfeng, Wang, Xun, and Zhou, Di
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Achieving high-performance in multi-object tracking algorithms heavily relies on modeling spatio-temporal relationships during the data association stage. Mainstream approaches encompass rule-based and deep learning-based methods for spatio-temporal relationship modeling. While the former relies on physical motion laws, offering wider applicability but yielding suboptimal results for complex object movements, the latter, though achieving high-performance, lacks interpretability and involves complex module designs. This work aims to simplify deep learning-based spatio-temporal relationship models and introduce interpretability into features for data association. Specifically, a lightweight single-layer transformer encoder is utilized to model spatio-temporal relationships. To make features more interpretative, two contrastive regularization losses based on representation alignment are proposed, derived from spatio-temporal consistency rules. By applying weighted summation to affinity matrices, the aligned features can seamlessly integrate into the data association stage of the original tracking workflow. Experimental results showcase that our model enhances the majority of existing tracking networks' performance without excessive complexity, with minimal increase in training overhead and nearly negligible computational and storage costs.
Published: 2024

5. An all-dielectric chiral metasurface with circular dichroism and asymmetric transmission characteristics

Author: Wu, Qihang, Shi, Zhuolin, He, Jinglin, and Dong, Jianfeng
Published: 2024
Full Text: View/download PDF

6. Let All be Whitened: Multi-teacher Distillation for Efficient Visual Retrieval

Author: Ma, Zhe, Dong, Jianfeng, Ji, Shouling, Liu, Zhenguang, Zhang, Xuhong, Wang, Zonghui, He, Sifeng, Qian, Feng, Zhang, Xiaobo, and Yang, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Visual retrieval aims to search for the most relevant visual items, e.g., images and videos, from a candidate gallery with a given query item. Accuracy and efficiency are two competing objectives in retrieval tasks. Instead of crafting a new method pursuing further improvement on accuracy, in this paper we propose a multi-teacher distillation framework Whiten-MTD, which is able to transfer knowledge from off-the-shelf pre-trained retrieval models to a lightweight student model for efficient visual retrieval. Furthermore, we discover that the similarities obtained by different retrieval models are diversified and incommensurable, which makes it challenging to jointly distill knowledge from multiple models. Therefore, we propose to whiten the output of teacher models before fusion, which enables effective multi-teacher distillation for retrieval models. Whiten-MTD is conceptually simple and practically effective. Extensive experiments on two landmark image retrieval datasets and one video retrieval dataset demonstrate the effectiveness of our proposed method, and its good balance of retrieval performance and efficiency. Our source code is released at https://github.com/Maryeon/whiten_mtd., Comment: Accepted by AAAI 2024
Published: 2023

7. CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer

Author: Wang, Yabing, Wang, Fan, Dong, Jianfeng, and Luo, Hao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Cross-lingual cross-modal retrieval has garnered increasing attention recently, which aims to achieve the alignment between vision and target language (V-T) without using any annotated V-T data pairs. Current methods employ machine translation (MT) to construct pseudo-parallel data pairs, which are then used to learn a multi-lingual and multi-modal embedding space that aligns visual and target-language representations. However, the large heterogeneous gap between vision and text, along with the noise present in target language translations, poses significant challenges in effectively aligning their representations. To address these challenges, we propose a general framework, Cross-Lingual to Cross-Modal (CL2CM), which improves the alignment between vision and target language using cross-lingual transfer. This approach allows us to fully leverage the merits of multi-lingual pre-trained models (e.g., mBERT) and the benefits of the same modality structure, i.e., smaller gap, to provide reliable and comprehensive semantic correspondence (knowledge) for the cross-modal network. We evaluate our proposed approach on two multilingual image-text datasets, Multi30K and MSCOCO, and one video-text dataset, VATEX. The results clearly demonstrate the effectiveness of our proposed method and its high potential for large-scale retrieval., Comment: Accepted by AAAI2024
Published: 2023

8. Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding

Author: Sun, Shengkai, Liu, Daizong, Dong, Jianfeng, Qu, Xiaoye, Gao, Junyu, Yang, Xun, Wang, Xun, and Wang, Meng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Unsupervised pre-training has shown great success in skeleton-based action understanding recently. Existing works typically train separate modality-specific models, then integrate the multi-modal information for action understanding by a late-fusion strategy. Although these approaches have achieved significant performance, they suffer from the complex yet redundant multi-stream model designs, each of which is also limited to the fixed input skeleton modality. To alleviate these issues, in this paper, we propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL, which exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner. Specifically, instead of designing separate modality-specific optimization processes for uni-modal unsupervised learning, we feed different modality inputs into the same stream with an early-fusion strategy to learn their multi-modal features for reducing model complexity. To ensure that the fused multi-modal features do not exhibit modality bias, i.e., being dominated by a certain modality input, we further propose both intra- and inter-modal consistency learning to guarantee that the multi-modal features contain the complete semantics of each modal via feature decomposition and distinct alignment. In this manner, our framework is able to learn the unified representations of uni-modal or multi-modal skeleton input, which is flexible to different kinds of modality input for robust action understanding in practical cases. Extensive experiments conducted on three large-scale datasets, i.e., NTU-60, NTU-120, and PKU-MMD II, demonstrate that UmURL is highly efficient, possessing the approximate complexity with the uni-modal methods, while achieving new state-of-the-art performance across various downstream task scenarios in skeleton-based action representation learning., Comment: Accepted by ACM MM 2023. The code is available at https://github.com/HuiGuanLab/UmURL
Published: 2023

9. Video Infringement Detection via Feature Disentanglement and Mutual Information Maximization

Author: Liu, Zhenguang, Yu, Xinyang, Wang, Ruili, Ye, Shuai, Ma, Zhe, Dong, Jianfeng, He, Sifeng, Qian, Feng, Zhang, Xiaobo, Zimmermann, Roger, and Yang, Lei
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: The self-media era provides us tremendous high quality videos. Unfortunately, frequent video copyright infringements are now seriously damaging the interests and enthusiasm of video creators. Identifying infringing videos is therefore a compelling task. Current state-of-the-art methods tend to simply feed high-dimensional mixed video features into deep neural networks and count on the networks to extract useful representations. Despite its simplicity, this paradigm heavily relies on the original entangled features and lacks constraints guaranteeing that useful task-relevant semantics are extracted from the features. In this paper, we seek to tackle the above challenges from two aspects: (1) We propose to disentangle an original high-dimensional feature into multiple sub-features, explicitly disentangling the feature into exclusive lower-dimensional components. We expect the sub-features to encode non-overlapping semantics of the original feature and remove redundant information. (2) On top of the disentangled sub-features, we further learn an auxiliary feature to enhance the sub-features. We theoretically analyzed the mutual information between the label and the disentangled features, arriving at a loss that maximizes the extraction of task-relevant information from the original feature. Extensive experiments on two large-scale benchmark datasets (i.e., SVD and VCSL) demonstrate that our method achieves 90.1% TOP-100 mAP on the large-scale SVD dataset and also sets the new state-of-the-art on the VCSL benchmark dataset. Our code and model have been released at https://github.com/yyyooooo/DMI/, hoping to contribute to the community., Comment: This paper is accepted by ACM MM 2023
Published: 2023

10. Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal Retrieval

Author: Wang, Yabing, Wang, Shuhui, Luo, Hao, Dong, Jianfeng, Wang, Fan, Han, Meng, Wang, Xun, and Wang, Meng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Current research on cross-modal retrieval is mostly English-oriented, as the availability of a large number of English-oriented human-labeled vision-language corpora. In order to break the limit of non-English labeled data, cross-lingual cross-modal retrieval (CCR) has attracted increasing attention. Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation (MT) to achieve cross-lingual transfer. However, the translated sentences from MT are generally imperfect in describing the corresponding visual contents. Improperly assuming the pseudo-parallel data are correctly correlated will make the networks overfit to the noisy correspondence. Therefore, we propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR. In particular, we quantify the confidence of the sample pair correlation with optimal transport theory from both the cross-lingual and cross-modal views, and design dual-view curriculum learning to dynamically model the transportation costs according to the learning stage of the two views. Extensive experiments are conducted on two multilingual image-text datasets and one video-text dataset, and the results demonstrate the effectiveness and robustness of the proposed method. Besides, our proposed method also shows a good expansibility to cross-lingual image-text baselines and a decent generalization on out-of-domain data.
Published: 2023

11. Robust video question answering via contrastive cross-modality representation learning

Author: Yang, Xun, Zeng, Jianming, Guo, Dan, Wang, Shanshan, Dong, Jianfeng, and Wang, Meng
Published: 2024
Full Text: View/download PDF

12. From Region to Patch: Attribute-Aware Foreground-Background Contrastive Learning for Fine-Grained Fashion Retrieval

Author: Dong, Jianfeng, Peng, Xiaoman, Ma, Zhe, Liu, Daizong, Qu, Xiaoye, Yang, Xun, Zhu, Jixiang, and Liu, Baolong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Attribute-specific fashion retrieval (ASFR) is a challenging information retrieval task, which has attracted increasing attention in recent years. Different from traditional fashion retrieval which mainly focuses on optimizing holistic similarity, the ASFR task concentrates on attribute-specific similarity, resulting in more fine-grained and interpretable retrieval results. As the attribute-specific similarity typically corresponds to the specific subtle regions of images, we propose a Region-to-Patch Framework (RPF) that consists of a region-aware branch and a patch-aware branch to extract fine-grained attribute-related visual features for precise retrieval in a coarse-to-fine manner. In particular, the region-aware branch is first to be utilized to locate the potential regions related to the semantic of the given attribute. Then, considering that the located region is coarse and still contains the background visual contents, the patch-aware branch is proposed to capture patch-wise attribute-related details from the previous amplified region. Such a hybrid architecture strikes a proper balance between region localization and feature extraction. Besides, different from previous works that solely focus on discriminating the attribute-relevant foreground visual features, we argue that the attribute-irrelevant background features are also crucial for distinguishing the detailed visual contexts in a contrastive manner. Therefore, a novel E-InfoNCE loss based on the foreground and background representations is further proposed to improve the discrimination of attribute-specific representation. Extensive experiments on three datasets demonstrate the effectiveness of our proposed framework, and also show a decent generalization of our RPF on out-of-domain fashion images. Our source code is available at https://github.com/HuiGuanLab/RPF., Comment: This paper has been published as a full paper at SIGIR 2023
Published: 2023
Full Text: View/download PDF

13. Transform-Equivariant Consistency Learning for Temporal Sentence Grounding

Author: Liu, Daizong, Qu, Xiaoye, Dong, Jianfeng, Zhou, Pan, Xu, Zichuan, Wang, Haozhao, Di, Xing, Lu, Weining, and Cheng, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper addresses the temporal sentence grounding (TSG). Although existing methods have made decent achievements in this task, they not only severely rely on abundant video-query paired data for training, but also easily fail into the dataset distribution bias. To alleviate these limitations, we introduce a novel Equivariant Consistency Regulation Learning (ECRL) framework to learn more discriminative query-related frame-wise representations for each video, in a self-supervised manner. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted under various video-level transformations. Concretely, we first design a series of spatio-temporal augmentations on both foreground and background video segments to generate a set of synthetic video samples. In particular, we devise a self-refine module to enhance the completeness and smoothness of the augmented video. Then, we present a novel self-supervised consistency loss (SSCL) applied on the original and augmented videos to capture their invariant query-related semantic by minimizing the KL-divergence between the sequence similarity of two videos and a prior Gaussian distribution of timestamp distance. At last, a shared grounding head is introduced to predict the transform-equivariant query-guided segment boundaries for both the original and augmented videos. Extensive experiments on three challenging datasets (ActivityNet, TACoS, and Charades-STA) demonstrate both effectiveness and efficiency of our proposed ECRL framework.
Published: 2023

14. Hierarchical Contrast for Unsupervised Skeleton-based Action Representation Learning

Author: Dong, Jianfeng, Sun, Shengkai, Liu, Zhonglin, Chen, Shujie, Liu, Baolong, and Wang, Xun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper targets unsupervised skeleton-based action representation learning and proposes a new Hierarchical Contrast (HiCo) framework. Different from the existing contrastive-based solutions that typically represent an input skeleton sequence into instance-level features and perform contrast holistically, our proposed HiCo represents the input into multiple-level features and performs contrast in a hierarchical manner. Specifically, given a human skeleton sequence, we represent it into multiple feature vectors of different granularities from both temporal and spatial domains via sequence-to-sequence (S2S) encoders and unified downsampling modules. Besides, the hierarchical contrast is conducted in terms of four levels: instance level, domain level, clip level, and part level. Moreover, HiCo is orthogonal to the S2S encoder, which allows us to flexibly embrace state-of-the-art S2S encoders. Extensive experiments on four datasets, i.e., NTU-60, NTU-120, PKU-MMD I and II, show that HiCo achieves a new state-of-the-art for unsupervised skeleton-based action representation learning in two downstream tasks including action recognition and retrieval, and its learned action representation is of good transferability. Besides, we also show that our framework is effective for semi-supervised skeleton-based action recognition. Our code is available at https://github.com/HuiGuanLab/HiCo., Comment: Accepted by AAAI 2023. The code is available at http://github.com/HuiGuanLab/HiCo
Published: 2022

15. Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

Author: Wang, Yabing, Dong, Jianfeng, Liang, Tianxiang, Zhang, Minsong, Cai, Rui, and Wang, Xun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Despite the recent developments in the field of cross-modal retrieval, there has been less research focusing on low-resource languages due to the lack of manually annotated datasets. In this paper, we propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages. To this end, we use Machine Translation (MT) to construct pseudo-parallel sentence pairs for low-resource languages. However, as MT is not perfect, it tends to introduce noise during translation, rendering textual embeddings corrupted and thereby compromising the retrieval performance. To alleviate this, we introduce a multi-view self-distillation method to learn noise-robust target-language representations, which employs a cross-attention module to generate soft pseudo-targets to provide direct supervision from the similarity-based view and feature-based view. Besides, inspired by the back-translation in unsupervised MT, we minimize the semantic discrepancies between origin sentences and back-translated sentences to further improve the noise robustness of the textual encoder. Extensive experiments are conducted on three video-text and image-text cross-modal retrieval benchmarks across different languages, and the results demonstrate that our method significantly improves the overall performance without using extra human-labeled data. In addition, equipped with a pre-trained visual encoder from a recent vision-and-language pre-training framework, i.e., CLIP, our model achieves a significant performance gain, showing that our method is compatible with popular pre-training models. Code and data are available at https://github.com/HuiGuanLab/nrccr., Comment: Accepted by ACM MM 2022. Code and data are available at https://github.com/HuiGuanLab/nrccr
Published: 2022
Full Text: View/download PDF

16. Partially Relevant Video Retrieval

Author: Dong, Jianfeng, Chen, Xianke, Zhang, Minsong, Yang, Xun, Chen, Shujie, Li, Xirong, and Wang, Xun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: Current methods for text-to-video retrieval (T2VR) are trained and tested on video-captioning oriented datasets such as MSVD, MSR-VTT and VATEX. A key property of these datasets is that videos are assumed to be temporally pre-trimmed with short duration, whilst the provided captions well describe the gist of the video content. Consequently, for a given paired video and caption, the video is supposed to be fully relevant to the caption. In reality, however, as queries are not known a priori, pre-trimmed video clips may not contain sufficient content to fully meet the query. This suggests a gap between the literature and the real world. To fill the gap, we propose in this paper a novel T2VR subtask termed Partially Relevant Video Retrieval (PRVR). An untrimmed video is considered to be partially relevant w.r.t. a given textual query if it contains a moment relevant to the query. PRVR aims to retrieve such partially relevant videos from a large collection of untrimmed videos. PRVR differs from single video moment retrieval and video corpus moment retrieval, as the latter two are to retrieve moments rather than untrimmed videos. We formulate PRVR as a multiple instance learning (MIL) problem, where a video is simultaneously viewed as a bag of video clips and a bag of video frames. Clips and frames represent video content at different time scales. We propose a Multi-Scale Similarity Learning (MS-SL) network that jointly learns clip-scale and frame-scale similarities for PRVR. Extensive experiments on three datasets (TVR, ActivityNet Captions, and Charades-STA) demonstrate the viability of the proposed method. We also show that our method can be used for improving video corpus moment retrieval., Comment: Accepted by ACM MM 2022. The paper's homepage is http://danieljf24.github.io/prvr
Published: 2022
Full Text: View/download PDF

17. Reading-strategy Inspired Visual Representation Learning for Text-to-Video Retrieval

Author: Dong, Jianfeng, Wang, Yabing, Chen, Xianke, Qu, Xiaoye, Li, Xirong, He, Yuan, and Wang, Xun
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Multimedia
Abstract: This paper aims for the task of text-to-video retrieval, where given a query in the form of a natural-language sentence, it is asked to retrieve videos which are semantically relevant to the given query, from a great number of unlabeled videos. The success of this task depends on cross-modal representation learning that projects both videos and sentences into common spaces for semantic similarity computation. In this work, we concentrate on video representation learning, an essential component for text-to-video retrieval. Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos, which consists of two branches: a previewing branch and an intensive-reading branch. The previewing branch is designed to briefly capture the overview information of videos, while the intensive-reading branch is designed to obtain more in-depth information. Moreover, the intensive-reading branch is aware of the video overview captured by the previewing branch. Such holistic information is found to be useful for the intensive-reading branch to extract more fine-grained features. Extensive experiments on three datasets are conducted, where our model RIVRL achieves a new state-of-the-art on TGIF and VATEX. Moreover, on MSR-VTT, our model using two video features shows comparable performance to the state-of-the-art using seven video features and even outperforms models pre-trained on the large-scale HowTo100M dataset., Comment: Accepted by IEEE Transactions on Circuits and Systems for Video Technology. Code is available at https://github.com/LiJiaBei-7/rivrl
Published: 2022
Full Text: View/download PDF

18. Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

Author: Hu, Fan, Chen, Aozhu, Wang, Ziyue, Zhou, Fangming, Dong, Jianfeng, and Li, Xirong
Subjects: Computer Science - Multimedia, Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016-2020) justify LAFF as a new baseline for text-to-video retrieval., Comment: Accepted by ECCV2022
Published: 2021

19. Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos

Author: Liu, Daizong, Qu, Xiaoye, Dong, Jianfeng, and Zhou, Pan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: We address the problem of temporal sentence localization in videos (TSLV). Traditional methods follow a top-down framework which localizes the target segment with pre-defined segment proposals. Although they have achieved decent performance, the proposals are handcrafted and redundant. Recently, bottom-up framework attracts increasing attention due to its superior efficiency. It directly predicts the probabilities for each frame as a boundary. However, the performance of bottom-up model is inferior to the top-down counterpart as it fails to exploit the segment-level interaction. In this paper, we propose an Adaptive Proposal Generation Network (APGN) to maintain the segment-level interaction while speeding up the efficiency. Specifically, we first perform a foreground-background classification upon the video and regress on the foreground frames to adaptively generate proposals. In this way, the handcrafted proposal design is discarded and the redundant proposals are decreased. Then, a proposal consolidation module is further developed to enhance the semantic of the generated proposals. Finally, we locate the target moments with these generated proposals following the top-down framework. Extensive experiments on three challenging benchmarks show that our proposed APGN significantly outperforms previous state-of-the-art methods., Comment: Accepted as a long paper in the main conference of EMNLP 2021
Published: 2021

20. Fine-Grained Fashion Similarity Prediction by Attribute-Specific Embedding Learning

Author: Dong, Jianfeng, Ma, Zhe, Mao, Xiaofeng, Yang, Xun, He, Yuan, Hong, Richang, and Ji, Shouling
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval
Abstract: This paper strives to predict fine-grained fashion similarity. In this similarity paradigm, one should pay more attention to the similarity in terms of a specific design/attribute between fashion items. For example, whether the collar designs of the two clothes are similar. It has potential value in many fashion related applications, such as fashion copyright protection. To this end, we propose an Attribute-Specific Embedding Network (ASEN) to jointly learn multiple attribute-specific embeddings, thus measure the fine-grained similarity in the corresponding space. The proposed ASEN is comprised of a global branch and a local branch. The global branch takes the whole image as input to extract features from a global perspective, while the local branch takes as input the zoomed-in region-of-interest (RoI) w.r.t. the specified attribute thus able to extract more fine-grained features. As the global branch and the local branch extract the features from different perspectives, they are complementary to each other. Additionally, in each branch, two attention modules, i.e., Attribute-aware Spatial Attention and Attribute-aware Channel Attention, are integrated to make ASEN be able to locate the related regions and capture the essential patterns under the guidance of the specified attribute, thus make the learned attribute-specific embeddings better reflect the fine-grained similarity. Extensive experiments on three fashion-related datasets, i.e., FashionAI, DARN, and DeepFashion, show the effectiveness of ASEN for fine-grained fashion similarity prediction and its potential for fashion reranking. Code and data are available at https://github.com/maryeon/asenpp ., Comment: Conference paper: arXiv:2002.02814
Published: 2021
Full Text: View/download PDF

21. Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

Author: Liu, Daizong, Qu, Xiaoye, Dong, Jianfeng, Zhou, Pan, Cheng, Yu, Wei, Wei, Xu, Zichuan, and Xie, Yulai
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper addresses the problem of temporal sentence grounding (TSG), which aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. Previous works either compare pre-defined candidate segments with the query and select the best one by ranking, or directly regress the boundary timestamps of the target segment. In this paper, we propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism. In particular, we present a Context-aware Biaffine Localizing Network (CBLN) which incorporates both local and global contexts into features of each start/end position for biaffine-based localization. The local contexts from the adjacent frames help distinguish the visually similar appearance, and the global contexts from the entire video contribute to reasoning the temporal relation. Besides, we also develop a multi-modal self-attention module to provide fine-grained query-guided video representation for this biaffine strategy. Extensive experiments show that our CBLN significantly outperforms state-of-the-arts on three public datasets (ActivityNet Captions, TACoS, and Charades-STA), demonstrating the effectiveness of the proposed localization framework., Comment: Accepted by CVPR 2021
Published: 2021

22. Hierarchical Similarity Learning for Language-based Product Image Retrieval

Author: Ma, Zhe, Liu, Fenghao, Dong, Jianfeng, Qu, Xiaoye, He, Yuan, and Ji, Shouling
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, Computer Science - Multimedia
Abstract: This paper aims for the language-based product image retrieval task. The majority of previous works have made significant progress by designing network structure, similarity measurement, and loss function. However, they typically perform vision-text matching at certain granularity regardless of the intrinsic multiple granularities of images. In this paper, we focus on the cross-modal similarity measurement, and propose a novel Hierarchical Similarity Learning (HSL) network. HSL first learns multi-level representations of input data by stacked encoders, and object-granularity similarity and image-granularity similarity are computed at each level. All the similarities are combined as the final hierarchical cross-modal similarity. Experiments on a large-scale product retrieval dataset demonstrate the effectiveness of our proposed method. Code and data are available at https://github.com/liufh1/hsl., Comment: Accepted by ICASSP 2021. Code and data will be available at https://github.com/liufh1/hsl
Published: 2021

23. Progressive Localization Networks for Language-based Moment Localization

Author: Zheng, Qi, Dong, Jianfeng, Qu, Xiaoye, Yang, Xun, Wang, Yabing, Zhou, Pan, Liu, Baolong, and Wang, Xun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper targets the task of language-based video moment localization. The language-based setting of this task allows for an open set of target activities, resulting in a large variation of the temporal lengths of video moments. Most existing methods prefer to first sample sufficient candidate moments with various temporal lengths, and then match them with the given query to determine the target moment. However, candidate moments generated with a fixed temporal granularity may be suboptimal to handle the large variation in moment lengths. To this end, we propose a novel multi-stage Progressive Localization Network (PLN) which progressively localizes the target moment in a coarse-to-fine manner. Specifically, each stage of PLN has a localization branch, and focuses on candidate moments that are generated with a specific temporal granularity. The temporal granularities of candidate moments are different across the stages. Moreover, we devise a conditional feature manipulation module and an upsampling connection to bridge the multiple localization branches. In this fashion, the later stages are able to absorb the previously learned information, thus facilitating the more fine-grained localization. Extensive experiments on three public datasets demonstrate the effectiveness of our proposed PLN for language-based moment localization, especially for localizing short moments in long videos., Comment: submited to TOMM
Published: 2021

24. Association Analysis Between Albumin Level and Maxillofacial Space Infection Severity

Author: Huang, Ying, Lu, Lei, Fei, Hetong, Ma, Jing, Dong, Jianfeng, and Xie, Fuqiang
Published: 2024
Full Text: View/download PDF

25. Dual Encoding for Video Retrieval by Text

Author: Dong, Jianfeng, Li, Xirong, Xu, Chaoxi, Yang, Xun, Yang, Gang, Wang, Xun, and Wang, Meng
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, Computer Science - Multimedia
Abstract: This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is crucial. To that end, the two modalities need to be first encoded into real-valued vectors and then projected into a common space. In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding that represents the rich content of both modalities in a coarse-to-fine fashion. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space. Dual encoding is conceptually simple, practically effective and end-to-end trained with hybrid space learning. Extensive experiments on four challenging video datasets show the viability of the new method., Comment: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence. Code and data will be available at https://github.com/danieljf24/hybrid_space. Conference version: arXiv:1809.06181
Published: 2020
Full Text: View/download PDF

26. Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos

Author: Qu, Xiaoye, Tang, Pengwei, Zhou, Zhikang, Cheng, Yu, Dong, Jianfeng, and Zhou, Pan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query. To tackle this task, designing an effective model to extract ground-ing information from both visual and textual modalities is crucial. However, most previous attempts in this field only focus on unidirectional interactions from video to query, which emphasizes which words to listen and attends to sentence information via vanilla soft attention, but clues from query-by-video interactions implying where to look are not taken into consideration. In this paper, we propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction. Specifically, in the iterative attention module, each word in the query is first enhanced by attending to each frame in the video through fine-grained attention, then video iteratively attends to the integrated query. Finally, both video and query information is utilized to provide robust cross-modal representation for further moment localization. In addition, to better predict the target segment, we propose a content-oriented localization strategy instead of applying recent anchor-based localization. We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA. FIAN significantly outperforms the state-of-the-art approaches., Comment: ACM MM 2020
Published: 2020

27. Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization

Author: Liu, Daizong, Qu, Xiaoye, Liu, Xiao-Yang, Dong, Jianfeng, Zhou, Pan, and Xu, Zichuan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval
Abstract: Query-based moment localization is a new task that localizes the best matched segment in an untrimmed video according to a given sentence query. In this localization task, one should pay more attention to thoroughly mine visual and linguistic information. To this end, we propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph. Specifically, the joint graph consists of Cross-Modal interaction Graph (CMG) and Self-Modal relation Graph (SMG), where frames and words are represented as nodes, and the relations between cross- and self-modal node pairs are described by an attention mechanism. Through parametric message passing, CMG highlights relevant instances across video and sentence, and then SMG models the pairwise relation inside each modality for frame (word) correlating. With multiple layers of such a joint graph, our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization. Besides, to better comprehend the contextual details in the query, we develop a hierarchical sentence encoder to enhance the query understanding. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed model, and GCSMAN significantly outperforms the state-of-the-arts., Comment: Accepted by ACM MM 2020
Published: 2020

28. Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

Author: Yang, Xun, Dong, Jianfeng, Cao, Yixin, Wang, Xun, Wang, Meng, and Chua, Tat-Seng
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually ineffective for complex queries that carry far more complex semantics. Recently, embedding-based paradigm has emerged as a popular approach. It aims to map the queries and videos into a shared embedding space where semantically-similar texts and videos are much closer to each other. Despite its simplicity, it forgoes the exploitation of the syntactic structure of text queries, making it suboptimal to model the complex queries. To facilitate video retrieval with complex queries, we propose a Tree-augmented Cross-modal Encoding method by jointly learning the linguistic structure of queries and the temporal representation of videos. Specifically, given a complex user query, we first recursively compose a latent semantic tree to structurally describe the text query. We then design a tree-augmented query encoder to derive structure-aware query representation and a temporal attentive video encoder to model the temporal characteristics of videos. Finally, both the query and videos are mapped into a joint embedding space for matching and ranking. In this approach, we have a better understanding and modeling of the complex queries, thereby achieving a better video retrieval performance. Extensive experiments on large scale video retrieval benchmark datasets demonstrate the effectiveness of our approach., Comment: Accepted For 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020)
Published: 2020

29. Feature Re-Learning with Data Augmentation for Video Relevance Prediction

Author: Dong, Jianfeng, Wang, Xun, Zhang, Leimin, Xu, Chaoxi, Yang, Gang, and Li, Xirong
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval
Abstract: Predicting the relevance between two given videos with respect to their visual content is a key component for content-based video recommendation and retrieval. Thanks to the increasing availability of pre-trained image and video convolutional neural network models, deep visual features are widely used for video content representation. However, as how two videos are relevant is task-dependent, such off-the-shelf features are not always optimal for all tasks. Moreover, due to varied concerns including copyright, privacy and security, one might have access to only pre-computed video features rather than original videos. We propose in this paper feature re-learning for improving video relevance prediction, with no need of revisiting the original video content. In particular, re-learning is realized by projecting a given deep feature into a new space by an affine transformation. We optimize the re-learning process by a novel negative-enhanced triplet ranking loss. In order to generate more training data, we propose a new data augmentation strategy which works directly on frame-level and video-level features. Extensive experiments in the context of the Hulu Content-based Video Relevance Prediction Challenge 2018 justify the effectiveness of the proposed method and its state-of-the-art performance for content-based video relevance prediction., Comment: accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE)
Published: 2020
Full Text: View/download PDF

30. Fine-Grained Fashion Similarity Learning by Attribute-Specific Embedding Network

Author: Ma, Zhe, Dong, Jianfeng, Zhang, Yao, Long, Zhongzi, He, Yuan, Xue, Hui, and Ji, Shouling
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: This paper strives to learn fine-grained fashion similarity. In this similarity paradigm, one should pay more attention to the similarity in terms of a specific design/attribute among fashion items, which has potential values in many fashion related applications such as fashion copyright protection. To this end, we propose an Attribute-Specific Embedding Network (ASEN) to jointly learn multiple attribute-specific embeddings in an end-to-end manner, thus measure the fine-grained similarity in the corresponding space. With two attention modules, i.e., Attribute-aware Spatial Attention and Attribute-aware Channel Attention, ASEN is able to locate the related regions and capture the essential patterns under the guidance of the specified attribute, thus make the learned attribute-specific embeddings better reflect the fine-grained similarity. Extensive experiments on four fashion-related datasets show the effectiveness of ASEN for fine-grained fashion similarity learning and its potential for fashion reranking., Comment: 16 pages, 13 figutes. Accepted by AAAI 2020. Code and data are available at https://github.com/Maryeon/asen
Published: 2020

31. Tunable circular dichroism in all-dielectric metasurface with efficient wavefront manipulation

Author: Shi, Zhuolin, He, Jinglin, Wu, Qihang, Shao, Hanru, and Dong, Jianfeng
Published: 2023
Full Text: View/download PDF

32. Reduction of zinc and iron in dust removal ash by H2 generated from lignite pyrolysis

Author: Wang, Zihan, Dong, Jianfeng, Guo, Jing, Guo, Hanjie, and Li, Bin
Published: 2023
Full Text: View/download PDF

33. Multi-level feature disentanglement network for cross-dataset face forgery detection

Author: Fu, Zhixiao, Chen, Xinyuan, Liu, Daizong, Qu, Xiaoye, Dong, Jianfeng, Zhang, Xuhong, and Ji, Shouling
Published: 2023
Full Text: View/download PDF

34. Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

Author: Hu, Fan, Chen, Aozhu, Wang, Ziyue, Zhou, Fangming, Dong, Jianfeng, Li, Xirong, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Avidan, Shai, editor, Brostow, Gabriel, editor, Cissé, Moustapha, editor, Farinella, Giovanni Maria, editor, and Hassner, Tal, editor
Published: 2022
Full Text: View/download PDF

35. One4All: Manipulate one agent to poison the cooperative multi-agent reinforcement learning

Author: Zheng, Haibin, Li, Xiaohao, Chen, Jinyin, Dong, Jianfeng, Zhang, Yan, and Lin, Changting
Published: 2023
Full Text: View/download PDF

36. Dual Encoding for Zero-Example Video Retrieval

Author: Dong, Jianfeng, Li, Xirong, Xu, Chaoxi, Ji, Shouling, He, Yuan, Yang, Gang, and Wang, Xun
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper attacks the challenging problem of zero-example video retrieval. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described in natural language text with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is required. The majority of existing methods are concept based, extracting relevant concepts from queries and videos and accordingly establishing associations between the two modalities. In contrast, this paper takes a concept-free approach, proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Dual encoding is conceptually simple, practically effective and end-to-end. As experiments on three benchmarks, i.e. MSR-VTT, TRECVID 2016 and 2017 Ad-hoc Video Search show, the proposed solution establishes a new state-of-the-art for zero-example video retrieval., Comment: Accepted by CVPR 2019. Code and data are available at https://github.com/danieljf24/dual_encoding
Published: 2018

37. Reconfigurable all-dielectric metasurface based on GSST

Author: He, Jinglin, Shi, Zhuolin, Ye, Sheng, Li, Minhua, and Dong, Jianfeng
Published: 2022
Full Text: View/download PDF

38. Graphene integrated rasorber at terahertz frequencies with functionalities of both absorption and transmission

Author: Li, Minhua, Huang, Congying, Huang, Chengye, Song, Youting, Shao, Hanru, and Dong, Jianfeng
Published: 2022
Full Text: View/download PDF

39. Typical case of carbon capture and utilization in Chinese iron and steel enterprises: CO2 emission analysis

Author: Feng, Chao, Zhu, Rong, Wei, Guangsheng, Dong, Kai, and Dong, Jianfeng
Published: 2022
Full Text: View/download PDF

40. FeatInter: Exploring fine-grained object features for video-text retrieval

Author: Liu, Baolong, Zheng, Qi, Wang, Yabing, Zhang, Minsong, Dong, Jianfeng, and Wang, Xun
Published: 2022
Full Text: View/download PDF

41. The B7H4-PDL1 classifier stratifies immuno-phenotype in cervical cancer

Author: Chen, Lingyan, Dong, Jianfeng, Li, Zeying, Chen, Yu, and Zhang, Yan
Published: 2022
Full Text: View/download PDF

42. NRG1 knockdown rescues PV interneuron GABAergic maturation deficits and schizophrenia behaviors in fetal growth restriction mice

Author: Dong, Jianfeng, Chen, Wen, Liu, Nana, Chang, Shujuan, Zhu, Wei, and Kang, Jiuhong
Published: 2022
Full Text: View/download PDF

43. Ultra-wideband tunable reflective linear-to-circular polarization converter realized by GST-based metasurface at terahertz frequency

Author: Cai, GaoFeng, Chen, Jia, Zhou, YuJie, Li, MinHua, and Dong, JianFeng
Published: 2022
Full Text: View/download PDF

44. Exploring Human-like Attention Supervision in Visual Question Answering

Author: Qiao, Tingting, Dong, Jianfeng, and Xu, Duanqing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Attention mechanisms have been widely applied in the Visual Question Answering (VQA) task, as they help to focus on the area-of-interest of both visual and textual information. To answer the questions correctly, the model needs to selectively target different areas of an image, which suggests that an attention-based model may benefit from an explicit attention supervision. In this work, we aim to address the problem of adding attention supervision to VQA models. Since there is a lack of human attention data, we first propose a Human Attention Network (HAN) to generate human-like attention maps, training on a recently released dataset called Human ATtention Dataset (VQA-HAT). Then, we apply the pre-trained HAN on the VQA v2.0 dataset to automatically produce the human-like attention maps for all image-question pairs. The generated human-like attention map dataset for the VQA v2.0 dataset is named as Human-Like ATtention (HLAT) dataset. Finally, we apply human-like attention supervision to an attention-based VQA model. The experiments show that adding human-like supervision yields a more accurate attention together with a better performance, showing a promising future for human-like attention supervision in VQA.
Published: 2017

45. Predicting Visual Features from Text for Image and Video Caption Retrieval

Author: Dong, Jianfeng, Li, Xirong, and Snoek, Cees G. M.
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do so in a visual space exclusively. Apart from this conceptual novelty, we contribute \emph{Word2VisualVec}, a deep neural network architecture that learns to predict a visual feature representation from textual input. Example captions are encoded into a textual embedding based on multi-scale sentence vectorization and further transferred into a deep visual feature of choice via a simple multi-layer perceptron. We further generalize Word2VisualVec for video caption retrieval, by predicting from text both 3-D convolutional neural network features as well as a visual-audio representation. Experiments on Flickr8k, Flickr30k, the Microsoft Video Description dataset and the very recent NIST TrecVid challenge for video caption retrieval detail Word2VisualVec's properties, its benefit over textual embeddings, the potential for multimodal query composition and its state-of-the-art results., Comment: Accepted by Transaction on Multimedia. Code is available at https://github.com/danieljf24/w2vv
Published: 2017
Full Text: View/download PDF

46. Cross-Media Similarity Evaluation for Web Image Retrieval in the Wild

Author: Dong, Jianfeng, Li, Xirong, and Xu, Duanqing
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In order to retrieve unlabeled images by textual queries, cross-media similarity computation is a key ingredient. Although novel methods are continuously introduced, little has been done to evaluate these methods together with large-scale query log analysis. Consequently, how far have these methods brought us in answering real-user queries is unclear. Given baseline methods that compute cross-media similarity using relatively simple text/image matching, how much progress have advanced models made is also unclear. This paper takes a pragmatic approach to answering the two questions. Queries are automatically categorized according to the proposed query visualness measure, and later connected to the evaluation of multiple cross-media similarity models on three test sets. Such a connection reveals that the success of the state-of-the-art is mainly attributed to their good performance on visual-oriented queries, while these queries account for only a small part of real-user queries. To quantify the current progress, we propose a simple text2image method, representing a novel test query by a set of images selected from large-scale query log. Consequently, computing cross-media similarity between the test query and a given image boils down to comparing the visual similarity between the given image and the selected images. Image retrieval experiments on the challenging Clickture dataset show that the proposed text2image compares favorably to recent deep learning based alternatives., Comment: 14 pages, 10 figures, accepted by IEEE Transactions on Multimedia 2018
Published: 2017

47. Fluency-Guided Cross-Lingual Image Captioning

Author: Lan, Weiyu, Li, Xirong, and Dong, Jianfeng
Subjects: Computer Science - Computation and Language
Abstract: Image captioning has so far been explored mostly in English, as most available datasets are in this language. However, the application of image captioning should not be restricted by language. Only few studies have been conducted for image captioning in a cross-lingual setting. Different from these works that manually build a dataset for a target language, we aim to learn a cross-lingual captioning model fully from machine-translated sentences. To conquer the lack of fluency in the translated sentences, we propose in this paper a fluency-guided learning framework. The framework comprises a module to automatically estimate the fluency of the sentences and another module to utilize the estimated fluency scores to effectively train an image captioning model for the target language. As experiments on two bilingual (English-Chinese) datasets show, our approach improves both fluency and relevance of the generated captions in Chinese, but without using any manually written sentences from the target language., Comment: 9 pages, 2 figures, accepted as ORAL by ACM Multimedia 2017
Published: 2017
Full Text: View/download PDF

48. CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer

Author: Wang, Yabing, primary, Wang, Fan, additional, Dong, Jianfeng, additional, and Luo, Hao, additional
Published: 2024
Full Text: View/download PDF

49. Theoretical and Experimental Research on the Mass Changes of Elements in Molten Steel with CO2 Used as RH Lifting Gas

Author: Han, Baochen, Zhu, Rong, Wei, Guangsheng, Feng, Chao, Dong, Jianfeng, Chen, Xiaobo, editor, Zhong, Yulin, editor, Zhang, Lei, editor, Howarter, John A., editor, Baba, Alafara Abdullahi, editor, Wang, Cong, editor, Sun, Ziqi, editor, Zhang, Mingming, editor, Olivetti, Elsa, editor, Luo, Alan, editor, and Powell, Adam, editor
Published: 2020
Full Text: View/download PDF

50. Jet characteristics of CO2–O2 mixed injection using a dual-parameter oxygen lance nozzle for different smelting periods

Author: Feng Chao, Zhu Rong, Wei Guangsheng, Dong Jianfeng, and Dong Kai
Subjects: jet, furnace gas composition, ambient temperature, co2–o2 mixed injection., Technology, Chemical technology, TP1-1185, Chemicals: Manufacture, use, etc., TP200-248
Abstract: The jet performance of an oxygen lance nozzle influences the smelting rhythm, smelting index, and energy consumption of a converter. Due to the complexity of the process of converter smelting, the changing temperature and gas composition in the converter significantly impact the jet characteristics of the oxygen lance nozzle in the smelting process; however, research on the change law of jet characteristics in different smelting periods is limited. In this study, we used Ansys Fluent 17.0, which is commercially available fluid simulation software, to simulate the variation of jet characteristics of a mixed injection comprising 6% CO2 and 94% O2; the mixed injection was tested using a dual-parameter oxygen lance nozzle in the early, middle, and late stages of smelting. The results show that the increase of CO concentration and ambient temperature in the converter lead to a decrease in the attenuation rate of jet velocity, improved independence of multiple jets, and an increase in the impact area of jets on the molten pool. Thus, when designing and employing oxygen lance nozzles, the influence of ambient temperature and furnace gas composition on jet characteristics must be considered.
Published: 2021
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

419 results on '"Dong, Jianfeng"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources