Journal: ieee transactions on pattern analysis & machine intelligence / Publication Type: Electronic Resources / Publication Year Range: Last 10 years / Publisher: ieee / Search Limiters: Available in Library Collection and Full Text / Topic: feature extraction and visualization - Searchworks@Jio Institute Digital Library Search Results

1. Dual Encoding for Video Retrieval by Text.

Author: Dong, Jianfeng, Li, Xirong, Xu, Chaoxi, Yang, Xun, Yang, Gang, Wang, Xun, and Wang, Meng
Subjects: *VIDEO coding, *BLENDED learning, *ENCODING, *VIDEOS, *MACHINE learning, *RECURRENT neural networks
Abstract: This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no visual example provided. Given videos as sequences of frames and queries as sequences of words, an effective sequence-to-sequence cross-modal matching is crucial. To that end, the two modalities need to be first encoded into real-valued vectors and then projected into a common space. In this paper we achieve this by proposing a dual deep encoding network that encodes videos and queries into powerful dense representations of their own. Our novelty is two-fold. First, different from prior art that resorts to a specific single-level encoder, the proposed network performs multi-level encoding that represents the rich content of both modalities in a coarse-to-fine fashion. Second, different from a conventional common space learning algorithm which is either concept based or latent space based, we introduce hybrid space learning which combines the high performance of the latent space and the good interpretability of the concept space. Dual encoding is conceptually simple, practically effective and end-to-end trained with hybrid space learning. Extensive experiments on four challenging video datasets show the viability of the new method. Code and data are available at https://github.com/danieljf24/hybrid_space. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

2. Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey.

Author: Jing, Longlong and Tian, Yingli
Subjects: *VISUAL learning, *SUPERVISED learning, *DEEP learning, *COMPUTER vision
Abstract: Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the schema and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used datasets for images, videos, audios, and 3D data, as well as the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

3. GPCA: A Probabilistic Framework for Gaussian Process Embedded Channel Attention.

Author: Xie, Jiyang, Ma, Zhanyu, Chang, Dongliang, Zhang, Guoqiang, and Guo, Jun
Subjects: *GAUSSIAN processes, *BETA distribution, *CONVOLUTIONAL neural networks
Abstract: Channel attention mechanisms have been commonly applied in many visual tasks for effective performance improvement. It is able to reinforce the informative channels as well as to suppress the useless channels. Recently, different channel attention modules have been proposed and implemented in various ways. Generally speaking, they are mainly based on convolution and pooling operations. In this paper, we propose Gaussian process embedded channel attention (GPCA) module and further interpret the channel attention schemes in a probabilistic way. The GPCA module intends to model the correlations among the channels, which are assumed to be captured by beta distributed variables. As the beta distribution cannot be integrated into the end-to-end training of convolutional neural networks (CNNs) with a mathematically tractable solution, we utilize an approximation of the beta distribution to solve this problem. To specify, we adapt a Sigmoid-Gaussian approximation, in which the Gaussian distributed variables are transferred into the interval [0,1]. The Gaussian process is then utilized to model the correlations among different channels. In this case, a mathematically tractable solution is derived. The GPCA module can be efficiently implemented and integrated into the end-to-end training of the CNNs. Experimental results demonstrate the promising performance of the proposed GPCA module. Codes are available at https://github.com/PRIS-CV/GPCA. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

4. How to Trust Unlabeled Data? Instance Credibility Inference for Few-Shot Learning.

Author: Wang, Yikai, Zhang, Li, Yao, Yuan, and Fu, Yanwei
Subjects: *TRUST, *COMPUTER vision, *DATA augmentation, *DEEP learning, *SECURE Sockets Layer (Computer network protocol), *SUPERVISED learning, *EIGENVALUES
Abstract: Deep learning based models have excelled in many computer vision tasks and appear to surpass humans’ performance. However, these models require an avalanche of expensive human labeled training data and many iterations to train their large number of parameters. This severely limits their scalability to the real-world long-tail distributed categories, some of which are with a large number of instances, but with only a few manually annotated. Learning from such extremely limited labeled examples is known as Few-Shot Learning (FSL). Different to prior arts that leverage meta-learning or data augmentation strategies to alleviate this extremely data-scarce problem, this paper presents a statistical approach, dubbed Instance Credibility Inference (ICI) to exploit the support of unlabeled instances for few-shot visual recognition. Typically, we repurpose the self-taught learning paradigm to predict pseudo-labels of unlabeled instances with an initial classifier trained from the few shot and then select the most confident ones to augment the training set to re-train the classifier. This is achieved by constructing a (Generalized) Linear Model (LM/GLM) with incidental parameters to model the mapping from (un-)labeled features to their (pseudo-)labels, in which the sparsity of the incidental parameters indicates the credibility of the corresponding pseudo-labeled instance. We rank the credibility of pseudo-labeled instances along the regularization path of their corresponding incidental parameters, and the most trustworthy pseudo-labeled examples are preserved as the augmented labeled instances. This process is repeated until all the unlabeled samples are included in the expanded training set. Theoretically, under the conditions of restricted eigenvalue, irrepresentability, and large error, our approach is guaranteed to collect all the correctly-predicted pseudo-labeled instances from the noisy pseudo-labeled set. Extensive experiments under two few-shot settings show the effectiveness of our approach on four widely used few-shot visual recognition benchmark datasets including miniImageNet, tieredImageNet, CIFAR-FS, and CUB. Code and models are released at https://github.com/Yikai-Wang/ICI-FSL. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

5. Cross-Modal Progressive Comprehension for Referring Segmentation.

Author: Liu, Si, Hui, Tianrui, Huang, Shaofei, Wei, Yunchao, Li, Bo, and Li, Guanbin
Subjects: *MODALITY (Linguistics), *NATURAL languages, *IMAGE segmentation, *VIDEO coding, *PROBLEM solving, *PIXELS
Abstract: Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.e., first roughly locating candidate entities and then distinguishing the target one. In this paper, we propose a cross-modal progressive comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models. For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the target entity as well as suppress other irrelevant ones by spatial graph reasoning. For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning. In addition to the CMPC, we also introduce a simple yet effective Text-Guided Feature Exchange (TGFE) module to integrate the reasoned multimodal features corresponding to different levels in the visual backbone under the guidance of textual information. In this way, multi-level features can communicate with each other and be mutually refined based on the textual context. Combining CMPC-I or CMPC-V with TGFE can form our image or video version referring segmentation frameworks and our frameworks achieve new state-of-the-art performances on four referring image segmentation benchmarks and three referring video segmentation benchmarks respectively. Our code is available at https://github.com/spyflying/CMPC-Refseg. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

6. Two-Branch Relational Prototypical Network for Weakly Supervised Temporal Action Localization.

Author: Huang, Linjiang, Huang, Yan, Ouyang, Wanli, and Wang, Liang
Subjects: *FEATURE extraction, *PROTOTYPES, *TASK analysis
Abstract: As a challenging task of high-level video understanding, weakly supervised temporal action localization has attracted more attention recently. With only video-level category labels, this task should indistinguishably identify the background and action categories frame by frame. However, it is non-trivial to achieve this in untrimmed videos, due to the unconstrained background, complex and multi-label actions. With the observation that these difficulties are mainly brought by the large variations within background and actions, we propose to address these challenges from the perspective of modeling variations. Moreover, it is desired to further reduce the variations, or learn compact features, so as to cast the problem of background identification as rejecting background and alleviate the contradiction between classification and detection. Accordingly, in this paper, we propose a two-branch relational prototypical network. The first branch, namely action-branch, adopts class-wise prototypes and mainly acts as an auxiliary to introduce priori knowledge about label dependencies and be a guide for the second branch. Meanwhile, the second branch, namely sub-branch, starts with multiple prototypes, namely sub-prototypes, to enable a powerful ability of modeling variations. As a further benefit, we elaborately design a multi-label clustering loss based on the sub-prototypes to learn compact features under the multi-label setting. The two branches are associated using the correspondences between two types of prototypes, leading to a special two-stage classifier in the s-branch, on the other hand, the two branches serve as regularization terms to each other, improving the final performance. Ablation studies find that the proposed model is capable of modeling classes with large variations and learning compact features. Extensive experimental evaluations on Thumos14, MultiThumos and ActivityNet datasets demonstrate the effectiveness of the proposed method and superior performance over state-of-the-art approaches. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

7. Attention in Attention Networks for Person Retrieval.

Author: Fang, Pengfei, Zhou, Jieming, Roy, Soumava Kumar, Ji, Pan, Petersson, Lars, and Harandi, Mehrtash
Subjects: *CONVOLUTIONAL neural networks, *ARTIFICIAL neural networks, *DEEP learning, *MAP design, *HILBERT space, *PETRI nets
Abstract: This paper generalizes the Attention in Attention (AiA) mechanism, in P. Fang et al., 2019 by employing explicit mapping in reproducing kernel Hilbert spaces to generate attention values of the input feature map. The AiA mechanism models the capacity of building inter-dependencies among the local and global features by the interaction of inner and outer attention modules. Besides a vanilla AiA module, termed linear attention with AiA, two non-linear counterparts, namely, second-order polynomial attention and Gaussian attention, are also proposed to utilize the non-linear properties of the input features explicitly, via the second-order polynomial kernel and Gaussian kernel approximation. The deep convolutional neural network, equipped with the proposed AiA blocks, is referred to as Attention in Attention Network (AiA-Net). The AiA-Net learns to extract a discriminative pedestrian representation, which combines complementary person appearance and corresponding part features. Extensive ablation studies verify the effectiveness of the AiA mechanism and the use of non-linear features hidden in the feature map for attention design. Furthermore, our approach outperforms current state-of-the-art by a considerable margin across a number of benchmarks. In addition, state-of-the-art performance is also achieved in the video person retrieval task with the assistance of the proposed AiA blocks. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

8. Self-Supervised Video Representation Learning by Uncovering Spatio-Temporal Statistics.

Author: Wang, Jiangliu, Jiao, Jianbo, Bao, Linchao, He, Shengfeng, Liu, Wei, and Liu, Yun-hui
Subjects: *VISUAL fields, *IMAGE color analysis, *CARTESIAN coordinates, *TASK analysis, *STATISTICAL learning, *VIDEO excerpts, *SELF-efficacy
Abstract: This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion, the spatial location and dominant color of the largest color diversity along the temporal axis, etc. Then a neural network is built and trained to yield the statistical summaries given the video frames as inputs. In order to alleviate the learning difficulty, we employ several spatial partitioning patterns to encode rough spatial locations instead of exact spatial Cartesian coordinates. Our approach is inspired by the observation that human visual system is sensitive to rapidly changing contents in the visual field, and only needs impressions about rough spatial locations to understand the visual contents. To validate the effectiveness of the proposed approach, we conduct extensive experiments with four 3D backbone networks, i.e., C3D, 3D-ResNet, R(2+1)D and S3D-G. The results show that our approach outperforms the existing approaches across these backbone networks on four downstream video analysis tasks including action recognition, video retrieval, dynamic scene recognition, and action similarity labeling. The source code is publicly available at: https://github.com/laura-wang/video_repres_sts. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

9. Referring Segmentation in Images and Videos With Cross-Modal Self-Attention Network.

Author: Ye, Linwei, Rochan, Mrigank, Liu, Zhi, Zhang, Xiaoqin, and Wang, Yang
Subjects: *IMAGE segmentation, *VIDEOS, *NATURAL languages
Abstract: We consider the problem of referring segmentation in images and videos with natural language. Given an input image (or video) and a referring expression, the goal is to segment the entity referred by the expression in the image or video. In this paper, we propose a cross-modal self-attention (CMSA) module to utilize fine details of individual words and the input image or video, which effectively captures the long-range dependencies between linguistic and visual features. Our model can adaptively focus on informative words in the referring expression and important regions in the visual input. We further propose a gated multi-level fusion (GMLF) module to selectively integrate self-attentive cross-modal features corresponding to different levels of visual features. This module controls the feature fusion of information flow of features at different levels with high-level and low-level semantic information related to different attentive words. Besides, we introduce cross-frame self-attention (CFSA) module to effectively integrate temporal information in consecutive frames which extends our method in the case of referring segmentation in videos. Experiments on benchmark datasets of four referring image datasets and two actor and action video segmentation datasets consistently demonstrate that our proposed approach outperforms existing state-of-the-art methods. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

10. Transferable Interactiveness Knowledge for Human-Object Interaction Detection.

Author: Li, Yong-Lu, Liu, Xinpeng, Wu, Xiaoqian, Huang, Xijie, Xu, Liang, and Lu, Cewu
Subjects: *HUMAN body, *DEEP learning, *FEATURE extraction
Abstract: Human-object interaction (HOI) Detection is an important problem to understand how humans interact with objects. In this paper, we explore Interactiveness Knowledge which indicates whether human and object interact with each other or not. We found that interactiveness knowledge can be learned across HOI datasets and alleviate the gap between diverse HOI category settings. Our core idea is to exploit an Interactiveness Network to learn the general interactiveness knowledge from multiple HOI datasets and perform Non-Interaction Suppression before HOI classification in inference. On account of the generalization of interactiveness, interactiveness network is a transferable knowledge learner and can be cooperated with any HOI detection models to achieve desirable results. We utilize the human instance and body part features together to learn the interactiveness in hierarchical paradigm, i.e., instance-level and body part-level interactivenesses. Thereafter, a consistency task is proposed to guide the learning and extract deeper interactive visual clues. We extensively evaluate the proposed method on HICO-DET, V-COCO, and a newly constructed HAKE-HOI dataset. With the learned interactiveness, our method outperforms state-of-the-art HOI detection methods, verifying its efficacy and flexibility. Code is available at https://github.com/DirtyHarryLYL/Transferable-Interactiveness-Network. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

11. OANet: Learning Two-View Correspondences and Geometry Using Order-Aware Network.

Author: Zhang, Jiahui, Sun, Dawei, Luo, Zixin, Yao, Anbang, Chen, Hongkai, Zhou, Lei, Shen, Tianwei, Chen, Yurong, Quan, Long, and Liao, Hongen
Subjects: *GEOMETRY, *IMAGE registration, *POSE estimation (Computer vision), *COMPUTER architecture, *FEATURE extraction
Abstract: Establishing correct correspondences between two images should consider both local and global spatial context. Given putative correspondences of feature points in two views, in this paper, we propose Order-Aware Network, which infers the probabilities of correspondences being inliers and regresses the relative pose encoded by the essential or fundamental matrix. Specifically, this proposed network is built hierarchically and comprises three operations. First, to capture the local context of sparse correspondences, the network clusters unordered input correspondences by learning a soft assignment matrix. These clusters are in canonical order and invariant to input permutations. Next, the clusters are spatially correlated to encode the global context of correspondences. After that, the context-encoded clusters are interpolated back to the original size and position to build a hierarchical architecture. We intensively experiment on both outdoor and indoor datasets. The accuracy of the two-view geometry and correspondences are significantly improved over the state-of-the-arts. Besides, based on the proposed method and advanced local feature, we won the first place in CVPR 2019 image matching workshop challenge and also achieve state-of-the-art results in the Visual Localization benchmark. Code is available at https://github.com/zjhthu/OANet. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

12. Learning Visual Instance Retrieval from Failure: Efficient Online Local Metric Adaptation from Negative Samples.

Author: Zhou, Jiahuan and Wu, Ying
Subjects: *VISUAL learning, *FEATURE extraction, *IMAGE retrieval, *GLOBAL method of teaching
Abstract: Existing visual instance retrieval (VIR) approaches attempt to learn a faithful global matching metric or discriminative feature embedding offline to cover enormous visual appearance variations, so as to directly use it online on various unseen probes for retrieval. However, their requirement for a huge set of positive training pairs is very demanding in practice and the performance is largely constrained for the unseen testing samples due to the severe data shifting issue. In contrast, this paper advocates a different paradigm: part of the learning can be performed online but with nominal costs, so as to achieve online metric adaptation for different query probes. By exploiting easily-available negative samples, we propose a novel solution to achieve the optimal local metric adaptation effectively and efficiently. The insight of our method is the local hard negative samples can actually provide tight constraints to fine tune the metric locally. Our local metric adaptation method is generally applicable to be used on top of any offline-learned baselines. In addition, this paper gives in-depth theoretical analyses of the proposed method to guarantee the reduction of the classification error both asymptotically and practically. Extensive experiments on various VIR tasks have confirmed our effectiveness and superiority. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

13. Unsupervised Learning of a Hierarchical Spiking Neural Network for Optical Flow Estimation: From Events to Global Motion Perception.

Author: Paredes-Valles, Federico, Scheper, Kirk Yannick Willehm, and de Croon, Guido C. H. E.
Subjects: *OPTICAL flow, *IMAGE sensors, *MOTION, *FEATURE extraction, *SENSORY perception, *ACTION potentials
Abstract: The combination of spiking neural networks and event-based vision sensors holds the potential of highly efficient and high-bandwidth optical flow estimation. This paper presents the first hierarchical spiking architecture in which motion (direction and speed) selectivity emerges in an unsupervised fashion from the raw stimuli generated with an event-based camera. A novel adaptive neuron model and stable spike-timing-dependent plasticity formulation are at the core of this neural network governing its spike-based processing and learning, respectively. After convergence, the neural architecture exhibits the main properties of biological visual motion systems, namely feature extraction and local and global motion perception. Convolutional layers with input synapses characterized by single and multiple transmission delays are employed for feature and local motion perception, respectively; while global motion selectivity emerges in a final fully-connected layer. The proposed solution is validated using synthetic and real event sequences. Along with this paper, we provide the cuSNN library, a framework that enables GPU-accelerated simulations of large-scale spiking neural networks. Source code and samples are available at https://github.com/tudelft/cuSNN. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

14. Revisiting Image-Language Networks for Open-Ended Phrase Detection.

Author: Plummer, Bryan A., Shih, Kevin J., Li, Yichen, Xu, Ke, Lazebnik, Svetlana, Sclaroff, Stan, and Saenko, Kate
Subjects: *OBJECT recognition (Computer vision), *NAIVE Bayes classification, *NATURAL languages, *TERMS & phrases, *STATISTICAL correlation
Abstract: Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image. In this paper we address a more realistic version of the natural language grounding task where we must both identify whether the phrase is relevant to an image and localize the phrase. This can also be viewed as a generalization of object detection to an open-ended vocabulary, introducing elements of few- and zero-shot detection. We propose an approach for this task that extends Faster R-CNN to relate image regions and phrases. By carefully initializing the classification layers of our network using canonical correlation analysis (CCA), we encourage a solution that is more discerning when reasoning between similar phrases, resulting in over double the performance compared to a naive adaptation on three popular phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, with test-time phrase vocabulary sizes of 5K, 32K, and 159K, respectively. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

15. Visual Grounding Via Accumulated Attention.

Author: Deng, Chaorui, Wu, Qi, Wu, Qingyao, Hu, Fuyuan, Lyu, Fan, and Tan, Mingkui
Subjects: *REDUNDANCY in engineering, *ATTENTION, *NATURAL languages, *DATA quality
Abstract: Visual grounding (VG) aims to locate the most relevant object or region in an image, based on a natural language query. Generally, it requires the machine to first understand the query, identify the key concepts in the image, and then locate the target object by specifying its bounding box. However, in many real-world visual grounding applications, we have to face with ambiguous queries and images with complicated scene structures. Identifying the target based on highly redundant and correlated information can be very challenging, and often leading to unsatisfactory performance. To tackle this, in this paper, we exploit an attention module for each kind of information to reduce internal redundancies. We then propose an accumulated attention (A-ATT) mechanism to reason among all the attention modules jointly. In this way, the relation among different kinds of information can be explicitly captured. Moreover, to improve the performance and robustness of our VG models, we additionally introduce some noises into the training procedure to bridge the distribution gap between the human-labeled training data and the real-world poor quality data. With this “noised” training strategy, we can further learn a bounding box regressor, which can be used to refine the bounding box of the target object. We evaluate the proposed methods on four popular datasets (namely ReferCOCO, ReferCOCO+, ReferCOCOg, and GuessWhat?!). The experimental results show that our methods significantly outperform all previous works on every dataset in terms of accuracy. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

16. Power Normalizations in Fine-Grained Image, Few-Shot Image and Graph Classification.

Author: Koniusz, Piotr and Zhang, Hongguang
Subjects: *COVARIANCE matrices, *NONLINEAR operators, *DEEP learning, *LAPLACIAN matrices, *CLASSIFICATION, *CHARTS, diagrams, etc., *FEATURE extraction
Abstract: Power Normalizations (PN) are useful non-linear operators which tackle feature imbalances in classification problems. We study PNs in the deep learning setup via a novel PN layer pooling feature maps. Our layer combines the feature vectors and their respective spatial locations in the feature maps produced by the last convolutional layer of CNN into a positive definite matrix with second-order statistics to which PN operators are applied, forming so-called Second-order Pooling (SOP). As the main goal of this paper is to study Power Normalizations, we investigate the role and meaning of MaxExp and Gamma, two popular PN functions. To this end, we provide probabilistic interpretations of such element-wise operators and discover surrogates with well-behaved derivatives for end-to-end training. Furthermore, we look at the spectral applicability of MaxExp and Gamma by studying Spectral Power Normalizations (SPN). We show that SPN on the autocorrelation/covariance matrix and the Heat Diffusion Process (HDP) on a graph Laplacian matrix are closely related, thus sharing their properties. Such a finding leads us to the culmination of our work, a fast spectral MaxExp which is a variant of HDP for covariances/autocorrelation matrices. We evaluate our ideas on fine-grained recognition, scene recognition, and material classification, as well as in few-shot learning and graph classification. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

17. P-CNN: Part-Based Convolutional Neural Networks for Fine-Grained Visual Categorization.

Author: Han, Junwei, Yao, Xiwen, Cheng, Gong, Feng, Xiaoxu, and Xu, Dong
Subjects: *CONVOLUTIONAL neural networks, *FILTER banks
Abstract: This paper proposes an end-to-end fine-grained visual categorization system, termed Part-based Convolutional Neural Network (P-CNN), which consists of three modules. The first module is a Squeeze-and-Excitation (SE) block, which learns to recalibrate channel-wise feature responses by emphasizing informative channels and suppressing less useful ones. The second module is a Part Localization Network (PLN) used to locate distinctive object parts, through which a bank of convolutional filters are learned as discriminative part detectors. Thus, a group of informative parts can be discovered by convolving the feature maps with each part detector. The third module is a Part Classification Network (PCN) that has two streams. The first stream classifies each individual object part into image-level categories. The second stream concatenates part features and global feature into a joint feature for the final classification. In order to learn powerful part features and boost the joint feature capability, we propose a Duplex Focal Loss used for metric learning and part classification, which focuses on training hard examples. We further merge PLN and PCN into a unified network for an end-to-end training process via a simple training technique. Comprehensive experiments and comparisons with state-of-the-art methods on three benchmark datasets demonstrate the effectiveness of our proposed method. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

18. Hierarchical Deep Click Feature Prediction for Fine-Grained Image Recognition.

Author: Yu, Jun, Tan, Min, Zhang, Hongyuan, Rui, Yong, and Tao, Dacheng
Subjects: *IMAGE recognition (Computer vision), *LEARNING ability, *FORECASTING, *FEATURE extraction, *SEMANTICS
Abstract: The click feature of an image, defined as the user click frequency vector of the image on a predefined word vocabulary, is known to effectively reduce the semantic gap for fine-grained image recognition. Unfortunately, user click frequency data are usually absent in practice. It remains challenging to predict the click feature from the visual feature, because the user click frequency vector of an image is always noisy and sparse. In this paper, we devise a Hierarchical Deep Word Embedding (HDWE) model by integrating sparse constraints and an improved RELU operator to address click feature prediction from visual features. HDWE is a coarse-to-fine click feature predictor that is learned with the help of an auxiliary image dataset containing click information. It can therefore discover the hierarchy of word semantics. We evaluate HDWE on three dog and one bird image datasets, in which Clickture-Dog and Clickture-Bird are utilized as auxiliary datasets to provide click data, respectively. Our empirical studies show that HDWE has 1) higher recognition accuracy, 2) a larger compression ratio, and 3) good one-shot learning ability and scalability to unseen categories. [ABSTRACT FROM AUTHOR]
Published: 2022
Full Text: View/download PDF

19. Extraction of an Explanatory Graph to Interpret a CNN.

Author: Zhang, Quanshi, Wang, Xin, Cao, Ruiming, Wu, Ying Nian, Shi, Feng, and Zhu, Song-Chun
Subjects: *REPRESENTATIONS of graphs, *CONVOLUTIONAL neural networks
Abstract: This paper introduces an explanatory graph representation to reveal object parts encoded inside convolutional layers of a CNN. Given a pre-trained CNN, each filter1 in a conv-layer usually represents a mixture of object parts. We develop a simple yet effective method to learn an explanatory graph, which automatically disentangles object parts from each filter without any part annotations. Specifically, given the feature map of a filter, we mine neural activations from the feature map, which correspond to different object parts. The explanatory graph is constructed to organize each mined part as a graph node. Each edge connects two nodes, whose corresponding object parts usually co-activate and keep a stable spatial relationship. Experiments show that each graph node consistently represented the same object part through different images, which boosted the transferability of CNN features. The explanatory graph transferred features of object parts to the task of part localization, and our method significantly outperformed other approaches. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

20. Interpretable CNNs for Object Classification.

Author: Zhang, Quanshi, Wang, Xin, Wu, Ying Nian, Zhou, Huilin, and Zhu, Song-Chun
Subjects: *KNOWLEDGE representation (Information theory), *CONVOLUTIONAL neural networks, *DEEP learning, *CLASSIFICATION, *ARTIFICIAL neural networks
Abstract: This paper proposes a generic method to learn interpretable convolutional filters in a deep convolutional neural network (CNN) for object classification, where each interpretable filter encodes features of a specific object part. Our method does not require additional annotations of object parts or textures for supervision. Instead, we use the same training data as traditional CNNs. Our method automatically assigns each interpretable filter in a high conv-layer with an object part of a certain category during the learning process. Such explicit knowledge representations in conv-layers of the CNN help people clarify the logic encoded in the CNN, i.e., answering what patterns the CNN extracts from an input image and uses for prediction. We have tested our method using different benchmark CNNs with various architectures to demonstrate the broad applicability of our method. Experiments have shown that our interpretable filters are much more semantically meaningful than traditional filters. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

21. Relationship-Embedded Representation Learning for Grounding Referring Expressions.

Author: Yang, Sibei, Li, Guanbin, and Yu, Yizhou
Subjects: *SEMANTIC computing, *NATURAL languages, *LOGIC circuits, *DATA mining
Abstract: Grounding referring expressions in images aims to locate the object instance in an image described by a referring expression. It involves a joint understanding of natural language and image content, and is essential for a range of visual tasks related to human-computer interaction. As a language-to-vision matching task, the core of this problem is to not only extract all the necessary information (i.e., objects and the relationships among them) in both the image and referring expression, but also make full use of context information to align cross-modal semantic concepts in the extracted information. Unfortunately, existing work on grounding referring expressions fails to accurately extract multi-order relationships from the referring expression and associate them with the objects and their related contexts in the image. In this paper, we propose a cross-modal relationship extractor (CMRE) to adaptively highlight objects and relationships (spatial and semantic relations) related to the given expression with a cross-modal attention mechanism, and represent the extracted information as a language-guided visual relation graph. In addition, we propose a Gated Graph Convolutional Network (GGCN) to compute multimodal semantic contexts by fusing information from different modes and propagating multimodal information in the structured relation graph. Experimental results on three common benchmark datasets show that our Cross-Modal Relationship Inference Network, which consists of CMRE and GGCN, significantly surpasses all existing state-of-the-art methods. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

22. Visual Scanpath Prediction Using IOR-ROI Recurrent Mixture Density Network.

Author: Sun, Wanjie, Chen, Zhenzhong, and Wu, Feng
Subjects: *EYE movements, *VISUAL memory, *CONVOLUTIONAL neural networks, *GAZE, *HUMAN mechanics, *VISUAL perception, *VISUAL fields
Abstract: A visual scanpath represents the human eye movements when scanning the visual field for acquiring and receiving visual information. Predicting visual scanpaths when a certain stimulus is presented plays an important role in modeling overt human visual attention and search behavior. In this paper, we presented an ’Inhibition of Return - Region of Interest’ (IOR-ROI) recurrent mixture density network based framework learning to produce human-like visual scanpaths under task-free viewing conditions. The proposed model simultaneously predicts a sequence of ordered fixation positions and their corresponding fixation durations. Our model integrates bottom-up features and semantic features extracted by convolutional neural networks. Then the integrated feature maps are fed into the IOR-ROI Long Short-Term Memory (LSTM) which is the core component of the proposed model. The IOR-ROI LSTM is a dual LSTM unit, i.e., the IOR-LSTM and the ROI-LSTM, capturing IOR dynamics and gaze shift behavior simultaneously. IOR-LSTM simulates the visual working memory to adaptively maintain and update visual information regarding previously fixated regions. ROI-LSTM is responsible for predicting the next possible ROIs given the spatially inhibited image feature maps on the feature-wise basis. Fixation duration is predicted by a regression neural network given the viewing history and image feature maps corresponding to currently fixated ROI. Considering the eye movement pattern variations among subjects, a mixture density network is adopted to model the next fixation distribution as Gaussian mixtures and the fixation duration is also modeled using Gaussian distribution. Our model is evaluated on the OSIE and MIT low resolution eye-tracking datasets and experimental results indicate that the proposed method can achieve superior performance in predicting visual scanpaths. The code will be publicly available on URL: https://github.com/sunwj/scanpath . [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

23. Crafting GBD-Net for Object Detection.

Author: Zeng, Xingyu, Ouyang, Wanli, Yan, Junjie, Li, Hongsheng, Xiao, Tong, Wang, Kun, Liu, Yu, Zhou, Yucong, Yang, Bin, Wang, Zhe, Zhou, Hui, and Wang, Xiaogang
Subjects: *OBJECT-oriented methods (Computer science), *SYSTEMS design, *OBJECT monitors (Computer software), *OBJECT-oriented databases, *OBJECT-oriented programming
Abstract: The visual cues from multiple support regions of different sizes and resolutions are complementary in classifying a candidate box in object detection. Effective integration of local and contextual visual cues from these regions has become a fundamental problem in object detection. In this paper, we propose a gated bi-directional CNN (GBD-Net) to pass messages among features from different support regions during both feature learning and feature extraction. Such message passing can be implemented through convolution between neighboring support regions in two directions and can be conducted in various layers. Therefore, local and contextual visual patterns can validate the existence of each other by learning their nonlinear relationships and their close interactions are modeled in a more complex way. It is also shown that message passing is not always helpful but dependent on individual samples. Gated functions are therefore needed to control message transmission, whose on-or-offs are controlled by extra visual evidence from the input sample. The effectiveness of GBD-Net is shown through experiments on three object detection datasets, ImageNet, Pascal VOC2007 and Microsoft COCO. Besides the GBD-Net, this paper also shows the details of our approach in winning the ImageNet object detection challenge of 2016, with source code provided on https://github.com/craftGBD/craftGBD. In this winning system, the modified GBD-Net, new pretraining scheme and better region proposal designs are provided. We also show the effectiveness of different network structures and existing techniques for object detection, such as multi-scale testing, left-right flip, bounding box voting, NMS, and context. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

24. Ordered or Orderless: A Revisit for Video Based Person Re-Identification.

Author: Zhang, Le, Shi, Zenglin, Zhou, Joey Tianyi, Cheng, Ming-Ming, Liu, Yun, Bian, Jia-Wang, Zeng, Zeng, and Shen, Chunhua
Subjects: *VISUAL learning, *VIDEOS
Abstract: Is recurrent network really necessary for learning a good visual representation for video based person re-identification (VPRe-id)? In this paper, we first show that the common practice of employing recurrent neural networks (RNNs) to aggregate temporal-spatial features may not be optimal. Specifically, with a diagnostic analysis, we show that the recurrent structure may not be effective learn temporal dependencies than what we expected and implicitly yields an orderless representation. Based on this observation, we then present a simple yet surprisingly powerful approach for VPRe-id, where we treat VPRe-id as an efficient orderless ensemble of image based person re-identification problem. More specifically, we divide videos into individual images and re-identify person with ensemble of image based rankers. Under the i.i.d. assumption, we provide an error bound that sheds light upon how could we improve VPRe-id. Our work also presents a promising way to bridge the gap between video and image based person re-identification. Comprehensive experimental evaluations demonstrate that the proposed solution achieves state-of-the-art performances on multiple widely used datasets (iLIDS-VID, PRID 2011, and MARS). [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

25. Visual Tracking via Dynamic Memory Networks.

Author: Yang, Tianyu and Chan, Antoni B.
Abstract: Template-matching methods for visual tracking have gained popularity recently due to their good performance and fast speed. However, they lack effective ways to adapt to changes in the target object's appearance, making their tracking accuracy still far from state-of-the-art. In this paper, we propose a dynamic memory network to adapt the template to the target's appearance variations during tracking. The reading and writing process of the external memory is controlled by an LSTM network with the search feature map as input. A spatial attention mechanism is applied to concentrate the LSTM input on the potential target as the location of the target is at first unknown. To prevent aggressive model adaptivity, we apply gated residual template learning to control the amount of retrieved memory that is used to combine with the initial template. In order to alleviate the drift problem, we also design a “negative” memory unit that stores templates for distractors, which are used to cancel out wrong responses from the object template. To further boost the tracking performance, an auxiliary classification loss is added after the feature extractor part. Unlike tracking-by-detection methods where the object's information is maintained by the weight parameters of neural networks, which requires expensive online fine-tuning to be adaptable, our tracker runs completely feed-forward and adapts to the target's appearance changes by updating the external memory. Moreover, the capacity of our model is not determined by the network size as with other trackers – the capacity can be easily enlarged as the memory requirements of a task increase, which is favorable for memorizing long-term object information. Extensive experiments on the OTB and VOT datasets demonstrate that our trackers perform favorably against state-of-the-art tracking methods while retaining real-time speed. [ABSTRACT FROM AUTHOR]
Published: 2021
Full Text: View/download PDF

26. Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks.

Author: Jiang, Yu-Gang, Wu, Zuxuan, Wang, Jun, Xue, Xiangyang, and Chang, Shih-Fu
Subjects: *FEATURE extraction, *ARTIFICIAL neural networks, *DEEP learning, *VIDEO recording, *KERNEL (Mathematics), *IMAGE fusion
Abstract: In this paper, we study the challenging problem of categorizing videos according to high-level semantics such as the existence of a particular human action or a complex event. Although extensive efforts have been devoted in recent years, most existing works combined multiple video features using simple fusion strategies and neglected the utilization of inter-class semantic relationships. This paper proposes a novel unified framework that jointly exploits the feature relationships and the class relationships for improved categorization performance. Specifically, these two types of relationships are estimated and utilized by imposing regularizations in the learning process of a deep neural network (DNN). Through arming the DNN with better capability of harnessing both the feature and the class relationships, the proposed regularized DNN (rDNN) is more suitable for modeling video semantics. We show that rDNN produces better performance over several state-of-the-art approaches. Competitive results are reported on the well-known Hollywood2 and Columbia Consumer Video benchmarks. In addition, to stimulate future research on large scale video categorization, we collect and release a new benchmark dataset, called FCVID, which contains 91,223 Internet videos and 239 manually annotated categories. [ABSTRACT FROM PUBLISHER]
Published: 2018
Full Text: View/download PDF

27. Structured Label Inference for Visual Understanding.

Author: Nauata, Nelson, Hu, Hexiang, Zhou, Guang-Tong, Deng, Zhiwei, Liao, Zicheng, and Mori, Greg
Subjects: *LABELS, *COMPUTER vision, *MARKOV processes, *TASK analysis
Abstract: Visual data such as images and videos contain a rich source of structured semantic labels as well as a wide range of interacting components. Visual content could be assigned with fine-grained labels describing major components, coarse-grained labels depicting high level abstractions, or a set of labels revealing attributes. Such categorization over different, interacting layers of labels evinces the potential for a graph-based encoding of label information. In this paper, we exploit this rich structure for performing graph-based inference in label space for a number of tasks: multi-label image and video classification and action detection in untrimmed videos. We consider the use of the Bidirectional Inference Neural Network (BINN) and Structured Inference Neural Network (SINN) for performing graph-based inference in label space and propose a Long Short-Term Memory (LSTM) based extension for exploiting activity progression on untrimmed videos. The methods were evaluated on (i) the Animal with Attributes (AwA), Scene Understanding (SUN) and NUS-WIDE datasets for multi-label image classification, (ii) the first two releases of the YouTube-8M large scale dataset for multi-label video classification, and (iii) the THUMOS'14 and MultiTHUMOS video datasets for action detection. Our results demonstrate the effectiveness of structured label inference in these challenging tasks, achieving significant improvements against baselines. [ABSTRACT FROM AUTHOR]
Published: 2020
Full Text: View/download PDF

28. Robust Visual Tracking via Hierarchical Convolutional Features.

Author: Ma, Chao, Huang, Jia-Bin, Yang, Xiaokang, and Yang, Ming-Hsuan
Subjects: *OBJECT tracking (Computer vision), *OBJECT recognition (Computer vision), *ADAPTIVE filters, *LONG-term memory, *IMAGE representation
Abstract: Visual tracking is challenging as target objects often undergo significant appearance changes caused by deformation, abrupt motion, background clutter and occlusion. In this paper, we propose to exploit the rich hierarchical features of deep convolutional neural networks to improve the accuracy and robustness of visual tracking. Deep neural networks trained on object recognition datasets consist of multiple convolutional layers. These layers encode target appearance with different levels of abstraction. For example, the outputs of the last convolutional layers encode the semantic information of targets and such representations are invariant to significant appearance variations. However, their spatial resolutions are too coarse to precisely localize the target. In contrast, features from earlier convolutional layers provide more precise localization but are less invariant to appearance changes. We interpret the hierarchical features of convolutional layers as a nonlinear counterpart of an image pyramid representation and explicitly exploit these multiple levels of abstraction to represent target objects. Specifically, we learn adaptive correlation filters on the outputs from each convolutional layer to encode the target appearance. We infer the maximum response of each layer to locate targets in a coarse-to-fine manner. To further handle the issues with scale estimation and re-detecting target objects from tracking failures caused by heavy occlusion or out-of-the-view movement, we conservatively learn another correlation filter, that maintains a long-term memory of target appearance, as a discriminative classifier. We apply the classifier to two types of object proposals: (1) proposals with a small step size and tightly around the estimated location for scale estimation; and (2) proposals with large step size and across the whole image for target re-detection. Extensive experimental results on large-scale benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art tracking methods. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

29. Transferring Knowledge Fragments for Learning Distance Metric from a Heterogeneous Domain.

Author: Luo, Yong, Wen, Yonggang, Liu, Tongliang, and Tao, Dacheng
Subjects: *LEARNING, *THEORY of knowledge, *PSYCHOLOGICAL distance, *HETEROGENEOUS catalysis, *COMPENSATION effect (Catalysis)
Abstract: The goal of transfer learning is to improve the performance of target learning task by leveraging information (or transferring knowledge) from other related tasks. In this paper, we examine the problem of transfer distance metric learning (DML), which usually aims to mitigate the label information deficiency issue in the target DML. Most of the current Transfer DML (TDML) methods are not applicable to the scenario where data are drawn from heterogeneous domains. Some existing heterogeneous transfer learning (HTL) approaches can learn target distance metric by usually transforming the samples of source and target domain into a common subspace. However, these approaches lack flexibility in real-world applications, and the learned transformations are often restricted to be linear. This motivates us to develop a general flexible heterogeneous TDML (HTDML) framework. In particular, any (linear/nonlinear) DML algorithms can be employed to learn the source metric beforehand. Then the pre-learned source metric is represented as a set of knowledge fragments to help target metric learning. We show how generalization error in the target domain could be reduced using the proposed transfer strategy, and develop novel algorithm to learn either linear or nonlinear target metric. Extensive experiments on various applications demonstrate the effectiveness of the proposed method. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

30. Learning Two-Branch Neural Networks for Image-Text Matching Tasks.

Author: Wang, Liwei, Li, Yin, Huang, Jing, and Lazebnik, Svetlana
Subjects: *IMAGE registration, *ARTIFICIAL neural networks, *DEEP learning, *TASK analysis, *FEATURE extraction
Abstract: Image-language matching tasks have recently attracted a lot of attention in the computer vision field. These tasks include image-sentence matching, i.e., given an image query, retrieving relevant sentences and vice versa, and region-phrase matching or visual grounding, i.e., matching a phrase to relevant regions. This paper investigates two-branch neural networks for learning the similarity between these two data modalities. We propose two network structures that produce different output representations. The first one, referred to as an embedding network, learns an explicit shared latent embedding space with a maximum-margin ranking loss and novel neighborhood constraints. Compared to standard triplet sampling, we perform improved neighborhood sampling that takes neighborhood information into consideration while constructing mini-batches. The second network structure, referred to as a similarity network, fuses the two branches via element-wise product and is trained with regression loss to directly predict a similarity score. Extensive experiments show that our networks achieve high accuracies for phrase localization on the Flickr30K Entities dataset and for bi-directional image-sentence retrieval on Flickr30K and MSCOCO datasets. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

31. Head and Body Orientation Estimation Using Convolutional Random Projection Forests.

Author: Lee, Donghoon, Yang, Ming-Hsuan, and Oh, Songhwai
Subjects: *RANDOM projection method, *FILTER banks, *ALGORITHMS, *COMBINATORIAL optimization, *BANDPASS filters
Abstract: In this paper, we consider the problem of estimating the head pose and body orientation of a person from a low-resolution image. Under this setting, it is difficult to reliably extract facial features or detect body parts. We propose a convolutional random projection forest (CRPforest) algorithm for these tasks. A convolutional random projection network (CRPnet) is used at each node of the forest. It maps an input image to a high-dimensional feature space using a rich filter bank. The filter bank is designed to generate sparse responses so that they can be efficiently computed by compressive sensing. A sparse random projection matrix can capture most essential information contained in the filter bank without using all the filters in it. Therefore, the CRPnet is fast, e.g., it requires $0.04\;\mathrm{ms}$ to process an image of $50\times 50$ pixels, due to the small number of convolutions (e.g., 0.01 percent of a layer of a neural network) at the expense of less than 2 percent accuracy. The overall forest estimates head and body pose well on benchmark datasets, e.g., over 98 percent on the HIIT dataset, while requiring $3.8\;\mathrm{ms}$ without using a GPU. Extensive experiments on challenging datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in low-resolution images with noise, occlusion, and motion blur. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

32. ELD-Net: An Efficient Deep Learning Architecture for Accurate Saliency Detection.

Author: Lee, Gayoung, Tai, Yu-Wing, and Kim, Junmo
Subjects: *DIGITAL image processing, *IMAGE recognition (Computer vision), *X-ray diffraction, *DEEP learning, *IMAGE quality analysis
Abstract: Recent advances in saliency detection have utilized deep learning to obtain high-level features to detect salient regions in scenes. These advances have yielded results superior to those reported in past work, which involved the use of hand-crafted low-level features for saliency detection. In this paper, we propose ELD-Net, a unified deep learning framework for accurate and efficient saliency detection. We show that hand-crafted features can provide complementary information to enhance saliency detection that uses only high-level features. Our method uses both low-level and high-level features for saliency detection. High-level features are extracted using GoogLeNet, and low-level features evaluate the relative importance of a local region using its differences from other regions in an image. The two feature maps are independently encoded by the convolutional and the ReLU layers. The encoded low-level and high-level features are then combined by concatenation and convolution. Finally, a linear fully connected layer is used to evaluate the saliency of a queried region. A full resolution saliency map is obtained by querying the saliency of each local region of an image. Since the high-level features are encoded at low resolution, and the encoded high-level features can be reused for every query region, our ELD-Net is very fast. Our experiments show that our method outperforms state-of-the-art deep learning-based saliency detection methods. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

33. A Novel Linelet-Based Representation for Line Segment Detection.

Author: Cho, Nam-Gyu, Yuille, Alan, and Lee, Seong-Whan
Subjects: *IMAGE segmentation, *DIGITAL image processing, *DATA visualization, *EDGE detection (Image processing), *FEATURE extraction
Abstract: This paper proposes a method for line segment detection in digital images. We propose a novel linelet-based representation to model intrinsic properties of line segments in rasterized image space. Based on this, line segment detection, validation, and aggregation frameworks are constructed. For a numerical evaluation on real images, we propose a new benchmark dataset of real images with annotated lines called YorkUrban-LineSegment. The results show that the proposed method outperforms state-of-the-art methods numerically and visually. To our best knowledge, this is the first report of numerical evaluation of line segment detection on real images. [ABSTRACT FROM PUBLISHER]
Published: 2018
Full Text: View/download PDF

34. Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts.

Author: Fu, Kun, Jin, Junqi, Cui, Runpeng, Sha, Fei, and Zhang, Changshui
Subjects: *PHOTOGRAPHS, *PHOTOGRAPH captions, *IMAGE processing, *VISUAL perception, *SEMANTIC networks (Information theory)
Abstract: Recent progress on automatic generation of image captions has shown that it is possible to describe the most salient information conveyed by images with accurate and meaningful sentences. In this paper, we propose an image captioning system that exploits the parallel structures between images and sentences. In our model, the process of generating the next word, given the previously generated ones, is aligned with the visual perception experience where the attention shifts among the visual regions—such transitions impose a thread of ordering in visual perception. This alignment characterizes the flow of latent meaning, which encodes what is semantically shared by both the visual scene and the text description. Our system also makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image. The contexts adapt language models for word generation to specific scene types. We benchmark our system and contrast to published results on several popular datasets, using both automatic evaluation metrics and human evaluation. We show that either region-based attention or scene-specific contexts improves systems without those components. Furthermore, combining these two modeling ingredients attains the state-of-the-art performance. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

35. Visually Grounded Meaning Representations.

Author: Silberer, Carina, Ferrari, Vittorio, and Lapata, Mirella
Subjects: *LEXICAL grammar, *SEMANTICS, *PRAGMATICS, *FEATURE extraction, *COMPUTER vision
Abstract: In this paper we address the problem of grounding distributional representations of lexical meaning. We introduce a new model which uses stacked autoencoders to learn higher-level representations from textual and visual input. The visual modality is encoded via vectors of attributes obtained automatically from images. We create a new large-scale taxonomy of 600 visual attributes representing more than 500 concepts and 700 K images. We use this dataset to train attribute classifiers and integrate their predictions with text-based distributional models of word meaning. We evaluate our model on its ability to simulate word similarity judgments and concept categorization. On both tasks, our model yields a better fit to behavioral data compared to baselines and related models which either rely on a single modality or do not make use of attribute-based input. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

36. Jointly Learning Heterogeneous Features for RGB-D Activity Recognition.

Author: Hu, Jian-Fang, Zheng, Wei-Shi, Lai, Jianhuang, and Zhang, Jianguo
Subjects: *IMAGE recognition (Computer vision), *OPTICAL pattern recognition, *IMAGE color analysis, *THREE-dimensional display systems, *FEATURE extraction
Abstract: In this paper, we focus on heterogeneous features learning for RGB-D activity recognition. We find that features from different channels (RGB, depth) could share some similar hidden structures, and then propose a joint learning model to simultaneously explore the shared and feature-specific components as an instance of heterogeneous multi-task learning. The proposed model formed in a unified framework is capable of: 1) jointly mining a set of subspaces with the same dimensionality to exploit latent shared features across different feature channels, 2) meanwhile, quantifying the shared and feature-specific components of features in the subspaces, and 3) transferring feature-specific intermediate transforms (i-transforms) for learning fusion of heterogeneous features across datasets. To efficiently train the joint model, a three-step iterative optimization algorithm is proposed, followed by a simple inference model. Extensive experimental results on four activity datasets have demonstrated the efficacy of the proposed method. A new RGB-D activity dataset focusing on human-object interaction is further contributed, which presents more challenges for RGB-D activity benchmarking. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

37. Cross-Convolutional-Layer Pooling for Image Recognition.

Author: Liu, Lingqiao, Shen, Chunhua, and Hengel, Anton van den
Subjects: *IMAGE recognition (Computer vision), *OPTICAL pattern recognition, *FEATURE extraction, *IMAGE representation, *ARTIFICIAL neural networks
Abstract: Recent studies have shown that a Deep Convolutional Neural Network (DCNN) trained on a large image dataset can be used as a universal image descriptor and that doing so leads to impressive performance for a variety of image recognition tasks. Most of these studies adopt activations from a single DCNN layer, usually a fully-connected layer, as the image representation. In this paper, we proposed a novel way to extract image representations from two consecutive convolutional layers: one layer is used for local feature extraction and the other serves as guidance to pool the extracted features. By taking different viewpoints of convolutional layers, we further develop two schemes to realize this idea. The first directly uses convolutional layers from a DCNN. The second applies the pre-trained CNN on densely sampled image regions and treats the fully-connected activations of each image region as a convolutional layer's feature activations. We then train another convolutional layer on top of that as the pooling-guidance convolutional layer. By applying our method to three popular visual classification tasks, we find that our first scheme tends to perform better on applications which demand strong discrimination on lower-level visual patterns while the latter excels in cases that require discrimination on category-level patterns. Overall, the proposed method achieves superior performance over existing approaches for extracting image representations from a DCNN. In addition, we apply cross-layer pooling to the problem of image retrieval and propose schemes to reduce the computational cost. Experimental results suggest that the proposed method achieves promising results for the image retrieval task. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

38. Video2vec Embeddings Recognize Events When Examples Are Scarce.

Author: Habibian, Amirhossein, Mensink, Thomas, and Snoek, Cees G. M.
Subjects: *DATA visualization, *FEATURE extraction, *IMAGE reconstruction, *IMAGE processing, *MACHINE learning
Abstract: This paper aims for event recognition when video examples are scarce or even completely absent. The key in such a challenging setting is a semantic video representation. Rather than building the representation from individual attribute detectors and their annotations, we propose to learn the entire representation from freely available web videos and their descriptions using an embedding between video features and term vectors. In our proposed embedding, which we call Video2vec, the correlations between the words are utilized to learn a more effective representation by optimizing a joint objective balancing descriptiveness and predictability. We show how learning the Video2vec embedding using a multimodal predictability loss, including appearance, motion and audio features, results in a better predictable representation. We also propose an event specific variant of Video2vec to learn a more accurate representation for the words, which are indicative of the event, by introducing a term sensitive descriptiveness loss. Our experiments on three challenging collections of web videos from the NIST TRECVID Multimedia Event Detection and Columbia Consumer Videos datasets demonstrate: i) the advantages of Video2vec over representations using attributes or alternative embeddings, ii) the benefit of fusing video modalities by an embedding over common strategies, iii) the complementarity of term sensitive descriptiveness and multimodal predictability for event recognition. By its ability to improve predictability of present day audio-visual video features, while at the same time maximizing their semantic descriptiveness, Video2vec leads to state-of-the-art accuracy for both few- and zero-example recognition of events in video. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

39. HOTS: A Hierarchy of Event-Based Time-Surfaces for Pattern Recognition.

Author: Lagorce, Xavier, Orchard, Garrick, Galluppi, Francesco, Shi, Bertram E., and Benosman, Ryad B.
Subjects: *PATTERN recognition systems, *IMAGE sensors, *PIXELS, *FEATURE extraction, *NEUROMORPHICS
Abstract: This paper describes novel event-based spatio-temporal features called time-surfaces and how they can be used to create a hierarchical event-based pattern recognition architecture. Unlike existing hierarchical architectures for pattern recognition, the presented model relies on a time oriented approach to extract spatio-temporal features from the asynchronously acquired dynamics of a visual scene. These dynamics are acquired using biologically inspired frameless asynchronous event-driven vision sensors. Similarly to cortical structures, subsequent layers in our hierarchy extract increasingly abstract features using increasingly large spatio-temporal windows. The central concept is to use the rich temporal information provided by events to create contexts in the form of time-surfaces which represent the recent temporal activity within a local spatial neighborhood. We demonstrate that this concept can robustly be used at all stages of an event-based hierarchical model. First layer feature units operate on groups of pixels, while subsequent layer feature units operate on the output of lower level feature units. We report results on a previously published 36 class character recognition task and a four class canonical dynamic card pip task, achieving near 100 percent accuracy on each. We introduce a new seven class moving face recognition task, achieving 79 percent accuracy. [ABSTRACT FROM AUTHOR]
Published: 2017
Full Text: View/download PDF

40. Higher-Order Occurrence Pooling for Bags-of-Words: Visual Concept Detection.

Author: Koniusz, Piotr, Yan, Fei, Gosselin, Philippe-Henri, and Mikolajczyk, Krystian
Subjects: *FEATURE extraction, *OBJECT recognition (Computer vision), *DESCRIPTOR systems, *ENCODING, *KERNEL (Mathematics), *VISUALIZATION, *ARTIFICIAL neural networks
Abstract: In object recognition, the Bag-of-Words model assumes: i) extraction of local descriptors from images, ii) embedding the descriptors by a coder to a given visual vocabulary space which results in mid-level features, iii) extracting statistics from mid-level features with a pooling operator that aggregates occurrences of visual words in images into signatures, which we refer to as First-order Occurrence Pooling. This paper investigates higher-order pooling that aggregates over co-occurrences of visual words. We derive Bag-of-Words with Higher-order Occurrence Pooling based on linearisation of Minor Polynomial Kernel, and extend this model to work with various pooling operators. This approach is then effectively used for fusion of various descriptor types. Moreover, we introduce Higher-order Occurrence Pooling performed directly on local image descriptors as well as a novel pooling operator that reduces the correlation in the image signatures. Finally, First-, Second-, and Third-order Occurrence Pooling are evaluated given various coders and pooling operators on several widely used benchmarks. The proposed methods are compared to other approaches such as Fisher Vector Encoding and demonstrate improved results. [ABSTRACT FROM PUBLISHER]
Published: 2017
Full Text: View/download PDF

41. Dynamic Scene Recognition with Complementary Spatiotemporal Features.

Author: Feichtenhofer, Christoph, Pinz, Axel, and Wildes, Richard P.
Subjects: *PATTERN recognition systems, *SPATIOTEMPORAL processes, *FEATURE extraction, *COLOR image processing, *ACQUISITION of data
Abstract: This paper presents Dynamically Pooled Complementary Features (DPCF), a unified approach to dynamic scene recognition that analyzes a short video clip in terms of its spatial, temporal and color properties. The complementarity of these properties is preserved through all main steps of processing, including primitive feature extraction, coding and pooling. In the feature extraction step, spatial orientations capture static appearance, spatiotemporal oriented energies capture image dynamics and color statistics capture chromatic information. Subsequently, primitive features are encoded into a mid-level representation that has been learned for the task of dynamic scene recognition. Finally, a novel dynamic spacetime pyramid is introduced. This dynamic pooling approach can handle both global as well as local motion by adapting to the temporal structure, as guided by pooling energies. The resulting system provides online recognition of dynamic scenes that is thoroughly evaluated on the two current benchmark datasets and yields best results to date on both datasets. In-depth analysis reveals the benefits of explicitly modeling feature complementarity in combination with the dynamic spacetime pyramid, indicating that this unified approach should be well-suited to many areas of video analysis. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

42. Max-Margin Action Prediction Machine.

Author: Kong, Yu and Fu, Yun
Subjects: *QUANTUM mechanics, *DIGITAL image processing, *FACTORIZATION, *LINEAR programming, *IMAGE converters
Abstract: The speed with which intelligent systems can react to an action depends on how soon it can be recognized. The ability to recognize ongoing actions is critical in many applications, for example, spotting criminal activity. It is challenging, since decisions have to be made based on partial videos of temporally incomplete action executions. In this paper, we propose a novel discriminative multi-scale kernelized model for predicting the action class from a partially observed video. The proposed model captures temporal dynamics of human actions by explicitly considering all the history of observed features as well as features in smaller temporal segments. A compositional kernel is proposed to hierarchically capture the relationships between partial observations as well as the temporal segments, respectively. We develop a new learning formulation, which elegantly captures the temporal evolution over time, and enforces the label consistency between segments and corresponding partial videos. We prove that the proposed learning formulation minimizes the upper bound of the empirical risk. Experimental results on four public datasets show that the proposed approach outperforms state-of-the-art action prediction methods. [ABSTRACT FROM AUTHOR]
Published: 2016
Full Text: View/download PDF

43. Adopting Abstract Images for Semantic Scene Understanding.

Author: Zitnick, C. Lawrence, Vedantam, Ramakrishna, and Parikh, Devi
Subjects: *SEMANTICS research, *LINGUISTICS research, *LANGUAGE research, *CLIP art, *GRAPHIC arts
Abstract: Relating visual information to its linguistic semantic meaning remains an open and challenging area of research. The semantic meaning of images depends on the presence of objects, their attributes and their relations to other objects. But precisely characterizing this dependence requires extracting complex visual information from an image, which is in general a difficult and yet unsolved problem. In this paper, we propose studying semantic information in abstract images created from collections of clip art. Abstract images provide several advantages over real images. They allow for the direct study of how to infer high-level semantic information, since they remove the reliance on noisy low-level object, attribute and relation detectors, or the tedious hand-labeling of real images. Importantly, abstract images also allow the ability to generate sets of semantically similar scenes. Finding analogous sets of real images that are semantically similar would be nearly impossible. We create 1,002 sets of 10 semantically similar abstract images with corresponding written descriptions. We thoroughly analyze this dataset to discover semantically important features, the relations of words to visual features and methods for measuring semantic similarity. Finally, we study the relation between the saliency and memorability of objects and their semantic importance. [ABSTRACT FROM PUBLISHER]
Published: 2016
Full Text: View/download PDF

44. One Shot Detection with Laplacian Object and Fast Matrix Cosine Similarity.

Author: Biswas, Sujoy Kumar and Milanfar, Peyman
Subjects: *OBJECT recognition (Computer vision), *LAPLACIAN matrices, *FOURIER transforms, *COSINE function, *EMBEDDINGS (Mathematics)
Abstract: One shot, generic object detection involves searching for a single query object in a larger target image. Relevant approaches have benefited from features that typically model the local similarity patterns. In this paper, we combine local similarity (encoded by local descriptors) with a global context (i.e., a graph structure) of pairwise affinities among the local descriptors, embedding the query descriptors into a low dimensional but discriminatory subspace. Unlike principal components that preserve global structure of feature space, we actually seek a linear approximation to the Laplacian eigenmap that permits us a locality preserving embedding of high dimensional region descriptors. Our second contribution is an accelerated but exact computation of matrix cosine similarity as the decision rule for detection, obviating the computationally expensive sliding window search. We leverage the power of Fourier transform combined with integral image to achieve superior runtime efficiency that allows us to test multiple hypotheses (for pose estimation) within a reasonably short time. Our approach to one shot detection is training-free, and experiments on the standard data sets confirm the efficacy of our model. Besides, low computation cost of the proposed (codebook-free) object detector facilitates rather straightforward query detection in large data sets including movie videos. [ABSTRACT FROM PUBLISHER]
Published: 2016
Full Text: View/download PDF

45. Weakly Supervised Large Scale Object Localization with Multiple Instance Learning and Bag Splitting.

Author: Ren, Weiqiang, Huang, Kaiqi, Tao, Dacheng, and Tan, Tieniu
Subjects: *IMAGE processing, *FACE perception, *BIG data, *MACHINE learning, *PATTERNS (Mathematics)
Abstract: Localizing objects of interest in images when provided with only image-level labels is a challenging visual recognition task. Previous efforts have required carefully designed features and have difficulty in handling images with cluttered backgrounds. Up-scaling to large datasets also poses a challenge to applying these methods to real applications. In this paper, we propose an efficient and effective learning framework called MILinear, which is able to learn an object localization model from large-scale data without using bounding box annotations. We integrate rich general prior knowledge into a learning model using a large pre-trained convolutional network. Moreover, to reduce ambiguity in positive images, we present a bag-splitting algorithm that iteratively generates new negative bags from positive ones. We evaluate the proposed approach on the challenging Pascal VOC 2007 dataset, and our method outperforms other state-of-the-art methods by a large margin; some results are even comparable to fully supervised models trained with bounding box annotations. To further demonstrate scalability, we also present detection results on the ILSVRC 2013 detection dataset, and our method outperforms supervised deformable part-based model without using box annotations. [ABSTRACT FROM PUBLISHER]
Published: 2016
Full Text: View/download PDF

46. Scalable Feature Matching by Dual Cascaded Scalar Quantization for Image Retrieval.

Author: Zhou, Wengang, Yang, Ming, Wang, Xiaoyu, Li, Houqiang, Lin, Yuanqing, and Tian, Qi
Subjects: *IMAGE retrieval, *FEATURE extraction, *PATTERN matching, *NEAREST neighbor analysis (Statistics), *VECTOR analysis
Abstract: In this paper, we investigate the problem of scalable visual feature matching in large-scale image search and propose a novel cascaded scalar quantization scheme in dual resolution. We formulate the visual feature matching as a range-based neighbor search problem and approach it by identifying hyper-cubes with a dual-resolution scalar quantization strategy. Specifically, for each dimension of the PCA-transformed feature, scalar quantization is performed at both coarse and fine resolutions. The scalar quantization results at the coarse resolution are cascaded over multiple dimensions to index an image database. The scalar quantization results over multiple dimensions at the fine resolution are concatenated into a binary super-vector and stored into the index list for efficient verification. The proposed cascaded scalar quantization (CSQ) method is free of the costly visual codebook training and thus is independent of any image descriptor training set. The index structure of the CSQ is flexible enough to accommodate new image features and scalable to index large-scale image database. We evaluate our approach on the public benchmark datasets for large-scale image retrieval. Experimental results demonstrate the competitive retrieval performance of the proposed method compared with several recent retrieval algorithms on feature quantization. [ABSTRACT FROM PUBLISHER]
Published: 2016
Full Text: View/download PDF

47. Multi-Camera Saliency.

Author: Luo, Yan, Jiang, Ming, Wong, Yongkang, and Zhao, Qi
Subjects: *CAMERAS, *EYE tracking, *IMAGE processing, *IMAGE converters, *ALGORITHM research
Abstract: A significant body of literature on saliency modeling predicts where humans look in a single image or video. Besides the scientific goal of understanding how information is fused from multiple visual sources to identify regions of interest in a holistic manner, there are tremendous engineering applications of multi-camera saliency due to the widespread of cameras. This paper proposes a principled framework to smoothly integrate visual information from multiple views to a global scene map, and to employ a saliency algorithm incorporating high-level features to identify the most important regions by fusing visual information. The proposed method has the following key distinguishing features compared with its counterparts: (1) the proposed saliency detection is global (salient regions from one local view may not be important in a global context), (2) it does not require special ways for camera deployment or overlapping field of view, and (3) the key saliency algorithm is effective in highlighting interesting object regions though not a single detector is used. Experiments on several data sets confirm the effectiveness of the proposed principled framework. [ABSTRACT FROM PUBLISHER]
Published: 2015
Full Text: View/download PDF

48. Single-Pedestrian Detection Aided by Two-Pedestrian Detection.

Author: Ouyang, Wanli, Zeng, Xingyu, and Wang, Xiaogang
Subjects: *PEDESTRIAN areas, *DETECTORS, *PEDESTRIANS, *DETECTION alarms, *ENGINEERING instruments
Abstract: In this paper, we address the challenging problem of detecting pedestrians who appear in groups. A new approach is proposed for single-pedestrian detection aided by two-pedestrian detection. A mixture model of two-pedestrian detectors is designed to capture the unique visual cues which are formed by nearby pedestrians but cannot be captured by single-pedestrian detectors. A probabilistic framework is proposed to model the relationship between the configurations estimated by single- and two-pedestrian detectors, and to refine the single-pedestrian detection result using two-pedestrian detection. The two-pedestrian detector can integrate with any single-pedestrian detector. Twenty-five state-of-the-art single-pedestrian detection approaches are combined with the two-pedestrian detector on three widely used public datasets: Caltech, TUD-Brussels, and ETH. Experimental results show that our framework improves all these approaches. The average improvement is $9$ percent on the Caltech-Test dataset, $11$ percent on the TUD-Brussels dataset and $17$ percent on the ETH dataset in terms of average miss rate. The lowest average miss rate is reduced from $37$ to percent on the Caltech-Test dataset, from $55$ to $50$ percent on the TUD-Brussels dataset and from $43$ to $38$ percent on the ETH dataset. [ABSTRACT FROM AUTHOR]
Published: 2015
Full Text: View/download PDF

49. Image Geo-Localization Based on MultipleNearest Neighbor Feature Matching UsingGeneralized Graphs.

Author: Zamir, Amir Roshan and Shah, Mubarak
Subjects: *IMAGE registration, *DIGITAL image processing, *IMAGE recognition (Computer vision), *RADIAL basis functions, *APPROXIMATION theory
Abstract: In this paper, we present a new framework for geo-locating an image utilizing a novel multiple nearest neighbor feature matching method using Generalized Minimum Clique Graphs (GMCP). First, we extract local features (e.g., SIFT) from the query image and retrieve a number of nearest neighbors for each query feature from the reference data set. Next, we apply our GMCP-based feature matching to select a single nearest neighbor for each query feature such that all matches are globally consistent. Our approach to feature matching is based on the proposition that the first nearest neighbors are not necessarily the best choices for finding correspondences in image matching. Therefore, the proposed method considers multiple reference nearest neighbors as potential matches and selects the correct ones by enforcing consistency among their global features (e.g., GIST) using GMCP. In this context, we argue that using a robust distance function for finding the similarity between the global features is essential for the cases where the query matches multiple reference images with dissimilar global features. Towards this end, we propose a robust distance function based on the Gaussian Radial Basis Function (G-RBF). We evaluated the proposed framework on a new data set of 102k street view images; the experiments show it outperforms the state of the art by 10 percent. [ABSTRACT FROM PUBLISHER]
Published: 2014
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

49 results

Search Results

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources