14 results on '"Zhou, Wengang"'
Search Results
2. Anti-Distractor Active Object Tracking in 3D Environments.
- Author
-
Xi, Mao, Zhou, Yun, Chen, Zheng, Zhou, Wengang, and Li, Houqiang
- Subjects
REINFORCEMENT learning ,DRONE aircraft - Abstract
In active object tracking, given a visual observation as input, the goal is to lockup the target by autonomously adjusting camera’s position and posture. Previous works on active tracking assume that there is only one object (person) in the environment without distractors. In this work, towards realistic setting, we move forward to a more challenging scenario, where the tracker moves freely in 3D space like unmanned aerial vehicles (UAV) to track a person in various complex scenes with multiple distractors. To this end, we propose a novel end-to-end anti-distractor active object tracking framework by introducing multiple attention modules. On one hand, we take the target template to learn an embedding as channel-wise attention for current observation to distinguish the target from the distractors. On the other hand, temporal attention is introduced to fuse the observation history to extract a feature representation, which is then fed into a reinforcement learning network to output the action of the tracker. To evaluate our method, we build several multi-object 3D environments in Unreal Engine and extensive experiments demonstrate the effectiveness of our approach. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
3. Learning Diverse Models for End-to-End Ensemble Tracking.
- Author
-
Wang, Ning, Zhou, Wengang, and Li, Houqiang
- Subjects
- *
OBJECT tracking (Computer vision) , *FEATURE extraction - Abstract
In visual tracking, how to effectively model the target appearance using limited prior information remains an open problem. In this paper, we leverage an ensemble of diverse models to learn manifold representations for robust object tracking. The proposed ensemble framework includes a shared backbone network for efficient feature extraction and multiple head networks for independent predictions. Trained by the shared data within an identical structure, the mutually correlated head models heavily hinder the potential of ensemble learning. To shrink the representational overlaps among multiple models while encouraging the diversity of individual predictions, we propose the model diversity and response diversity regularization terms during training. By fusing these distinctive prediction results via a fusion module, the tracking variance caused by the distractor objects can be largely restrained. Our whole framework is end-to-end trained in a data-driven manner, avoiding the heuristic designs of multiple base models and fusion strategies. The proposed method achieves state-of-the-art results on seven challenging benchmarks while operating in real-time. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
4. An End-to-End Foreground-Aware Network for Person Re-Identification.
- Author
-
Liu, Yiheng, Zhou, Wengang, Liu, Jianzhuang, Qi, Guo-Jun, Tian, Qi, and Li, Houqiang
- Subjects
- *
PEDESTRIANS , *IDENTIFICATION , *FEATURE extraction - Abstract
Person re-identification is a crucial task of identifying pedestrians of interest across multiple surveillance camera views. For person re-identification, a pedestrian is usually represented with features extracted from a rectangular image region that inevitably contains the scene background, which incurs ambiguity to distinguish different pedestrians and degrades the accuracy. Thus, we propose an end-to-end foreground-aware network to discriminate against the foreground from the background by learning a soft mask for person re-identification. In our method, in addition to the pedestrian ID as supervision for the foreground, we introduce the camera ID of each pedestrian image for background modeling. The foreground branch and the background branch are optimized collaboratively. By presenting a target attention loss, the pedestrian features extracted from the foreground branch become more insensitive to backgrounds, which greatly reduces the negative impact of changing backgrounds on pedestrian matching across different camera views. Notably, in contrast to existing methods, our approach does not require an additional dataset to train a human landmark detector or a segmentation model for locating the background regions. The experimental results conducted on three challenging datasets, i.e., Market-1501, DukeMTMC-reID, and MSMT17, demonstrate the effectiveness of our approach. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
5. Deep Relation Embedding for Cross-Modal Retrieval.
- Author
-
Zhang, Yifan, Zhou, Wengang, Wang, Min, Tian, Qi, and Li, Houqiang
- Subjects
- *
IMAGE retrieval , *FEATURE extraction , *TASK analysis , *COSINE function - Abstract
Cross-modal retrieval aims to identify relevant data across different modalities. In this work, we are dedicated to cross-modal retrieval between images and text sentences, which is formulated into similarity measurement for each image-text pair. To this end, we propose a Cross-modal Relation Guided Network (CRGN) to embed image and text into a latent feature space. The CRGN model uses GRU to extract text feature and ResNet model to learn the globally guided image feature. Based on the global feature guiding and sentence generation learning, the relation between image regions can be modeled. The final image embedding is generated by a relation embedding module with an attention mechanism. With the image embeddings and text embeddings, we conduct cross-modal retrieval based on the cosine similarity. The learned embedding space well captures the inherent relevance between image and text. We evaluate our approach with extensive experiments on two public benchmark datasets, i.e., MS-COCO and Flickr30K. Experimental results demonstrate that our approach achieves better or comparable performance with the state-of-the-art methods with notable efficiency. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
6. Cascaded Regression Tracking: Towards Online Hard Distractor Discrimination.
- Author
-
Wang, Ning, Zhou, Wengang, Tian, Qi, and Li, Houqiang
- Subjects
- *
RUNNING speed - Abstract
Visual can be easily disturbed by similar surrounding objects. Such objects as hard distractors, even though being the minority among negative samples, increase the risk of target drift and model corruption, which deserve additional attention in online tracking and model update. To enhance the tracking robustness, in this paper, we propose a cascaded regression tracker with two sequential stages. In the first stage, we filter out abundant easily-identified negative candidates via an efficient convolutional regression. In the second stage, a discrete sampling based ridge regression is designed to double-check the remaining ambiguous hard samples, which serves as an alternative of fully-connected layers and benefits from the closed-form solver for efficient learning. During the model update, we utilize the hard negative mining technique and an adaptive ridge regression scheme to improve the discrimination capability of the second-stage regressor. Extensive experiments are conducted on 11 challenging tracking benchmarks including OTB-2013, OTB-2015, VOT2018, VOT2019, UAV123, Temple-Color, NfS, TrackingNet, LaSOT, UAV20L, and OxUvA. The proposed method achieves state-of-the-art performance on prevalent benchmarks, while running in a real-time speed. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
7. Hierarchical Recurrent Deep Fusion Using Adaptive Clip Summarization for Sign Language Translation.
- Author
-
Guo, Dan, Zhou, Wengang, Li, Anyang, Li, Houqiang, and Wang, Meng
- Subjects
- *
SIGN language , *SUPERVISED learning , *FACIAL expression , *TRANSLATING & interpreting , *SKELETON - Abstract
Vision-based sign language translation (SLT) is a challenging task due to the complicated variations of facial expressions, gestures, and articulated poses involved in sign linguistics. As a weakly supervised sequence-to-sequence learning problem, in SLT there are usually no exact temporal boundaries of actions. To adequately explore temporal hints in videos, we propose a novel framework named Hierarchical deep Recurrent Fusion (HRF). Aiming at modeling discriminative action patterns, in HRF we design an adaptive temporal encoder to capture crucial RGB visemes and skeleton signees. Specifically, RGB visemes and skeleton signees are learned by the same scheme named Adaptive Clip Summarization (ACS), respectively. ACS consists of three key modules, i.e., variable-length clip mining, adaptive temporal pooling, and attention-aware weighting. Besides, based on unaligned action patterns (RGB visemes and skeleton signees), a query-adaptive decoding fusion is proposed to translate the target sentence. Extensive experiments demonstrate the effectiveness of the proposed HRF framework. [ABSTRACT FROM AUTHOR]
- Published
- 2020
- Full Text
- View/download PDF
8. Collaborative Index Embedding for Image Retrieval.
- Author
-
Zhou, Wengang, Li, Houqiang, Sun, Jian, and Tian, Qi
- Subjects
- *
IMAGE retrieval , *SIGNAL convolution , *ARTIFICIAL neural networks , *FEATURE extraction , *MATHEMATICAL optimization , *ACCURACY - Abstract
In content-based image retrieval, SIFT feature and the feature from deep convolutional neural network (CNN) have demonstrated promising performance. To fully explore both visual features in a unified framework for effective and efficient retrieval, we propose a collaborative index embedding method to implicitly integrate the index matrices of them. We formulate the index embedding as an optimization problem from the perspective of neighborhood sharing and solve it with an alternating index update scheme. After the iterative embedding, only the embedded CNN index is kept for on-line query, which demonstrates significant gain in retrieval accuracy, with very economical memory cost. Extensive experiments have been conducted on the public datasets with million-scale distractor images. The experimental results reveal that, compared with the recent state-of-the-art retrieval algorithms, our approach achieves competitive accuracy performance with less memory overhead and efficient query computation. [ABSTRACT FROM PUBLISHER]
- Published
- 2018
- Full Text
- View/download PDF
9. Democratic Diffusion Aggregation for Image Retrieval.
- Author
-
Gao, Zhanning, Xue, Jianru, Zhou, Wengang, Pang, Shanmin, and Tian, Qi
- Abstract
Content-based image retrieval is an important research topic in the multimedia field. In large-scale image search using local features, image features are encoded and aggregated into a compact vector to avoid indexing each feature individually. In the aggregation step, sum-aggregation is wildly used in many existing works and demonstrates promising performance. However, it is based on a strong and implicit assumption that the local descriptors of an image are identically and independently distributed in descriptor space and image plane. To address this problem, we propose a new aggregation method named democratic diffusion aggregation (DDA) with weak spatial context embedded. The main idea of our aggregation method is to re-weight the embedded vectors before sum-aggregation by considering the relevance among local descriptors. Different from previous work, by conducting a diffusion process on the improved kernel matrix, we calculate the weighting coefficients more efficiently without any iterative optimization. Besides considering the relevance of local descriptors from different images, we also discuss an efficient query fusion strategy which uses the initial top-ranked image vectors to enhance the retrieval performance. Experimental results show that our aggregation method exhibits much higher efficiency (about \mathbf \times \,14 faster) and better retrieval accuracy compared with previous methods, and the query fusion strategy consistently improves the retrieval quality. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
10. Making Residual Vector Distribution Uniform for Distinctive Image Representation.
- Author
-
Liu, Zhen, Li, Houqiang, Zhou, Wengang, Rui, Ting, and Tian, Qi
- Subjects
IMAGE representation ,COMPUTATIONAL complexity ,MULTIPLE correspondence analysis (Statistics) ,COMPUTER graphics research ,CORRESPONDENCE analysis (Statistics) - Abstract
Recently, image representation by vector of locally aggregated descriptors (VLADs) has been demonstrated to be super efficient in image representation. Due to the coarse division in the feature space, its discriminative power is limited. One intuitive way to address this issue is to construct a VLAD with a larger vocabulary, but this will lead to a higher dimensional VLAD and suffer more computational complexity when learning the principal component analysis parameters used to project VLAD onto a low-dimensional space. In this paper, we propose a hierarchical scheme to build the VLAD. In our approach, by generating some subwords to each visual word of a coarse vocabulary, a hidden layer visual vocabulary is constructed. With the hidden layer visual vocabulary, the feature space is divided finer. Then, we aggregate the residues in the hidden layer vocabulary to the coarse layer to obtain an image descriptor that is of the same dimension as the original VLAD. In addition, we reveal that performing the whitening operation to local descriptor can further enhance the discriminative power of the VLAD. We validate our approach with experiments mainly conducted on three benchmark data sets, i.e., Holidays data set, UKBench data set, and Oxford Building data set with Flickr1M as distractors and make comparison with the related algorithms on VLAD. The experimental results demonstrate the effectiveness of our algorithm. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
11. Scalable Feature Matching by Dual Cascaded Scalar Quantization for Image Retrieval.
- Author
-
Zhou, Wengang, Yang, Ming, Wang, Xiaoyu, Li, Houqiang, Lin, Yuanqing, and Tian, Qi
- Subjects
- *
IMAGE retrieval , *FEATURE extraction , *PATTERN matching , *NEAREST neighbor analysis (Statistics) , *VECTOR analysis - Abstract
In this paper, we investigate the problem of scalable visual feature matching in large-scale image search and propose a novel cascaded scalar quantization scheme in dual resolution. We formulate the visual feature matching as a range-based neighbor search problem and approach it by identifying hyper-cubes with a dual-resolution scalar quantization strategy. Specifically, for each dimension of the PCA-transformed feature, scalar quantization is performed at both coarse and fine resolutions. The scalar quantization results at the coarse resolution are cascaded over multiple dimensions to index an image database. The scalar quantization results over multiple dimensions at the fine resolution are concatenated into a binary super-vector and stored into the index list for efficient verification. The proposed cascaded scalar quantization (CSQ) method is free of the costly visual codebook training and thus is independent of any image descriptor training set. The index structure of the CSQ is flexible enough to accommodate new image features and scalable to index large-scale image database. We evaluate our approach on the public benchmark datasets for large-scale image retrieval. Experimental results demonstrate the competitive retrieval performance of the proposed method compared with several recent retrieval algorithms on feature quantization. [ABSTRACT FROM PUBLISHER]
- Published
- 2016
- Full Text
- View/download PDF
12. Uniting Keypoints: Local Visual Information Fusion for Large-Scale Image Search.
- Author
-
Liu, Zhen, Li, Houqiang, Zhou, Wengang, Hong, Richang, and Tian, Qi
- Abstract
In this paper, we propose a novel approach to address the problem of the huge amount of local features for a large-scale database. First, in each image the local features are organized into dozens of groups by performing the standard k-means clustering algorithm on their spatial positions. Second, a compact descriptor is generated to describe the visual information of each group of local features. Since, in each image, thousands of local features are reorganized into only dozens of groups and each group is described by a single descriptor, the total amount of descriptors in a large-scale database will be greatly reduced. Therefore, we can reduce the complexity of the searching procedure significantly. Further, the generated group descriptors are encoded into binary format to achieve the storage and computation efficiency. The experiments on two benchmark datasets, i.e., UKBench and Holidays, with the Flickr1M distractor database demonstrate the effectiveness of the proposed approach. [ABSTRACT FROM PUBLISHER]
- Published
- 2015
- Full Text
- View/download PDF
13. Cross-Indexing of Binary SIFT Codes for Large-Scale Image Search.
- Author
-
Liu, Zhen, Li, Houqiang, Zhang, Liyan, Zhou, Wengang, and Tian, Qi
- Subjects
DIGITAL image processing ,BINARY codes ,IMAGE compression ,INFORMATION storage & retrieval systems ,HAMMING distance ,FEATURE extraction ,ALGORITHMS - Abstract
In recent years, there has been growing interest in mapping visual features into compact binary codes for applications on large-scale image collections. Encoding high-dimensional data as compact binary codes reduces the memory cost for storage. Besides, it benefits the computational efficiency since the computation of similarity can be efficiently measured by Hamming distance. In this paper, we propose a novel flexible scale invariant feature transform (SIFT) binarization (FSB) algorithm for large-scale image search. The FSB algorithm explores the magnitude patterns of SIFT descriptor. It is unsupervised and the generated binary codes are demonstrated to be dispreserving. Besides, we propose a new searching strategy to find target features based on the cross-indexing in the binary SIFT space and original SIFT space. We evaluate our approach on two publicly released data sets. The experiments on large-scale partial duplicate image retrieval system demonstrate the effectiveness and efficiency of the proposed algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2014
- Full Text
- View/download PDF
14. Towards Codebook-Free: Scalable Cascaded Hashing for Mobile Image Search.
- Author
-
Zhou, Wengang, Yang, Ming, Li, Houqiang, Wang, Xiaoyu, Lin, Yuanqing, and Tian, Qi
- Abstract
State-of-the-art image retrieval algorithms using local invariant features mostly rely on a large visual codebook to accelerate the feature quantization and matching. This codebook typically contains millions of visual words, which not only demands for considerable resources to train offline but also consumes large amount of memory at the online retrieval stage. This is hardly affordable in resource limited scenarios such as mobile image search applications. To address this issue, we propose a codebook-free algorithm for large scale mobile image search. In our method, we first employ a novel scalable cascaded hashing scheme to ensure the recall rate of local feature matching. Afterwards, we enhance the matching precision by an efficient verification with the binary signatures of these local features. Consequently, our method achieves fast and accurate feature matching free of a huge visual codebook. Moreover, the quantization and binarizing functions in the proposed scheme are independent of small collections of training images and generalize well for diverse image datasets. Evaluated on two public datasets with a million distractor images, the proposed algorithm demonstrates competitive retrieval accuracy and scalability against four recent retrieval methods in literature. [ABSTRACT FROM PUBLISHER]
- Published
- 2014
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.