2,228 results on '"Yan, Shuicheng"'
Search Results
152. Towards Class Interpretable Vision Transformer with Multi-Class-Tokens
- Author
-
Dong, Bowen, Zhou, Pan, Yan, Shuicheng, Zuo, Wangmeng, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Yu, Shiqi, editor, Zhang, Zhaoxiang, editor, Yuen, Pong C., editor, Han, Junwei, editor, Tan, Tieniu, editor, Guo, Yike, editor, Lai, Jianhuang, editor, and Zhang, Jianguo, editor
- Published
- 2022
- Full Text
- View/download PDF
153. DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
- Author
-
Liang, Yuxuan, Zhou, Pan, Zimmermann, Roger, Yan, Shuicheng, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Avidan, Shai, editor, Brostow, Gabriel, editor, Cissé, Moustapha, editor, Farinella, Giovanni Maria, editor, and Hassner, Tal, editor
- Published
- 2022
- Full Text
- View/download PDF
154. Graph-Based Global Reasoning Networks
- Author
-
Chen, Yunpeng, Rohrbach, Marcus, Yan, Zhicheng, Yan, Shuicheng, Feng, Jiashi, and Kalantidis, Yannis
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Globally modeling and reasoning over relations between regions can be beneficial for many computer vision tasks on both images and videos. Convolutional Neural Networks (CNNs) excel at modeling local relations by convolution operations, but they are typically inefficient at capturing global relations between distant regions and require stacking multiple convolution layers. In this work, we propose a new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed. After reasoning, relation-aware features are distributed back to the original coordinate space for down-stream tasks. We further present a highly efficient instantiation of the proposed approach and introduce the Global Reasoning unit (GloRe unit) that implements the coordinate-interaction space mapping by weighted global pooling and weighted broadcasting, and the relation reasoning via graph convolution on a small graph in interaction space. The proposed GloRe unit is lightweight, end-to-end trainable and can be easily plugged into existing CNNs for a wide range of tasks. Extensive experiments show our GloRe unit can consistently boost the performance of state-of-the-art backbone architectures, including ResNet, ResNeXt, SE-Net and DPN, for both 2D and 3D CNNs, on image classification, semantic segmentation and video action recognition task.
- Published
- 2018
155. Style Separation and Synthesis via Generative Adversarial Networks
- Author
-
Zhang, Rui, Tang, Sheng, Li, Yu, Guo, Junbo, Zhang, Yongdong, Li, Jintao, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Style synthesis attracts great interests recently, while few works focus on its dual problem "style separation". In this paper, we propose the Style Separation and Synthesis Generative Adversarial Network (S3-GAN) to simultaneously implement style separation and style synthesis on object photographs of specific categories. Based on the assumption that the object photographs lie on a manifold, and the contents and styles are independent, we employ S3-GAN to build mappings between the manifold and a latent vector space for separating and synthesizing the contents and styles. The S3-GAN consists of an encoder network, a generator network, and an adversarial network. The encoder network performs style separation by mapping an object photograph to a latent vector. Two halves of the latent vector represent the content and style, respectively. The generator network performs style synthesis by taking a concatenated vector as input. The concatenated vector contains the style half vector of the style target image and the content half vector of the content target image. Once obtaining the images from the generator network, an adversarial network is imposed to generate more photo-realistic images. Experiments on CelebA and UT Zappos 50K datasets demonstrate that the S3-GAN has the capacity of style separation and synthesis simultaneously, and could capture various styles in a single model.
- Published
- 2018
- Full Text
- View/download PDF
156. $A^2$-Nets: Double Attention Networks
- Author
-
Chen, Yunpeng, Kalantidis, Yannis, Li, Jianshu, Yan, Shuicheng, and Feng, Jiashi
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Learning to capture long-range relations is fundamental to image/video recognition. Existing CNN models generally rely on increasing depth to model such relations which is highly inefficient. In this work, we propose the "double attention block", a novel component that aggregates and propagates informative global features from the entire spatio-temporal space of input images/videos, enabling subsequent convolution layers to access features from the entire space efficiently. The component is designed with a double attention mechanism in two steps, where the first step gathers features from the entire space into a compact set through second-order attention pooling and the second step adaptively selects and distributes features to each location via another attention. The proposed double attention block is easy to adopt and can be plugged into existing deep neural networks conveniently. We conduct extensive ablation studies and experiments on both image and video recognition tasks for evaluating its performance. On the image recognition task, a ResNet-50 equipped with our double attention blocks outperforms a much larger ResNet-152 architecture on ImageNet-1k dataset with over 40% less the number of parameters and less FLOPs. On the action recognition task, our proposed model achieves the state-of-the-art results on the Kinetics and UCF-101 datasets with significantly higher efficiency than recent works., Comment: Accepted at NIPS 2018
- Published
- 2018
157. Look Across Elapse: Disentangled Representation Learning and Photorealistic Cross-Age Face Synthesis for Age-Invariant Face Recognition
- Author
-
Zhao, Jian, Cheng, Yu, Cheng, Yi, Yang, Yang, Lan, Haochong, Zhao, Fang, Xiong, Lin, Xu, Yan, Li, Jianshu, Pranata, Sugiri, Shen, Shengmei, Xing, Junliang, Liu, Hengzhu, Yan, Shuicheng, and Feng, Jiashi
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
Despite the remarkable progress in face recognition related technologies, reliably recognizing faces across ages still remains a big challenge. The appearance of a human face changes substantially over time, resulting in significant intra-class variations. As opposed to current techniques for age-invariant face recognition, which either directly extract age-invariant features for recognition, or first synthesize a face that matches target age before feature extraction, we argue that it is more desirable to perform both tasks jointly so that they can leverage each other. To this end, we propose a deep Age-Invariant Model (AIM) for face recognition in the wild with three distinct novelties. First, AIM presents a novel unified deep architecture jointly performing cross-age face synthesis and recognition in a mutual boosting way. Second, AIM achieves continuous face rejuvenation/aging with remarkable photorealistic and identity-preserving properties, avoiding the requirement of paired data and the true age of testing samples. Third, we develop effective and novel training strategies for end-to-end learning the whole deep architecture, which generates powerful age-invariant face representations explicitly disentangled from the age variation. Moreover, we propose a new large-scale Cross-Age Face Recognition (CAFR) benchmark dataset to facilitate existing efforts and push the frontiers of age-invariant face recognition research. Extensive experiments on both our CAFR and several other cross-age datasets (MORPH, CACD and FG-NET) demonstrate the superiority of the proposed AIM model over the state-of-the-arts. Benchmarking our model on one of the most popular unconstrained face recognition datasets IJB-C additionally verifies the promising generalizability of AIM in recognizing faces in the wild.
- Published
- 2018
158. Multi-Fiber Networks for Video Recognition
- Author
-
Chen, Yunpeng, Kalantidis, Yannis, Li, Jianshu, Yan, Shuicheng, and Feng, Jiashi
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper, we aim to reduce the computational cost of spatio-temporal deep neural networks, making them run as fast as their 2D counterparts while preserving state-of-the-art accuracy on video recognition benchmarks. To this end, we present the novel Multi-Fiber architecture that slices a complex neural network into an ensemble of lightweight networks or fibers that run through the network. To facilitate information flow between fibers we further incorporate multiplexer modules and end up with an architecture that reduces the computational cost of 3D networks by an order of magnitude, while increasing recognition performance at the same time. Extensive experimental results show that our multi-fiber architecture significantly boosts the efficiency of existing convolution networks for both image and video recognition tasks, achieving state-of-the-art performance on UCF-101, HMDB-51 and Kinetics datasets. Our proposed model requires over 9x and 13x less computations than the I3D and R(2+1)D models, respectively, yet providing higher accuracy., Comment: ECCV 2018, Code is on GitHub
- Published
- 2018
159. Exact Low Tubal Rank Tensor Recovery from Gaussian Measurements
- Author
-
Lu, Canyi, Feng, Jiashi, Lin, Zhouchen, and Yan, Shuicheng
- Subjects
Statistics - Machine Learning ,Computer Science - Learning - Abstract
The recent proposed Tensor Nuclear Norm (TNN) [Lu et al., 2016; 2018a] is an interesting convex penalty induced by the tensor SVD [Kilmer and Martin, 2011]. It plays a similar role as the matrix nuclear norm which is the convex surrogate of the matrix rank. Considering that the TNN based Tensor Robust PCA [Lu et al., 2018a] is an elegant extension of Robust PCA with a similar tight recovery bound, it is natural to solve other low rank tensor recovery problems extended from the matrix cases. However, the extensions and proofs are generally tedious. The general atomic norm provides a unified view of low-complexity structures induced norms, e.g., the $\ell_1$-norm and nuclear norm. The sharp estimates of the required number of generic measurements for exact recovery based on the atomic norm are known in the literature. In this work, with a careful choice of the atomic set, we prove that TNN is a special atomic norm. Then by computing the Gaussian width of certain cone which is necessary for the sharp estimate, we achieve a simple bound for guaranteed low tubal rank tensor recovery from Gaussian measurements. Specifically, we show that by solving a TNN minimization problem, the underlying tensor of size $n_1\times n_2\times n_3$ with tubal rank $r$ can be exactly recovered when the given number of Gaussian measurements is $O(r(n_1+n_2-r)n_3)$. It is order optimal when comparing with the degrees of freedom $r(n_1+n_2-r)n_3$. Beyond the Gaussian mapping, we also give the recovery guarantee of tensor completion based on the uniform random mapping by TNN minimization. Numerical experiments verify our theoretical results., Comment: International Joint Conference on Artificial Intelligence (IJCAI), 2018
- Published
- 2018
160. Subspace Clustering by Block Diagonal Representation
- Author
-
Lu, Canyi, Feng, Jiashi, Lin, Zhouchen, Mei, Tao, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
This paper studies the subspace clustering problem. Given some data points approximately drawn from a union of subspaces, the goal is to group these data points into their underlying subspaces. Many subspace clustering methods have been proposed and among which sparse subspace clustering and low-rank representation are two representative ones. Despite the different motivations, we observe that many existing methods own the common block diagonal property, which possibly leads to correct clustering, yet with their proofs given case by case. In this work, we consider a general formulation and provide a unified theoretical guarantee of the block diagonal property. The block diagonal property of many existing methods falls into our special case. Second, we observe that many existing methods approximate the block diagonal representation matrix by using different structure priors, e.g., sparsity and low-rankness, which are indirect. We propose the first block diagonal matrix induced regularizer for directly pursuing the block diagonal matrix. With this regularizer, we solve the subspace clustering problem by Block Diagonal Representation (BDR), which uses the block diagonal structure prior. The BDR model is nonconvex and we propose an alternating minimization solver and prove its convergence. Experiments on real datasets demonstrate the effectiveness of BDR.
- Published
- 2018
161. Tensor Robust Principal Component Analysis with A New Tensor Nuclear Norm
- Author
-
Lu, Canyi, Feng, Jiashi, Chen, Yudong, Liu, Wei, Lin, Zhouchen, and Yan, Shuicheng
- Subjects
Statistics - Machine Learning ,Computer Science - Machine Learning - Abstract
In this paper, we consider the Tensor Robust Principal Component Analysis (TRPCA) problem, which aims to exactly recover the low-rank and sparse components from their sum. Our model is based on the recently proposed tensor-tensor product (or t-product). Induced by the t-product, we first rigorously deduce the tensor spectral norm, tensor nuclear norm, and tensor average rank, and show that the tensor nuclear norm is the convex envelope of the tensor average rank within the unit ball of the tensor spectral norm. These definitions, their relationships and properties are consistent with matrix cases. Equipped with the new tensor nuclear norm, we then solve the TRPCA problem by solving a convex program and provide the theoretical guarantee for the exact recovery. Our TRPCA model and recovery guarantee include matrix RPCA as a special case. Numerical experiments verify our results, and the applications to image recovery and background modeling problems demonstrate the effectiveness of our method., Comment: arXiv admin note: text overlap with arXiv:1708.04181
- Published
- 2018
162. Understanding Humans in Crowded Scenes: Deep Nested Adversarial Learning and A New Benchmark for Multi-Human Parsing
- Author
-
Zhao, Jian, Li, Jianshu, Cheng, Yu, Zhou, Li, Sim, Terence, Yan, Shuicheng, and Feng, Jiashi
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Despite the noticeable progress in perceptual tasks like detection, instance segmentation and human parsing, computers still perform unsatisfactorily on visually understanding humans in crowded scenes, such as group behavior analysis, person re-identification and autonomous driving, etc. To this end, models need to comprehensively perceive the semantic information and the differences between instances in a multi-human image, which is recently defined as the multi-human parsing task. In this paper, we present a new large-scale database "Multi-Human Parsing (MHP)" for algorithm development and evaluation, and advances the state-of-the-art in understanding humans in crowded scenes. MHP contains 25,403 elaborately annotated images with 58 fine-grained semantic category labels, involving 2-26 persons per image and captured in real-world scenes from various viewpoints, poses, occlusion, interactions and background. We further propose a novel deep Nested Adversarial Network (NAN) model for multi-human parsing. NAN consists of three Generative Adversarial Network (GAN)-like sub-nets, respectively performing semantic saliency prediction, instance-agnostic parsing and instance-aware clustering. These sub-nets form a nested structure and are carefully designed to learn jointly in an end-to-end way. NAN consistently outperforms existing state-of-the-art solutions on our MHP and several other datasets, and serves as a strong baseline to drive the future research for multi-human parsing., Comment: The first three authors are with equal contributions
- Published
- 2018
- Full Text
- View/download PDF
163. Multi-Oriented Scene Text Detection via Corner Localization and Region Segmentation
- Author
-
Lyu, Pengyuan, Yao, Cong, Wu, Wenhao, Yan, Shuicheng, and Bai, Xiang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Previous deep learning based state-of-the-art scene text detection methods can be roughly classified into two categories. The first category treats scene text as a type of general objects and follows general object detection paradigm to localize scene text by regressing the text box locations, but troubled by the arbitrary-orientation and large aspect ratios of scene text. The second one segments text regions directly, but mostly needs complex post processing. In this paper, we present a method that combines the ideas of the two types of methods while avoiding their shortcomings. We propose to detect scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions. In inference stage, candidate boxes are generated by sampling and grouping corner points, which are further scored by segmentation maps and suppressed by NMS. Compared with previous methods, our method can handle long oriented text naturally and doesn't need complex post processing. The experiments on ICDAR2013, ICDAR2015, MSRA-TD500, MLT and COCO-Text demonstrate that the proposed algorithm achieves better or comparable results in both accuracy and efficiency. Based on VGG16, it achieves an F-measure of 84.3% on ICDAR2015 and 81.5% on MSRA-TD500., Comment: To appear in CVPR2018
- Published
- 2018
164. Face Aging with Contextual Generative Adversarial Nets
- Author
-
Liu, Si, Sun, Yao, Zhu, Defa, Bao, Renda, Wang, Wei, Shu, Xiangbo, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Face aging, which renders aging faces for an input face, has attracted extensive attention in the multimedia research. Recently, several conditional Generative Adversarial Nets (GANs) based methods have achieved great success. They can generate images fitting the real face distributions conditioned on each individual age group. However, these methods fail to capture the transition patterns, e.g., the gradual shape and texture changes between adjacent age groups. In this paper, we propose a novel Contextual Generative Adversarial Nets (C-GANs) to specifically take it into consideration. The C-GANs consists of a conditional transformation network and two discriminative networks. The conditional transformation network imitates the aging procedure with several specially designed residual blocks. The age discriminative network guides the synthesized face to fit the real conditional distribution. The transition pattern discriminative network is novel, aiming to distinguish the real transition patterns with the fake ones. It serves as an extra regularization term for the conditional transformation network, ensuring the generated image pairs to fit the corresponding real transition pattern distribution. Experimental results demonstrate the proposed framework produces appealing results by comparing with the state-of-the-art and ground truth. We also observe performance gain for cross-age face verification., Comment: accepted at ACM Multimedia 2017
- Published
- 2018
165. BT-Nets: Simplifying Deep Neural Networks via Block Term Decomposition
- Author
-
Li, Guangxi, Ye, Jinmian, Yang, Haiqin, Chen, Di, Yan, Shuicheng, and Xu, Zenglin
- Subjects
Statistics - Machine Learning ,Computer Science - Learning - Abstract
Recently, deep neural networks (DNNs) have been regarded as the state-of-the-art classification methods in a wide range of applications, especially in image classification. Despite the success, the huge number of parameters blocks its deployment to situations with light computing resources. Researchers resort to the redundancy in the weights of DNNs and attempt to find how fewer parameters can be chosen while preserving the accuracy at the same time. Although several promising results have been shown along this research line, most existing methods either fail to significantly compress a well-trained deep network or require a heavy fine-tuning process for the compressed network to regain the original performance. In this paper, we propose the \textit{Block Term} networks (BT-nets) in which the commonly used fully-connected layers (FC-layers) are replaced with block term layers (BT-layers). In BT-layers, the inputs and the outputs are reshaped into two low-dimensional high-order tensors, then block-term decomposition is applied as tensor operators to connect them. We conduct extensive experiments on benchmark datasets to demonstrate that BT-layers can achieve a very large compression ratio on the number of parameters while preserving the representation power of the original FC-layers as much as possible. Specifically, we can get a higher performance while requiring fewer parameters compared with the tensor train method.
- Published
- 2017
166. Weaving Multi-scale Context for Single Shot Detector
- Author
-
Chen, Yunpeng, Li, Jianshu, Zhou, Bin, Feng, Jiashi, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Aggregating context information from multiple scales has been proved to be effective for improving accuracy of Single Shot Detectors (SSDs) on object detection. However, existing multi-scale context fusion techniques are computationally expensive, which unfavorably diminishes the advantageous speed of SSD. In this work, we propose a novel network topology, called WeaveNet, that can efficiently fuse multi-scale information and boost the detection accuracy with negligible extra cost. The proposed WeaveNet iteratively weaves context information from adjacent scales together to enable more sophisticated context reasoning while maintaining fast speed. Built by stacking light-weight blocks, WeaveNet is easy to train without requiring batch normalization and can be further accelerated by our proposed architecture simplification. Experimental results on PASCAL VOC 2007, PASCAL VOC 2012 benchmarks show signification performance boost brought by WeaveNet. For 320x320 input of batch size = 8, WeaveNet reaches 79.5% mAP on PASCAL VOC 2007 test in 101 fps with only 4 fps extra cost, and further improves to 79.7% mAP with more iterations.
- Published
- 2017
167. Nonconvex Sparse Spectral Clustering by Alternating Direction Method of Multipliers and Its Convergence Analysis
- Author
-
Lu, Canyi, Feng, Jiashi, Lin, Zhouchen, and Yan, Shuicheng
- Subjects
Computer Science - Learning - Abstract
Spectral Clustering (SC) is a widely used data clustering method which first learns a low-dimensional embedding $U$ of data by computing the eigenvectors of the normalized Laplacian matrix, and then performs k-means on $U^\top$ to get the final clustering result. The Sparse Spectral Clustering (SSC) method extends SC with a sparse regularization on $UU^\top$ by using the block diagonal structure prior of $UU^\top$ in the ideal case. However, encouraging $UU^\top$ to be sparse leads to a heavily nonconvex problem which is challenging to solve and the work (Lu, Yan, and Lin 2016) proposes a convex relaxation in the pursuit of this aim indirectly. However, the convex relaxation generally leads to a loose approximation and the quality of the solution is not clear. This work instead considers to solve the nonconvex formulation of SSC which directly encourages $UU^\top$ to be sparse. We propose an efficient Alternating Direction Method of Multipliers (ADMM) to solve the nonconvex SSC and provide the convergence guarantee. In particular, we prove that the sequences generated by ADMM always exist a limit point and any limit point is a stationary point. Our analysis does not impose any assumptions on the iterates and thus is practical. Our proposed ADMM for nonconvex problems allows the stepsize to be increasing but upper bounded, and this makes it very efficient in practice. Experimental analysis on several real data sets verifies the effectiveness of our method., Comment: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). 2018
- Published
- 2017
168. WSNet: Compact and Efficient Networks Through Weight Sampling
- Author
-
Jin, Xiaojie, Yang, Yingzhen, Xu, Ning, Yang, Jianchao, Jojic, Nebojsa, Feng, Jiashi, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Neural and Evolutionary Computing ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We present a new approach and a novel architecture, termed WSNet, for learning compact and efficient deep neural networks. Existing approaches conventionally learn full model parameters independently and then compress them via ad hoc processing such as model pruning or filter factorization. Alternatively, WSNet proposes learning model parameters by sampling from a compact set of learnable parameters, which naturally enforces {parameter sharing} throughout the learning process. We demonstrate that such a novel weight sampling approach (and induced WSNet) promotes both weights and computation sharing favorably. By employing this method, we can more efficiently learn much smaller networks with competitive performance compared to baseline networks with equal numbers of convolution filters. Specifically, we consider learning compact and efficient 1D convolutional neural networks for audio classification. Extensive experiments on multiple audio classification datasets verify the effectiveness of WSNet. Combined with weight quantization, the resulted models are up to 180 times smaller and theoretically up to 16 times faster than the well-established baselines, without noticeable performance drop., Comment: To appear at ICML 2018
- Published
- 2017
169. HashGAN:Attention-aware Deep Adversarial Hashing for Cross Modal Retrieval
- Author
-
Zhang, Xi, Zhou, Siyu, Feng, Jiashi, Lai, Hanjiang, Li, Bo, Pan, Yan, Yin, Jian, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
As the rapid growth of multi-modal data, hashing methods for cross-modal retrieval have received considerable attention. Deep-networks-based cross-modal hashing methods are appealing as they can integrate feature learning and hash coding into end-to-end trainable frameworks. However, it is still challenging to find content similarities between different modalities of data due to the heterogeneity gap. To further address this problem, we propose an adversarial hashing network with attention mechanism to enhance the measurement of content similarities by selectively focusing on informative parts of multi-modal data. The proposed new adversarial network, HashGAN, consists of three building blocks: 1) the feature learning module to obtain feature representations, 2) the generative attention module to generate an attention mask, which is used to obtain the attended (foreground) and the unattended (background) feature representations, 3) the discriminative hash coding module to learn hash functions that preserve the similarities between different modalities. In our framework, the generative module and the discriminative module are trained in an adversarial way: the generator is learned to make the discriminator cannot preserve the similarities of multi-modal data w.r.t. the background feature representations, while the discriminator aims to preserve the similarities of multi-modal data w.r.t. both the foreground and the background feature representations. Extensive evaluations on several benchmark datasets demonstrate that the proposed HashGAN brings substantial improvements over other state-of-the-art cross-modal hashing methods., Comment: 10 pages, 8 figures, 3 tables
- Published
- 2017
170. Personalized and Occupational-aware Age Progression by Generative Adversarial Networks
- Author
-
Zhou, Siyu, Zhao, Weiqiang, Feng, Jiashi, Lai, Hanjiang, Pan, Yan, Yin, Jian, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Face age progression, which aims to predict the future looks, is important for various applications and has been received considerable attentions. Existing methods and datasets are limited in exploring the effects of occupations which may influence the personal appearances. In this paper, we firstly introduce an occupational face aging dataset for studying the influences of occupations on the appearances. It includes five occupations, which enables the development of new algorithms for age progression and facilitate future researches. Second, we propose a new occupational-aware adversarial face aging network, which learns human aging process under different occupations. Two factors are taken into consideration in our aging process: personality-preserving and visually plausible texture change for different occupations. We propose personalized network with personalized loss in deep autoencoder network for keeping personalized facial characteristics, and occupational-aware adversarial network with occupational-aware adversarial loss for obtaining more realistic texture changes. Experimental results well demonstrate the advantages of the proposed method by comparing with other state-of-the-arts age progression methods., Comment: 9 pages, 10 figures
- Published
- 2017
171. Integrated Face Analytics Networks through Cross-Dataset Hybrid Training
- Author
-
Li, Jianshu, Xiao, Shengtao, Zhao, Fang, Zhao, Jian, Li, Jianan, Feng, Jiashi, Yan, Shuicheng, and Sim, Terence
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Face analytics benefits many multimedia applications. It consists of a number of tasks, such as facial emotion recognition and face parsing, and most existing approaches generally treat these tasks independently, which limits their deployment in real scenarios. In this paper we propose an integrated Face Analytics Network (iFAN), which is able to perform multiple tasks jointly for face analytics with a novel carefully designed network architecture to fully facilitate the informative interaction among different tasks. The proposed integrated network explicitly models the interactions between tasks so that the correlations between tasks can be fully exploited for performance boost. In addition, to solve the bottleneck of the absence of datasets with comprehensive training data for various tasks, we propose a novel cross-dataset hybrid training strategy. It allows "plug-in and play" of multiple datasets annotated for different tasks without the requirement of a fully labeled common dataset for all the tasks. We experimentally show that the proposed iFAN achieves state-of-the-art performance on multiple face analytics tasks using a single integrated model. Specifically, iFAN achieves an overall F-score of 91.15% on the Helen dataset for face parsing, a normalized mean error of 5.81% on the MTFL dataset for facial landmark localization and an accuracy of 45.73% on the BNU dataset for emotion recognition with a single model., Comment: 10 pages
- Published
- 2017
172. Predicting Scene Parsing and Motion Dynamics in the Future
- Author
-
Jin, Xiaojie, Xiao, Huaxin, Shen, Xiaohui, Yang, Jimei, Lin, Zhe, Chen, Yunpeng, Jie, Zequn, Feng, Jiashi, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
The ability of predicting the future is important for intelligent systems, e.g. autonomous vehicles and robots to plan early and make decisions accordingly. Future scene parsing and optical flow estimation are two key tasks that help agents better understand their environments as the former provides dense semantic information, i.e. what objects will be present and where they will appear, while the latter provides dense motion information, i.e. how the objects will move. In this paper, we propose a novel model to simultaneously predict scene parsing and optical flow in unobserved future video frames. To our best knowledge, this is the first attempt in jointly predicting scene parsing and motion dynamics. In particular, scene parsing enables structured motion prediction by decomposing optical flow into different groups while optical flow estimation brings reliable pixel-wise correspondence to scene parsing. By exploiting this mutually beneficial relationship, our model shows significantly better parsing and motion prediction results when compared to well-established baselines and individual prediction models on the large-scale Cityscapes dataset. In addition, we also demonstrate that our model can be used to predict the steering angle of the vehicles, which further verifies the ability of our model to learn latent representations of scene dynamics., Comment: To appear in NIPS 2017
- Published
- 2017
173. Learning to Segment Human by Watching YouTube
- Author
-
Liang, Xiaodan, Wei, Yunchao, Lin, Liang, Chen, Yunpeng, Shen, Xiaohui, Yang, Jianchao, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
An intuition on human segmentation is that when a human is moving in a video, the video-context (e.g., appearance and motion clues) may potentially infer reasonable mask information for the whole human body. Inspired by this, based on popular deep convolutional neural networks (CNN), we explore a very-weakly supervised learning framework for human segmentation task, where only an imperfect human detector is available along with massive weakly-labeled YouTube videos. In our solution, the video-context guided human mask inference and CNN based segmentation network learning iterate to mutually enhance each other until no further improvement gains. In the first step, each video is decomposed into supervoxels by the unsupervised video segmentation. The superpixels within the supervoxels are then classified as human or non-human by graph optimization with unary energies from the imperfect human detection results and the predicted confidence maps by the CNN trained in the previous iteration. In the second step, the video-context derived human masks are used as direct labels to train CNN. Extensive experiments on the challenging PASCAL VOC 2012 semantic segmentation benchmark demonstrate that the proposed framework has already achieved superior results than all previous weakly-supervised methods with object class or bounding box annotations. In addition, by augmenting with the annotated masks from PASCAL VOC 2012, our method reaches a new state-of-the-art performance on the human segmentation task., Comment: Very-weakly supervised learning framework. New state-of-the-art performance on the human segmentation task! (Published in T-PAMI 2017)
- Published
- 2017
- Full Text
- View/download PDF
174. Deep Sparse Subspace Clustering
- Author
-
Peng, Xi, Feng, Jiashi, Xiao, Shijie, Lu, Jiwen, Yi, Zhang, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper, we present a deep extension of Sparse Subspace Clustering, termed Deep Sparse Subspace Clustering (DSSC). Regularized by the unit sphere distribution assumption for the learned deep features, DSSC can infer a new data affinity matrix by simultaneously satisfying the sparsity principle of SSC and the nonlinearity given by neural networks. One of the appealing advantages brought by DSSC is: when original real-world data do not meet the class-specific linear subspace distribution assumption, DSSC can employ neural networks to make the assumption valid with its hierarchical nonlinear transformations. To the best of our knowledge, this is among the first deep learning based subspace clustering methods. Extensive experiments are conducted on four real-world datasets to show the proposed DSSC is significantly superior to 12 existing methods for subspace clustering., Comment: The initial version is completed at the beginning of 2015
- Published
- 2017
175. Meta Networks for Neural Style Transfer
- Author
-
Shen, Falong, Yan, Shuicheng, and Zeng, Gang
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper we propose a new method to get the specified network parameters through one time feed-forward propagation of the meta networks and explore the application to neural style transfer. Recent works on style transfer typically need to train image transformation networks for every new style, and the style is encoded in the network parameters by enormous iterations of stochastic gradient descent. To tackle these issues, we build a meta network which takes in the style image and produces a corresponding image transformations network directly. Compared with optimization-based methods for every style, our meta networks can handle an arbitrary new style within $19ms$ seconds on one modern GPU card. The fast image transformation network generated by our meta network is only 449KB, which is capable of real-time executing on a mobile device. We also investigate the manifold of the style transfer networks by operating the hidden features from meta networks. Experiments have well validated the effectiveness of our method. Code and trained models has been released https://github.com/FalongShen/styletransfer.
- Published
- 2017
176. Discriminative Similarity for Clustering and Semi-Supervised Learning
- Author
-
Yang, Yingzhen, Liang, Feng, Jojic, Nebojsa, Yan, Shuicheng, Feng, Jiashi, and Huang, Thomas S.
- Subjects
Statistics - Machine Learning ,Computer Science - Learning - Abstract
Similarity-based clustering and semi-supervised learning methods separate the data into clusters or classes according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose a novel discriminative similarity learning framework which learns discriminative similarity for either data clustering or semi-supervised learning. The proposed framework learns classifier from each hypothetical labeling, and searches for the optimal labeling by minimizing the generalization error of the learned classifiers associated with the hypothetical labeling. Kernel classifier is employed in our framework. By generalization analysis via Rademacher complexity, the generalization error bound for the kernel classifier learned from hypothetical labeling is expressed as the sum of pairwise similarity between the data from different classes, parameterized by the weights of the kernel classifier. Such pairwise similarity serves as the discriminative similarity for the purpose of clustering and semi-supervised learning, and discriminative similarity with similar form can also be induced by the integrated squared error bound for kernel density classification. Based on the discriminative similarity induced by the kernel classifier, we propose new clustering and semi-supervised learning methods.
- Published
- 2017
177. Learning with Rethinking: Recurrently Improving Convolutional Neural Networks through Feedback
- Author
-
Li, Xin, Jie, Zequn, Feng, Jiashi, Liu, Changsong, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent years have witnessed the great success of convolutional neural network (CNN) based models in the field of computer vision. CNN is able to learn hierarchically abstracted features from images in an end-to-end training manner. However, most of the existing CNN models only learn features through a feedforward structure and no feedback information from top to bottom layers is exploited to enable the networks to refine themselves. In this paper, we propose a "Learning with Rethinking" algorithm. By adding a feedback layer and producing the emphasis vector, the model is able to recurrently boost the performance based on previous prediction. Particularly, it can be employed to boost any pre-trained models. This algorithm is tested on four object classification benchmark datasets: CIFAR-100, CIFAR-10, MNIST-background-image and ILSVRC-2012 dataset. These results have demonstrated the advantage of training CNN models with the proposed feedback mechanism.
- Published
- 2017
178. Tensor Robust Principal Component Analysis: Exact Recovery of Corrupted Low-Rank Tensors via Convex Optimization
- Author
-
Lu, Canyi, Feng, Jiashi, Chen, Yudong, Liu, Wei, Lin, Zhouchen, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
This paper studies the Tensor Robust Principal Component (TRPCA) problem which extends the known Robust PCA (Candes et al. 2011) to the tensor case. Our model is based on a new tensor Singular Value Decomposition (t-SVD) (Kilmer and Martin 2011) and its induced tensor tubal rank and tensor nuclear norm. Consider that we have a 3-way tensor ${\mathcal{X}}\in\mathbb{R}^{n_1\times n_2\times n_3}$ such that ${\mathcal{X}}={\mathcal{L}}_0+{\mathcal{E}}_0$, where ${\mathcal{L}}_0$ has low tubal rank and ${\mathcal{E}}_0$ is sparse. Is that possible to recover both components? In this work, we prove that under certain suitable assumptions, we can recover both the low-rank and the sparse components exactly by simply solving a convex program whose objective is a weighted combination of the tensor nuclear norm and the $\ell_1$-norm, i.e., $\min_{{\mathcal{L}},\ {\mathcal{E}}} \ \|{{\mathcal{L}}}\|_*+\lambda\|{{\mathcal{E}}}\|_1, \ \text{s.t.} \ {\mathcal{X}}={\mathcal{L}}+{\mathcal{E}}$, where $\lambda= {1}/{\sqrt{\max(n_1,n_2)n_3}}$. Interestingly, TRPCA involves RPCA as a special case when $n_3=1$ and thus it is a simple and elegant tensor extension of RPCA. Also numerical experiments verify our theory and the application for the image denoising demonstrates the effectiveness of our method., Comment: IEEE International Conference on Computer Vision and Pattern Recognition (CVPR, 2016)
- Published
- 2017
179. FoveaNet: Perspective-aware Urban Scene Parsing
- Author
-
Li, Xin, Jie, Zequn, Wang, Wei, Liu, Changsong, Yang, Jimei, Shen, Xiaohui, Lin, Zhe, Chen, Qiang, Yan, Shuicheng, and Feng, Jiashi
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Parsing urban scene images benefits many applications, especially self-driving. Most of the current solutions employ generic image parsing models that treat all scales and locations in the images equally and do not consider the geometry property of car-captured urban scene images. Thus, they suffer from heterogeneous object scales caused by perspective projection of cameras on actual scenes and inevitably encounter parsing failures on distant objects as well as other boundary and recognition errors. In this work, we propose a new FoveaNet model to fully exploit the perspective geometry of scene images and address the common failures of generic parsing models. FoveaNet estimates the perspective geometry of a scene image through a convolutional network which integrates supportive evidence from contextual objects within the image. Based on the perspective geometry information, FoveaNet "undoes" the camera perspective projection analyzing regions in the space of the actual scene, and thus provides much more reliable parsing results. Furthermore, to effectively address the recognition errors, FoveaNet introduces a new dense CRFs model that takes the perspective geometry as a prior potential. We evaluate FoveaNet on two urban scene parsing datasets, Cityspaces and CamVid, which demonstrates that FoveaNet can outperform all the well-established baselines and provide new state-of-the-art performance.
- Published
- 2017
180. Neural Person Search Machines
- Author
-
Liu, Hao, Feng, Jiashi, Jie, Zequn, Jayashree, Karlekar, Zhao, Bo, Qi, Meibin, Jiang, Jianguo, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We investigate the problem of person search in the wild in this work. Instead of comparing the query against all candidate regions generated in a query-blind manner, we propose to recursively shrink the search area from the whole image till achieving precise localization of the target person, by fully exploiting information from the query and contextual cues in every recursive search step. We develop the Neural Person Search Machines (NPSM) to implement such recursive localization for person search. Benefiting from its neural search mechanism, NPSM is able to selectively shrink its focus from a loose region to a tighter one containing the target automatically. In this process, NPSM employs an internal primitive memory component to memorize the query representation which modulates the attention and augments its robustness to other distracting regions. Evaluations on two benchmark datasets, CUHK-SYSU Person Search dataset and PRW dataset, have demonstrated that our method can outperform current state-of-the-arts in both mAP and top-1 evaluation protocols., Comment: ICCV2017 camera ready
- Published
- 2017
181. Dual Path Networks
- Author
-
Chen, Yunpeng, Li, Jianan, Xiao, Huaxin, Jin, Xiaojie, Yan, Shuicheng, and Feng, Jiashi
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this work, we present a simple, highly efficient and modularized Dual Path Network (DPN) for image classification which presents a new topology of connection paths internally. By revealing the equivalence of the state-of-the-art Residual Network (ResNet) and Densely Convolutional Network (DenseNet) within the HORNN framework, we find that ResNet enables feature re-usage while DenseNet enables new features exploration which are both important for learning good representations. To enjoy the benefits from both path topologies, our proposed Dual Path Network shares common features while maintaining the flexibility to explore new features through dual path architectures. Extensive experiments on three benchmark datasets, ImagNet-1k, Places365 and PASCAL VOC, clearly demonstrate superior performance of the proposed DPN over state-of-the-arts. In particular, on the ImagNet-1k dataset, a shallow DPN surpasses the best ResNeXt-101(64x4d) with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN (DPN-131) further pushes the state-of-the-art single model performance with about 2 times faster training speed. Experiments on the Places365 large-scale scene dataset, PASCAL VOC detection dataset, and PASCAL VOC segmentation dataset also demonstrate its consistently better performance than DenseNet, ResNet and the latest ResNeXt model over various applications., Comment: for code and models, see https://github.com/cypw/DPNs
- Published
- 2017
182. Perceptual Generative Adversarial Networks for Small Object Detection
- Author
-
Li, Jianan, Liang, Xiaodan, Wei, Yunchao, Xu, Tingfa, Feng, Jiashi, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Detecting small objects is notoriously challenging due to their low resolution and noisy representation. Existing object detection pipelines usually detect small objects through learning representations of all the objects at multiple scales. However, the performance gain of such ad hoc architectures is usually limited to pay off the computational cost. In this work, we address the small object detection problem by developing a single architecture that internally lifts representations of small objects to "super-resolved" ones, achieving similar characteristics as large objects and thus more discriminative for detection. For this purpose, we propose a new Perceptual Generative Adversarial Network (Perceptual GAN) model that improves small object detection through narrowing representation difference of small objects from the large ones. Specifically, its generator learns to transfer perceived poor representations of the small objects to super-resolved ones that are similar enough to real large objects to fool a competing discriminator. Meanwhile its discriminator competes with the generator to identify the generated representation and imposes an additional perceptual requirement - generated representations of small objects must be beneficial for detection purpose - on the generator. Extensive evaluations on the challenging Tsinghua-Tencent 100K and the Caltech benchmark well demonstrate the superiority of Perceptual GAN in detecting small objects, including traffic signs and pedestrians, over well-established state-of-the-arts.
- Published
- 2017
183. Personalized Age Progression with Bi-level Aging Dictionary Learning
- Author
-
Shu, Xiangbo, Tang, Jinhui, Li, Zechao, Lai, Hanjiang, Zhang, Liyan, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Age progression is defined as aesthetically re-rendering the aging face at any future age for an individual face. In this work, we aim to automatically render aging faces in a personalized way. Basically, for each age group, we learn an aging dictionary to reveal its aging characteristics (e.g., wrinkles), where the dictionary bases corresponding to the same index yet from two neighboring aging dictionaries form a particular aging pattern cross these two age groups, and a linear combination of all these patterns expresses a particular personalized aging process. Moreover, two factors are taken into consideration in the dictionary learning process. First, beyond the aging dictionaries, each person may have extra personalized facial characteristics, e.g. mole, which are invariant in the aging process. Second, it is challenging or even impossible to collect faces of all age groups for a particular person, yet much easier and more practical to get face pairs from neighboring age groups. To this end, we propose a novel Bi-level Dictionary Learning based Personalized Age Progression (BDL-PAP) method. Here, bi-level dictionary learning is formulated to learn the aging dictionaries based on face pairs from neighboring age groups. Extensive experiments well demonstrate the advantages of the proposed BDL-PAP over other state-of-the-arts in term of personalized age progression, as well as the performance gain for cross-age face verification by synthesizing aging faces., Comment: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence
- Published
- 2017
- Full Text
- View/download PDF
184. Generative Partition Networks for Multi-Person Pose Estimation
- Author
-
Nie, Xuecheng, Feng, Jiashi, Xing, Junliang, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
This paper proposes a new Generative Partition Network (GPN) to address the challenging multi-person pose estimation problem. Different from existing models that are either completely top-down or bottom-up, the proposed GPN introduces a novel strategy--it generates partitions for multiple persons from their global joint candidates and infers instance-specific joint configurations simultaneously. The GPN is favorably featured by low complexity and high accuracy of joint detection and re-organization. In particular, GPN designs a generative model that performs one feed-forward pass to efficiently generate robust person detections with joint partitions, relying on dense regressions from global joint candidates in an embedding space parameterized by centroids of persons. In addition, GPN formulates the inference procedure for joint configurations of human poses as a graph partition problem, and conducts local optimization for each person detection with reliable global affinity cues, leading to complexity reduction and performance improvement. GPN is implemented with the Hourglass architecture as the backbone network to simultaneously learn joint detector and dense regressor. Extensive experiments on benchmarks MPII Human Pose Multi-Person, extended PASCAL-Person-Part, and WAF, show the efficiency of GPN with new state-of-the-art performance.
- Published
- 2017
185. Multiple-Human Parsing in the Wild
- Author
-
Li, Jianshu, Zhao, Jian, Wei, Yunchao, Lang, Congyan, Li, Yidong, Sim, Terence, Yan, Shuicheng, and Feng, Jiashi
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Human parsing is attracting increasing research attention. In this work, we aim to push the frontier of human parsing by introducing the problem of multi-human parsing in the wild. Existing works on human parsing mainly tackle single-person scenarios, which deviates from real-world applications where multiple persons are present simultaneously with interaction and occlusion. To address the multi-human parsing problem, we introduce a new multi-human parsing (MHP) dataset and a novel multi-human parsing model named MH-Parser. The MHP dataset contains multiple persons captured in real-world scenes with pixel-level fine-grained semantic annotations in an instance-aware setting. The MH-Parser generates global parsing maps and person instance masks simultaneously in a bottom-up fashion with the help of a new Graph-GAN model. We envision that the MHP dataset will serve as a valuable data resource to develop new multi-human parsing models, and the MH-Parser offers a strong baseline to drive future research for multi-human parsing in the wild., Comment: The first two authors are with equal contribution
- Published
- 2017
186. More is Less: A More Complicated Network with Less Inference Complexity
- Author
-
Dong, Xuanyi, Huang, Junshi, Yang, Yi, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper, we present a novel and general network structure towards accelerating the inference process of convolutional neural networks, which is more complicated in network structure yet with less inference complexity. The core idea is to equip each original convolutional layer with another low-cost collaborative layer (LCCL), and the element-wise multiplication of the ReLU outputs of these two parallel layers produces the layer-wise output. The combined layer is potentially more discriminative than the original convolutional layer, and its inference is faster for two reasons: 1) the zero cells of the LCCL feature maps will remain zero after element-wise multiplication, and thus it is safe to skip the calculation of the corresponding high-cost convolution in the original convolutional layer, 2) LCCL is very fast if it is implemented as a 1*1 convolution or only a single filter shared by all channels. Extensive experiments on the CIFAR-10, CIFAR-100 and ILSCRC-2012 benchmarks show that our proposed network structure can accelerate the inference process by 32\% on average with negligible performance drop., Comment: This paper has been accepted by the IEEE CVPR 2017
- Published
- 2017
187. Object Region Mining with Adversarial Erasing: A Simple Classification to Semantic Segmentation Approach
- Author
-
Wei, Yunchao, Feng, Jiashi, Liang, Xiaodan, Cheng, Ming-Ming, Zhao, Yao, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We investigate a principle way to progressively mine discriminative object regions using classification networks to address the weakly-supervised semantic segmentation problems. Classification networks are only responsive to small and sparse discriminative regions from the object of interest, which deviates from the requirement of the segmentation task that needs to localize dense, interior and integral regions for pixel-wise inference. To mitigate this gap, we propose a new adversarial erasing approach for localizing and expanding object regions progressively. Starting with a single small object region, our proposed approach drives the classification network to sequentially discover new and complement object regions by erasing the current mined regions in an adversarial manner. These localized regions eventually constitute a dense and complete object region for learning semantic segmentation. To further enhance the quality of the discovered regions by adversarial erasing, an online prohibitive segmentation learning approach is developed to collaborate with adversarial erasing by providing auxiliary segmentation supervision modulated by the more reliable classification scores. Despite its apparent simplicity, the proposed approach achieves 55.0% and 55.7% mean Intersection-over-Union (mIoU) scores on PASCAL VOC 2012 val and test sets, which are the new state-of-the-arts., Comment: Accepted to appear in CVPR 2017 (oral)
- Published
- 2017
188. Interpretable Structure-Evolving LSTM
- Author
-
Liang, Xiaodan, Lin, Liang, Shen, Xiaohui, Feng, Jiashi, Yan, Shuicheng, and Xing, Eric P.
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Learning - Abstract
This paper develops a general framework for learning interpretable data representation via Long Short-Term Memory (LSTM) recurrent neural networks over hierarchal graph structures. Instead of learning LSTM models over the pre-fixed structures, we propose to further learn the intermediate interpretable multi-level graph structures in a progressive and stochastic way from data during the LSTM network optimization. We thus call this model the structure-evolving LSTM. In particular, starting with an initial element-level graph representation where each node is a small data element, the structure-evolving LSTM gradually evolves the multi-level graph representations by stochastically merging the graph nodes with high compatibilities along the stacked LSTM layers. In each LSTM layer, we estimate the compatibility of two connected nodes from their corresponding LSTM gate outputs, which is used to generate a merging probability. The candidate graph structures are accordingly generated where the nodes are grouped into cliques with their merging probabilities. We then produce the new graph structure with a Metropolis-Hasting algorithm, which alleviates the risk of getting stuck in local optimums by stochastic sampling with an acceptance probability. Once a graph structure is accepted, a higher-level graph is then constructed by taking the partitioned cliques as its nodes. During the evolving process, representation becomes more abstracted in higher-levels where redundant information is filtered out, allowing more efficient propagation of long-range data dependencies. We evaluate the effectiveness of structure-evolving LSTM in the application of semantic object parsing and demonstrate its advantage over state-of-the-art LSTM models on standard benchmarks., Comment: To appear in CVPR 2017 as a spotlight paper
- Published
- 2017
189. Tree-Structured Reinforcement Learning for Sequential Object Localization
- Author
-
Jie, Zequn, Liang, Xiaodan, Feng, Jiashi, Jin, Xiaojie, Lu, Wen Feng, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Existing object proposal algorithms usually search for possible object regions over multiple locations and scales separately, which ignore the interdependency among different objects and deviate from the human perception procedure. To incorporate global interdependency between objects into object localization, we propose an effective Tree-structured Reinforcement Learning (Tree-RL) approach to sequentially search for objects by fully exploiting both the current observation and historical search paths. The Tree-RL approach learns multiple searching policies through maximizing the long-term reward that reflects localization accuracies over all the objects. Starting with taking the entire image as a proposal, the Tree-RL approach allows the agent to sequentially discover multiple objects via a tree-structured traversing scheme. Allowing multiple near-optimal policies, Tree-RL offers more diversity in search paths and is able to find multiple objects with a single feed-forward pass. Therefore, Tree-RL can better cover different objects with various scales which is quite appealing in the context of object proposal. Experiments on PASCAL VOC 2007 and 2012 validate the effectiveness of the Tree-RL, which can achieve comparable recalls with current object proposal algorithms via much fewer candidate windows., Comment: Advances in Neural Information Processing Systems 2016
- Published
- 2017
190. Training Group Orthogonal Neural Networks with Privileged Information
- Author
-
Chen, Yunpeng, Jin, Xiaojie, Feng, Jiashi, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Learning rich and diverse representations is critical for the performance of deep convolutional neural networks (CNNs). In this paper, we consider how to use privileged information to promote inherent diversity of a single CNN model such that the model can learn better representations and offer stronger generalization ability. To this end, we propose a novel group orthogonal convolutional neural network (GoCNN) that learns untangled representations within each layer by exploiting provided privileged information and enhances representation diversity effectively. We take image classification as an example where image segmentation annotations are used as privileged information during the training process. Experiments on two benchmark datasets -- ImageNet and PASCAL VOC -- clearly demonstrate the strong generalization ability of our proposed GoCNN model. On the ImageNet dataset, GoCNN improves the performance of state-of-the-art ResNet-152 model by absolute value of 1.2% while only uses privileged information of 10% of the training images, confirming effectiveness of GoCNN on utilizing available privileged knowledge to train better CNNs., Comment: Proceedings of the IJCAI-17
- Published
- 2017
191. Video-based Person Re-identification with Accumulative Motion Context
- Author
-
Liu, Hao, Jie, Zequn, Jayashree, Karlekar, Qi, Meibin, Jiang, Jianguo, Yan, Shuicheng, and Feng, Jiashi
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Video based person re-identification plays a central role in realistic security and video surveillance. In this paper we propose a novel Accumulative Motion Context (AMOC) network for addressing this important problem, which effectively exploits the long-range motion context for robustly identifying the same person under challenging conditions. Given a video sequence of the same or different persons, the proposed AMOC network jointly learns appearance representation and motion context from a collection of adjacent frames using a two-stream convolutional architecture. Then AMOC accumulates clues from motion context by recurrent aggregation, allowing effective information flow among adjacent frames and capturing dynamic gist of the persons. The architecture of AMOC is end-to-end trainable and thus motion context can be adapted to complement appearance clues under unfavorable conditions (e.g. occlusions). Extensive experiments are conduced on three public benchmark datasets, i.e., the iLIDS-VID, PRID-2011 and MARS datasets, to investigate the performance of AMOC. The experimental results demonstrate that the proposed AMOC network outperforms state-of-the-arts for video-based re-identification significantly and confirm the advantage of exploiting long-range motion context for video based person re-identification, validating our motivation evidently., Comment: accepted by TCSVT
- Published
- 2016
192. Robust LSTM-Autoencoders for Face De-Occlusion in the Wild
- Author
-
Zhao, Fang, Feng, Jiashi, Zhao, Jian, Yang, Wenhan, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Face recognition techniques have been developed significantly in recent years. However, recognizing faces with partial occlusion is still challenging for existing face recognizers which is heavily desired in real-world applications concerning surveillance and security. Although much research effort has been devoted to developing face de-occlusion methods, most of them can only work well under constrained conditions, such as all the faces are from a pre-defined closed set. In this paper, we propose a robust LSTM-Autoencoders (RLA) model to effectively restore partially occluded faces even in the wild. The RLA model consists of two LSTM components, which aims at occlusion-robust face encoding and recurrent occlusion removal respectively. The first one, named multi-scale spatial LSTM encoder, reads facial patches of various scales sequentially to output a latent representation, and occlusion-robustness is achieved owing to the fact that the influence of occlusion is only upon some of the patches. Receiving the representation learned by the encoder, the LSTM decoder with a dual channel architecture reconstructs the overall face and detects occlusion simultaneously, and by feat of LSTM, the decoder breaks down the task of face de-occlusion into restoring the occluded part step by step. Moreover, to minimize identify information loss and guarantee face recognition accuracy over recovered faces, we introduce an identity-preserving adversarial training scheme to further improve RLA. Extensive experiments on both synthetic and real datasets of faces with occlusion clearly demonstrate the effectiveness of our proposed RLA in removing different types of facial occlusion at various locations. The proposed method also provides significantly larger performance gain than other de-occlusion methods in promoting recognition performance over partially-occluded faces.
- Published
- 2016
193. Dual-Constrained Deep Semi-Supervised Coupled Factorization Network with Enriched Prior
- Author
-
Zhang, Yan, Zhang, Zhao, Wang, Yang, Zhang, Zheng, Zhang, Li, Yan, Shuicheng, and Wang, Meng
- Published
- 2021
- Full Text
- View/download PDF
194. Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts
- Author
-
Wang, Alex Jinpeng, primary, Zhou, Pan, additional, Shou, Mike Zheng, additional, and Yan, Shuicheng, additional
- Published
- 2024
- Full Text
- View/download PDF
195. Rethinking Bottleneck Structure for Efficient Mobile Network Design
- Author
-
Zhou, Daquan, Hou, Qibin, Chen, Yunpeng, Feng, Jiashi, Yan, Shuicheng, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Vedaldi, Andrea, editor, Bischof, Horst, editor, Brox, Thomas, editor, and Frahm, Jan-Michael, editor
- Published
- 2020
- Full Text
- View/download PDF
196. Video Scene Parsing with Predictive Feature Learning
- Author
-
Jin, Xiaojie, Li, Xin, Xiao, Huaxin, Shen, Xiaohui, Lin, Zhe, Yang, Jimei, Chen, Yunpeng, Dong, Jian, Liu, Luoqi, Jie, Zequn, Feng, Jiashi, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this work, we address the challenging video scene parsing problem by developing effective representation learning methods given limited parsing annotations. In particular, we contribute two novel methods that constitute a unified parsing framework. (1) \textbf{Predictive feature learning}} from nearly unlimited unlabeled video data. Different from existing methods learning features from single frame parsing, we learn spatiotemporal discriminative features by enforcing a parsing network to predict future frames and their parsing maps (if available) given only historical frames. In this way, the network can effectively learn to capture video dynamics and temporal context, which are critical clues for video scene parsing, without requiring extra manual annotations. (2) \textbf{Prediction steering parsing}} architecture that effectively adapts the learned spatiotemporal features to scene parsing tasks and provides strong guidance for any off-the-shelf parsing model to achieve better video scene parsing performance. Extensive experiments over two challenging datasets, Cityscapes and Camvid, have demonstrated the effectiveness of our methods by showing significant improvement over well-established baselines., Comment: 15 pages, 7 figures, 5 tables, currently v2
- Published
- 2016
197. Deep Joint Rain Detection and Removal from a Single Image
- Author
-
Yang, Wenhan, Tan, Robby T., Feng, Jiashi, Liu, Jiaying, Guo, Zongming, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper, we address a rain removal problem from a single image, even in the presence of heavy rain and rain streak accumulation. Our core ideas lie in the new rain image models and a novel deep learning architecture. We first modify an existing model comprising a rain streak layer and a background layer, by adding a binary map that locates rain streak regions. Second, we create a new model consisting of a component representing rain streak accumulation (where individual streaks cannot be seen, and thus visually similar to mist or fog), and another component representing various shapes and directions of overlapping rain streaks, which usually happen in heavy rain. Based on the first model, we develop a multi-task deep learning architecture that learns the binary rain streak map, the appearance of rain streaks, and the clean background, which is our ultimate output. The additional binary map is critically beneficial, since its loss function can provide additional strong information to the network. To handle rain streak accumulation (again, a phenomenon visually similar to mist or fog) and various shapes and directions of overlapping rain streaks, we propose a recurrent rain detection and removal network that removes rain streaks and clears up the rain accumulation iteratively and progressively. In each recurrence of our method, a new contextualized dilated network is developed to exploit regional contextual information and outputs better representation for rain detection. The evaluation on real images, particularly on heavy rain, shows the effectiveness of our novel models and architecture, outperforming the state-of-the-art methods significantly. Our codes and data sets will be publicly available., Comment: Preliminary version to appear in CVPR2017
- Published
- 2016
198. Multi-Path Feedback Recurrent Neural Network for Scene Parsing
- Author
-
Jin, Xiaojie, Chen, Yunpeng, Feng, Jiashi, Jie, Zequn, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper, we consider the scene parsing problem and propose a novel Multi-Path Feedback recurrent neural network (MPF-RNN) for parsing scene images. MPF-RNN can enhance the capability of RNNs in modeling long-range context information at multiple levels and better distinguish pixels that are easy to confuse. Different from feedforward CNNs and RNNs with only single feedback, MPF-RNN propagates the contextual features learned at top layer through \textit{multiple} weighted recurrent connections to learn bottom features. For better training MPF-RNN, we propose a new strategy that considers accumulative loss at multiple recurrent steps to improve performance of the MPF-RNN on parsing small objects. With these two novel components, MPF-RNN has achieved significant improvement over strong baselines (VGG16 and Res101) on five challenging scene parsing benchmarks, including traditional SiftFlow, Barcelona, CamVid, Stanford Background as well as the recently released large-scale ADE20K., Comment: Accepted by AAAI-17. Camera-ready version
- Published
- 2016
199. Visual Processing by a Unified Schatten-$p$ Norm and $\ell_q$ Norm Regularized Principal Component Pursuit
- Author
-
Wang, Jing, Wang, Meng, Hu, Xuegang, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper, we propose a non-convex formulation to recover the authentic structure from the corrupted real data. Typically, the specific structure is assumed to be low rank, which holds for a wide range of data, such as images and videos. Meanwhile, the corruption is assumed to be sparse. In the literature, such a problem is known as Robust Principal Component Analysis (RPCA), which usually recovers the low rank structure by approximating the rank function with a nuclear norm and penalizing the error by an $\ell_1$-norm. Although RPCA is a convex formulation and can be solved effectively, the introduced norms are not tight approximations, which may cause the solution to deviate from the authentic one. Therefore, we consider here a non-convex relaxation, consisting of a Schatten-$p$ norm and an $\ell_q$-norm that promote low rank and sparsity respectively. We derive a proximal iteratively reweighted algorithm (PIRA) to solve the problem. Our algorithm is based on an alternating direction method of multipliers, where in each iteration we linearize the underlying objective function that allows us to have a closed form solution. We demonstrate that solutions produced by the linearized approximation always converge and have a tighter approximation than the convex counterpart. Experimental results on benchmarks show encouraging results of our approach., Comment: Pattern Recognition, 2015
- Published
- 2016
- Full Text
- View/download PDF
200. Multi-stage Object Detection with Group Recursive Learning
- Author
-
Li, Jianan, Liang, Xiaodan, Li, Jianshu, Xu, Tingfa, Feng, Jiashi, and Yan, Shuicheng
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Most of existing detection pipelines treat object proposals independently and predict bounding box locations and classification scores over them separately. However, the important semantic and spatial layout correlations among proposals are often ignored, which are actually useful for more accurate object detection. In this work, we propose a new EM-like group recursive learning approach to iteratively refine object proposals by incorporating such context of surrounding proposals and provide an optimal spatial configuration of object detections. In addition, we propose to incorporate the weakly-supervised object segmentation cues and region-based object detection into a multi-stage architecture in order to fully exploit the learned segmentation features for better object detection in an end-to-end way. The proposed architecture consists of three cascaded networks which respectively learn to perform weakly-supervised object segmentation, object proposal generation and recursive detection refinement. Combining the group recursive learning and the multi-stage architecture provides competitive mAPs of 78.6% and 74.9% on the PASCAL VOC2007 and VOC2012 datasets respectively, which outperforms many well-established baselines [10] [20] significantly.
- Published
- 2016
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.