124 results on '"Xian-Sheng Hua"'
Search Results
2. HEART: Towards Effective Hash Codes under Label Noise
- Author
-
Jinan Sun, Haixin Wang, Xiao Luo, Shikun Zhang, Wei Xiang, Chong Chen, and Xian-Sheng Hua
- Published
- 2022
- Full Text
- View/download PDF
3. Pursuing Knowledge Consistency: Supervised Hierarchical Contrastive Learning for Facial Action Unit Recognition
- Author
-
Yingjie Chen, Chong Chen, Xiao Luo, Jianqiang Huang, Xian-Sheng Hua, Tao Wang, and Yun Liang
- Published
- 2022
- Full Text
- View/download PDF
4. Improved Deep Unsupervised Hashing via Prototypical Learning
- Author
-
Zeyu Ma, Wei Ju, Xiao Luo, Chong Chen, Xian-Sheng Hua, and Guangming Lu
- Published
- 2022
- Full Text
- View/download PDF
5. CoHOZ
- Author
-
Ning Liao, Yifeng Liu, Li Xiaobo, Chenyi Lei, Guoxin Wang, Xian-Sheng Hua, and Junchi Yan
- Published
- 2022
- Full Text
- View/download PDF
6. DEAL: An Unsupervised Domain Adaptive Framework for Graph-level Classification
- Author
-
Nan Yin, Li Shen, Baopu Li, Mengzhu Wang, Xiao Luo, Chong Chen, Zhigang Luo, and Xian-Sheng Hua
- Published
- 2022
- Full Text
- View/download PDF
7. Token Embeddings Alignment for Cross-Modal Retrieval
- Author
-
Chen-Wei Xie, Jianmin Wu, Yun Zheng, Pan Pan, and Xian-Sheng Hua
- Published
- 2022
- Full Text
- View/download PDF
8. Attention-guided Temporally Coherent Video Object Matting
- Author
-
Chi Wang, Weiwei Xu, Xuansong Xie, Peiran Ren, Xian-Sheng Hua, Miaomiao Cui, Yunke Zhang, Hujun Bao, and Qixing Huang
- Subjects
FOS: Computer and information sciences ,Pixel ,Artificial neural network ,business.industry ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Feature vector ,Deep learning ,Computer Science - Computer Vision and Pattern Recognition ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Construct (python library) ,Object (computer science) ,Computer Science::Multimedia ,Code (cryptography) ,Segmentation ,Computer vision ,Artificial intelligence ,business ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
This paper proposes a novel deep learning-based video object matting method that can achieve temporally coherent matting results. Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength for video matting networks. This module computes temporal correlations for pixels adjacent to each other along the time axis in feature space, which is robust against motion noises. We also design a novel loss term to train the attention weights, which drastically boosts the video matting performance. Besides, we show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network with a sparse set of user-annotated keyframes. To facilitate video matting and trimap generation networks' training, we construct a large-scale video matting dataset with 80 training and 28 validation foreground video clips with ground-truth alpha mattes. Experimental results show that our method can generate high-quality alpha mattes for various videos featuring appearance change, occlusion, and fast motion. Our code and dataset can be found at: https://github.com/yunkezhang/TCVOM, 10 pages, 6 figures, MM '21 camera-ready
- Published
- 2021
- Full Text
- View/download PDF
9. Pairwise VLAD Interaction Network for Video Question Answering
- Author
-
Dan Guo, Xian-Sheng Hua, Hui Wang, and Meng Wang
- Subjects
Modality (human–computer interaction) ,Computer science ,business.industry ,Perspective (graphical) ,computer.software_genre ,Interaction network ,Question answering ,Pairwise comparison ,Artificial intelligence ,Cluster analysis ,business ,computer ,Encoder ,Natural language processing ,Natural language - Abstract
Video Question Answering (VideoQA) is a challenging problem, as it requires a joint understanding of video and natural language question. Existing methods perform correlation learning between video and question have achieved great success. However, previous methods merely model relations between individual video frames (or clips) and words, which are not enough to correctly answer the question. From human's perspective, answering a video question should first summarize both visual and language information, and then explore their correlations for answer reasoning. In this paper, we propose a new method called Pairwise VLAD Interaction Network (PVI-Net) to address this problem. Specifically, we develop a learnable clustering-based VLAD encoder to respectively summarize video and question modalities into a small number of compact VLAD descriptors. For correlation learning, a pairwise VLAD interaction mechanism is proposed to better exploit complementary information for each pair of modality descriptors, avoiding modeling uninformative individual relations (e.g., frame-word and clip-word relations), and exploring both inter- and intra-modality relations simultaneously. Experimental results show that our approach achieves state-of-the-art performance on three VideoQA datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA.
- Published
- 2021
- Full Text
- View/download PDF
10. A Statistical Approach to Mining Semantic Similarity for Deep Unsupervised Hashing
- Author
-
Xian-Sheng Hua, Zeyu Ma, Jianqiang Huang, Daqing Wu, Minghua Deng, Xiao Luo, and Chong Chen
- Subjects
Semantic similarity ,Similarity (network science) ,Sampling distribution ,business.industry ,Computer science ,Hash function ,Metric (mathematics) ,Pairwise comparison ,Pattern recognition ,Artificial intelligence ,Divergence (statistics) ,business ,Image retrieval - Abstract
The majority of deep unsupervised hashing methods usually first construct pairwise semantic similarity information and then learn to map images into compact hash codes while preserving the similarity structure, which implies that the quality of hash codes highly depends on the constructed semantic similarity structure. However, since the features of images for each kind of semantics usually scatter in high-dimensional space with unknown distribution, previous methods could introduce a large number of false positives and negatives for boundary points of distributions in the local semantic structure based on pairwise cosine distances. Towards this limitation, we propose a general distribution-based metric to depict the pairwise distance between images. Specifically, each image is characterized by its random augmentations that can be viewed as samples from the corresponding latent semantic distribution. Then we estimate the distances between images by calculating the sample distribution divergence of their semantics. By applying this new metric to deep unsupervised hashing, we come up with Distribution-based similArity sTructure rEconstruction (DATE). DATE can generate more accurate semantic similarity information by using non-parametric ball divergence. Moreover, DATE explores both semantic-preserving learning and contrastive learning to obtain high-quality hash codes. Extensive experiments on several widely-used datasets validate the superiority of our DATE.
- Published
- 2021
- Full Text
- View/download PDF
11. Large-scale vehicle trajectory reconstruction with camera sensing network
- Author
-
Mo Li, Xian-Sheng Hua, Jianqiang Huang, Mingqian Li, and Panrong Tong
- Subjects
Computer science ,Real-time computing ,Process (computing) ,020206 networking & telecommunications ,02 engineering and technology ,Convolution ,Consistency (database systems) ,0202 electrical engineering, electronic engineering, information engineering ,Trajectory ,Range (statistics) ,Graph (abstract data type) ,020201 artificial intelligence & image processing ,Scale (map) ,TRACE (psycholinguistics) - Abstract
Vehicle trajectories provide essential information to understand the urban mobility and benefit a wide range of urban applications. State-of-the-art solutions for vehicle sensing may not build accurate and complete knowledge of all vehicle trajectories. In order to fill the gap, this paper proposes VeTrac, a comprehensive system that employs widely deployed traffic cameras as a sensing network to trace vehicle movements and reconstruct their trajectories in a large scale. VeTrac fuses mobility correlation and vision-based analysis to reduce uncertainties in identifying vehicles. A graph convolution process is employed to maintain the identity consistency across different camera observations, and a self-training process is invoked when aligning with the urban road network to reconstruct vehicle trajectories with confidence. Extensive experiments with real-world data input of over 7 million vehicle snapshots from over one thousand traffic cameras demonstrate that VeTrac achieves 98% accuracy for simple expressway scenario and 89% accuracy for complex urban environment. The achieved accuracy outperforms alternative solutions by 32% for expressway scenario and by 59% for complex urban environment.
- Published
- 2021
- Full Text
- View/download PDF
12. PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation
- Author
-
Jianqiang Huang, Rongxin Jiang, Yaowu Chen, Chen Shen, Shaotian Yan, Zhongming Jin, and Xian-Sheng Hua
- Subjects
FOS: Computer and information sciences ,Theoretical computer science ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,media_common.quotation_subject ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Predicate (grammar) ,Correlation ,Annotation ,Class imbalance ,Perception ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Scene graph ,Encoder ,0105 earth and related environmental sciences ,media_common - Abstract
Today, scene graph generation(SGG) task is largely limited in realistic scenarios, mainly due to the extremely long-tailed bias of predicate annotation distribution. Thus, tackling the class imbalance trouble of SGG is critical and challenging. In this paper, we first discover that when predicate labels have strong correlation with each other, prevalent re-balancing strategies(e.g., re-sampling and re-weighting) will give rise to either over-fitting the tail data(e.g., bench sitting on sidewalk rather than on), or still suffering the adverse effect from the original uneven distribution(e.g., aggregating varied parked on/standing on/sitting on into on). We argue the principal reason is that re-balancing strategies are sensitive to the frequencies of predicates yet blind to their relatedness, which may play a more important role to promote the learning of predicate features. Therefore, we propose a novel Predicate-Correlation Perception Learning(PCPL for short) scheme to adaptively seek out appropriate loss weights by directly perceiving and utilizing the correlation among predicate classes. Moreover, our PCPL framework is further equipped with a graph encoder module to better extract context features. Extensive experiments on the benchmark VG150 dataset show that the proposed PCPL performs markedly better on tail classes while well-preserving the performance on head ones, which significantly outperforms previous state-of-the-art methods., Comment: To be appeared on ACMMM 2020
- Published
- 2020
- Full Text
- View/download PDF
13. Bridging the Web Data and Fine-Grained Visual Recognition via Alleviating Label Noise and Domain Mismatch
- Author
-
Zhibin Li, Yazhou Yao, Xian-Sheng Hua, Guanyu Gao, Zeren Sun, and Jian Zhang
- Subjects
Professional knowledge ,Bridging (networking) ,Training set ,Computer science ,Noise reduction ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,World Wide Web ,Visual recognition ,Categorization ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Noise removal ,0105 earth and related environmental sciences ,Network model - Abstract
To distinguish the subtle differences among fine-grained categories, a large amount of well-labeled images are typically required. However, manual annotations for fine-grained categories is an extremely difficult task as it usually has a high demand for professional knowledge. To this end, we propose to directly leverage web images for fine-grained visual recognition. Our work mainly focuses on two critical issues including "label noise" and "domain mismatch" in the web images. Specifically, we propose an end-to-end deep denoising network (DDN) model to jointly solve these problems in the process of web images selection. To verify the effectiveness of our proposed approach, we first collect web images by using the labels in fine-grained datasets. Then we apply the proposed deep denoising network model for noise removal and domain mismatch alleviation. We leverage the selected web images as the training set for fine-grained categorization models learning. Extensive experiments and ablation studies demonstrate state-of-the-art performance gained by our proposed approach, which, at the same time, delivers a new pipeline for fine-grained visual categorization that is to be highly effective for real-world applications.
- Published
- 2020
- Full Text
- View/download PDF
14. CRSSC: Salvage Reusable Samples from Noisy Data for Robust Learning
- Author
-
Zeren Sun, Guosheng Hu, Xiu-Shen Wei, Xian-Sheng Hua, Jian Zhang, and Yazhou Yao
- Subjects
Sample selection ,Computer science ,business.industry ,02 engineering and technology ,010501 environmental sciences ,Machine learning ,computer.software_genre ,01 natural sciences ,Memorization ,Robust learning ,Robustness (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,Key (cryptography) ,Leverage (statistics) ,020201 artificial intelligence & image processing ,Noise (video) ,Artificial intelligence ,business ,computer ,Noisy data ,0105 earth and related environmental sciences - Abstract
Due to the existence of label noise in web images and the high memorization capacity of deep neural networks, training deep fine-grained (FG) models directly through web images tends to have an inferior recognition ability. In the literature, to alleviate this issue, loss correction methods try to estimate the noise transition matrix, but the inevitable false correction would cause severe accumulated errors. Sample selection methods identify clean ("easy") samples based on the fact that small losses can alleviate the accumulated errors. However, "hard" and mislabeled examples that can both boost the robustness of FG models are also dropped. To this end, we propose a certainty-based reusable sample selection and correction approach, termed as CRSSC, for coping with label noise in training deep FG models with web images. Our key idea is to additionally identify and correct reusable samples, and then leverage them together with clean examples to update the networks. We demonstrate the superiority of the proposed approach from both theoretical and experimental perspectives.
- Published
- 2020
- Full Text
- View/download PDF
15. PyRetri: A PyTorch-based Library for Unsupervised Image Retrieval by Deep Convolutional Neural Networks
- Author
-
Ren-Jie Song, Benyi Hu, Yuehu Liu, Xian-Sheng Hua, Xiu-Shen Wei, and Yazhou Yao
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Source code ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,media_common.quotation_subject ,Computer Science - Computer Vision and Pattern Recognition ,02 engineering and technology ,010501 environmental sciences ,01 natural sciences ,Convolutional neural network ,Extensibility ,Field (computer science) ,Machine Learning (cs.LG) ,Computer Science - Information Retrieval ,Software ,0202 electrical engineering, electronic engineering, information engineering ,Image retrieval ,0105 earth and related environmental sciences ,media_common ,Information retrieval ,business.industry ,Deep learning ,Usability ,Multimedia (cs.MM) ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Information Retrieval (cs.IR) ,Computer Science - Multimedia - Abstract
Despite significant progress of applying deep learning methods to the field of content-based image retrieval, there has not been a software library that covers these methods in a unified manner. In order to fill this gap, we introduce PyRetri, an open source library for deep learning based unsupervised image retrieval. The library encapsulates the retrieval process in several stages and provides functionality that covers various prominent methods for each stage. The idea underlying its design is to provide a unified platform for deep learning based image retrieval research, with high usability and extensibility. To the best of our knowledge, this is the first open-source library for unsupervised image retrieval by deep learning., Accepted by ACM Multimedia Conference 2020. PyRetri is open-source and available at https://github.com/PyRetri/PyRetri
- Published
- 2020
- Full Text
- View/download PDF
16. Challenges and Practices of Large Scale Visual Intelligence in the Real-World
- Author
-
Xian-Sheng Hua
- Subjects
Communication design ,Product design ,Computer science ,business.industry ,Visual Objects ,Scale (chemistry) ,Big data ,Reinforcement learning ,Business value ,business ,Data science ,computer ,computer.programming_language - Abstract
Visual intelligence is one of the key aspects of Artificial Intelligence. Considerable technology progresses along this direction have been made in the past a few years. However, how to incubate the right technologies and convert them into real business values in the real-world remains a challenge. In this talk, we will analyze current challenges of visual intelligence in the real-world and try to summarize a few key points that help us successfully develop and apply technologies to solve real-world problems. In particular, we will introduce a few successful examples, including "City Brain", "Luban (visual design)", from the problem definition/discovery, to technology development, to product design, and to realizing business values. City Brain: A city is an aggregate of a huge amount of heterogeneous data. However, extracting meaningful values from that data is nontrivial. City Brain is an end-to-end system whose goal is to glean irreplaceable values from big-city data, specifically videos, with the assistance of rapidly evolving AI technologies and fast-growing computing capacity. From cognition to optimization, to decision-making, from search to prediction and ultimately, to intervention, City Brain improves the way we manage the city, as well as the way we live in it. In this talk, we will introduce current practices of the City Brain platform, as well as what we can do to achieve the goal and make it a reality, step by step. Luban: Different from most typical visual intelligence technologies, which are more focused on analyzing, recognizing or searching visual objects, the goal of Luban (visual design) is to create visual content. In particular, we will introduce an automatic 2D banner design technique that is based on deep learning and reinforcement learning. We will detail how Luban was created and how it changed the world of 2D banner design by creating 50M banners a day.
- Published
- 2018
- Full Text
- View/download PDF
17. Local Convolutional Neural Networks for Person Re-Identification
- Author
-
Xinmei Tian, Houqiang Li, Jianqiang Huang, Jiwei Yang, Xu Shen, and Xian-Sheng Hua
- Subjects
0209 industrial biotechnology ,020901 industrial engineering & automation ,Boosting (machine learning) ,business.industry ,Computer science ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,020201 artificial intelligence & image processing ,02 engineering and technology ,Artificial intelligence ,business ,Convolutional neural network ,Re identification - Abstract
Recent works have shown that person re-identification can be substantially improved by introducing attention mechanisms, which allow learning both global and local representations. However, all these works learn global and local features in separate branches. As a consequence, the interaction/boosting of global and local information are not allowed, except in the final feature embedding layer. In this paper, we propose local operations as a generic family of building blocks for synthesizing global and local information in any layer. This building block can be inserted into any convolutional networks with only a small amount of prior knowledge about the approximate locations of local parts. For the task of person re-identification, even with only one local block inserted, our local convolutional neural networks (Local CNN) can outperform state-of-the-art methods consistently on three large-scale benchmarks, including Market-1501, CUHK03, and DukeMTMC-ReID.
- Published
- 2018
- Full Text
- View/download PDF
18. Previewer for Multi-Scale Object Detector
- Author
-
Zhongming Jin, Guo-Jun Qi, Chen Shen, Zhihang Fu, Rongxin Jiang, Xian-Sheng Hua, and Yaowu Chen
- Subjects
Computer science ,business.industry ,Pattern recognition ,02 engineering and technology ,Pascal (programming language) ,010501 environmental sciences ,01 natural sciences ,Convolutional neural network ,Object detection ,Feature (computer vision) ,Face (geometry) ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,False positive paradox ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,0105 earth and related environmental sciences ,computer.programming_language ,Block (data storage) - Abstract
Most multi-scale detectors face a challenge of small-size false positives due to the inadequacy of low-level features, which have small receptive field sizes and weak semantic capabilities. This paper demonstrates independent predictions from different feature layers on the same region is beneficial for reducing false positives. We propose a novel light-weight previewer block, which previews the objectness probability for the potential regression region of each prior box, using the stronger features with larger receptive fields and more contextual information for better predictions. This previewer block is generic and can be easily implemented in multi-scale detectors, such as SSD, RFBNet and MS-CNN. Extensive experiments are conducted on PASCAL VOC and KITTI pedestrian benchmark to show the superiority of the proposed method.
- Published
- 2018
- Full Text
- View/download PDF
19. Session details: Multimodal-1 (Multimodal Reasoning)
- Author
-
Xian-Sheng Hua
- Subjects
Human–computer interaction ,Computer science ,Session (computer science) - Published
- 2018
- Full Text
- View/download PDF
20. The City Brain
- Author
-
Xian-Sheng Hua
- Subjects
050210 logistics & transportation ,Focus (computing) ,Computer science ,business.industry ,Real-time Search ,05 social sciences ,Aggregate (data warehouse) ,Cloud computing ,02 engineering and technology ,Data science ,Intervention (law) ,ComputerSystemsOrganization_MISCELLANEOUS ,Smart city ,0502 economics and business ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,business - Abstract
A city is an aggregate of a huge amount of heterogeneous data. However, extracting meaningful values from that data remains challenging. City Brain is an end-to-end system whose goal is to glean irreplaceable values from big city data, specifically from videos, with the assistance of rapidly evolving AI technologies and fast-growing computing capacity. From cognition to optimization, to decision-making, from search to prediction and ultimately, to intervention, City Brain improves the way we manage the city, as well as the way we live in it. In this talk, firstly we will introduce current practices of the City Brain platform in a few cities in China, including what we can do to achieve the goal and make it a reality. Then we will focus on visual search technologies and applications that we can apply on the city data. Last, a few video demos will be shown, followed by highlighting a few future directions of city computing.
- Published
- 2018
- Full Text
- View/download PDF
21. Learning Feature Embedding with Strong Neural Activations for Fine-Grained Retrieval
- Author
-
Chang Zhou, Rongxin Jiang, Wenqing Chu, Chen Shen, Zhongming Jin, Xian-Sheng Hua, and Yaowu Chen
- Subjects
Similarity (geometry) ,Computer science ,business.industry ,Deep learning ,Pattern recognition ,02 engineering and technology ,Machine learning ,computer.software_genre ,Convolutional neural network ,Discriminative model ,Feature (computer vision) ,020204 information systems ,Softmax function ,0202 electrical engineering, electronic engineering, information engineering ,Embedding ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer - Abstract
Fine-grained object retrieval, which aims at finding objects belonging to the same sub-category as the probe object from a large database, is becoming increasingly popular because of its research and application significance. Recently, convolutional neural network (CNN) based deep learning models have achieved promising retrieval performance, as they can learn both feature representations and discriminative distance metrics jointly. Specifically, a generic method is to extract activations of the fully-connected layer as feature descriptors and simultaneously optimize classification constraints (e.g., softmax loss) and similarity constraints (e.g., triplet loss) to improve the representative capability of the features. However, the typical fully-connected layer activations are more focused on representing global attributes of the corresponding image, thus relatively less sensitive to specific local characteristics. Therefore, the features learned through these approaches in general are not sufficiently capable for retrieving fine-grained objects. To attack this issue, we propose an effective feature embedding by simultaneously encoding original global features and discriminative local features, in which the local features are extracted by exploiting strong neural activations on the last convolutional layer. We present that the novel feature embedding can dramatically enlarge the gap between inter-class variance and intra-class variance, which is the key factor to improve retrieval precision. In addition, we show our architecture can also be applied in person re-identification. Experimental results on multiple challenging benchmarks demonstrate that our method outperforms the current state-of-the-art approaches by large margins.
- Published
- 2017
- Full Text
- View/download PDF
22. Layout Style Modeling for Automating Banner Design
- Author
-
Weiwei Xu, Peiran Ren, Changyuan Yang, Xian-Sheng Hua, Yunke Zhang, and Kangkang Hu
- Subjects
Multimedia ,Computer science ,business.industry ,media_common.quotation_subject ,05 social sciences ,020207 software engineering ,02 engineering and technology ,Creativity ,computer.software_genre ,Online advertising ,Style (sociolinguistics) ,Human–computer interaction ,0202 electrical engineering, electronic engineering, information engineering ,Identity (object-oriented programming) ,0501 psychology and cognitive sciences ,Banner ,Data set (IBM mainframe) ,business ,Set (psychology) ,computer ,050107 human factors ,media_common - Abstract
Banner design for is challenging to clearly convey information while also satisfying aesthetic goals and complying with the banner owner or advertiser's visual identity system. In online advertising, banners are often born with tens of different display sizes and rapidly changing design styles to chase fashion in many distinct market areas and designers have to make huge efforts to adjust their designs for each display size and target style. Therefore, automating multi-size and multi-style banner design can greatly release designers' creativity. Different from previous work relying on a single unified omnipotent optimization to accomplish such a complex problem, we tackle it with a combination of layout style learning, interpolation and transfer. We optimize banner layout given the style parameter learned from a set of training banners for a particular display size and layout style. Such kind of optimization is faster and much more controllable than optimizing for all sizes and diverse styles. To achieve multi-size banner design, we collect style parameters for a small collection of various sizes and interpolate them to support arbitrary target size. To reduce the difficulty of style parameter training, we invent a novel style transfer technique so that creating a multi-size style becomes as easy as designing a single banner. With all of the three techniques described above, a robust and easy-to-use layout style model is built, upon which we automate the banner design. We test our method on a data set containing thousands of real banners for online advertising and evaluate our generated banners in various sizes and styles by comparing them with professional designs.
- Published
- 2017
- Full Text
- View/download PDF
23. Spatio-Temporal AutoEncoder for Video Anomaly Detection
- Author
-
Hongtao Lu, Bing Deng, Chen Shen, Xian-Sheng Hua, Yao Liu, and Yiru Zhao
- Subjects
business.industry ,Computer science ,Anomaly (natural sciences) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,02 engineering and technology ,Autoencoder ,Motion (physics) ,Set (abstract data type) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Computer vision ,Anomaly detection ,Artificial intelligence ,Representation (mathematics) ,business ,Feature learning - Abstract
Anomalous events detection in real-world video scenes is a challenging problem due to the complexity of "anomaly" as well as the cluttered backgrounds, objects and motions in the scenes. Most existing methods use hand-crafted features in local spatial regions to identify anomalies. In this paper, we propose a novel model called Spatio-Temporal AutoEncoder (ST AutoEncoder or STAE), which utilizes deep neural networks to learn video representation automatically and extracts features from both spatial and temporal dimensions by performing 3-dimensional convolutions. In addition to the reconstruction loss used in existing typical autoencoders, we introduce a weight-decreasing prediction loss for generating future frames, which enhances the motion feature learning in videos. Since most anomaly detection datasets are restricted to appearance anomalies or unnatural motion anomalies, we collected a new challenging dataset comprising a set of real-world traffic surveillance videos. Several experiments are performed on both the public benchmarks and our traffic dataset, which show that our proposed method remarkably outperforms the state-of-the-art approaches.
- Published
- 2017
- Full Text
- View/download PDF
24. Spatiotemporal Multi-Task Network for Human Activity Understanding
- Author
-
Chang Zhou, Jianqiang Huang, Xian-Sheng Hua, Yao Liu, and Deng Cai
- Subjects
Computer science ,business.industry ,Deep learning ,02 engineering and technology ,Machine learning ,computer.software_genre ,Convolutional neural network ,Motion (physics) ,Task (project management) ,Task network ,Action (philosophy) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Representation (mathematics) ,computer - Abstract
Recently, remarkable progress has been achieved in human action recognition and detection by using deep learning techniques. However, for action detection in real-world untrimmed videos, the accuracies of most existing approaches are still far from satisfactory, due to the difficulties in temporal action localization. On the other hand, the spatiotempoal features are not well utilized in recent work for video analysis. To tackle these problems, we propose a spatiotemporal, multi-task, 3D deep convolutional neural network to detect (including temporally localize and recognition) actions in untrimmed videos. First, we introduce a fusion framework which aims to extract video-level spatiotemporal features in the training phase. And we demonstrate the effectiveness of video-level features by evaluating our model on human action recognition task. Then, under the fusion framework, we propose a spatiotemporal multi-task network, which has two sibling output layers for action classification and temporal localization, respectively. To obtain precise temporal locations, we present a novel temporal regression method to revise the proposal window which contains an action. Meanwhile, in order to better utilize the rich motion information in videos, we introduce a novel video representation, interlaced images, as an additional network input stream. As a result, our model outperforms state-of-the-art methods for both action recognition and detection on standard benchmarks.
- Published
- 2017
- Full Text
- View/download PDF
25. Deep Siamese Network with Multi-level Similarity Perception for Person Re-identification
- Author
-
Zhihang Fu, Rongxin Jiang, Chen Shen, Yiru Zhao, Xian-Sheng Hua, Yaowu Chen, and Zhongming Jin
- Subjects
Computer science ,business.industry ,media_common.quotation_subject ,02 engineering and technology ,Spotting ,Machine learning ,computer.software_genre ,Convolutional neural network ,Image (mathematics) ,Discriminative model ,Similarity (network science) ,Feature (computer vision) ,020204 information systems ,Perception ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,computer ,media_common - Abstract
Person re-identification (re-ID), which aims at spotting a person of interest across multiple camera views, has gained more and more attention in computer vision community. In this paper, we propose a novel deep Siamese architecture based on convolutional neural network (CNN) and multi-level similarity perception. According to the distinct characteristics of diverse feature maps, we effectively apply different similarity constraints to both low-level and high-level feature maps, during training stage. Therefore, our network can efficiently learn discriminative feature representations at different levels, which significantly improves the re-ID performance. Besides, our framework has two additional benefits. Firstly, classification constraints can be easily incorporated into the framework, forming a unified multi-task network with similarity constraints. Secondly, as similarity comparable information has been encoded in the network's learning parameters via back-propagation, pairwise input is not necessary at test time. That means we can extract features of each gallery image and build index in an off-line manner, which is essential for large-scale real-world applications. Experimental results on multiple challenging benchmarks demonstrate that our method achieves splendid performance compared with the current state-of-the-art approaches.
- Published
- 2017
- Full Text
- View/download PDF
26. Stylized Adversarial AutoEncoder for Image Generation
- Author
-
Hongtao Lu, Bing Deng, Yiru Zhao, Xian-Sheng Hua, and Jianqiang Huang
- Subjects
Stylized fact ,business.industry ,Computer science ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Pattern recognition ,Latent variable ,010501 environmental sciences ,Real image ,01 natural sciences ,Autoencoder ,03 medical and health sciences ,ComputingMethodologies_PATTERNRECOGNITION ,0302 clinical medicine ,Prior probability ,Artificial intelligence ,business ,Classifier (UML) ,030217 neurology & neurosurgery ,0105 earth and related environmental sciences - Abstract
In this paper, we propose an autoencoder-based generative adversarial network (GAN) for automatic image generation, which is called "stylized adversarial autoencoder". Different from existing generative autoencoders which typically impose a prior distribution over the latent vector, the proposed approach splits the latent variable into two components: style feature and content feature, both encoded from real images. The split of the latent vector enables us adjusting the content and the style of the generated image arbitrarily by choosing different exemplary images. In addition, a multiclass classifier is adopted in the GAN network as the discriminator, which makes the generated images more realistic. We performed experiments on hand-writing digits, scene text and face datasets, in which the stylized adversarial autoencoder achieves superior results for image generation as well as remarkably improves the corresponding supervised recognition task.
- Published
- 2017
- Full Text
- View/download PDF
27. Deep CTR Prediction in Display Advertising
- Author
-
Junxuan Chen, Hongtao Lu, Hao Li, Xian-Sheng Hua, and Baigui Sun
- Subjects
FOS: Computer and information sciences ,Artificial neural network ,business.industry ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Display advertising ,Computer Science - Computer Vision and Pattern Recognition ,Pattern recognition ,02 engineering and technology ,Click-through rate ,Multimedia (cs.MM) ,Image (mathematics) ,Task (computing) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Computer vision ,Artificial intelligence ,business ,Computer Science - Multimedia - Abstract
Click through rate (CTR) prediction of image ads is the core task of online display advertising systems, and logistic regression (LR) has been frequently applied as the prediction model. However, LR model lacks the ability of extracting complex and intrinsic nonlinear features from handcrafted high-dimensional image features, which limits its effectiveness. To solve this issue, in this paper, we introduce a novel deep neural network (DNN) based model that directly predicts the CTR of an image ad based on raw image pixels and other basic features in one step. The DNN model employs convolution layers to automatically extract representative visual features from images, and nonlinear CTR features are then learned from visual features and other contextual features by using fully-connected layers. Empirical evaluations on a real world dataset with over 50 million records demonstrate the effectiveness and efficiency of this method., Comment: This manuscript is the accepted version for ACM Multimedia Conference 2016
- Published
- 2016
- Full Text
- View/download PDF
28. A Domain Robust Approach For Image Dataset Construction
- Author
-
Fumin Shen, Zhenmin Tang, Yazhou Yao, Jian Zhang, and Xian-Sheng Hua
- Subjects
Information retrieval ,Computer science ,business.industry ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,A domain ,02 engineering and technology ,computer.software_genre ,Image (mathematics) ,020204 information systems ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,The Internet ,Data mining ,business ,computer - Abstract
© 2016 ACM. There have been increasing research interests in automatically constructing image dataset by collecting images from the Internet. However, existing methods tend to have a weak domain adaptation ability, known as the \dataset bias problem". To address this issue, in this work, we propose a novel image dataset construction framework which can generalize well to unseen target domains. In specific, the given queries are first expanded by searching in the Google Books Ngrams Corpora (GBNC) to obtain a richer semantic description, from which the noisy query expansions are then filtered out. By treating each expansion as a \bag" and the retrieved images therein as \instances", we formulate image filtering as a multi-instance learning (MIL) problem with constrained positive bags. By this approach, images from different data distributions will be kept while with noisy images filtered out. Comprehensive experiments on two challenging tasks demonstrate the effectiveness of our proposed approach.
- Published
- 2016
- Full Text
- View/download PDF
29. Session details: Open Source Software Competition
- Author
-
Xian-Sheng Hua, Marco Bertini, and Tao Mei
- Subjects
Competition (economics) ,Multimedia ,Computer science ,Open source software ,Session (computer science) ,computer.software_genre ,computer - Published
- 2015
- Full Text
- View/download PDF
30. Pushing Image Recognition in the Real World
- Author
-
Xian-Sheng Hua
- Subjects
Set (abstract data type) ,Search engine ,Range (mathematics) ,Computer science ,business.industry ,Model selection ,Computation ,Scalability ,Selection (linguistics) ,The Internet ,Computer vision ,Artificial intelligence ,business - Abstract
Building a system that can recognize "what," "who," and "where" from arbitrary images has motivated researchers in computer vision, multimedia and machine learning areas for decades. Significant progresses have been made in recently years based on distributed computation and/or deep neural networks techniques. However, it is still very challenging to realize a general purpose real world image recognition engine that has reasonable recognition accuracy, semantic coverage, and recognition speed.In this talk, firstly we will review the current status of this area, analyze the difficulties, and discuss the potential solutions. Then two promising schemes to attack this challenge will be introduced, including (1) learning millions of concepts from search engine click logs, and (2) recognizing whatever you want without data labeling. The first work tries to build large-scale recognition models by mining search engine click logs. Challenges in training data selection and model selection will be discussed, and efficient and scalable approaches for model training and prediction will be introduced. The second work aims at building image recognition engines for any set of entities without using any human labeled training data, which helps generalize image recognition to a wide range of semantic concepts. Automatic training data generation steps will be presented, and techniques for improving recognition accuracy, which effectively leveraging massive amount of Internet data will be discussed. Different parallelization strategies for different computation tasks will be introduced, which guarantee the efficiency and scalability of the entire system. And last, we will discuss possible directions in pushing image recognition in the real world.
- Published
- 2014
- Full Text
- View/download PDF
31. Clickage
- Author
-
Kuansan Wang, Jingdong Wang, Yong Rui, Linjun Yang, Jin Li, Xian-Sheng Hua, Ming Ye, and Jing Wang
- Subjects
business.industry ,Computer science ,media_common.quotation_subject ,Semantic search ,Semantics ,Bridging (programming) ,Variety (cybernetics) ,World Wide Web ,Search engine ,Leverage (statistics) ,Quality (business) ,business ,media_common ,Semantic gap - Abstract
The semantic gap between low-level visual features and high-level semantics has been investigated for decades but still remains a big challenge in multimedia. When "search" became one of the most frequently used applications, "intent gap", the gap between query expressions and users' search intents, emerged. Researchers have been focusing on three approaches to bridge the semantic and intent gaps: 1) developing more representative features, 2) exploiting better learning approaches or statistical models to represent the semantics, and 3) collecting more training data with better quality. However, it remains a challenge to close the gaps. In this paper, we argue that the massive amount of click data from commercial search engines provides a data set that is unique in the bridging of the semantic and intent gap. Search engines generate millions of click data (a.k.a. image-query pairs), which provide almost "unlimited" yet strong connections between semantics and images, as well as connections between users' intents and queries. To study the intrinsic properties of click data and to investigate how to effectively leverage this huge amount of data to bridge semantic and intent gap is a promising direction to advance multimedia research. In the past, the primary obstacle is that there is no such dataset available to the public research community. This changes as Microsoft has released a new large-scale real-world image click data to public. This paper presents preliminary studies on the power of large-scale click data with a variety of experiments, such as building large-scale concept detectors, tag processing, search, definitive tag detection, intent analysis, etc., with the goal to inspire deeper researches based on this dataset.
- Published
- 2013
- Full Text
- View/download PDF
32. Towards next generation multimedia recommendation systems
- Author
-
Jialie Shen, Emre Sargin, and Xian-Sheng Hua
- Subjects
World Wide Web ,Multimedia ,Computer science ,business.industry ,Social media network ,Key (cryptography) ,Mobile computing ,Information technology ,Recommender system ,computer.software_genre ,business ,Digital library ,computer - Abstract
Empowered by advances in information technology, such as social media network, digital library and mobile computing, there emerges an ever-increasing amounts of multimedia data. As the key technology to address the problem of information overload, multimedia recommendation system has been received a lot of attentions from both industry and academia. This course aims to 1) provide a series of detailed review of state-of-the-art in multimedia recommendation; 2) analyze key technical challenges in developing and evaluating next generation multimedia recommendation systems from different perspectives and 3) give some predictions about the road lies ahead of us.
- Published
- 2013
- Full Text
- View/download PDF
33. Scalable similar image search by joint indices
- Author
-
Jing Wang, Shipeng Li, Xian-Sheng Hua, and Jingdong Wang
- Subjects
business.industry ,Computer science ,Scalability ,Search engine indexing ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Graph (abstract data type) ,Pattern recognition ,Artificial intelligence ,business - Abstract
Text-based image search is able to return desired images for simple queries, but has limited capabilities in finding images with additional visual requirements. As a result, an image is usually used to help describe the appearance requirements. In this demonstration, we show a similar image search system that can support the joint textual and visual query. We present an efficient and effective indexing algorithm, neighborhood graph index, which is suitable for millions of images, and use it to organize joint inverted indices to search over billions of images.
- Published
- 2012
- Full Text
- View/download PDF
34. The role of attractiveness in web image search
- Author
-
Shipeng Li, Chao Xu, Bo Geng, Linjun Yang, and Xian-Sheng Hua
- Subjects
Attractiveness ,Search engine ,Information retrieval ,Index (publishing) ,Computer science ,business.industry ,End user ,Web page ,Semantic search ,Relevance (information retrieval) ,business ,Ranking (information retrieval) - Abstract
Existing web image search engines are mainly designed to optimize topical relevance. However, according to our user study, attractiveness is becoming a more and more important factor for web image search engines to satisfy users' search intentions. Important as it can be, web image attractiveness from the search users' perspective has not been sufficiently recognized in both the industry and the academia. In this paper, we present a definition of web image attractiveness with three levels according to the end users' feedback, including perceptual quality, aesthetic sensitivity and affective tune. Corresponding to each level of the definition, various visual features are investigated on their applicability to attractiveness estimation of web images. To further deal with the unreliability of visual features induced by the large variations of web images, we propose a contextual approach to integrate the visual features with contextual cues mined from image EXIF information and the associated web pages. We explore the role of attractiveness by applying it to various stages of a web image search engine, including the online ranking and the interactive reranking, as well as the offline index selection. Experimental results on three large-scale web image search datasets demonstrate that the incorporation of attractiveness can bring more satisfaction to 80% of the users for ranking/reranking search results and 30.5% index coverage improvement for index selection, compared to the conventional relevance based approaches.
- Published
- 2011
- Full Text
- View/download PDF
35. StoryImaging
- Author
-
Genliang Guan, Xian-Sheng Hua, Dagan Feng, and Zhiyong Wang
- Subjects
World Wide Web ,Presentation ,Key terms ,Multimedia ,Computer science ,media_common.quotation_subject ,Context (language use) ,User interface ,computer.software_genre ,computer ,Bridging (programming) ,media_common - Abstract
In this demo, we develop the StoryImaging system to illustrate a textual story with both images harvested from the Web and synthesized speech. At the backend, a story is firstly processed to identify key terms such as named entities and to obtain the story summary. With the aid of commercial search engines, images are then collected from the Web for those key terms and re-ranked by taking the summary as context. At last, images are clustered to provide an overview of the story. At the web-based frontend, the user interface has been tailored to both improve information comprehension and provide engaging and explorative experiences for users by closely bridging textual and visual modalities.
- Published
- 2011
- Full Text
- View/download PDF
36. Web-scale image search by color sketch
- Author
-
Xian-Sheng Hua and Jingdong Wang
- Subjects
Information retrieval ,Computer science ,business.industry ,Interface (computing) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Sketch ,Search engine ,Index (publishing) ,Computer vision ,Visual Word ,Artificial intelligence ,Scale (map) ,business ,Image retrieval - Abstract
Most existing image search engines rely on the associated texts or tags with images to index and retrieval images, which results in limited ability on searching images with visual requirement. In this demonstration, we present an image search system, which enables consumers to find images on the requirement of how the colors are spatially distributed. It is a well-designed trade-off between scalability and feasibility. To the best knowledge, this system is the first one to scale up to Web-scale images. The interface is very intuitive and requires users to only scribble a few color strokes or drag an image and mask a few regions of interest, to express the search intent.
- Published
- 2011
- Full Text
- View/download PDF
37. Hybrid image summarization
- Author
-
Shipeng Li, Xian-Sheng Hua, Hao Xu, and Jingdong Wang
- Subjects
Set (abstract data type) ,Hybrid image ,Computer science ,business.industry ,Affinity propagation ,Pattern recognition ,Artificial intelligence ,Machine learning ,computer.software_genre ,business ,Automatic summarization ,computer ,Image (mathematics) - Abstract
In this paper, we address a problem of managing tagged images with hybrid summarization. We formulate this problem as finding a few image exemplars to represent the image set semantically and visually and solve it in a hybrid way by exploiting both visual and textual information associated with images. We propose a novel approach, called Homogeneous and Heterogeneous Message Propagation (H2MP), which extends affinity propagation that only works over homogeneous relations to heterogeneous relations. The summary obtained by our approach is both visually and semantically satisfactory. The experimental results demonstrate the effectiveness and efficiency of the proposed approach.
- Published
- 2011
- Full Text
- View/download PDF
38. TapTell
- Author
-
Xian-Sheng Hua, Shipeng Li, Tao Mei, Ling Guan, and Ning Zhang
- Subjects
Metadata ,Visual search ,World Wide Web ,Focus (computing) ,Action (philosophy) ,Computer science ,Phone ,Process (engineering) ,Human–computer interaction ,Natural (music) ,Gesture - Abstract
This demonstration presents a mobile-based visual recognition and recommendation application on Windows Phone 7 called TapTell. This is different from other mobile-based visual search mechanisms which merely focus on the search process. TapTell firstly discovers and understands users' visual intents via a circle based natural user interaction called "O" gestures. Following, a Tap action is operated to choose the "O" gestured regions. The context-aware visual search mechanism is utilized for recognizing the intents and associating them with indexed metadata. Finally, the "Tell" action recommends relevant entities utilizing contextual information. The TapTell system has been evaluated at different scenarios on million scale images.
- Published
- 2011
- Full Text
- View/download PDF
39. Internet multimedia advertising
- Author
-
Tao Mei, Ruofei Zhang, and Xian-Sheng Hua
- Published
- 2011
- Full Text
- View/download PDF
40. Million-scale near-duplicate video retrieval system
- Author
-
Xian-Sheng Hua, Fei Wang, Linjun Yang, Shipeng Li, Yang Cai, Wei Ping, and Tao Mei
- Subjects
Information retrieval ,Bag-of-words model ,Computer science ,Video tracking ,Frame (networking) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,computer.file_format ,Visual Word ,Smacker video ,Scale (map) ,computer ,Word (computer architecture) ,Video retrieval - Abstract
In this paper, we present a novel near-duplicate video retrieval system serving one million web videos. To achieve both the effectiveness and efficiency, a visual word based approach is proposed, which quantizes each video frame into a word and represents the whole video as a bag of words. The system can respond to a query in 41ms with 78.4% MAP on average.
- Published
- 2011
- Full Text
- View/download PDF
41. Modeling social strength in social media community via kernel-based learning
- Author
-
Shipeng Li, Steven C. H. Hoi, Xian-Sheng Hua, Jinfeng Zhuang, and Tao Mei
- Subjects
Social computing ,Social network ,Computer science ,business.industry ,Social learning ,Data science ,World Wide Web ,Interpersonal ties ,Social relationship ,Social media ,Learning to rank ,business ,Social heuristics ,Social behavior - Abstract
Modeling continuous social strength rather than conventional binary social ties in the social network can lead to a more precise and informative description of social relationship among people. In this paper, we study the problem of social strength modeling (SSM) for the users in a social media community, who are typically associated with diverse form of data. In particular, we take Flickr---the most popular online photo sharing community---as an example, in which users are sharing their experiences through substantial amounts of multimodal contents (e.g., photos, tags, geo-locations, friend lists) and social behaviors (e.g., commenting and joining interest groups). Such heterogeneous data in Flickr bring opportunities yet challenges to the research community for SSM. One of the key issues in SSM is how to effectively explore the heterogeneous data and how to optimally combine them to measure the social strength. In this paper, we present a kernel-based learning to rank framework for inferring the social strength of Flickr users, which involves two learning stages. The first stage employs a kernel target alignment algorithm to integrate the heterogeneous data into a holistic similarity space. With the learned kernel, the second stage rectifies the pair-wise learning to rank approach to estimating the social strength. By learning the social strength graph, we are able to conduct collaborative recommendation and collective classification. The promising results show that the learning-based approach is effective for SSM. Despite being focused on Flickr, our technique can be applied to model social strength of users in any other social media community.
- Published
- 2011
- Full Text
- View/download PDF
42. Contextual image search
- Author
-
Shengjin Wang, Jingdong Wang, Wenhao Lu, Shipeng Li, and Xian-Sheng Hua
- Subjects
Web search query ,Information retrieval ,Concept search ,Computer science ,business.industry ,Search analytics ,Semantic search ,Full text search ,Phrase search ,Search engine ,Query expansion ,Web query classification ,Web page ,business - Abstract
In this paper, we propose a novel image search scheme, contextual image search. Different from conventional image search schemes that present a separate interface (e.g., text input box) to allow users to submit a query, the new search scheme enables users to search images by only masking a few words when they are reading through Web pages or other documents. Rather than merely making use of the explicit query input that is often not sufficient to express user's search intent, our approach explores the context information to better understand the search intent with two key steps: query augmenting and search results reranking using context, and expects to obtain better search results. Beyond contextual Web search, the context in our case is much richer and includes images besides texts. In addition to this type of search scheme, called contextual image search with text input, we also present another type of scheme, called contextual image search with image input, to allow users to select an image as the search query from Web pages or other documents they are reading. The key idea is to use the search-to-annotation technique and the contextual textual query mining scheme to determine the corresponding textual query, to finally get semantically similar search results. Experiments show that the proposed schemes make image search more convenient and the search results are more relevant to user intention.
- Published
- 2011
- Full Text
- View/download PDF
43. Multimedia tagging
- Author
-
Shuicheng Yan, Xian-Sheng Hua, Jialie Shen, and Meng Wang
- Subjects
World Wide Web ,Multimedia ,Computer science ,Scale (chemistry) ,computer.software_genre ,Media content ,computer ,Variety (cybernetics) - Abstract
The tags have proved to be a very crucial mechanism to facilitate the effective sharing and organization of large scale of multimedia information. As a result, technical developments on intelligent multimedia tagging have attracted a substantial amount of efforts involving experts from information retrieval, multimedia computing and artificial intelligence (particularly computer vision). The truly interdisciplinary research has resulted in many algorithmic and methodological developments. Meanwhile, many commercial web systems (e.g., Youtube, Last.fm and Flickr) have successfully introduced a variety of toolkits to assist different users in discovering and exploring media content using tags. This tutorial aims to provide a comprehensive coverage on the evolution of research for developing multimedia tagging technologies and identify a range of major challenges for the further scholarly study in the coming years.
- Published
- 2011
- Full Text
- View/download PDF
44. Video-based image retrieval
- Author
-
Yang Cai, Alan Hanjalic, Linjun Yang, Shipeng Li, and Xian-Sheng Hua
- Subjects
Information retrieval ,Concept search ,Computer science ,business.industry ,InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Image processing ,Content-based image retrieval ,Query expansion ,Automatic image annotation ,Image texture ,Computer vision ,Visual Word ,Artificial intelligence ,business ,Image retrieval - Abstract
Likely variations in the capture conditions (e.g. light, blur, scale, occlusion) and in the viewpoint between the query image and the images in the collection are the factors due to which image retrieval based on the Query-by-Example (QBE) principle is still not reliable enough. In this paper, we propose a novel QBE-based image retrieval system where users are allowed to submit a short video clip as a query to improve the retrieval reliability. Improvement is achieved by integrating the information about different viewpoints and conditions under which object and scene appearances can be captured across different video frames. Rich information extracted from a video can be exploited to generate a more complete query representation than in the case of a single-image query and to improve the relevance of the retrieved results. Our experimental results show that video-based image retrieval (VBIR) is significantly more reliable than the retrieval using a single image as a query.
- Published
- 2011
- Full Text
- View/download PDF
45. Graph-cut based tag enrichment
- Author
-
Xueming Qian and Xian-Sheng Hua
- Subjects
Max-flow min-cut theorem ,Computer science ,Computer Science::Information Retrieval ,Cut ,Graph (abstract data type) ,Computer Science::Computation and Language (Computational Linguistics and Natural Language and Speech Processing) ,Data mining ,computer.software_genre ,computer ,Graph ,MathematicsofComputing_DISCRETEMATHEMATICS - Abstract
In this paper, a graph cut based tag enrichment approach is proposed. We build a graph for each image with its initial tags. The graph is with two terminals. Nodes of the graph are full connected with each other. Min-cut/max-flow algorithm is utilized to find the relevant tags for the image. Experiments on Flickr dataset demonstrate the effectiveness of the proposed graph-cut based tag enrichment approach.
- Published
- 2011
- Full Text
- View/download PDF
46. WonderWhat
- Author
-
Xian-Sheng Hua, Mingyan Gao, and Ramesh Jain
- Subjects
World Wide Web ,Event (computing) ,Computer science ,media_common.quotation_subject ,Component (UML) ,Rank (computer programming) ,Complex event processing ,Context (language use) ,Ignorance ,media_common ,Task (project management) - Abstract
How often did you feel disappointed in a foreign country, when you had been craving for participating in authentic native events but miserably ended up with being lost in the crowd, due to ignorance of the local culture? Have you ever imagined that with merely a simple click, a tool can identify the events that are right in front of you? As a step in this direction, in this paper, we propose a system that provides users with information of the public events that they are attending by analyzing in real time their photos taken at the event, leveraging both spatio-temporal context and photo content. To fulfill the task, we designed the system to collect event information, maintain dedicated event database, build photo content model for event types, and rank the final results. Extensive experiments were conducted to prove the effectiveness of each component.
- Published
- 2011
- Full Text
- View/download PDF
47. Melog
- Author
-
Xian-Sheng Hua and Hongzhi Li
- Subjects
Multimedia ,business.industry ,Wireless network ,Computer science ,Mobile computing ,Mobile Web ,Cloud computing ,computer.software_genre ,Camera phone ,Contextual design ,Mobile media ,Mobile station ,Mobile database ,Mobile search ,Mobile technology ,business ,Mobile device ,computer - Abstract
The rapid developments of smart mobile devices, wireless networks and cloud computing have extended mobile phones with much more functionalities rather than only being used as voice communication tools. With an increasing trend, more and more people are using camera phones to record and share their daily experiences due to its mobility and real-timing. Camera phones are true "multimedia" devices capable of managing acquisition, processing, transmission, and presentation of multiple modal data such as image, video, audio and text information, as well as rich contextual information like location, direction, and velocity from the equipped sensors. All these provide sufficient information and channel to effectively share people's experiences. However, due to the complexity and structureless of the raw multimedia and contextual data, experience sharing is still a nontrivial task. There is still lack of efficient tools that supports mobile, rapid and realtime experience sharing. In this paper, we will propose a "mobile + cloud" system enabling rapid and near-realtime experience sharing through automatic blogging and micro-blogging, which are based on multi-modal media content analyses and syntheses. An experimental system shows the effectiveness and efficiency of the proposed scheme.
- Published
- 2010
- Full Text
- View/download PDF
48. Session details: Media networking and content delivery
- Author
-
Xian-Sheng Hua
- Subjects
Multimedia ,Computer science ,Content delivery ,Session (computer science) ,computer.software_genre ,computer - Published
- 2010
- Full Text
- View/download PDF
49. The e-recall environment for cloud based mobile rich media data management
- Author
-
Jialie Shen, Shuicheng Yan, and Xian-Sheng Hua
- Subjects
User information ,Service (systems architecture) ,Multimedia ,business.industry ,Computer science ,Data management ,Cloud computing ,computer.software_genre ,Digital media ,Personalization ,World Wide Web ,Publishing ,Scalability ,The Internet ,business ,computer - Abstract
With the pervasiveness of mobile and wireless devices and rapid growth of the Internet, the access availability of rich media continues to accelerate in amount, variety, complexity and scale. This has been exemplified by various online media service portals including Flickr, Facebook and Last.fm. Unfortunately, lack of proper data processing techniques has now become major obstacle for effectively personal data management, especially in a mobile environment. While the related technical developments have been attracted a lot of research attentions recently, many open problems still remain unsolved. Among them, two major challenges need to be addressed. The first one is how to construct comprehensive way to describe 1) queries - model user information needs and 2) contents of rich media data. Moreover, size of media data collected by modern personal digital device (PDA) could be huge and will continue to grow. Consequently, fast data processing and associated scalability issues are becoming more important than ever before, and yet, little serious research has been conducted in this field.In this paper, we report ongoing efforts to develop E-Recall system - a novel platform for cloud based mobile rich media data management. It aims to provide an intelligent and comprehensive infrastructure for (1) scalable media data processing, (2) flexible media content sharing and publishing and (3) personalized media content integration under mobile environment.
- Published
- 2010
- Full Text
- View/download PDF
50. Large-scale robust visual codebook construction
- Author
-
Linjun Yang, Hong-Jiang Zhang, Darui Li, and Xian-Sheng Hua
- Subjects
Linde–Buzo–Gray algorithm ,Best bin first ,Computer science ,business.industry ,Nearest neighbor search ,Codebook ,Vector quantization ,Pattern recognition ,Artificial intelligence ,Cluster analysis ,business ,Algorithm ,Image retrieval - Abstract
The web-scale image retrieval system demands a large-scale visual codebook, which is difficult to be generated by the commonly adopted K-means vector quantization due to the applicability issue. While approximate K-means is proposed to scale up the visual codebook construction it needs to employ a high-precision approximate nearest neighbor search in the assignment step and is difficult to converge, which limits its scalability. In this paper, we propose an improved approximate K-means, by leveraging the assignment information in the history, namely the previous iterations, to improve the assignment precision. By further randomizing the employed approximate nearest neighbor search in each iteration, the proposed algorithm can improve the assignment precision conceptually similarly as the randomized k-d trees, while nearly no additional cost is introduced. The algorithm can be proved to be convergent and we demonstrate that the proposed algorithm improves the quality of the generated visual codebook as well as the scalability experimentally and analytically.
- Published
- 2010
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.