26 results on '"Zhangzhang Si"'
Search Results
2. Unsupervised learning of event AND-OR grammar and semantics from video.
- Author
-
Zhangzhang Si, Mingtao Pei, Benjamin Z. Yao, and Song-Chun Zhu
- Published
- 2011
- Full Text
- View/download PDF
3. Unsupervised learning of stochastic AND-OR templates for object modeling.
- Author
-
Zhangzhang Si and Song-Chun Zhu
- Published
- 2011
- Full Text
- View/download PDF
4. Wavelet, active basis, and shape script: a tour in the sparse land.
- Author
-
Zhangzhang Si and Ying Nian Wu
- Published
- 2010
- Full Text
- View/download PDF
5. Learning mixed templates for object recognition.
- Author
-
Zhangzhang Si, Haifeng Gong, Ying Nian Wu, and Song Chun Zhu
- Published
- 2009
- Full Text
- View/download PDF
6. Deformable Template As Active Basis.
- Author
-
Ying Nian Wu, Zhangzhang Si, Chuck Fleming, and Song Chun Zhu
- Published
- 2007
- Full Text
- View/download PDF
7. Using High-Level Semantic Features in Video Retrieval.
- Author
-
Wujie Zheng, Jianmin Li 0001, Zhangzhang Si, Fuzong Lin, and Bo Zhang 0010
- Published
- 2006
- Full Text
- View/download PDF
8. Learning and parsing video events with goal and intent prediction.
- Author
-
Mingtao Pei, Zhangzhang Si, Benjamin Z. Yao, and Song-Chun Zhu
- Published
- 2013
- Full Text
- View/download PDF
9. Learning AND-OR Templates for Object Recognition and Detection.
- Author
-
Zhangzhang Si and Song-Chun Zhu
- Published
- 2013
- Full Text
- View/download PDF
10. Learning Hybrid Image Templates (HIT) by Information Projection.
- Author
-
Zhangzhang Si and Song-Chun Zhu
- Published
- 2012
- Full Text
- View/download PDF
11. Learning Active Basis Model for Object Detection and Recognition.
- Author
-
Ying Nian Wu, Zhangzhang Si, Haifeng Gong, and Song Chun Zhu
- Published
- 2010
- Full Text
- View/download PDF
12. Learning explicit and implicit visual manifolds by information projection.
- Author
-
Song Chun Zhu, Kent Shi, and Zhangzhang Si
- Published
- 2010
- Full Text
- View/download PDF
13. Tsinghua University at TRECVID 2005.
- Author
-
Jinhui Yuan, Huiyi Wang, Lan Xiao, Dong Wang 0022, Dayong Ding, Yuanyuan Zuo, Zijan Tong, Xiaobing Liu, Shuping Xu, Wujie Zheng, Xirong Li 0001, Zhangzhang Si, Jianmin Li 0001, Fuzong Lin, and Bo Zhang 0010
- Published
- 2005
14. Unsupervised learning of compositional sparse code for natural image representation
- Author
-
Ying Nian Wu, Wenze Hu, Song-Chun Zhu, Zhangzhang Si, and Yi Hong
- Subjects
Image representation ,business.industry ,Computer science ,Applied Mathematics ,Code (cryptography) ,Natural (music) ,Unsupervised learning ,Pattern recognition ,Artificial intelligence ,business - Published
- 2013
15. Learning and parsing video events with goal and intent prediction
- Author
-
Zhangzhang Si, Song-Chun Zhu, Benjamin Yao, and Mingtao Pei
- Subjects
Parsing ,Grammar ,Computer science ,business.industry ,media_common.quotation_subject ,Machine learning ,computer.software_genre ,Spatial relation ,Rule-based machine translation ,Signal Processing ,Graph (abstract data type) ,Unsupervised learning ,Synchronous context-free grammar ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,computer ,Software ,Information projection ,media_common - Abstract
In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph (T-AOG) and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified by a number of grounded relations over image frames. An And-node represents a sequence of actions. An Or-node represents a number of alternative ways of such concatenations. The And-Or nodes in the T-AOG can generate a set of valid temporal configurations of atomic actions, which can be equivalently represented as the language of a stochastic context-free grammar (SCFG). For each And-node we model the temporal relations of its children nodes to distinguish events with similar structures but different temporal patterns and interpolate missing portions of events. This makes the T-AOG grammar context-sensitive. We propose an unsupervised learning algorithm to learn the atomic actions, the temporal relations and the And-Or nodes under the information projection principle in a coherent probabilistic framework. We also propose an event parsing algorithm based on the T-AOG which can understand events, infer the goal of agents, and predict their plausible intended actions. In comparison with existing methods, our paper makes the following contributions. (i) We represent events by a T-AOG with hierarchical compositions of events and the temporal relations between the sub-events. (ii) We learn the grammar, including atomic actions and temporal relations, automatically from the video data without manual supervision. (iii) Our algorithm infers the goal of agents and predicts their intents by a top-down process, handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework. (iv) The algorithm uses event context to improve the detection of atomic actions, segment and recognize objects in the scene. Extensive experiments, including indoor and out door scenes, single and multiple agents events, are conducted to validate the effectiveness of the proposed approach.
- Published
- 2013
16. Learning Hybrid Image Templates (HIT) by Information Projection
- Author
-
Song-Chun Zhu and Zhangzhang Si
- Subjects
Hybrid image ,business.industry ,Orientation (computer vision) ,Computer science ,Applied Mathematics ,Feature extraction ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Feature selection ,Pattern recognition ,Sketch ,Support vector machine ,Computational Theory and Mathematics ,Image texture ,Discriminative model ,Artificial Intelligence ,Feature (computer vision) ,Computer vision ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,Software ,Information projection - Abstract
This paper presents a novel framework for learning a generative image representation-the hybrid image template (HIT) from a small number (i.e., 3 \sim 20) of image examples. Each learned template is composed of, typically, 50 \sim 500 image patches whose geometric attributes (location, scale, orientation) may adapt in a local neighborhood for deformation, and whose appearances are characterized, respectively, by four types of descriptors: local sketch (edge or bar), texture gradients with orientations, flatness regions, and colors. These heterogeneous patches are automatically ranked and selected from a large pool according to their information gains using an information projection framework. Intuitively, a patch has a higher information gain if 1) its feature statistics are consistent within the training examples and are distinctive from the statistics of negative examples (i.e., generic images or examples from other categories); and 2) its feature statistics have less intraclass variations. The learning process pursues the most informative (for either generative or discriminative purpose) patches one at a time and stops when the information gain is within statistical fluctuation. The template is associated with a well-normalized probability model that integrates the heterogeneous feature statistics. This automated feature selection procedure allows our algorithm to scale up to a wide range of image categories, from those with regular shapes to those with stochastic texture. The learned representation captures the intrinsic characteristics of the object or scene categories. We evaluate the hybrid image templates on several public benchmarks, and demonstrate classification performances on par with state-of-the-art methods like HoG+SVM, and when small training sample sizes are used, the proposed system shows a clear advantage.
- Published
- 2012
17. Learning explicit and implicit visual manifolds by information projection
- Author
-
Zhangzhang Si, Song-Chun Zhu, and Kent Shi
- Subjects
Hybrid image ,business.industry ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Cognitive neuroscience of visual object recognition ,Pattern recognition ,Object detection ,Manifold ,Artificial Intelligence ,Computer Science::Computer Vision and Pattern Recognition ,Signal Processing ,Pattern recognition (psychology) ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Mathematical structure ,business ,Visual learning ,Software ,Information projection ,Mathematics - Abstract
Natural images have a vast amount of visual patterns distributed in a wide spectrum of subspaces of varying complexities and dimensions. Understanding the characteristics of these subspaces and their compositional structures is of fundamental importance for pattern modeling, learning and recognition. In this paper, we start with small image patches and define two types of atomic subspaces: explicit manifolds of low dimensions for structural primitives and implicit manifolds of high dimensions for stochastic textures. Then we present an information theoretical learning framework that derives common models for these manifolds through information projection, and study a manifold pursuit algorithm that clusters image patches into those atomic subspaces and ranks them according to their information gains. We further show how those atomic subspaces change over an image scaling process and how they are composed to form larger and more complex image patterns. Finally, we integrate the implicit and explicit manifolds to form a primal sketch model as a generic representation in early vision and to generate a hybrid image template representation for object category recognition in high level vision. The study of the mathematical structures in the image space sheds lights on some basic questions in human vision, such as atomic elements in visual perception, the perceptual metrics in various manifolds, and the perceptual transitions over image scales. This paper is based on the J.K. Aggarwal Prize lecture by the first author at the International Conference on Pattern Recognition, Tempa, FL. 2008.
- Published
- 2010
18. Learning AND-OR templates for object recognition and detection
- Author
-
Song-Chun Zhu and Zhangzhang Si
- Subjects
Computer science ,business.industry ,Applied Mathematics ,Template matching ,Cognitive neuroscience of visual object recognition ,Pattern recognition ,Object detection ,Visualization ,Computational Theory and Mathematics ,Artificial Intelligence ,Histogram ,Visual Objects ,Unsupervised learning ,Computer vision ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,computer ,Software ,computer.programming_language - Abstract
This paper presents a framework for unsupervised learning of a hierarchical reconfigurable image template - the AND-OR Template (AOT) for visual objects. The AOT includes: 1) hierarchical composition as "AND" nodes, 2) deformation and articulation of parts as geometric "OR" nodes, and 3) multiple ways of composition as structural "OR" nodes. The terminal nodes are hybrid image templates (HIT) [17] that are fully generative to the pixels. We show that both the structures and parameters of the AOT model can be learned in an unsupervised way from images using an information projection principle. The learning algorithm consists of two steps: 1) a recursive block pursuit procedure to learn the hierarchical dictionary of primitives, parts, and objects, and 2) a graph compression procedure to minimize model structure for better generalizability. We investigate the factors that influence how well the learning algorithm can identify the underlying AOT. And we propose a number of ways to evaluate the performance of the learned AOTs through both synthesized examples and real-world images. Our model advances the state of the art for object detection by improving the accuracy of template matching.
- Published
- 2013
19. Structure vs. Appearance and 3D vs. 2D? A Numeric Answer
- Author
-
Zhangzhang Si, Song-Chun Zhu, and Wenze Hu
- Subjects
Structure (mathematical logic) ,Set (abstract data type) ,business.industry ,Computer science ,Spectrum (functional analysis) ,Representation (systemics) ,Pattern recognition ,Artificial intelligence ,Information gain ,business ,Data dependent ,Information projection ,Image (mathematics) - Abstract
This paper introduces an information projection framework to provide a numerical criteria for evaluating the information contribution of competing image representations. Such representations include structure vs. appearance, and 3D vs. 3D representations. The framework allows a heterogeneous model of mixed representations, and sequentially selects representation elements according to their information gains. Optimal representations for a given set of images can be learned automatically in this manner. Experiments on these two competing representation pairs show that the optimal representation is data dependent, and forms a spectrum across multiple images. This shows the necessity of having numerical solutions to these problems.
- Published
- 2013
20. Unsupervised learning of event AND-OR grammar and semantics from video
- Author
-
Benjamin Yao, Song-Chun Zhu, Mingtao Pei, and Zhangzhang Si
- Subjects
Parsing ,Grammar ,Event (computing) ,business.industry ,Computer science ,media_common.quotation_subject ,Attribute grammar ,computer.software_genre ,Semantics ,Machine learning ,TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES ,Rule-based machine translation ,Stochastic grammar ,Unsupervised learning ,Synchronous context-free grammar ,Artificial intelligence ,business ,computer ,Natural language processing ,media_common - Abstract
We study the problem of automatically learning event AND-OR grammar from videos of a certain environment, e.g. an office where students conduct daily activities. We propose to learn the event grammar under the information projection and minimum description length principles in a coherent probabilistic framework, without manual supervision about what events happen and when they happen. Firstly a predefined set of unary and binary relations are detected for each video frame: e.g. agent's position, pose and interaction with environment. Then their co-occurrences are clustered into a dictionary of simple and transient atomic actions. Recursively these actions are grouped into longer and complexer events, resulting in a stochastic event grammar. By modeling time constraints of successive events, the learned grammar becomes context-sensitive. We introduce a new dataset of surveillance-style video in office, and present a prototype system for video analysis integrating bottom-up detection, grammatical learning and parsing. On this dataset, the learning algorithm is able to automatically discover important events and construct a stochastic grammar, which can be used to accurately parse newly observed video. The learned grammar can be used as a prior to improve the noisy bottom-up detection of atomic actions. It can also be used to infer semantics of the scene. In general, the event grammar is an efficient way for common knowledge acquisition from video.
- Published
- 2011
21. Learning Active Basis Models by EM-Type Algorithms
- Author
-
Ying Nian Wu, Haifeng Gong, Zhangzhang Si, and Song-Chun Zhu
- Subjects
Statistics and Probability ,Computer Science::Machine Learning ,FOS: Computer and information sciences ,Basis (linear algebra) ,Computer science ,General Mathematics ,Supervised learning ,Generative models ,Cognitive neuroscience of visual object recognition ,FOS: Physical sciences ,Latent variable ,Object (computer science) ,object recognition ,wavelet sparse coding ,Set (abstract data type) ,Methodology (stat.ME) ,Wavelet ,Physics - Data Analysis, Statistics and Probability ,Expectation–maximization algorithm ,Statistics, Probability and Uncertainty ,Algorithm ,Statistics - Methodology ,Data Analysis, Statistics and Probability (physics.data-an) - Abstract
EM algorithm is a convenient tool for maximum likelihood model fitting when the data are incomplete or when there are latent variables or hidden states. In this review article we explain that EM algorithm is a natural computational scheme for learning image templates of object categories where the learning is not fully supervised. We represent an image template by an active basis model, which is a linear composition of a selected set of localized, elongated and oriented wavelet elements that are allowed to slightly perturb their locations and orientations to account for the deformations of object shapes. The model can be easily learned when the objects in the training images are of the same pose, and appear at the same location and scale. This is often called supervised learning. In the situation where the objects may appear at different unknown locations, orientations and scales in the training images, we have to incorporate the unknown locations, orientations and scales as latent variables into the image generation process, and learn the template by EM-type algorithms. The E-step imputes the unknown locations, orientations and scales based on the currently learned template. This step can be considered self-supervision, which involves using the current template to recognize the objects in the training images. The M-step then relearns the template based on the imputed locations, orientations and scales, and this is essentially the same as supervised learning. So the EM learning process iterates between recognition and supervised learning. We illustrate this scheme by several experiments., Comment: Published in at http://dx.doi.org/10.1214/09-STS281 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2011
- Full Text
- View/download PDF
22. Learning Active Basis Model for Object Detection and Recognition
- Author
-
Zhangzhang Si, Ying Nian Wu, Haifeng Gong, and Song-Chun Zhu
- Subjects
Artificial Intelligence (incl. Robotics) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Sum maps and max maps ,Image processing ,Pattern Recognition ,Gabor filter ,Artificial Intelligence ,Shared sketch algorithm ,Computer vision ,Mathematics ,Basis (linear algebra) ,Orientation (computer vision) ,business.industry ,Template matching ,Gabor wavelet ,Computer Imaging, Vision, Pattern Recognition and Graphics ,Object detection ,Wavelet sparse coding ,Image Processing and Computer Vision ,Generative model ,Deformable template ,Computer Science::Computer Vision and Pattern Recognition ,Computer Science ,Artificial intelligence ,Computer Vision and Pattern Recognition ,business ,Software - Abstract
This article proposes an active basis model, a shared sketch algorithm, and a computational architecture of sum-max maps for representing, learning, and recognizing deformable templates. In our generative model, a deformable template is in the form of an active basis, which consists of a small number of Gabor wavelet elements at selected locations and orientations. These elements are allowed to slightly perturb their locations and orientations before they are linearly combined to generate the observed image. The active basis model, in particular, the locations and the orientations of the basis elements, can be learned from training images by the shared sketch algorithm. The algorithm selects the elements of the active basis sequentially from a dictionary of Gabor wavelets. When an element is selected at each step, the element is shared by all the training images, and the element is perturbed to encode or sketch a nearby edge segment in each training image. The recognition of the deformable template from an image can be accomplished by a computational architecture that alternates the sum maps and the max maps. The computation of the max maps deforms the active basis to match the image data, and the computation of the sum maps scores the template matching by the log-likelihood of the deformed active basis.
- Published
- 2010
23. Wavelet, active basis, and shape script
- Author
-
Zhangzhang Si and Ying Nian Wu
- Subjects
Basis (linear algebra) ,Computer science ,Generalization ,business.industry ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Pattern recognition ,Set (abstract data type) ,Generative model ,Wavelet ,Active shape model ,Artificial intelligence ,Neural coding ,Representation (mathematics) ,business - Abstract
Sparse coding is a key principle that underlies wavelet representation of natural images. In this paper, we explain that the effort of seeking a common wavelet sparse coding of images from the same object category leads to an active basis model, where the images share the same set of selected wavelet elements, which form a linear basis for representing the images. The selected wavelet elements are allowed to perturb their locations and orientations to account for shape deformations, so that the basis becomes active, and the active basis serves as a mathematical representation of a deformable template. We show that a recursive application of the strategy underlying the active basis model leads to a shape script model, which is a composition of shape motifs such as ellipsoids, parallel bars, angles, etc. These shape motifs are allowed to change their locations, orientations, scales and aspect ratios, and the shape motifs themselves are modeled by active bases. Compared to the active basis model, the shape script model is a sparser representation and therefore has stronger generalization power. It can also be considered another layer of sparse coding of the selected wavelet elements that themselves provide sparse coding of the image intensities.
- Published
- 2010
24. Using High-Level Semantic Features in Video Retrieval.
- Author
-
Sundaram, Hari, Naphade, Milind, Smith, John R., Yong Rui, Wujie Zheng, Jianmin Li, Zhangzhang Si, Fuzong Lin, and Bo Zhang
- Abstract
Extraction and utilization of high-level semantic features are critical for more effective video retrieval. However, the performance of video retrieval hasn't benefited much despite of the advances in high-level feature extraction. To make good use of high-level semantic features in video retrieval, we present a method called pointwise mutual information weighted scheme(PMIWS). The method makes a good judgment of the relevance of all the semantic features to the queries, taking the characteristics of semantic features into account. The method can also be extended for the fusion of multi-modalities. Experiment results based on TRECVID2005 corpus demonstrate the effectiveness of the method. [ABSTRACT FROM AUTHOR]
- Published
- 2006
- Full Text
- View/download PDF
25. UNSUPERVISED LEARNING OF COMPOSITIONAL SPARSE CODE FOR NATURAL IMAGE REPRESENTATION.
- Author
-
YI HONG, WENZE HU, SONG-CHUN ZHU, ZHANGZHANG SI, and YING NIAN WU
- Subjects
IMAGE representation ,WAVELETS (Mathematics) ,RADIAL basis functions ,LEARNING ,PIXELS - Abstract
This article proposes an unsupervised method for learning compositional sparse code for representing natural images. Our method is built upon the original sparse coding framework where there is a dictionary of basis functions often in the form of localized, elongated and oriented wavelets, so that each image can be represented by a linear combination of a small number of basis functions automatically selected from the dictionary. In our compositional sparse code, the representational units are composite: they are compositional patterns formed by the basis functions. These compositional patterns can be viewed as shape templates. We propose an unsupervised learning method for learning a dictionary of frequently occurring templates from training images, so that each training image can be represented by a small number of templates automatically selected from the learned dictionary. The compositional sparse code approximates the raw image of a large number of pixel intensities using a small number of templates, thus facilitating the signal-to-symbol transition and allowing a symbolic description of the image. The current form of our model consists of two layers of representational units (basis functions and shape templates). It is possible to extend it to multiple layers of hierarchy. Experiments show that our method is capable of learning meaningful compositional sparse code, and the learned templates are useful for image classification. [ABSTRACT FROM AUTHOR]
- Published
- 2014
26. Learning Active Basis Models by EM-Type Algorithms.
- Author
-
Zhangzhang Si, Haifeng Gong, Song-Chun Zhu, and Ying Nian Wu
- Subjects
EXPECTATION-maximization algorithms ,MAXIMUM likelihood statistics ,LEARNING ,WAVELETS (Mathematics) ,PERTURBATION theory - Abstract
EM algorithm is a convenient tool for maximum likelihood model fitting when the data are incomplete or when there are latent variables or hidden states. In this review article we explain that EM algorithm is a natural computational scheme for learning image templates of object categories where the learning is not fully supervised. We represent an image template by an active basis model, which is a linear composition of a selected set of localized, elongated and oriented wavelet elements that are allowed to slightly perturb their locations and orientations to account for the deformations of object shapes. The model can be easily learned when the objects in the training images are of the same pose, and appear at the same location and scale. This is often called supervised learning. In the situation where the objects may appear at different unknown locations, orientations and scales in the training images, we have to incorporate the unknown locations, orientations and scales as latent variables into the image generation process, and learn the template by EM-type algorithms. The E-step imputes the unknown locations, orientations and scales based on the currently learned template. This step can be considered self-supervision, which involves using the current template to recognize the objects in the training images. The M-step then relearns the template based on the imputed locations, orientations and scales, and this is essentially the same as supervised learning. So the EM learning process iterates between recognition and supervised learning. We illustrate this scheme by several experiments. [ABSTRACT FROM AUTHOR]
- Published
- 2010
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.