15 results on '"Zhangzhang Si"'
Search Results
2. Learning and parsing video events with goal and intent prediction
- Author
-
Zhangzhang Si, Song-Chun Zhu, Benjamin Yao, and Mingtao Pei
- Subjects
Parsing ,Grammar ,Computer science ,business.industry ,media_common.quotation_subject ,Machine learning ,computer.software_genre ,Spatial relation ,Rule-based machine translation ,Signal Processing ,Graph (abstract data type) ,Unsupervised learning ,Synchronous context-free grammar ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,computer ,Software ,Information projection ,media_common - Abstract
In this paper, we present a framework for parsing video events with stochastic Temporal And-Or Graph (T-AOG) and unsupervised learning of the T-AOG from video. This T-AOG represents a stochastic event grammar. The alphabet of the T-AOG consists of a set of grounded spatial relations including the poses of agents and their interactions with objects in the scene. The terminal nodes of the T-AOG are atomic actions which are specified by a number of grounded relations over image frames. An And-node represents a sequence of actions. An Or-node represents a number of alternative ways of such concatenations. The And-Or nodes in the T-AOG can generate a set of valid temporal configurations of atomic actions, which can be equivalently represented as the language of a stochastic context-free grammar (SCFG). For each And-node we model the temporal relations of its children nodes to distinguish events with similar structures but different temporal patterns and interpolate missing portions of events. This makes the T-AOG grammar context-sensitive. We propose an unsupervised learning algorithm to learn the atomic actions, the temporal relations and the And-Or nodes under the information projection principle in a coherent probabilistic framework. We also propose an event parsing algorithm based on the T-AOG which can understand events, infer the goal of agents, and predict their plausible intended actions. In comparison with existing methods, our paper makes the following contributions. (i) We represent events by a T-AOG with hierarchical compositions of events and the temporal relations between the sub-events. (ii) We learn the grammar, including atomic actions and temporal relations, automatically from the video data without manual supervision. (iii) Our algorithm infers the goal of agents and predicts their intents by a top-down process, handles events insertion and multi-agent events, keeps all possible interpretations of the video to preserve the ambiguities, and achieves the globally optimal parsing solution in a Bayesian framework. (iv) The algorithm uses event context to improve the detection of atomic actions, segment and recognize objects in the scene. Extensive experiments, including indoor and out door scenes, single and multiple agents events, are conducted to validate the effectiveness of the proposed approach.
- Published
- 2013
3. Learning Hybrid Image Templates (HIT) by Information Projection
- Author
-
Song-Chun Zhu and Zhangzhang Si
- Subjects
Hybrid image ,business.industry ,Orientation (computer vision) ,Computer science ,Applied Mathematics ,Feature extraction ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Feature selection ,Pattern recognition ,Sketch ,Support vector machine ,Computational Theory and Mathematics ,Image texture ,Discriminative model ,Artificial Intelligence ,Feature (computer vision) ,Computer vision ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,Software ,Information projection - Abstract
This paper presents a novel framework for learning a generative image representation-the hybrid image template (HIT) from a small number (i.e., 3 \sim 20) of image examples. Each learned template is composed of, typically, 50 \sim 500 image patches whose geometric attributes (location, scale, orientation) may adapt in a local neighborhood for deformation, and whose appearances are characterized, respectively, by four types of descriptors: local sketch (edge or bar), texture gradients with orientations, flatness regions, and colors. These heterogeneous patches are automatically ranked and selected from a large pool according to their information gains using an information projection framework. Intuitively, a patch has a higher information gain if 1) its feature statistics are consistent within the training examples and are distinctive from the statistics of negative examples (i.e., generic images or examples from other categories); and 2) its feature statistics have less intraclass variations. The learning process pursues the most informative (for either generative or discriminative purpose) patches one at a time and stops when the information gain is within statistical fluctuation. The template is associated with a well-normalized probability model that integrates the heterogeneous feature statistics. This automated feature selection procedure allows our algorithm to scale up to a wide range of image categories, from those with regular shapes to those with stochastic texture. The learned representation captures the intrinsic characteristics of the object or scene categories. We evaluate the hybrid image templates on several public benchmarks, and demonstrate classification performances on par with state-of-the-art methods like HoG+SVM, and when small training sample sizes are used, the proposed system shows a clear advantage.
- Published
- 2012
4. Learning explicit and implicit visual manifolds by information projection
- Author
-
Zhangzhang Si, Song-Chun Zhu, and Kent Shi
- Subjects
Hybrid image ,business.industry ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Cognitive neuroscience of visual object recognition ,Pattern recognition ,Object detection ,Manifold ,Artificial Intelligence ,Computer Science::Computer Vision and Pattern Recognition ,Signal Processing ,Pattern recognition (psychology) ,Computer Vision and Pattern Recognition ,Artificial intelligence ,Mathematical structure ,business ,Visual learning ,Software ,Information projection ,Mathematics - Abstract
Natural images have a vast amount of visual patterns distributed in a wide spectrum of subspaces of varying complexities and dimensions. Understanding the characteristics of these subspaces and their compositional structures is of fundamental importance for pattern modeling, learning and recognition. In this paper, we start with small image patches and define two types of atomic subspaces: explicit manifolds of low dimensions for structural primitives and implicit manifolds of high dimensions for stochastic textures. Then we present an information theoretical learning framework that derives common models for these manifolds through information projection, and study a manifold pursuit algorithm that clusters image patches into those atomic subspaces and ranks them according to their information gains. We further show how those atomic subspaces change over an image scaling process and how they are composed to form larger and more complex image patterns. Finally, we integrate the implicit and explicit manifolds to form a primal sketch model as a generic representation in early vision and to generate a hybrid image template representation for object category recognition in high level vision. The study of the mathematical structures in the image space sheds lights on some basic questions in human vision, such as atomic elements in visual perception, the perceptual metrics in various manifolds, and the perceptual transitions over image scales. This paper is based on the J.K. Aggarwal Prize lecture by the first author at the International Conference on Pattern Recognition, Tempa, FL. 2008.
- Published
- 2010
5. Learning AND-OR templates for object recognition and detection
- Author
-
Song-Chun Zhu and Zhangzhang Si
- Subjects
Computer science ,business.industry ,Applied Mathematics ,Template matching ,Cognitive neuroscience of visual object recognition ,Pattern recognition ,Object detection ,Visualization ,Computational Theory and Mathematics ,Artificial Intelligence ,Histogram ,Visual Objects ,Unsupervised learning ,Computer vision ,Computer Vision and Pattern Recognition ,Artificial intelligence ,business ,computer ,Software ,computer.programming_language - Abstract
This paper presents a framework for unsupervised learning of a hierarchical reconfigurable image template - the AND-OR Template (AOT) for visual objects. The AOT includes: 1) hierarchical composition as "AND" nodes, 2) deformation and articulation of parts as geometric "OR" nodes, and 3) multiple ways of composition as structural "OR" nodes. The terminal nodes are hybrid image templates (HIT) [17] that are fully generative to the pixels. We show that both the structures and parameters of the AOT model can be learned in an unsupervised way from images using an information projection principle. The learning algorithm consists of two steps: 1) a recursive block pursuit procedure to learn the hierarchical dictionary of primitives, parts, and objects, and 2) a graph compression procedure to minimize model structure for better generalizability. We investigate the factors that influence how well the learning algorithm can identify the underlying AOT. And we propose a number of ways to evaluate the performance of the learned AOTs through both synthesized examples and real-world images. Our model advances the state of the art for object detection by improving the accuracy of template matching.
- Published
- 2013
6. Structure vs. Appearance and 3D vs. 2D? A Numeric Answer
- Author
-
Zhangzhang Si, Song-Chun Zhu, and Wenze Hu
- Subjects
Structure (mathematical logic) ,Set (abstract data type) ,business.industry ,Computer science ,Spectrum (functional analysis) ,Representation (systemics) ,Pattern recognition ,Artificial intelligence ,Information gain ,business ,Data dependent ,Information projection ,Image (mathematics) - Abstract
This paper introduces an information projection framework to provide a numerical criteria for evaluating the information contribution of competing image representations. Such representations include structure vs. appearance, and 3D vs. 3D representations. The framework allows a heterogeneous model of mixed representations, and sequentially selects representation elements according to their information gains. Optimal representations for a given set of images can be learned automatically in this manner. Experiments on these two competing representation pairs show that the optimal representation is data dependent, and forms a spectrum across multiple images. This shows the necessity of having numerical solutions to these problems.
- Published
- 2013
7. Unsupervised learning of event AND-OR grammar and semantics from video
- Author
-
Benjamin Yao, Song-Chun Zhu, Mingtao Pei, and Zhangzhang Si
- Subjects
Parsing ,Grammar ,Event (computing) ,business.industry ,Computer science ,media_common.quotation_subject ,Attribute grammar ,computer.software_genre ,Semantics ,Machine learning ,TheoryofComputation_MATHEMATICALLOGICANDFORMALLANGUAGES ,Rule-based machine translation ,Stochastic grammar ,Unsupervised learning ,Synchronous context-free grammar ,Artificial intelligence ,business ,computer ,Natural language processing ,media_common - Abstract
We study the problem of automatically learning event AND-OR grammar from videos of a certain environment, e.g. an office where students conduct daily activities. We propose to learn the event grammar under the information projection and minimum description length principles in a coherent probabilistic framework, without manual supervision about what events happen and when they happen. Firstly a predefined set of unary and binary relations are detected for each video frame: e.g. agent's position, pose and interaction with environment. Then their co-occurrences are clustered into a dictionary of simple and transient atomic actions. Recursively these actions are grouped into longer and complexer events, resulting in a stochastic event grammar. By modeling time constraints of successive events, the learned grammar becomes context-sensitive. We introduce a new dataset of surveillance-style video in office, and present a prototype system for video analysis integrating bottom-up detection, grammatical learning and parsing. On this dataset, the learning algorithm is able to automatically discover important events and construct a stochastic grammar, which can be used to accurately parse newly observed video. The learned grammar can be used as a prior to improve the noisy bottom-up detection of atomic actions. It can also be used to infer semantics of the scene. In general, the event grammar is an efficient way for common knowledge acquisition from video.
- Published
- 2011
8. Unsupervised learning of stochastic AND-OR templates for object modeling
- Author
-
Zhangzhang Si and Song-Chun Zhu
- Subjects
Hierarchy (mathematics) ,Computer science ,business.industry ,Visual dictionary ,Pattern recognition ,Visualization ,Image (mathematics) ,Visual Objects ,Object model ,Unsupervised learning ,Artificial intelligence ,business ,computer ,computer.programming_language - Abstract
This paper presents a framework for unsupervised learning of a hierarchical generative image model called ANDOR Template (AOT) for visual objects. The AOT includes: (1) hierarchical composition as “AND” nodes, (2) deformation of parts as continuous “OR” nodes, and (3) multiple ways of composition as discrete “OR” nodes. These AND/OR nodes form the hierarchical visual dictionary. We show that both the structure and parameters of the AOT model can be learned in an unsupervised way from example images using an information projection principle. The learning algorithm consists two steps: i) a recursive Block-Pursuit procedure to learn the hierarchical dictionary of primitives, parts and objects, which form leaf nodes, AND nodes and structural OR nodes and ii) a Graph-Compression operation to minimize model structure for better generalizability, which produce additional OR nodes across the compositional hierarchy. We investigate the conditions under which the learning algorithm can identify, (i.e. recover) an underlying AOT that generates the data, and evaluate the performance of our learning algorithm through both artificial and real examples.
- Published
- 2011
9. Learning Active Basis Models by EM-Type Algorithms
- Author
-
Ying Nian Wu, Haifeng Gong, Zhangzhang Si, and Song-Chun Zhu
- Subjects
Statistics and Probability ,Computer Science::Machine Learning ,FOS: Computer and information sciences ,Basis (linear algebra) ,Computer science ,General Mathematics ,Supervised learning ,Generative models ,Cognitive neuroscience of visual object recognition ,FOS: Physical sciences ,Latent variable ,Object (computer science) ,object recognition ,wavelet sparse coding ,Set (abstract data type) ,Methodology (stat.ME) ,Wavelet ,Physics - Data Analysis, Statistics and Probability ,Expectation–maximization algorithm ,Statistics, Probability and Uncertainty ,Algorithm ,Statistics - Methodology ,Data Analysis, Statistics and Probability (physics.data-an) - Abstract
EM algorithm is a convenient tool for maximum likelihood model fitting when the data are incomplete or when there are latent variables or hidden states. In this review article we explain that EM algorithm is a natural computational scheme for learning image templates of object categories where the learning is not fully supervised. We represent an image template by an active basis model, which is a linear composition of a selected set of localized, elongated and oriented wavelet elements that are allowed to slightly perturb their locations and orientations to account for the deformations of object shapes. The model can be easily learned when the objects in the training images are of the same pose, and appear at the same location and scale. This is often called supervised learning. In the situation where the objects may appear at different unknown locations, orientations and scales in the training images, we have to incorporate the unknown locations, orientations and scales as latent variables into the image generation process, and learn the template by EM-type algorithms. The E-step imputes the unknown locations, orientations and scales based on the currently learned template. This step can be considered self-supervision, which involves using the current template to recognize the objects in the training images. The M-step then relearns the template based on the imputed locations, orientations and scales, and this is essentially the same as supervised learning. So the EM learning process iterates between recognition and supervised learning. We illustrate this scheme by several experiments., Comment: Published in at http://dx.doi.org/10.1214/09-STS281 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2011
- Full Text
- View/download PDF
10. Learning Active Basis Model for Object Detection and Recognition
- Author
-
Zhangzhang Si, Ying Nian Wu, Haifeng Gong, and Song-Chun Zhu
- Subjects
Artificial Intelligence (incl. Robotics) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Sum maps and max maps ,Image processing ,Pattern Recognition ,Gabor filter ,Artificial Intelligence ,Shared sketch algorithm ,Computer vision ,Mathematics ,Basis (linear algebra) ,Orientation (computer vision) ,business.industry ,Template matching ,Gabor wavelet ,Computer Imaging, Vision, Pattern Recognition and Graphics ,Object detection ,Wavelet sparse coding ,Image Processing and Computer Vision ,Generative model ,Deformable template ,Computer Science::Computer Vision and Pattern Recognition ,Computer Science ,Artificial intelligence ,Computer Vision and Pattern Recognition ,business ,Software - Abstract
This article proposes an active basis model, a shared sketch algorithm, and a computational architecture of sum-max maps for representing, learning, and recognizing deformable templates. In our generative model, a deformable template is in the form of an active basis, which consists of a small number of Gabor wavelet elements at selected locations and orientations. These elements are allowed to slightly perturb their locations and orientations before they are linearly combined to generate the observed image. The active basis model, in particular, the locations and the orientations of the basis elements, can be learned from training images by the shared sketch algorithm. The algorithm selects the elements of the active basis sequentially from a dictionary of Gabor wavelets. When an element is selected at each step, the element is shared by all the training images, and the element is perturbed to encode or sketch a nearby edge segment in each training image. The recognition of the deformable template from an image can be accomplished by a computational architecture that alternates the sum maps and the max maps. The computation of the max maps deforms the active basis to match the image data, and the computation of the sum maps scores the template matching by the log-likelihood of the deformed active basis.
- Published
- 2010
11. Learning mixed templates for object recognition
- Author
-
Zhangzhang Si, Haifeng Gong, Ying Nian Wu, and Song-Chun Zhu
- Subjects
Contextual image classification ,business.industry ,Template matching ,Feature extraction ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Cognitive neuroscience of visual object recognition ,Pattern recognition ,Generative model ,Gabor filter ,Image texture ,Projection pursuit ,Computer vision ,Artificial intelligence ,business ,ComputingMethodologies_COMPUTERGRAPHICS ,Mathematics - Abstract
This article proposes a method for learning object templates composed of local sketches and local textures, and investigates the relative importance of the sketches and textures for different object categories. Local sketches and local textures in the object templates account for shapes and appearances respectively. Both local sketches and local textures are extracted from the maps of Gabor filter responses. The local sketches are captured by the local maxima of Gabor responses, where the local maximum pooling accounts for shape deformations in objects. The local textures are captured by the local averages of Gabor filter responses, where the local average pooling extracts texture information for appearances. The selection of local sketch variables and local texture variables can be accomplished by a projection pursuit type of learning process, where both types of variables can be compared and merged within a common framework. The learning process returns a generative model for image intensities from a relatively small number of training images. The recognition or classification by template matching can then be based on log-likelihood ratio scores. We apply the learning method to a variety of object and texture categories. The results show that both the sketches and textures are useful for classification, and they complement each other.
- Published
- 2009
12. Deformable Template As Active Basis
- Author
-
Ying Nian Wu, Zhangzhang Si, C. Fleming, and Song-Chun Zhu
- Subjects
Basis (linear algebra) ,business.industry ,Gabor wavelet ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Cognitive neuroscience of visual object recognition ,Wavelet transform ,Pattern recognition ,Generative model ,Wavelet ,Encoding (memory) ,Artificial intelligence ,business ,Neural coding ,Mathematics - Abstract
This article proposes an active basis model and a shared pursuit algorithm for learning deformable templates from image patches of various object categories. In our generative model, a deformable template is in the form of an active basis, which consists of a small number of Gabor wavelet elements at different locations and orientations. These elements are allowed to slightly perturb their locations and orientations before they are linearly combined to generate each individual training or testing example. The active basis model can be learned from training image patches by the shared pursuit algorithm. The algorithm selects the elements of the active basis sequentially from a dictionary of Gabor wavelets. When an element is selected at each step, the element is shared by all the training examples, in the sense that a perturbed version of this element is added to improve the encoding of each example. Our model and algorithm are developed within a probabilistic framework that naturally embraces wavelet sparse coding and random field.
- Published
- 2007
13. UNSUPERVISED LEARNING OF COMPOSITIONAL SPARSE CODE FOR NATURAL IMAGE REPRESENTATION.
- Author
-
YI HONG, WENZE HU, SONG-CHUN ZHU, ZHANGZHANG SI, and YING NIAN WU
- Subjects
IMAGE representation ,WAVELETS (Mathematics) ,RADIAL basis functions ,LEARNING ,PIXELS - Abstract
This article proposes an unsupervised method for learning compositional sparse code for representing natural images. Our method is built upon the original sparse coding framework where there is a dictionary of basis functions often in the form of localized, elongated and oriented wavelets, so that each image can be represented by a linear combination of a small number of basis functions automatically selected from the dictionary. In our compositional sparse code, the representational units are composite: they are compositional patterns formed by the basis functions. These compositional patterns can be viewed as shape templates. We propose an unsupervised learning method for learning a dictionary of frequently occurring templates from training images, so that each training image can be represented by a small number of templates automatically selected from the learned dictionary. The compositional sparse code approximates the raw image of a large number of pixel intensities using a small number of templates, thus facilitating the signal-to-symbol transition and allowing a symbolic description of the image. The current form of our model consists of two layers of representational units (basis functions and shape templates). It is possible to extend it to multiple layers of hierarchy. Experiments show that our method is capable of learning meaningful compositional sparse code, and the learned templates are useful for image classification. [ABSTRACT FROM AUTHOR]
- Published
- 2014
14. Learning Active Basis Model for Object Detection and Recognition.
- Author
-
Ying Nian Wu, Zhangzhang Si, Haifeng Gong, and Song-Chun Zhu
- Subjects
- *
ALGORITHMS , *WAVELETS (Mathematics) , *GABOR transforms , *COMPUTER vision , *ARTIFICIAL intelligence - Abstract
This article proposes an active basis model, a shared sketch algorithm, and a computational architecture of sum-max maps for representing, learning, and recognizing deformable templates. In our generative model, a deformable template is in the form of an active basis, which consists of a small number of Gabor wavelet elements at selected locations and orientations. These elements are allowed to slightly perturb their locations and orientations before they are linearly combined to generate the observed image. The active basis model, in particular, the locations and the orientations of the basis elements, can be learned from training images by the shared sketch algorithm. The algorithm selects the elements of the active basis sequentially from a dictionary of Gabor wavelets. When an element is selected at each step, the element is shared by all the training images, and the element is perturbed to encode or sketch a nearby edge segment in each training image. The recognition of the deformable template from an image can be accomplished by a computational architecture that alternates the sum maps and the max maps. The computation of the max maps deforms the active basis to match the image data, and the computation of the sum maps scores the template matching by the log-likelihood of the deformed active basis. [ABSTRACT FROM AUTHOR]
- Published
- 2010
- Full Text
- View/download PDF
15. Learning Active Basis Models by EM-Type Algorithms.
- Author
-
Zhangzhang Si, Haifeng Gong, Song-Chun Zhu, and Ying Nian Wu
- Subjects
EXPECTATION-maximization algorithms ,MAXIMUM likelihood statistics ,LEARNING ,WAVELETS (Mathematics) ,PERTURBATION theory - Abstract
EM algorithm is a convenient tool for maximum likelihood model fitting when the data are incomplete or when there are latent variables or hidden states. In this review article we explain that EM algorithm is a natural computational scheme for learning image templates of object categories where the learning is not fully supervised. We represent an image template by an active basis model, which is a linear composition of a selected set of localized, elongated and oriented wavelet elements that are allowed to slightly perturb their locations and orientations to account for the deformations of object shapes. The model can be easily learned when the objects in the training images are of the same pose, and appear at the same location and scale. This is often called supervised learning. In the situation where the objects may appear at different unknown locations, orientations and scales in the training images, we have to incorporate the unknown locations, orientations and scales as latent variables into the image generation process, and learn the template by EM-type algorithms. The E-step imputes the unknown locations, orientations and scales based on the currently learned template. This step can be considered self-supervision, which involves using the current template to recognize the objects in the training images. The M-step then relearns the template based on the imputed locations, orientations and scales, and this is essentially the same as supervised learning. So the EM learning process iterates between recognition and supervised learning. We illustrate this scheme by several experiments. [ABSTRACT FROM AUTHOR]
- Published
- 2010
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.