Aponyi, Ilona, Izsányi, Gizella, Sótonyi, János, Varga, Krisztina, Vendrei, Zsolt, and Leonardis, Aleš
We investigate a framework for visual object categorization in artificial cognitive systems that is based on a discovery of appearance and structural similarities between object exemplars and prototypes that represent a category. Our main contribution is a novel approach for visual categorization of objects by synchronous hierarchical matching to a prototype, where high level matches between an object and a prototype are gradually discovered through several steps of binding, selection and inhibition. We show that categorization can be achieved without an excessive collection of evidence or learning from examples. Instead, by discovering commonalities between an object and a prototype, categorization can be based on the notion of ``family resemblances'' that does not require an explicit modeling of features to define a category. We combine the successful methods for categorization based on local features with a flexible and general prototype matching framework that balances the prevalent dependence of local methods on patch appearance by a gradual construction of high level features in a hierarchical feature binding process, where the descriptive power of features gradually increases from localized features with little or no structural information to complex composite features that span wider areas, until they encompass the whole object. We start with a low level description, which defines relatively stable local regions of interest (ROI) based on their appearance. Features are then dynamically constructed, or activated, in order to search for possible matches between the view being interpreted and a prototype view of an object. The composite features describe geometrical and photometric properties of a local area that expands, until a high level focused response, with a receptive field that potentially covers the whole object, is constructed. The construction of high level features is not steered by matching to a predefined set of features, but is rather conditioned by a hierarchical matching and inhibition of ad-hoc features. The advantage of such an approach is that we do not have to define a category with respect to a set of learning examples, since objects from the same category can be articulated using many different local geometrical and appearance matches. In order to characterize the appearance variation within local regions, we learn a codebook of ICA filters which impose a maximally sparse response; we intentionally keep the number of filters in the codebook low. In that way, local regions are strongly informed in terms of their structural (orientation, scale) properties, but carry only a basic information on appearance. Clusters of local features that conform to a subset of Gestalt rules are then grouped. This reduces the redundancy and provides a raw description of linear structures and equi-curvature areas. Based on a prototype image, we also learn the structure of the geometric conceptual space. The hierarchical binding is a massively parallel process that does not require global information. Twoplets of features are matched to twoplets of features, where the shape of the receptive field is conditioned to simulate attentive processing. The result is a set of matches of composite features at level the highest level of the hierarchy, where each of the features can be tracked to the constituent features at different levels. The decision whether the object categories match can be done based on the number of high level matches. We evaluate our framework on two domains: the first set of experiments assesses the performance of the hierarchical categorization on segmented images of objects that come from multiple categories, and the task is to find the closest match to a given image. The second set of experiments assesses the performance of hierarchical categorization of objects in occluded images. We use the ETH80 and the Caltech-101 public databases. The results show that our framework achieves a reliable categorization of objects in a canonical view. Besides categorization, it enables a reliable detection of an object as a whole, and of matching the object parts at different resolutions. We argue that the model could be integrated in a cognitive framework, where several other sources of information can be used to establish a stable interpretation of the scene. We also claim that categorization by visual matching to a prototype has biological plausibility, and that our architecture implements some ideas that originate in enactive cognitive science.