22 results on '"Tim K. Marks"'
Search Results
2. LUVLi Face Alignment: Estimating Landmarks’ Location, Uncertainty, and Visibility Likelihood
- Author
-
Xiaoming Liu, Abhinav Kumar, Wenxuan Mou, Tim K. Marks, Michael Jones, Ye Wang, Anoop Cherian, Toshiaki Koike-Akino, and Chen Feng
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Landmark ,Computer science ,business.industry ,Computer Vision and Pattern Recognition (cs.CV) ,Image and Video Processing (eess.IV) ,Computer Science - Computer Vision and Pattern Recognition ,Pattern recognition ,02 engineering and technology ,Electrical Engineering and Systems Science - Image and Video Processing ,010501 environmental sciences ,01 natural sciences ,Machine Learning (cs.LG) ,Face (geometry) ,FOS: Electrical engineering, electronic engineering, information engineering ,0202 electrical engineering, electronic engineering, information engineering ,Range (statistics) ,020201 artificial intelligence & image processing ,Artificial intelligence ,Visibility ,business ,0105 earth and related environmental sciences - Abstract
Modern face alignment methods have become quite accurate at predicting the locations of facial landmarks, but they do not typically estimate the uncertainty of their predicted locations nor predict whether landmarks are visible. In this paper, we present a novel framework for jointly predicting landmark locations, associated uncertainties of these predicted locations, and landmark visibilities. We model these as mixed random variables and estimate them using a deep network trained with our proposed Location, Uncertainty, and Visibility Likelihood (LUVLi) loss. In addition, we release an entirely new labeling of a large face alignment dataset with over 19,000 face images in a full range of head poses. Each face is manually labeled with the ground-truth locations of 68 landmarks, with the additional information of whether each landmark is unoccluded, self-occluded (due to extreme head poses), or externally occluded. Not only does our joint estimation yield accurate estimates of the uncertainty of predicted landmark locations, but it also yields state-of-the-art estimates for the landmark locations themselves on multiple standard face alignment datasets. Our method's estimates of the uncertainty of predicted landmark locations could be used to automatically identify input images on which face alignment fails, which can be critical for downstream tasks., Accepted to CVPR 2020
- Published
- 2020
- Full Text
- View/download PDF
3. FX-GAN: Self-Supervised GAN Learning via Feature Exchange
- Author
-
Teng-Yok Lee, Anoop Cherian, Ye Wang, Tim K. Marks, Rui Huang, and Wenju Xu
- Subjects
Discriminator ,Image quality ,business.industry ,Computer science ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Stability (learning theory) ,Pattern recognition ,02 engineering and technology ,010501 environmental sciences ,Real image ,01 natural sciences ,Image (mathematics) ,Feature (computer vision) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Artificial intelligence ,business ,Feature structure ,0105 earth and related environmental sciences ,Generator (mathematics) - Abstract
We propose a self-supervised approach to improve the training of Generative Adversarial Networks (GANs) via inducing the discriminator to examine the structural consistency of images. Although natural image samples provide ideal examples of both valid structure and valid texture, learning to reproduce both together remains an open challenge. In our approach, we augment the training set of natural images with modified examples that have degraded structural consistency. These degraded examples are automatically created by randomly exchanging pairs of patches in an image’s convolutional feature map. We call this approach feature exchange. With this setup, we propose a novel GAN formulation, termed Feature eXchange GAN (FX-GAN), in which the discriminator is trained not only to distinguish real versus generated images, but also to perform the auxiliary task of distinguishing between real images and structurally corrupted (feature-exchanged) real images. This auxiliary task causes the discriminator to learn the proper feature structure of natural images, which in turn guides the generator to produce images with more realistic structure. Compared with strong GAN baselines, our proposed self-supervision approach improves generated image quality, diversity, and training stability for both the unconditional and class-conditional settings.
- Published
- 2020
- Full Text
- View/download PDF
4. Unsupervised Joint 3D Object Model Learning and 6D Pose Estimation for Depth-Based Instance Segmentation
- Author
-
Alan Sullivan, Tim K. Marks, Guanghui Wang, Yuanwei Wu, Chen Feng, Anoop Cherian, and Siheng Chen
- Subjects
Computer science ,business.industry ,Deep learning ,Point cloud ,Object model ,Computer vision ,Segmentation ,Artificial intelligence ,Image segmentation ,Object (computer science) ,business ,Pose ,Rigid transformation - Abstract
In this work, we propose a novel unsupervised approach to jointly learn the 3D object model and estimate the 6D poses of multiple instances of a same object, with applications to depth-based instance segmentation. The inputs are depth images, and the learned object model is represented by a 3D point cloud. Traditional 6D pose estimation approaches are not sufficient to address this problem, where neither a CAD model of the object nor the ground-truth 6D poses of its instances are available during training. To solve this problem, we propose to jointly optimize the model learning and pose estimation in an end-to-end deep learning framework. Specifically, our network produces a 3D object model and a list of rigid transformations on this model to generate instances, which when rendered must match the observed point cloud to minimizing the Chamfer distance. To render the set of instance point clouds with occlusions, the network automatically removes the occluded points in a given camera view. Extensive experiments evaluate our technique on several object models and varying number of instances in 3D point clouds. We demonstrate the application of our method to instance segmentation of depth images of small bins of industrial parts. Compared with popular baselines for instance segmentation, our model not only demonstrates competitive performance, but also learns a 3D object model that is represented as a 3D point cloud.
- Published
- 2019
- Full Text
- View/download PDF
5. UGLLI Face Alignment: Estimating Uncertainty with Gaussian Log-Likelihood Loss
- Author
-
Abhinav Kumar, Xiaoming Liu, Wenxuan Mou, Tim K. Marks, and Chen Feng
- Subjects
Estimation ,business.industry ,Computer science ,Gaussian ,05 social sciences ,Log likelihood ,Pattern recognition ,02 engineering and technology ,symbols.namesake ,Computer Science::Graphics ,Computer Science::Computer Vision and Pattern Recognition ,Face (geometry) ,0202 electrical engineering, electronic engineering, information engineering ,symbols ,020201 artificial intelligence & image processing ,Artificial intelligence ,0509 other social sciences ,050904 information & library sciences ,business - Abstract
Modern face alignment methods have become quite accurate at predicting the locations of facial landmarks, but they do not typically estimate the uncertainty of their predicted locations. In this paper, we present a novel frame-work for jointly predicting facial landmark locations and the associated uncertainties, modeled as 2D Gaussian distributions, using Gaussian log-likelihood loss. Not only does our joint estimation of uncertainty and landmark locations yield state-of-the-art estimates of the uncertainty of predicted landmark locations, but it also yields state-of-the-art estimates for the landmark locations (face alignment). Our method's estimates of the uncertainty of landmarks' predicted locations could be used to automatically identify input images on which face alignment fails, which can be critical for downstream tasks.
- Published
- 2019
- Full Text
- View/download PDF
6. Audio Visual Scene-Aware Dialog
- Author
-
Chiori Hori, Devi Parikh, Vincent Cartillier, Anoop Cherian, Abhishek Das, Jue Wang, Huda Alamri, Irfan Essa, Stefan Lee, Dhruv Batra, Peter Anderson, and Tim K. Marks
- Subjects
Computer science ,business.industry ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,02 engineering and technology ,010501 environmental sciences ,computer.software_genre ,01 natural sciences ,Task (project management) ,Audio visual ,0202 electrical engineering, electronic engineering, information engineering ,Benchmark (computing) ,Natural (music) ,020201 artificial intelligence & image processing ,Artificial intelligence ,Dialog box ,Baseline (configuration management) ,business ,computer ,Natural language processing ,0105 earth and related environmental sciences - Abstract
We introduce the task of scene-aware dialog. Our goal is to generate a complete and natural response to a question about a scene, given video and audio of the scene and the history of previous turns in the dialog. To answer successfully, agents must ground concepts from the question in the video while leveraging contextual cues from the dialog history. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more than 11,000 videos of human actions from the Charades dataset, our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using both qualitative and quantitative metrics. Our results indicate that models must utilize all the available inputs (video, audio, question, and dialog history) to perform best on this dataset.
- Published
- 2019
- Full Text
- View/download PDF
7. End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features
- Author
-
Devi Parikh, Raphael Gontijo Lopes, Anoop Cherian, Gordon Wichern, Abhishek Das, Vincent Cartillier, Dhruv Batra, Jue Wang, Huda Alamri, Tim K. Marks, Takaaki Hori, Chiori Hori, and Irfan Essa
- Subjects
FOS: Computer and information sciences ,Sound (cs.SD) ,Computer Science - Computation and Language ,Computer science ,Computer Vision and Pattern Recognition (cs.CV) ,Feature extraction ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Computer Science - Computer Vision and Pattern Recognition ,020206 networking & telecommunications ,02 engineering and technology ,Human behavior ,Computer Science - Sound ,Visualization ,End-to-end principle ,Audio and Speech Processing (eess.AS) ,Human–computer interaction ,FOS: Electrical engineering, electronic engineering, information engineering ,0202 electrical engineering, electronic engineering, information engineering ,Question answering ,020201 artificial intelligence & image processing ,Mel-frequency cepstrum ,Dialog box ,Computation and Language (cs.CL) ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Dialog systems need to understand dynamic visual scenes in order to have conversations with users about the objects and events around them. Scene-aware dialog systems for real-world applications could be developed by integrating state-of-the-art technologies from multiple research areas, including: end-to-end dialog technologies, which generate system responses using models trained from dialog data; visual question answering (VQA) technologies, which answer questions about images using learned image features; and video description technologies, in which descriptions/captions are generated from videos using multimodal information. We introduce a new dataset of dialogs about videos of human behaviors. Each dialog is a typed conversation that consists of a sequence of 10 question-and-answer(QA) pairs between two Amazon Mechanical Turk (AMT) workers. In total, we collected dialogs on roughly 9,000 videos. Using this new dataset for Audio Visual Scene-aware dialog (AVSD), we trained an end-to-end conversation model that generates responses in a dialog about a video. Our experiments demonstrate that using multimodal features that were developed for multimodal attention-based video description enhances the quality of generated dialog about dynamic scenes (videos). Our dataset, model code and pretrained models will be publicly available for a new Video Scene-Aware Dialog challenge., A prototype system for the Audio Visual Scene-aware Dialog (AVSD) at DSTC7
- Published
- 2019
- Full Text
- View/download PDF
8. SparsePPG: Towards Driver Monitoring Using Camera-Based Vital Signs Estimation in Near-Infrared
- Author
-
Tim K. Marks, Ewa Magdalena Nowara, Ashok Veeraraghavany, and Hassan Mansour
- Subjects
Heartbeat ,business.industry ,Computer science ,0206 medical engineering ,Near-infrared spectroscopy ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Vital signs ,02 engineering and technology ,020601 biomedical engineering ,Band-pass filter ,Robustness (computer science) ,0202 electrical engineering, electronic engineering, information engineering ,RGB color model ,020201 artificial intelligence & image processing ,Computer vision ,Artificial intelligence ,business - Abstract
Camera-based measurement of the heartbeat signal from minute changes in the appearance of a person's skin is known as remote photoplethysmography (rPPG). Methods for rPPG have improved considerably in recent years, making possible its integration into applications such as telemedicine. Driver monitoring using in-car cameras is another potential application of this emerging technology. Unfortunately, there are several challenges unique to the driver monitoring context that must be overcome. First, there are drastic illumination changes on the driver's face, both during the day (as sun filters in and out of overhead trees, etc.) and at night (from streetlamps and oncoming headlights), which current rPPG algorithms cannot account for. We argue that these variations are significantly reduced by narrow-bandwidth near-infrared (NIR) active illumination at 940 nm, with matching bandpass filter on the camera. Second, the amount of motion during driving is significant. We perform a preliminary analysis of the motion magnitude and argue that any in-car solution must provide better robustness to motion artifacts. Third, low signal-tonoise ratio (SNR) and false peaks due to motion have the potential to confound the rPPG signal. To address these challenges, we develop a novel rPPG signal tracking and denoising algorithm (sparsePPG) based on Robust Principal Components Analysis and sparse frequency spectrum estimation. We release a new dataset of face videos collected simultaneously in RGB and NIR.We demonstrate that in each of these frequency ranges, our new method performs as well as or better than current state-of-the-art rPPG algorithms. Overall, our preliminary study indicates that while driver vital signs monitoring using cameras is promising, much work needs to be done in terms of improving robustness to motion artifacts before it becomes practical.
- Published
- 2018
- Full Text
- View/download PDF
9. Early and late integration of audio features for automatic video description
- Author
-
Chiori Hori, John R. Hershey, Tim K. Marks, and Takaaki Hori
- Subjects
Closed captioning ,Audio signal ,Artificial neural network ,Computer science ,business.industry ,Deep learning ,Speech recognition ,Question answering ,Artificial intelligence ,Mel-frequency cepstrum ,business ,Fusion mechanism ,Natural language - Abstract
This paper presents our approach to improve video captioning by integrating audio and video features. Video captioning is the task of generating a textual description to describe the content of a video. State-of-the-art approaches to video captioning are based on sequence-to-sequence models, in which a single neural network accepts sequential images and audio data, and outputs a sequence of words that best describe the input data in natural language. The network thus learns to encode the video input into an intermediate semantic representation, which can be useful in applications such as multimedia indexing, automatic narration, and audio-visual question answering. In our prior work, we proposed an attention-based multi-modal fusion mechanism to integrate image, motion, and audio features, where the multiple features are integrated in the network. Here, we apply hypothesis-level integration based on minimum Bayes-risk (MBR) decoding to further improve the caption quality, focusing on well-known evaluation metrics (BLEU and METEOR scores). Experiments with the YouTube2Text and MSR-VTT datasets demonstrate that combinations of early and late integration of multimodal features significantly improve the audio-visual semantic representation, as measured by the resulting caption quality. In addition, we compared the performance of our method using two different types of audio features: MFCC features, and the audio features extracted using SoundNet, which was trained to recognize objects and scenes from videos using only the audio signals.
- Published
- 2017
- Full Text
- View/download PDF
10. Attention-Based Multimodal Fusion for Video Description
- Author
-
Kazuhiko Sumi, Chiori Hori, Ziming Zhang, Teng-Yok Lee, Bret Harsham, John R. Hershey, Takaaki Hori, and Tim K. Marks
- Subjects
Artificial neural network ,Computer science ,business.industry ,Concatenation ,Feature extraction ,Cognitive neuroscience of visual object recognition ,Pattern recognition ,02 engineering and technology ,030507 speech-language pathology & audiology ,03 medical and health sciences ,Recurrent neural network ,Feature (computer vision) ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Computer vision ,Relevance (information retrieval) ,Artificial intelligence ,0305 other medical science ,business ,Word (computer architecture) - Abstract
Current methods for video description are based on encoder-decoder sentence generation using recurrent neural networks (RNNs). Recent work has demonstrated the advantages of integrating temporal attention mechanisms into these models, in which the decoder network predicts each word in the description by selectively giving more weight to encoded features from specific time frames. Such methods typically use two different types of features: image features (from an object classification model), and motion features (from an action recognition model), combined by naive concatenation in the model input. Because different feature modalities may carry task-relevant information at different times, fusing them by naive concatenation may limit the model's ability to dynamically determine the relevance of each type of feature to different parts of the description. In this paper, we incorporate audio features in addition to the image and motion features. To fuse these three modalities, we introduce a multimodal attention model that can selectively utilize features from different modalities for each word in the output description. Combining our new multimodal attention model with standard temporal attention outperforms state-of-the-art methods on two standard datasets: YouTube2Text and MSR-VTT.
- Published
- 2017
- Full Text
- View/download PDF
11. High-accuracy user identification using EEG biometrics
- Author
-
Philip Orlik, Oncel Tuzel, Ye Wang, Toshiaki Koike-Akino, Shinji Watanabe, Ruhi Mahajan, and Tim K. Marks
- Subjects
Biometrics ,Computer science ,Speech recognition ,Feature extraction ,02 engineering and technology ,Machine Learning ,03 medical and health sciences ,0302 clinical medicine ,0202 electrical engineering, electronic engineering, information engineering ,Humans ,Evoked Potentials ,Artificial neural network ,business.industry ,Dimensionality reduction ,Brain ,Electroencephalography ,Signal Processing, Computer-Assisted ,Pattern recognition ,Equipment Design ,Event-Related Potentials, P300 ,Support vector machine ,Identification (information) ,Logistic Models ,Biometric Identification ,Principal component analysis ,020201 artificial intelligence & image processing ,Neural Networks, Computer ,Artificial intelligence ,business ,Algorithms ,030217 neurology & neurosurgery - Abstract
We analyze brain waves acquired through a consumer-grade EEG device to investigate its capabilities for user identification and authentication. First, we show the statistical significance of the P300 component in event-related potential (ERP) data from 14-channel EEGs across 25 subjects. We then apply a variety of machine learning techniques, comparing the user identification performance of various different combinations of a dimensionality reduction technique followed by a classification algorithm. Experimental results show that an identification accuracy of 72% can be achieved using only a single 800 ms ERP epoch. In addition, we demonstrate that the user identification accuracy can be significantly improved to more than 96.7% by joint classification of multiple epochs.
- Published
- 2016
- Full Text
- View/download PDF
12. A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection
- Author
-
Oncel Tuzel, Michael Jones, Tim K. Marks, Ming Shao, and Bharat Singh
- Subjects
Artificial neural network ,Pixel ,Computer science ,business.industry ,Frame (networking) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Optical flow ,020207 software engineering ,02 engineering and technology ,Convolutional neural network ,Background noise ,Recurrent neural network ,Minimum bounding box ,0202 electrical engineering, electronic engineering, information engineering ,020201 artificial intelligence & image processing ,Computer vision ,Artificial intelligence ,business - Abstract
We present a multi-stream bi-directional recurrent neural network for fine-grained action detection. Recently, twostream convolutional neural networks (CNNs) trained on stacked optical flow and image frames have been successful for action recognition in videos. Our system uses a tracking algorithm to locate a bounding box around the person, which provides a frame of reference for appearance and motion and also suppresses background noise that is not within the bounding box. We train two additional streams on motion and appearance cropped to the tracked bounding box, along with full-frame streams. Our motion streams use pixel trajectories of a frame as raw features, in which the displacement values corresponding to a moving scene point are at the same spatial position across several frames. To model long-term temporal dynamics within and between actions, the multi-stream CNN is followed by a bi-directional Long Short-Term Memory (LSTM) layer. We show that our bi-directional LSTM network utilizes about 8 seconds of the video sequence to predict an action label. We test on two action detection datasets: the MPII Cooking 2 Dataset, and a new MERL Shopping Dataset that we introduce and make available to the community with this paper. The results demonstrate that our method significantly outperforms state-of-the-art action detection methods on both datasets.
- Published
- 2016
- Full Text
- View/download PDF
13. An improved deep learning architecture for person re-identification
- Author
-
Michael Jones, Ejaz Ahmed, and Tim K. Marks
- Subjects
Set (abstract data type) ,Data set ,Similarity (geometry) ,business.industry ,Computer science ,Deep learning ,Metric (mathematics) ,Computer vision ,Artificial intelligence ,Layer (object-oriented design) ,business ,Image (mathematics) - Abstract
In this work, we propose a method for simultaneously learning features and a corresponding similarity metric for person re-identification. We present a deep convolutional architecture with layers specially designed to address the problem of re-identification. Given a pair of images as input, our network outputs a similarity value indicating whether the two input images depict the same person. Novel elements of our architecture include a layer that computes cross-input neighborhood differences, which capture local relationships between the two input images based on mid-level features from each input image. A high-level summary of the outputs of this layer is computed by a layer of patch summary features, which are then spatially integrated in subsequent layers. Our method significantly outperforms the state of the art on both a large data set (CUHK03) and a medium-sized data set (CUHK01), and is resistant to over-fitting. We also demonstrate that by initially training on an unrelated large data set before fine-tuning on a small target data set, our network can achieve results comparable to the state of the art even on a small data set (VIPeR).
- Published
- 2015
- Full Text
- View/download PDF
14. Real-time 3D head pose and facial landmark estimation from depth images using triangular surface patch features
- Author
-
Chavdar Papazov, Tim K. Marks, and Michael Jones
- Subjects
Landmark ,Robustness (computer science) ,Computer science ,business.industry ,Depth map ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Computer vision ,Artificial intelligence ,business ,Pose ,Facial recognition system ,k-nearest neighbors algorithm - Abstract
We present a real-time system for 3D head pose estimation and facial landmark localization using a commodity depth sensor. We introduce a novel triangular surface patch (TSP) descriptor, which encodes the shape of the 3D surface of the face within a triangular area. The proposed descriptor is viewpoint invariant, and it is robust to noise and to variations in the data resolution. Using a fast nearest neighbor lookup, TSP descriptors from an input depth map are matched to the most similar ones that were computed from synthetic head models in a training phase. The matched triangular surface patches in the training set are used to compute estimates of the 3D head pose and facial landmark positions in the input depth map. By sampling many TSP descriptors, many votes for pose and landmark positions are generated which together yield robust final estimates. We evaluate our approach on the publicly available Biwi Kinect Head Pose Database to compare it against state-of-the-art methods. Our results show a significant improvement in the accuracy of both pose and landmark location estimates while maintaining real-time speed.
- Published
- 2015
- Full Text
- View/download PDF
15. Improving Person Tracking Using an Inexpensive Thermal Infrared Sensor
- Author
-
Suren Kumar, Tim K. Marks, and Michael Jones
- Subjects
business.industry ,Computer science ,Detector ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Image segmentation ,Tracking (particle physics) ,Visualization ,Video tracking ,RGB color model ,Segmentation ,Computer vision ,Artificial intelligence ,business ,Image resolution - Abstract
This paper proposes a person tracking framework using a scanning low-resolution thermal infrared (IR) sensor colocated with a wide-angle RGB camera. The low temporal and spatial resolution of the low-cost IR sensor make it unable to track moving people and prone to false detections of stationary people. Thus, IR-only tracking using only this sensor would be quite problematic. We demonstrate that despite the limited capabilities of this low-cost IR sensor, it can be used effectively to correct the errors of a real-time RGB camera-based tracker. We align the signals from the two sensors both spatially (by computing a pixel-to-pixel geometric correspondence between the two modalities) and temporally (by modeling the temporal dynamics of the scanning IR sensor), which enables multi-modal improvements based on judicious application of elementary reasoning. Our combined RGB+IR system improves upon the RGB camera-only tracking by: rejecting false positives, improving segmentation of tracked objects, and correcting false negatives (starting new tracks for people that were missed by the camera-only tracker). Since we combine RGB and thermal information at the level of RGB camera-based tracks, our method is not limited to the particular camera-based tracker that we used in our experiments. Our method could improve the results of any tracker that uses RGB camera input alone. We collect a new dataset and demonstrate the superiority of our method over RGB camera-only tracking.
- Published
- 2014
- Full Text
- View/download PDF
16. Log-linear dialog manager
- Author
-
Hao Tang, Shinji Watanabe, Tim K. Marks, and John R. Hershey
- Subjects
Action (philosophy) ,Computer science ,business.industry ,Human–computer interaction ,Dialog manager ,Feature vector ,Partially observable Markov decision process ,Artificial intelligence ,Log-linear model ,Machine learning ,computer.software_genre ,business ,computer - Abstract
A dialog manager receives previous user actions and previous observations and current observations. Previous and current user states, previous user actions, current user actions, future system actions, and future observations are hypothesized. The user states, the user actions, and the user observations are hidden. A feature vector is extracted based on the user states, the system actions, the user actions, and the observations. An expected reward of each current action is based on a log-linear model using the feature vectors. Then, the current action that has an optimal expected reward is outputted.
- Published
- 2014
- Full Text
- View/download PDF
17. Detecting 3D geometric boundaries of indoor scenes under varying lighting
- Author
-
Oncel Tuzel, Jie Ni, Tim K. Marks, and Fatih Porikli
- Subjects
Computer science ,business.industry ,Photography ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Boundary (topology) ,Edge detection ,Object detection ,Non-negative matrix factorization ,Image texture ,Shadow ,Computer vision ,Artificial intelligence ,business ,Normal ,ComputingMethodologies_COMPUTERGRAPHICS - Abstract
The goal of this research is to identify 3D geometric boundaries in a set of 2D photographs of a static indoor scene under unknown, changing lighting conditions. A 3D geometric boundary is a contour located at a 3D depth discontinuity or a discontinuity in the surface normal. These boundaries can be used effectively for reasoning about the 3D layout of a scene. To distinguish 3D geometric boundaries from 2D texture edges, we analyze the illumination subspace of local appearance at each image location. In indoor time-lapse photography and surveillance video, we frequently see images that are lit by unknown combinations of uncalibrated light sources. We introduce an algorithm for semi-binary nonnegative matrix factorization (SBNMF) to decompose such images into a set of lighting basis images, each of which shows the scene lit by a single light source. These basis images provide a natural, succinct representation of the scene, enabling tasks such as scene editing (e.g., relighting) and shadow edge identification.
- Published
- 2014
- Full Text
- View/download PDF
18. Fully automatic pose-invariant face recognition via 3D pose normalization
- Author
-
Tim K. Marks, Kinh Tieu, M. V. Rohith, Akshay Asthana, and Michael Jones
- Subjects
Computer science ,business.industry ,Normalization (image processing) ,Pattern recognition ,Facial recognition system ,Kernel (image processing) ,Fully automatic ,Three-dimensional face recognition ,Computer vision ,Artificial intelligence ,Invariant (mathematics) ,business ,Pose ,FERET - Abstract
An ideal approach to the problem of pose-invariant face recognition would handle continuous pose variations, would not be database specific, and would achieve high accuracy without any manual intervention. Most of the existing approaches fail to match one or more of these goals. In this paper, we present a fully automatic system for pose-invariant face recognition that not only meets these requirements but also outperforms other comparable methods. We propose a 3D pose normalization method that is completely automatic and leverages the accurate 2D facial feature points found by the system. The current system can handle 3D pose variation up to ±45° in yaw and ±30° in pitch angles. Recognition experiments were conducted on the USF 3D, Multi-PIE, CMU-PIE, FERET, and FacePix databases. Our system not only shows excellent generalization by achieving high accuracy on all 5 databases but also outperforms other methods convincingly.
- Published
- 2011
- Full Text
- View/download PDF
19. Morphable Reflectance Fields for enhancing face recognition
- Author
-
Michael Jones, Ritwik Kumar, and Tim K. Marks
- Subjects
Set (abstract data type) ,Pixel ,business.industry ,Computer science ,Face (geometry) ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Benchmark (computing) ,Computer vision ,Artificial intelligence ,business ,Facial recognition system ,Reflectivity - Abstract
In this paper, we present a novel framework to address the confounding effects of illumination variation in face recognition. By augmenting the gallery set with realistically relit images, we enhance recognition performance in a classifier-independent way. We describe a novel method for single-image relighting, Morphable Reflectance Fields (MoRF), which does not require manual intervention and provides relighting superior to that of existing automatic methods. We test our framework through face recognition experiments using various state-of-the-art classifiers and popular benchmark datasets: CMU PIE, Multi-PIE, and MERL Dome. We demonstrate that our MoRF relighting and gallery augmentation framework achieves improvements in terms of both rank-1 recognition rates and ROC curves. We also compare our model with other automatic relighting methods to confirm its advantage. Finally, we show that the recognition rates achieved using our framework exceed those of state-of-the-art recognizers on the aforementioned databases.
- Published
- 2010
- Full Text
- View/download PDF
20. Rao-Blackwellized particle filtering for probing-based 6-DOF localization in robotic assembly
- Author
-
Haruhisa Okuda, Tim K. Marks, and Yuichi Taguchi
- Subjects
business.industry ,Solid modeling ,Kalman filter ,Computer Science::Robotics ,Extended Kalman filter ,Mesh generation ,Probability distribution ,Computer vision ,Artificial intelligence ,business ,Particle filter ,Robotic arm ,Pose ,Mathematics - Abstract
This paper presents a probing-based method for probabilistic localization in automated robotic assembly. We consider peg-in-hole problems in which a needle-like peg has a single point of contact with the object that contains the hole, and in which the initial uncertainty in the relative pose (3D position and 3D angle) between the peg and the object is much greater than the required accuracy (assembly clearance). We solve this 6 degree-of-freedom (6-DOF) localization problem using a Rao-Blackwellized particle filter, in which the probability distribution over the peg's pose is factorized into two components: The distribution over position (3-DOF) is represented by particles, while the distribution over angle (3-DOF) is approximated as a Gaussian distribution for each particle, updated using an extended Kalman filter. This factorization reduces the number of particles required for localization by orders of magnitude, enabling real-time online 6-DOF pose estimation. Each measurement is simply the contact position obtained by randomly repositioning the peg and moving towards the object until there is contact. To compute the likelihood of each measurement, we use as a map a mesh model of the object that is based on the CAD model but also explicitly models the uncertainty in the map. The mesh uncertainty model makes our system robust to cases in which the actual measurement is different from the expected one. We demonstrate the advantages of our approach over previous methods using simulations as well as physical experiments with a robotic arm and a metal peg and object.
- Published
- 2010
- Full Text
- View/download PDF
21. Gamma-SLAM: Using stereo vision and variance grid maps for SLAM in unstructured environments
- Author
-
Garrison W. Cottrell, L. Matthies, Tim K. Marks, A. Howard, and M. Bajracharya
- Subjects
Ground truth ,Occupancy grid mapping ,business.industry ,Computer science ,Posterior probability ,Filter (signal processing) ,Simultaneous localization and mapping ,Grid ,GPS signals ,Global Positioning System ,Robot ,Computer vision ,Artificial intelligence ,Motion planning ,Visual odometry ,business ,Particle filter - Abstract
We introduce a new method for stereo visual SLAM (simultaneous localization and mapping) that works in unstructured, outdoor environments. Unlike other grid-based SLAM algorithms, which use occupancy grid maps, our algorithm uses a new mapping technique that maintains a posterior distribution over the height variance in each cell. This idea was motivated by our experience with outdoor navigation tasks, which has shown height variance to be a useful measure of traversability. To obtain a joint posterior over poses and maps, we use a Rao-Blackwellized particle filter: the pose distribution is estimated using a particle filter, and each particle has its own map that is obtained through exact filtering conditioned on the particle's pose. Visual odometry provides good proposal distributions for the particle pose. In the analytical (exact) filter for the map, we update the sufficient statistics of a gamma distribution over the precision (inverse variance) of heights in each grid cell. We verify the algorithm's accuracy on two outdoor courses by comparing with ground truth data obtained using electronic surveying equipment. In addition, we solve for the optimal transformation from the SLAM map to georeferenced coordinates, based on a noisy GPS signal. We derive an online version of this alignment process, which can be used to maintain a running estimate of the robot's global position that is much more accurate than the GPS readings.
- Published
- 2008
- Full Text
- View/download PDF
22. 3D Tracking of Morphable Objects Using Conditionally Gaussian Nonlinear Filters
- Author
-
John R. Hershey, Javier R. Movellan, J. C. Roddey, and Tim K. Marks
- Subjects
Stochastic process ,business.industry ,Gaussian ,ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION ,Kalman filter ,Tracking (particle physics) ,Generative model ,symbols.namesake ,Position (vector) ,symbols ,Filtering problem ,Computer vision ,Artificial intelligence ,business ,Gaussian process ,Mathematics - Abstract
We present a generative model and its associated stochastic filtering algorithm for simultaneous tracking of 3D position and orientation, non-rigid motion, object texture, and background texture. The model defines a stochastic process that belongs to the class of conditionally Gaussian processes [On Kalman filtering for conditionally gaussian systems with random matrices]. This allows partitioning the filtering problem into two components: a linear component for texture that is solved using a bank of Kalman filters with time-varying parameters, and a nonlinear component for pose (rigid and non-rigid motion parameters) whose solution depends on the states of the Kalman filters. When applied to the 3D tracking problem, this results in an inference algorithm from which existing optic flow-based tracking algorithms and tracking algorithms based on texture templates emerge as special cases. Flow-based tracking emerges when the pose of the object is certain but its appearance is uncertain. Template-based tracking emerges when the position of the object is uncertain but its texture is relatively certain. In practice, optimal inference under this model integrates optic flow-based and template-based tracking, dynamically weighting their relative importance as new images are presented.
- Published
- 2005
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.