Author: "Gould, Stephen" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Gould, Stephen"' showing total 3,249 results

Start Over Author "Gould, Stephen"

3,249 results on '"Gould, Stephen"'

1. Can We Predict Performance of Large Models across Vision-Language Tasks?

Author: Zhao, Qinyu, Xu, Ming, Gupta, Kartik, Asthana, Akshay, Zheng, Liang, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Evaluating large vision-language models (LVLMs) is very expensive, due to the high computational costs and the wide variety of tasks. The good news is that if we already have some observed performance scores, we may be able to infer unknown ones. In this study, we propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks. We first formulate the performance prediction as a matrix completion task. Specifically, we construct a sparse performance matrix $\boldsymbol{R}$, where each entry $R_{mn}$ represents the performance score of the $m$-th model on the $n$-th dataset. By applying probabilistic matrix factorization (PMF) with Markov chain Monte Carlo (MCMC), we can complete the performance matrix, that is, predict unknown scores. Additionally, we estimate the uncertainty of performance prediction based on MCMC. Practitioners can evaluate their models on untested tasks with higher uncertainty first, quickly reducing errors in performance prediction. We further introduce several improvements to enhance PMF for scenarios with sparse observed performance scores. In experiments, we systematically evaluate 108 LVLMs on 176 datasets from 36 benchmarks, constructing training and testing sets for validating our framework. Our experiments demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data., Comment: Under Review. Project page: https://github.com/Qinyu-Allen-Zhao/CrossPred-LVLM
Published: 2024

2. Temporally Grounding Instructional Diagrams in Unconstrained Videos

Author: Zhang, Jiahao, Zhang, Frederic Z., Rodriguez, Cristian, Ben-Shabat, Yizhak, Cherian, Anoop, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We study the challenging problem of simultaneously localizing a sequence of queries in the form of instructional diagrams in a video. This requires understanding not only the individual queries but also their interrelationships. However, most existing methods focus on grounding one query at a time, ignoring the inherent structures among queries such as the general mutual exclusiveness and the temporal order. Consequently, the predicted timespans of different step diagrams may overlap considerably or violate the temporal order, thus harming the accuracy. In this paper, we tackle this issue by simultaneously grounding a sequence of step diagrams. Specifically, we propose composite queries, constructed by exhaustively pairing up the visual content features of the step diagrams and a fixed number of learnable positional embeddings. Our insight is that self-attention among composite queries carrying different content features suppress each other to reduce timespan overlaps in predictions, while the cross-attention corrects the temporal misalignment via content and position joint guidance. We demonstrate the effectiveness of our approach on the IAW dataset for grounding step diagrams and the YouCook2 benchmark for grounding natural language queries, significantly outperforming existing methods while simultaneously grounding multiple queries.
Published: 2024

3. Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation

Author: Xu, Ming and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: We propose a novel approach to the action segmentation task for long, untrimmed videos, based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem, we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches, our method does not require knowing the action order for a video to attain temporal consistency. Furthermore, our resulting (fused) Gromov-Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsupervised learning setting, where our method is used to generate pseudo-labels for self-training. We evaluate our segmentation approach and unsupervised learning pipeline on the Breakfast, 50-Salads, YouTube Instructions and Desktop Assembly datasets, yielding state-of-the-art results for the unsupervised video action segmentation task., Comment: Accepted to CVPR 2024 (Oral)
Published: 2024

4. The First to Know: How Token Distributions Reveal Hidden Knowledge in Large Vision-Language Models?

Author: Zhao, Qinyu, Xu, Ming, Gupta, Kartik, Asthana, Akshay, Zheng, Liang, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Large vision-language models (LVLMs), designed to interpret and respond to human instructions, occasionally generate hallucinated or harmful content due to inappropriate instructions. This study uses linear probing to shed light on the hidden knowledge at the output layers of LVLMs. We demonstrate that the logit distributions of the first tokens contain sufficient information to determine whether to respond to the instructions, including recognizing unanswerable visual questions, defending against jailbreaking attacks, and identifying deceptive questions. Such hidden knowledge is gradually lost in logits of subsequent tokens during response generation. Then, we illustrate a simple decoding strategy at the generation of the first token, effectively improving the generated content. In experiments, we find a few interesting insights: First, the CLIP model already contains a strong signal for solving these tasks, which indicates potential bias in the existing datasets. Second, we observe performance improvement by utilizing the first logit distributions on three additional tasks, including indicating uncertainty in math solving, mitigating hallucination, and image classification. Last, with the same training data, simply finetuning LVLMs improves models' performance but is still inferior to linear probing on these tasks., Comment: ECCV 2024. Project page: https://github.com/Qinyu-Allen-Zhao/LVLM-LP
Published: 2024

5. An Empirical Study Into What Matters for Calibrating Vision-Language Models

Author: Tu, Weijie, Deng, Weijian, Campbell, Dylan, Gould, Stephen, and Gedeon, Tom
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Vision-Language Models (VLMs) have emerged as the dominant approach for zero-shot recognition, adept at handling diverse scenarios and significant distribution changes. However, their deployment in risk-sensitive areas requires a deeper understanding of their uncertainty estimation capabilities, a relatively uncharted area. In this study, we explore the calibration properties of VLMs across different architectures, datasets, and training strategies. In particular, we analyze the uncertainty estimation performance of VLMs when calibrated in one domain, label set or hierarchy level, and tested in a different one. Our findings reveal that while VLMs are not inherently calibrated for uncertainty, temperature scaling significantly and consistently improves calibration, even across shifts in distribution and changes in label set. Moreover, VLMs can be calibrated with a very small set of examples. Through detailed experimentation, we highlight the potential applications and importance of our insights, aiming for more reliable and effective use of VLMs in critical, real-world scenarios., Comment: ICML 2024 Camera Ready
Published: 2024

6. Towards Optimal Feature-Shaping Methods for Out-of-Distribution Detection

Author: Zhao, Qinyu, Xu, Ming, Gupta, Kartik, Asthana, Akshay, Zheng, Liang, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Feature shaping refers to a family of methods that exhibit state-of-the-art performance for out-of-distribution (OOD) detection. These approaches manipulate the feature representation, typically from the penultimate layer of a pre-trained deep learning model, so as to better differentiate between in-distribution (ID) and OOD samples. However, existing feature-shaping methods usually employ rules manually designed for specific model architectures and OOD datasets, which consequently limit their generalization ability. To address this gap, we first formulate an abstract optimization framework for studying feature-shaping methods. We then propose a concrete reduction of the framework with a simple piecewise constant shaping function and show that existing feature-shaping methods approximate the optimal solution to the concrete optimization problem. Further, assuming that OOD data is inaccessible, we propose a formulation that yields a closed-form solution for the piecewise constant shaping function, utilizing solely the ID data. Through extensive experiments, we show that the feature-shaping function optimized by our method improves the generalization ability of OOD detection across a large variety of datasets and model architectures., Comment: ICLR 2024. Project page: https://github.com/Qinyu-Allen-Zhao/OptFSOOD
Published: 2024

7. Revisiting Implicit Differentiation for Learning Problems in Optimal Control

Author: Xu, Ming, Molloy, Timothy, and Gould, Stephen
Subjects: Computer Science - Machine Learning, Computer Science - Robotics, Electrical Engineering and Systems Science - Systems and Control
Abstract: This paper proposes a new method for differentiating through optimal trajectories arising from non-convex, constrained discrete-time optimal control (COC) problems using the implicit function theorem (IFT). Previous works solve a differential Karush-Kuhn-Tucker (KKT) system for the trajectory derivative, and achieve this efficiently by solving an auxiliary Linear Quadratic Regulator (LQR) problem. In contrast, we directly evaluate the matrix equations which arise from applying variable elimination on the Lagrange multiplier terms in the (differential) KKT system. By appropriately accounting for the structure of the terms within the resulting equations, we show that the trajectory derivatives scale linearly with the number of timesteps. Furthermore, our approach allows for easy parallelization, significantly improved scalability with model size, direct computation of vector-Jacobian products and improved numerical stability compared to prior works. As an additional contribution, we unify prior works, addressing claims that computing trajectory derivatives using IFT scales quadratically with the number of timesteps. We evaluate our method on a both synthetic benchmark and four challenging, learning from demonstration benchmarks including a 6-DoF maneuvering quadrotor and 6-DoF rocket powered landing., Comment: Accepted to NeurIPS 2023 (poster)
Published: 2023

8. 3D-GPT: Procedural 3D Modeling with Large Language Models

Author: Sun, Chunyi, Han, Junlin, Deng, Weijian, Wang, Xinlong, Qin, Zishan, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Computer Science - Machine Learning
Abstract: In the pursuit of efficient automated content creation, procedural generation, leveraging modifiable parameters and rule-based systems, emerges as a promising approach. Nonetheless, it could be a demanding endeavor, given its intricate nature necessitating a deep understanding of rules, algorithms, and parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT positions LLMs as proficient problem solvers, dissecting the procedural 3D modeling tasks into accessible segments and appointing the apt agent for each task. 3D-GPT integrates three core agents: the task dispatch agent, the conceptualization agent, and the modeling agent. They collaboratively achieve two objectives. First, it enhances concise initial scene descriptions, evolving them into detailed forms while dynamically adapting the text based on subsequent instructions. Second, it integrates procedural generation, extracting parameter values from enriched text to effortlessly interface with 3D software for asset creation. Our empirical investigations confirm that 3D-GPT not only interprets and executes instructions, delivering reliable results but also collaborates effectively with human designers. Furthermore, it seamlessly integrates with Blender, unlocking expanded manipulation possibilities. Our work highlights the potential of LLMs in 3D modeling, offering a basic framework for future advancements in scene generation and animation., Comment: Project page: https://chuny1.github.io/3DGPT/3dgpt.html
Published: 2023

9. Exploring Predicate Visual Context in Detecting Human-Object Interactions

Author: Zhang, Frederic Z., Yuan, Yuhui, Campbell, Dylan, Zhong, Zhuoyao, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost., Comment: Accepted to ICCV 2023
Published: 2023

10. Scaling Data Generation in Vision-and-Language Navigation

Author: Wang, Zun, Li, Jialu, Hong, Yicong, Wang, Yi, Wu, Qi, Bansal, Mohit, Gould, Stephen, Tan, Hao, and Qiao, Yu
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent's performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments., Comment: ICCV 2023
Published: 2023

11. Learning Navigational Visual Representations with Semantic Map Supervision

Author: Hong, Yicong, Zhou, Yang, Zhang, Ruiyi, Dernoncourt, Franck, Bui, Trung, Gould, Stephen, and Tan, Hao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot. However, most existing works only employ visual backbones pre-trained either with independent images for classification or with self-supervised learning methods to adapt to the indoor navigation domain, neglecting the spatial relationships that are essential to the learning of navigation. Inspired by the behavior that humans naturally build semantically and spatially meaningful cognitive maps in their brains during navigation, in this paper, we propose a novel navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps (Ego$^2$-Map). We apply the visual transformer as the backbone encoder and train the model with data collected from the large-scale Habitat-Matterport3D environments. Ego$^2$-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation. Experiments show that agents using our learned representations on object-goal navigation outperform recent visual pre-training methods. Moreover, our representations significantly improve vision-and-language navigation in continuous environments for both high-level and low-level action spaces, achieving new state-of-the-art results of 47% SR and 41% SPL on the test server.
Published: 2023

12. PMaF: Deep Declarative Layers for Principal Matrix Features

Author: Xu, Zhiwei, Wang, Hao, Liu, Yanbin, and Gould, Stephen
Subjects: Computer Science - Machine Learning
Abstract: We explore two differentiable deep declarative layers, namely least squares on sphere (LESS) and implicit eigen decomposition (IED), for learning the principal matrix features (PMaF). It can be used to represent data features with a low-dimensional vector containing dominant information from a high-dimensional matrix. We first solve the problems with iterative optimization in the forward pass and then backpropagate the solution for implicit gradients under a bi-level optimization framework. Particularly, adaptive descent steps with the backtracking line search method and descent decay in the tangent space are studied to improve the forward pass efficiency of LESS. Meanwhile, exploited data structures are used to greatly reduce the computational complexity in the backward pass of LESS and IED. Empirically, we demonstrate the superiority of our layers over the off-the-shelf baselines by comparing the solution optimality and computational requirements., Comment: 16 pages, 7 figures, 10 tables, accepted to the differentiable almost everything workshop, ICML, 2023
Published: 2023

13. Towards Understanding Gradient Approximation in Equality Constrained Deep Declarative Networks

Author: Gould, Stephen, Xu, Ming, Xu, Zhiwei, and Liu, Yanbin
Subjects: Computer Science - Machine Learning
Abstract: We explore conditions for when the gradient of a deep declarative node can be approximated by ignoring constraint terms and still result in a descent direction for the global loss function. This has important practical application when training deep learning models since the approximation is often computationally much more efficient than the true gradient calculation. We provide theoretical analysis for problems with linear equality constraints and normalization constraints, and show examples where the approximation works well in practice as well as some cautionary tales for when it fails., Comment: 10 pages, 4 figures, ICML 2023 workshop on Differentiable Almost Everything
Published: 2023

14. Candidate Set Re-ranking for Composed Image Retrieval with Dual Multi-modal Encoder

Author: Liu, Zheyuan, Sun, Weixuan, Teney, Damien, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: Composed image retrieval aims to find an image that best matches a given multi-modal user query consisting of a reference image and text pair. Existing methods commonly pre-compute image embeddings over the entire corpus and compare these to a reference image embedding modified by the query text at test time. Such a pipeline is very efficient at test time since fast vector distances can be used to evaluate candidates, but modifying the reference image embedding guided only by a short textual description can be difficult, especially independent of potential candidates. An alternative approach is to allow interactions between the query and every possible candidate, i.e., reference-text-candidate triplets, and pick the best from the entire set. Though this approach is more discriminative, for large-scale datasets the computational cost is prohibitive since pre-computation of candidate embeddings is no longer possible. We propose to combine the merits of both schemes using a two-stage model. Our first stage adopts the conventional vector distancing metric and performs a fast pruning among candidates. Meanwhile, our second stage employs a dual-encoder architecture, which effectively attends to the input triplet of reference-text-candidate and re-ranks the candidates. Both stages utilize a vision-and-language pre-trained network, which has proven beneficial for various downstream tasks. Our method consistently outperforms state-of-the-art approaches on standard benchmarks for the task. Our implementation is available at https://github.com/Cuberick-Orion/Candidate-Reranking-CIR., Comment: Accepted at TMLR, 19 pages, 8 figures
Published: 2023

15. Combined PD-L1/TGFβ blockade allows expansion and differentiation of stem cell-like CD8 T cells in immune excluded tumors.

Author: Castiglioni, Alessandra, Yang, Yagai, Williams, Katherine, Gogineni, Alvin, Lane, Ryan, Wang, Amber, Shyer, Justin, Zhang, Zhe, Mittman, Stephanie, Gutierrez, Alan, Astarita, Jillian, Thai, Minh, Hung, Jeffrey, Yang, Yeqing, Pourmohamad, Tony, Himmels, Patricia, De Simone, Marco, Elstrott, Justin, Capietto, Aude-Hélène, Cubas, Rafael, Modrusan, Zora, Sandoval, Wendy, Ziai, James, Gould, Stephen, Fu, Wenxian, Wang, Yulei, Koerber, James, Mellman, Ira, Turley, Shannon, Müller, Sören, and Sanjabi, Shomyseh
Subjects: Female, Animals, Mice, Cell Differentiation, CD8-Positive T-Lymphocytes, Stem Cells, B7-H1 Antigen, Transforming Growth Factor beta, Interferon-gamma, T-Cell Exhaustion, Immune Checkpoint Inhibitors, Mice, Inbred BALB C, Cell Line, Tumor, Breast Neoplasms, RNA-Seq
Abstract: TGFβ signaling is associated with non-response to immune checkpoint blockade in patients with advanced cancers, particularly in the immune-excluded phenotype. While previous work demonstrates that converting tumors from excluded to inflamed phenotypes requires attenuation of PD-L1 and TGFβ signaling, the underlying cellular mechanisms remain unclear. Here, we show that TGFβ and PD-L1 restrain intratumoral stem cell-like CD8 T cell (TSCL) expansion and replacement of progenitor-exhausted and dysfunctional CD8 T cells with non-exhausted T effector cells in the EMT6 tumor model in female mice. Upon combined TGFβ/PD-L1 blockade IFNγhi CD8 T effector cells show enhanced motility and accumulate in the tumor. Ensuing IFNγ signaling transforms myeloid, stromal, and tumor niches to yield an immune-supportive ecosystem. Blocking IFNγ abolishes the anti-PD-L1/anti-TGFβ therapy efficacy. Our data suggest that TGFβ works with PD-L1 to prevent TSCL expansion and replacement of exhausted CD8 T cells, thereby maintaining the T cell compartment in a dysfunctional state.
Published: 2023

16. A Biological Homage to Mickey Mouse

Author: Gould, Stephen Jay
Published: 2012
Full Text: View/download PDF

17. GoferBot: A Visual Guided Human-Robot Collaborative Assembly System

Author: Zhuang, Zheyu, Ben-Shabat, Yizhak, Zhang, Jiahao, Gould, Stephen, and Mahony, Robert
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition
Abstract: The current transformation towards smart manufacturing has led to a growing demand for human-robot collaboration (HRC) in the manufacturing process. Perceiving and understanding the human co-worker's behaviour introduces challenges for collaborative robots to efficiently and effectively perform tasks in unstructured and dynamic environments. Integrating recent data-driven machine vision capabilities into HRC systems is a logical next step in addressing these challenges. However, in these cases, off-the-shelf components struggle due to generalisation limitations. Real-world evaluation is required in order to fully appreciate the maturity and robustness of these approaches. Furthermore, understanding the pure-vision aspects is a crucial first step before combining multiple modalities in order to understand the limitations. In this paper, we propose GoferBot, a novel vision-based semantic HRC system for a real-world assembly task. It is composed of a visual servoing module that reaches and grasps assembly parts in an unstructured multi-instance and dynamic environment, an action recognition module that performs human action prediction for implicit communication, and a visual handover module that uses the perceptual understanding of human behaviour to produce an intuitive and efficient collaborative assembly experience. GoferBot is a novel assembly system that seamlessly integrates all sub-modules by utilising implicit semantic information purely from visual perception.
Published: 2023

18. Adaptive Cross Batch Normalization for Metric Learning

Author: Ajanthan, Thalaiyasingam, Ma, Matt, Hengel, Anton van den, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Metric learning is a fundamental problem in computer vision whereby a model is trained to learn a semantically useful embedding space via ranking losses. Traditionally, the effectiveness of a ranking loss depends on the minibatch size, and is, therefore, inherently limited by the memory constraints of the underlying hardware. While simply accumulating the embeddings across minibatches has proved useful (Wang et al. [2020]), we show that it is equally important to ensure that the accumulated embeddings are up to date. In particular, it is necessary to circumvent the representational drift between the accumulated embeddings and the feature embeddings at the current training iteration as the learnable parameters are being updated. In this paper, we model representational drift as distribution misalignment and tackle it using moment matching. The result is a simple method for updating the stored embeddings to match the first and second moments of the current embeddings at each training iteration. Experiments on three popular image retrieval datasets, namely, SOP, In-Shop, and DeepFashion2, demonstrate that our approach significantly improves the performance in all scenarios.
Published: 2023

19. Bi-directional Training for Composed Image Retrieval via Text Prompt Learning

Author: Liu, Zheyuan, Sun, Weixuan, Hong, Yicong, Teney, Damien, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text describing the desired changes. Existing approaches to solving this challenging task learn a mapping from the (reference image, modification text)-pair to an image embedding that is then matched against a large image corpus. One area that has not yet been explored is the reverse direction, which asks the question, what reference image when modified as described by the text would produce the given target image? In this work we propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures with minimum changes, which improves the performance of the model. To encode the bi-directional query we prepend a learnable token to the modification text that designates the direction of the query and then finetune the parameters of the text embedding module. We make no other changes to the network architecture. Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model that itself already achieves competitive performance. Our code is released at https://github.com/Cuberick-Orion/Bi-Blip4CIR., Comment: WACV 2024 accepted. 12 pages, 7 figures
Published: 2023

20. Aligning Step-by-Step Instructional Diagrams to Video Demonstrations

Author: Zhang, Jiahao, Cherian, Anoop, Liu, Yanbin, Ben-Shabat, Yizhak, Rodriguez, Cristian, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world. To learn this alignment, we introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams, guided by a set of novel losses. To study this problem and demonstrate the effectiveness of our method, we introduce a novel dataset: IAW for Ikea assembly in the wild consisting of 183 hours of videos from diverse furniture assembly collections and nearly 8,300 illustrations from their associated instruction manuals and annotated for their ground truth alignments. We define two tasks on this dataset: First, nearest neighbor retrieval between video segments and illustrations, and, second, alignment of instruction steps and the segments for each video. Extensive experiments on IAW demonstrate superior performances of our approach against alternatives., Comment: Project website: https://academic.davidz.cn/en/publication/zhang-cvpr-2023/
Published: 2023

21. Deep Declarative Dynamic Time Warping for End-to-End Learning of Alignment Paths

Author: Xu, Ming, Garg, Sourav, Milford, Michael, and Gould, Stephen
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: This paper addresses learning end-to-end models for time series data that include a temporal alignment step via dynamic time warping (DTW). Existing approaches to differentiable DTW either differentiate through a fixed warping path or apply a differentiable relaxation to the min operator found in the recursive steps used to solve the DTW problem. We instead propose a DTW layer based around bi-level optimisation and deep declarative networks, which we name DecDTW. By formulating DTW as a continuous, inequality constrained optimisation problem, we can compute gradients for the solution of the optimal alignment (with respect to the underlying time series) using implicit differentiation. An interesting byproduct of this formulation is that DecDTW outputs the optimal warping path between two time series as opposed to a soft approximation, recoverable from Soft-DTW. We show that this property is particularly useful for applications where downstream loss functions are defined on the optimal alignment path itself. This naturally occurs, for instance, when learning to improve the accuracy of predicted alignments against ground truth alignments. We evaluate DecDTW on two such applications, namely the audio-to-score alignment task in music information retrieval and the visual place recognition task in robotics, demonstrating state-of-the-art results in both., Comment: ICLR 2023 (Poster)
Published: 2023

22. 3DInAction: Understanding Human Actions in 3D Point Clouds

Author: Ben-Shabat, Yizhak, Shrout, Oren, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We propose a novel method for 3D point cloud action recognition. Understanding human actions in RGB videos has been widely studied in recent years, however, its 3D point cloud counterpart remains under-explored. This is mostly due to the inherent limitation of the point cloud data modality -- lack of structure, permutation invariance, and varying number of points -- which makes it difficult to learn a spatio-temporal representation. To address this limitation, we propose the 3DinAction pipeline that first estimates patches moving in time (t-patches) as a key building block, alongside a hierarchical architecture that learns an informative spatio-temporal representation. We show that our method achieves improved performance on existing datasets, including DFAUST and IKEA ASM. Code is publicly available at https://github.com/sitzikbs/3dincaction.
Published: 2023

23. Learning to Select Camera Views: Efficient Multiview Understanding at Few Glances

Author: Hou, Yunzhong, Gould, Stephen, and Zheng, Liang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Multiview camera setups have proven useful in many computer vision applications for reducing ambiguities, mitigating occlusions, and increasing field-of-view coverage. However, the high computational cost associated with multiple views poses a significant challenge for end devices with limited computational resources. To address this issue, we propose a view selection approach that analyzes the target object or scenario from given views and selects the next best view for processing. Our approach features a reinforcement learning based camera selection module, MVSelect, that not only selects views but also facilitates joint training with the task network. Experimental results on multiview classification and detection tasks show that our approach achieves promising performance while using only 2 or 3 out of N available views, significantly reducing computational costs. Furthermore, analysis on the selected views reveals that certain cameras can be shut off with minimal performance impact, shedding light on future camera layout optimization for multiview systems. Code is available at https://github.com/hou-yz/MVSelect.
Published: 2023

24. Confidence and Dispersity Speak: Characterising Prediction Matrix for Unsupervised Accuracy Estimation

Author: Deng, Weijian, Suh, Yumin, Gould, Stephen, and Zheng, Liang
Subjects: Computer Science - Machine Learning
Abstract: This work aims to assess how well a model performs under distribution shifts without using labels. While recent methods study prediction confidence, this work reports prediction dispersity is another informative cue. Confidence reflects whether the individual prediction is certain; dispersity indicates how the overall predictions are distributed across all categories. Our key insight is that a well-performing model should give predictions with high confidence and high dispersity. That is, we need to consider both properties so as to make more accurate estimates. To this end, we use the nuclear norm that has been shown to be effective in characterizing both properties. Extensive experiments validate the effectiveness of nuclear norm for various models (e.g., ViT and ConvNeXt), different datasets (e.g., ImageNet and CUB-200), and diverse types of distribution shifts (e.g., style shift and reproduction shift). We show that the nuclear norm is more accurate and robust in accuracy estimation than existing methods. Furthermore, we validate the feasibility of other measurements (e.g., mutual information maximization) for characterizing dispersity and confidence. Lastly, we investigate the limitation of the nuclear norm, study its improved variant under severe class imbalance, and discuss potential directions., Comment: This version is not fully edited and will be updated soon
Published: 2023

25. Understanding and Improving the Role of Projection Head in Self-Supervised Learning

Author: Gupta, Kartik, Ajanthan, Thalaiyasingam, Hengel, Anton van den, and Gould, Stephen
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Self-supervised learning (SSL) aims to produce useful feature representations without access to any human-labeled data annotations. Due to the success of recent SSL methods based on contrastive learning, such as SimCLR, this problem has gained popularity. Most current contrastive learning approaches append a parametrized projection head to the end of some backbone network to optimize the InfoNCE objective and then discard the learned projection head after training. This raises a fundamental question: Why is a learnable projection head required if we are to discard it after training? In this work, we first perform a systematic study on the behavior of SSL training focusing on the role of the projection head layers. By formulating the projection head as a parametric component for the InfoNCE objective rather than a part of the network, we present an alternative optimization scheme for training contrastive learning based SSL frameworks. Our experimental study on multiple image classification datasets demonstrates the effectiveness of the proposed approach over alternatives in the SSL literature.
Published: 2022

26. NeRFEditor: Differentiable Style Decomposition for Full 3D Scene Editing

Author: Sun, Chunyi, Liu, Yanbin, Han, Junlin, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Graphics, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: We present NeRFEditor, an efficient learning framework for 3D scene editing, which takes a video captured over 360{\deg} as input and outputs a high-quality, identity-preserving stylized 3D scene. Our method supports diverse types of editing such as guided by reference images, text prompts, and user interactions. We achieve this by encouraging a pre-trained StyleGAN model and a NeRF model to learn from each other mutually. Specifically, we use a NeRF model to generate numerous image-angle pairs to train an adjustor, which can adjust the StyleGAN latent code to generate high-fidelity stylized images for any given angle. To extrapolate editing to GAN out-of-domain views, we devise another module that is trained in a self-supervised learning manner. This module maps novel-view images to the hidden space of StyleGAN that allows StyleGAN to generate stylized images on novel views. These two modules together produce guided images in 360{\deg}views to finetune a NeRF to make stylization effects, where a stable fine-tuning strategy is proposed to achieve this. Experiments show that NeRFEditor outperforms prior work on benchmark and real-world scenes with better editability, fidelity, and identity preservation., Comment: Project page: https://chuny1.github.io/NeRFEditor/nerfeditor.html
Published: 2022

27. High-Fidelity Guided Image Synthesis with Latent Diffusion Models

Author: Singh, Jaskirat, Gould, Stephen, and Zheng, Liang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Controllable image synthesis with user scribbles has gained huge public interest with the recent advent of text-conditioned latent diffusion models. The user scribbles control the color composition while the text prompt provides control over the overall image semantics. However, we note that prior works in this direction suffer from an intrinsic domain shift problem, wherein the generated outputs often lack details and resemble simplistic representations of the target domain. In this paper, we propose a novel guided image synthesis framework, which addresses this problem by modeling the output image as the solution of a constrained optimization problem. We show that while computing an exact solution to the optimization is infeasible, an approximation of the same can be achieved while just requiring a single pass of the reverse diffusion process. Additionally, we show that by simply defining a cross-attention based correspondence between the input text tokens and the user stroke-painting, the user is also able to control the semantics of different painted regions without requiring any conditional training or finetuning. Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores. Project page for our paper is available at https://1jsingh.github.io/gradop.
Published: 2022

28. Order and Life by Joseph Needham (review)

Author: Gould, Stephen Jay
Published: 2017

29. Learning to Structure an Image with Few Colors and Beyond

Author: Hou, Yunzhong, Zheng, Liang, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Color and structure are the two pillars that combine to give an image its meaning. Interested in critical structures for neural network recognition, we isolate the influence of colors by limiting the color space to just a few bits, and find structures that enable network recognition under such constraints. To this end, we propose a color quantization network, ColorCNN, which learns to structure an image in limited color spaces by minimizing the classification loss. Building upon the architecture and insights of ColorCNN, we introduce ColorCNN+, which supports multiple color space size configurations, and addresses the previous issues of poor recognition accuracy and undesirable visual fidelity under large color spaces. Via a novel imitation learning approach, ColorCNN+ learns to cluster colors like traditional color quantization methods. This reduces overfitting and helps both visual fidelity and recognition accuracy under large color spaces. Experiments verify that ColorCNN+ achieves very competitive results under most circumstances, preserving both key structures for network recognition and visual fidelity with accurate colors. We further discuss differences between key structures and accurate colors, and their specific contributions to network recognition. For potential applications, we show that ColorCNNs can be used as image compression methods for network recognition.
Published: 2022

30. Multi-View Correlation Consistency for Semi-Supervised Semantic Segmentation

Author: Hou, Yunzhong, Gould, Stephen, and Zheng, Liang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Semi-supervised semantic segmentation needs rich and robust supervision on unlabeled data. Consistency learning enforces the same pixel to have similar features in different augmented views, which is a robust signal but neglects relationships with other pixels. In comparison, contrastive learning considers rich pairwise relationships, but it can be a conundrum to assign binary positive-negative supervision signals for pixel pairs. In this paper, we take the best of both worlds and propose multi-view correlation consistency (MVCC) learning: it considers rich pairwise relationships in self-correlation matrices and matches them across views to provide robust supervision. Together with this correlation consistency loss, we propose a view-coherent data augmentation strategy that guarantees pixel-pixel correspondence between different views. In a series of semi-supervised settings on two datasets, we report competitive accuracy compared with the state-of-the-art methods. Notably, on Cityscapes, we achieve 76.8% mIoU with 1/8 labeled data, just 0.6% shy from the fully supervised oracle.
Published: 2022

31. On the Strong Correlation Between Model Invariance and Generalization

Author: Deng, Weijian, Gould, Stephen, and Zheng, Liang
Subjects: Computer Science - Machine Learning
Abstract: Generalization and invariance are two essential properties of any machine learning model. Generalization captures a model's ability to classify unseen data while invariance measures consistency of model predictions on transformations of the data. Existing research suggests a positive relationship: a model generalizing well should be invariant to certain visual factors. Building on this qualitative implication we make two contributions. First, we introduce effective invariance (EI), a simple and reasonable measure of model invariance which does not rely on image labels. Given predictions on a test image and its transformed version, EI measures how well the predictions agree and with what level of confidence. Second, using invariance scores computed by EI, we perform large-scale quantitative correlation studies between generalization and invariance, focusing on rotation and grayscale transformations. From a model-centric view, we observe generalization and invariance of different models exhibit a strong linear relationship, on both in-distribution and out-of-distribution datasets. From a dataset-centric view, we find a certain model's accuracy and invariance linearly correlated on different test sets. Apart from these major findings, other minor but interesting insights are also discussed., Comment: 18 pages, 11 figures; this version is not fully edited and will be updated soon
Published: 2022

32. Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

Author: Hong, Yicong, Wang, Zun, Wu, Qi, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Robotics
Abstract: Most existing works in vision-and-language navigation (VLN) focus on either discrete or continuous environments, training agents that cannot generalize across the two. The fundamental difference between the two setups is that discrete navigation assumes prior knowledge of the connectivity graph of the environment, so that the agent can effectively transfer the problem of navigation with low-level controls to jumping from node to node with high-level actions by grounding to an image of a navigable direction. To bridge the discrete-to-continuous gap, we propose a predictor to generate a set of candidate waypoints during navigation, so that agents designed with high-level actions can be transferred to and trained in continuous environments. We refine the connectivity graph of Matterport3D to fit the continuous Habitat-Matterport3D, and train the waypoints predictor with the refined graphs to produce accessible waypoints at each time step. Moreover, we demonstrate that the predicted waypoints can be augmented during training to diversify the views and paths, and therefore enhance agent's generalization ability. Through extensive experiments we show that agents navigating in continuous environments with predicted waypoints perform significantly better than agents using low-level actions, which reduces the absolute discrete-to-continuous gap by 11.76% Success Weighted by Path Length (SPL) for the Cross-Modal Matching Agent and 18.24% SPL for the Recurrent VLN-BERT. Our agents, trained with a simple imitation learning objective, outperform previous methods by a large margin, achieving new state-of-the-art results on the testing environments of the R2R-CE and the RxR-CE datasets.
Published: 2022

33. Exploiting Problem Structure in Deep Declarative Networks: Two Case Studies

Author: Gould, Stephen, Campbell, Dylan, Ben-Shabat, Itzik, Koneputugodage, Chamin Hewa, and Xu, Zhiwei
Subjects: Computer Science - Machine Learning
Abstract: Deep declarative networks and other recent related works have shown how to differentiate the solution map of a (continuous) parametrized optimization problem, opening up the possibility of embedding mathematical optimization problems into end-to-end learnable models. These differentiability results can lead to significant memory savings by providing an expression for computing the derivative without needing to unroll the steps of the forward-pass optimization procedure during the backward pass. However, the results typically require inverting a large Hessian matrix, which is computationally expensive when implemented naively. In this work we study two applications of deep declarative networks -- robust vector pooling and optimal transport -- and show how problem structure can be exploited to obtain very efficient backward pass computations in terms of both time and memory. Our ideas can be used as a guide for improving the computational performance of other novel deep declarative nodes., Comment: Appears in OT-SDM 2022: The 1st International Workshop on Optimal Transport and Structured Data Modeling
Published: 2022

34. Phase 2 of extracellular RNA communication consortium charts next-generation approaches for extracellular RNA research

Author: Mateescu, Bogdan, Jones, Jennifer C, Alexander, Roger P, Alsop, Eric, An, Ji Yeong, Asghari, Mohammad, Boomgarden, Alex, Bouchareychas, Laura, Cayota, Alfonso, Chang, Hsueh-Chia, Charest, Al, Chiu, Daniel T, Coffey, Robert J, Das, Saumya, De Hoff, Peter, deMello, Andrew, D’Souza-Schorey, Crislyn, Elashoff, David, Eliato, Kiarash R, Franklin, Jeffrey L, Galas, David J, Gerstein, Mark B, Ghiran, Ionita H, Go, David B, Gould, Stephen, Grogan, Tristan R, Higginbotham, James N, Hladik, Florian, Huang, Tony Jun, Huo, Xiaoye, Hutchins, Elizabeth, Jeppesen, Dennis K, Jovanovic-Talisman, Tijana, Kim, Betty YS, Kim, Sung, Kim, Kyoung-Mee, Kim, Yong, Kitchen, Robert R, Knouse, Vaughan, LaPlante, Emily L, Lebrilla, Carlito B, Lee, L James, Lennon, Kathleen M, Li, Guoping, Li, Feng, Li, Tieyi, Liu, Tao, Liu, Zirui, Maddox, Adam L, McCarthy, Kyle, Meechoovet, Bessie, Maniya, Nalin, Meng, Yingchao, Milosavljevic, Aleksandar, Min, Byoung-Hoon, Morey, Amber, Ng, Martin, Nolan, John, De Oliveira, Getulio P, Paulaitis, Michael E, Phu, Tuan Anh, Raffai, Robert L, Reátegui, Eduardo, Roth, Matthew E, Routenberg, David A, Rozowsky, Joel, Rufo, Joseph, Senapati, Satyajyoti, Shachar, Sigal, Sharma, Himani, Sood, Anil K, Stavrakis, Stavros, Stürchler, Alessandra, Tewari, Muneesh, Tosar, Juan P, Tucker-Schwartz, Alexander K, Turchinovich, Andrey, Valkov, Nedyalka, Van Keuren-Jensen, Kendall, Vickers, Kasey C, Vojtech, Lucia, Vreeland, Wyatt N, Wang, Ceming, Wang, Kai, Wang, ZeYu, Welsh, Joshua A, Witwer, Kenneth W, Wong, David TW, Xia, Jianping, Xie, Ya-Hong, Yang, Kaichun, Zaborowski, Mikołaj P, Zhang, Chenguang, Zhang, Qin, Zivkovic, Angela M, and Laurent, Louise C
Subjects: Biological Sciences, Biomedical and Clinical Sciences, Genetics, Biochemistry, Biological sciences, Cell biology, Molecular biology
Abstract: The extracellular RNA communication consortium (ERCC) is an NIH-funded program aiming to promote the development of new technologies, resources, and knowledge about exRNAs and their carriers. After Phase 1 (2013-2018), Phase 2 of the program (ERCC2, 2019-2023) aims to fill critical gaps in knowledge and technology to enable rigorous and reproducible methods for separation and characterization of both bulk populations of exRNA carriers and single EVs. ERCC2 investigators are also developing new bioinformatic pipelines to promote data integration through the exRNA atlas database. ERCC2 has established several Working Groups (Resource Sharing, Reagent Development, Data Analysis and Coordination, Technology Development, nomenclature, and Scientific Outreach) to promote collaboration between ERCC2 members and the broader scientific community. We expect that ERCC2's current and future achievements will significantly improve our understanding of exRNA biology and the development of accurate and efficient exRNA-based diagnostic, prognostic, and theranostic biomarker assays.
Published: 2022

35. Efficient Two-Stage Detection of Human-Object Interactions with a Novel Unary-Pairwise Transformer

Author: Zhang, Frederic Z., Campbell, Dylan, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Recent developments in transformer models for visual data have led to significant improvements in recognition and detection tasks. In particular, using learnable queries in place of region proposals has given rise to a new class of one-stage detection models, spearheaded by the Detection Transformer (DETR). Variations on this one-stage approach have since dominated human-object interaction (HOI) detection. However, the success of such one-stage HOI detectors can largely be attributed to the representation power of transformers. We discovered that when equipped with the same transformer, their two-stage counterparts can be more performant and memory-efficient, while taking a fraction of the time to train. In this work, we propose the Unary-Pairwise Transformer, a two-stage detector that exploits unary and pairwise representations for HOIs. We observe that the unary and pairwise parts of our transformer network specialise, with the former preferentially increasing the scores of positive examples and the latter decreasing the scores of negative examples. We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches. At inference time, our model with ResNet50 approaches real-time performance on a single GPU., Comment: Accepted to CVPR2022. 14 pages, 14 figures and 5 tables
Published: 2021

36. A Regularized Wasserstein Framework for Graph Kernels

Author: Wijesinghe, Asiri, Wang, Qing, and Gould, Stephen
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We propose a learning framework for graph kernels, which is theoretically grounded on regularizing optimal transport. This framework provides a novel optimal transport distance metric, namely Regularized Wasserstein (RW) discrepancy, which can preserve both features and structure of graphs via Wasserstein distances on features and their local variations, local barycenters and global connectivity. Two strongly convex regularization terms are introduced to improve the learning ability. One is to relax an optimal alignment between graphs to be a cluster-to-cluster mapping between their locally connected vertices, thereby preserving the local clustering structure of graphs. The other is to take into account node degree distributions in order to better preserve the global structure of graphs. We also design an efficient algorithm to enable a fast approximation for solving the optimization problem. Theoretically, our framework is robust and can guarantee the convergence and numerical stability in optimization. We have empirically validated our method using 12 datasets against 16 state-of-the-art baselines. The experimental results show that our method consistently outperforms all state-of-the-art methods on all benchmark databases for both graphs with discrete attributes and graphs with continuous attributes., Comment: 21st IEEE International Conference on Data Mining (ICDM 2021)
Published: 2021

37. Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models

Author: Liu, Zheyuan, Rodriguez-Opazo, Cristian, Teney, Damien, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: We extend the task of composed image retrieval, where an input query consists of an image and short textual description of how to modify the image. Existing methods have only been applied to non-complex images within narrow domains, such as fashion products, thereby limiting the scope of study on in-depth visual reasoning in rich image and language contexts. To address this issue, we collect the Compose Image Retrieval on Real-life images (CIRR) dataset, which consists of over 36,000 pairs of crowd-sourced, open-domain images with human-generated modifying text. To extend current methods to the open-domain, we propose CIRPLANT, a transformer based model that leverages rich pre-trained vision-and-language (V&L) knowledge for modifying visual features conditioned on natural language. Retrieval is then done by nearest neighbor lookup on the modified features. We demonstrate that with a relatively simple architecture, CIRPLANT outperforms existing methods on open-domain images, while matching state-of-the-art accuracy on the existing narrow datasets, such as fashion. Together with the release of CIRR, we believe this work will inspire further research on composed image retrieval., Comment: ICCV 2021. Dataset, code, and pre-trained models are released at https://cuberick-orion.github.io/CIRR/
Published: 2021

38. DiGS : Divergence guided shape implicit neural representation for unoriented point clouds

Author: Ben-Shabat, Yizhak, Koneputugodage, Chamin Hewa, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Shape implicit neural representations (INRs) have recently shown to be effective in shape analysis and reconstruction tasks. Existing INRs require point coordinates to learn the implicit level sets of the shape. When a normal vector is available for each point, a higher fidelity representation can be learned, however normal vectors are often not provided as raw data. Furthermore, the method's initialization has been shown to play a crucial role for surface reconstruction. In this paper, we propose a divergence guided shape representation learning approach that does not require normal vectors as input. We show that incorporating a soft constraint on the divergence of the distance function favours smooth solutions that reliably orients gradients to match the unknown normal at each point, in some cases even better than approaches that use ground truth normal vectors directly. Additionally, we introduce a novel geometric initialization method for sinusoidal INRs that further improves convergence to the desired solution. We evaluate the effectiveness of our approach on the task of surface reconstruction and shape space learning and show SOTA performance compared to other unoriented methods. Code and model parameters available at our project page https://chumbyte.github.io/DiGS-Site/.
Published: 2021

39. What Does Rotation Prediction Tell Us about Classifier Accuracy under Varying Testing Environments?

Author: Deng, Weijian, Gould, Stephen, and Zheng, Liang
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Understanding classifier decision under novel environments is central to the community, and a common practice is evaluating it on labeled test sets. However, in real-world testing, image annotations are difficult and expensive to obtain, especially when the test environment is changing. A natural question then arises: given a trained classifier, can we evaluate its accuracy on varying unlabeled test sets? In this work, we train semantic classification and rotation prediction in a multi-task way. On a series of datasets, we report an interesting finding, i.e., the semantic classification accuracy exhibits a strong linear relationship with the accuracy of the rotation prediction task (Pearson's Correlation r > 0.88). This finding allows us to utilize linear regression to estimate classifier performance from the accuracy of rotation prediction which can be obtained on the test set through the freely generated rotation labels., Comment: ICML 2021 camera ready
Published: 2021

40. View-coherent correlation consistency for semi-supervised semantic segmentation

Author: Hou, Yunzhong, Gould, Stephen, and Zheng, Liang
Published: 2024
Full Text: View/download PDF

41. Hamilton transversals in random Latin squares

Author: Gould, Stephen and Kelly, Tom
Subjects: Mathematics - Combinatorics
Abstract: Gy\'{a}rf\'{a}s and S\'{a}rk\"{o}zy conjectured that every $n\times n$ Latin square has a `cycle-free' partial transversal of size $n-2$. We confirm this conjecture in a strong sense for almost all Latin squares, by showing that as $n \rightarrow \infty$, all but a vanishing proportion of $n\times n$ Latin squares have a Hamilton transversal, i.e. a full transversal for which any proper subset is cycle-free. In fact, we prove a counting result that in almost all Latin squares, the number of Hamilton transversals is essentially that of Taranenko's upper bound on the number of full transversals. This result strengthens a result of Kwan (which in turn implies that almost all Latin squares also satisfy the famous Ryser-Brualdi-Stein conjecture)., Comment: 28 pages, 4 figures. To appear in Random Structures & Algorithms
Published: 2021

42. Semantics for Robotic Mapping, Perception and Interaction: A Survey

Author: Garg, Sourav, Sünderhauf, Niko, Dayoub, Feras, Morrison, Douglas, Cosgun, Akansel, Carneiro, Gustavo, Wu, Qi, Chin, Tat-Jun, Reid, Ian, Gould, Stephen, Corke, Peter, and Milford, Michael
Subjects: Computer Science - Robotics, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning
Abstract: For robots to navigate and interact more richly with the world around them, they will likely require a deeper understanding of the world in which they operate. In robotics and related research fields, the study of understanding is often referred to as semantics, which dictates what does the world "mean" to a robot, and is strongly tied to the question of how to represent that meaning. With humans and robots increasingly operating in the same world, the prospects of human-robot interaction also bring semantics and ontology of natural language into the picture. Driven by need, as well as by enablers like increasing availability of training data and computational resources, semantics is a rapidly growing research area in robotics. The field has received significant attention in the research literature to date, but most reviews and surveys have focused on particular aspects of the topic: the technical research issues regarding its use in specific robotic topics like mapping or segmentation, or its relevance to one particular application domain like autonomous driving. A new treatment is therefore required, and is also timely because so much relevant research has occurred since many of the key surveys were published. This survey therefore provides an overarching snapshot of where semantics in robotics stands today. We establish a taxonomy for semantics research in or relevant to robotics, split into four broad categories of activity, in which semantics are extracted, used, or both. Within these broad categories we survey dozens of major topics including fundamentals from the computer vision field and key robotics research areas utilizing semantics, including mapping, navigation and interaction with the world. The survey also covers key practical considerations, including enablers like increased data availability and improved computational hardware, and major application areas where..., Comment: 81 pages, 1 figure, published in Foundations and Trends in Robotics, 2020
Published: 2021
Full Text: View/download PDF

43. Fine-grained Classification via Categorical Memory Networks

Author: Deng, Weijian, Marsh, Joshua, Gould, Stephen, and Zheng, Liang
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Motivated by the desire to exploit patterns shared across classes, we present a simple yet effective class-specific memory module for fine-grained feature learning. The memory module stores the prototypical feature representation for each category as a moving average. We hypothesize that the combination of similarities with respect to each category is itself a useful discriminative cue. To detect these similarities, we use attention as a querying mechanism. The attention scores with respect to each class prototype are used as weights to combine prototypes via weighted sum, producing a uniquely tailored response feature representation for a given input. The original and response features are combined to produce an augmented feature for classification. We integrate our class-specific memory module into a standard convolutional neural network, yielding a Categorical Memory Network. Our memory module significantly improves accuracy over baseline CNNs, achieving competitive accuracy with state-of-the-art methods on four benchmarks, including CUB-200-2011, Stanford Cars, FGVC Aircraft, and NABirds., Comment: 10 pages, 9 figures, 7 tables; this version is not fully edited and will be updated soon
Published: 2020

44. Spatially Conditioned Graphs for Detecting Human-Object Interactions

Author: Zhang, Frederic Z., Campbell, Dylan, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We address the problem of detecting human-object interactions in images using graphical neural networks. Unlike conventional methods, where nodes send scaled but otherwise identical messages to each of their neighbours, we propose to condition messages between pairs of nodes on their spatial relationships, resulting in different messages going to neighbours of the same node. To this end, we explore various ways of applying spatial conditioning under a multi-branch structure. Through extensive experimentation we demonstrate the advantages of spatial conditioning for the computation of the adjacency structure, messages and the refined graph features. In particular, we empirically show that as the quality of the bounding boxes increases, their coarse appearance features contribute relatively less to the disambiguation of interactions compared to the spatial information. Our method achieves an mAP of 31.33% on HICO-DET and 54.2% on V-COCO, significantly outperforming state-of-the-art on fine-tuned detections., Comment: Accepted to ICCV 2021
Published: 2020

45. Probabilistic Tracklet Scoring and Inpainting for Multiple Object Tracking

Author: Saleh, Fatemeh, Aliakbarian, Sadegh, Rezatofighi, Hamid, Salzmann, Mathieu, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Despite the recent advances in multiple object tracking (MOT), achieved by joint detection and tracking, dealing with long occlusions remains a challenge. This is due to the fact that such techniques tend to ignore the long-term motion information. In this paper, we introduce a probabilistic autoregressive motion model to score tracklet proposals by directly measuring their likelihood. This is achieved by training our model to learn the underlying distribution of natural tracklets. As such, our model allows us not only to assign new detections to existing tracklets, but also to inpaint a tracklet when an object has been lost for a long time, e.g., due to occlusion, by sampling tracklets so as to fill the gap caused by misdetections. Our experiments demonstrate the superiority of our approach at tracking objects in challenging sequences; it outperforms the state of the art in most standard MOT metrics on multiple MOT benchmark datasets, including MOT16, MOT17, and MOT20.
Published: 2020

46. A Recurrent Vision-and-Language BERT for Navigation

Author: Hong, Yicong, Wu, Qi, Qi, Yuankai, Rodriguez-Opazo, Cristian, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Accuracy of many visiolinguistic tasks has benefited significantly from the application of vision-and-language(V&L) BERT. However, its application for the task of vision-and-language navigation (VLN) remains limited. One reason for this is the difficulty adapting the BERT architecture to the partially observable Markov decision process present in VLN, requiring history-dependent attention and decision making. In this paper we propose a recurrent BERT model that is time-aware for use in VLN. Specifically, we equip the BERT model with a recurrent function that maintains cross-modal state information for the agent. Through extensive experiments on R2R and REVERIE we demonstrate that our model can replace more complex encoder-decoder models to achieve state-of-the-art results. Moreover, our approach can be generalised to other transformer-based architectures, supports pre-training, and is capable of solving navigation and referring expression tasks simultaneously.
Published: 2020

47. Rethinking conditional GAN training: An approach using geometrically structured latent manifolds

Author: Ramasinghe, Sameera, Farazi, Moshiur, Khan, Salman, Barnes, Nick, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Conditional GANs (cGAN), in their rudimentary form, suffer from critical drawbacks such as the lack of diversity in generated outputs and distortion between the latent and output manifolds. Although efforts have been made to improve results, they can suffer from unpleasant side-effects such as the topology mismatch between latent and output spaces. In contrast, we tackle this problem from a geometrical perspective and propose a novel training mechanism that increases both the diversity and the visual quality of a vanilla cGAN, by systematically encouraging a bi-lipschitz mapping between the latent and the output manifolds. We validate the efficacy of our solution on a baseline cGAN (i.e., Pix2Pix) which lacks diversity, and show that by only modifying its training mechanism (i.e., with our proposed Pix2Pix-Geo), one can achieve more diverse and realistic outputs on a broad set of image-to-image translation tasks. Codes are available at https://github.com/samgregoost/Rethinking-CGANs.
Published: 2020

48. Language and Visual Entity Relationship Graph for Agent Navigation

Author: Hong, Yicong, Rodriguez-Opazo, Cristian, Qi, Yuankai, Wu, Qi, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions. From both the textual and visual perspectives, we find that the relationships among the scene, its objects,and directional clues are essential for the agent to interpret complex instructions and correctly perceive the environment. To capture and utilize the relationships, we propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision, and the intra-modal relationships among visual entities. We propose a message passing algorithm for propagating information between language elements and visual entities in the graph, which we then combine to determine the next action to take. Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art. On the Room-to-Room (R2R) benchmark, our method achieves the new best performance on the test unseen split with success rate weighted by path length (SPL) of 52%. On the Room-for-Room (R4R) dataset, our method significantly improves the previous best from 13% to 34% on the success weighted by normalized dynamic time warping (SDTW). Code is available at: https://github.com/YicongHong/Entity-Graph-VLN.
Published: 2020

49. DORi: Discovering Object Relationship for Moment Localization of a Natural-Language Query in Video

Author: Rodriguez-Opazo, Cristian, Marrese-Taylor, Edison, Fernando, Basura, Li, Hongdong, and Gould, Stephen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper studies the task of temporal moment localization in a long untrimmed video using natural language query. Given a query sentence, the goal is to determine the start and end of the relevant segment within the video. Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm suitable for temporal moment localization which captures the relationships between humans, objects and activities in the video. These relationships are obtained by a spatial sub-graph that contextualizes the scene representation using detected objects and human features conditioned in the language query. Moreover, a temporal sub-graph captures the activities within the video through time. Our method is evaluated on three standard benchmark datasets, and we also introduce YouCookII as a new benchmark for this task. Experiments show our method outperforms state-of-the-art methods on these datasets, confirming the effectiveness of our approach.
Published: 2020

50. Conditional Generative Modeling via Learning the Latent Space

Author: Ramasinghe, Sameera, Ranasinghe, Kanchana, Khan, Salman, Barnes, Nick, and Gould, Stephen
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: Although deep learning has achieved appealing results on several machine learning tasks, most of the models are deterministic at inference, limiting their application to single-modal settings. We propose a novel general-purpose framework for conditional generation in multimodal spaces, that uses latent variables to model generalizable learning patterns while minimizing a family of regression cost functions. At inference, the latent variables are optimized to find optimal solutions corresponding to multiple output modes. Compared to existing generative solutions, in multimodal spaces, our approach demonstrates faster and stable convergence, and can learn better representations for downstream tasks. Importantly, it provides a simple generic model that can beat highly engineered pipelines tailored using domain expertise on a variety of tasks, while generating diverse outputs. Our codes will be released.
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

3,249 results on '"Gould, Stephen"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources