Author: "Zisserman A" / Publisher: arxiv - Searchworks@Jio Institute Digital Library Search Results

1. Multi-Modal Classifiers for Open-Vocabulary Object Detection

Author: Kaul, Prannay, Xie, Weidi, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, I.2.10, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), I.4.6, I.4.8, I.4.9, Computer Science - Computer Vision and Pattern Recognition, Machine Learning (cs.LG)
Abstract: The goal of this paper is open-vocabulary object detection (OVOD) $\unicode{x2013}$ building a model that can detect objects beyond the set of categories seen at training, thus enabling the user to specify categories of interest at inference without the need for model retraining. We adopt a standard two-stage object detector architecture, and explore three ways for specifying novel categories: via language descriptions, via image exemplars, or via a combination of the two. We make three contributions: first, we prompt a large language model (LLM) to generate informative language descriptions for object classes, and construct powerful text-based classifiers; second, we employ a visual aggregator on image exemplars that can ingest any number of images as input, forming vision-based classifiers; and third, we provide a simple method to fuse information from language descriptions and image exemplars, yielding a multi-modal classifier. When evaluating on the challenging LVIS open-vocabulary benchmark we demonstrate that: (i) our text-based classifiers outperform all previous OVOD works; (ii) our vision-based classifiers perform as well as text-based classifiers in prior work; (iii) using multi-modal classifiers perform better than either modality alone; and finally, (iv) our text-based and multi-modal classifiers yield better performance than a fully-supervised detector., Comment: ICML 2023, project page: https://www.robots.ox.ac.uk/vgg/research/mm-ovod/
Published: 2023
Full Text: View/download PDF

2. TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement

Author: Doersch, Carl, Yang, Yi, Vecerik, Mel, Gokay, Dilara, Gupta, Ankush, Aytar, Yusuf, Carreira, Joao, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time. Visualizations, source code, and pretrained models can be found on our project webpage.
Published: 2023
Full Text: View/download PDF

3. Epic-Sounds: A Large-scale Dataset of Actions That Sound

Author: Huh, J, Chalk, J, Kazakos, E, Damen, D, and Zisserman, A
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: We introduce EPIC-SOUNDS, a large-scale dataset of audio annotations capturing temporal extents and class labels within the audio stream of the egocentric videos. We propose an annotation pipeline where annotators temporally label distinguishable audio segments and describe the action that could have caused this sound. We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes. For actions that involve objects colliding, we collect human annotations of the materials of these objects (e.g. a glass object being placed on a wooden surface), which we verify from visual labels, discarding ambiguities. Overall, EPIC-SOUNDS includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments. We train and evaluate two state-of-the-art audio recognition models on our dataset, highlighting the importance of audio-only labels and the limitations of current models to recognise actions that sound., Comment: 6 pages, 4 figures
Published: 2023
Full Text: View/download PDF

4. Verbs in Action: Improving verb understanding in video-language models

Author: Momeni, Liliane, Caron, Mathilde, Nagrani, Arsha, Zisserman, Andrew, and Schmid, Cordelia
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding: video-text matching, video question-answering and video classification. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it.
Published: 2023
Full Text: View/download PDF

5. VoxSRC 2022: The Fourth VoxCeleb Speaker Recognition Challenge

Author: Huh, Jaesung, Brown, Andrew, Jung, Jee-weon, Chung, Joon Son, Nagrani, Arsha, Garcia-Romero, Daniel, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: This paper summarises the findings from the VoxCeleb Speaker Recognition Challenge 2022 (VoxSRC-22), which was held in conjunction with INTERSPEECH 2022. The goal of this challenge was to evaluate how well state-of-the-art speaker recognition systems can diarise and recognise speakers from speech obtained "in the wild". The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and hybrid workshop held at INTERSPEECH 2022. We describe the four tracks of our challenge along with the baselines, methods, and results. We conclude with a discussion on the new domain-transfer focus of VoxSRC-22, and on the progression of the challenge from the previous three editions.
Published: 2023
Full Text: View/download PDF

6. Zorro: the masked multimodal transformer

Author: Recasens, Adrià, Lin, Jason, Carreira, Joāo, Jaegle, Drew, Wang, Luyu, Alayrac, Jean-baptiste, Luc, Pauline, Miech, Antoine, Smaira, Lucas, Hemsley, Ross, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.
Published: 2023
Full Text: View/download PDF

7. Three ways to improve feature alignment for open vocabulary detection

Author: Arandjelović, Relja, Andonian, Alex, Mensch, Arthur, Hénaff, Olivier J., Alayrac, Jean-Baptiste, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (cs.LG)
Abstract: The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes. Previous approaches train the feature pyramid and detection head from scratch, which breaks the vision-text feature alignment established during pretraining, and struggles to prevent the language model from forgetting unseen classes. We propose three methods to alleviate these issues. Firstly, a simple scheme is used to augment the text embeddings which prevents overfitting to a small number of classes seen during training, while simultaneously saving memory and computation. Secondly, the feature pyramid network and the detection head are modified to include trainable gated shortcuts, which encourages vision-text feature alignment and guarantees it at the start of detection training. Finally, a self-training approach is used to leverage a larger corpus of image-text pairs thus improving detection performance on classes with no human annotated bounding boxes. Our three methods are evaluated on the zero-shot version of the LVIS benchmark, each of them showing clear and significant benefits. Our final network achieves the new stateof-the-art on the mAP-all metric and demonstrates competitive performance for mAP-rare, as well as superior transfer to COCO and Objects365.
Published: 2023
Full Text: View/download PDF

8. AutoAD: Movie Description in Context

Author: Han, T, Bain, M, Nagrani, A, Varol, G, Xie, W, and Zisserman, A
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. Generating high-quality movie AD is challenging due to the dependency of the descriptions on context, and the limited amount of training data available. In this work, we leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation. In order to obtain high-quality AD, we make the following four contributions: (i) we incorporate context from the movie clip, AD from previous clips, as well as the subtitles; (ii) we address the lack of training data by pretraining on large-scale datasets, where visual or contextual information is unavailable, e.g. text-only AD without movies or visual captioning datasets without context; (iii) we improve on the currently available AD datasets, by removing label noise in the MAD dataset, and adding character naming information; and (iv) we obtain strong results on the movie AD task compared with previous methods., Comment: CVPR2023 Highlight. Project page: https://www.robots.ox.ac.uk/~vgg/research/autoad/
Published: 2023
Full Text: View/download PDF

9. SpineNetV2: Automated detection, labelling and radiological grading of clinical MR scans

Author: Windsor, Rhydian, Jamaludin, Amir, Kadir, Timor, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Image and Video Processing (eess.IV), Computer Science - Computer Vision and Pattern Recognition, FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Image and Video Processing
Abstract: This technical report presents SpineNetV2, an automated tool which: (i) detects and labels vertebral bodies in clinical spinal magnetic resonance (MR) scans across a range of commonly used sequences; and (ii) performs radiological grading of lumbar intervertebral discs in T2-weighted scans for a range of common degenerative changes. SpineNetV2 improves over the original SpineNet software in two ways: (1) The vertebral body detection stage is significantly faster, more accurate and works across a range of fields-of-view (as opposed to just lumbar scans). (2) Radiological grading adopts a more powerful architecture, adding several new grading schemes without loss in performance. A demo of the software is available at the project website: http://zeus.robots.ox.ac.uk/spinenet2/., Comment: Technical Report, 22 pages, 9 Figures
Published: 2022
Full Text: View/download PDF

10. Weakly-supervised Fingerspelling Recognition in British Sign Language Videos

Author: Prajwal, K R, Bull, Hannah, Momeni, Liliane, Albanie, Samuel, Varol, Gül, Zisserman, Andrew, Visual Geometry Group (VGG), University of Oxford, Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Architectures et Modèles pour l'Interaction (AMI), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Interaction avec l'Humain (IaH), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Information, Langue Ecrite et Signée (ILES), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Sciences et Technologies des Langues (STL), University of Cambridge [UK] (CAM), Laboratoire d'Informatique Gaspard-Monge (LIGM), École des Ponts ParisTech (ENPC)-Centre National de la Recherche Scientifique (CNRS)-Université Gustave Eiffel, and ANR-21-CE23-0003,CorVis,Traduire la langue des signes avec la vision par ordinateur(2021)
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, [INFO]Computer Science [cs]
Abstract: The goal of this work is to detect and recognize sequences of letters signed using fingerspelling in British Sign Language (BSL). Previous fingerspelling recognition methods have not focused on BSL, which has a very different signing alphabet (e.g., two-handed instead of one-handed) to American Sign Language (ASL). They also use manual annotations for training. In contrast to previous methods, our method only uses weak annotations from subtitles for training. We localize potential instances of fingerspelling using a simple feature similarity method, then automatically annotate these instances by querying subtitle words and searching for corresponding mouthing cues from the signer. We propose a Transformer architecture adapted to this task, with a multiple-hypothesis CTC loss function to learn from alternative annotation possibilities. We employ a multi-stage training approach, where we make use of an initial version of our trained model to extend and enhance our training data before re-training again to achieve better performance. Through extensive evaluations, we verify our method for automatic annotation and our model architecture. Moreover, we provide a human expert annotated test set of 5K video clips for evaluating BSL fingerspelling recognition methods to support sign language research., Comment: Appears in: British Machine Vision Conference 2022 (BMVC 2022)
Published: 2022
Full Text: View/download PDF

11. VoxSRC 2021: The Third VoxCeleb Speaker Recognition Challenge

Author: Brown, Andrew, Huh, Jaesung, Chung, Joon Son, Nagrani, Arsha, Garcia-Romero, Daniel, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The third instalment of the VoxCeleb Speaker Recognition Challenge was held in conjunction with Interspeech 2021. The aim of this challenge was to assess how well current speaker recognition technology is able to diarise and recognise speakers in unconstrained or `in the wild' data. The challenge consisted of: (i) the provision of publicly available speaker recognition and diarisation data from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a virtual public challenge and workshop held at Interspeech 2021. This paper outlines the challenge, and describes the baselines, methods and results. We conclude with a discussion on the new multi-lingual focus of VoxSRC 2021, and on the progression of the challenge since the previous two editions., Comment: arXiv admin note: substantial text overlap with arXiv:2012.06867
Published: 2022
Full Text: View/download PDF

12. Turbo Training with Token Dropout

Author: Han, T, Xie, W, and Zisserman, A
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: The objective of this paper is an efficient training method for video tasks. We make three contributions: (1) We propose Turbo training, a simple and versatile training paradigm for Transformers on multiple video tasks. (2) We illustrate the advantages of Turbo training on action classification, video-language representation learning, and long-video activity classification, showing that Turbo training can largely maintain competitive performance while achieving almost 4X speed-up and significantly less memory consumption. (3) Turbo training enables long-schedule video-language training and end-to-end long-video training, delivering competitive or superior performance than previous works, which were infeasible to train under limited resources., Comment: BMVC2022
Published: 2022
Full Text: View/download PDF

13. Is an Object-Centric Video Representation Beneficial for Transfer?

Author: Zhang, Chuhan, Gupta, Ankush, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a new object-centric video recognition model based on a transformer architecture. The model learns a set of object-centric summary vectors for the video, and uses these vectors to fuse the visual and spatio-temporal trajectory 'modalities' of the video clip. We also introduce a novel trajectory contrast loss to further enhance objectness in these summary vectors. With experiments on four datasets -- SomethingSomething-V2, SomethingElse, Action Genome and EpicKitchens -- we show that the object-centric model outperforms prior video representations (both object-agnostic and object-aware), when: (1) classifying actions on unseen objects and unseen environments; (2) low-shot learning of novel classes; (3) linear probe to other downstream tasks; as well as (4) for standard action classification., Comment: Accepted to ACCV 2022
Published: 2022
Full Text: View/download PDF

14. Flamingo: a Visual Language Model for Few-Shot Learning

Author: Alayrac, Jean-Baptiste, Donahue, Jeff, Luc, Pauline, Miech, Antoine, Barr, Iain, Hasson, Yana, Lenc, Karel, Mensch, Arthur, Millican, Katie, Reynolds, Malcolm, Ring, Roman, Rutherford, Eliza, Cabi, Serkan, Han, Tengda, Gong, Zhitao, Samangooei, Sina, Monteiro, Marianne, Menick, Jacob, Borgeaud, Sebastian, Brock, Andrew, Nematzadeh, Aida, Sharifzadeh, Sahand, Binkowski, Mikolaj, Barreira, Ricardo, Vinyals, Oriol, Zisserman, Andrew, and Simonyan, Karen
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (cs.LG)
Abstract: Building models that can be rapidly adapted to novel tasks using only a handful of annotated examples is an open challenge for multimodal machine learning research. We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained vision-only and language-only models, (ii) handle sequences of arbitrarily interleaved visual and textual data, and (iii) seamlessly ingest images or videos as inputs. Thanks to their flexibility, Flamingo models can be trained on large-scale multimodal web corpora containing arbitrarily interleaved text and images, which is key to endow them with in-context few-shot learning capabilities. We perform a thorough evaluation of our models, exploring and measuring their ability to rapidly adapt to a variety of image and video tasks. These include open-ended tasks such as visual question-answering, where the model is prompted with a question which it has to answer; captioning tasks, which evaluate the ability to describe a scene or an event; and close-ended tasks such as multiple-choice visual question-answering. For tasks lying anywhere on this spectrum, a single Flamingo model can achieve a new state of the art with few-shot learning, simply by prompting the model with task-specific examples. On numerous benchmarks, Flamingo outperforms models fine-tuned on thousands of times more task-specific data., Comment: 54 pages. In Proceedings of Neural Information Processing Systems (NeurIPS) 2022
Published: 2022
Full Text: View/download PDF

15. HiP: Hierarchical Perceiver

Author: Carreira, Joao, Koppula, Skanda, Zoran, Daniel, Recasens, Adria, Ionescu, Catalin, Henaff, Olivier, Shelhamer, Evan, Arandjelovic, Relja, Botvinick, Matt, Vinyals, Oriol, Simonyan, Karen, Zisserman, Andrew, and Jaegle, Andrew
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: General perception systems such as Perceivers can process arbitrary modalities in any combination and are able to handle up to a few hundred thousand inputs. They achieve this generality by using exclusively global attention operations. This however hinders them from scaling up to the inputs sizes required to process raw high-resolution images or video. In this paper, we show that some degree of locality can be introduced back into these models, greatly improving their efficiency while preserving their generality. To scale them further, we introduce a self-supervised approach that enables learning dense low-dimensional positional embeddings for very large signals. We call the resulting model a Hierarchical Perceiver (HiP). In sum our contributions are: 1) scaling Perceiver-type models to raw high-resolution images and audio+video, 2) showing the feasibility of learning 1M+ positional embeddings from scratch using masked auto-encoding, 3) demonstrating competitive performance on raw data from ImageNet, AudioSet, PASCAL VOC, ModelNet40 and Kinetics datasets with the same exact, unchanged model and without specialized preprocessing or any tokenization.
Published: 2022
Full Text: View/download PDF

16. Omnimatte: Associating Objects and Their Effects in Video

Author: William T. Freeman, Andrew Zisserman, Erika Lu, Forrester Cole, Michael Rubinstein, and Tali Dekel
Subjects: FOS: Computer and information sciences, Color image, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Ranging, Image segmentation, Alpha (programming language), Pattern recognition (psychology), Segmentation, Computer vision, Artificial intelligence, Reflection (computer graphics), business
Abstract: Computer vision is increasingly effective at segmenting objects in images and videos; however, scene effects related to the objects -- shadows, reflections, generated smoke, etc -- are typically overlooked. Identifying such scene effects and associating them with the objects producing them is important for improving our fundamental understanding of visual scenes, and can also assist a variety of applications such as removing, duplicating, or enhancing objects in video. In this work, we take a step towards solving this novel problem of automatically associating objects with their effects in video. Given an ordinary video and a rough segmentation mask over time of one or more subjects of interest, we estimate an omnimatte for each subject -- an alpha matte and color image that includes the subject along with all its related time-varying scene elements. Our model is trained only on the input video in a self-supervised manner, without any manual labels, and is generic -- it produces omnimattes automatically for arbitrary objects and a variety of effects. We show results on real-world videos containing interactions between different types of subjects (cars, animals, people) and complex effects, ranging from semi-transparent elements such as smoke and reflections, to fully opaque effects such as objects attached to the subject., Comment: CVPR 2021 Oral. Project webpage: https://omnimatte.github.io/. Added references
Published: 2021
Full Text: View/download PDF

17. Visual Keyword Spotting with Attention

Author: Prajwal, KR, Momeni, L, Afouras, T, and Zisserman, A
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Computer Vision and Pattern Recognition (cs.CV), InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, Computer Science - Computer Vision and Pattern Recognition, Computation and Language (cs.CL)
Abstract: In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos., Comment: Appears in: British Machine Vision Conference 2021 (BMVC 2021)
Published: 2021
Full Text: View/download PDF

18. Open-Set Recognition: a Good Closed-Set Classifier is All You Need?

Author: Vaze, S, Han, K, Vedaldi, A, and Zisserman, A
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (cs.LG)
Abstract: The ability to identify whether or not a test sample belongs to one of the semantic classes in a classifier's training set is critical to practical deployment of the model. This task is termed open-set recognition (OSR) and has received significant attention in recent years. In this paper, we first demonstrate that the ability of a classifier to make the 'none-of-above' decision is highly correlated with its accuracy on the closed-set classes. We find that this relationship holds across loss objectives and architectures, and further demonstrate the trend both on the standard OSR benchmarks as well as on a large-scale ImageNet evaluation. Second, we use this correlation to boost the performance of a maximum logit score OSR 'baseline' by improving its closed-set accuracy, and with this strong baseline achieve state-of-the-art on a number of OSR benchmarks. Similarly, we boost the performance of the existing state-of-the-art method by improving its closed-set accuracy, but the resulting discrepancy with the strong baseline is marginal. Our third contribution is to present the 'Semantic Shift Benchmark' (SSB), which better respects the task of detecting semantic novelty, in contrast to other forms of distribution shift also considered in related sub-fields, such as out-of-distribution detection. On this new evaluation, we again demonstrate that there is negligible difference between the strong baseline and the existing state-of-the-art. Project Page: https://www.robots.ox.ac.uk/~vgg/research/osr/, Comment: ICLR 22 Oral. Changes from pre-print highlighted on Github page
Published: 2021
Full Text: View/download PDF

19. Comment on Stochastic Polyak Step-Size: Performance of ALI-G

Author: Berrada, Leonard, Zisserman, Andrew, and Kumar, M. Pawan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: This is a short note on the performance of the ALI-G algorithm (Berrada et al., 2020) as reported in (Loizou et al., 2021). ALI-G (Berrada et al., 2020) and SPS (Loizou et al., 2021) are both adaptations of the Polyak step-size to optimize machine learning models that can interpolate the training data. The main algorithmic differences are that (1) SPS employs a multiplicative constant in the denominator of the learning-rate while ALI-G uses an additive constant, and (2) SPS uses an iteration-dependent maximal learning-rate while ALI-G uses a constant one. There are also differences in the analysis provided by the two works, with less restrictive assumptions proposed in (Loizou et al., 2021). In their experiments, (Loizou et al., 2021) did not use momentum for ALI-G (which is a standard part of the algorithm) or standard hyper-parameter tuning (for e.g. learning-rate and regularization). Hence this note as a reference for the improved performance that ALI-G can obtain with well-chosen hyper-parameters. In particular, we show that when training a ResNet-34 on CIFAR-10 and CIFAR-100, the performance of ALI-G can reach respectively 93.5% (+6%) and 76% (+8%) with a very small amount of tuning. Thus ALI-G remains a very competitive method for training interpolating neural networks.
Published: 2021
Full Text: View/download PDF

20. Perceiver IO: A General Architecture for Structured Inputs & Outputs

Author: Jaegle, Andrew, Borgeaud, Sebastian, Alayrac, Jean-Baptiste, Doersch, Carl, Ionescu, Catalin, Ding, David, Koppula, Skanda, Zoran, Daniel, Brock, Andrew, Shelhamer, Evan, H��naff, Olivier, Botvinick, Matthew M., Zisserman, Andrew, Vinyals, Oriol, and Carreira, Jo��o
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Audio and Speech Processing (eess.AS), Computer Vision and Pattern Recognition (cs.CV), FOS: Electrical engineering, electronic engineering, information engineering, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: A central goal of machine learning is the development of systems that can solve many problems in as many data domains as possible. Current architectures, however, cannot be applied beyond a small set of stereotyped settings, as they bake in domain & task assumptions or scale poorly to large inputs or outputs. In this work, we propose Perceiver IO, a general-purpose architecture that handles data from arbitrary settings while scaling linearly with the size of inputs and outputs. Our model augments the Perceiver with a flexible querying mechanism that enables outputs of various sizes and semantics, doing away with the need for task-specific architecture engineering. The same architecture achieves strong results on tasks spanning natural language and visual understanding, multi-task and multi-modal reasoning, and StarCraft II. As highlights, Perceiver IO outperforms a Transformer-based BERT baseline on the GLUE language benchmark despite removing input tokenization and achieves state-of-the-art performance on Sintel optical flow estimation with no explicit mechanisms for multiscale correspondence., ICLR 2022 camera ready. Code: https://dpmd.ai/perceiver-code
Published: 2021
Full Text: View/download PDF

21. Self-supervised Video Object Segmentation by Motion Grouping

Author: Yang, C, Lamdouar, H, Lu, E, Zisserman, A, and Xie, W
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Machine Learning (cs.LG)
Abstract: Animals have evolved highly functional visual systems to understand motion, assisting perception even under complex environments. In this paper, we work towards developing a computer vision system able to segment objects by exploiting motion cues, i.e. motion segmentation. We make the following contributions: First, we introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background. Second, we train the architecture in a self-supervised manner, i.e. without using any manual annotations. Third, we analyze several critical components of our method and conduct thorough ablation studies to validate their necessity. Fourth, we evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59). Despite using only optical flow as input, our approach achieves superior or comparable results to previous state-of-the-art self-supervised methods, while being an order of magnitude faster. We additionally evaluate on a challenging camouflage dataset (MoCA), significantly outperforming the other self-supervised approaches, and comparing favourably to the top supervised approach, highlighting the importance of motion cues, and the potential bias towards visual appearance in existing video segmentation models., Comment: Best Paper in CVPR2021 RVSU Workshop. Accepted by ICCV
Published: 2021
Full Text: View/download PDF

22. BBC-Oxford British Sign Language Dataset

Author: Albanie, Samuel, Varol, Gül, Momeni, Liliane, Bull, Hannah, Afouras, Triantafyllos, Chowdhury, Himel, Fox, Neil, Woll, Bencie, Cooper, Rob, McParland, Andrew, Zisserman, Andrew, University of Cambridge [UK] (CAM), Laboratoire d'Informatique Gaspard-Monge (LIGM), École des Ponts ParisTech (ENPC)-Centre National de la Recherche Scientifique (CNRS)-Université Gustave Eiffel, Visual Geometry Group (VGG), University of Oxford [Oxford], Laboratoire Interdisciplinaire des Sciences du Numérique (LISN), CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Architecture et Modèles pour l'Interaction (AMI), Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur (LIMSI), Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Deafness, Cognition and Language Research Centre (DCAL), University College of London [London] (UCL), BBC Research & Development, BBC, University of Oxford, Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS), Architectures et Modèles pour l'Interaction (AMI), Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Interaction avec l'Humain (IaH), and Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Université Paris-Saclay-Centre National de la Recherche Scientifique (CNRS)
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV]
Abstract: In this work, we introduce the BBC-Oxford British Sign Language (BOBSL) dataset, a large-scale video collection of British Sign Language (BSL). BOBSL is an extended and publicly released dataset based on the BSL-1K dataset introduced in previous work. We describe the motivation for the dataset, together with statistics and available annotations. We conduct experiments to provide baselines for the tasks of sign recognition, sign language alignment, and sign language translation. Finally, we describe several strengths and limitations of the data from the perspectives of machine learning and linguistics, note sources of bias present in the dataset, and discuss potential applications of BOBSL in the context of sign language technology. The dataset is available at https://www.robots.ox.ac.uk/~vgg/data/bobsl/.
Published: 2021
Full Text: View/download PDF

23. NeRF in detail: Learning to sample for view synthesis

Author: Arandjelović, Relja and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Graphics, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Graphics (cs.GR), Machine Learning (cs.LG)
Abstract: Neural radiance fields (NeRF) methods have demonstrated impressive novel view synthesis performance. The core approach is to render individual rays by querying a neural network at points sampled along the ray to obtain the density and colour of the sampled points, and integrating this information using the rendering equation. Since dense sampling is computationally prohibitive, a common solution is to perform coarse-to-fine sampling. In this work we address a clear limitation of the vanilla coarse-to-fine approach -- that it is based on a heuristic and not trained end-to-end for the task at hand. We introduce a differentiable module that learns to propose samples and their importance for the fine network, and consider and compare multiple alternatives for its neural architecture. Training the proposal module from scratch can be unstable due to lack of supervision, so an effective pre-training strategy is also put forward. The approach, named `NeRF in detail' (NeRF-ID), achieves superior view synthesis quality over NeRF and the state-of-the-art on the synthetic Blender benchmark and on par or better performance on the real LLFF-NeRF scenes. Furthermore, by leveraging the predicted sample importance, a 25% saving in computation can be achieved without significantly sacrificing the rendering quality.
Published: 2021
Full Text: View/download PDF

24. The End-of-End-to-End: A Video Understanding Pentathlon Challenge (2020)

Author: Albanie, Samuel, Liu, Yang, Nagrani, Arsha, Miech, Antoine, Coto, Ernesto, Laptev, Ivan, Sukthankar, Rahul, Ghanem, Bernard, Zisserman, Andrew, Gabeur, Valentin, Sun, Chen, Alahari, Karteek, Schmid, Cordelia, Chen, Shizhe, Zhao, Yida, Jin, Qin, Cui, Kaixu, Liu, Hui, Wang, Chen, Jiang, Yudong, and Hao, Xiaoshuai
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: We present a new video understanding pentathlon challenge, an open competition held in conjunction with the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020. The objective of the challenge was to explore and evaluate new methods for text-to-video retrieval-the task of searching for content within a corpus of videos using natural language queries. This report summarizes the results of the first edition of the challenge together with the findings of the participants., Comment: Individual reports, dataset information, rules, and released source code can be found at the competition webpage (https://www.robots.ox.ac.uk/~vgg/challenges/video-pentathlon)
Published: 2020
Full Text: View/download PDF

25. Self-Supervised MultiModal Versatile Networks

Author: Alayrac, Jean-Baptiste, Recasens, Adrià, Schneider, Rosalia, Arandjelović, Relja, Ramapuram, Jason, De Fauw, Jeffrey, Smaira, Lucas, Dieleman, Sander, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available., Comment: To appear in the Thirty-Fourth Annual Conference on Neural Information Processing Systems (NeurIPS 2020)
Published: 2020
Full Text: View/download PDF

26. Self-supervised Co-training for Video Representation Learning

Author: Han, T, Xie, W, and Zisserman, A
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: The objective of this paper is visual-only self-supervised video representation learning. We make the following contributions: (i) we investigate the benefit of adding semantic-class positives to instance-based Info Noise Contrastive Estimation (InfoNCE) training, showing that this form of supervised contrastive learning leads to a clear improvement in performance; (ii) we propose a novel self-supervised co-training scheme to improve the popular infoNCE loss, exploiting the complementary information from different views, RGB streams and optical flow, of the same data source by using one view to obtain positive class samples for the other; (iii) we thoroughly evaluate the quality of the learnt representation on two different downstream tasks: action recognition and video retrieval. In both cases, the proposed approach demonstrates state-of-the-art or comparable performance with other self-supervised approaches, whilst being significantly more efficient to train, i.e. requiring far less training data to achieve similar performance., Comment: NeurIPS2020
Published: 2020
Full Text: View/download PDF

27. QuerYD: A video dataset with high-quality text and audio narrations

Author: João F. Henriques, Andreea-Maria Oncescu, Andrew Zisserman, Yang Liu, and Samuel Albanie
Subjects: FOS: Computer and information sciences, Vocabulary, Event (computing), business.industry, Computer science, media_common.quotation_subject, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Audio description, computer.software_genre, Code (semiotics), Visualization, Feature (computer vision), Benchmark (surveying), Artificial intelligence, business, computer, Natural language, Natural language processing, media_common
Abstract: We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video. A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content. The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos. This ever-growing collection of videos contains highly detailed, temporally aligned audio and text annotations. The content descriptions are more relevant than dialogue, and more detailed than previous description attempts, which can be observed to contain many superficial or uninformative descriptions. To demonstrate the utility of the QuerYD dataset, we show that it can be used to train and benchmark strong models for retrieval and event localisation. Data, code and models are made publicly available, and we hope that QuerYD inspires further research on video understanding with written and spoken natural language., Comment: 5 pages, 4 figures, accepted at ICASSP 2021
Published: 2020
Full Text: View/download PDF

28. Speech2Action: Cross-modal Supervision for Action Recognition

Author: David A. Ross, Chen Sun, Cordelia Schmid, Arsha Nagrani, Rahul Sukthankar, and Andrew Zisserman
Subjects: FOS: Computer and information sciences, business.industry, Computer science, Speech recognition, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Modal, ComputingMethodologies_PATTERNRECOGNITION, 020204 information systems, 0202 electrical engineering, electronic engineering, information engineering, Action recognition, 020201 artificial intelligence & image processing, Artificial intelligence, business, Classifier (UML)
Abstract: Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single manually labelled action example., Comment: Accepted to CVPR 2020
Published: 2020
Full Text: View/download PDF

29. Inducing Predictive Uncertainty Estimation for Face Recognition

Author: Xie, W, Byrne, J, and Zisserman, A
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (cs.LG)
Abstract: Knowing when an output can be trusted is critical for reliably using face recognition systems. While there has been enormous effort in recent research on improving face verification performance, understanding when a model's predictions should or should not be trusted has received far less attention. Our goal is to assign a confidence score for a face image that reflects its quality in terms of recognizable information. To this end, we propose a method for generating image quality training data automatically from 'mated-pairs' of face images, and use the generated data to train a lightweight Predictive Confidence Network, termed as PCNet, for estimating the confidence score of a face image. We systematically evaluate the usefulness of PCNet with its error versus reject performance, and demonstrate that it can be universally paired with and improve the robustness of any verification model. We describe three use cases on the public IJB-C face verification benchmark: (i) to improve 1:1 image-based verification error rates by rejecting low-quality face images; (ii) to improve quality score based fusion performance on the 1:1 set-based verification benchmark; and (iii) its use as a quality measure for selecting high quality (unblurred, good lighting, more frontal) faces from a collection, e.g. for automatic enrolment or display., Comment: To Appear at the British Machine Vision Conference (BMVC), 2020
Published: 2020
Full Text: View/download PDF

30. CrossTransformers: spatially-aware few-shot transfer

Author: Doersch, Carl, Gupta, Ankush, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Given new tasks with very little data$-$such as new classes in a classification problem or a domain shift in the input$-$performance of modern vision systems degrades remarkably quickly. In this work, we illustrate how the neural network representations which underpin modern vision systems are subject to supervision collapse, whereby they lose any information that is not necessary for performing the training task, including information that may be necessary for transfer to new tasks or domains. We then propose two methods to mitigate this problem. First, we employ self-supervised learning to encourage general-purpose features that transfer better. Second, we propose a novel Transformer based neural network architecture called CrossTransformers, which can take a small number of labeled images and an unlabeled query, find coarse spatial correspondence between the query and the labeled images, and then infer class membership by computing distances between spatially-corresponding features. The result is a classifier that is more robust to task and domain shift, which we demonstrate via state-of-the-art performance on Meta-Dataset, a recent dataset for evaluating transfer from ImageNet to many other vision datasets., Comment: Published at NeurIPS 2020. Code/checkpoints: https://github.com/google-research/meta-dataset
Published: 2020
Full Text: View/download PDF

31. The AVA-Kinetics Localized Human Actions Video Dataset

Author: Li, Ang, Thotakuri, Meghana, Ross, David A., Carreira, João, Vostrikov, Alexander, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, ComputingMethodologies_PATTERNRECOGNITION, Computer Vision and Pattern Recognition (cs.CV), Image and Video Processing (eess.IV), Computer Science - Computer Vision and Pattern Recognition, FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Image and Video Processing, Machine Learning (cs.LG)
Abstract: This paper describes the AVA-Kinetics localized human actions video dataset. The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames. We describe the annotation process and provide statistics about the new dataset. We also include a baseline evaluation using the Video Action Transformer Network on the AVA-Kinetics dataset, demonstrating improved performance for action classification on the AVA test set. The dataset can be downloaded from https://research.google.com/ava/, Comment: 8 pages, 8 figures
Published: 2020
Full Text: View/download PDF

32. Seeing wake words: Audio-visual Keyword Spotting

Author: Momeni, L, Afouras, T, Stafylakis, T, Albanie, S, and Zisserman, A
Subjects: FOS: Computer and information sciences, Audio and Speech Processing (eess.AS), Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, FOS: Electrical engineering, electronic engineering, information engineering, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The goal of this work is to automatically determine whether and when a word of interest is spoken by a talking face, with or without the audio. We propose a zero-shot method suitable for ‘in the wild’ videos. Our key contributions are: (1) a novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into (i) sequence matching, and (ii) pattern detection, to decide whether the word is there and when; (2) we demonstrate that if audio is available, visual keyword spotting improves the performance both for a clean and noisy audio signal. Finally, (3) we show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data, by fine-tuning the network pre-trained on English. The method exceeds the performance of the previous state-of-the-art visual keyword spotting architecture when trained and tested on the same benchmark, and also that of a state-of-the-art lip reading method.
Published: 2020
Full Text: View/download PDF

33. VoxSRC 2019: The first VoxCeleb Speaker Recognition Challenge

Author: Chung, Joon Son, Nagrani, Arsha, Coto, Ernesto, Xie, Weidi, McLaren, Mitchell, Reynolds, Douglas A, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Statistics - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Machine Learning (stat.ML), Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: The VoxCeleb Speaker Recognition Challenge 2019 aimed to assess how well current speaker recognition technology is able to identify speakers in unconstrained or `in the wild' data. It consisted of: (i) a publicly available speaker recognition dataset from YouTube videos together with ground truth annotation and standardised evaluation software; and (ii) a public challenge and workshop held at Interspeech 2019 in Graz, Austria. This paper outlines the challenge and provides its baselines, results and discussions., Comment: ISCA Archive
Published: 2019
Full Text: View/download PDF

34. Semi-Supervised Learning with Scarce Annotations

Author: Andrea Vedaldi, Sebastien Ehrhardt, Andrew Zisserman, Sylvestre-Alvise Rebuffi, and Kai Han
Subjects: FOS: Computer and information sciences, Computer science, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (stat.ML), 02 engineering and technology, Semi-supervised learning, Overfitting, Machine learning, computer.software_genre, 03 medical and health sciences, 0302 clinical medicine, Statistics - Machine Learning, 0202 electrical engineering, electronic engineering, information engineering, Leverage (statistics), Representation (mathematics), business.industry, Bootstrapping, Data point, ComputingMethodologies_PATTERNRECOGNITION, Key (cryptography), 020201 artificial intelligence & image processing, Artificial intelligence, Transfer of learning, business, computer, 030217 neurology & neurosurgery
Abstract: While semi-supervised learning (SSL) algorithms provide an efficient way to make use of both labelled and unlabelled data, they generally struggle when the number of annotated samples is very small. In this work, we consider the problem of SSL multi-class classification with very few labelled instances. We introduce two key ideas. The first is a simple but effective one: we leverage the power of transfer learning among different tasks and self-supervision to initialize a good representation of the data without making use of any label. The second idea is a new algorithm for SSL that can exploit well such a pre-trained representation. The algorithm works by alternating two phases, one fitting the labelled points and one fitting the unlabelled ones, with carefully-controlled information flow between them. The benefits are greatly reducing overfitting of the labelled data and avoiding issue with balancing labelled and unlabelled losses during training. We show empirically that this method can successfully train competitive models with as few as 10 labelled data points per class. More in general, we show that the idea of bootstrapping features using self-supervised learning always improves SSL on standard benchmarks. We show that our algorithm works increasingly well compared to other methods when refining from other tasks or datasets., Comment: Workshop on Deep Vision, CVPR 2020
Published: 2019
Full Text: View/download PDF

35. Sim2real transfer learning for 3D human pose estimation: motion to the rescue

Author: Doersch, C and Zisserman, A
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human pose estimation is a particularly interesting example of this sim2real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this paper, we show that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person's motion, notably as optical flow and the motion of 2D keypoints. Therefore, our results suggest that motion can be a simple way to bridge a sim2real gap when video is available. We evaluate on the 3D Poses in the Wild dataset, the most challenging modern benchmark for 3D pose estimation, where we show full 3D mesh recovery that is on par with state-of-the-art methods trained on real 3D sequences, despite training only on synthetic humans from the SURREAL dataset., Comment: Accepted at NeurIPS 2019
Published: 2019
Full Text: View/download PDF

36. Object Discovery with a Copy-Pasting GAN

Author: Arandjelović, Relja and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Machine Learning (cs.LG)
Abstract: We tackle the problem of object discovery, where objects are segmented for a given input image, and the system is trained without using any direct supervision whatsoever. A novel copy-pasting GAN framework is proposed, where the generator learns to discover an object in one image by compositing it into another image such that the discriminator cannot tell that the resulting image is fake. After carefully addressing subtle issues, such as preventing the generator from `cheating', this game results in the generator learning to select objects, as copy-pasting objects is most likely to fool the discriminator. The system is shown to work well on four very different datasets, including large object appearance variations in challenging cluttered backgrounds.
Published: 2019
Full Text: View/download PDF

37. Unsupervised Learning of Object Keypoints for Perception and Control

Author: Kulkarni, T, Gupta, A, Ionescu, C, Borgeaud, S, Reynolds, M, Zisserman, A, and Mnih, V
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Machine Learning (cs.LG)
Abstract: The study of object representations in computer vision has primarily focused on developing representations that are useful for image classification, object detection, or semantic segmentation as downstream tasks. In this work we aim to learn object representations that are useful for control and reinforcement learning (RL). To this end, we introduce Transporter, a neural network architecture for discovering concise geometric object representations in terms of keypoints or image-space coordinates. Our method learns from raw video frames in a fully unsupervised manner, by transporting learnt image features between video frames using a keypoint bottleneck. The discovered keypoints track objects and object parts across long time-horizons more accurately than recent similar methods. Furthermore, consistent long-term tracking enables two notable results in control domains -- (1) using the keypoint co-ordinates and corresponding image features as inputs enables highly sample-efficient reinforcement learning; (2) learning to explore by controlling keypoint locations drastically reduces the search space, enabling deep exploration (leading to states unreachable through random action exploration) without any extrinsic rewards., Comment: In NeurIPS 2019. Code https://github.com/deepmind/deepmind-research/tree/master/transporter
Published: 2019
Full Text: View/download PDF

38. My lips are concealed: Audio-visual speech enhancement through obstructions

Author: Triantafyllos Afouras, Andrew Zisserman, and Joon Son Chung
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, Speech recognition, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Representation (systemics), 020206 networking & telecommunications, 02 engineering and technology, Visual modality, Computer Science - Sound, Speech enhancement, Background noise, 030507 speech-language pathology & audiology, 03 medical and health sciences, Audio and Speech Processing (eess.AS), Audio visual, 0202 electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, Mouth region, 0305 other medical science, Sensory cue, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Our objective is an audio-visual model for separating a single speaker from a mixture of sounds such as other speakers and background noise. Moreover, we wish to hear the speaker even when the visual cues are temporarily absent due to occlusion. To this end we introduce a deep audio-visual speech enhancement network that is able to separate a speaker's voice by conditioning on both the speaker's lip movements and/or a representation of their voice. The voice representation can be obtained by either (i) enrollment, or (ii) by self-enrollment -- learning the representation on-the-fly given sufficient unobstructed visual input. The model is trained by blending audios, and by introducing artificial occlusions around the mouth region that prevent the visual modality from dominating. The method is speaker-independent, and we demonstrate it on real examples of speakers unheard (and unseen) during training. The method also improves over previous models in particular for cases of occlusion in the visual modality., Comment: Accepted to Interspeech 2019
Published: 2019
Full Text: View/download PDF

39. Count, Crop and Recognise: Fine-Grained Recognition in the Wild

Author: Andrew Zisserman, Arsha Nagrani, Max Bain, and Daniel Schofield
Subjects: FOS: Computer and information sciences, business.industry, Computer science, Computer Vision and Pattern Recognition (cs.CV), Frame (networking), Computer Science - Computer Vision and Pattern Recognition, Pattern recognition, Facial recognition system, Task (project management), ComputingMethodologies_PATTERNRECOGNITION, Face (geometry), Labelling, Artificial intelligence, business
Abstract: The goal of this paper is to label all the animal individuals present in every frame of a video. Unlike previous methods that have principally concentrated on labelling face tracks, we aim to label individuals even when their faces are not visible. We make the following contributions: (i) we introduce a 'Count, Crop and Recognise' (CCR) multi-stage recognition process for frame level labelling. The Count and Recognise stages involve specialised CNNs for the task, and we show that this simple staging gives a substantial boost in performance; (ii) we compare the recall using frame based labelling to both face and body track based labelling, and demonstrate the advantage of frame based with CCR for the specified goal; (iii) we introduce a new dataset for chimpanzee recognition in the wild; and (iv) we apply a high-granularity visualisation technique to further understand the learned CNN features for the recognition of chimpanzee individuals.
Published: 2019
Full Text: View/download PDF

40. Exploiting temporal context for 3D human pose estimation in the wild

Author: Andrew Zisserman, Carl Doersch, and Anurag Arnab
Subjects: FOS: Computer and information sciences, Sequence, Monocular, Computer science, business.industry, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 020207 software engineering, 02 engineering and technology, Face (geometry), 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, Computer vision, Artificial intelligence, business, Pose, Gesture
Abstract: We present a bundle-adjustment-based algorithm for recovering accurate 3D human pose and meshes from monocular videos. Unlike previous algorithms which operate on single frames, we show that reconstructing a person over an entire sequence gives extra constraints that can resolve ambiguities. This is because videos often give multiple views of a person, yet the overall body shape does not change and 3D positions vary slowly. Our method improves not only on standard mocap-based datasets like Human 3.6M -- where we show quantitative improvements -- but also on challenging in-the-wild datasets such as Kinetics. Building upon our algorithm, we present a new dataset of more than 3 million frames of YouTube videos from Kinetics with automatically generated 3D poses and meshes. We show that retraining a single-frame 3D pose estimator on this data improves accuracy on both real-world and mocap data by evaluating on the 3DPW and HumanEVA datasets., Comment: CVPR 2019
Published: 2019
Full Text: View/download PDF

41. VoxCeleb2: Deep Speaker Recognition

Author: Joon Son Chung, Arsha Nagrani, and Andrew Zisserman
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, Speech recognition, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, Speaker recognition, Pipeline (software), Convolutional neural network, Computer Science - Sound, 030507 speech-language pathology & audiology, 03 medical and health sciences, Margin (machine learning), Audio and Speech Processing (eess.AS), 0202 electrical engineering, electronic engineering, information engineering, Benchmark (computing), Key (cryptography), FOS: Electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 0305 other medical science, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The objective of this paper is speaker recognition under noisy and unconstrained conditions. We make two key contributions. First, we introduce a very large-scale audio-visual speaker recognition dataset collected from open-source media. Using a fully automated pipeline, we curate VoxCeleb2 which contains over a million utterances from over 6,000 speakers. This is several times larger than any publicly available speaker recognition dataset. Second, we develop and compare Convolutional Neural Network (CNN) models and training strategies that can effectively recognise identities from voice under various conditions. The models trained on the VoxCeleb2 dataset surpass the performance of previous works on a benchmark dataset by a significant margin., Comment: To appear in Interspeech 2018. The audio-visual dataset can be downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb2 . 1806.05622v2: minor fixes; 5 pages
Published: 2018
Full Text: View/download PDF

42. Learning to Navigate in Cities Without a Map

Author: Mirowski, P., Grimes, M. K., Mateusz Malinowski, Hermann, K. M., Anderson, K., Teplyashin, D., Simonyan, K., Kavukcuoglu, K., Zisserman, A., and Hadsell, R.
Subjects: FOS: Computer and information sciences, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence
Abstract: Navigating through unstructured environments is a basic capability of intelligent creatures, and thus is of fundamental interest in the study and development of artificial intelligence. Long-range navigation is a complex cognitive task that relies on developing an internal representation of space, grounded by recognisable landmarks and robust visual processing, that can simultaneously support continuous self-localisation ("I am here") and a representation of the goal ("I am going there"). Building upon recent research that applies deep reinforcement learning to maze navigation problems, we present an end-to-end deep reinforcement learning approach that can be applied on a city scale. Recognising that successful navigation relies on integration of general policies with locale-specific knowledge, we propose a dual pathway architecture that allows locale-specific features to be encapsulated, while still enabling transfer to multiple cities. We present an interactive navigation environment that uses Google StreetView for its photographic content and worldwide coverage, and demonstrate that our learning method allows agents to learn to navigate multiple cities and to traverse to target destinations that may be kilometres away. The project webpage http://streetlearn.cc contains a video summarising our research and showing the trained agent in diverse city environments and on the transfer task, the form to request the StreetLearn dataset and links to further resources. The StreetLearn environment code is available at https://github.com/deepmind/streetlearn, Comment: 17 pages, 16 figures, published at NeurIPS 2018
Published: 2018
Full Text: View/download PDF

43. Comparator Networks

Author: Xie, W, Shen, L, and Zisserman, A
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 0202 electrical engineering, electronic engineering, information engineering, 020201 artificial intelligence & image processing, 02 engineering and technology, Machine Learning (cs.LG)
Abstract: The objective of this work is set-based verification, e.g. to decide if two sets of images of a face are of the same person or not. The traditional approach to this problem is to learn to generate a feature vector per image, aggregate them into one vector to represent the set, and then compute the cosine similarity between sets. Instead, we design a neural network architecture that can directly learn set-wise verification. Our contributions are: (i) We propose a Deep Comparator Network (DCN) that can ingest a pair of sets (each may contain a variable number of images) as inputs, and compute a similarity between the pair--this involves attending to multiple discriminative local regions (landmarks), and comparing local descriptors between pairs of faces; (ii) To encourage high-quality representations for each set, internal competition is introduced for recalibration based on the landmark score; (iii) Inspired by image retrieval, a novel hard sample mining regime is proposed to control the sampling process, such that the DCN is complementary to the standard image classification models. Evaluations on the IARPA Janus face recognition benchmarks show that the comparator networks outperform the previous state-of-the-art results by a large margin., Comment: To appear in ECCV 2018
Published: 2018
Full Text: View/download PDF

44. Self-supervised learning of a facial attribute embedding from video

Author: Wiles, O, Koepke, A, and Zisserman, A
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
Abstract: We propose a self-supervised framework for learning facial attributes by simply watching videos of a human face speaking, laughing, and moving over time. To perform this task, we introduce a network, Facial Attributes-Net (FAb-Net), that is trained to embed multiple frames from the same video face-track into a common low-dimensional space. With this approach, we make three contributions: first, we show that the network can leverage information from multiple source frames by predicting confidence/attention masks for each frame; second, we demonstrate that using a curriculum learning regime improves the learned embedding; finally, we demonstrate that the network learns a meaningful face embedding that encodes information about head pose, facial landmarks and facial expression, i.e. facial attributes, without having been supervised with any labelled data. We are comparable or superior to state-of-the-art self-supervised methods on these tasks and approach the performance of supervised methods., Comment: To appear in BMVC 2018. Supplementary material can be found at http://www.robots.ox.ac.uk/~vgg/research/unsup_learn_watch_faces/fabnet.html
Published: 2018
Full Text: View/download PDF

45. LRS3-TED: a large-scale dataset for visual speech recognition

Author: Afouras, Triantafyllos, Chung, Joon Son, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research., Comment: The audio-visual dataset can be downloaded from http://www.robots.ox.ac.uk/~vgg/data/lip_reading/
Published: 2018
Full Text: View/download PDF

46. The Visual Centrifuge: Model-Free Layered Video Representations

Author: Andrew Zisserman, Jean-Baptiste Alayrac, and Joao Carreira
Subjects: FOS: Computer and information sciences, Color constancy, Computer science, business.industry, Deep learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION, Object (computer science), Motion (physics), Computational photography, Computer vision, Artificial intelligence, Image sensor, business, Representation (mathematics), Feature learning, ComputingMethodologies_COMPUTERGRAPHICS
Abstract: True video understanding requires making sense of non-lambertian scenes where the color of light arriving at the camera sensor encodes information about not just the last object it collided with, but about multiple mediums -- colored windows, dirty mirrors, smoke or rain. Layered video representations have the potential of accurately modelling realistic scenes but have so far required stringent assumptions on motion, lighting and shape. Here we propose a learning-based approach for multi-layered video representation: we introduce novel uncertainty-capturing 3D convolutional architectures and train them to separate blended videos. We show that these models then generalize to single videos, where they exhibit interesting abilities: color constancy, factoring out shadows and separating reflections. We present quantitative and qualitative results on real world videos., Comment: Appears in: 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019). This arXiv contains the CVPR Camera Ready version of the paper (although we have included larger figures) as well as an appendix detailing the model architecture
Published: 2018
Full Text: View/download PDF

47. Seeing Voices and Hearing Faces: Cross-modal biometric matching

Author: Samuel Albanie, Arsha Nagrani, and Andrew Zisserman
Subjects: FOS: Computer and information sciences, Matching (statistics), Biometrics, Computer science, business.industry, Speech recognition, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 02 engineering and technology, 010501 environmental sciences, 01 natural sciences, Facial recognition system, Task (project management), Modal, Face (geometry), 0202 electrical engineering, electronic engineering, information engineering, Task analysis, 020201 artificial intelligence & image processing, Artificial intelligence, business, 0105 earth and related environmental sciences
Abstract: We introduce a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker. In this paper we study this, and a number of related cross-modal tasks, aimed at answering the question: how much can we infer from the voice about the face and vice versa? We study this task "in the wild", employing the datasets that are now publicly available for face recognition from static images (VGGFace) and speaker identification from audio (VoxCeleb). These provide training and testing scenarios for both static and dynamic testing of cross-modal matching. We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching, (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available), and (iii) we use human testing as a baseline to calibrate the difficulty of the task. We show that a CNN can indeed be trained to solve this task in both the static and dynamic scenarios, and is even well above chance on 10-way classification of the face given the voice. The CNN matches human performance on easy examples (e.g. different gender across faces) but exceeds human performance on more challenging examples (e.g. faces with the same gender, age and nationality)., Comment: To appear in: IEEE Computer Vision and Pattern Recognition (CVPR), 2018
Published: 2018
Full Text: View/download PDF

48. VoxCeleb: a large-scale speaker identification dataset

Author: Joon Son Chung, Andrew Zisserman, and Arsha Nagrani
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer science, Speech recognition, 020206 networking & telecommunications, 02 engineering and technology, Convolutional neural network, Facial recognition system, Pipeline (software), Computer Science - Sound, Identification (information), Synchronization (computer science), 0202 electrical engineering, electronic engineering, information engineering, Identity (object-oriented programming), 020201 artificial intelligence & image processing, State (computer science), Scale (map)
Abstract: Most existing datasets for speaker identification contain samples obtained under quite constrained conditions, and are usually hand-annotated, hence limited in size. The goal of this paper is to generate a large scale text-independent speaker identification dataset collected 'in the wild'. We make two contributions. First, we propose a fully automated pipeline based on computer vision techniques to create the dataset from open-source media. Our pipeline involves obtaining videos from YouTube; performing active speaker verification using a two-stream synchronization Convolutional Neural Network (CNN), and confirming the identity of the speaker using CNN based facial recognition. We use this pipeline to curate VoxCeleb which contains hundreds of thousands of 'real world' utterances for over 1,000 celebrities. Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance. We show that a CNN based architecture obtains the best performance for both identification and verification., Comment: The dataset can be downloaded from http://www.robots.ox.ac.uk/~vgg/data/voxceleb . 1706.08612v2: minor fixes; 6 pages
Published: 2017
Full Text: View/download PDF

49. The Kinetics Human Action Video Dataset

Author: Kay, Will, Carreira, Joao, Simonyan, Karen, Zhang, Brian, Hillier, Chloe, Vijayanarasimhan, Sudheendra, Viola, Fabio, Green, Tim, Back, Trevor, Natsev, Paul, Suleyman, Mustafa, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition
Abstract: We describe the DeepMind Kinetics human action video dataset. The dataset contains 400 human action classes, with at least 400 video clips for each action. Each clip lasts around 10s and is taken from a different YouTube video. The actions are human focussed and cover a broad range of classes including human-object interactions such as playing instruments, as well as human-human interactions such as shaking hands. We describe the statistics of the dataset, how it was collected, and give some baseline performance figures for neural network architectures trained and tested for human action classification on this dataset. We also carry out a preliminary analysis of whether imbalance in the dataset leads to bias in the classifiers.
Published: 2017
Full Text: View/download PDF

50. You said that?

Author: Chung, Joon Son, Jamaludin, Amir, and Zisserman, Andrew
Subjects: FOS: Computer and information sciences, Computer Vision and Pattern Recognition (cs.CV), ComputingMethodologies_IMAGEPROCESSINGANDCOMPUTERVISION
Abstract: We present a method for generating a video of a talking face. The method takes as inputs: (i) still images of the target face, and (ii) an audio speech segment; and outputs a video of the target face lip synched with the audio. The method runs in real time and is applicable to faces and audio not seen at training time. To achieve this we propose an encoder-decoder CNN model that uses a joint embedding of the face and audio to generate synthesised talking face video frames. The model is trained on tens of hours of unlabelled videos. We also show results of re-dubbing videos using speech from a different person., https://youtu.be/LeufDSb15Kc British Machine Vision Conference (BMVC), 2017
Published: 2017
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

57 results on '"Zisserman A"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources