101 results on '"Mangalam, Karttikeya"'
Search Results
2. LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
- Author
-
Lee, Nicholas, Wattanawong, Thanakul, Kim, Sehoon, Mangalam, Karttikeya, Shen, Sheng, Anumanchipalli, Gopala, Mahoney, Michael W., Keutzer, Kurt, and Gholami, Amir
- Subjects
Computer Science - Computation and Language - Abstract
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a Llama-2-7B student model. Our code is available at https://github.com/SqueezeAILab/LLM2LLM ., Comment: ACL 2024
- Published
- 2024
3. xT: Nested Tokenization for Larger Context in Large Images
- Author
-
Gupta, Ritwik, Li, Shufan, Zhu, Tyler, Malik, Jitendra, Darrell, Trevor, and Mangalam, Karttikeya
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Modern computer vision pipelines handle large images in one of two sub-optimal ways: down-sampling or cropping. These two methods incur significant losses in the amount of information and context present in an image. There are many downstream applications in which global context matters as much as high frequency details, such as in real-world satellite imagery; in such cases researchers have to make the uncomfortable choice of which information to discard. We introduce xT, a simple framework for vision transformers which effectively aggregates global context with local details and can model large images end-to-end on contemporary GPUs. We select a set of benchmark datasets across classic vision tasks which accurately reflect a vision model's ability to understand truly large images and incorporate fine details over large scales and assess our method's improvement on them. xT is a streaming, two-stage architecture that adapts existing vision backbones and long sequence language models to effectively model large images without quadratic memory growth. We are able to increase accuracy by up to 8.6% on challenging classification tasks and $F_1$ score by 11.6 on context-dependent segmentation on images as large as 29,000 x 29,000 pixels., Comment: Accepted to the 2024 International Conference on Machine Learning (ICML)
- Published
- 2024
4. Do Vision and Language Encoders Represent the World Similarly?
- Author
-
Maniparambil, Mayug, Akshulakov, Raiymbek, Djilali, Yasser Abdelaziz Dahou, Narayan, Sanath, Seddik, Mohamed El Amine, Mangalam, Karttikeya, and O'Connor, Noel E.
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Machine Learning - Abstract
Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks. Furthermore, modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment exist between uni-modal vision and language encoders since they fundamentally represent the same physical world? Analyzing the latent spaces structure of vision and language models on image-caption benchmarks using the Centered Kernel Alignment (CKA), we find that the representation spaces of unaligned and aligned encoders are semantically similar. In the absence of statistical similarity in aligned encoders like CLIP, we show that a possible matching of unaligned encoders exists without any training. We frame this as a seeded graph-matching problem exploiting the semantic similarity between graphs and propose two methods - a Fast Quadratic Assignment Problem optimization, and a novel localized CKA metric-based matching/retrieval. We demonstrate the effectiveness of this on several downstream tasks including cross-lingual, cross-domain caption matching and image classification. Code available at github.com/mayug/0-shot-llm-vision., Comment: Accepted CVPR 2024
- Published
- 2024
5. Dr$^2$Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning
- Author
-
Zhao, Chen, Liu, Shuming, Mangalam, Karttikeya, Qian, Guocheng, Zohra, Fatimah, Alghannam, Abdulmohsen, Malik, Jitendra, and Ghanem, Bernard
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Large pretrained models are increasingly crucial in modern computer vision tasks. These models are typically used in downstream tasks by end-to-end finetuning, which is highly memory-intensive for tasks with high-resolution data, e.g., video understanding, small object detection, and point cloud analysis. In this paper, we propose Dynamic Reversible Dual-Residual Networks, or Dr$^2$Net, a novel family of network architectures that acts as a surrogate network to finetune a pretrained model with substantially reduced memory consumption. Dr$^2$Net contains two types of residual connections, one maintaining the residual structure in the pretrained models, and the other making the network reversible. Due to its reversibility, intermediate activations, which can be reconstructed from output, are cleared from memory during training. We use two coefficients on either type of residual connections respectively, and introduce a dynamic training strategy that seamlessly transitions the pretrained model to a reversible network with much higher numerical precision. We evaluate Dr$^2$Net on various pretrained models and various tasks, and show that it can reach comparable performance to conventional finetuning but with significantly less memory usage.
- Published
- 2024
6. Adaptive Human Trajectory Prediction via Latent Corridors
- Author
-
Thakkar, Neerja, Mangalam, Karttikeya, Bajcsy, Andrea, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Human trajectory prediction is typically posed as a zero-shot generalization problem: a predictor is learnt on a dataset of human motion in training scenes, and then deployed on unseen test scenes. While this paradigm has yielded tremendous progress, it fundamentally assumes that trends in human behavior within the deployment scene are constant over time. As such, current prediction models are unable to adapt to scene-specific transient human behaviors, such as crowds temporarily gathering to see buskers, pedestrians hurrying through the rain and avoiding puddles, or a protest breaking out. We formalize the problem of scene-specific adaptive trajectory prediction and propose a new adaptation approach inspired by prompt tuning called latent corridors. By augmenting the input of any pre-trained human trajectory predictor with learnable image prompts, the predictor can improve in the deployment scene by inferring trends from extremely small amounts of new data (e.g., 2 humans observed for 30 seconds). With less than 0.1% additional model parameters, we see up to 23.9% ADE improvement in MOTSynth simulated data and 16.4% ADE in MOT and Wildtrack real pedestrian data. Qualitatively, we observe that latent corridors imbue predictors with an awareness of scene geometry and scene-specific human behaviors that non-adaptive predictors struggle to capture. The project website can be found at https://neerja.me/atp_latent_corridors/., Comment: Accepted to ECCV 2024. Project website can be found at https://neerja.me/atp_latent_corridors/
- Published
- 2023
7. Sequential Modeling Enables Scalable Learning for Large Vision Models
- Author
-
Bai, Yutong, Geng, Xinyang, Mangalam, Karttikeya, Bar, Amir, Yuille, Alan, Darrell, Trevor, Malik, Jitendra, and Efros, Alexei A
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time., Comment: Website: https://yutongbai.com/lvm.html
- Published
- 2023
8. EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding
- Author
-
Mangalam, Karttikeya, Akshulakov, Raiymbek, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
We introduce EgoSchema, a very long-form video question-answering dataset, and benchmark to evaluate long video understanding capabilities of modern vision and language systems. Derived from Ego4D, EgoSchema consists of over 5000 human curated multiple choice question answer pairs, spanning over 250 hours of real video data, covering a very broad range of natural human activity and behavior. For each question, EgoSchema requires the correct answer to be selected between five given options based on a three-minute-long video clip. While some prior works have proposed video datasets with long clip lengths, we posit that merely the length of the video clip does not truly capture the temporal difficulty of the video task that is being considered. To remedy this, we introduce temporal certificate sets, a general notion for capturing the intrinsic temporal understanding length associated with a broad range of video understanding tasks & datasets. Based on this metric, we find EgoSchema to have intrinsic temporal lengths over 5.7x longer than the second closest dataset and 10x to 100x longer than any other video understanding dataset. Further, our evaluation of several current state-of-the-art video and language models shows them to be severely lacking in long-term video understanding capabilities. Even models with several billions of parameters achieve QA accuracy less than 33% (random is 20%) on the EgoSchema multi-choice question answering task, while humans achieve about 76% accuracy. We posit that \name{}{}, with its long intrinsic temporal structures and diverse complexity, would serve as a valuable evaluation probe for developing effective long-term video understanding systems in the future. Data and Zero-shot model evaluation code are open-sourced for both public and commercial use under the Ego4D license at http://egoschema.github.io, Comment: https://egoschema.github.io/
- Published
- 2023
9. PaReprop: Fast Parallelized Reversible Backpropagation
- Author
-
Zhu, Tyler and Mangalam, Karttikeya
- Subjects
Computer Science - Machine Learning ,Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition - Abstract
The growing size of datasets and deep learning models has made faster and memory-efficient training crucial. Reversible transformers have recently been introduced as an exciting new method for extremely memory-efficient training, but they come with an additional computation overhead of activation re-computation in the backpropagation phase. We present PaReprop, a fast Parallelized Reversible Backpropagation algorithm that parallelizes the additional activation re-computation overhead in reversible training with the gradient computation itself in backpropagation phase. We demonstrate the effectiveness of the proposed PaReprop algorithm through extensive benchmarking across model families (ViT, MViT, Swin and RoBERTa), data modalities (Vision & NLP), model sizes (from small to giant), and training batch sizes. Our empirical results show that PaReprop achieves up to 20% higher training throughput than vanilla reversible training, largely mitigating the theoretical overhead of 25% lower throughput from activation recomputation in reversible training. Project page: https://tylerzhu.com/pareprop., Comment: Spotlight paper, T4V Workshop @ CVPR 2023
- Published
- 2023
10. Diffusion Models as Masked Autoencoders
- Author
-
Wei, Chen, Mangalam, Karttikeya, Huang, Po-Yao, Li, Yanghao, Fan, Haoqi, Xu, Hu, Wang, Huiyu, Xie, Cihang, Yuille, Alan, and Feichtenhofer, Christoph
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
There has been a longstanding belief that generation can facilitate a true understanding of visual data. In line with this, we revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models. While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE). Our approach is capable of (i) serving as a strong initialization for downstream recognition tasks, (ii) conducting high-quality image inpainting, and (iii) being effortlessly extended to video where it produces state-of-the-art classification accuracy. We further perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders., Comment: Tech report. Project page: https://weichen582.github.io/diffmae.html
- Published
- 2023
11. Speculative Decoding with Big Little Decoder
- Author
-
Kim, Sehoon, Mangalam, Karttikeya, Moon, Suhong, Malik, Jitendra, Mahoney, Michael W., Gholami, Amir, and Keutzer, Kurt
- Subjects
Computer Science - Computation and Language - Abstract
The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model's inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to correct the small model's inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/DailyMail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation. Furthermore, our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture. Our code is open-sourced, Comment: NeurIPS 2023
- Published
- 2023
12. Reversible Vision Transformers
- Author
-
Mangalam, Karttikeya, Fan, Haoqi, Li, Yanghao, Wu, Chao-Yuan, Xiong, Bo, Feichtenhofer, Christoph, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory requirement from the depth of the model, Reversible Vision Transformers enable scaling up architectures with efficient memory usage. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark extensively across both model sizes and tasks of image classification, object detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to 15.5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes. Finally, we find that the additional computational burden of recomputing activations is more than overcome for deeper models, where throughput can increase up to 2.3x over their non-reversible counterparts. Full code and trained models are available at https://github.com/facebookresearch/slowfast. A simpler, easy to understand and modify version is also available at https://github.com/karttikeya/minREV, Comment: Oral at CVPR 2022, updated version
- Published
- 2023
13. Re-evaluating the Need for Multimodal Signals in Unsupervised Grammar Induction
- Author
-
Li, Boyi, Corona, Rodolfo, Mangalam, Karttikeya, Chen, Catherine, Flaherty, Daniel, Belongie, Serge, Weinberger, Kilian Q., Malik, Jitendra, Darrell, Trevor, and Klein, Dan
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Are multimodal inputs necessary for grammar induction? Recent work has shown that multimodal training inputs can improve grammar induction. However, these improvements are based on comparisons to weak text-only baselines that were trained on relatively little textual data. To determine whether multimodal inputs are needed in regimes with large amounts of textual training data, we design a stronger text-only baseline, which we refer to as LC-PCFG. LC-PCFG is a C-PFCG that incorporates em-beddings from text-only large language models (LLMs). We use a fixed grammar family to directly compare LC-PCFG to various multi-modal grammar induction methods. We compare performance on four benchmark datasets. LC-PCFG provides an up to 17% relative improvement in Corpus-F1 compared to state-of-the-art multimodal grammar induction methods. LC-PCFG is also more computationally efficient, providing an up to 85% reduction in parameter count and 8.8x reduction in training time compared to multimodal approaches. These results suggest that multimodal inputs may not be necessary for grammar induction, and emphasize the importance of strong vision-free baselines for evaluating the benefit of multimodal approaches., Comment: NAACL Findings 2024
- Published
- 2022
14. Re^2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization
- Author
-
Zhao, Chen, Liu, Shuming, Mangalam, Karttikeya, and Ghanem, Bernard
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Multimedia - Abstract
Temporal action localization (TAL) requires long-form reasoning to predict actions of various durations and complex content. Given limited GPU memory, training TAL end to end (i.e., from videos to predictions) on long videos is a significant challenge. Most methods can only train on pre-extracted features without optimizing them for the localization problem, consequently limiting localization performance. In this work, to extend the potential in TAL networks, we propose a novel end-to-end method Re2TAL, which rewires pretrained video backbones for reversible TAL. Re2TAL builds a backbone with reversible modules, where the input can be recovered from the output such that the bulky intermediate activations can be cleared from memory during training. Instead of designing one single type of reversible module, we propose a network rewiring mechanism, to transform any module with a residual connection to a reversible module without changing any parameters. This provides two benefits: (1) a large variety of reversible networks are easily obtained from existing and even future model designs, and (2) the reversible models require much less training effort as they reuse the pre-trained parameters of their original non-reversible versions. Re2TAL, only using the RGB modality, reaches 37.01% average mAP on ActivityNet-v1.3, a new state-of-the-art record, and mAP 64.9% at tIoU=0.5 on THUMOS-14, outperforming all other RGB-only methods.
- Published
- 2022
15. Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022
- Author
-
Ben-Avraham, Elad, Herzig, Roei, Mangalam, Karttikeya, Bar, Amir, Rohrbach, Anna, Karlinsky, Leonid, Darrell, Trevor, and Globerson, Amir
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
This technical report describes the SViT approach for the Ego4D Point of No Return (PNR) Temporal Localization Challenge. We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights. First, as both images and videos contain structured information, we enrich a transformer model with a set of \emph{object tokens} that can be used across images and videos. Second, the scene representations of individual frames in video should "align" with those of still images. This is achieved via a "Frame-Clip Consistency" loss, which ensures the flow of structured information between images and videos. SViT obtains strong performance on the challenge test set with 0.656 absolute temporal localization error., Comment: Ego4D CVPR22 Object State Localization challenge. arXiv admin note: substantial text overlap with arXiv:2206.06346
- Published
- 2022
16. Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens
- Author
-
Ben-Avraham, Elad, Herzig, Roei, Mangalam, Karttikeya, Bar, Amir, Rohrbach, Anna, Karlinsky, Leonid, Darrell, Trevor, and Globerson, Amir
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recent action recognition models have achieved impressive results by integrating objects, their locations and interactions. However, obtaining dense structured annotations for each frame is tedious and time-consuming, making these methods expensive to train and less scalable. At the same time, if a small set of annotated images is available, either within or outside the domain of interest, how could we leverage these for a video downstream task? We propose a learning framework StructureViT (SViT for short), which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model. SViT relies on two key insights. First, as both images and videos contain structured information, we enrich a transformer model with a set of \emph{object tokens} that can be used across images and videos. Second, the scene representations of individual frames in video should "align" with those of still images. This is achieved via a \emph{Frame-Clip Consistency} loss, which ensures the flow of structured information between images and videos. We explore a particular instantiation of scene structure, namely a \emph{Hand-Object Graph}, consisting of hands and objects with their locations as nodes, and physical relations of contact/no-contact as edges. SViT shows strong performance improvements on multiple video understanding tasks and datasets. Furthermore, it won in the Ego4D CVPR'22 Object State Localization challenge. For code and pretrained models, visit the project page at \url{https://eladb3.github.io/SViT/}, Comment: Tech report
- Published
- 2022
17. Squeezeformer: An Efficient Transformer for Automatic Speech Recognition
- Author
-
Kim, Sehoon, Gholami, Amir, Shaw, Albert, Lee, Nicholas, Mangalam, Karttikeya, Malik, Jitendra, Mahoney, Michael W., and Keutzer, Kurt
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Computation and Language ,Computer Science - Sound - Abstract
The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture's design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7.5%, 6.5%, and 6.0% word-error-rate (WER) on LibriSpeech test-other without external language models, which are 3.1%, 1.4%, and 0.6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online., Comment: NeurIPS 2022
- Published
- 2022
18. MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
- Author
-
Wu, Chao-Yuan, Li, Yanghao, Mangalam, Karttikeya, Fan, Haoqi, Xiong, Bo, Malik, Jitendra, and Feichtenhofer, Christoph
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration. Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost. Based on this idea, we build MeMViT, a Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x longer than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models are available at https://github.com/facebookresearch/memvit., Comment: Technical report. arXiv v2: add link to code
- Published
- 2022
19. Overcoming Mode Collapse with Adaptive Multi Adversarial Training
- Author
-
Mangalam, Karttikeya and Garg, Rohin
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Generative Adversarial Networks (GANs) are a class of generative models used for various applications, but they have been known to suffer from the mode collapse problem, in which some modes of the target distribution are ignored by the generator. Investigative study using a new data generation procedure indicates that the mode collapse of the generator is driven by the discriminator's inability to maintain classification accuracy on previously seen samples, a phenomenon called Catastrophic Forgetting in continual learning. Motivated by this observation, we introduce a novel training procedure that adaptively spawns additional discriminators to remember previous modes of generation. On several datasets, we show that our training scheme can be plugged-in to existing GAN frameworks to mitigate mode collapse and improve standard metrics for GAN evaluation., Comment: BMVC 2021 Poster
- Published
- 2021
20. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
- Author
-
Li, Yanghao, Wu, Chao-Yuan, Fan, Haoqi, Mangalam, Karttikeya, Xiong, Bo, Malik, Jitendra, and Feichtenhofer, Christoph
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models are available at https://github.com/facebookresearch/mvit., Comment: CVPR 2022 Camera Ready
- Published
- 2021
21. Ego4D: Around the World in 3,000 Hours of Egocentric Video
- Author
-
Grauman, Kristen, Westbury, Andrew, Byrne, Eugene, Chavis, Zachary, Furnari, Antonino, Girdhar, Rohit, Hamburger, Jackson, Jiang, Hao, Liu, Miao, Liu, Xingyu, Martin, Miguel, Nagarajan, Tushar, Radosavovic, Ilija, Ramakrishnan, Santhosh Kumar, Ryan, Fiona, Sharma, Jayant, Wray, Michael, Xu, Mengmeng, Xu, Eric Zhongcong, Zhao, Chen, Bansal, Siddhant, Batra, Dhruv, Cartillier, Vincent, Crane, Sean, Do, Tien, Doulaty, Morrie, Erapalli, Akshay, Feichtenhofer, Christoph, Fragomeni, Adriano, Fu, Qichen, Gebreselasie, Abrham, Gonzalez, Cristina, Hillis, James, Huang, Xuhua, Huang, Yifei, Jia, Wenqi, Khoo, Weslie, Kolar, Jachym, Kottur, Satwik, Kumar, Anurag, Landini, Federico, Li, Chao, Li, Yanghao, Li, Zhenqiang, Mangalam, Karttikeya, Modhugu, Raghava, Munro, Jonathan, Murrell, Tullie, Nishiyasu, Takumi, Price, Will, Puentes, Paola Ruiz, Ramazanova, Merey, Sari, Leda, Somasundaram, Kiran, Southerland, Audrey, Sugano, Yusuke, Tao, Ruijie, Vo, Minh, Wang, Yuchen, Wu, Xindi, Yagi, Takuma, Zhao, Ziwei, Zhu, Yunyi, Arbelaez, Pablo, Crandall, David, Damen, Dima, Farinella, Giovanni Maria, Fuegen, Christian, Ghanem, Bernard, Ithapu, Vamsi Krishna, Jawahar, C. V., Joo, Hanbyul, Kitani, Kris, Li, Haizhou, Newcombe, Richard, Oliva, Aude, Park, Hyun Soo, Rehg, James M., Sato, Yoichi, Shi, Jianbo, Shou, Mike Zheng, Torralba, Antonio, Torresani, Lorenzo, Yan, Mingfei, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/, Comment: To appear in the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. This version updates the baseline result numbers for the Hands and Objects benchmark (appendix)
- Published
- 2021
22. Object-Region Video Transformers
- Author
-
Herzig, Roei, Ben-Avraham, Elad, Mangalam, Karttikeya, Bar, Amir, Chechik, Gal, Rohrbach, Anna, Darrell, Trevor, and Globerson, Amir
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations. The key idea is to fuse object-centric representations starting from early layers and propagate them into the transformer-layers, thus affecting the spatio-temporal representations throughout the network. Our ORViT block consists of two object-level streams: appearance and dynamics. In the appearance stream, an "Object-Region Attention" module applies self-attention over the patches and \emph{object regions}. In this way, visual object regions interact with uniform patch tokens and enrich them with contextualized object information. We further model object dynamics via a separate "Object-Dynamics Module", which captures trajectory interactions, and show how to integrate the two streams. We evaluate our model on four tasks and five datasets: compositional and few-shot action recognition on SomethingElse, spatio-temporal action detection on AVA, and standard action recognition on Something-Something V2, Diving48 and Epic-Kitchen100. We show strong performance improvement across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a transformer architecture. For code and pretrained models, visit the project page at \url{https://roeiherz.github.io/ORViT/}, Comment: CVPR 2022
- Published
- 2021
23. LOKI: Long Term and Key Intentions for Trajectory Prediction
- Author
-
Girase, Harshayu, Gang, Haiming, Malla, Srikanth, Li, Jiachen, Kanehara, Akira, Mangalam, Karttikeya, and Choi, Chiho
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,Computer Science - Multiagent Systems ,Computer Science - Robotics - Abstract
Recent advances in trajectory prediction have shown that explicit reasoning about agents' intent is important to accurately forecast their motion. However, the current research activities are not directly applicable to intelligent and safety critical systems. This is mainly because very few public datasets are available, and they only consider pedestrian-specific intents for a short temporal horizon from a restricted egocentric view. To this end, we propose LOKI (LOng term and Key Intentions), a novel large-scale dataset that is designed to tackle joint trajectory and intention prediction for heterogeneous traffic agents (pedestrians and vehicles) in an autonomous driving setting. The LOKI dataset is created to discover several factors that may affect intention, including i) agent's own will, ii) social interactions, iii) environmental constraints, and iv) contextual information. We also propose a model that jointly performs trajectory and intention prediction, showing that recurrently reasoning about intention can assist with trajectory prediction. We show our method outperforms state-of-the-art trajectory prediction methods by upto $27\%$ and also provide a baseline for frame-wise intention estimation., Comment: ICCV 2021 (The dataset is available at https://usa.honda-ri.com/loki)
- Published
- 2021
24. Multiscale Vision Transformers
- Author
-
Fan, Haoqi, Xiong, Bo, Mangalam, Karttikeya, Li, Yanghao, Yan, Zhicheng, Malik, Jitendra, and Feichtenhofer, Christoph
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in computation and parameters. We further remove the temporal dimension and apply our model for image classification where it outperforms prior work on vision transformers. Code is available at: https://github.com/facebookresearch/SlowFast, Comment: Technical report
- Published
- 2021
25. From Goals, Waypoints & Paths To Long Term Human Trajectory Forecasting
- Author
-
Mangalam, Karttikeya, An, Yang, Girase, Harshayu, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Robotics - Abstract
Human trajectory forecasting is an inherently multi-modal problem. Uncertainty in future trajectories stems from two sources: (a) sources that are known to the agent but unknown to the model, such as long term goals and (b)sources that are unknown to both the agent & the model, such as intent of other agents & irreducible randomness indecisions. We propose to factorize this uncertainty into its epistemic & aleatoric sources. We model the epistemic un-certainty through multimodality in long term goals and the aleatoric uncertainty through multimodality in waypoints& paths. To exemplify this dichotomy, we also propose a novel long term trajectory forecasting setting, with prediction horizons upto a minute, an order of magnitude longer than prior works. Finally, we presentY-net, a scene com-pliant trajectory forecasting network that exploits the pro-posed epistemic & aleatoric structure for diverse trajectory predictions across long prediction horizons.Y-net significantly improves previous state-of-the-art performance on both (a) The well studied short prediction horizon settings on the Stanford Drone & ETH/UCY datasets and (b) The proposed long prediction horizon setting on the re-purposed Stanford Drone & Intersection Drone datasets., Comment: 14 pages, 7 figures (including 2 GIFs)
- Published
- 2020
26. Long-term Human Motion Prediction with Scene Context
- Author
-
Cao, Zhe, Gao, Hang, Mangalam, Karttikeya, Cai, Qi-Zhi, Vo, Minh, and Malik, Jitendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Human movement is goal-directed and influenced by the spatial layout of the objects in the scene. To plan future human motion, it is crucial to perceive the environment -- imagine how hard it is to navigate a new room with lights off. Existing works on predicting human motion do not pay attention to the scene context and thus struggle in long-term prediction. In this work, we propose a novel three-stage framework that exploits scene context to tackle this task. Given a single scene image and 2D pose histories, our method first samples multiple human motion goals, then plans 3D human paths towards each goal, and finally predicts 3D human pose sequences following each path. For stable training and rigorous evaluation, we contribute a diverse synthetic dataset with clean annotations. In both synthetic and real datasets, our method shows consistent quantitative and qualitative improvements over existing methods., Comment: ECCV 2020 Oral. Dataset & Code: https://github.com/ZheC/GTA-IM-Dataset Video: https://people.eecs.berkeley.edu/~zhecao/hmp/index.html
- Published
- 2020
27. It Is Not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction
- Author
-
Mangalam, Karttikeya, Girase, Harshayu, Agarwal, Shreyas, Lee, Kuan-Hui, Adeli, Ehsan, Malik, Jitendra, and Gaidon, Adrien
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning - Abstract
Human trajectory forecasting with multiple socially interacting agents is of critical importance for autonomous navigation in human environments, e.g., for self-driving cars and social robots. In this work, we present Predicted Endpoint Conditioned Network (PECNet) for flexible human trajectory prediction. PECNet infers distant trajectory endpoints to assist in long-range multi-modal trajectory prediction. A novel non-local social pooling layer enables PECNet to infer diverse yet socially compliant trajectories. Additionally, we present a simple "truncation-trick" for improving few-shot multi-modal trajectory prediction performance. We show that PECNet improves state-of-the-art performance on the Stanford Drone trajectory prediction benchmark by ~20.9% and on the ETH/UCY benchmark by ~40.8%. Project homepage: https://karttikeya.github.io/publication/htf/, Comment: Accepted at ECCV 2020 (Oral)
- Published
- 2020
28. Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision
- Author
-
Mangalam, Karttikeya, Adeli, Ehsan, Lee, Kuan-Hui, Gaidon, Adrien, and Niebles, Juan Carlos
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Machine Learning ,Computer Science - Robotics - Abstract
We tackle the problem of Human Locomotion Forecasting, a task for jointly predicting the spatial positions of several keypoints on the human body in the near future under an egocentric setting. In contrast to the previous work that aims to solve either the task of pose prediction or trajectory forecasting in isolation, we propose a framework to unify the two problems and address the practically useful task of pedestrian locomotion prediction in the wild. Among the major challenges in solving this task is the scarcity of annotated egocentric video datasets with dense annotations for pose, depth, or egomotion. To surmount this difficulty, we use state-of-the-art models to generate (noisy) annotations and propose robust forecasting models that can learn from this noisy supervision. We present a method to disentangle the overall pedestrian motion into easier to learn subparts by utilizing a pose completion and a decomposition module. The completion module fills in the missing key-point annotations and the decomposition module breaks the cleaned locomotion down to global (trajectory) and local (pose keypoint movements). Further, with Quasi RNN as our backbone, we propose a novel hierarchical trajectory forecasting network that utilizes low-level vision domain specific signals like egomotion and depth to predict the global trajectory. Our method leads to state-of-the-art results for the prediction of human locomotion in the egocentric view. Project pade: https://karttikeya.github.io/publication/plf/, Comment: Accepted to WACV 2020 (Oral)
- Published
- 2019
29. On Compressing U-net Using Knowledge Distillation
- Author
-
Mangalam, Karttikeya and Salzamann, Mathieu
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
We study the use of knowledge distillation to compress the U-net architecture. We show that, while standard distillation is not sufficient to reliably train a compressed U-net, introducing other regularization methods, such as batch normalization and class re-weighting, in knowledge distillation significantly improves the training process. This allows us to compress a U-net by over 1000x, i.e., to 0.1% of its original number of parameters, at a negligible decrease in performance., Comment: 4 pages, 1 figure
- Published
- 2018
30. Learning Spontaneity to Improve Emotion Recognition In Speech
- Author
-
Mangalam, Karttikeya and Guha, Tanaya
- Subjects
Electrical Engineering and Systems Science - Audio and Speech Processing ,Computer Science - Computation and Language ,Computer Science - Human-Computer Interaction ,Computer Science - Sound - Abstract
We investigate the effect and usefulness of spontaneity (i.e. whether a given speech is spontaneous or not) in speech in the context of emotion recognition. We hypothesize that emotional content in speech is interrelated with its spontaneity, and use spontaneity classification as an auxiliary task to the problem of emotion recognition. We propose two supervised learning settings that utilize spontaneity to improve speech emotion recognition: a hierarchical model that performs spontaneity detection before performing emotion recognition, and a multitask learning model that jointly learns to recognize both spontaneity and emotion. Through various experiments on the well known IEMOCAP database, we show that by using spontaneity detection as an additional task, significant improvement can be achieved over emotion recognition systems that are unaware of spontaneity. We achieve state-of-the-art emotion recognition accuracy (4-class, 69.1%) on the IEMOCAP database outperforming several relevant and competitive baselines., Comment: Accepted at Interspeech 2018
- Published
- 2017
31. Future Person Localization in First-Person Videos
- Author
-
Yagi, Takuma, Mangalam, Karttikeya, Yonetani, Ryo, and Sato, Yoichi
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We present a new task that predicts future locations of people observed in first-person videos. Consider a first-person video stream continuously recorded by a wearable camera. Given a short clip of a person that is extracted from the complete stream, we aim to predict that person's location in future frames. To facilitate this future person localization ability, we make the following three key observations: a) First-person videos typically involve significant ego-motion which greatly affects the location of the target person in future frames; b) Scales of the target person act as a salient cue to estimate a perspective effect in first-person videos; c) First-person videos often capture people up-close, making it easier to leverage target poses (e.g., where they look) for predicting their future locations. We incorporate these three observations into a prediction framework with a multi-stream convolution-deconvolution architecture. Experimental results reveal our method to be effective on our new dataset as well as on a public social interaction dataset., Comment: Accepted to CVPR 2018
- Published
- 2017
32. Bitwise Operations of Cellular Automaton on Gray-scale Images
- Author
-
Mangalam, Karttikeya and Venkatesh, K S
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Cellular Automata (CA) theory is a discrete model that represents the state of each of its cells from a finite set of possible values which evolve in time according to a pre-defined set of transition rules. CA have been applied to a number of image processing tasks such as Convex Hull Detection, Image Denoising etc. but mostly under the limitation of restricting the input to binary images. In general, a gray-scale image may be converted to a number of different binary images which are finally recombined after CA operations on each of them individually. We have developed a multinomial regression based weighed summation method to recombine binary images for better performance of CA based Image Processing algorithms. The recombination algorithm is tested for the specific case of denoising Salt and Pepper Noise to test against standard benchmark algorithms such as the Median Filter for various images and noise levels. The results indicate several interesting invariances in the application of the CA, such as the particular noise realization and the choice of sub-sampling of pixels to determine recombination weights. Additionally, it appears that simpler algorithms for weight optimization which seek local minima work as effectively as those that seek global minima such as Simulated Annealing., Comment: 5 Pages. The code is available at : https://github.com/karttikeya/Bitwise-CA-Opeartions/
- Published
- 2017
33. Perceiving People over Long Periods: Algorithms, Architectures & Datasets
- Author
-
Mangalam, Karttikeya
- Subjects
Artificial intelligence ,Computer Vision ,Long Sequence Modelling ,Long-form video understanding ,Reversible models ,Transformers ,Video Understanding - Abstract
Long-form video understanding remains as one of the last enduring open problems in computer vision. While the natural world offers long periods of visual stimuli, most computer vision systems still operate within a limited temporal scope, typically just a few seconds in both input and output. This thesis presents my work developing the neural machinery, i.e., the algorithms, architectures and datasets, that extend the temporal capacity of video understanding systems to minutes and beyond. I start by presenting my work on algorithms for long-term multimodal human motion forecasting, termed PECNet and Y-net. Next, I introduce my contributions on neural architectures for hierarchical, temporally scalable and memory-efficient neural architectures for understanding long-form videos in form of MViT and Rev-ViT. Finally, I close by presenting my work on EgoSchema, the first certifiably long-form video-language dataset, which serves as a benchmark for evaluating the long-form understanding capabilities of multimodal models. The presented benchmark results on EgoSchema highlight the existing performance gap between current state-of-the-art models and human-level long-form video understanding. I believe that my presented advancements in algorithms, architectures, and datasets not only address several existing limitations but also open new avenues for future research and application.
- Published
- 2023
34. It Is Not the Journey But the Destination: Endpoint Conditioned Trajectory Prediction
- Author
-
Mangalam, Karttikeya, primary, Girase, Harshayu, additional, Agarwal, Shreyas, additional, Lee, Kuan-Hui, additional, Adeli, Ehsan, additional, Malik, Jitendra, additional, and Gaidon, Adrien, additional
- Published
- 2020
- Full Text
- View/download PDF
35. Long-Term Human Motion Prediction with Scene Context
- Author
-
Cao, Zhe, primary, Gao, Hang, additional, Mangalam, Karttikeya, additional, Cai, Qi-Zhi, additional, Vo, Minh, additional, and Malik, Jitendra, additional
- Published
- 2020
- Full Text
- View/download PDF
36. Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization
- Author
-
Zhao, Chen, primary, Liu, Shuming, additional, Mangalam, Karttikeya, additional, and Ghanem, Bernard, additional
- Published
- 2023
- Full Text
- View/download PDF
37. Latency Matters: Real-Time Action Forecasting Transformer
- Author
-
Girase, Harshayu, primary, Agarwal, Nakul, additional, Choi, Chiho, additional, and Mangalam, Karttikeya, additional
- Published
- 2023
- Full Text
- View/download PDF
38. A Vision-free Baseline for Multimodal Grammar Induction
- Author
-
Li, Boyi, Corona, Rodolfo, Mangalam, Karttikeya, Chen, Catherine, Flaherty, Daniel, Belongie, Serge, Weinberger, Kilian Q., Malik, Jitendra, Darrell, Trevor, Klein, Dan, Li, Boyi, Corona, Rodolfo, Mangalam, Karttikeya, Chen, Catherine, Flaherty, Daniel, Belongie, Serge, Weinberger, Kilian Q., Malik, Jitendra, Darrell, Trevor, and Klein, Dan
- Abstract
Past work has shown that paired vision-language signals substantially improve grammar induction in multimodal datasets such as MSCOCO. We investigate whether advancements in large language models (LLMs) that are only trained with text could provide strong assistance for grammar induction in multimodal settings. We find that our text-only approach, an LLM-based C-PCFG (LC-PCFG), outperforms previous multi-modal methods, and achieves state-of-the-art grammar induction performance for various multimodal datasets. Compared to image-aided grammar induction, LC-PCFG outperforms the prior state-of-the-art by 7.9 Corpus-F1 points, with an 85% reduction in parameter count and 1.7x faster training speed. Across three video-assisted grammar induction benchmarks, LC-PCFG outperforms prior state-of-the-art by up to 7.7 Corpus-F1, with 8.8x faster training. These results shed light on the notion that text-only language models might include visually grounded cues that aid in grammar induction in multimodal contexts. Moreover, our results emphasize the importance of establishing a robust vision-free baseline when evaluating the benefit of multimodal approaches.
- Published
- 2023
39. Big Little Transformer Decoder
- Author
-
Kim, Sehoon, Mangalam, Karttikeya, Moon, Suhong, Canny, John, Malik, Jitendra, Mahoney, Michael W., Gholami, Amir, and Keutzer, Kurt
- Subjects
FOS: Computer and information sciences ,Computer Science - Computation and Language ,Computation and Language (cs.CL) - Abstract
The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment, and which makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model's inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to correct the small model's inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/DailyMail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation. Furthermore, our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture. Our code is open-sourced
- Published
- 2023
40. Object-Region Video Transformers
- Author
-
Herzig, Roei, primary, Ben-Avraham, Elad, additional, Mangalam, Karttikeya, additional, Bar, Amir, additional, Chechik, Gal, additional, Rohrbach, Anna, additional, Darrell, Trevor, additional, and Globerson, Amir, additional
- Published
- 2022
- Full Text
- View/download PDF
41. MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
- Author
-
Wu, Chao-Yuan, primary, Li, Yanghao, additional, Mangalam, Karttikeya, additional, Fan, Haoqi, additional, Xiong, Bo, additional, Malik, Jitendra, additional, and Feichtenhofer, Christoph, additional
- Published
- 2022
- Full Text
- View/download PDF
42. Ego4D: Around the World in 3,000 Hours of Egocentric Video
- Author
-
Grauman, Kristen, primary, Westbury, Andrew, additional, Byrne, Eugene, additional, Chavis, Zachary, additional, Furnari, Antonino, additional, Girdhar, Rohit, additional, Hamburger, Jackson, additional, Jiang, Hao, additional, Liu, Miao, additional, Liu, Xingyu, additional, Martin, Miguel, additional, Nagarajan, Tushar, additional, Radosavovic, Ilija, additional, Ramakrishnan, Santhosh Kumar, additional, Ryan, Fiona, additional, Sharma, Jayant, additional, Wray, Michael, additional, Xu, Mengmeng, additional, Xu, Eric Zhongcong, additional, Zhao, Chen, additional, Bansal, Siddhant, additional, Batra, Dhruv, additional, Cartillier, Vincent, additional, Crane, Sean, additional, Do, Tien, additional, Doulaty, Morrie, additional, Erapalli, Akshay, additional, Feichtenhofer, Christoph, additional, Fragomeni, Adriano, additional, Fu, Qichen, additional, Gebreselasie, Abrham, additional, Gonzalez, Cristina, additional, Hillis, James, additional, Huang, Xuhua, additional, Huang, Yifei, additional, Jia, Wenqi, additional, Khoo, Weslie, additional, Kolar, Jachym, additional, Kottur, Satwik, additional, Kumar, Anurag, additional, Landini, Federico, additional, Li, Chao, additional, Li, Yanghao, additional, Li, Zhenqiang, additional, Mangalam, Karttikeya, additional, Modhugu, Raghava, additional, Munro, Jonathan, additional, Murrell, Tullie, additional, Nishiyasu, Takumi, additional, Price, Will, additional, Puentes, Paola Ruiz, additional, Ramazanova, Merey, additional, Sari, Leda, additional, Somasundaram, Kiran, additional, Southerland, Audrey, additional, Sugano, Yusuke, additional, Tao, Ruijie, additional, Vo, Minh, additional, Wang, Yuchen, additional, Wu, Xindi, additional, Yagi, Takuma, additional, Zhao, Ziwei, additional, Zhu, Yunyi, additional, Arbelaez, Pablo, additional, Crandall, David, additional, Damen, Dima, additional, Farinella, Giovanni Maria, additional, Fuegen, Christian, additional, Ghanem, Bernard, additional, Ithapu, Vamsi Krishna, additional, Jawahar, C. V., additional, Joo, Hanbyul, additional, Kitani, Kris, additional, Li, Haizhou, additional, Newcombe, Richard, additional, Oliva, Aude, additional, Park, Hyun Soo, additional, Rehg, James M., additional, Sato, Yoichi, additional, Shi, Jianbo, additional, Shou, Mike Zheng, additional, Torralba, Antonio, additional, Torresani, Lorenzo, additional, Yan, Mingfei, additional, and Malik, Jitendra, additional
- Published
- 2022
- Full Text
- View/download PDF
43. MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
- Author
-
Li, Yanghao, primary, Wu, Chao-Yuan, additional, Fan, Haoqi, additional, Mangalam, Karttikeya, additional, Xiong, Bo, additional, Malik, Jitendra, additional, and Feichtenhofer, Christoph, additional
- Published
- 2022
- Full Text
- View/download PDF
44. Reversible Vision Transformers
- Author
-
Mangalam, Karttikeya, primary, Fan, Haoqi, additional, Li, Yanghao, additional, Wu, Chao-Yuan, additional, Xiong, Bo, additional, Feichtenhofer, Christoph, additional, and Malik, Jitendra, additional
- Published
- 2022
- Full Text
- View/download PDF
45. Does unsupervised grammar induction need pixels?
- Author
-
Li, Boyi, Corona, Rodolfo, Mangalam, Karttikeya, Chen, Catherine, Flaherty, Daniel, Belongie, Serge, Weinberger, Kilian Q., Malik, Jitendra, Darrell, Trevor, and Klein, Dan
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Science - Computation and Language ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Computation and Language (cs.CL) ,Machine Learning (cs.LG) - Abstract
Are extralinguistic signals such as image pixels crucial for inducing constituency grammars? While past work has shown substantial gains from multimodal cues, we investigate whether such gains persist in the presence of rich information from large language models (LLMs). We find that our approach, LLM-based C-PCFG (LC-PCFG), outperforms previous multi-modal methods on the task of unsupervised constituency parsing, achieving state-of-the-art performance on a variety of datasets. Moreover, LC-PCFG results in an over 50% reduction in parameter count, and speedups in training time of 1.7x for image-aided models and more than 5x for video-aided models, respectively. These results challenge the notion that extralinguistic signals such as image pixels are needed for unsupervised grammar induction, and point to the need for better text-only baselines in evaluating the need of multi-modality for the task.
- Published
- 2022
- Full Text
- View/download PDF
46. From Goals, Waypoints & Paths To Long Term Human Trajectory Forecasting
- Author
-
Mangalam, Karttikeya, primary, An, Yang, additional, Girase, Harshayu, additional, and Malik, Jitendra, additional
- Published
- 2021
- Full Text
- View/download PDF
47. Multiscale Vision Transformers
- Author
-
Fan, Haoqi, primary, Xiong, Bo, additional, Mangalam, Karttikeya, additional, Li, Yanghao, additional, Yan, Zhicheng, additional, Malik, Jitendra, additional, and Feichtenhofer, Christoph, additional
- Published
- 2021
- Full Text
- View/download PDF
48. LOKI: Long Term and Key Intentions for Trajectory Prediction
- Author
-
Girase, Harshayu, primary, Gang, Haiming, additional, Malla, Srikanth, additional, Li, Jiachen, additional, Kanehara, Akira, additional, Mangalam, Karttikeya, additional, and Choi, Chiho, additional
- Published
- 2021
- Full Text
- View/download PDF
49. Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision
- Author
-
Mangalam, Karttikeya, primary, Adeli, Ehsan, additional, Lee, Kuan-Hui, additional, Gaidon, Adrien, additional, and Niebles, Juan Carlos, additional
- Published
- 2020
- Full Text
- View/download PDF
50. Future Person Localization in First-Person Videos
- Author
-
Yagi, Takuma, primary, Mangalam, Karttikeya, additional, Yonetani, Ryo, additional, and Sato, Yoichi, additional
- Published
- 2018
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.