Author: "Buch, Shyamal" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Buch, Shyamal"' showing total 19 results

Start Over Author "Buch, Shyamal"

19 results on '"Buch, Shyamal"'

1. Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Author: Jain, Gagan, Hegde, Nidhi, Kusupati, Aditya, Nagrani, Arsha, Buch, Shyamal, Jain, Prateek, Arnab, Anurag, and Paul, Sujoy
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The visual medium (images and videos) naturally contains a large amount of information redundancy, thereby providing a great opportunity for leveraging efficiency in processing. While Vision Transformer (ViT) based models scale effectively to large data regimes, they fail to capitalize on this inherent redundancy, leading to higher computational costs. Mixture of Experts (MoE) networks demonstrate scalability while maintaining same inference-time costs, but they come with a larger parameter footprint. We present Mixture of Nested Experts (MoNE), which utilizes a nested structure for experts, wherein individual experts fall on an increasing compute-accuracy curve. Given a compute budget, MoNE learns to dynamically choose tokens in a priority order, and thus redundant tokens are processed through cheaper nested experts. Using this framework, we achieve equivalent performance as the baseline models, while reducing inference time compute by over two-fold. We validate our approach on standard image and video datasets - ImageNet-21K, Kinetics400, and Something-Something-v2. We further highlight MoNE$'$s adaptability by showcasing its ability to maintain strong performance across different inference-time compute budgets on videos, using only a single trained model.
Published: 2024

2. MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

Author: Min, Juhong, Buch, Shyamal, Nagrani, Arsha, Cho, Minsu, and Schmid, Cordelia
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning)., Comment: CVPR 2024
Published: 2024

3. Streaming Dense Video Captioning

Author: Zhou, Xingyi, Arnab, Anurag, Buch, Shyamal, Yan, Shen, Myers, Austin, Xiong, Xuehan, Nagrani, Arsha, and Schmid, Cordelia
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First, we propose a new memory module, based on clustering incoming tokens, which can handle arbitrarily long videos as the memory is of a fixed size. Second, we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability, and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet, YouCook2 and ViTT. Our code is released at https://github.com/google-research/scenic., Comment: CVPR 2024. Code is available at https://github.com/google-research/scenic/tree/main/scenic/projects/streaming_dvc
Published: 2024

4. Revisiting the 'Video' in Video-Language Understanding

Author: Buch, Shyamal, Eyzaguirre, Cristóbal, Gaidon, Adrien, Wu, Jiajun, Fei-Fei, Li, and Niebles, Juan Carlos
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy., Comment: CVPR 2022 (Oral)
Published: 2022

5. On the Opportunities and Risks of Foundation Models

Author: Bommasani, Rishi, Hudson, Drew A., Adeli, Ehsan, Altman, Russ, Arora, Simran, von Arx, Sydney, Bernstein, Michael S., Bohg, Jeannette, Bosselut, Antoine, Brunskill, Emma, Brynjolfsson, Erik, Buch, Shyamal, Card, Dallas, Castellon, Rodrigo, Chatterji, Niladri, Chen, Annie, Creel, Kathleen, Davis, Jared Quincy, Demszky, Dora, Donahue, Chris, Doumbouya, Moussa, Durmus, Esin, Ermon, Stefano, Etchemendy, John, Ethayarajh, Kawin, Fei-Fei, Li, Finn, Chelsea, Gale, Trevor, Gillespie, Lauren, Goel, Karan, Goodman, Noah, Grossman, Shelby, Guha, Neel, Hashimoto, Tatsunori, Henderson, Peter, Hewitt, John, Ho, Daniel E., Hong, Jenny, Hsu, Kyle, Huang, Jing, Icard, Thomas, Jain, Saahil, Jurafsky, Dan, Kalluri, Pratyusha, Karamcheti, Siddharth, Keeling, Geoff, Khani, Fereshte, Khattab, Omar, Koh, Pang Wei, Krass, Mark, Krishna, Ranjay, Kuditipudi, Rohith, Kumar, Ananya, Ladhak, Faisal, Lee, Mina, Lee, Tony, Leskovec, Jure, Levent, Isabelle, Li, Xiang Lisa, Li, Xuechen, Ma, Tengyu, Malik, Ali, Manning, Christopher D., Mirchandani, Suvir, Mitchell, Eric, Munyikwa, Zanele, Nair, Suraj, Narayan, Avanika, Narayanan, Deepak, Newman, Ben, Nie, Allen, Niebles, Juan Carlos, Nilforoshan, Hamed, Nyarko, Julian, Ogut, Giray, Orr, Laurel, Papadimitriou, Isabel, Park, Joon Sung, Piech, Chris, Portelance, Eva, Potts, Christopher, Raghunathan, Aditi, Reich, Rob, Ren, Hongyu, Rong, Frieda, Roohani, Yusuf, Ruiz, Camilo, Ryan, Jack, Ré, Christopher, Sadigh, Dorsa, Sagawa, Shiori, Santhanam, Keshav, Shih, Andy, Srinivasan, Krishnan, Tamkin, Alex, Taori, Rohan, Thomas, Armin W., Tramèr, Florian, Wang, Rose E., Wang, William, Wu, Bohan, Wu, Jiajun, Wu, Yuhuai, Xie, Sang Michael, Yasunaga, Michihiro, You, Jiaxuan, Zaharia, Matei, Zhang, Michael, Zhang, Tianyi, Zhang, Xikun, Zhang, Yuhui, Zheng, Lucia, Zhou, Kaitlyn, and Liang, Percy
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computers and Society
Abstract: AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature., Comment: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Report page with citation guidelines: https://crfm.stanford.edu/report.html
Published: 2021

6. BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments

Author: Srivastava, Sanjana, Li, Chengshu, Lingelbach, Michael, Martín-Martín, Roberto, Xia, Fei, Vainio, Kent, Lian, Zheng, Gokmen, Cem, Buch, Shyamal, Liu, C. Karen, Savarese, Silvio, Gweon, Hyowon, Wu, Jiajun, and Fei-Fei, Li
Subjects: Computer Science - Robotics, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce BEHAVIOR, a benchmark for embodied AI with 100 activities in simulation, spanning a range of everyday household chores such as cleaning, maintenance, and food preparation. These activities are designed to be realistic, diverse, and complex, aiming to reproduce the challenges that agents must face in the real world. Building such a benchmark poses three fundamental difficulties for each activity: definition (it can differ by time, place, or person), instantiation in a simulator, and evaluation. BEHAVIOR addresses these with three innovations. First, we propose an object-centric, predicate logic-based description language for expressing an activity's initial and goal conditions, enabling generation of diverse instances for any activity. Second, we identify the simulator-agnostic features required by an underlying environment to support BEHAVIOR, and demonstrate its realization in one such simulator. Third, we introduce a set of metrics to measure task progress and efficiency, absolute and relative to human demonstrators. We include 500 human demonstrations in virtual reality (VR) to serve as the human ground truth. Our experiments demonstrate that even state of the art embodied AI solutions struggle with the level of realism, diversity, and complexity imposed by the activities in our benchmark. We make BEHAVIOR publicly available at behavior.stanford.edu to facilitate and calibrate the development of new embodied AI solutions.
Published: 2021

7. iGibson 1.0: a Simulation Environment for Interactive Tasks in Large Realistic Scenes

Author: Shen, Bokui, Xia, Fei, Li, Chengshu, Martín-Martín, Roberto, Fan, Linxi, Wang, Guanzhi, Pérez-D'Arpino, Claudia, Buch, Shyamal, Srivastava, Sanjana, Tchapmi, Lyne P., Tchapmi, Micael E., Vainio, Kent, Wong, Josiah, Fei-Fei, Li, and Savarese, Silvio
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: We present iGibson 1.0, a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes. Our environment contains 15 fully interactive home-sized scenes with 108 rooms populated with rigid and articulated objects. The scenes are replicas of real-world homes, with distribution and the layout of objects aligned to those of the real world. iGibson 1.0 integrates several key features to facilitate the study of interactive tasks: i) generation of high-quality virtual sensor signals (RGB, depth, segmentation, LiDAR, flow and so on), ii) domain randomization to change the materials of the objects (both visual and physical) and/or their shapes, iii) integrated sampling-based motion planners to generate collision-free trajectories for robot bases and arms, and iv) intuitive human-iGibson interface that enables efficient collection of human demonstrations. Through experiments, we show that the full interactivity of the scenes enables agents to learn useful visual representations that accelerate the training of downstream manipulation tasks. We also show that iGibson 1.0 features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of human demonstrated (mobile) manipulation behaviors. iGibson 1.0 is open-source, equipped with comprehensive examples and documentation. For more information, visit our project website: http://svl.stanford.edu/igibson/
Published: 2020

8. The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary

Author: Ghanem, Bernard, Niebles, Juan Carlos, Snoek, Cees, Heilbron, Fabian Caba, Alwassel, Humam, Escorcia, Victor, Krishna, Ranjay, Buch, Shyamal, and Dao, Cuong Duc
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The 3rd annual installment of the ActivityNet Large- Scale Activity Recognition Challenge, held as a full-day workshop in CVPR 2018, focused on the recognition of daily life, high-level, goal-oriented activities from user-generated videos as those found in internet video portals. The 2018 challenge hosted six diverse tasks which aimed to push the limits of semantic visual understanding of videos as well as bridge visual content with human captions. Three out of the six tasks were based on the ActivityNet dataset, which was introduced in CVPR 2015 and organized hierarchically in a semantic taxonomy. These tasks focused on tracing evidence of activities in time in the form of proposals, class labels, and captions. In this installment of the challenge, we hosted three guest tasks to enrich the understanding of visual information in videos. The guest tasks focused on complementary aspects of the activity recognition problem at large scale and involved three challenging and recently compiled datasets: the Kinetics-600 dataset from Google DeepMind, the AVA dataset from Berkeley and Google, and the Moments in Time dataset from MIT and IBM Research., Comment: CVPR Workshop 2018 challenge summary
Published: 2018

9. ActivityNet Challenge 2017 Summary

Author: Ghanem, Bernard, Niebles, Juan Carlos, Snoek, Cees, Heilbron, Fabian Caba, Alwassel, Humam, Khrisna, Ranjay, Escorcia, Victor, Hata, Kenji, and Buch, Shyamal
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The ActivityNet Large Scale Activity Recognition Challenge 2017 Summary: results and challenge participants papers., Comment: 76 pages
Published: 2017

10. RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

Author: Fan, Linxi, Buch, Shyamal, Wang, Guanzhi, Cao, Ryan, Zhu, Yuke, Niebles, Juan Carlos, Fei-Fei, Li, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Vedaldi, Andrea, editor, Bischof, Horst, editor, Brox, Thomas, editor, and Frahm, Jan-Michael, editor
Published: 2020
Full Text: View/download PDF

11. End-to-End Joint Semantic Segmentation of Actors and Actions in Video

Author: Ji, Jingwei, Buch, Shyamal, Soto, Alvaro, Niebles, Juan Carlos, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Ferrari, Vittorio, editor, Hebert, Martial, editor, Sminchisescu, Cristian, editor, and Weiss, Yair, editor
Published: 2018
Full Text: View/download PDF

12. RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

Author: Fan, Linxi, primary, Buch, Shyamal, additional, Wang, Guanzhi, additional, Cao, Ryan, additional, Zhu, Yuke, additional, Niebles, Juan Carlos, additional, and Fei-Fei, Li, additional
Published: 2020
Full Text: View/download PDF

13. End-to-End Joint Semantic Segmentation of Actors and Actions in Video

Author: Ji, Jingwei, primary, Buch, Shyamal, additional, Soto, Alvaro, additional, and Niebles, Juan Carlos, additional
Published: 2018
Full Text: View/download PDF

14. Revisiting the “Video” in Video-Language Understanding

Author: Buch, Shyamal, primary, Eyzaguirre, Cristobal, additional, Gaidon, Adrien, additional, Wu, Jiajun, additional, Fei-Fei, Li, additional, and Niebles, Juan Carlos, additional
Published: 2022
Full Text: View/download PDF

15. iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes

Author: Shen, Bokui, primary, Xia, Fei, additional, Li, Chengshu, additional, Martin-Martin, Roberto, additional, Fan, Linxi, additional, Wang, Guanzhi, additional, Perez-D'Arpino, Claudia, additional, Buch, Shyamal, additional, Srivastava, Sanjana, additional, Tchapmi, Lyne, additional, Tchapmi, Micael, additional, Vainio, Kent, additional, Wong, Josiah, additional, Fei-Fei, Li, additional, and Savarese, Silvio, additional
Published: 2021
Full Text: View/download PDF

16. Neural Event Semantics for Grounded Language Understanding

Author: Buch, Shyamal, primary, Fei-Fei, Li, additional, and Goodman, Noah D., additional
Published: 2021
Full Text: View/download PDF

17. Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos

Author: Huang, De-An, primary, Buch, Shyamal, additional, Dery, Lucio, additional, Garg, Animesh, additional, Fei-Fei, Li, additional, and Niebles, Juan Carlos, additional
Published: 2018
Full Text: View/download PDF

18. SST: Single-Stream Temporal Action Proposals

Author: Buch, Shyamal, primary, Escorcia, Victor, additional, Shen, Chuanqi, additional, Ghanem, Bernard, additional, and Niebles, Juan Carlos, additional
Published: 2017
Full Text: View/download PDF

19. End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos

Author: Buch, Shyamal, primary, Escorcia, Victor, additional, Ghanem, Bernard, additional, and Niebles, Juan Carlos, additional
Published: 2017
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

19 results on '"Buch, Shyamal"'

1. Mixture of Nested Experts: Adaptive Processing of Visual Tokens

2. MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

3. Streaming Dense Video Captioning

4. Revisiting the 'Video' in Video-Language Understanding

5. On the Opportunities and Risks of Foundation Models

6. BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments

7. iGibson 1.0: a Simulation Environment for Interactive Tasks in Large Realistic Scenes

8. The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary

9. ActivityNet Challenge 2017 Summary

10. RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

11. End-to-End Joint Semantic Segmentation of Actors and Actions in Video

12. RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

13. End-to-End Joint Semantic Segmentation of Actors and Actions in Video

14. Revisiting the “Video” in Video-Language Understanding

15. iGibson 1.0: A Simulation Environment for Interactive Tasks in Large Realistic Scenes

16. Neural Event Semantics for Grounded Language Understanding

17. Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos

18. SST: Single-Stream Temporal Action Proposals

19. End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

19 results on '"Buch, Shyamal"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources