60,690 results on '"A. Shrivastava"'
Search Results
2. Benchmarking Floworks against OpenAI & Anthropic: A Novel Framework for Enhanced LLM Function Calling
- Author
-
Bhan, Nirav, Gupta, Shival, Manaswini, Sai, Baba, Ritik, Yadav, Narun, Desai, Hillori, Choudhary, Yash, Pawar, Aman, Shrivastava, Sarthak, and Biswas, Sudipta
- Subjects
Computer Science - Artificial Intelligence - Abstract
Large Language Models (LLMs) have shown remarkable capabilities in various domains, yet their economic impact has been limited by challenges in tool use and function calling. This paper introduces ThorV2, a novel architecture that significantly enhances LLMs' function calling abilities. We develop a comprehensive benchmark focused on HubSpot CRM operations to evaluate ThorV2 against leading models from OpenAI and Anthropic. Our results demonstrate that ThorV2 outperforms existing models in accuracy, reliability, latency, and cost efficiency for both single and multi-API calling tasks. We also show that ThorV2 is far more reliable and scales better to multistep tasks compared to traditional models. Our work offers the tantalizing possibility of more accurate function-calling compared to today's best-performing models using significantly smaller LLMs. These advancements have significant implications for the development of more capable AI assistants and the broader application of LLMs in real-world scenarios., Comment: 15 pages for main paper, 21 pages in total including references and appendix, 10 figures
- Published
- 2024
3. Twisted bilinear spherical maximal functions
- Author
-
Bhojak, Ankit, Choudhary, Surjeet Singh, and Shrivastava, Saurabh
- Subjects
Mathematics - Classical Analysis and ODEs ,42B15, 42B25 - Abstract
We obtain $L^p-$estimates for the full and lacunary maximal functions associated to the twisted bilinear spherical averages given by \[\mathfrak{A}_t(f_1,f_2)(x,y)=\int_{\mathbb S^{2d-1}}f_1(x+tz_1,y)f_2(x,y+tz_2)\;d\sigma(z_1,z_2),\;t>0,\] for all dimensions $d\geq1$. We show that the estimates for such operators in dimensions $d\geq2$ essentially relies on the method of slicing. The bounds for the lacunary maximal function in dimension one is more delicate and requires a trilinear smoothing inequality which is based on an appropriate sublevel set estimate in this context., Comment: 27 pages, 3 figures
- Published
- 2024
4. Measuring Free-Form Decision-Making Inconsistency of Language Models in Military Crisis Simulations
- Author
-
Shrivastava, Aryan, Hullman, Jessica, and Lamparth, Max
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence ,Computer Science - Computers and Society - Abstract
There is an increasing interest in using language models (LMs) for automated decision-making, with multiple countries actively testing LMs to aid in military crisis decision-making. To scrutinize relying on LM decision-making in high-stakes settings, we examine the inconsistency of responses in a crisis simulation ("wargame"), similar to reported tests conducted by the US military. Prior work illustrated escalatory tendencies and varying levels of aggression among LMs but were constrained to simulations with pre-defined actions. This was due to the challenges associated with quantitatively measuring semantic differences and evaluating natural language decision-making without relying on pre-defined actions. In this work, we query LMs for free form responses and use a metric based on BERTScore to measure response inconsistency quantitatively. Leveraging the benefits of BERTScore, we show that the inconsistency metric is robust to linguistic variations that preserve semantic meaning in a question-answering setting across text lengths. We show that all five tested LMs exhibit levels of inconsistency that indicate semantic differences, even when adjusting the wargame setting, anonymizing involved conflict countries, or adjusting the sampling temperature parameter $T$. Further qualitative evaluation shows that models recommend courses of action that share few to no similarities. We also study the impact of different prompt sensitivity variations on inconsistency at temperature $T = 0$. We find that inconsistency due to semantically equivalent prompt variations can exceed response inconsistency from temperature sampling for most studied models across different levels of ablations. Given the high-stakes nature of military deployment, we recommend further consideration be taken before using LMs to inform military decisions or other cases of high-stakes decision-making.
- Published
- 2024
5. Analyzing (In)Abilities of SAEs via Formal Languages
- Author
-
Menon, Abhinav, Shrivastava, Manish, Krueger, David, and Lubana, Ekdeep Singh
- Subjects
Computer Science - Machine Learning - Abstract
Autoencoders have been used for finding interpretable and disentangled features underlying neural network representations in both image and text domains. While the efficacy and pitfalls of such methods are well-studied in vision, there is a lack of corresponding results, both qualitative and quantitative, for the text domain. We aim to address this gap by training sparse autoencoders (SAEs) on a synthetic testbed of formal languages. Specifically, we train SAEs on the hidden representations of models trained on formal languages (Dyck-2, Expr, and English PCFG) under a wide variety of hyperparameter settings, finding interpretable latents often emerge in the features learned by our SAEs. However, similar to vision, we find performance turns out to be highly sensitive to inductive biases of the training pipeline. Moreover, we show latents correlating to certain features of the input do not always induce a causal impact on model's computation. We thus argue that causality has to become a central target in SAE training: learning of causal features should be incentivized from the ground-up. Motivated by this, we propose and perform preliminary investigations for an approach that promotes learning of causally relevant features in our formal language setting., Comment: Under review
- Published
- 2024
6. Advanced Gesture Recognition in Autism: Integrating YOLOv7, Video Augmentation and VideoMAE for Video Analysis
- Author
-
Singh, Amit Kumar, Shrivastava, Trapti, and Singh, Vrijendra
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Deep learning and advancements in contactless sensors have significantly enhanced our ability to understand complex human activities in healthcare settings. In particular, deep learning models utilizing computer vision have been developed to enable detailed analysis of human gesture recognition, especially repetitive gestures which are commonly observed behaviors in children with autism. This research work aims to identify repetitive behaviors indicative of autism by analyzing videos captured in natural settings as children engage in daily activities. The focus is on accurately categorizing real-time repetitive gestures such as spinning, head banging, and arm flapping. To this end, we utilize the publicly accessible Self-Stimulatory Behavior Dataset (SSBD) to classify these stereotypical movements. A key component of the proposed methodology is the use of \textbf{VideoMAE}, a model designed to improve both spatial and temporal analysis of video data through a masking and reconstruction mechanism. This model significantly outperformed traditional methods, achieving an accuracy of 97.7\%, a 14.7\% improvement over the previous state-of-the-art.
- Published
- 2024
7. SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching
- Author
-
Zhang, Tianyi, Su, Junda, Wu, Oscar, Xu, Zhaozhuo, and Shrivastava, Anshumali
- Subjects
Computer Science - Machine Learning - Abstract
Compressive adaptation approaches, such as QLoRA, are widely popular alternatives for reducing memory requirements during fine-tuning of large language models (LLMs) while producing models capable of handling various downstream tasks. The key idea is to employ a "two-tower" architecture: compressing pre-trained LLM parameters into compact representations and fine-tuning the additive full-precision adapter, which typically has few tunable parameters in low-rank format. However, the strict algebraic assumptions, such as low-rank assumption, and the complexity of composing two-tower architectures are some of the known shortcomings, resulting in a poor accuracy-efficiency trade-off. In response to these known limitations, we propose SpaLLM (Sketched Parameter Adaptation of LLMs), a novel compressive adaptation approach for LLMs. This method is also the first to illustrate parameter-sharing compression methods for LLM fine-tuning, which, unlike QLoRA, are free from strict low-rank algebraic assumptions on adapters. Furthermore, our proposal unifies model compression and adaptation into a single, streamlined process, eliminating the need for two-tower architectures. SpaLLM sketches pre-trained LLM weights into lookup tables and directly fine-tunes the values in these tables. This approach simplifies LLMs' compressive adaptation workflow, potentially improves multi-user serving efficiency, and delivers significantly better accuracy for both natural language understanding and generation tasks. Moreover, by avoiding the "two-tower" architecture, our framework only requires one compressed matrix multiplication per layer during inference, demonstrating superior inference efficiency compared to previous methods.
- Published
- 2024
8. Fast Encoding and Decoding for Implicit Video Representation
- Author
-
Chen, Hao, Xie, Saining, Lim, Ser-Nam, and Shrivastava, Abhinav
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Despite the abundant availability and content richness for video data, its high-dimensionality poses challenges for video research. Recent advancements have explored the implicit representation for videos using neural networks, demonstrating strong performance in applications such as video compression and enhancement. However, the prolonged encoding time remains a persistent challenge for video Implicit Neural Representations (INRs). In this paper, we focus on improving the speed of video encoding and decoding within implicit representations. We introduce two key components: NeRV-Enc, a transformer-based hyper-network for fast encoding; and NeRV-Dec, a parallel decoder for efficient video loading. NeRV-Enc achieves an impressive speed-up of $\mathbf{10^4\times}$ by eliminating gradient-based optimization. Meanwhile, NeRV-Dec simplifies video decoding, outperforming conventional codecs with a loading speed $\mathbf{11\times}$ faster, and surpassing RAM loading with pre-decoded videos ($\mathbf{2.5\times}$ faster while being $\mathbf{65\times}$ smaller in size)., Comment: ECCV 2024. Project page at https://haochen-rye.github.io/FastNeRV/, code will be at https://github.com/haochen-rye/FastNeRV
- Published
- 2024
9. Self-Supervised Any-Point Tracking by Contrastive Random Walks
- Author
-
Shrivastava, Ayush and Owens, Andrew
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We present a simple, self-supervised approach to the Tracking Any Point (TAP) problem. We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks, using the transformer's attention-based global matching to define the transition matrices for a random walk on a space-time graph. The ability to perform "all pairs" comparisons between points allows the model to obtain high spatial precision and to obtain a strong contrastive learning signal, while avoiding many of the complexities of recent approaches (such as coarse-to-fine matching). To do this, we propose a number of design decisions that allow global matching architectures to be trained through self-supervision using cycle consistency. For example, we identify that transformer-based methods are sensitive to shortcut solutions, and propose a data augmentation scheme to address them. Our method achieves strong performance on the TapVid benchmarks, outperforming previous self-supervised tracking methods, such as DIFT, and is competitive with several supervised methods., Comment: ECCV 2024. Project link: https://ayshrv.com/gmrw . Code: https://github.com/ayshrv/gmrw/
- Published
- 2024
10. Training Language Models to Self-Correct via Reinforcement Learning
- Author
-
Kumar, Aviral, Zhuang, Vincent, Agarwal, Rishabh, Su, Yi, Co-Reyes, John D, Singh, Avi, Baumli, Kate, Iqbal, Shariq, Bishop, Colton, Roelofs, Rebecca, Zhang, Lei M, McKinney, Kay, Shrivastava, Disha, Paduraru, Cosmin, Tucker, George, Precup, Doina, Behbahani, Feryal, and Faust, Aleksandra
- Subjects
Computer Science - Machine Learning - Abstract
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
- Published
- 2024
11. Emergence of coupling induced transparency by tuning purely dissipative couplings
- Author
-
Shrivastava, Kuldeep Kumar, Ketkar, Moulik Deviprasad, Bhoi, Biswanath, and Singh, Rajeev
- Subjects
Quantum Physics - Abstract
Controlled transitions between coupling induced transparency (CIT) and coupling induced absorption (CIA) are effects of both fundamental importance as well as potential applications in various devices. We have explored these peculiar phenomena in multi-mode coupled hybrid quantum systems by considering a tunable mode (TM) and several static modes (SMs). The individual SMs and TM are designed such that they show CIA, but upon coupling different SMs we observe a transition from CIA to CIT. Quite remarkably we are able to achieve CIT using only purely dissipative couplings, whereas it is well known that CIT appears with coherent coupling. We have developed a robust quantum theory based formalism which is able to capture the transition between CIA to CIT and have the capability to explain the inter-transition (CIT to CIA) as well as intra-transitions (CIA to CIA, CIT to CIT etc.) in a multimode hybrid quantum system all with just linear approach. A general model is developed for hybrid quantum systems having N modes. We have then explicitly described two sets of hybrid systems, the first set is of three modes, 1TM coupled with 2SMs, and the second set is of four modes, 1TM coupled with 3SMs. Later we have generalised it for hybrid quantum systems having N number of modes. The results provide a pathway for designing hybrid systems that can control the group velocity of light, offering potential applications in the fields of optical switching and quantum information technology. Our finding and formulation that in a single hybrid quantum system we can achieve controllable inter-transitions and intra-transitions of CIT/ CIA may open a tool and guidance for its application in quantum technology and quantum materials as the TMs/SMs may be well extended to other real/ quasi-particles also.
- Published
- 2024
12. Robust Controller Synthesis under Markovian Mode Switching with Periodic LTV Dynamics
- Author
-
Shrivastava, Shaurya and Oguri, Kenshiro
- Subjects
Mathematics - Optimization and Control - Abstract
In this work, we propose novel LMI-based controller synthesis frameworks for periodically time-varying Markov-jump linear systems. We first discuss the necessary conditions for mean square stability and derive Lyapunov-like conditions for stability assurance. To relax strict stability requirements, we introduce a new criterion that doesn't require the Lyapunov function to decrease at each time step. Further, we incorporate these stability theorems in LMI-based controller synthesis frameworks while considering two separate problems: minimizing a quadratic cost, and maximizing the region of attraction. Numerical simulations verify the controllers' stability and showcase its applicability to fault-tolerant control.
- Published
- 2024
13. LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation
- Author
-
Swaminathan, Archana, Gupta, Anubhav, Gupta, Kamal, Maiya, Shishira R., Agarwal, Vatsal, and Shrivastava, Abhinav
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Neural Radiance Fields (NeRFs) have revolutionized the reconstruction of static scenes and objects in 3D, offering unprecedented quality. However, extending NeRFs to model dynamic objects or object articulations remains a challenging problem. Previous works have tackled this issue by focusing on part-level reconstruction and motion estimation for objects, but they often rely on heuristics regarding the number of moving parts or object categories, which can limit their practical use. In this work, we introduce LEIA, a novel approach for representing dynamic 3D objects. Our method involves observing the object at distinct time steps or "states" and conditioning a hypernetwork on the current state, using this to parameterize our NeRF. This approach allows us to learn a view-invariant latent representation for each state. We further demonstrate that by interpolating between these states, we can generate novel articulation configurations in 3D space that were previously unseen. Our experimental results highlight the effectiveness of our method in articulating objects in a manner that is independent of the viewing angle and joint configuration. Notably, our approach outperforms previous methods that rely on motion information for articulation registration., Comment: Accepted to ECCV 2024. Project Website at https://archana1998.github.io/leia/
- Published
- 2024
14. Randomness in quantum random number generator from vacuum fluctuations with source-device-independence
- Author
-
Shrivastava, Megha, Mittal, Mohit, Kumari, Isha, and Abhignan, Venkat
- Subjects
Quantum Physics ,Physics - Optics - Abstract
The application for random numbers is ubiquitous. We experimentally build a well-studied quantum random number generator from homodyne measurements on the quadrature of the vacuum fluctuations. Semi-device-independence in this random number generator is usually obtained using phase modulators to shift the phase of the laser and obtain random sampling from both X and P quadrature measurements of the vacuum state in previous implementations. We characterize the experimental parameters for optimal performance of this source-device independent quantum random number generator by measuring the two quadratures concurrently using two homodyne detectors. We also study the influence of these parameters on randomness, which can be extracted based on Shannon entropy and von Neumann entropy, which correspond to an eavesdropper listening to classical and quantum side information, respectively.
- Published
- 2024
15. Real-time Speech Enhancement on Raw Signals with Deep State-space Modeling
- Author
-
Pei, Yan Ru, Shrivastava, Ritik, and Sidharth, FNU
- Subjects
Computer Science - Sound ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
We present aTENNuate, a simple deep state-space autoencoder configured for efficient online raw speech enhancement in an end-to-end fashion. The network's performance is primarily evaluated on raw speech denoising, with additional assessments on tasks such as super-resolution and de-quantization. We benchmark aTENNuate on the VoiceBank + DEMAND and the Microsoft DNS1 synthetic test sets. The network outperforms previous real-time denoising models in terms of PESQ score, parameter count, MACs, and latency. Even as a raw waveform processing model, the model maintains high fidelity to the clean signal with minimal audible artifacts. In addition, the model remains performant even when the noisy input is compressed down to 4000Hz and 4 bits, suggesting general speech enhancement capabilities in low-resource environments. Code is available at github.com/Brainchip-Inc/aTENNuate, Comment: 7 pages, 2 figures
- Published
- 2024
16. CONClave -- Secure and Robust Cooperative Perception for CAVs Using Authenticated Consensus and Trust Scoring
- Author
-
Andert, Edward, Mendoza, Francis, Behrens, Hans Walter, and Shrivastava, Aviral
- Subjects
Computer Science - Robotics ,Computer Science - Cryptography and Security ,Computer Science - Multiagent Systems - Abstract
Connected Autonomous Vehicles have great potential to improve automobile safety and traffic flow, especially in cooperative applications where perception data is shared between vehicles. However, this cooperation must be secured from malicious intent and unintentional errors that could cause accidents. Previous works typically address singular security or reliability issues for cooperative driving in specific scenarios rather than the set of errors together. In this paper, we propose CONClave, a tightly coupled authentication, consensus, and trust scoring mechanism that provides comprehensive security and reliability for cooperative perception in autonomous vehicles. CONClave benefits from the pipelined nature of the steps such that faults can be detected significantly faster and with less compute. Overall, CONClave shows huge promise in preventing security flaws, detecting even relatively minor sensing faults, and increasing the robustness and accuracy of cooperative perception in CAVs while adding minimal overhead., Comment: 6 pages, 6 figures, Design Automation Conference June 2024
- Published
- 2024
- Full Text
- View/download PDF
17. GeoAI in resource-constrained environments
- Author
-
Böhlen, Marc, Sughiarta, Gede, Kurnianingsih, Atiek, Gopaladinne, Srikar Reddy, Shrivastava, Sujay, and Gorla, Hemanth Kumar Reddy
- Subjects
Computer Science - Computers and Society - Abstract
This paper describes spatially aware Artificial Intelligence, GeoAI, tailored for small organizations such as NGOs in resource constrained contexts where access to large datasets, expensive compute infrastructure and AI expertise may be restricted. We furthermore consider future scenarios in which resource-intensive, large geospatial models may homogenize the representation of complex landscapes, and suggest strategies to prepare for this condition., Comment: 8 pages, 6 figures
- Published
- 2024
18. VLM-KD: Knowledge Distillation from VLM for Long-Tail Visual Recognition
- Author
-
Zhang, Zaiwei, Meyer, Gregory P., Lu, Zhichao, Shrivastava, Ashish, Ravichandran, Avinash, and Wolff, Eric M.
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
For visual recognition, knowledge distillation typically involves transferring knowledge from a large, well-trained teacher model to a smaller student model. In this paper, we introduce an effective method to distill knowledge from an off-the-shelf vision-language model (VLM), demonstrating that it provides novel supervision in addition to those from a conventional vision-only teacher model. Our key technical contribution is the development of a framework that generates novel text supervision and distills free-form text into a vision encoder. We showcase the effectiveness of our approach, termed VLM-KD, across various benchmark datasets, showing that it surpasses several state-of-the-art long-tail visual classifiers. To our knowledge, this work is the first to utilize knowledge distillation with text supervision generated by an off-the-shelf VLM and apply it to vanilla randomly initialized vision encoders.
- Published
- 2024
19. Knowledge-Aware Reasoning over Multimodal Semi-structured Tables
- Author
-
Mathur, Suyash Vardhan, Bafna, Jainit Sushil, Kartik, Kunal, Khandelwal, Harshita, Shrivastava, Manish, Gupta, Vivek, Bansal, Mohit, and Roth, Dan
- Subjects
Computer Science - Computation and Language ,Computer Science - Computer Vision and Pattern Recognition - Abstract
Existing datasets for tabular question answering typically focus exclusively on text within cells. However, real-world data is inherently multimodal, often blending images such as symbols, faces, icons, patterns, and charts with textual content in tables. With the evolution of AI models capable of multimodal reasoning, it is pertinent to assess their efficacy in handling such structured data. This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We explore their ability to reason on tables that integrate both images and text, introducing MMTabQA, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs, understanding visual context, and comparing visual content across images. These findings establish our dataset as a robust benchmark for advancing AI's comprehension and capabilities in analyzing multimodal structured data.
- Published
- 2024
20. CoDi: Conversational Distillation for Grounded Question Answering
- Author
-
Huber, Patrick, Einolghozati, Arash, Conway, Rylan, Narang, Kanika, Smith, Matt, Nayyar, Waqar, Sagar, Adithya, Aly, Ahmed, and Shrivastava, Akshat
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Distilling conversational skills into Small Language Models (SLMs) with approximately 1 billion parameters presents significant challenges. Firstly, SLMs have limited capacity in their model parameters to learn extensive knowledge compared to larger models. Secondly, high-quality conversational datasets are often scarce, small, and domain-specific. Addressing these challenges, we introduce a novel data distillation framework named CoDi (short for Conversational Distillation, pronounced "Cody"), allowing us to synthesize large-scale, assistant-style datasets in a steerable and diverse manner. Specifically, while our framework is task agnostic at its core, we explore and evaluate the potential of CoDi on the task of conversational grounded reasoning for question answering. This is a typical on-device scenario for specialist SLMs, allowing for open-domain model responses, without requiring the model to "memorize" world knowledge in its limited weights. Our evaluations show that SLMs trained with CoDi-synthesized data achieve performance comparable to models trained on human-annotated data in standard metrics. Additionally, when using our framework to generate larger datasets from web data, our models surpass larger, instruction-tuned models in zero-shot conversational grounded reasoning tasks., Comment: 13 pages
- Published
- 2024
21. DSP-MLIR: A MLIR Dialect for Digital Signal Processing
- Author
-
Kumar, Abhinav, Khedkar, Atharva, and Shrivastava, Aviral
- Subjects
Electrical Engineering and Systems Science - Signal Processing ,Computer Science - Computation and Language - Abstract
Traditional Digital Signal Processing ( DSP ) compilers work at low level ( C-level / assembly level ) and hence lose much of the optimization opportunities present at high-level ( domain-level ). The emerging multi-level compiler infrastructure MLIR ( Multi-level Intermediate Representation ) allows to specify optimizations at higher level. In this paper, we utilize MLIR framework to introduce a DSP Dialect and perform domain-specific optimizations at dialect -level ( high-level ) and show the usefulness of these optimizations on sample DSP apps. In particular, we develop a compiler for DSP and a DSL (Domain Specific Language) to ease the development of apps. We show the performance improvement in execution time for these sample apps by upto 10x which would have been difficult if the IR were at C/ affine level.
- Published
- 2024
22. Latent-INR: A Flexible Framework for Implicit Representations of Videos with Discriminative Semantics
- Author
-
Maiya, Shishira R, Gupta, Anubhav, Gwilliam, Matthew, Ehrlich, Max, and Shrivastava, Abhinav
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Implicit Neural Networks (INRs) have emerged as powerful representations to encode all forms of data, including images, videos, audios, and scenes. With video, many INRs for video have been proposed for the compression task, and recent methods feature significant improvements with respect to encoding time, storage, and reconstruction quality. However, these encoded representations lack semantic meaning, so they cannot be used for any downstream tasks that require such properties, such as retrieval. This can act as a barrier for adoption of video INRs over traditional codecs as they do not offer any significant edge apart from compression. To alleviate this, we propose a flexible framework that decouples the spatial and temporal aspects of the video INR. We accomplish this with a dictionary of per-frame latents that are learned jointly with a set of video specific hypernetworks, such that given a latent, these hypernetworks can predict the INR weights to reconstruct the given frame. This framework not only retains the compression efficiency, but the learned latents can be aligned with features from large vision models, which grants them discriminative properties. We align these latents with CLIP and show good performance for both compression and video retrieval tasks. By aligning with VideoLlama, we are able to perform open-ended chat with our learned latents as the visual inputs. Additionally, the learned latents serve as a proxy for the underlying weights, allowing us perform tasks like video interpolation. These semantic properties and applications, existing simultaneously with ability to perform compression, interpolation, and superresolution properties, are a first in this field of work., Comment: equal contribution for first two authors; accepted to ECCV2024; 14 pages, 4 tables, 10 figures in main paper, supplementary after bibliography
- Published
- 2024
23. IncidentNet: Traffic Incident Detection, Localization and Severity Estimation with Sparse Sensing
- Author
-
Peddiraju, Sai Shashank, Harapanahalli, Kaustubh, Andert, Edward, and Shrivastava, Aviral
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence - Abstract
Prior art in traffic incident detection relies on high sensor coverage and is primarily based on decision-tree and random forest models that have limited representation capacity and, as a result, cannot detect incidents with high accuracy. This paper presents IncidentNet - a novel approach for classifying, localizing, and estimating the severity of traffic incidents using deep learning models trained on data captured from sparsely placed sensors in urban environments. Our model works on microscopic traffic data that can be collected using cameras installed at traffic intersections. Due to the unavailability of datasets that provide microscopic traffic details and traffic incident details simultaneously, we also present a methodology to generate a synthetic microscopic traffic dataset that matches given macroscopic traffic data. IncidentNet achieves a traffic incident detection rate of 98%, with false alarm rates of less than 7% in 197 seconds on average in urban environments with cameras on less than 20% of the traffic intersections., Comment: 6 pages, 6 figures, 2024 IEEE 27th International Conference on Intelligent Transportation Systems (ITSC)
- Published
- 2024
24. InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models
- Author
-
Saini, Nirat, Bodla, Navaneeth, Shrivastava, Ashish, Ravichandran, Avinash, Zhang, Xiao, Shrivastava, Abhinav, and Singh, Bharat
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We introduce InVi, an approach for inserting or replacing objects within videos (referred to as inpainting) using off-the-shelf, text-to-image latent diffusion models. InVi targets controlled manipulation of objects and blending them seamlessly into a background video unlike existing video editing methods that focus on comprehensive re-styling or entire scene alterations. To achieve this goal, we tackle two key challenges. Firstly, for high quality control and blending, we employ a two-step process involving inpainting and matching. This process begins with inserting the object into a single frame using a ControlNet-based inpainting diffusion model, and then generating subsequent frames conditioned on features from an inpainted frame as an anchor to minimize the domain gap between the background and the object. Secondly, to ensure temporal coherence, we replace the diffusion model's self-attention layers with extended-attention layers. The anchor frame features serve as the keys and values for these layers, enhancing consistency across frames. Our approach removes the need for video-specific fine-tuning, presenting an efficient and adaptable solution. Experimental results demonstrate that InVi achieves realistic object insertion with consistent blending and coherence across frames, outperforming existing methods.
- Published
- 2024
25. A Personal Journey of Studying Positive Psychology: Reflections of Undergraduate Students in the United Arab Emirates
- Author
-
Anita Shrivastava, Humna Azhar, and Lynda Hyland
- Abstract
Background: An increasing number of undergraduate positive psychology courses offer students a holistic view of the broader discipline of psychology. Even short-term participation in positive psychology activities as part of a taught course may improve psychological well-being and lower stress. However, there is a dearth of qualitative evidence on how students experience this learning process. Objective: This study aimed to explore UAE-based undergraduate students' reflections on their experiences of an elective positive psychology course and their participation in various positive psychology interventions (PPIs). Method: This qualitative study explored 21 UAE-based undergraduate students' reflections on taking a semester-long positive psychology course, in which they participated in PPIs. The rich data from semi-structured interviews were analyzed using reflexive thematic analysis. Results: Three main themes emerged, namely "rethinking positive psychology," "changes in perspective on happiness and search for positivity," and "enhanced relationships." Conclusion and Teaching Implications: The study suggests that positive psychology may reach past the time and space of the taught course and have at least a short-term positive impact on students' mental and social lives. Findings from this study imply the potential of positive psychology in higher education and point towards further integration of such courses in undergraduate programs in the UAE and beyond.
- Published
- 2024
- Full Text
- View/download PDF
26. Evidence for the general dominance of proton shells in low-energy fission
- Author
-
K. Mahata, C. Schmitt, Shilpi Gupta, A. Shrivastava, G. Scamps, and K.-H. Schmidt
- Subjects
Physics ,QC1-999 - Abstract
A regular pattern, revealing the leading role of the light-fragment nuclear charge, is found to emerge from a consistent analysis of the experimental information collected recently on low-energy asymmetric fission of neutron-deficient nuclei around lead. The observation is corroborated by a theoretical investigation within a microscopic framework, suggesting the importance of proton configurations driven by quadrupole-octupole correlations. This is in contrast to the earlier theoretical interpretations in terms of dominant neutron shells. The survey of a wider area of the nuclear chart by a semi-empirical approach points to the lack of understanding of the competition between the different underlying macroscopic and microscopic forces in a quantitative manner. Combined with previously identified stabilizing forces, the present finding shows a striking connection between the “old” (actinide) and “new” (pre-actinide) islands of asymmetric fission which could steer the strive for a unified theory of fission.
- Published
- 2022
- Full Text
- View/download PDF
27. MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
- Author
-
Lin, Xi Victoria, Shrivastava, Akshat, Luo, Liang, Iyer, Srinivasan, Lewis, Mike, Ghosh, Gargi, Zettlemoyer, Luke, and Aghajanyan, Armen
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
We introduce MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed for pre-training mixed-modal, early-fusion language models. MoMa processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. These groups exclusively process designated tokens while employing learned routing within each group to maintain semantically informed adaptivity. Our empirical results reveal substantial pre-training efficiency gains through this modality-specific parameter allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings: 3.7x overall, with 2.6x for text and 5.2x for image processing compared to a compute-equivalent dense baseline, measured by pre-training loss. This outperforms the standard expert-choice MoE with 8 mixed-modal experts, which achieves 3x overall FLOPs savings (3x for text, 2.8x for image). Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination hurts performance in causal inference due to increased sensitivity to router accuracy. These results demonstrate MoMa's potential to significantly advance the efficiency of mixed-modal, early-fusion language model pre-training, paving the way for more resource-efficient and capable multimodal AI systems., Comment: v2 -> update related work section v3 -> fix spelling
- Published
- 2024
28. Trajectory-aligned Space-time Tokens for Few-shot Action Recognition
- Author
-
Kumar, Pulkit, Padmanabhan, Namitha, Luo, Luke, Rambhatla, Sai Saketh, and Shrivastava, Abhinav
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We propose a simple yet effective approach for few-shot action recognition, emphasizing the disentanglement of motion and appearance representations. By harnessing recent progress in tracking, specifically point trajectories and self-supervised representation learning, we build trajectory-aligned tokens (TATs) that capture motion and appearance information. This approach significantly reduces the data requirements while retaining essential information. To process these representations, we use a Masked Space-time Transformer that effectively learns to aggregate information to facilitate few-shot action recognition. We demonstrate state-of-the-art results on few-shot action recognition across multiple datasets. Our project page is available at https://www.cs.umd.edu/~pulkit/tats, Comment: ECCV 2024
- Published
- 2024
29. WayEx: Waypoint Exploration using a Single Demonstration
- Author
-
Levy, Mara, Saini, Nirat, and Shrivastava, Abhinav
- Subjects
Computer Science - Robotics ,Computer Science - Artificial Intelligence - Abstract
We propose WayEx, a new method for learning complex goal-conditioned robotics tasks from a single demonstration. Our approach distinguishes itself from existing imitation learning methods by demanding fewer expert examples and eliminating the need for information about the actions taken during the demonstration. This is accomplished by introducing a new reward function and employing a knowledge expansion technique. We demonstrate the effectiveness of WayEx, our waypoint exploration strategy, across six diverse tasks, showcasing its applicability in various environments. Notably, our method significantly reduces training time by 50% as compared to traditional reinforcement learning methods. WayEx obtains a higher reward than existing imitation learning methods given only a single demonstration. Furthermore, we demonstrate its success in tackling complex environments where standard approaches fall short. More information is available at: https://waypoint-ex.github.io., Comment: ICRA 2024
- Published
- 2024
30. SecScale: A Scalable and Secure Trusted Execution Environment for Servers
- Author
-
Sunny, Ani, Shrivastava, Nivedita, and Sarangi, Smruti R.
- Subjects
Computer Science - Cryptography and Security ,Computer Science - Hardware Architecture - Abstract
Trusted execution environments (TEEs) are an integral part of modern secure processors. They ensure that their application and code pages are confidential, tamper proof and immune to diverse types of attacks. In 2021, Intel suddenly announced its plans to deprecate its most trustworthy enclave, SGX, on its 11th and 12th generation processors. The reasons stemmed from the fact that it was difficult to scale the enclaves (sandboxes) beyond 256 MB as the hardware overheads outweighed the benefits. Competing solutions by Intel and other vendors are much more scalable, but do not provide many key security guarantees that SGX used to provide notably replay attack protection. In the last three years, no proposal from industry or academia has been able to provide both scalability (with a modest slowdown) as well as replay-protection on generic hardware (to the best of our knowledge). We solve this problem by proposing SecScale that uses some new ideas centered around speculative execution (read first, verify later), creating a forest of MACs (instead of a tree of counters) and providing complete memory encryption (no generic unsecure regions). We show that we are 10% faster than the nearest competing alternative.
- Published
- 2024
31. NeuroPlug: Plugging Side-Channel Leaks in NPUs using Space Filling Curves
- Author
-
Shrivastava, Nivedita and Sarangi, Smruti R.
- Subjects
Computer Science - Cryptography and Security - Abstract
Securing deep neural networks (DNNs) from side-channel attacks is an important problem as of today, given the substantial investment of time and resources in acquiring the raw data and training complex models. All published countermeasures (CMs) add noise N to a signal X (parameter of interest such as the net memory traffic that is leaked). The adversary observes X+N ; we shall show that it is easy to filter this noise out using targeted measurements, statistical analyses and different kinds of reasonably-assumed side information. We present a novel CM NeuroPlug that is immune to these attack methodologies mainly because we use a different formulation CX + N . We introduce a multiplicative variable C that naturally arises from feature map compression; it plays a key role in obfuscating the parameters of interest. Our approach is based on mapping all the computations to a 1-D space filling curve and then performing a sequence of tiling, compression and binning-based obfuscation operations. We follow up with proposing a theoretical framework based on Mellin transforms that allows us to accurately quantify the size of the search space as a function of the noise we add and the side information that an adversary possesses. The security guarantees provided by NeuroPlug are validated using a battery of statistical and information theory-based tests. We also demonstrate a substantial performance enhancement of 15% compared to the closest competing work.
- Published
- 2024
32. LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid
- Author
-
Zhang, Tianyi and Shrivastava, Anshumali
- Subjects
Computer Science - Machine Learning - Abstract
Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising technique to reduce memory requirements and decoding latency. However, recent accurate quantization methods often depend on specialized computations or custom data formats to achieve better model quality, which limits their compatibility with popular frameworks, as they require dedicated inference kernels tailored to specific hardware and software platforms, hindering wider adoption. Furthermore, many competitive methods have high resource requirements and computational overhead, making it challenging to scale them to hundreds of billions of parameters. In response to these challenges, we propose LeanQuant (Loss-error-aware Network Quantization), a novel quantization method that is accurate, versatile, and scalable. In the existing popular iterative loss-error-based quantization framework, we identify a critical limitation in prior methods: the min-max affine quantization grid fails to preserve model quality due to outliers in inverse Hessian diagonals. To overcome this fundamental issue, we propose learning loss-error-aware grids, instead of using non-adaptive min-max affine grids. Our approach not only produces quantized models that are more accurate but also generalizes to a wider range of quantization types, including affine and non-uniform quantization, enhancing compatibility with more frameworks. Extensive empirical evaluations on recent LLMs demonstrate that LeanQuant is highly accurate, comparing favorably against recent competitive baselines in model quality, and scalable, achieving very accurate quantization of Llama-3.1 405B, one of the largest open-source LLMs to date, using two Quadro RTX 8000-48GB GPUs in 21 hours.
- Published
- 2024
33. V-VIPE: Variational View Invariant Pose Embedding
- Author
-
Levy, Mara and Shrivastava, Abhinav
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence - Abstract
Learning to represent three dimensional (3D) human pose given a two dimensional (2D) image of a person, is a challenging problem. In order to make the problem less ambiguous it has become common practice to estimate 3D pose in the camera coordinate space. However, this makes the task of comparing two 3D poses difficult. In this paper, we address this challenge by separating the problem of estimating 3D pose from 2D images into two steps. We use a variational autoencoder (VAE) to find an embedding that represents 3D poses in canonical coordinate space. We refer to this embedding as variational view-invariant pose embedding V-VIPE. Using V-VIPE we can encode 2D and 3D poses and use the embedding for downstream tasks, like retrieval and classification. We can estimate 3D poses from these embeddings using the decoder as well as generate unseen 3D poses. The variability of our encoding allows it to generalize well to unseen camera views when mapping from 2D space. To the best of our knowledge, V-VIPE is the only representation to offer this diversity of applications. Code and more information can be found at https://v-vipe.github.io/., Comment: CVPR 2024 - RHOBIN Workshop
- Published
- 2024
34. Mast Kalandar at SemEval-2024 Task 8: On the Trail of Textual Origins: RoBERTa-BiLSTM Approach to Detect AI-Generated Text
- Author
-
Bafna, Jainit Sushil, Mittal, Hardik, Sethia, Suyash, Shrivastava, Manish, and Mamidi, Radhika
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Large Language Models (LLMs) have showcased impressive abilities in generating fluent responses to diverse user queries. However, concerns regarding the potential misuse of such texts in journalism, educational, and academic contexts have surfaced. SemEval 2024 introduces the task of Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection, aiming to develop automated systems for identifying machine-generated text and detecting potential misuse. In this paper, we i) propose a RoBERTa-BiLSTM based classifier designed to classify text into two categories: AI-generated or human ii) conduct a comparative study of our model with baseline approaches to evaluate its effectiveness. This paper contributes to the advancement of automatic text detection systems in addressing the challenges posed by machine-generated text misuse. Our architecture ranked 46th on the official leaderboard with an accuracy of 80.83 among 125., Comment: SemEval-2024
- Published
- 2024
35. Search for MeV Gamma-ray emission from TeV bright red dwarfs with COMPTEL
- Author
-
Shrivastava, Niharika, Manna, Siddhant, and Desai, Shantanu
- Subjects
Astrophysics - High Energy Astrophysical Phenomena ,Astrophysics - Solar and Stellar Astrophysics - Abstract
The SHALON atmospheric Cherenkov telescope has detected very high energy gamma-ray emission at TeV energies from eight red dwarfs, namely, V388 Cas, V547 Cas, V780 Tau, V962 Tau, V1589 Cyg, GJ 1078, GJ 3684 and GL 851.1. Consequently, these red dwarfs have been suggested as sources of ultra-high energy cosmic rays. In this work, we search for soft gamma-ray emission from these TeV bright red dwarfs between 0.75-30 MeV using archival data from the COMPTEL gamma-ray imaging telescope, as a follow-up to a similar search for GeV gamma-ray emission using the Fermi-LAT telescope. Although, prima-facie, we detect non-zero photon flux from three red dwarfs with high significance, these signals can attributed to contamination from nearby sources such as Crab and Cygnus, which are within the angular resolution of COMPTEL, and have been previously detected as very bright point sources at MeV energies. Therefore, we could not detect any statistically significant signal ($>3\sigma$) from any of these eight red dwarfs from 0.75-30 MeV. We then report the 95% confidence level upper limits on the differential photon flux (at 30 MeV), integral photon flux and integral energy flux for all of the eight red dwarfs. The integral energy flux limits range between $10^{-11}-10^{-10} \rm{ergs/cm^2/s}$., Comment: 10 pages, 12 figures. Accepted for publication in JCAP
- Published
- 2024
- Full Text
- View/download PDF
36. Unveiling photon-photon coupling induced transparency and absorption
- Author
-
Shrivastava, Kuldeep Kumar, Sahu, Ansuman, Bhoi, Biswanath, and Singh, Rajeev
- Subjects
Quantum Physics ,Physics - Optics - Abstract
This study presents the theoretical foundations of an analogous electromagnetically induced transparency (EIT) and absorption (EIA) which we are referring as coupling induced transparency (CIT) and absorption (CIA) respectively, along with an exploration of the transition between these phenomena. We provide a concise phenomenological description with analytical expressions for transmission spectra and dispersion elucidating how the interplay of coherent and dissipative interactions in a coupled system results in the emergence of level repulsion and attraction, corresponding to CIT and CIA, respectively. The model is validated through numerical simulations using a hybrid system comprising a split ring resonator (SRR) and electric inductive-capacitive (ELC) resonator in planar geometry. We analyse two cases while keeping ELC parameters constant; one involving a dynamic adjustment of the SRR size with a fixed split gap, and the other entailing a varying gap while maintaining a constant SRR size. Notably, in the first case, the dispersion profile of the transmission signal demonstrates level repulsion, while the second case results in level attraction, effectively showcasing CIT and CIA, respectively. These simulated findings not only align with the theoretical model but also underscore the versatility of our approach. Subsequently, we expand our model to a more general case, demonstrating that a controlled transition from CIT to CIA is achievable by manipulating the dissipation rate of individual modes within the hybrid system, leading to either coherent or dissipative interactions between the modes. The results provide a pathway for designing hybrid systems that can control the group velocity of light, offering potential applications in the fields of optical switching and quantum information technology.
- Published
- 2024
- Full Text
- View/download PDF
37. IDentity with Locality: An ideal hash for gene sequence search
- Author
-
Desai, Aditya, Gupta, Gaurav, Zhang, Tianyi, and Shrivastava, Anshumali
- Subjects
Computer Science - Information Retrieval - Abstract
Gene sequence search is a fundamental operation in computational genomics. Due to the petabyte scale of genome archives, most gene search systems now use hashing-based data structures such as Bloom Filters (BF). The state-of-the-art systems such as Compact bit-slicing signature index (COBS) and Repeated And Merged Bloom filters (RAMBO) use BF with Random Hash (RH) functions for gene representation and identification. The standard recipe is to cast the gene search problem as a sequence of membership problems testing if each subsequent gene substring (called kmer) of Q is present in the set of kmers of the entire gene database D. We observe that RH functions, which are crucial to the memory and the computational advantage of BF, are also detrimental to the system performance of gene-search systems. While subsequent kmers being queried are likely very similar, RH, oblivious to any similarity, uniformly distributes the kmers to different parts of potentially large BF, thus triggering excessive cache misses and causing system slowdown. We propose a novel hash function called the Identity with Locality (IDL) hash family, which co-locates the keys close in input space without causing collisions. This approach ensures both cache locality and key preservation. IDL functions can be a drop-in replacement for RH functions and help improve the performance of information retrieval systems. We give a simple but practical construction of IDL function families and show that replacing the RH with IDL functions reduces cache misses by a factor of 5x, thus improving query and indexing times of SOTA methods such as COBS and RAMBO by factors up to 2x without compromising their quality. We also provide a theoretical analysis of the false positive rate of BF with IDL functions. Our hash function is the first study that bridges Locality Sensitive Hash (LSH) and RH to obtain cache efficiency., Comment: 13 pages
- Published
- 2024
38. ARDuP: Active Region Video Diffusion for Universal Policies
- Author
-
Huang, Shuaiyi, Levy, Mara, Jiang, Zhenyu, Anandkumar, Anima, Zhu, Yuke, Fan, Linxi, Huang, De-An, and Shrivastava, Abhinav
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Robotics - Abstract
Sequential decision-making can be formulated as a text-conditioned video generation problem, where a video planner, guided by a text-defined goal, generates future frames visualizing planned actions, from which control actions are subsequently derived. In this work, we introduce Active Region Video Diffusion for Universal Policies (ARDuP), a novel framework for video-based policy learning that emphasizes the generation of active regions, i.e. potential interaction areas, enhancing the conditional policy's focus on interactive areas critical for task execution. This innovative framework integrates active region conditioning with latent diffusion models for video planning and employs latent representations for direct action decoding during inverse dynamic modeling. By utilizing motion cues in videos for automatic active region discovery, our method eliminates the need for manual annotations of active regions. We validate ARDuP's efficacy via extensive experiments on simulator CLIPort and the real-world dataset BridgeData v2, achieving notable improvements in success rates and generating convincingly realistic video plans.
- Published
- 2024
39. Composing Object Relations and Attributes for Image-Text Matching
- Author
-
Pham, Khoi, Huynh, Chuong, Lim, Ser-Nam, and Shrivastava, Abhinav
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
We study the visual semantic embedding problem for image-text matching. Most existing work utilizes a tailored cross-attention mechanism to perform local alignment across the two image and text modalities. This is computationally expensive, even though it is more powerful than the unimodal dual-encoder approach. This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges. Utilizing a graph attention network, our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system. Representing caption as a scene graph offers the ability to utilize the strong relational inductive bias of graph neural networks to learn object-attribute and object-object relations effectively. To train the model, we propose losses that align the image and caption both at the holistic level (image-caption) and the local level (image-object entity), which we show is key to the success of the model. Our model is termed Composition model for Object Relations and Attributes, CORA. Experimental results on two prominent image-text retrieval benchmarks, Flickr30K and MSCOCO, demonstrate that CORA outperforms existing state-of-the-art computationally expensive cross-attention methods regarding recall score while achieving fast computation speed of the dual encoder., Comment: Accepted to CVPR'24
- Published
- 2024
40. AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models
- Author
-
Wu, Xiyang, Guan, Tianrui, Li, Dianqi, Huang, Shuaiyi, Liu, Xiaoyu, Wang, Xijun, Xian, Ruiqi, Shrivastava, Abhinav, Huang, Furong, Boyd-Graber, Jordan Lee, Zhou, Tianyi, and Manocha, Dinesh
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Computation and Language - Abstract
Large vision-language models (LVLMs) are prone to hallucinations, where certain contextual cues in an image can trigger the language module to produce overconfident and incorrect reasoning about abnormal or hypothetical objects. While some benchmarks have been developed to investigate LVLM hallucinations, they often rely on hand-crafted corner cases whose failure patterns may not generalize well. Additionally, fine-tuning on these examples could undermine their validity. To address this, we aim to scale up the number of cases through an automated approach, reducing human bias in crafting such corner cases. This motivates the development of AutoHallusion, the first automated benchmark generation approach that employs several key strategies to create a diverse range of hallucination examples. Our generated visual-question pairs pose significant challenges to LVLMs, requiring them to overcome contextual biases and distractions to arrive at correct answers. AutoHallusion enables us to create new benchmarks at the minimum cost and thus overcomes the fragility of hand-crafted benchmarks. It also reveals common failure patterns and reasons, providing key insights to detect, avoid, or control hallucinations. Comprehensive evaluations of top-tier LVLMs, e.g., GPT-4V(ision), Gemini Pro Vision, Claude 3, and LLaVA-1.5, show a 97.7% and 98.7% success rate of hallucination induction on synthetic and real-world datasets of AutoHallusion, paving the way for a long battle against hallucinations. The codebase and data can be accessed at https://github.com/wuxiyang1996/AutoHallusion.
- Published
- 2024
41. GenMM: Geometrically and Temporally Consistent Multimodal Data Generation for Video and LiDAR
- Author
-
Singh, Bharat, Kulharia, Viveka, Yang, Luyu, Ravichandran, Avinash, Tyagi, Ambrish, and Shrivastava, Ashish
- Subjects
Computer Science - Computer Vision and Pattern Recognition ,Computer Science - Artificial Intelligence ,Computer Science - Machine Learning - Abstract
Multimodal synthetic data generation is crucial in domains such as autonomous driving, robotics, augmented/virtual reality, and retail. We propose a novel approach, GenMM, for jointly editing RGB videos and LiDAR scans by inserting temporally and geometrically consistent 3D objects. Our method uses a reference image and 3D bounding boxes to seamlessly insert and blend new objects into target videos. We inpaint the 2D Regions of Interest (consistent with 3D boxes) using a diffusion-based video inpainting model. We then compute semantic boundaries of the object and estimate it's surface depth using state-of-the-art semantic segmentation and monocular depth estimation techniques. Subsequently, we employ a geometry-based optimization algorithm to recover the 3D shape of the object's surface, ensuring it fits precisely within the 3D bounding box. Finally, LiDAR rays intersecting with the new object surface are updated to reflect consistent depths with its geometry. Our experiments demonstrate the effectiveness of GenMM in inserting various 3D objects across video and LiDAR modalities.
- Published
- 2024
42. Simulations of distributed-phase-reference quantum key distribution protocols
- Author
-
Abhignan, Venkat, Jamunkar, Abhishek, Nair, Gokul, Mittal, Mohit, and Shrivastava, Megha
- Subjects
Quantum Physics - Abstract
Quantum technology can enable secure communication for cryptography purposes using quantum key distribution. Quantum key distribution protocols provide a secret key between two users with security guaranteed by the laws of quantum mechanics. To define the proper implementation of a quantum key distribution system using a particular cryptography protocol, it is crucial to critically and meticulously assess the device's performance due to technological limitations in the components used. We perform simulations on the ANSYS Interconnect platform to characterise the practical implementation of these devices using distributed-phase-reference protocols differential-phase-shift and coherent-one-way quantum key distribution. Further, we briefly describe and simulate some possible eavesdropping attempts, backflash attack, trojan-horse attack and detector-blinding attack exploiting the device imperfections.
- Published
- 2024
- Full Text
- View/download PDF
43. PRoDeliberation: Parallel Robust Deliberation for End-to-End Spoken Language Understanding
- Author
-
Le, Trang, Lazar, Daniel, Kim, Suyoun, Jiang, Shan, Le, Duc, Sagar, Adithya, Livshits, Aleksandr, Aly, Ahmed, and Shrivastava, Akshat
- Subjects
Computer Science - Computation and Language ,Computer Science - Sound ,Electrical Engineering and Systems Science - Audio and Speech Processing - Abstract
Spoken Language Understanding (SLU) is a critical component of voice assistants; it consists of converting speech to semantic parses for task execution. Previous works have explored end-to-end models to improve the quality and robustness of SLU models with Deliberation, however these models have remained autoregressive, resulting in higher latencies. In this work we introduce PRoDeliberation, a novel method leveraging a Connectionist Temporal Classification-based decoding strategy as well as a denoising objective to train robust non-autoregressive deliberation models. We show that PRoDeliberation achieves the latency reduction of parallel decoding (2-10x improvement over autoregressive models) while retaining the ability to correct Automatic Speech Recognition (ASR) mistranscriptions of autoregressive deliberation systems. We further show that the design of the denoising training allows PRoDeliberation to overcome the limitations of small ASR devices, and we provide analysis on the necessity of each component of the system.
- Published
- 2024
44. UVIS: Unsupervised Video Instance Segmentation
- Author
-
Huang, Shuaiyi, Suri, Saksham, Gupta, Kamal, Rambhatla, Sai Saketh, Lim, Ser-nam, and Shrivastava, Abhinav
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
Video instance segmentation requires classifying, segmenting, and tracking every object across video frames. Unlike existing approaches that rely on masks, boxes, or category labels, we propose UVIS, a novel Unsupervised Video Instance Segmentation (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining. Our key insight comes from leveraging the dense shape prior from the self-supervised vision foundation model DINO and the openset recognition ability from the image-caption supervised vision-language model CLIP. Our UVIS framework consists of three essential steps: frame-level pseudo-label generation, transformer-based VIS model training, and query-based tracking. To improve the quality of VIS predictions in the unsupervised setup, we introduce a dual-memory design. This design includes a semantic memory bank for generating accurate pseudo-labels and a tracking memory bank for maintaining temporal consistency in object tracks. We evaluate our approach on three standard VIS benchmarks, namely YoutubeVIS-2019, YoutubeVIS-2021, and Occluded VIS. Our UVIS achieves 21.1 AP on YoutubeVIS-2019 without any video annotations or dense pretraining, demonstrating the potential of our unsupervised VIS framework., Comment: CVPR2024 Workshop
- Published
- 2024
45. PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs
- Author
-
Hou, Charlie, Shrivastava, Akshat, Zhan, Hongyuan, Conway, Rylan, Le, Trang, Sagar, Adithya, Fanti, Giulia, and Lazar, Daniel
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language ,Computer Science - Cryptography and Security ,Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ($\epsilon=1.29$, $\epsilon=7.58$). We achieve these results while using 9$\times$ fewer rounds, 6$\times$ less client computation per round, and 100$\times$ less communication per round. Second, finetuning large models on PrE-Text's DP synthetic data improves large language model (LLM) performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at https://github.com/houcharlie/PrE-Text., Comment: ICML 2024 (Oral). Latest revision corrects a discussion on concurrent work arXiv:2403.01749. We described their work as reliant on using closed-sourced models when in reality they also evaluate and use open source models. This has been corrected in this version
- Published
- 2024
46. Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents
- Author
-
Lee, Andrew H., Semnani, Sina J., Castillo-López, Galo, de Chalendar, Gäel, Choudhury, Monojit, Dua, Ashna, Kavitha, Kapil Rajesh, Kim, Sungkyun, Kodali, Prashant, Kumaraguru, Ponnurangam, Lombard, Alexis, Moradshahi, Mehrad, Park, Gihyun, Semmar, Nasredine, Seo, Jiwon, Shen, Tianhao, Shrivastava, Manish, Xiong, Deyi, and Lam, Monica S.
- Subjects
Computer Science - Computation and Language - Abstract
Creating multilingual task-oriented dialogue (TOD) agents is challenging due to the high cost of training data acquisition. Following the research trend of improving training data efficiency, we show for the first time, that in-context learning is sufficient to tackle multilingual TOD. To handle the challenging dialogue state tracking (DST) subtask, we break it down to simpler steps that are more compatible with in-context learning where only a handful of few-shot examples are used. We test our approach on the multilingual TOD dataset X-RiSAWOZ, which has 12 domains in Chinese, English, French, Korean, Hindi, and code-mixed Hindi-English. Our turn-by-turn DST accuracy on the 6 languages range from 55.6% to 80.3%, seemingly worse than the SOTA results from fine-tuned models that achieve from 60.7% to 82.8%; our BLEU scores in the response generation (RG) subtask are also significantly lower than SOTA. However, after manual evaluation of the validation set, we find that by correcting gold label errors and improving dataset annotation schema, GPT-4 with our prompts can achieve (1) 89.6%-96.8% accuracy in DST, and (2) more than 99% correct response generation across different languages. This leads us to conclude that current automatic metrics heavily underestimate the effectiveness of in-context learning.
- Published
- 2024
47. DaVinci at SemEval-2024 Task 9: Few-shot prompting GPT-3.5 for Unconventional Reasoning
- Author
-
Mathur, Suyash Vardhan, Jindal, Akshett Rai, and Shrivastava, Manish
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
While significant work has been done in the field of NLP on vertical thinking, which involves primarily logical thinking, little work has been done towards lateral thinking, which involves looking at problems from an unconventional perspective and defying existing conceptions and notions. Towards this direction, SemEval 2024 introduces the task of BRAINTEASER, which involves two types of questions -- Sentence Puzzles and Word Puzzles that defy conventional common-sense reasoning and constraints. In this paper, we tackle both types of questions using few-shot prompting on GPT-3.5 and gain insights regarding the difference in the nature of the two types. Our prompting strategy placed us 26th on the leaderboard for the Sentence Puzzle and 15th on the Word Puzzle task.
- Published
- 2024
48. From Human Judgements to Predictive Models: Unravelling Acceptability in Code-Mixed Sentences
- Author
-
Kodali, Prashant, Goel, Anmol, Asapu, Likhith, Bonagiri, Vamshi Krishna, Govil, Anirudh, Choudhury, Monojit, Shrivastava, Manish, and Kumaraguru, Ponnurangam
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Current computational approaches for analysing or generating code-mixed sentences do not explicitly model "naturalness" or "acceptability" of code-mixed sentences, but rely on training corpora to reflect distribution of acceptable code-mixed sentences. Modelling human judgement for the acceptability of code-mixed text can help in distinguishing natural code-mixed text and enable quality-controlled generation of code-mixed text. To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text. Cline is the largest of its kind with 16,642 sentences, consisting of samples sourced from two sources: synthetically generated code-mixed text and samples collected from online social media. Our analysis establishes that popular code-mixing metrics such as CMI, Number of Switch Points, Burstines, which are used to filter/curate/compare code-mixed corpora have low correlation with human acceptability judgements, underlining the necessity of our dataset. Experiments using Cline demonstrate that simple Multilayer Perceptron (MLP) models trained solely on code-mixing metrics are outperformed by fine-tuned pre-trained Multilingual Large Language Models (MLLMs). Specifically, XLM-Roberta and Bernice outperform IndicBERT across different configurations in challenging data settings. Comparison with ChatGPT's zero and fewshot capabilities shows that MLLMs fine-tuned on larger data outperform ChatGPT, providing scope for improvement in code-mixed tasks. Zero-shot transfer from English-Hindi to English-Telugu acceptability judgments using our model checkpoints proves superior to random baselines, enabling application to other code-mixed language pairs and providing further avenues of research. We publicly release our human-annotated dataset, trained checkpoints, code-mix corpus, and code for data generation and model training.
- Published
- 2024
49. KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
- Author
-
Zhang, Tianyi, Yi, Jonah, Xu, Zhaozhuo, and Shrivastava, Anshumali
- Subjects
Computer Science - Machine Learning - Abstract
Efficient deployment of Large Language Models (LLMs) requires batching multiple requests together to improve throughput. As the batch size, context length, or model size increases, the size of the key and value (KV) cache can quickly become the main contributor to GPU memory usage and the bottleneck of inference latency. Quantization has emerged as an effective technique for KV cache compression, but existing methods still fail at very low bit widths. We observe that distinct channels of a key/value activation embedding are highly inter-dependent, and the joint entropy of multiple channels grows at a slower rate than the sum of their marginal entropies. Based on this insight, we propose Coupled Quantization (CQ), which couples multiple key/value channels together to exploit their inter-dependency and encode the activations in a more information-efficient manner. Extensive experiments reveal that CQ outperforms or is competitive with existing baselines in preserving model quality. Furthermore, we demonstrate that CQ can preserve model quality with KV cache quantized down to 1-bit.
- Published
- 2024
50. Bayesian optimization for stable properties amid processing fluctuations in sputter deposition
- Author
-
Shrivastava, Ankit, Kalaswad, Matias, Custer, Joyce O., Adams, David P., and Najm, Habib N.
- Subjects
Condensed Matter - Materials Science ,Computer Science - Machine Learning ,Mathematics - Optimization and Control - Abstract
We introduce a Bayesian optimization approach to guide the sputter deposition of molybdenum thin films, aiming to achieve desired residual stress and sheet resistance while minimizing susceptibility to stochastic fluctuations during deposition. Thin films are pivotal in numerous technologies, including semiconductors and optical devices, where their properties are critical. Sputter deposition parameters, such as deposition power, vacuum chamber pressure, and working distance, influence physical properties like residual stress and resistance. Excessive stress and high resistance can impair device performance, necessitating the selection of optimal process parameters. Furthermore, these parameters should ensure the consistency and reliability of thin film properties, assisting in the reproducibility of the devices. However, exploring the multidimensional design space for process optimization is expensive. Bayesian optimization is ideal for optimizing inputs/parameters of general black-box functions without reliance on gradient information. We utilize Bayesian optimization to optimize deposition power and pressure using a custom-built objective function incorporating observed stress and resistance data. Additionally, we integrate prior knowledge of stress variation with pressure into the objective function to prioritize films least affected by stochastic variations. Our findings demonstrate that Bayesian optimization effectively explores the design space and identifies optimal parameter combinations meeting desired stress and resistance specifications.
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.