Author: "Al-Halah, Ziad" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Al-Halah, Ziad"' showing total 107 results

Start Over Author "Al-Halah, Ziad"

107 results on '"Al-Halah, Ziad"'

1. Switch-a-View: Few-Shot View Selection Learned from Edited Videos

Author: Majumder, Sagnik, Nagarajan, Tushar, Al-Halah, Ziad, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce Switch-a-View, a model that learns to automatically select the viewpoint to display at each timepoint when creating a how-to video. The key insight of our approach is how to train such a model from unlabeled--but human-edited--video samples. We pose a pretext task that pseudo-labels segments in the training videos for their primary viewpoint (egocentric or exocentric), and then discovers the patterns between those view-switch moments on the one hand and the visual and spoken content in the how-to video on the other hand. Armed with this predictor, our model then takes an unseen multi-view video as input and orchestrates which viewpoint should be displayed when. We further introduce a few-shot training setting that permits steering the model towards a new data domain. We demonstrate our idea on a variety of real-world video from HowTo100M and Ego-Exo4D and rigorously validate its advantages.
Published: 2024

2. Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos

Author: Majumder, Sagnik, Nagarajan, Tushar, Al-Halah, Ziad, Pradhan, Reina, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Given a multi-view video, which viewpoint is most informative for a human observer? Existing methods rely on heuristics or expensive "best-view" supervision to answer this question, limiting their applicability. We propose a weakly supervised approach that leverages language accompanying an instructional multi-view video as a means to recover its most informative viewpoint(s). Our key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is. To put this into action, we propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels. Then, those pseudo-labels are used to train a view selector, together with an auxiliary camera pose predictor that enhances view-sensitivity. During inference, our model takes as input only a multi-view video--no language or camera poses--and returns the best viewpoint to watch at each timestep. On two challenging datasets comprised of diverse multi-camera setups and how-to activities, our model consistently outperforms state-of-the-art baselines, both with quantitative metrics and human evaluation.
Published: 2024

3. Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

Author: Majumder, Sagnik, Al-Halah, Ziad, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. Through extensive experiments, we show that our features are generic enough to improve over multiple state-of-the-art baselines on both tasks on two challenging egocentric video datasets that offer binaural audio, EgoCom and EasyCom. Project: http://vision.cs.utexas.edu/projects/ego_av_corr., Comment: Accepted to CVPR 2024
Published: 2023

4. SpotEM: Efficient Video Search for Episodic Memory

Author: Ramakrishnan, Santhosh Kumar, Al-Halah, Ziad, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., "where did I leave my purse?"). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable-camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: 1) a novel clip selector that learns to identify promising video regions to search conditioned on the language query; 2) a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and 3) distillation losses that address the optimization issues arising from end-to-end joint training of the clip selector and EM model. Our experiments on 200+ hours of video from the Ego4D EM Natural Language Queries benchmark and three different EM models demonstrate the effectiveness of our approach: computing only 10% - 25% of the clip features, we preserve 84% - 97% of the original EM model's accuracy. Project page: https://vision.cs.utexas.edu/projects/spotem, Comment: Published in ICML 2023
Published: 2023

5. A Domain-Agnostic Approach for Characterization of Lifelong Learning Systems

Author: Baker, Megan M., New, Alexander, Aguilar-Simon, Mario, Al-Halah, Ziad, Arnold, Sébastien M. R., Ben-Iwhiwhu, Ese, Brna, Andrew P., Brooks, Ethan, Brown, Ryan C., Daniels, Zachary, Daram, Anurag, Delattre, Fabien, Dellana, Ryan, Eaton, Eric, Fu, Haotian, Grauman, Kristen, Hostetler, Jesse, Iqbal, Shariq, Kent, Cassandra, Ketz, Nicholas, Kolouri, Soheil, Konidaris, George, Kudithipudi, Dhireesha, Learned-Miller, Erik, Lee, Seungwon, Littman, Michael L., Madireddy, Sandeep, Mendez, Jorge A., Nguyen, Eric Q., Piatko, Christine D., Pilly, Praveen K., Raghavan, Aswin, Rahman, Abrar, Ramakrishnan, Santhosh Kumar, Ratzlaff, Neale, Soltoggio, Andrea, Stone, Peter, Sur, Indranil, Tang, Zhipeng, Tiwari, Saket, Vedder, Kyle, Wang, Felix, Xu, Zifan, Yanguas-Gil, Angel, Yedidsion, Harel, Yu, Shangqun, and Vallabha, Gautam K.
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Robotics
Abstract: Despite the advancement of machine learning techniques in recent years, state-of-the-art systems lack robustness to "real world" events, where the input distributions and tasks encountered by the deployed systems will not be limited to the original training context, and systems will instead need to adapt to novel distributions and tasks while deployed. This critical gap may be addressed through the development of "Lifelong Learning" systems that are capable of 1) Continuous Learning, 2) Transfer and Adaptation, and 3) Scalability. Unfortunately, efforts to improve these capabilities are typically treated as distinct areas of research that are assessed independently, without regard to the impact of each separate capability on other aspects of the system. We instead propose a holistic approach, using a suite of metrics and an evaluation framework to assess Lifelong Learning in a principled way that is agnostic to specific domains or system techniques. Through five case studies, we show that this suite of metrics can inform the development of varied and complex Lifelong Learning systems. We highlight how the proposed suite of metrics quantifies performance trade-offs present during Lifelong Learning system development - both the widely discussed Stability-Plasticity dilemma and the newly proposed relationship between Sample Efficient and Robust Learning. Further, we make recommendations for the formulation and use of metrics to guide the continuing development of Lifelong Learning systems and assess their progress in the future., Comment: To appear in Neural Networks
Published: 2023
Full Text: View/download PDF

6. NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

Author: Ramakrishnan, Santhosh Kumar, Al-Halah, Ziad, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories. Code and models: {\small\url{http://vision.cs.utexas.edu/projects/naq}}., Comment: 13 pages, 7 figures, appearing in CVPR 2023
Published: 2023

7. Few-Shot Audio-Visual Learning of Environment Acoustics

Author: Majumder, Sagnik, Chen, Changan, Al-Halah, Ziad, and Grauman, Kristen
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Room impulse response (RIR) functions capture how the surrounding physical environment transforms the sounds heard by a listener, with implications for various applications in AR, VR, and robotics. Whereas traditional methods to estimate RIRs assume dense geometry and/or sound measurements throughout the environment, we explore how to infer RIRs based on a sparse set of images and echoes observed in the space. Towards that goal, we introduce a transformer-based method that uses self-attention to build a rich acoustic context, then predicts RIRs of arbitrary query source-receiver locations through cross-attention. Additionally, we design a novel training objective that improves the match in the acoustic signature between the RIR predictions and the targets. In experiments using a state-of-the-art audio-visual simulator for 3D environments, we demonstrate that our method successfully generates arbitrary RIRs, outperforming state-of-the-art methods and -- in a major departure from traditional methods -- generalizing to novel environments in a few-shot manner. Project: http://vision.cs.utexas.edu/projects/fs_rir., Comment: Accepted to NeurIPS 2022
Published: 2022

8. Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation

Author: Al-Halah, Ziad, Ramakrishnan, Santhosh K., and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments. However, this process is expensive; massive amounts of interactions are needed for the model to generalize well. Moreover, this process is repeated whenever there is a change in the task type or the goal modality. We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can effectively leverage its experience from one source task and apply it to multiple target tasks (e.g., ObjectNav, RoomNav, ViewNav) with various goal modalities (e.g., image, sketch, audio, label). Furthermore, our model enables zero-shot experience learning, whereby it can solve the target tasks without receiving any task-specific interactive training. Our experiments on multiple photorealistic datasets and challenging tasks show that our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin., Comment: CVPR 2022. Project page: https://vision.cs.utexas.edu/projects/zsel/
Published: 2022

9. PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

Author: Ramakrishnan, Santhosh Kumar, Chaplot, Devendra Singh, Al-Halah, Ziad, Malik, Jitendra, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: State-of-the-art approaches to ObjectGoal navigation rely on reinforcement learning and typically require significant computational resources and time for learning. We propose Potential functions for ObjectGoal Navigation with Interaction-free learning (PONI), a modular approach that disentangles the skills of `where to look?' for an object and `how to navigate to (x, y)?'. Our key insight is that `where to look?' can be treated purely as a perception problem, and learned without environment interactions. To address this, we propose a network that predicts two complementary potential functions conditioned on a semantic map and uses them to decide where to look for an unseen object. We train the potential function network using supervised learning on a passive dataset of top-down semantic maps, and integrate it into a modular framework to perform ObjectGoal navigation. Experiments on Gibson and Matterport3D demonstrate that our method achieves the state-of-the-art for ObjectGoal navigation while incurring up to 1,600x less computational cost for training. Code and pre-trained models are available: https://vision.cs.utexas.edu/projects/poni/, Comment: 8 pages + supplementary. Accepted in CVPR 2022
Published: 2022

10. Move2Hear: Active Audio-Visual Source Separation

Author: Majumder, Sagnik, Al-Halah, Ziad, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Robotics, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest in its environment. The agent hears multiple audio sources simultaneously (e.g., a person speaking down the hall in a noisy household) and it must use its eyes and ears to automatically separate out the sounds originating from a target object within a limited time budget. Towards this goal, we introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time, guided by the improvement in predicted audio separation quality. We demonstrate our approach in scenarios motivated by both augmented reality (system is already co-located with the target object) and mobile robotics (agent begins arbitrarily far from the target object). Using state-of-the-art realistic audio-visual simulations in 3D environments, we demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation. Project: http://vision.cs.utexas.edu/projects/move2hear., Comment: Accepted to ICCV 2021
Published: 2021

11. Environment Predictive Coding for Embodied Agents

Author: Ramakrishnan, Santhosh K., Nagarajan, Tushar, Al-Halah, Ziad, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce environment predictive coding, a self-supervised approach to learn environment-level representations for embodied agents. In contrast to prior work on self-supervised learning for images, we aim to jointly encode a series of images gathered by an agent as it moves about in 3D environments. We learn these representations via a zone prediction task, where we intelligently mask out portions of an agent's trajectory and predict them from the unmasked portions, conditioned on the agent's camera poses. By learning such representations on a collection of videos, we demonstrate successful transfer to multiple downstream navigation-oriented tasks. Our experiments on the photorealistic 3D environments of Gibson and Matterport3D show that our method outperforms the state-of-the-art on challenging tasks with only a limited budget of experience., Comment: 9 pages, 6 figures, appendix
Published: 2021

12. Semantic Audio-Visual Navigation

Author: Chen, Changan, Al-Halah, Ziad, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Robotics, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent work on audio-visual navigation assumes a constantly-sounding target and restricts the role of audio to signaling the target's position. We introduce semantic audio-visual navigation, where objects in the environment make sounds consistent with their semantic meaning (e.g., toilet flushing, door creaking) and acoustic events are sporadic or short in duration. We propose a transformer-based model to tackle this new semantic AudioGoal task, incorporating an inferred goal descriptor that captures both spatial and semantic properties of the target. Our model's persistent multimodal memory enables it to reach the goal even long after the acoustic event stops. In support of the new task, we also expand the SoundSpaces audio simulations to provide semantically grounded sounds for an array of objects in Matterport3D. Our method strongly outperforms existing audio-visual navigation methods by learning to associate semantic, acoustic, and visual cues., Comment: Project page: http://vision.cs.utexas.edu/projects/semantic-audio-visual-navigation
Published: 2020

13. Modeling Fashion Influence from Photos

Author: Al-Halah, Ziad and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Multimedia
Abstract: The evolution of clothing styles and their migration across the world is intriguing, yet difficult to describe quantitatively. We propose to discover and quantify fashion influences from catalog and social media photos. We explore fashion influence along two channels: geolocation and fashion brands. We introduce an approach that detects which of these entities influence which other entities in terms of propagating their styles. We then leverage the discovered influence patterns to inform a novel forecasting model that predicts the future popularity of any given style within any given city or brand. To demonstrate our idea, we leverage public large-scale datasets of 7.7M Instagram photos from 44 major world cities (where styles are worn with variable frequency) as well as 41K Amazon product photos (where styles are purchased with variable frequency). Our model learns directly from the image data how styles move between locations and how certain brands affect each other's designs in a predictable way. The discovered influence relationships reveal how both cities and brands exert and receive fashion influence for an array of visual styles inferred from the images. Furthermore, the proposed forecasting model achieves state-of-the-art results for challenging style forecasting tasks. Our results indicate the advantage of grounding visual style evolution both spatially and temporally, and for the first time, they quantify the propagation of inter-brand and inter-city influences., Comment: To appear in the IEEE Transactions on Multimedia, 2020. Project page: https://www.cs.utexas.edu/~ziad/influence_from_photos.html. arXiv admin note: substantial text overlap with arXiv:2004.01316
Published: 2020

14. Learning to Set Waypoints for Audio-Visual Navigation

Author: Chen, Changan, Majumder, Sagnik, Al-Halah, Ziad, Gao, Ruohan, Ramakrishnan, Santhosh Kumar, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Robotics, Computer Science - Sound
Abstract: In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source (e.g., a phone ringing in another room). Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations. We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements: 1) waypoints that are dynamically set and learned end-to-end within the navigation policy, and 2) an acoustic memory that provides a structured, spatially grounded record of what the agent has heard as it moves. Both new ideas capitalize on the synergy of audio and visual data for revealing the geometry of an unmapped space. We demonstrate our approach on two challenging datasets of real-world 3D scenes, Replica and Matterport3D. Our model improves the state of the art by a substantial margin, and our experiments reveal that learning the links between sights, sounds, and space is essential for audio-visual navigation. Project: http://vision.cs.utexas.edu/projects/audio_visual_waypoints., Comment: Accepted to ICLR 2021
Published: 2020

15. Occupancy Anticipation for Efficient Exploration and Navigation

Author: Ramakrishnan, Santhosh K., Al-Halah, Ziad, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: State-of-the-art navigation methods leverage a spatial memory to generalize to new environments, but their occupancy maps are limited to capturing the geometric structures directly observed by the agent. We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions. In doing so, the agent builds its spatial awareness more rapidly, which facilitates efficient exploration and navigation in 3D environments. By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment, with performance significantly better than strong baselines. Furthermore, when deployed for the sequential decision-making tasks of exploration and navigation, our model outperforms state-of-the-art methods on the Gibson and Matterport3D datasets. Our approach is the winning entry in the 2020 Habitat PointNav Challenge. Project page: http://vision.cs.utexas.edu/projects/occupancy_anticipation/, Comment: Accepted in ECCV 2020. 19 pages, 6 figures, appendix at end
Published: 2020

16. VisualEchoes: Spatial Image Representation Learning through Echolocation

Author: Gao, Ruohan, Chen, Changan, Al-Halah, Ziad, Schissler, Carl, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D indoor scene environments. Then we propose a novel interaction-based representation learning framework that learns useful visual features via echolocation. We show that the learned image features are useful for multiple downstream vision tasks requiring spatial reasoning---monocular depth estimation, surface normal estimation, and visual navigation---with results comparable or even better than heavily supervised pre-training. Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world., Comment: Appears in ECCV 2020
Published: 2020

17. From Paris to Berlin: Discovering Fashion Style Influences Around the World

Author: Al-Halah, Ziad and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Social and Information Networks
Abstract: The evolution of clothing styles and their migration across the world is intriguing, yet difficult to describe quantitatively. We propose to discover and quantify fashion influences from everyday images of people wearing clothes. We introduce an approach that detects which cities influence which other cities in terms of propagating their styles. We then leverage the discovered influence patterns to inform a forecasting model that predicts the popularity of any given style at any given city into the future. Demonstrating our idea with GeoStyle---a large-scale dataset of 7.7M images covering 44 major world cities, we present the discovered influence relationships, revealing how cities exert and receive fashion influence for an array of 50 observed visual styles. Furthermore, the proposed forecasting model achieves state-of-the-art results for a challenging style forecasting task, showing the advantage of grounding visual style evolution both spatially and temporally., Comment: CVPR 2020. Project page: https://www.cs.utexas.edu/~ziad/fashion_influence.html
Published: 2020

18. SoundSpaces: Audio-Visual Navigation in 3D Environments

Author: Chen, Changan, Jain, Unnat, Schissler, Carl, Gari, Sebastia Vicenc Amengual, Al-Halah, Ziad, Ithapu, Vamsi Krishna, Robinson, Philip, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Human-Computer Interaction, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Moving around in the world is naturally a multisensory experience, but today's embodied agents are deaf---restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to a sounding object. We propose a multi-modal deep reinforcement learning approach to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to (1) discover elements of the geometry of the physical space indicated by the reverberating audio and (2) detect and follow sound-emitting targets. We further introduce SoundSpaces: a first-of-its-kind dataset of audio renderings based on geometrical acoustic simulations for two sets of publicly available 3D environments (Matterport3D and Replica), and we instrument Habitat to support the new sensor, making it possible to insert arbitrary sound sources in an array of real-world scanned environments. Our results show that audio greatly benefits embodied visual navigation in 3D spaces, and our work lays groundwork for new research in embodied AI with audio-visual perception., Comment: Accepted to ECCV 2020 (Spotlight). Project page: http://vision.cs.utexas.edu/projects/audio_visual_navigation/
Published: 2019

19. Smile, Be Happy :) Emoji Embedding for Visual Sentiment Analysis

Author: Al-Halah, Ziad, Aitken, Andrew, Shi, Wenzhe, and Caballero, Jose
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Due to the lack of large-scale datasets, the prevailing approach in visual sentiment analysis is to leverage models trained for object classification in large datasets like ImageNet. However, objects are sentiment neutral which hinders the expected gain of transfer learning for such tasks. In this work, we propose to overcome this problem by learning a novel sentiment-aligned image embedding that is better suited for subsequent visual sentiment analysis. Our embedding leverages the intricate relation between emojis and images in large-scale and readily available data from social media. Emojis are language-agnostic, consistent, and carry a clear sentiment signal which make them an excellent proxy to learn a sentiment aligned embedding. Hence, we construct a novel dataset of 4 million images collected from Twitter with their associated emojis. We train a deep neural model for image embedding using emoji prediction task as a proxy. Our evaluation demonstrates that the proposed embedding outperforms the popular object-based counterpart consistently across several sentiment analysis benchmarks. Furthermore, without bell and whistles, our compact, effective and simple embedding outperforms the more elaborate and customized state-of-the-art deep models on these public benchmarks. Additionally, we introduce a novel emoji representation based on their visual emotional response which supports a deeper understanding of the emoji modality and their usage on social media., Comment: International Conference on Computer Vision (ICCV 2019) Workshops. Project page and the Visual Smiley Dataset: https://www.cs.utexas.edu/~ziad/emoji_visual_sentiment.html
Published: 2019

20. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback

Author: Wu, Hui, Gao, Yupeng, Guo, Xiaoxiao, Al-Halah, Ziad, Rennie, Steven, Grauman, Kristen, and Feris, Rogerio
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Conversational interfaces for the detail-oriented retail fashion domain are more natural, expressive, and user friendly than classical keyword-based search interfaces. In this paper, we introduce the Fashion IQ dataset to support and advance research on interactive fashion image retrieval. Fashion IQ is the first fashion dataset to provide human-generated captions that distinguish similar pairs of garment images together with side-information consisting of real-world product descriptions and derived visual attribute labels for these images. We provide a detailed analysis of the characteristics of the Fashion IQ data, and present a transformer-based user simulator and interactive image retriever that can seamlessly integrate visual attributes with image features, user feedback, and dialog history, leading to improved performance over the state of the art in dialog-based image retrieval. We believe that our dataset will encourage further work on developing more natural and real-world applicable conversational shopping assistants.
Published: 2019

21. Traversing the Continuous Spectrum of Image Retrieval with Deep Dynamic Models

Author: Al-Halah, Ziad, Lehrmann, Andreas M., and Sigal, Leonid
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: We introduce the first work to tackle the image retrieval problem as a continuous operation. While the proposed approaches in the literature can be roughly categorized into two main groups: category- and instance-based retrieval, in this work we show that the retrieval task is much richer and more complex. Image similarity goes beyond this discrete vantage point and spans a continuous spectrum among the classical operating points of category and instance similarity. However, current retrieval models are static and incapable of exploring this rich structure of the retrieval space since they are trained and evaluated with a single operating point as a target objective. Hence, we introduce a novel retrieval model that for a given query is capable of producing a dynamic embedding that can target an arbitrary point along the continuous retrieval spectrum. Our model disentangles the visual signal of a query image into its basic components of categorical and attribute information. Furthermore, using a continuous control parameter our model learns to reconstruct a dynamic embedding of the query by mixing these components with different proportions to target a specific point along the retrieval simplex. We demonstrate our idea in a comprehensive evaluation of the proposed model and highlight the advantages of our approach against a set of well-established discrete retrieval models.
Published: 2018

22. Informed Democracy: Voting-based Novelty Detection for Action Recognition

Author: Roitberg, Alina, Al-Halah, Ziad, and Stiefelhagen, Rainer
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Novelty detection is crucial for real-life applications. While it is common in activity recognition to assume a closed-set setting, i.e. test samples are always of training categories, this assumption is impractical in a real-world scenario. Test samples can be of various categories including those never seen before during training. Thus, being able to know what we know and what we do not know is decisive for the model to avoid what can be catastrophic consequences. We present in this work a novel approach for identifying samples of activity classes that are not previously seen by the classifier. Our model employs a voting-based scheme that leverages the estimated uncertainty of the individual classifiers in their predictions to measure the novelty of a new input sample. Furthermore, the voting is privileged to a subset of informed classifiers that can best estimate whether a sample is novel or not when it is classified to a certain known category. In a thorough evaluation on UCF-101 and HMDB-51, we show that our model consistently outperforms state-of-the-art in novelty detection. Additionally, by combining our model with off-the-shelf zero-shot learning (ZSL) approaches, our model leads to a significant improvement in action classification accuracy for the generalized ZSL setting., Comment: Published in BMVC 2018. First and second authors contributed equally to this work
Published: 2018

23. Fashion Forward: Forecasting Visual Style in Fashion

Author: Al-Halah, Ziad, Stiefelhagen, Rainer, and Grauman, Kristen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: What is the future of fashion? Tackling this question from a data-driven vision perspective, we propose to forecast visual style trends before they occur. We introduce the first approach to predict the future popularity of styles discovered from fashion images in an unsupervised manner. Using these styles as a basis, we train a forecasting model to represent their trends over time. The resulting model can hypothesize new mixtures of styles that will become popular in the future, discover style dynamics (trendy vs. classic), and name the key visual attributes that will dominate tomorrow's fashion. We demonstrate our idea applied to three datasets encapsulating 80,000 fashion products sold across six years on Amazon. Results indicate that fashion forecasting benefits greatly from visual analysis, much more than textual or meta-data cues surrounding products., Comment: ICCV 2017. Project page: https://cvhci.anthropomatik.kit.edu/~zalhalah/prj_fashion_forecast.html
Published: 2017

24. Automatic Discovery, Association Estimation and Learning of Semantic Attributes for a Thousand Categories

Author: Al-Halah, Ziad and Stiefelhagen, Rainer
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Attribute-based recognition models, due to their impressive performance and their ability to generalize well on novel categories, have been widely adopted for many computer vision applications. However, usually both the attribute vocabulary and the class-attribute associations have to be provided manually by domain experts or large number of annotators. This is very costly and not necessarily optimal regarding recognition performance, and most importantly, it limits the applicability of attribute-based models to large scale data sets. To tackle this problem, we propose an end-to-end unsupervised attribute learning approach. We utilize online text corpora to automatically discover a salient and discriminative vocabulary that correlates well with the human concept of semantic attributes. Moreover, we propose a deep convolutional model to optimize class-attribute associations with a linguistic prior that accounts for noise and missing data in text. In a thorough evaluation on ImageNet, we demonstrate that our model is able to efficiently discover and learn semantic attributes at a large scale. Furthermore, we demonstrate that our model outperforms the state-of-the-art in zero-shot learning on three data sets: ImageNet, Animals with Attributes and aPascal/aYahoo. Finally, we enable attribute-based learning on ImageNet and will share the attributes and associations for future research., Comment: Accepted as a conference paper at CVPR 2017
Published: 2017

25. Relaxed Earth Mover's Distances for Chain- and Tree-connected Spaces and their use as a Loss Function in Deep Learning

Author: Martinez, Manuel, Haurilet, Monica, Al-Halah, Ziad, Tapaswi, Makarand, and Stiefelhagen, Rainer
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: The Earth Mover's Distance (EMD) computes the optimal cost of transforming one distribution into another, given a known transport metric between them. In deep learning, the EMD loss allows us to embed information during training about the output space structure like hierarchical or semantic relations. This helps in achieving better output smoothness and generalization. However EMD is computationally expensive.Moreover, solving EMD optimization problems usually require complex techniques like lasso. These properties limit the applicability of EMD-based approaches in large scale machine learning. We address in this work the difficulties facing incorporation of EMD-based loss in deep learning frameworks. Additionally, we provide insight and novel solutions on how to integrate such loss function in training deep neural networks. Specifically, we make three main contributions: (i) we provide an in-depth analysis of the fastest state-of-the-art EMD algorithm (Sinkhorn Distance) and discuss its limitations in deep learning scenarios. (ii) we derive fast and numerically stable closed-form solutions for the EMD gradient in output spaces with chain- and tree- connectivity; and (iii) we propose a relaxed form of the EMD gradient with equivalent computational complexity but faster convergence rate. We support our claims with experiments on real datasets. In a restricted data setting on the ImageNet dataset, we train a model to classify 1000 categories using 50K images, and demonstrate that our relaxed EMD loss achieves better Top-1 accuracy than the cross entropy loss. Overall, we show that our relaxed EMD loss criterion is a powerful asset for deep learning in the small data regime.
Published: 2016

26. Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning

Author: Al-Halah, Ziad, Tapaswi, Makarand, and Stiefelhagen, Rainer
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Collecting training images for all visual categories is not only expensive but also impractical. Zero-shot learning (ZSL), especially using attributes, offers a pragmatic solution to this problem. However, at test time most attribute-based methods require a full description of attribute associations for each unseen class. Providing these associations is time consuming and often requires domain specific knowledge. In this work, we aim to carry out attribute-based zero-shot classification in an unsupervised manner. We propose an approach to learn relations that couples class embeddings with their corresponding attributes. Given only the name of an unseen class, the learned relationship model is used to automatically predict the class-attribute associations. Furthermore, our model facilitates transferring attributes across data sets without additional effort. Integrating knowledge from multiple sources results in a significant additional improvement in performance. We evaluate on two public data sets: Animals with Attributes and aPascal/aYahoo. Our approach outperforms state-of-the-art methods in both predicting class-attribute associations and unsupervised ZSL by a large margin., Comment: Published as a conference paper at CVPR 2016
Published: 2016

27. How to Transfer? Zero-Shot Object Recognition via Hierarchical Transfer of Semantic Attributes

Author: Al-Halah, Ziad and Stiefelhagen, Rainer
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Attribute based knowledge transfer has proven very successful in visual object analysis and learning previously unseen classes. However, the common approach learns and transfers attributes without taking into consideration the embedded structure between the categories in the source set. Such information provides important cues on the intra-attribute variations. We propose to capture these variations in a hierarchical model that expands the knowledge source with additional abstraction levels of attributes. We also provide a novel transfer approach that can choose the appropriate attributes to be shared with an unseen class. We evaluate our approach on three public datasets: aPascal, Animals with Attributes and CUB-200-2011 Birds. The experiments demonstrate the effectiveness of our model with significant improvement over state-of-the-art., Comment: Published as a conference paper at WACV 2015, modifications include new results with GoogLeNet features
Published: 2016
Full Text: View/download PDF

28. SoundSpaces: Audio-Visual Navigation in 3D Environments

Author: Chen, Changan, Jain, Unnat, Schissler, Carl, Gari, Sebastia Vicenc Amengual, Al-Halah, Ziad, Ithapu, Vamsi Krishna, Robinson, Philip, Grauman, Kristen, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Vedaldi, Andrea, editor, Bischof, Horst, editor, Brox, Thomas, editor, and Frahm, Jan-Michael, editor
Published: 2020
Full Text: View/download PDF

29. DynGraph: Visual Question Answering via Dynamic Scene Graphs

Author: Haurilet, Monica, Al-Halah, Ziad, Stiefelhagen, Rainer, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Fink, Gernot A., editor, Frintrop, Simone, editor, and Jiang, Xiaoyi, editor
Published: 2019
Full Text: View/download PDF

30. MoQA – A Multi-modal Question Answering Architecture

Author: Haurilet, Monica, Al-Halah, Ziad, Stiefelhagen, Rainer, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Leal-Taixé, Laura, editor, and Roth, Stefan, editor
Published: 2019
Full Text: View/download PDF

31. SoundSpaces: Audio-Visual Navigation in 3D Environments

Author: Chen, Changan, primary, Jain, Unnat, additional, Schissler, Carl, additional, Gari, Sebastia Vicenc Amengual, additional, Al-Halah, Ziad, additional, Ithapu, Vamsi Krishna, additional, Robinson, Philip, additional, and Grauman, Kristen, additional
Published: 2020
Full Text: View/download PDF

32. VisualEchoes: Spatial Image Representation Learning Through Echolocation

Author: Gao, Ruohan, primary, Chen, Changan, additional, Al-Halah, Ziad, additional, Schissler, Carl, additional, and Grauman, Kristen, additional
Published: 2020
Full Text: View/download PDF

33. Occupancy Anticipation for Efficient Exploration and Navigation

Author: Ramakrishnan, Santhosh K., primary, Al-Halah, Ziad, additional, and Grauman, Kristen, additional
Published: 2020
Full Text: View/download PDF

34. NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory

Author: Ramakrishnan, Santhosh Kumar, primary, Al-Halah, Ziad, additional, and Grauman, Kristen, additional
Published: 2023
Full Text: View/download PDF

35. A domain-agnostic approach for characterization of lifelong learning systems

Author: Baker, Megan M., primary, New, Alexander, additional, Aguilar-Simon, Mario, additional, Al-Halah, Ziad, additional, Arnold, Sébastien M.R., additional, Ben-Iwhiwhu, Ese, additional, Brna, Andrew P., additional, Brooks, Ethan, additional, Brown, Ryan C., additional, Daniels, Zachary, additional, Daram, Anurag, additional, Delattre, Fabien, additional, Dellana, Ryan, additional, Eaton, Eric, additional, Fu, Haotian, additional, Grauman, Kristen, additional, Hostetler, Jesse, additional, Iqbal, Shariq, additional, Kent, Cassandra, additional, Ketz, Nicholas, additional, Kolouri, Soheil, additional, Konidaris, George, additional, Kudithipudi, Dhireesha, additional, Learned-Miller, Erik, additional, Lee, Seungwon, additional, Littman, Michael L., additional, Madireddy, Sandeep, additional, Mendez, Jorge A., additional, Nguyen, Eric Q., additional, Piatko, Christine, additional, Pilly, Praveen K., additional, Raghavan, Aswin, additional, Rahman, Abrar, additional, Ramakrishnan, Santhosh Kumar, additional, Ratzlaff, Neale, additional, Soltoggio, Andrea, additional, Stone, Peter, additional, Sur, Indranil, additional, Tang, Zhipeng, additional, Tiwari, Saket, additional, Vedder, Kyle, additional, Wang, Felix, additional, Xu, Zifan, additional, Yanguas-Gil, Angel, additional, Yedidsion, Harel, additional, Yu, Shangqun, additional, and Vallabha, Gautam K., additional
Published: 2023
Full Text: View/download PDF

36. PONI: Potential Functions for ObjectGoal Navigation with Interaction-free Learning

Author: Ramakrishnan, Santhosh Kumar, primary, Chaplot, Devendra Singh, additional, Al-Halah, Ziad, additional, Malik, Jitendra, additional, and Grauman, Kristen, additional
Published: 2022
Full Text: View/download PDF

37. Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation

Author: Al-Halah, Ziad, primary, Ramakrishnan, Santhosh K., additional, and Grauman, Kristen, additional
Published: 2022
Full Text: View/download PDF

38. Move2Hear: Active Audio-Visual Source Separation

Author: Majumder, Sagnik, primary, Al-Halah, Ziad, additional, and Grauman, Kristen, additional
Published: 2021
Full Text: View/download PDF

39. Semantic Audio-Visual Navigation

Author: Chen, Changan, primary, Al-Halah, Ziad, additional, and Grauman, Kristen, additional
Published: 2021
Full Text: View/download PDF

40. Fashion IQ: A New Dataset Towards Retrieving Images by Natural Language Feedback

Author: Wu, Hui, primary, Gao, Yupeng, additional, Guo, Xiaoxiao, additional, Al-Halah, Ziad, additional, Rennie, Steven, additional, Grauman, Kristen, additional, and Feris, Rogerio, additional
Published: 2021
Full Text: View/download PDF

41. From Paris to Berlin: Discovering Fashion Style Influences Around the World

Author: Al-Halah, Ziad, primary and Grauman, Kristen, additional
Published: 2020
Full Text: View/download PDF

42. Modeling Fashion Influence from Photos

Author: Al-Halah, Ziad, primary and Grauman, Kristen, additional
Published: 2020
Full Text: View/download PDF

43. Smile, Be Happy :) Emoji Embedding for Visual Sentiment Analysis

Author: Al-Halah, Ziad, primary, Aitken, Andrew, additional, Shi, Wenzhe, additional, and Caballero, Jose, additional
Published: 2019
Full Text: View/download PDF

44. Semantic Attributes for Transfer Learning in Visual Recognition

Author: Al Halah, Ziad and Stiefelhagen, R.
Subjects: fashion forecast, DATA processing & computer science, fine-grained visual recognition, ddc:004, transfer learning, zero-shot learning, attribute discovery
Abstract: Angetrieben durch den Erfolg von Deep Learning Verfahren wurden in Bezug auf künstliche Intelligenz erhebliche Fortschritte im Bereich des Maschinenverstehens gemacht. Allerdings sind Tausende von manuell annotierten Trainingsdaten zwingend notwendig, um die Generalisierungsfähigkeit solcher Modelle sicherzustellen. Darüber hinaus muss das Modell jedes Mal komplett neu trainiert werden, sobald es auf eine neue Problemklasse angewandt werden muss. Dies führt wiederum dazu, dass der sehr kostenintensive Prozess des Sammelns und Annotierens von Trainingsdaten wiederholt werden muss, wodurch die Skalierbarkeit solcher Modelle erheblich begrenzt wird. Auf der anderen Seite bearbeiten wir Menschen neue Aufgaben nicht isoliert, sondern haben die bemerkenswerte Fähigkeit, auf bereits erworbenes Wissen bei der Lösung neuer Probleme zurückzugreifen. Diese Fähigkeit wird als Transfer-Learning bezeichnet. Sie ermöglicht es uns, schneller, besser und anhand nur sehr weniger Beispiele Neues zu lernen. Daher besteht ein großes Interesse, diese Fähigkeit durch Algorithmen nachzuahmen, insbesondere in Bereichen, in denen Trainingsdaten sehr knapp oder sogar nicht verfügbar sind. In dieser Arbeit untersuchen wir Transfer-Learning im Kontext von Computer Vision. Insbesondere untersuchen wir, wie visuelle Erkennung (z.B. Objekt- oder Aktionsklassifizierung) durchgeführt werden kann, wenn nur wenige oder keine Trainingsbeispiele existieren. Eine vielversprechende Lösung in dieser Richtung ist das Framework der semantischen Attribute. Dabei werden visuelle Kategorien in Form von Attributen wie Farbe, Muster und Form beschrieben. Diese Attribute können aus einer disjunkten Menge von Trainingsbeispielen gelernt werden. Da die Attribute eine doppelte, d.h. sowohl visuelle als auch semantische, Interpretation haben, kann Sprache effektiv genutzt werden, um den Übertragungsprozess zu steuern. Dies bedeutet, dass Modelle für eine neue visuelle Kategorie nur anhand der sprachlichen Beschreibung erstellt werden können, indem relevante Attribute selektiert und auf die neue Kategorie übertragen werden. Die Notwendigkeit von Trainingsbildern entfällt durch diesen Prozess jedoch vollständig. In dieser Arbeit stellen wir neue Lösungen vor, semantische Attribute zu modellieren, zu übertragen, automatisch mit visuellen Kategorien zu assoziieren, und aus sprachlichen Beschreibungen zu erkennen. Zu diesem Zweck beleuchten wir die attributbasierte Erkennung aus den folgenden vier Blickpunkten: 1) Anders als das gängige Modell, bei dem Attribute global gelernt werden müssen, stellen wir einen hierarchischen Ansatz vor, der es ermöglicht, die Attribute auf verschiedenen Abstraktionsebenen zu lernen. Wir zeigen zudem, wie die Struktur zwischen den Kategorien effektiv genutzt werden kann, um den Lern- und Transferprozess zu steuern und damit diskriminative Modelle für neue Kategorien zu erstellen. Mit einer gründlichen experimentellen Analyse demonstrieren wir eine deutliche Verbesserung unseres Modells gegenüber dem globalen Ansatz, insbesondere bei der Erkennung detailgenauer Kategorien. 2) In vorherrschend attributbasierten Transferansätzen überwacht der Benutzer die Zuordnung zwischen den Attributen und den Kategorien. Wir schlagen in dieser Arbeit vor, die Verbindung zwischen den beiden automatisch und ohne Benutzereingriff herzustellen. Unser Modell erfasst die semantischen Beziehungen, welche die Attribute mit Objekten koppeln, um ihre Assoziationen vorherzusagen und unüberwacht auszuwählen welche Attribute übertragen werden sollen. 3) Wir umgehen die Notwendigkeit eines vordefinierten Vokabulars von Attributen. Statt dessen schlagen wir vor, Enyzklopädie-Artikel zu verwenden, die Objektkategorien in einem freien Text beschreiben, um automatisch eine Menge von diskriminanten, salienten und vielfältigen Attributen zu entdecken. Diese Beseitigung des Bedarfs eines benutzerdefinierten Vokabulars ermöglicht es uns, das Potenzial attributbasierter Modelle im Kontext sehr großer Datenmengen vollends auszuschöpfen. 4) Wir präsentieren eine neuartige Anwendung semantischer Attribute in der realen Welt. Wir schlagen das erste Verfahren vor, welches automatisch Modestile lernt, und vorhersagt, wie sich ihre Beliebtheit in naher Zukunft entwickeln wird. Wir zeigen, dass semantische Attribute interpretierbare Modestile liefern und zu einer besseren Vorhersage der Beliebtheit von visuellen Stilen im Vergleich zu anderen Darstellungen führen.
Published: 2018
Full Text: View/download PDF

45. SPaSe - Multi-Label Page Segmentation for Presentation Slides

Author: Haurilet, Monica, primary, Al-Halah, Ziad, additional, and Stiefelhagen, Rainer, additional
Published: 2019
Full Text: View/download PDF

46. Transfer metric learning for action similarity using high-level semantics

Author: Al-Halah, Ziad, Rybok, Lukas, and Stiefelhagen, Rainer
Published: 2016
Full Text: View/download PDF

47. Fashion Forward: Forecasting Visual Style in Fashion

Author: Al-Halah, Ziad, primary, Stiefelhagen, Rainer, additional, and Grauman, Kristen, additional
Published: 2017
Full Text: View/download PDF

48. Automatic Discovery, Association Estimation and Learning of Semantic Attributes for a Thousand Categories

Author: Al-Halah, Ziad, primary and Stiefelhagen, Rainer, additional
Published: 2017
Full Text: View/download PDF

49. Recovering the Missing Link: Predicting Class-Attribute Associations for Unsupervised Zero-Shot Learning

Author: Al-Halah, Ziad, primary, Tapaswi, Makarand, additional, and Stiefelhagen, Rainer, additional
Published: 2016
Full Text: View/download PDF

50. Naming TV characters by watching and analyzing dialogs

Author: Haurilet, Monica-Laura, primary, Tapaswi, Makarand, additional, Al-Halah, Ziad, additional, and Stiefelhagen, Rainer, additional
Published: 2016
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

107 results on '"Al-Halah, Ziad"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources