Author: "Eghbal‐Zadeh, Hamid" / Topic: audio and speech processing (eess.as) - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Eghbal‐Zadeh, Hamid"' showing total 5 results

Start Over Author "Eghbal‐Zadeh, Hamid" Topic audio and speech processing (eess.as)

5 results on '"Eghbal‐Zadeh, Hamid"'

1. Efficient Training of Audio Transformers with Patchout

Author: Koutini, Khaled, Schlüter, Jan, Eghbal-zadeh, Hamid, and Widmer, Gerhard
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: The great success of transformer-based models in natural language processing (NLP) has led to various attempts at adapting these architectures to other domains such as vision and audio. Recent work has shown that transformers can outperform Convolutional Neural Networks (CNNs) on vision and audio tasks. However, one of the main shortcomings of transformer models, compared to the well-established CNNs, is the computational complexity. In transformers, the compute and memory complexity is known to grow quadratically with the input length. Therefore, there has been extensive work on optimizing transformers, but often at the cost of degrading predictive performance. In this work, we propose a novel method to optimize and regularize transformers on audio spectrograms. Our proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU. Furthermore, we propose a transformer model that outperforms CNNs in terms of both performance and training speed. Source code: https://github.com/kkoutini/PaSST, Comment: Submitted to Interspeech 2022. Source code: https://github.com/kkoutini/PaSST
Published: 2022
Full Text: View/download PDF

2. Receptive-Field Regularized CNNs for Music Classification and Tagging

Author: Koutini, Khaled, Eghbal-Zadeh, Hamid, Haunschmid, Verena, Primus, Paul, Chowdhury, Shreyan, and Widmer, Gerhard
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Machine Learning (cs.LG)
Abstract: Convolutional Neural Networks (CNNs) have been successfully used in various Music Information Retrieval (MIR) tasks, both as end-to-end models and as feature extractors for more complex systems. However, the MIR field is still dominated by the classical VGG-based CNN architecture variants, often in combination with more complex modules such as attention, and/or techniques such as pre-training on large datasets. Deeper models such as ResNet -- which surpassed VGG by a large margin in other domains -- are rarely used in MIR. One of the main reasons for this, as we will show, is the lack of generalization of deeper CNNs in the music domain. In this paper, we present a principled way to make deep architectures like ResNet competitive for music-related tasks, based on well-designed regularization strategies. In particular, we analyze the recently introduced Receptive-Field Regularization and Shake-Shake, and show that they significantly improve the generalization of deep CNNs on music-related tasks, and that the resulting deep CNNs can outperform current more complex models such as CNNs augmented with pre-training and attention. We demonstrate this on two different MIR tasks and two corresponding datasets, thus offering our deep regularized CNNs as a new baseline for these datasets, which can also be used as a feature-extracting module in future, more complex approaches.
Published: 2020
Full Text: View/download PDF

3. Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs

Author: Koutini, Khaled, Chowdhury, Shreyan, Haunschmid, Verena, Eghbal-zadeh, Hamid, and Widmer, Gerhard
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Computer Science - Multimedia, Machine Learning (cs.LG), Multimedia (cs.MM), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We present CP-JKU submission to MediaEval 2019; a Receptive Field-(RF)-regularized and Frequency-Aware CNN approach for tagging music with emotion/mood labels. We perform an investigation regarding the impact of the RF of the CNNs on their performance on this dataset. We observe that ResNets with smaller receptive fields -- originally adapted for acoustic scene classification -- also perform well in the emotion tagging task. We improve the performance of such architectures using techniques such as Frequency Awareness and Shake-Shake regularization, which were used in previous work on general acoustic recognition tasks., MediaEval`19, 27-29 October 2019, Sophia Antipolis, France
Published: 2019

4. Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments

Author: Serizel, Romain, Turpault, Nicolas, Eghbal-Zadeh, Hamid, Shah, Ankit Parag, Speech Modeling for Facilitating Oral-Based Communication (MULTISPEECH), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Natural Language Processing & Knowledge Discovery (LORIA - NLPKD), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Johannes Kepler University Linz [Linz] (JKU), Language Technologies Institute [Pittsburgh] (LTI), Carnegie Mellon University [Pittsburgh] (CMU), and Grid'5000
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Sound event detection, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing, Audio and Speech Processing (eess.AS), Semi-supervised learning, [INFO.INFO-SD]Computer Science [cs]/Sound [cs.SD], FOS: Electrical engineering, electronic engineering, information engineering, Weakly labeled data, Large scale, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Submitted to DCASE2018 Workshop; International audience; This paper presents DCASE 2018 task 4. The task evaluates systems for the large-scale detection of sound events using weakly labeled data (without time boundaries). The target of the systems is to provide not only the event class but also the event time boundaries given that multiple events can be present in an audio recording. Another challenge of the task is to explore the possibility to exploit a large amount of unbalanced and unlabeled training data together with a small weakly labeled training set to improve system performance. The data are Youtube video excerpts from domestic context which have many applications such as ambient assisted living. The domain was chosen due to the scientific challenges (wide variety of sounds, time-localized events.. .) and potential industrial applications .
Published: 2018

5. Deep Within-Class Covariance Analysis for Robust Audio Representation Learning

Author: Eghbal-zadeh, Hamid, Dorfer, Matthias, and Widmer, Gerhard
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Audio and Speech Processing (eess.AS), Computer Science - Artificial Intelligence, FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Sound, Machine Learning (cs.LG), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Convolutional Neural Networks (CNNs) can learn effective features, though have been shown to suffer from a performance drop when the distribution of the data changes from training to test data. In this paper we analyze the internal representations of CNNs and observe that the representations of unseen data in each class, spread more (with higher variance) in the embedding space of the CNN compared to representations of the training data. More importantly, this difference is more extreme if the unseen data comes from a shifted distribution. Based on this observation, we objectively evaluate the degree of representation's variance in each class via eigenvalue decomposition on the within-class covariance of the internal representations of CNNs and observe the same behaviour. This can be problematic as larger variances might lead to mis-classification if the sample crosses the decision boundary of its class. We apply nearest neighbor classification on the representations and empirically show that the embeddings with the high variance actually have significantly worse KNN classification performances, although this could not be foreseen from their end-to-end classification results. To tackle this problem, we propose Deep Within-Class Covariance Analysis (DWCCA), a deep neural network layer that significantly reduces the within-class covariance of a DNN's representation, improving performance on unseen test data from a shifted distribution. We empirically evaluate DWCCA on two datasets for Acoustic Scene Classification (DCASE2016 and DCASE2017). We demonstrate that not only does DWCCA significantly improve the network's internal representation, it also increases the end-to-end classification accuracy, especially when the test set exhibits a distribution shift. By adding DWCCA to a VGG network, we achieve around 6 percentage points improvement in the case of a distribution mismatch., 11 pages, 3 tables, 4 figures
Published: 2017

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

5 results on '"Eghbal‐Zadeh, Hamid"'

1. Efficient Training of Audio Transformers with Patchout

2. Receptive-Field Regularized CNNs for Music Classification and Tagging

3. Emotion and Theme Recognition in Music with Frequency-Aware RF-Regularized CNNs

4. Large-Scale Weakly Labeled Semi-Supervised Sound Event Detection in Domestic Environments

5. Deep Within-Class Covariance Analysis for Robust Audio Representation Learning

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

Publisher

5 results on '"Eghbal‐Zadeh, Hamid"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources