Author: "Jansen, Aren" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Jansen, Aren"' showing total 146 results

Start Over Author "Jansen, Aren"

146 results on '"Jansen, Aren"'

1. A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation

Author: Kim, Gwanghyun, Martinez, Alonso, Su, Yu-Chuan, Jou, Brendan, Lezama, José, Gupta, Agrim, Yu, Lijun, Jiang, Lu, Jansen, Aren, Walker, Jacob, and Somandepalli, Krishna
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Multimedia, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Training diffusion models for audiovisual sequences allows for a range of generation tasks by learning conditional distributions of various input-output combinations of the two modalities. Nevertheless, this strategy often requires training a separate model for each task which is expensive. Here, we propose a novel training approach to effectively learn arbitrary conditional distributions in the audiovisual space.Our key contribution lies in how we parameterize the diffusion timestep in the forward diffusion process. Instead of the standard fixed diffusion timestep, we propose applying variable diffusion timesteps across the temporal dimension and across modalities of the inputs. This formulation offers flexibility to introduce variable noise levels for various portions of the input, hence the term mixture of noise levels. We propose a transformer-based audiovisual latent diffusion model and show that it can be trained in a task-agnostic fashion using our approach to enable a variety of audiovisual generation tasks at inference time. Experiments demonstrate the versatility of our method in tackling cross-modal and multimodal interpolation tasks in the audiovisual space. Notably, our proposed approach surpasses baselines in generating temporally and perceptually consistent samples conditioned on the input. Project page: avdit2024.github.io
Published: 2024

2. Dataset balancing can hurt model performance

Author: Moore, R. Channing, Ellis, Daniel P. W., Fonseca, Eduardo, Hershey, Shawn, Jansen, Aren, and Plakal, Manoj
Subjects: Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Machine learning from training data with a skewed distribution of examples per class can lead to models that favor performance on common classes at the expense of performance on rare ones. AudioSet has a very wide range of priors over its 527 sound event classes. Classification performance on AudioSet is usually evaluated by a simple average over per-class metrics, meaning that performance on rare classes is equal in importance to the performance on common ones. Several recent papers have used dataset balancing techniques to improve performance on AudioSet. We find, however, that while balancing improves performance on the public AudioSet evaluation data it simultaneously hurts performance on an unpublished evaluation set collected under the same conditions. By varying the degree of balancing, we show that its benefits are fragile and depend on the evaluation set. We also do not find evidence indicating that balancing improves rare class performance relative to common classes. We therefore caution against blind application of balancing, as well as against paying too much attention to small improvements on a public evaluation set., Comment: 5 pages, 3 figures, ICASSP 2023
Published: 2023
Full Text: View/download PDF

3. V2Meow: Meowing to the Visual Beat via Video-to-Music Generation

Author: Su, Kun, Li, Judith Yue, Huang, Qingqing, Kuzmin, Dima, Lee, Joonseok, Donahue, Chris, Sha, Fei, Jansen, Aren, Wang, Yu, Verzetti, Mauro, and Denk, Timo I.
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Video-to-music generation demands both a temporally localized high-quality listening experience and globally aligned video-acoustic signatures. While recent music generation models excel at the former through advanced audio codecs, the exploration of video-acoustic signatures has been confined to specific visual scenarios. In contrast, our research confronts the challenge of learning globally aligned signatures between video and music directly from paired music and videos, without explicitly modeling domain-specific rhythmic or semantic relationships. We propose V2Meow, a video-to-music generation system capable of producing high-quality music audio for a diverse range of video input types using a multi-stage autoregressive model. Trained on 5k hours of music audio clips paired with video frames mined from in-the-wild music videos, V2Meow is competitive with previous domain-specific models when evaluated in a zero-shot manner. It synthesizes high-fidelity music audio waveforms solely by conditioning on pre-trained general-purpose visual features extracted from video frames, with optional style control via text prompts. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms various existing music generation systems in terms of visual-audio correspondence and audio quality. Music samples are available at tinyurl.com/v2meow., Comment: accepted at AAAI 2024, music samples available at https://tinyurl.com/v2meow
Published: 2023

4. MusicLM: Generating Music From Text

Author: Agostinelli, Andrea, Denk, Timo I., Borsos, Zalán, Engel, Jesse, Verzetti, Mauro, Caillon, Antoine, Huang, Qingqing, Jansen, Aren, Roberts, Adam, Tagliasacchi, Marco, Sharifi, Matt, Zeghidour, Neil, and Frank, Christian
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: We introduce MusicLM, a model generating high-fidelity music from text descriptions such as "a calming violin melody backed by a distorted guitar riff". MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task, and it generates music at 24 kHz that remains consistent over several minutes. Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description. Moreover, we demonstrate that MusicLM can be conditioned on both text and a melody in that it can transform whistled and hummed melodies according to the style described in a text caption. To support future research, we publicly release MusicCaps, a dataset composed of 5.5k music-text pairs, with rich text descriptions provided by human experts., Comment: Supplementary material at https://google-research.github.io/seanet/musiclm/examples and https://kaggle.com/datasets/googleai/musiccaps
Published: 2023

5. MAQA: A Multimodal QA Benchmark for Negation

Author: Li, Judith Yue, Jansen, Aren, Huang, Qingqing, Lee, Joonseok, Ganti, Ravi, and Kuzmin, Dima
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Multimodal learning can benefit from the representation power of pretrained Large Language Models (LLMs). However, state-of-the-art transformer based LLMs often ignore negations in natural language and there is no existing benchmark to quantitatively evaluate whether multimodal transformers inherit this weakness. In this study, we present a new multimodal question answering (QA) benchmark adapted from labeled music videos in AudioSet (Gemmeke et al., 2017) with the goal of systematically evaluating if multimodal transformers can perform complex reasoning to recognize new concepts as negation of previously learned concepts. We show that with standard fine-tuning approach multimodal transformers are still incapable of correctly interpreting negation irrespective of model size. However, our experiments demonstrate that augmenting the original training task distributions with negated QA examples allow the model to reliably reason with negation. To do this, we describe a novel data generation procedure that prompts the 540B-parameter PaLM model to automatically generate negated QA examples as compositions of easily accessible video tags. The generated examples contain more natural linguistic patterns and the gains compared to template-based task augmentation approach are significant., Comment: NeurIPS 2022 SyntheticData4ML Workshop
Published: 2023

6. MuLan: A Joint Embedding of Music Audio and Natural Language

Author: Huang, Qingqing, Jansen, Aren, Lee, Joonseok, Ganti, Ravi, Li, Judith Yue, and Ellis, Daniel P. W.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Sound, Statistics - Machine Learning
Abstract: Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications., Comment: To appear in ISMIR 2022
Published: 2022

7. Text-Driven Separation of Arbitrary Sounds

Author: Kilgour, Kevin, Gfeller, Beat, Huang, Qingqing, Jansen, Aren, Wisdom, Scott, and Tagliasacchi, Marco
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: We propose a method of separating a desired sound source from a single-channel mixture, based on either a textual description or a short audio sample of the target source. This is achieved by combining two distinct models. The first model, SoundWords, is trained to jointly embed both an audio clip and its textual description to the same embedding in a shared representation. The second model, SoundFilter, takes a mixed source audio clip as an input and separates it based on a conditioning vector from the shared text-audio representation defined by SoundWords, making the model agnostic to the conditioning modality. Evaluating on multiple datasets, we show that our approach can achieve an SI-SDR of 9.1 dB for mixtures of two arbitrary sounds when conditioned on text and 10.1 dB when conditioned on audio. We also show that SoundWords is effective at learning co-embeddings and that our multi-modal training approach improves the performance of SoundFilter., Comment: Submitted to INTERSPEECH 2022
Published: 2022

8. Universal Paralinguistic Speech Representations Using Self-Supervised Conformers

Author: Shor, Joel, Jansen, Aren, Han, Wei, Park, Daniel, and Zhang, Yu
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Many speech applications require understanding aspects beyond the words being spoken, such as recognizing emotion, detecting whether the speaker is wearing a mask, or distinguishing real from synthetic speech. In this work, we introduce a new state-of-the-art paralinguistic representation derived from large-scale, fully self-supervised training of a 600M+ parameter Conformer-based architecture. We benchmark on a diverse set of speech tasks and demonstrate that simple linear classifiers trained on top of our time-averaged representation outperform nearly all previous results, in some cases by large margins. Our analyses of context-window size demonstrate that, surprisingly, 2 second context-windows achieve 96\% the performance of the Conformers that use the full long-term context on 7 out of 9 tasks. Furthermore, while the best per-task representations are extracted internally in the network, stable performance across several layers allows a single universal representation to reach near optimal performance on all tasks.
Published: 2021
Full Text: View/download PDF

9. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Author: Zhang, Yu, Park, Daniel S., Han, Wei, Qin, James, Gulati, Anmol, Shor, Joel, Jansen, Aren, Xu, Yuanzhong, Huang, Yanping, Wang, Shibo, Zhou, Zongwei, Li, Bo, Ma, Min, Chan, William, Yu, Jiahui, Wang, Yongqiang, Cao, Liangliang, Sim, Khe Chai, Ramabhadran, Bhuvana, Sainath, Tara N., Beaufays, Françoise, Chen, Zhifeng, Le, Quoc V., Chiu, Chung-Cheng, Pang, Ruoming, and Wu, Yonghui
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Sound
Abstract: We summarize the results of a host of efforts using giant automatic speech recognition (ASR) models pre-trained using large, diverse unlabeled datasets containing approximately a million hours of audio. We find that the combination of pre-training, self-training and scaling up model size greatly increases data efficiency, even for extremely large tasks with tens of thousands of hours of labeled data. In particular, on an ASR task with 34k hours of labeled data, by fine-tuning an 8 billion parameter pre-trained Conformer model we can match state-of-the-art (SoTA) performance with only 3% of the training data and significantly improve SoTA with the full training set. We also report on the universal benefits gained from using big pre-trained and self-trained models for a large set of downstream tasks that cover a wide range of speech domains and span multiple orders of magnitudes of dataset sizes, including obtaining SoTA performance on many public benchmarks. In addition, we utilize the learned representation of pre-trained networks to achieve SoTA results on non-ASR tasks., Comment: 14 pages, 7 figures, 13 tables; v2: minor corrections, reference baselines and bibliography updated; v3: corrections based on reviewer feedback, bibliography updated
Published: 2021
Full Text: View/download PDF

10. Attention Bottlenecks for Multimodal Fusion

Author: Nagrani, Arsha, Yang, Shan, Arnab, Anurag, Jansen, Aren, Schmid, Cordelia, and Sun, Chen
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released., Comment: Published at NeurIPS 2021. Note this version updates numbers due to a bug in the AudioSet mAP calculation in Table 1 (last row)
Published: 2021

11. Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

Author: Wisdom, Scott, Jansen, Aren, Weiss, Ron J., Erdogan, Hakan, and Hershey, John R.
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: Supervised neural network training has led to significant progress on single-channel sound separation. This approach relies on ground truth isolated sources, which precludes scaling to widely available mixture data and limits progress on open-domain tasks. The recent mixture invariant training (MixIT) method enables training on in-the-wild data; however, it suffers from two outstanding problems. First, it produces models which tend to over-separate, producing more output sources than are present in the input. Second, the exponential computational complexity of the MixIT loss limits the number of feasible output sources. In this paper we address both issues. To combat over-separation we introduce new losses: sparsity losses that favor fewer output sources and a covariance loss that discourages correlated outputs. We also experiment with a semantic classification loss by predicting weak class labels for each mixture. To handle larger numbers of sources, we introduce an efficient approximation using a fast least-squares solution, projected onto the MixIT constraint set. Our experiments show that the proposed losses curtail over-separation and improve overall performance. The best performance is achieved using larger numbers of output sources, enabled by our efficient MixIT loss, combined with sparsity losses to prevent over-separation. On the FUSS test set, we achieve over 13 dB in multi-source SI-SNR improvement, while boosting single-source reconstruction SI-SNR by over 17 dB., Comment: 5 pages, 1 figure. WASPAA 2021
Published: 2021

12. The Benefit Of Temporally-Strong Labels In Audio Event Classification

Author: Hershey, Shawn, Ellis, Daniel P W, Fonseca, Eduardo, Jansen, Aren, Liu, Caroline, Moore, R Channing, and Plakal, Manoj
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: To reveal the importance of temporal precision in ground truth audio event labels, we collected precise (~0.1 sec resolution) "strong" labels for a portion of the AudioSet dataset. We devised a temporally strong evaluation set (including explicit negatives of varying difficulty) and a small strong-labeled training subset of 67k clips (compared to the original dataset's 1.8M clips labeled at 10 sec resolution). We show that fine-tuning with a mix of weak and strongly labeled data can substantially improve classifier performance, even when evaluated using only the original weak labels. For a ResNet50 architecture, d' on the strong evaluation data including explicit negatives improves from 1.13 to 1.41. The new labels are available as an update to AudioSet., Comment: Accepted for publication at ICASSP 2021
Published: 2021

13. Self-Supervised Learning from Automatically Separated Sound Scenes

Author: Fonseca, Eduardo, Jansen, Aren, Ellis, Daniel P. W., Wisdom, Scott, Tagliasacchi, Marco, Hershey, John R., Plakal, Manoj, Hershey, Shawn, Moore, R. Channing, and Serra, Xavier
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.
Published: 2021

14. Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Author: Tzinis, Efthymios, Wisdom, Scott, Jansen, Aren, Hershey, Shawn, Remez, Tal, Ellis, Daniel P. W., and Hershey, John R.
Subjects: Computer Science - Sound, Computer Science - Computer Vision and Pattern Recognition, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present AudioScope, a novel audio-visual sound separation framework that can be trained without supervision to isolate on-screen sound sources from real in-the-wild videos. Prior audio-visual separation work assumed artificial limitations on the domain of sound classes (e.g., to speech or music), constrained the number of sources, and required strong sound separation or visual segmentation labels. AudioScope overcomes these limitations, operating on an open domain of sounds, with variable numbers of sources, and without labels or prior visual segmentation. The training procedure for AudioScope uses mixture invariant training (MixIT) to separate synthetic mixtures of mixtures (MoMs) into individual sources, where noisy labels for mixtures are provided by an unsupervised audio-visual coincidence model. Using the noisy labels, along with attention between video and audio features, AudioScope learns to identify audio-visual similarity and to suppress off-screen sounds. We demonstrate the effectiveness of our approach using a dataset of video clips extracted from open-domain YFCC100m video data. This dataset contains a wide diversity of sound classes recorded in unconstrained conditions, making the application of previous methods unsuitable. For evaluation and semi-supervised experiments, we collected human labels for presence of on-screen and off-screen sounds on a small subset of clips., Comment: ICLR 2021, 27 pages
Published: 2020

15. Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking

Author: Fonseca, Eduardo, Hershey, Shawn, Plakal, Manoj, Ellis, Daniel P. W., Jansen, Aren, Moore, R. Channing, and Serra, Xavier
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: The study of label noise in sound event recognition has recently gained attention with the advent of larger and noisier datasets. This work addresses the problem of missing labels, one of the big weaknesses of large audio datasets, and one of the most conspicuous issues for AudioSet. We propose a simple and model-agnostic method based on a teacher-student framework with loss masking to first identify the most critical missing label candidates, and then ignore their contribution during the learning process. We find that a simple optimisation of the training label set improves recognition performance without additional computation. We discover that most of the improvement comes from ignoring a critical tiny portion of the missing labels. We also show that the damage done by missing labels is larger as the training set gets smaller, yet it can still be observed even when training with massive amounts of audio. We believe these insights can generalize to other large-scale datasets., Comment: Accepted in IEEE Signal Processing Letters, openly accessible at https://ieeexplore.ieee.org/document/9130823
Published: 2020
Full Text: View/download PDF

16. Towards Learning a Universal Non-Semantic Representation of Speech

Author: Shor, Joel, Jansen, Aren, Maor, Ronnie, Lang, Oran, Tuval, Omry, Quitry, Felix de Chaumont, Tagliasacchi, Marco, Shavitt, Ira, Emanuel, Dotan, and Haviv, Yinnon
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Machine Learning, Computer Science - Sound, Statistics - Machine Learning
Abstract: The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a pre-existing embedding model trained for different datasets or tasks. The visual and language communities have established benchmarks to compare embeddings, but the speech community has yet to do so. This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a representation based on an unsupervised triplet-loss objective. The proposed representation outperforms other representations on the benchmark, and even exceeds state-of-the-art performance on a number of transfer learning tasks. The embedding is trained on a publicly available dataset, and it is tested on a variety of low-resource downstream tasks, including personalization tasks and medical domain. The benchmark, models, and evaluation code are publicly released.
Published: 2020
Full Text: View/download PDF

17. Improving Universal Sound Separation Using Sound Classification

Author: Tzinis, Efthymios, Wisdom, Scott, Hershey, John R., Jansen, Aren, and Ellis, Daniel P. W.
Subjects: Computer Science - Sound, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: Deep learning approaches have recently achieved impressive performance on both audio source separation and sound classification. Most audio source separation approaches focus only on separating sources belonging to a restricted domain of source classes, such as speech and music. However, recent work has demonstrated the possibility of "universal sound separation", which aims to separate acoustic sources from an open domain, regardless of their class. In this paper, we utilize the semantic information learned by sound classifier networks trained on a vast amount of diverse sounds to improve universal sound separation. In particular, we show that semantic embeddings extracted from a sound classifier can be used to condition a separation network, providing it with useful additional information. This approach is especially useful in an iterative setup, where source estimates from an initial separation stage and their corresponding classifier-derived embeddings are fed to a second separation network. By performing a thorough hyperparameter search consisting of over a thousand experiments, we find that classifier embeddings from clean sources provide nearly one dB of SNR gain, and our best iterative models achieve a significant fraction of this oracle performance, establishing a new state-of-the-art for universal sound separation.
Published: 2019
Full Text: View/download PDF

18. Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

Author: Jansen, Aren, Ellis, Daniel P. W., Hershey, Shawn, Moore, R. Channing, Plakal, Manoj, Popat, Ashok C., and Saurous, Rif A.
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: Humans do not acquire perceptual abilities in the way we train machines. While machine learning algorithms typically operate on large collections of randomly-chosen, explicitly-labeled examples, human acquisition relies more heavily on multimodal unsupervised learning (as infants) and active learning (as children). With this motivation, we present a learning framework for sound representation and recognition that combines (i) a self-supervised objective based on a general notion of unimodal and cross-modal coincidence, (ii) a clustering objective that reflects our need to impose categorical structure on our experiences, and (iii) a cluster-based active learning procedure that solicits targeted weak supervision to consolidate categories into relevant semantic classes. By training a combined sound embedding/clustering/classification network according to these criteria, we achieve a new state-of-the-art unsupervised audio representation and demonstrate up to a 20-fold reduction in the number of labels required to reach a desired classification performance., Comment: This extended version of a ICASSP 2020 submission under same title has an added figure and additional discussion for easier consumption
Published: 2019

19. Unsupervised Learning of Semantic Audio Representations

Author: Jansen, Aren, Plakal, Manoj, Pandya, Ratheet, Ellis, Daniel P. W., Hershey, Shawn, Liu, Jiayang, Moore, R. Channing, and Saurous, Rif A.
Subjects: Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing, Statistics - Machine Learning
Abstract: Even in the absence of any explicit semantic annotation, vast collections of audio recordings provide valuable information for learning the categorical structure of sounds. We consider several class-agnostic semantic constraints that apply to unlabeled nonspeech audio: (i) noise and translations in time do not change the underlying sound category, (ii) a mixture of two sound events inherits the categories of the constituents, and (iii) the categories of events in close temporal proximity are likely to be the same or related. Without labels to ground them, these constraints are incompatible with classification loss functions. However, they may still be leveraged to identify geometric inequalities needed for triplet loss-based training of convolutional neural networks. The result is low-dimensional embeddings of the input spectrograms that recover 41% and 84% of the performance of their fully-supervised counterparts when applied to downstream query-by-example sound retrieval and sound event classification tasks, respectively. Moreover, in limited-supervision settings, our unsupervised embeddings double the state-of-the-art classification performance., Comment: Submitted to ICASSP 2018
Published: 2017

20. Shared computational principles for language processing in humans and deep language models

Author: Goldstein, Ariel, Zada, Zaid, Buchnik, Eliav, Schain, Mariano, Price, Amy, Aubrey, Bobbi, Nastase, Samuel A., Feder, Amir, Emanuel, Dotan, Cohen, Alon, Jansen, Aren, Gazula, Harshvardhan, Choe, Gina, Rao, Aditi, Kim, Catherine, Casto, Colton, Fanda, Lora, Doyle, Werner, Friedman, Daniel, Dugan, Patricia, Melloni, Lucia, Reichart, Roi, Devore, Sasha, Flinker, Adeen, Hasenfratz, Liat, Levy, Omer, Hassidim, Avinatan, Brenner, Michael, Matias, Yossi, Norman, Kenneth A., Devinsky, Orrin, and Hasson, Uri
Published: 2022
Full Text: View/download PDF

21. A machine-learning based objective measure for ALS disease severity

Author: Vieira, Fernando G., Venugopalan, Subhashini, Premasiri, Alan S., McNally, Maeve, Jansen, Aren, McCloskey, Kevin, Brenner, Michael P., and Perrin, Steven
Published: 2022
Full Text: View/download PDF

22. CNN Architectures for Large-Scale Audio Classification

Author: Hershey, Shawn, Chaudhuri, Sourish, Ellis, Daniel P. W., Gemmeke, Jort F., Jansen, Aren, Moore, R. Channing, Plakal, Manoj, Platt, Devin, Saurous, Rif A., Seybold, Bryan, Slaney, Malcolm, Weiss, Ron J., and Wilson, Kevin
Subjects: Computer Science - Sound, Computer Science - Learning, Statistics - Machine Learning
Abstract: Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task., Comment: Accepted for publication at ICASSP 2017 Changes: Added definitions of mAP, AUC, and d-prime. Updated mAP/AUC/d-prime numbers for Audio Set based on changes of latest Audio Set revision. Changed wording to fit 4 page limit with new additions
Published: 2016

23. A segmental framework for fully-unsupervised large-vocabulary speech recognition

Author: Kamper, Herman, Jansen, Aren, and Goldwater, Sharon
Subjects: Computer Science - Computation and Language, Computer Science - Learning
Abstract: Zero-resource speech technology is a growing research area that aims to develop methods for speech processing in the absence of transcriptions, lexicons, or language modelling text. Early term discovery systems focused on identifying isolated recurring patterns in a corpus, while more recent full-coverage systems attempt to completely segment and cluster the audio into word-like units---effectively performing unsupervised speech recognition. This article presents the first attempt we are aware of to apply such a system to large-vocabulary multi-speaker data. Our system uses a Bayesian modelling framework with segmental word representations: each word segment is represented as a fixed-dimensional acoustic embedding obtained by mapping the sequence of feature frames to a single embedding vector. We compare our system on English and Xitsonga datasets to state-of-the-art baselines, using a variety of measures including word error rate (obtained by mapping the unsupervised output to ground truth transcriptions). Very high word error rates are reported---in the order of 70--80% for speaker-dependent and 80--95% for speaker-independent systems---highlighting the difficulty of this task. Nevertheless, in terms of cluster quality and word segmentation metrics, we show that by imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, both single-speaker and multi-speaker versions of our system outperform a purely bottom-up single-speaker syllable-based approach. We also show that the discovered clusters can be made less speaker- and gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding). Our system's discovered clusters are still less pure than those of unsupervised term discovery systems, but provide far greater coverage., Comment: 15 pages, 6 figures, 8 tables
Published: 2016
Full Text: View/download PDF

24. Unsupervised word segmentation and lexicon discovery using acoustic word embeddings

Author: Kamper, Herman, Jansen, Aren, and Goldwater, Sharon
Subjects: Computer Science - Computation and Language
Abstract: In settings where only unlabelled speech data is available, speech technology needs to be developed without transcriptions, pronunciation dictionaries, or language modelling text. A similar problem is faced when modelling infant language acquisition. In these cases, categorical linguistic structure needs to be discovered directly from speech audio. We present a novel unsupervised Bayesian model that segments unlabelled speech and clusters the segments into hypothesized word groupings. The result is a complete unsupervised tokenization of the input speech in terms of discovered word types. In our approach, a potential word segment (of arbitrary length) is embedded in a fixed-dimensional acoustic vector space. The model, implemented as a Gibbs sampler, then builds a whole-word acoustic model in this space while jointly performing segmentation. We report word error rates in a small-vocabulary connected digit recognition task by mapping the unsupervised decoded output to ground truth transcriptions. The model achieves around 20% error rate, outperforming a previous HMM-based system by about 10% absolute. Moreover, in contrast to the baseline, our model does not require a pre-specified vocabulary size., Comment: 11 pages, 8 figures; Accepted to the IEEE/ACM Transactions on Audio, Speech, and Language Processing
Published: 2016
Full Text: View/download PDF

25. Scalable Out-of-Sample Extension of Graph Embeddings Using Deep Neural Networks

Author: Jansen, Aren, Sell, Gregory, and Lyzinski, Vince
Subjects: Statistics - Machine Learning, Computer Science - Learning, Computer Science - Neural and Evolutionary Computing, Statistics - Methodology
Abstract: Several popular graph embedding techniques for representation learning and dimensionality reduction rely on performing computationally expensive eigendecompositions to derive a nonlinear transformation of the input data space. The resulting eigenvectors encode the embedding coordinates for the training samples only, and so the embedding of novel data samples requires further costly computation. In this paper, we present a method for the out-of-sample extension of graph embeddings using deep neural networks (DNN) to parametrically approximate these nonlinear maps. Compared with traditional nonparametric out-of-sample extension methods, we demonstrate that the DNNs can generalize with equal or better fidelity and require orders of magnitude less computation at test time. Moreover, we find that unsupervised pretraining of the DNNs improves optimization for larger network sizes, thus removing sensitivity to model selection., Comment: 10 pages, 2 figures, 1 table, this paper is under consideration for publication in Pattern Recognition Letters
Published: 2015

26. A Framework for Evaluating Speech Representations

Author: Richter, Caitlin, Feldman, Naomi, Salgado, Harini, and Jansen, Aren
Subjects: Speech perception, speaker normalization, Bayesian modeling, approximate inference
Abstract: Listeners track distributions of speech sounds along percep-tual dimensions. We introduce a method for evaluating hy-potheses about what those dimensions are, using a cognitivemodel whose prior distribution is estimated directly from speechrecordings. We use this method to evaluate two speaker nor-malization algorithms against human data. Simulations showthat representations that are normalized across speakers predicthuman discrimination data better than unnormalized representa-tions, consistent with previous research. Results further revealdifferences across normalization methods in how well eachpredicts human data. This work provides a framework forevaluating hypothesized representations of speech and lays thegroundwork for testing models of speech perception on naturalspeech recordings from ecologically valid settings.
Published: 2016

27. A segmental framework for fully-unsupervised large-vocabulary speech recognition

Author: Kamper, Herman, Jansen, Aren, and Goldwater, Sharon
Published: 2017
Full Text: View/download PDF

28. Dataset Balancing Can Hurt Model Performance

Author: Moore, R. Channing, primary, Ellis, Daniel P. W., additional, Fonseca, Eduardo, additional, Hershey, Shawn, additional, Jansen, Aren, additional, and Plakal, Manoj, additional
Published: 2023
Full Text: View/download PDF

29. V2Meow: Meowing to the Visual Beat via Music Generation

Author: Su, Kun, Li, Judith Yue, Huang, Qingqing, Kuzmin, Dima, Lee, Joonseok, Donahue, Chris, Sha, Fei, Jansen, Aren, Wang, Yu, Verzetti, Mauro, and Denk, Timo I.
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Machine Learning, Audio and Speech Processing (eess.AS), Computer Vision and Pattern Recognition (cs.CV), FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Sound, Computer Science - Multimedia, Machine Learning (cs.LG), Multimedia (cs.MM), Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Generating high quality music that complements the visual content of a video is a challenging task. Most existing visual conditioned music generation systems generate symbolic music data, such as MIDI files, instead of raw audio waveform. Given the limited availability of symbolic music data, such methods can only generate music for a few instruments or for specific types of visual input. In this paper, we propose a novel approach called V2Meow that can generate high-quality music audio that aligns well with the visual semantics of a diverse range of video input types. Specifically, the proposed music generation system is a multi-stage autoregressive model which is trained with a number of O(100K) music audio clips paired with video frames, which are mined from in-the-wild music videos, and no parallel symbolic music data is involved. V2Meow is able to synthesize high-fidelity music audio waveform solely conditioned on pre-trained visual features extracted from an arbitrary silent video clip, and it also allows high-level control over the music style of generation examples via supporting text prompts in addition to the video frames conditioning. Through both qualitative and quantitative evaluations, we demonstrate that our model outperforms several existing music generation systems in terms of both visual-audio correspondence and audio quality.
Published: 2023

30. BigSSL: Exploring the Frontier of Large-Scale Semi-Supervised Learning for Automatic Speech Recognition

Author: Zhang, Yu, primary, Park, Daniel S., additional, Han, Wei, additional, Qin, James, additional, Gulati, Anmol, additional, Shor, Joel, additional, Jansen, Aren, additional, Xu, Yuanzhong, additional, Huang, Yanping, additional, Wang, Shibo, additional, Zhou, Zongwei, additional, Li, Bo, additional, Ma, Min, additional, Chan, William, additional, Yu, Jiahui, additional, Wang, Yongqiang, additional, Cao, Liangliang, additional, Sim, Khe Chai, additional, Ramabhadran, Bhuvana, additional, Sainath, Tara N., additional, Beaufays, Francoise, additional, Chen, Zhifeng, additional, Le, Quoc V., additional, Chiu, Chung-Cheng, additional, Pang, Ruoming, additional, and Wu, Yonghui, additional
Published: 2022
Full Text: View/download PDF

31. Text-Driven Separation of Arbitrary Sounds

Author: Kilgour, Kevin, primary, Gfeller, Beat, additional, Huang, Qingqing, additional, Jansen, Aren, additional, Wisdom, Scott, additional, and Tagliasacchi, Marco, additional
Published: 2022
Full Text: View/download PDF

32. Universal Paralinguistic Speech Representations Using self-Supervised Conformers

Author: Shor, Joel, primary, Jansen, Aren, additional, Han, Wei, additional, Park, Daniel, additional, and Zhang, Yu, additional
Published: 2022
Full Text: View/download PDF

33. MuLan: A Joint Embedding of Music Audio and Natural Language

Author: Qingqing Huang, Jansen, Aren, Joonseok Lee, Ganti, Ravi, Li, Judith Yue, and Ellis, Daniel P W
Subjects: FOS: Computer and information sciences, Sound (cs.SD), Computer Science - Computation and Language, Statistics - Machine Learning, Audio and Speech Processing (eess.AS), FOS: Electrical engineering, electronic engineering, information engineering, ismir, Machine Learning (stat.ML), Computation and Language (cs.CL), Computer Science - Sound, ismir2022, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Music tagging and content-based retrieval systems have traditionally been constructed using pre-defined ontologies covering a rigid set of music attributes or text queries. This paper presents MuLan: a first attempt at a new generation of acoustic models that link music audio directly to unconstrained natural language music descriptions. MuLan takes the form of a two-tower, joint audio-text embedding model trained using 44 million music recordings (370K hours) and weakly-associated, free-form text annotations. Through its compatibility with a wide range of music genres and text styles (including conventional music tags), the resulting audio-text representation subsumes existing ontologies while graduating to true zero-shot functionalities. We demonstrate the versatility of the MuLan embeddings with a range of experiments including transfer learning, zero-shot music tagging, language understanding in the music domain, and cross-modal retrieval applications., Comment: To appear in ISMIR 2022
Published: 2022
Full Text: View/download PDF

34. A Machine-Learning Based Objective Measure for ALS Disease Severity

Author: Vieira, Fernando G, primary, Venugopalan, Subhashini, additional, Premasiri, Alan S, additional, McNally, Maeve, additional, Jansen, Aren, additional, McCloskey, Kevin, additional, Brenner, Michael P, additional, and Perrin, Steven, additional
Published: 2022
Full Text: View/download PDF

35. Self-Supervised Learning from Automatically Separated Sound Scenes

Author: Fonseca, Eduardo, primary, Jansen, Aren, additional, Ellis, Daniel P. W., additional, Wisdom, Scott, additional, Tagliasacchi, Marco, additional, Hershey, John R., additional, Plakal, Manoj, additional, Hershey, Shawn, additional, Moore, R. Channing, additional, and Serra, Xavier, additional
Published: 2021
Full Text: View/download PDF

36. Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-Wild Unsupervised Sound Separation

Author: Wisdom, Scott, primary, Jansen, Aren, additional, Weiss, Ron J., additional, Erdogan, Hakan, additional, and Hershey, John R., additional
Published: 2021
Full Text: View/download PDF

37. Point process models for event-based speech recognition

Author: Jansen, Aren and Niyogi, Partha
Published: 2009
Full Text: View/download PDF

38. The Benefit of Temporally-Strong Labels in Audio Event Classification

Author: Hershey, Shawn, primary, Ellis, Daniel P W, additional, Fonseca, Eduardo, additional, Jansen, Aren, additional, Liu, Caroline, additional, Channing Moore, R, additional, and Plakal, Manoj, additional
Published: 2021
Full Text: View/download PDF

39. A Convolutional Neural Network for Automated Detection of Humpback Whale Song in a Diverse, Long-Term Passive Acoustic Dataset

Author: Allen, Ann N., primary, Harvey, Matt, additional, Harrell, Lauren, additional, Jansen, Aren, additional, Merkens, Karlina P., additional, Wall, Carrie C., additional, Cattiau, Julie, additional, and Oleson, Erin M., additional
Published: 2021
Full Text: View/download PDF

40. Thinking ahead: spontaneous prediction in context as a keystone of language in humans and machines

Author: Goldstein, Ariel, primary, Zada, Zaid, additional, Buchnik, Eliav, additional, Schain, Mariano, additional, Price, Amy, additional, Aubrey, Bobbi, additional, Nastase, Samuel A., additional, Feder, Amir, additional, Emanuel, Dotan, additional, Cohen, Alon, additional, Jansen, Aren, additional, Gazula, Harshvardhan, additional, Choe, Gina, additional, Rao, Aditi, additional, Kim, Se Catherine, additional, Casto, Colton, additional, Fanda, Lora, additional, Doyle, Werner, additional, Friedman, Daniel, additional, Dugan, Patricia, additional, Melloni, Lucia, additional, Reichart, Roi, additional, Devore, Sasha, additional, Flinker, Adeen, additional, Hasenfratz, Liat, additional, Levy, Omer, additional, Hassidim, Avinatan, additional, Brenner, Michael, additional, Matias, Yossi, additional, Norman, Kenneth A., additional, Devinsky, Orrin, additional, and Hasson, Uri, additional
Published: 2020
Full Text: View/download PDF

41. Towards Learning a Universal Non-Semantic Representation of Speech

Author: Shor, Joel, primary, Jansen, Aren, additional, Maor, Ronnie, additional, Lang, Oran, additional, Tuval, Omry, additional, Quitry, Félix de Chaumont, additional, Tagliasacchi, Marco, additional, Shavitt, Ira, additional, Emanuel, Dotan, additional, and Haviv, Yinnon, additional
Published: 2020
Full Text: View/download PDF

42. Coincidence, Categorization, and Consolidation: Learning to Recognize Sounds with Minimal Supervision

Author: Jansen, Aren, primary, Ellis, Daniel P. W., additional, Hershey, Shawn, additional, Moore, R. Channing, additional, Plakal, Manoj, additional, Popat, Ashok C., additional, and Saurous, Rif A., additional
Published: 2020
Full Text: View/download PDF

43. Improving Universal Sound Separation Using Sound Classification

Author: Tzinis, Efthymios, primary, Wisdom, Scott, additional, Hershey, John R., additional, Jansen, Aren, additional, and Ellis, Daniel P. W., additional
Published: 2020
Full Text: View/download PDF

44. Large-Scale Weakly-Supervised Content Embeddings for Music Recommendation and Tagging

Author: Huang, Qingqing, primary, Jansen, Aren, additional, Zhang, Li, additional, Ellis, Daniel P. W., additional, Saurous, Rif A., additional, and Anderson, John, additional
Published: 2020
Full Text: View/download PDF

45. Addressing Missing Labels in Large-Scale Sound Event Recognition Using a Teacher-Student Framework With Loss Masking

Author: Fonseca, Eduardo, primary, Hershey, Shawn, additional, Plakal, Manoj, additional, Ellis, Daniel P. W., additional, Jansen, Aren, additional, and Moore, R. Channing, additional
Published: 2020
Full Text: View/download PDF

46. SEMANTICALLY MEANINGFUL ATTRIBUTES FROM CO-LISTEN EMBEDDINGS FOR PLAYLIST EXPLORATION AND EXPANSION.

Author: Patwari, Ayush, Kong, Nicholas, Jun Wang, Gargi, Ullas, Covell, Michele, and Jansen, Aren
Subjects: ARTIFICIAL neural networks, EMBEDDINGS (Mathematics), MUSIC, CLASSIFIERS (Linguistics), MACHINE learning
Abstract: Audio embeddings of musical similarity are often used for music recommendations and autoplay discovery. These embeddings are typically learned using co-listen data to train a deep neural network, to provide consistent tripletloss distances. Instead of directly using these co-listen-based embeddings, we explore making recommendations based on a second, smaller embedding space of humanintelligible musical attributes. To do this, we use the co-listen-based audio embeddings as inputs to small attribute classifiers, trained on a small hand-labeled dataset. These classifiers map from the original embedding space to a new interpretable attribute coordinate system that provides a more useful distance measure for downstream applications. The attributes and attribute embeddings allow us to provide a search interface and more intelligible recommendations for music curators. We examine the relative performance of these two embedding spaces (the co-listen-audio embedding and the attribute embedding) for the mathematical separation of thematic playlists. We also report on the usefulness of recommendations from the attributeembedding space to human curators for automatically extending thematic playlists. [ABSTRACT FROM AUTHOR]
Published: 2020

47. Temporal Dynamics of Meaning

Author: Goldstein, Ariel, primary, Jansen, Aren, additional, Slaney, Malcom, additional, Price, Amy, additional, Zada, Zaid Kokaja, additional, Ghoe, Gina, additional, Aubrey, Bobbi, additional, Rao, Aditi, additional, Fanda, Lora, additional, Norman, Kenneth, additional, Flinker, Adeen, additional, Devinsky, Orrin, additional, Brenner, Michael, additional, and Hasson, Uri, additional
Published: 2019
Full Text: View/download PDF

48. Scalable out-of-sample extension of graph embeddings using deep neural networks

Author: Jansen, Aren, Sell, Gregory, and Lyzinski, Vince
Published: 2017
Full Text: View/download PDF

49. Unsupervised Learning of Semantic Audio Representations

Author: Jansen, Aren, primary, Plakal, Manoj, additional, Pandya, Ratheet, additional, Ellis, Daniel P. W., additional, Hershey, Shawn, additional, Liu, Jiayang, additional, Moore, R. Channing, additional, and Saurous, Rif A., additional
Published: 2018
Full Text: View/download PDF

50. Evaluating Low-Level Speech Features Against Human Perceptual Data

Author: Richter, Caitlin, primary, Feldman, Naomi H., additional, Salgado, Harini, additional, and Jansen, Aren, additional
Published: 2017
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

146 results on '"Jansen, Aren"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources