Author: "Lee, Seong-Whan" / Publication Year Range: Last 3 years - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Lee, Seong-Whan"' showing total 339 results

Start Over Author "Lee, Seong-Whan" Publication Year Range Last 3 years

339 results on '"Lee, Seong-Whan"'

1. Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization

Author: Lee, Sang-Hoon, Choi, Ha-Yeong, and Lee, Seong-Whan
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: This paper introduces PeriodWave-Turbo, a high-fidelity and high-efficient waveform generation model via adversarial flow matching optimization. Recently, conditional flow matching (CFM) generative models have been successfully adopted for waveform generation tasks, leveraging a single vector field estimation objective for training. Although these models can generate high-fidelity waveform signals, they require significantly more ODE steps compared to GAN-based models, which only need a single generation step. Additionally, the generated samples often lack high-frequency information due to noisy vector field estimation, which fails to ensure high-frequency reproduction. To address this limitation, we enhance pre-trained CFM-based generative models by incorporating a fixed-step generator modification. We utilized reconstruction losses and adversarial feedback to accelerate high-fidelity waveform generation. Through adversarial flow matching optimization, it only requires 1,000 steps of fine-tuning to achieve state-of-the-art performance across various objective metrics. Moreover, we significantly reduce inference speed from 16 steps to 2 or 4 steps. Additionally, by scaling up the backbone of PeriodWave from 29M to 70M parameters for improved generalization, PeriodWave-Turbo achieves unprecedented performance, with a perceptual evaluation of speech quality (PESQ) score of 4.454 on the LibriTTS dataset. Audio samples, source code and checkpoints will be available at https://github.com/sh-lee-prml/PeriodWave., Comment: 9 pages, 9 tables, 1 figure
Published: 2024

2. PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

Author: Lee, Sang-Hoon, Choi, Ha-Yeong, and Lee, Seong-Whan
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Electrical Engineering and Systems Science - Audio and Speech Processing, Electrical Engineering and Systems Science - Signal Processing
Abstract: Recently, universal waveform generation tasks have been investigated conditioned on various out-of-distribution scenarios. Although GAN-based methods have shown their strength in fast waveform generation, they are vulnerable to train-inference mismatch scenarios such as two-stage text-to-speech. Meanwhile, diffusion-based models have shown their powerful generative performance in other domains; however, they stay out of the limelight due to slow inference speed in waveform generation tasks. Above all, there is no generator architecture that can explicitly disentangle the natural periodic features of high-resolution waveform signals. In this paper, we propose PeriodWave, a novel universal waveform generation model. First, we introduce a period-aware flow matching estimator that can capture the periodic features of the waveform signal when estimating the vector fields. Additionally, we utilize a multi-period estimator that avoids overlaps to capture different periodic features of waveform signals. Although increasing the number of periods can improve the performance significantly, this requires more computational costs. To reduce this issue, we also propose a single period-conditional universal estimator that can feed-forward parallel by period-wise batch inference. Additionally, we utilize discrete wavelet transform to losslessly disentangle the frequency information of waveform signals for high-frequency modeling, and introduce FreeU to reduce the high-frequency noise for waveform generation. The experimental results demonstrated that our model outperforms the previous models both in Mel-spectrogram reconstruction and text-to-speech tasks. All source code will be available at \url{https://github.com/sh-lee-prml/PeriodWave}., Comment: 24 pages, 16 tables, 4 figures
Published: 2024

3. EmoSphere-TTS: Emotional Style and Intensity Modeling via Spherical Emotion Vector for Controllable Emotional Text-to-Speech

Author: Cho, Deok-Hyeon, Oh, Hyung-Seok, Kim, Seung-Bin, Lee, Sang-Hoon, and Lee, Seong-Whan
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Despite rapid advances in the field of emotional text-to-speech (TTS), recent studies primarily focus on mimicking the average style of a particular emotion. As a result, the ability to manipulate speech emotion remains constrained to several predefined labels, compromising the ability to reflect the nuanced variations of emotion. In this paper, we propose EmoSphere-TTS, which synthesizes expressive emotional speech by using a spherical emotion vector to control the emotional style and intensity of the synthetic speech. Without any human annotation, we use the arousal, valence, and dominance pseudo-labels to model the complex nature of emotion via a Cartesian-spherical transformation. Furthermore, we propose a dual conditional adversarial network to improve the quality of generated speech by reflecting the multi-aspect characteristics. The experimental results demonstrate the model ability to control emotional style and intensity with high-quality expressive speech., Comment: Accepted at INTERSPEECH 2024
Published: 2024

4. TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression

Author: Kim, Ho-Joong, Hong, Jung-Ho, Kong, Heejo, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: In this paper, we investigate that the normalized coordinate expression is a key factor as reliance on hand-crafted components in query-based detectors for temporal action detection (TAD). Despite significant advancements towards an end-to-end framework in object detection, query-based detectors have been limited in achieving full end-to-end modeling in TAD. To address this issue, we propose \modelname{}, a full end-to-end temporal action detection transformer that integrates time-aligned coordinate expression. We reformulate coordinate expression utilizing actual timeline values, ensuring length-invariant representations from the extremely diverse video duration environment. Furthermore, our proposed adaptive query selection dynamically adjusts the number of queries based on video length, providing a suitable solution for varying video durations compared to a fixed query set. Our approach not only simplifies the TAD process by eliminating the need for hand-crafted components but also significantly improves the performance of query-based detectors. Our TE-TAD outperforms the previous query-based detectors and achieves competitive performance compared to state-of-the-art methods on popular benchmark datasets. Code is available at: https://github.com/Dotori-HJ/TE-TAD
Published: 2024

5. Explaining generative diffusion models via visual analysis for interpretable decision-making process

Author: Park, Ji-Hoon, Ju, Yeong-Joon, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, 68T01
Abstract: Diffusion models have demonstrated remarkable performance in generation tasks. Nevertheless, explaining the diffusion process remains challenging due to it being a sequence of denoising noisy images that are difficult for experts to interpret. To address this issue, we propose the three research questions to interpret the diffusion process from the perspective of the visual concepts generated by the model and the region where the model attends in each time step. We devise tools for visualizing the diffusion process and answering the aforementioned research questions to render the diffusion process human-understandable. We show how the output is progressively generated in the diffusion process by explaining the level of denoising and highlighting relationships to foundational visual concepts at each time step through the results of experiments with various visual analyses using the tools. Throughout the training of the diffusion model, the model learns diverse visual concepts corresponding to each time-step, enabling the model to predict varying levels of visual concepts at different stages. We substantiate our tools using Area Under Cover (AUC) score, correlation quantification, and cross-attention mapping. Our findings provide insights into the diffusion process and pave the way for further research into explainable diffusion mechanisms., Comment: 22 pages, published in Expert Systems with Applications
Published: 2024
Full Text: View/download PDF

6. AM-SORT: Adaptable Motion Predictor with Historical Trajectory Embedding for Multi-Object Tracking

Author: Kim, Vitaliy, Jung, Gunho, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Many multi-object tracking (MOT) approaches, which employ the Kalman Filter as a motion predictor, assume constant velocity and Gaussian-distributed filtering noises. These assumptions render the Kalman Filter-based trackers effective in linear motion scenarios. However, these linear assumptions serve as a key limitation when estimating future object locations within scenarios involving non-linear motion and occlusions. To address this issue, we propose a motion-based MOT approach with an adaptable motion predictor, called AM-SORT, which adapts to estimate non-linear uncertainties. AM-SORT is a novel extension of the SORT-series trackers that supersedes the Kalman Filter with the transformer architecture as a motion predictor. We introduce a historical trajectory embedding that empowers the transformer to extract spatio-temporal features from a sequence of bounding boxes. AM-SORT achieves competitive performance compared to state-of-the-art trackers on DanceTrack, with 56.3 IDF1 and 55.6 HOTA. We conduct extensive experiments to demonstrate the effectiveness of our method in predicting non-linear movement under occlusions.
Published: 2024

7. TIFu: Tri-directional Implicit Function for High-Fidelity 3D Character Reconstruction

Author: Lim, Byoungsung and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recent advances in implicit function-based approaches have shown promising results in 3D human reconstruction from a single RGB image. However, these methods are not sufficient to extend to more general cases, often generating dragged or disconnected body parts, particularly for animated characters. We argue that these limitations stem from the use of the existing point-level 3D shape representation, which lacks holistic 3D context understanding. Voxel-based reconstruction methods are more suitable for capturing the entire 3D space at once, however, these methods are not practical for high-resolution reconstructions due to their excessive memory usage. To address these challenges, we introduce Tri-directional Implicit Function (TIFu), which is a vector-level representation that increases global 3D consistencies while significantly reducing memory usage compared to voxel representations. We also introduce a new algorithm in 3D reconstruction at an arbitrary resolution by aggregating vectors along three orthogonal axes, resolving inherent problems with regressing fixed dimension of vectors. Our approach achieves state-of-the-art performances in both our self-curated character dataset and the benchmark 3D human dataset. We provide both quantitative and qualitative analyses to support our findings.
Published: 2024

8. Edge Conditional Node Update Graph Neural Network for Multi-variate Time Series Anomaly Detection

Author: Jo, Hayoung and Lee, Seong-Whan
Subjects: Computer Science - Machine Learning
Abstract: With the rapid advancement in cyber-physical systems, the increasing number of sensors has significantly complicated manual monitoring of system states. Consequently, graph-based time-series anomaly detection methods have gained attention due to their ability to explicitly represent relationships between sensors. However, these methods often apply a uniform source node representation across all connected target nodes, even when updating different target node representations. Moreover, the graph attention mechanism, commonly used to infer unknown graph structures, could constrain the diversity of source node representations. In this paper, we introduce the Edge Conditional Node-update Graph Neural Network (ECNU-GNN). Our model, equipped with an edge conditional node update module, dynamically transforms source node representations based on connected edges to represent target nodes aptly. We validate performance on three real-world datasets: SWaT, WADI, and PSM. Our model demonstrates 5.4%, 12.4%, and 6.0% higher performance, respectively, compared to best F1 baseline models.
Published: 2024

9. Appearance Debiased Gaze Estimation via Stochastic Subject-Wise Adversarial Learning

Author: Kim, Suneung, Nam, Woo-Jeoung, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Recently, appearance-based gaze estimation has been attracting attention in computer vision, and remarkable improvements have been achieved using various deep learning techniques. Despite such progress, most methods aim to infer gaze vectors from images directly, which causes overfitting to person-specific appearance factors. In this paper, we address these challenges and propose a novel framework: Stochastic subject-wise Adversarial gaZE learning (SAZE), which trains a network to generalize the appearance of subjects. We design a Face generalization Network (Fgen-Net) using a face-to-gaze encoder and face identity classifier and a proposed adversarial loss. The proposed loss generalizes face appearance factors so that the identity classifier inferences a uniform probability distribution. In addition, the Fgen-Net is trained by a learning mechanism that optimizes the network by reselecting a subset of subjects at every training step to avoid overfitting. Our experimental results verify the robustness of the method in that it yields state-of-the-art performance, achieving 3.89 and 4.42 on the MPIIGaze and EyeDiap datasets, respectively. Furthermore, we demonstrate the positive generalization effect by conducting further experiments using face images involving different styles generated from the generative model.
Published: 2024

10. TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data

Author: Kim, Seung-Bin, Lee, Sang-Hoon, and Lee, Seong-Whan
Subjects: Computer Science - Computation and Language, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Although there has been significant advancement in the field of speech-to-speech translation, conventional models still require language-parallel speech data between the source and target languages for training. In this paper, we introduce TranSentence, a novel speech-to-speech translation without language-parallel speech data. To achieve this, we first adopt a language-agnostic sentence-level speech encoding that captures the semantic information of speech, irrespective of language. We then train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder that is pre-trained with various languages. With this method, despite training exclusively on the target language's monolingual data, we can generate target language speech in the inference stage using language-agnostic speech embedding from the source language speech. Furthermore, we extend TranSentence to multilingual speech-to-speech translation. The experimental results demonstrate that TranSentence is superior to other models., Comment: Accepted by ICASSP 2024
Published: 2024
Full Text: View/download PDF

11. DurFlex-EVC: Duration-Flexible Emotional Voice Conversion with Parallel Generation

Author: Oh, Hyung-Seok, Lee, Sang-Hoon, Cho, Deok-Hyeon, and Lee, Seong-Whan
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Emotional voice conversion involves modifying the pitch, spectral envelope, and other acoustic characteristics of speech to match a desired emotional state while maintaining the speaker's identity. Recent advances in EVC involve simultaneously modeling pitch and duration by exploiting the potential of sequence-to-sequence models. In this study, we focus on parallel speech generation to increase the reliability and efficiency of conversion. We introduce a duration-flexible EVC (DurFlex-EVC) that integrates a style autoencoder and a unit aligner. The previous variable-duration parallel generation model required text-to-speech alignment. We consider self-supervised model representation and discrete speech units to be the core of our parallel generation. The style autoencoder promotes content style disentanglement by separating the source style of the input features and applying them with the target style. The unit aligner encodes unit-level features by modeling emotional context. Furthermore, we enhance the style of the features with a hierarchical stylize encoder and generate high-quality Mel-spectrograms with a diffusion-based generator. The effectiveness of the approach has been validated through subjective and objective evaluations and has been demonstrated to be superior to baseline models., Comment: 14 pages, 11 figures, 12 tables
Published: 2024

12. Towards Better Visualizing the Decision Basis of Networks via Unfold and Conquer Attribution Guidance

Author: Hong, Jung-Ho, Nam, Woo-Jeoung, Jeon, Kyu-Sung, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Revealing the transparency of Deep Neural Networks (DNNs) has been widely studied to describe the decision mechanisms of network inner structures. In this paper, we propose a novel post-hoc framework, Unfold and Conquer Attribution Guidance (UCAG), which enhances the explainability of the network decision by spatially scrutinizing the input features with respect to the model confidence. Addressing the phenomenon of missing detailed descriptions, UCAG sequentially complies with the confidence of slices of the image, leading to providing an abundant and clear interpretation. Therefore, it is possible to enhance the representation ability of explanation by preserving the detailed descriptions of assistant input features, which are commonly overwhelmed by the main meaningful regions. We conduct numerous evaluations to validate the performance in several metrics: i) deletion and insertion, ii) (energy-based) pointing games, and iii) positive and negative density maps. Experimental results, including qualitative comparisons, demonstrate that our method outperforms the existing methods with the nature of clear and detailed explanations and applicability., Comment: 9 pages, 5 figures, Accepted paper in AAAI Conference on Artificial Intelligence (AAAI), 2023
Published: 2023

13. Sparse Multitask Learning for Efficient Neural Representation of Motor Imagery and Execution

Author: Shin, Hye-Bin, Yin, Kang, and Lee, Seong-Whan
Subjects: Computer Science - Machine Learning, Electrical Engineering and Systems Science - Signal Processing
Abstract: In the quest for efficient neural network models for neural data interpretation and user intent classification in brain-computer interfaces (BCIs), learning meaningful sparse representations of the underlying neural subspaces is crucial. The present study introduces a sparse multitask learning framework for motor imagery (MI) and motor execution (ME) tasks, inspired by the natural partitioning of associated neural subspaces observed in the human brain. Given a dual-task CNN model for MI-ME classification, we apply a saliency-based sparsification approach to prune superfluous connections and reinforce those that show high importance in both tasks. Through our approach, we seek to elucidate the distinct and common neural ensembles associated with each task, employing principled sparsification techniques to eliminate redundant connections and boost the fidelity of neural signal decoding. Our results indicate that this tailored sparsity can mitigate the overfitting problem and improve the test performance with small amount of data, suggesting a viable path forward for computationally efficient and robust BCI systems.
Published: 2023

14. Neural Speech Embeddings for Speech Synthesis Based on Deep Generative Networks

Author: Lee, Seo-Hyun, Lee, Young-Eun, Kim, Soowon, Ko, Byung-Kwan, Kim, Jun-Young, and Lee, Seong-Whan
Subjects: Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Brain-to-speech technology represents a fusion of interdisciplinary applications encompassing fields of artificial intelligence, brain-computer interfaces, and speech synthesis. Neural representation learning based intention decoding and speech synthesis directly connects the neural activity to the means of human linguistic communication, which may greatly enhance the naturalness of communication. With the current discoveries on representation learning and the development of the speech synthesis technologies, direct translation of brain signals into speech has shown great promise. Especially, the processed input features and neural speech embeddings which are given to the neural network play a significant role in the overall performance when using deep generative models for speech generation from brain signals. In this paper, we introduce the current brain-to-speech technology with the possibility of speech synthesis from brain signals, which may ultimately facilitate innovation in non-verbal communication. Also, we perform comprehensive analysis on the neural features and neural speech embeddings underlying the neurophysiological activation while performing speech, which may play a significant role in the speech synthesis works., Comment: 4 pages
Published: 2023

15. HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

Author: Lee, Sang-Hoon, Choi, Ha-Yeong, Kim, Seung-Bin, and Lee, Seong-Whan
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. However, they require a large-scale data and possess the same limitations as previous autoregressive speech models, including slow inference speed and lack of robustness. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC). We verified that hierarchical speech synthesis frameworks could significantly improve the robustness and expressiveness of the synthetic speech. Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios. For text-to-speech, we adopt the text-to-vec framework, which generates a self-supervised speech representation and an F0 representation based on text representations and prosody prompts. Then, HierSpeech++ generates speech from the generated vector, F0, and voice prompt. We further introduce a high-efficient speech super-resolution framework from 16 kHz to 48 kHz. The experimental results demonstrated that the hierarchical variational autoencoder could be a strong zero-shot speech synthesizer given that it outperforms LLM-based and diffusion-based models. Moreover, we achieved the first human-level quality zero-shot speech synthesis. Audio samples and source code are available at https://github.com/sh-lee-prml/HierSpeechpp., Comment: 16 pages, 9 figures, 12 tables
Published: 2023

16. Neurophysiological Response Based on Auditory Sense for Brain Modulation Using Monaural Beat

Author: Jo, Ha-Na, Kweon, Young-Seok, Shin, Gi-Hwan, Kwak, Heon-Gyu, and Lee, Seong-Whan
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Human-Computer Interaction
Abstract: Brain modulation is a modification process of brain activity through external stimulations. However, which condition can induce the activation is still unclear. Therefore, we aimed to identify brain activation conditions using 40 Hz monaural beat (MB). Under this stimulation, auditory sense status which is determined by frequency and power range is the condition to consider. Hence, we designed five sessions to compare; no stimulation, audible (AB), inaudible in frequency, inaudible in power, and inaudible in frequency and power. Ten healthy participants underwent each stimulation session for ten minutes with electroencephalogram (EEG) recording. For analysis, we calculated the power spectral density (PSD) of EEG for each session and compared them in frequency, time, and five brain regions. As a result, we observed the prominent power peak at 40 Hz in only AB. The induced EEG amplitude increase started at one minute and increased until the end of the session. These results of AB had significant differences in frontal, central, temporal, parietal, and occipital regions compared to other stimulations. From the statistical analysis, the PSD of the right temporal region was significantly higher than the left. We figure out the role that the auditory sense is important to lead brain activation. These findings help to understand the neurophysiological principle and effects of auditory stimulation., Comment: Accepted to EMBC 2023
Published: 2023

17. Impact of Nap on Performance in Different Working Memory Tasks Using EEG

Author: Shin, Gi-Hwan, Kweon, Young-Seok, Kwak, Heon-Gyu, Jo, Ha-Na, and Lee, Seong-Whan
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Human-Computer Interaction
Abstract: Electroencephalography (EEG) has been widely used to study the relationship between naps and working memory, yet the effects of naps on distinct working memory tasks remain unclear. Here, participants performed word-pair and visuospatial working memory tasks pre- and post-nap sessions. We found marked differences in accuracy and reaction time between tasks performed pre- and post-nap. In order to identify the impact of naps on performance in each working memory task, we employed clustering to classify participants as high- or low-performers. Analysis of sleep architecture revealed significant variations in sleep onset latency and rapid eye movement (REM) proportion. In addition, the two groups exhibited prominent differences, especially in the delta power of the Non-REM 3 stage linked to memory. Our results emphasize the interplay between nap-related neural activity and working memory, underlining specific EEG markers associated with cognitive performance., Comment: Submitted to 2024 12th IEEE International Winter Conference on Brain-Computer Interface
Published: 2023

18. Improving Generalization of Drowsiness State Classification by Domain-Specific Normalization

Author: Kim, Dong-Young, Han, Dong-Kyun, Park, Seo-Hyeon, Jang, Geun-Deok, and Lee, Seong-Whan
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning
Abstract: Abnormal driver states, particularly have been major concerns for road safety, emphasizing the importance of accurate drowsiness detection to prevent accidents. Electroencephalogram (EEG) signals are recognized for their effectiveness in monitoring a driver's mental state by monitoring brain activities. However, the challenge lies in the requirement for prior calibration due to the variation of EEG signals among and within individuals. The necessity of calibration has made the brain-computer interface (BCI) less accessible. We propose a practical generalized framework for classifying driver drowsiness states to improve accessibility and convenience. We separate the normalization process for each driver, treating them as individual domains. The goal of developing a general model is similar to that of domain generalization. The framework considers the statistics of each domain separately since they vary among domains. We experimented with various normalization methods to enhance the ability to generalize across subjects, i.e. the model's generalization performance of unseen domains. The experiments showed that applying individual domain-specific normalization yielded an outstanding improvement in generalizability. Furthermore, our framework demonstrates the potential and accessibility by removing the need for calibration in BCI applications., Comment: Submitted to 2024 12th IEEE International Winter Conference on Brain-Computer Interface
Published: 2023

19. Relationship Between Mood, Sleepiness, and EEG Functional Connectivity by 40 Hz Monaural Beats

Author: Jo, Ha-Na, Kweon, Young-Seok, Shin, Gi-Hwan, Kwak, Heon-Gyu, and Lee, Seong-Whan
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Human-Computer Interaction
Abstract: The monaural beat is known that it can modulate brain and personal states. However, which changes in brain waves are related to changes in state is still unclear. Therefore, we aimed to investigate the effects of monaural beats and find the relationship between them. Ten participants took part in five separate random sessions, which included a baseline session and four sessions with monaural beats stimulation: one audible session and three inaudible sessions. Electroencephalogram (EEG) were recorded and participants completed pre- and post-stimulation questionnaires assessing mood and sleepiness. As a result, audible session led to increased arousal and positive mood compared to other conditions. From the neurophysiological analysis, statistical differences in frontal-central, central-central, and central-parietal connectivity were observed only in the audible session. Furthermore, a significant correlation was identified between sleepiness and EEG power in the temporal and occipital regions. These results suggested a more detailed correlation for stimulation to change its personal state. These findings have implications for applications in areas such as cognitive enhancement, mood regulation, and sleep management.
Published: 2023

20. Brain-Driven Representation Learning Based on Diffusion Model

Author: Kim, Soowon, Lee, Seo-Hyun, Lee, Young-Eun, Lee, Ji-Won, Park, Ji-Ha, and Lee, Seong-Whan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Interpreting EEG signals linked to spoken language presents a complex challenge, given the data's intricate temporal and spatial attributes, as well as the various noise factors. Denoising diffusion probabilistic models (DDPMs), which have recently gained prominence in diverse areas for their capabilities in representation learning, are explored in our research as a means to address this issue. Using DDPMs in conjunction with a conditional autoencoder, our new approach considerably outperforms traditional machine learning algorithms and established baseline models in accuracy. Our results highlight the potential of DDPMs as a sophisticated computational method for the analysis of speech-related EEG signals. This could lead to significant advances in brain-computer interfaces tailored for spoken communication.
Published: 2023

21. Influence of Video Dynamics on EEG-based Single-Trial Video Target Surveillance System

Author: Kwak, Heon-Gyu, Kim, Sung-Jin, Han, Hyeon-Taek, Jeong, Ji-Hoon, and Lee, Seong-Whan
Subjects: Computer Science - Human-Computer Interaction
Abstract: Target detection models are one of the widely used deep learning-based applications for reducing human efforts on video surveillance and patrol. However, the application of conventional computer vision-based target detection models in military usage can result in limited performance, due to the lack of sample data of hostile targets. In this paper, we present the possibility of the electroencephalography-based video target detection model, which could be applied as a supportive module of the military video surveillance system. The proposed framework and detection model showed prospective performance achieving a mean macro F-beta of 0.6522 with asynchronous real-time data from five subjects, in a certain video stimulus, but not on some video stimuli. By analyzing the results of experiments using each video stimulus, we present the factors that would affect the performance of electroencephalography-based video target detection models., Comment: 2024 International BCI winter conference accepted paper
Published: 2023

22. Multi-Signal Reconstruction Using Masked Autoencoder From EEG During Polysomnography

Author: Kweon, Young-Seok, Shin, Gi-Hwan, Kwak, Heon-Gyu, Jo, Ha-Na, and Lee, Seong-Whan
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Signal Processing
Abstract: Polysomnography (PSG) is an indispensable diagnostic tool in sleep medicine, essential for identifying various sleep disorders. By capturing physiological signals, including EEG, EOG, EMG, and cardiorespiratory metrics, PSG presents a patient's sleep architecture. However, its dependency on complex equipment and expertise confines its use to specialized clinical settings. Addressing these limitations, our study aims to perform PSG by developing a system that requires only a single EEG measurement. We propose a novel system capable of reconstructing multi-signal PSG from a single-channel EEG based on a masked autoencoder. The masked autoencoder was trained and evaluated using the Sleep-EDF-20 dataset, with mean squared error as the metric for assessing the similarity between original and reconstructed signals. The model demonstrated proficiency in reconstructing multi-signal data. Our results present promise for the development of more accessible and long-term sleep monitoring systems. This suggests the expansion of PSG's applicability, enabling its use beyond the confines of clinics., Comment: Proc. 12th IEEE International Winter Conference on Brain-Computer Interface
Published: 2023

23. Sample Dominance Aware Framework via Non-Parametric Estimation for Spontaneous Brain-Computer Interface

Author: Lee, Byeong-Hoo, Kwon, Byoung-Hee, and Lee, Seong-Whan
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Signal Processing
Abstract: Deep learning has shown promise in decoding brain signals, such as electroencephalogram (EEG), in the field of brain-computer interfaces (BCIs). However, the non-stationary characteristics of EEG signals pose challenges for training neural networks to acquire appropriate knowledge. Inconsistent EEG signals resulting from these non-stationary characteristics can lead to poor performance. Therefore, it is crucial to investigate and address sample inconsistency to ensure robust performance in spontaneous BCIs. In this study, we introduce the concept of sample dominance as a measure of EEG signal inconsistency and propose a method to modulate its effect on network training. We present a two-stage dominance score estimation technique that compensates for performance degradation caused by sample inconsistencies. Our proposed method utilizes non-parametric estimation to infer sample inconsistency and assigns each sample a dominance score. This score is then aggregated with the loss function during training to modulate the impact of sample inconsistency. Furthermore, we design a curriculum learning approach that gradually increases the influence of inconsistent signals during training to improve overall performance. We evaluate our proposed method using public spontaneous BCI dataset. The experimental results confirm that our findings highlight the importance of addressing sample dominance for achieving robust performance in spontaneous BCIs., Comment: 5 pages, 2 figures
Published: 2023

24. Pioneering EEG Motor Imagery Classification Through Counterfactual Analysis

Author: Yin, Kang, Shin, Hye-Bin, Kim, Hee-Dong, and Lee, Seong-Whan
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The application of counterfactual explanation (CE) techniques in the realm of electroencephalography (EEG) classification has been relatively infrequent in contemporary research. In this study, we attempt to introduce and explore a novel non-generative approach to CE, specifically tailored for the analysis of EEG signals. This innovative approach assesses the model's decision-making process by strategically swapping patches derived from time-frequency analyses. By meticulously examining the variations and nuances introduced in the classification outcomes through this method, we aim to derive insights that can enhance interpretability. The empirical results obtained from our experimental investigations serve not only to validate the efficacy of our proposed approach but also to reinforce human confidence in the model's predictive capabilities. Consequently, these findings underscore the significance and potential value of conducting further, more extensive research in this promising direction.
Published: 2023

25. A Distributed Inference System for Detecting Task-wise Single Trial Event-Related Potential in Stream of Satellite Images

Author: Kim, Sung-Jin, Kwak, Heon-Gyu, Han, Hyeon-Taek, Lee, Dae-Hyeok, Jeong, Ji-Hoon, and Lee, Seong-Whan
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition
Abstract: Brain-computer interface (BCI) has garnered the significant attention for their potential in various applications, with event-related potential (ERP) performing a considerable role in BCI systems. This paper introduces a novel Distributed Inference System tailored for detecting task-wise single-trial ERPs in a stream of satellite images. Unlike traditional methodologies that employ a single model for target detection, our system utilizes multiple models, each optimized for specific tasks, ensuring enhanced performance across varying image transition times and target onset times. Our experiments, conducted on four participants, employed two paradigms: the Normal paradigm and an AI paradigm with bounding boxes. Results indicate that our proposed system outperforms the conventional methods in both paradigms, achieving the highest $F_{\beta}$ scores. Furthermore, including bounding boxes in the AI paradigm significantly improved target recognition. This study underscores the potential of our Distributed Inference System in advancing the field of ERP detection in satellite image streams.
Published: 2023

26. Decoding EEG-based Workload Levels Using Spatio-temporal Features Under Flight Environment

Author: Lee, Dae-Hyeok, Kim, Sung-Jin, Kim, Si-Hyun, and Lee, Seong-Whan
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The detection of pilots' mental states is important due to the potential for their abnormal mental states to result in catastrophic accidents. This study introduces the feasibility of employing deep learning techniques to classify different workload levels, specifically normal state, low workload, and high workload. To the best of our knowledge, this study is the first attempt to classify workload levels of pilots. Our approach involves the hybrid deep neural network that consists of five convolutional blocks and one long short-term memory block to extract the significant features from electroencephalography signals. Ten pilots participated in the experiment, which was conducted within the simulated flight environment. In contrast to four conventional models, our proposed model achieved a superior grand--average accuracy of 0.8613, surpassing other conventional models by at least 0.0597 in classifying workload levels across all participants. Our model not only successfully classified workload levels but also provided valuable feedback to the participants. Hence, we anticipate that our study will make the significant contributions to the advancement of autonomous flight and driving leveraging artificial intelligence technology in the future., Comment: 5 pages, 3 figures, 1 table, 1 algorithm
Published: 2023

27. Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

Author: Choi, Ha-Yeong, Lee, Sang-Hoon, and Lee, Seong-Whan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: Although voice conversion (VC) systems have shown a remarkable ability to transfer voice style, existing methods still have an inaccurate pitch and low speaker adaptation quality. To address these challenges, we introduce Diff-HierVC, a hierarchical VC system based on two diffusion models. We first introduce DiffPitch, which can effectively generate F0 with the target voice style. Subsequently, the generated F0 is fed to DiffVoice to convert the speech with a target voice style. Furthermore, using the source-filter encoder, we disentangle the speech and use the converted Mel-spectrogram as a data-driven prior in DiffVoice to improve the voice style transfer capacity. Finally, by using the masked prior in diffusion models, our model can improve the speaker adaptation quality. Experimental results verify the superiority of our model in pitch generation and voice style transfer performance, and our model also achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios., Comment: INTERSPEECH 2023 (Oral)
Published: 2023

28. NeuroInspect: Interpretable Neuron-based Debugging Framework through Class-conditional Visualizations

Author: Ju, Yeong-Joon, Park, Ji-Hoon, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Despite deep learning (DL) has achieved remarkable progress in various domains, the DL models are still prone to making mistakes. This issue necessitates effective debugging tools for DL practitioners to interpret the decision-making process within the networks. However, existing debugging methods often demand extra data or adjustments to the decision process, limiting their applicability. To tackle this problem, we present NeuroInspect, an interpretable neuron-based debugging framework with three key stages: counterfactual explanations, feature visualizations, and false correlation mitigation. Our debugging framework first pinpoints neurons responsible for mistakes in the network and then visualizes features embedded in the neurons to be human-interpretable. To provide these explanations, we introduce CLIP-Illusion, a novel feature visualization method that generates images representing features conditioned on classes to examine the connection between neurons and the decision layer. We alleviate convoluted explanations of the conventional visualization approach by employing class information, thereby isolating mixed properties. This process offers more human-interpretable explanations for model errors without altering the trained network or requiring additional data. Furthermore, our framework mitigates false correlations learned from a dataset under a stochastic perspective, modifying decisions for the neurons considered as the main causes. We validate the effectiveness of our framework by addressing false correlations and improving inferences for classes with the worst performance in real-world settings. Moreover, we demonstrate that NeuroInspect helps debug the mistakes of DL models through evaluation for human understanding. The code is openly available at https://github.com/yeongjoonJu/NeuroInspect.
Published: 2023

29. WUDI: A Human Involved Self-Adaptive Framework to Prevent Childhood Obesity in Internet of Things Environment

Author: Lee, Euijong, Jung, Jaemin, Moon, Gee-Myung, Lee, Seong-Whan, and Jeong, Ji-Hoon
Subjects: Computer Science - Software Engineering
Abstract: The Internet of Things (IoT) connects people, devices, and information resources, in various domains to improve efficiency. The healthcare domain has been transformed by the integration of the IoT, leading to the development of digital healthcare solutions such as health monitoring, emergency detection, and remote operation. This integration has led to an increase in the health data collected from a variety of IoT sources. Consequently, advanced technologies are required to analyze health data, and artificial intelligence has been employed to extract meaningful insights from the data. Childhood overweight and obesity have emerged as some of the most serious global public health challenges, as they can lead to a variety of health-related problems and the early development of chronic diseases. To address this, a self-adaptive framework is proposed to prevent childhood obesity by using lifelog data from IoT environments, with human involvement being an important consideration in the framework. The framework uses an ensemble-based learning model to predict obesity using the lifelog data. Empirical experiments using lifelog data from smartphone applications were conducted to validate the effectiveness of human involvement and obesity prediction. The results demonstrated the efficiency of the proposed framework with human involvement in obesity prediction. The proposed framework can be applied in real-world healthcare services for childhood obesity.
Published: 2023

30. DeepHealthNet: Adolescent Obesity Prediction System Based on a Deep Learning Framework

Author: Jeong, Ji-Hoon, Lee, In-Gyu, Kim, Sung-Kyung, Kam, Tae-Eui, Lee, Seong-Whan, and Lee, Euijong
Subjects: Computer Science - Artificial Intelligence, Computer Science - Software Engineering
Abstract: Childhood and adolescent obesity rates are a global concern because obesity is associated with chronic diseases and long-term health risks. Artificial intelligence technology has emerged as a promising solution to accurately predict obesity rates and provide personalized feedback to adolescents. This study emphasizes the importance of early identification and prevention of obesity-related health issues. Factors such as height, weight, waist circumference, calorie intake, physical activity levels, and other relevant health information need to be considered for developing robust algorithms for obesity rate prediction and delivering personalized feedback. Hence, by collecting health datasets from 321 adolescents, we proposed an adolescent obesity prediction system that provides personalized predictions and assists individuals in making informed health decisions. Our proposed deep learning framework, DeepHealthNet, effectively trains the model using data augmentation techniques, even when daily health data are limited, resulting in improved prediction accuracy (acc: 0.8842). Additionally, the study revealed variations in the prediction of the obesity rate between boys (acc: 0.9320) and girls (acc: 0.9163), allowing the identification of disparities and the determination of the optimal time to provide feedback. The proposed system shows significant potential in effectively addressing childhood and adolescent obesity.
Published: 2023

31. Local-Global Temporal Fusion Network with an Attention Mechanism for Multiple and Multiclass Arrhythmia Classification

Author: Kim, Yun Kwan, Lee, Minji, Jo, Kunwook, Song, Hee Seok, and Lee, Seong-Whan
Subjects: Electrical Engineering and Systems Science - Signal Processing, Computer Science - Machine Learning, 68T07, 92C55
Abstract: Clinical decision support systems (CDSSs) have been widely utilized to support the decisions made by cardiologists when detecting and classifying arrhythmia from electrocardiograms (ECGs). However, forming a CDSS for the arrhythmia classification task is challenging due to the varying lengths of arrhythmias. Although the onset time of arrhythmia varies, previously developed methods have not considered such conditions. Thus, we propose a framework that consists of (i) local temporal information extraction, (ii) global pattern extraction, and (iii) local-global information fusion with attention to perform arrhythmia detection and classification with a constrained input length. The 10-class and 4-class performances of our approach were assessed by detecting the onset and offset of arrhythmia as an episode and the duration of arrhythmia based on the MIT-BIH arrhythmia database (MITDB) and MIT-BIH atrial fibrillation database (AFDB), respectively. The results were statistically superior to those achieved by the comparison models. To check the generalization ability of the proposed method, an AFDB-trained model was tested on the MITDB, and superior performance was attained compared with that of a state-of-the-art model. The proposed method can capture local-global information and dynamics without incurring information losses. Therefore, arrhythmias can be recognized more accurately, and their occurrence times can be calculated; thus, the clinical field can create more accurate treatment plans by using the proposed method., Comment: 14 pages, 6 figures
Published: 2023

32. DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training

Author: Oh, Hyung-Seok, Lee, Sang-Hoon, and Lee, Seong-Whan
Subjects: Computer Science - Sound, Computer Science - Computation and Language, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Expressive text-to-speech systems have undergone significant advancements owing to prosody modeling, but conventional methods can still be improved. Traditional approaches have relied on the autoregressive method to predict the quantized prosody vector; however, it suffers from the issues of long-term dependency and slow inference. This study proposes a novel approach called DiffProsody in which expressive speech is synthesized using a diffusion-based latent prosody generator and prosody conditional adversarial training. Our findings confirm the effectiveness of our prosody generator in generating a prosody vector. Furthermore, our prosody conditional discriminator significantly improves the quality of the generated speech by accurately emulating prosody. We use denoising diffusion generative adversarial networks to improve the prosody generation speed. Consequently, DiffProsody is capable of generating prosody 16 times faster than the conventional diffusion model. The superior performance of our proposed method has been demonstrated via experiments., Comment: 10 pages, 8 figures, 5 tables, under review
Published: 2023

33. HierVST: Hierarchical Adaptive Zero-shot Voice Style Transfer

Author: Lee, Sang-Hoon, Choi, Ha-Yeong, Oh, Hyung-Seok, and Lee, Seong-Whan
Subjects: Computer Science - Sound, Computer Science - Artificial Intelligence, Computer Science - Multimedia, Electrical Engineering and Systems Science - Audio and Speech Processing
Abstract: Despite rapid progress in the voice style transfer (VST) field, recent zero-shot VST systems still lack the ability to transfer the voice style of a novel speaker. In this paper, we present HierVST, a hierarchical adaptive end-to-end zero-shot VST model. Without any text transcripts, we only use the speech dataset to train the model by utilizing hierarchical variational inference and self-supervised representation. In addition, we adopt a hierarchical adaptive generator that generates the pitch representation and waveform audio sequentially. Moreover, we utilize unconditional generation to improve the speaker-relative acoustic capacity in the acoustic representation. With a hierarchical adaptive structure, the model can adapt to a novel voice style and convert speech progressively. The experimental results demonstrate that our method outperforms other VST models in zero-shot VST scenarios. Audio samples are available at \url{https://hiervst.github.io/}., Comment: INTERSPEECH 2023 (Oral)
Published: 2023

34. Diff-E: Diffusion-based Learning for Decoding Imagined Speech EEG

Author: Kim, Soowon, Lee, Young-Eun, Lee, Seo-Hyun, and Lee, Seong-Whan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Computation and Language, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning, 68T10
Abstract: Decoding EEG signals for imagined speech is a challenging task due to the high-dimensional nature of the data and low signal-to-noise ratio. In recent years, denoising diffusion probabilistic models (DDPMs) have emerged as promising approaches for representation learning in various domains. Our study proposes a novel method for decoding EEG signals for imagined speech using DDPMs and a conditional autoencoder named Diff-E. Results indicate that Diff-E significantly improves the accuracy of decoding EEG signals for imagined speech compared to traditional machine learning techniques and baseline models. Our findings suggest that DDPMs can be an effective tool for EEG signal decoding, with potential implications for the development of brain-computer interfaces that enable communication through imagined speech., Comment: Accepted to Interspeech 2023
Published: 2023

35. PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and Pause-based Prosody Modeling

Author: Hwang, Ji-Sang, Lee, Sang-Hoon, and Lee, Seong-Whan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: Although text-to-speech (TTS) systems have significantly improved, most TTS systems still have limitations in synthesizing speech with appropriate phrasing. For natural speech synthesis, it is important to synthesize the speech with a phrasing structure that groups words into phrases based on semantic information. In this paper, we propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling. First, we introduce a phrasing structure encoder that utilizes a context representation from the pre-trained language model. In the phrasing structure encoder, we extract a speaker-dependent syntactic representation from the context representation and then predict a pause sequence that separates the input text into phrases. Furthermore, we introduce a pause-based word encoder to model word-level prosody based on pause sequence. Experimental results show PauseSpeech outperforms previous models in terms of naturalness. Furthermore, in terms of objective evaluations, we can observe that our proposed methods help the model decrease the distance between ground-truth and synthesized speech. Audio samples are available at https://jisang93.github.io/pausespeech-demo/., Comment: 13 pages, 4 figures, 3 tables, under reivew
Published: 2023

36. HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio Codec and Latent Diffusion Models

Author: Hwang, Ji-Sang, Lee, Sang-Hoon, and Lee, Seong-Whan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: Recently, denoising diffusion models have demonstrated remarkable performance among generative models in various domains. However, in the speech domain, the application of diffusion models for synthesizing time-varying audio faces limitations in terms of complexity and controllability, as speech synthesis requires very high-dimensional samples with long-term acoustic features. To alleviate the challenges posed by model complexity in singing voice synthesis, we propose HiddenSinger, a high-quality singing voice synthesis system using a neural audio codec and latent diffusion models. To ensure high-fidelity audio, we introduce an audio autoencoder that can encode audio into an audio codec as a compressed representation and reconstruct the high-fidelity audio from the low-dimensional compressed latent vector. Subsequently, we use the latent diffusion models to sample a latent representation from a musical score. In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model using an unlabeled singing voice dataset. Experimental results demonstrate that our model outperforms previous models in terms of audio quality. Furthermore, the HiddenSinger-U can synthesize high-quality singing voices of speakers trained solely on unlabeled data., Comment: 11 pages, 5 figures, 5 tables, under review
Published: 2023

37. DDDM-VC: Decoupled Denoising Diffusion Models with Disentangled Representation and Prior Mixup for Verified Robust Voice Conversion

Author: Choi, Ha-Yeong, Lee, Sang-Hoon, and Lee, Seong-Whan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Artificial Intelligence, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: Diffusion-based generative models have exhibited powerful generative performance in recent years. However, as many attributes exist in the data distribution and owing to several limitations of sharing the model parameters across all levels of the generation process, it remains challenging to control specific styles for each attribute. To address the above problem, this paper presents decoupled denoising diffusion models (DDDMs) with disentangled representations, which can control the style for each attribute in generative models. We apply DDDMs to voice conversion (VC) tasks to address the challenges of disentangling and controlling each speech attribute (e.g., linguistic information, intonation, and timbre). First, we use a self-supervised representation to disentangle the speech representation. Subsequently, the DDDMs are applied to resynthesize the speech from the disentangled representations for denoising with respect to each attribute. Moreover, we also propose the prior mixup for robust voice style transfer, which uses the converted representation of the mixed style as a prior distribution for the diffusion models. The experimental results reveal that our method outperforms publicly available VC models. Furthermore, we show that our method provides robust generative performance regardless of the model size. Audio samples are available https://hayeong0.github.io/DDDM-VC-demo/., Comment: 23 pages, 10 figures, 17 tables, under review
Published: 2023

38. Towards Voice Reconstruction from EEG during Imagined Speech

Author: Lee, Young-Eun, Lee, Seo-Hyun, Kim, Sang-Ho, and Lee, Seong-Whan
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Human-Computer Interaction, Computer Science - Sound, Electrical Engineering and Systems Science - Signal Processing
Abstract: Translating imagined speech from human brain activity into voice is a challenging and absorbing research issue that can provide new means of human communication via brain signals. Endeavors toward reconstructing speech from brain activity have shown their potential using invasive measures of spoken speech data, however, have faced challenges in reconstructing imagined speech. In this paper, we propose NeuroTalk, which converts non-invasive brain signals of imagined speech into the user's own voice. Our model was trained with spoken speech EEG which was generalized to adapt to the domain of imagined speech, thus allowing natural correspondence between the imagined speech and the voice as a ground truth. In our framework, automatic speech recognition decoder contributed to decomposing the phonemes of generated speech, thereby displaying the potential of voice reconstruction from unseen words. Our results imply the potential of speech synthesis from human EEG signals, not only from spoken speech but also from the brain signals of imagined speech., Comment: 9 pages, 4 figures, accepted paper of AAAI 2023 in main track
Published: 2023

39. Towards Neural Decoding of Imagined Speech based on Spoken Speech

Author: Lee, Seo-Hyun, Lee, Young-Eun, Kim, Soowon, Ko, Byung-Kwan, and Lee, Seong-Whan
Subjects: Computer Science - Human-Computer Interaction
Abstract: Decoding imagined speech from human brain signals is a challenging and important issue that may enable human communication via brain signals. While imagined speech can be the paradigm for silent communication via brain signals, it is always hard to collect enough stable data to train the decoding model. Meanwhile, spoken speech data is relatively easy and to obtain, implying the significance of utilizing spoken speech brain signals to decode imagined speech. In this paper, we performed a preliminary analysis to find out whether if it would be possible to utilize spoken speech electroencephalography data to decode imagined speech, by simply applying the pre-trained model trained with spoken speech brain signals to decode imagined speech. While the classification performance of imagined speech data solely used to train and validation was 30.5 %, the transferred performance of spoken speech based classifier to imagined speech data displayed average accuracy of 26.8 % which did not have statistically significant difference compared to the imagined speech based classifier (p = 0.0983, chi-square = 4.64). For more comprehensive analysis, we compared the result with the visual imagery dataset, which would naturally be less related to spoken speech compared to the imagined speech. As a result, visual imagery have shown solely trained performance of 31.8 % and transferred performance of 26.3 % which had shown statistically significant difference between each other (p = 0.022, chi-square = 7.64). Our results imply the potential of applying spoken speech to decode imagined speech, as well as their underlying common features., Comment: 4 pages, 2 figures
Published: 2022

40. Kinematic-aware Hierarchical Attention Network for Human Pose Estimation in Videos

Author: Jin, Kyung-Min, Lim, Byoung-Sung, Lee, Gun-Hee, Kang, Tae-Kyung, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Previous video-based human pose estimation methods have shown promising results by leveraging aggregated features of consecutive frames. However, most approaches compromise accuracy to mitigate jitter or do not sufficiently comprehend the temporal aspects of human motion. Furthermore, occlusion increases uncertainty between consecutive frames, which results in unsmooth results. To address these issues, we design an architecture that exploits the keypoint kinematic features with the following components. First, we effectively capture the temporal features by leveraging individual keypoint's velocity and acceleration. Second, the proposed hierarchical transformer encoder aggregates spatio-temporal dependencies and refines the 2D or 3D input pose estimated from existing estimators. Finally, we provide an online cross-supervision between the refined input pose generated from the encoder and the final pose from our decoder to enable joint optimization. We demonstrate comprehensive results and validate the effectiveness of our model in various tasks: 2D pose estimation, 3D pose estimation, body mesh recovery, and sparsely annotated multi-human pose estimation. Our code is available at https://github.com/KyungMinJin/HANet.
Published: 2022

41. Spatial Reasoning for Few-Shot Object Detection

Author: Kim, Geonuk, Jung, Hong-Gyu, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Although modern object detectors rely heavily on a significant amount of training data, humans can easily detect novel objects using a few training examples. The mechanism of the human visual system is to interpret spatial relationships among various objects and this process enables us to exploit contextual information by considering the co-occurrence of objects. Thus, we propose a spatial reasoning framework that detects novel objects with only a few training examples in a context. We infer geometric relatedness between novel and base RoIs (Region-of-Interests) to enhance the feature representation of novel categories using an object detector well trained on base categories. We employ a graph convolutional network as the RoIs and their relatedness are defined as nodes and edges, respectively. Furthermore, we present spatial data augmentation to overcome the few-shot environment where all objects and bounding boxes in an image are resized randomly. Using the PASCAL VOC and MS COCO datasets, we demonstrate that the proposed method significantly outperforms the state-of-the-art methods and verify its efficacy through extensive ablation studies., Comment: Pattern Recognition, Vol.120, 2021
Published: 2022

42. Instance-Ambiguity Weighting for Multi-label Recognition with Limited Annotations

Author: Shrewsbury, Daniel, Kim, Suneung, Kim, Young-Eun, Kong, Heejo, Lee, Seong-Whan, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Yang, De-Nian, editor, Xie, Xing, editor, Tseng, Vincent S., editor, Pei, Jian, editor, Huang, Jen-Wei, editor, and Lin, Jerry Chun-Wei, editor
Published: 2024
Full Text: View/download PDF

43. OTPose: Occlusion-Aware Transformer for Pose Estimation in Sparsely-Labeled Videos

Author: Jin, Kyung-Min, Lee, Gun-Hee, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition, 68T45
Abstract: Although many approaches for multi-human pose estimation in videos have shown profound results, they require densely annotated data which entails excessive man labor. Furthermore, there exists occlusion and motion blur that inevitably lead to poor estimation performance. To address these problems, we propose a method that leverages an attention mask for occluded joints and encodes temporal dependency between frames using transformers. First, our framework composes different combinations of sparsely annotated frames that denote the track of the overall joint movement. We propose an occlusion attention mask from these combinations that enable encoding occlusion-aware heatmaps as a semi-supervised task. Second, the proposed temporal encoder employs transformer architecture to effectively aggregate the temporal relationship and keypoint-wise attention from each time step and accurately refines the target frame's final pose estimation. We achieve state-of-the-art pose estimation results for PoseTrack2017 and PoseTrack2018 datasets and demonstrate the robustness of our approach to occlusion and motion blur in sparsely annotated video data., Comment: 6 pages
Published: 2022

44. HTNet: Anchor-free Temporal Action Localization with Hierarchical Transformers

Author: Kang, Tae-Kyung, Lee, Gun-Hee, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition, 68T45
Abstract: Temporal action localization (TAL) is a task of identifying a set of actions in a video, which involves localizing the start and end frames and classifying each action instance. Existing methods have addressed this task by using predefined anchor windows or heuristic bottom-up boundary-matching strategies, which are major bottlenecks in inference time. Additionally, the main challenge is the inability to capture long-range actions due to a lack of global contextual information. In this paper, we present a novel anchor-free framework, referred to as HTNet, which predicts a set of triplets from a video based on a Transformer architecture. After the prediction of coarse boundaries, we refine it through a background feature sampling (BFS) module and hierarchical Transformers, which enables our model to aggregate global contextual information and effectively exploit the inherent semantic relationships in a video. We demonstrate how our method localizes accurate action instances and achieves state-of-the-art performance on two TAL benchmark datasets: THUMOS14 and ActivityNet 1.3., Comment: 6 pages
Published: 2022

45. Multi-Contextual Predictions with Vision Transformer for Video Anomaly Detection

Author: Lee, Joo-Yeon, Nam, Woo-Jeoung, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Video Anomaly Detection(VAD) has been traditionally tackled in two main methodologies: the reconstruction-based approach and the prediction-based one. As the reconstruction-based methods learn to generalize the input image, the model merely learns an identity function and strongly causes the problem called generalizing issue. On the other hand, since the prediction-based ones learn to predict a future frame given several previous frames, they are less sensitive to the generalizing issue. However, it is still uncertain if the model can learn the spatio-temporal context of a video. Our intuition is that the understanding of the spatio-temporal context of a video plays a vital role in VAD as it provides precise information on how the appearance of an event in a video clip changes. Hence, to fully exploit the context information for anomaly detection in video circumstances, we designed the transformer model with three different contextual prediction streams: masked, whole and partial. By learning to predict the missing frames of consecutive normal frames, our model can effectively learn various normality patterns in the video, which leads to a high reconstruction error at the abnormal cases that are unsuitable to the learned context. To verify the effectiveness of our approach, we assess our model on the public benchmark datasets: USCD Pedestrian 2, CUHK Avenue and ShanghaiTech and evaluate the performance with the anomaly score metric of reconstruction error. The results demonstrate that our proposed approach achieves a competitive performance compared to the existing video anomaly detection methods.
Published: 2022

46. Neural Architecture Adaptation for Object Detection by Searching Channel Dimensions and Mapping Pre-trained Parameters

Author: Jung, Harim, Oh, Myeong-Seok, Yang, Cheoljong, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Most object detection frameworks use backbone architectures originally designed for image classification, conventionally with pre-trained parameters on ImageNet. However, image classification and object detection are essentially different tasks and there is no guarantee that the optimal backbone for classification is also optimal for object detection. Recent neural architecture search (NAS) research has demonstrated that automatically designing a backbone specifically for object detection helps improve the overall accuracy. In this paper, we introduce a neural architecture adaptation method that can optimize the given backbone for detection purposes, while still allowing the use of pre-trained parameters. We propose to adapt both the micro- and macro-architecture by searching for specific operations and the number of layers, in addition to the output channel dimensions of each block. It is important to find the optimal channel depth, as it greatly affects the feature representation capability and computation cost. We conduct experiments with our searched backbone for object detection and demonstrate that our backbone outperforms both manually designed and searched state-of-the-art backbones on the COCO dataset., Comment: Accepted to ICPR 2022
Published: 2022

47. Factorization Approach for Sparse Spatio-Temporal Brain-Computer Interface

Author: Lee, Byeong-Hoo, Cho, Jeong-Hyun, Kwon, Byoung-Hee, and Lee, Seong-Whan
Subjects: Computer Science - Artificial Intelligence, Electrical Engineering and Systems Science - Signal Processing
Abstract: Recently, advanced technologies have unlimited potential in solving various problems with a large amount of data. However, these technologies have yet to show competitive performance in brain-computer interfaces (BCIs) which deal with brain signals. Basically, brain signals are difficult to collect in large quantities, in particular, the amount of information would be sparse in spontaneous BCIs. In addition, we conjecture that high spatial and temporal similarities between tasks increase the prediction difficulty. We define this problem as sparse condition. To solve this, a factorization approach is introduced to allow the model to obtain distinct representations from latent space. To this end, we propose two feature extractors: A class-common module is trained through adversarial learning acting as a generator; Class-specific module utilizes loss function generated from classification so that features are extracted with traditional methods. To minimize the latent space shared by the class-common and class-specific features, the model is trained under orthogonal constraint. As a result, EEG signals are factorized into two separate latent spaces. Evaluations were conducted on a single-arm motor imagery dataset. From the results, we demonstrated that factorizing the EEG signal allows the model to extract rich and decisive features under sparse condition., Comment: 8 pages
Published: 2022

48. Gradient Hedging for Intensively Exploring Salient Interpretation beyond Neuron Activation

Author: Nam, Woo-Jeoung and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: Hedging is a strategy for reducing the potential risks in various types of investments by adopting an opposite position in a related asset. Motivated by the equity technique, we introduce a method for decomposing output predictions into intensive salient attributions by hedging the evidence for a decision. We analyze the conventional approach applied to the evidence for a decision and discuss the paradox of the conservation rule. Subsequently, we define the viewpoint of evidence as a gap of positive and negative influence among the gradient-derived initial contribution maps and propagate the antagonistic elements to the evidence as suppressors, following the criterion of the degree of positive attribution defined by user preference. In addition, we reflect the severance or sparseness contribution of inactivated neurons, which are mostly irrelevant to a decision, resulting in increased robustness to interpretability. We conduct the following assessments in a verified experimental environment: pointing game, most relevant first region insertion, outside-inside relevance ratio, and mean average precision on the PASCAL VOC 2007, MS COCO 2014, and ImageNet datasets. The results demonstrate that our method outperforms existing attribution methods in distinctive, intensive, and intuitive visualization with robustness and applicability in general models.
Published: 2022

49. Instance-Ambiguity Weighting for Multi-label Recognition with Limited Annotations

Author: Shrewsbury, Daniel, primary, Kim, Suneung, additional, Kim, Young-Eun, additional, Kong, Heejo, additional, and Lee, Seong-Whan, additional
Published: 2024
Full Text: View/download PDF

50. Few-Shot Object Detection with Proposal Balance Refinement

Author: Kim, Sueyeon, Nam, Woo-Jeoung, and Lee, Seong-Whan
Subjects: Computer Science - Computer Vision and Pattern Recognition
Abstract: Few-shot object detection has gained significant attention in recent years as it has the potential to greatly reduce the reliance on large amounts of manually annotated bounding boxes. While most existing few-shot object detection literature primarily focuses on bounding box classification by obtaining as discriminative feature embeddings as possible, we emphasize the necessity of handling the lack of intersection-over-union (IoU) variations induced by a biased distribution of novel samples. In this paper, we analyze the IoU imbalance that is caused by the relatively high number of low-quality region proposals, and reveal that it plays a critical role in improving few-shot learning capabilities. The well-known two stage fine-tuning technique causes insufficient quality and quantity of the novel positive samples, which hinders the effective object detection of unseen novel classes. To alleviate this issue, we present a few-shot object detection model with proposal balance refinement, a simple yet effective approach in learning object proposals using an auxiliary sequential bounding box refinement process. This process enables the detector to be optimized on the various IoU scores through additional novel class samples. To fully exploit our sequential stage architecture, we revise the fine-tuning strategy and expose the Region Proposal Network to the novel classes in order to provide increased learning opportunities for the region-of-interest (RoI) classifiers and regressors. Our extensive assessments on PASCAL VOC and COCO demonstrate that our framework substantially outperforms other existing few-shot object detection approaches.
Published: 2022

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

339 results on '"Lee, Seong-Whan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources