Descriptor: "Data imputation" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Data imputation"' showing total 1,355 results

Start Over Descriptor "Data imputation"

1,355 results on '"Data imputation"'

1. Missing Meteorological Data Imputation for Mini Eolic Electrical Power Prediction

Author: García-Ordás, María Teresa, Díaz-Longueira, Antonio, Michelena, Álvaro, Jove, Esteban, Bayón-Gutiérrez, Martín, Alaiz-Moretón, Héctor, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Quintián, Héctor, editor, Corchado, Emilio, editor, Troncoso Lora, Alicia, editor, Pérez García, Hilde, editor, Jove Pérez, Esteban, editor, Calvo Rolle, José Luis, editor, Martínez de Pisón, Francisco Javier, editor, García Bringas, Pablo, editor, Martínez Álvarez, Francisco, editor, Herrero, Álvaro, editor, and Fosci, Paolo, editor
Published: 2025
Full Text: View/download PDF

2. Maize yield prediction with trait-missing data via bipartite graph neural network.

Author: Kaiyi Wang, Yanyun Han, Yuqing Zhang, Yong Zhang, Shufeng Wang, Feng Yang, Chunqing Liu, Dongfeng Zhang, Tiangang Lu, Like Zhang, and Zhongqiang Liu
Subjects: GRAPH neural networks, DATA structures, BIPARTITE graphs, PLANTING, AGRICULTURAL policy, CORN, DEEP learning
Abstract: The timely and accurate prediction of maize (Zea mays L.) yields prior to harvest is critical for food security and agricultural policy development. Currently, many researchers are using machine learning and deep learning to predict maize yields in specific regions with high accuracy. However, existing methods typically have two limitations. One is that they ignore the extensive correlation in maize planting data, such as the association of maize yields between adjacent planting locations and the combined effect of meteorological features and maize traits on maize yields. The other issue is that the performance of existing models may suffer significantly when some data in maize planting records is missing, or the samples are unbalanced. Therefore, this paper proposes an end-to-end bipartite graph neural network-based model for trait data imputation and yield prediction. The maize planting data is initially converted to a bipartite graph data structure. Then, a yield prediction model based on a bipartite graph neural network is developed to impute missing trait data and predict maize yield. This model can mine correlations between different samples of data, correlations between different meteorological features and traits, and correlations between different traits. Finally, to address the issue of unbalanced sample size at each planting location, we propose a loss function based on the gradient balancing mechanism that effectively reduces the impact of data imbalance on the prediction model. When compared to other data imputation and prediction models, our method achieves the best yield prediction result even when missing data is not pre-processed. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

3. Life cycle assessment of metal powder production: a Bayesian stochastic Kriging model-based autonomous estimation.

Author: Xiao, Haibo, Gao, Baoyun, Yu, Shoukang, Liu, Bin, Cao, Sheng, and Peng, Shitong
Subjects: ALLOY powders, TITANIUM powder, NICKEL alloys, PRODUCT life cycle assessment, TITANIUM alloys, METAL powders
Abstract: Metal powder contributes to the environmental burdens of additive manufacturing (AM) substantially. Current life cycle assessments (LCAs) of metal powders present considerable variations of lifecycle environmental inventory due to process divergence, spatial heterogeneity, or temporal fluctuation. Most importantly, the amounts of LCA studies on metal powder are limited and primarily confined to partial material types. To this end, based on the data surveyed from a metal powder supplier, this study conducted an LCA of titanium and nickel alloy produced by electrode-inducted and vacuum-inducted melting gas atomization, respectively. Given that energy consumption dominates the environmental burden of powder production and is influenced by metal materials' physical properties, we proposed a Bayesian stochastic Kriging model to estimate the energy consumption during the gas atomization process. This model considered the inherent uncertainties of training data and adaptively updated the parameters of interest when new environmental data on gas atomization were available. With the predicted energy use information of specific powder, the corresponding lifecycle environmental impacts can be further autonomously estimated in conjunction with the other surveyed powder production stages. Results indicated the environmental impact of titanium alloy powder is slightly higher than that of nickel alloy powder and their lifecycle carbon emissions are around 20 kg CO2 equivalency. The proposed Bayesian stochastic Kriging model showed more accurate predictions of energy consumption compared with conventional Kriging and stochastic Kriging models. This study enables data imputation of energy consumption during gas atomization given the physical properties and producing technique of powder materials. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

4. A novel 8-connected Pixel Identity GAN with Neutrosophic (ECP-IGANN) for missing imputation.

Author: Mahmoud, Gamal M., Elbaz, Mostafa, Alqahtani, Fayez, Alginahi, Yasser, and Said, Wael
Subjects: *COMPUTER vision, *IMAGE processing, *IMAGE reconstruction, *REMOTE-sensing images, *MISSING data (Statistics)
Abstract: Missing pixel imputation presents a critical challenge in image processing and computer vision, particularly in applications such as image restoration and inpainting. The primary objective of this paper is to accurately estimate and reconstruct missing pixel values to restore complete visual information. This paper introduces a novel model called the Enhanced Connected Pixel Identity GAN with Neutrosophic (ECP-IGANN), which is designed to address two fundamental issues inherent in existing GAN architectures for missing pixel generation: (1) mode collapse, which leads to a lack of diversity in generated pixels, and (2) the preservation of pixel integrity within the reconstructed images. ECP-IGANN incorporates two key innovations to improve missing pixel imputation. First, an identity block is integrated into the generation process to facilitate the retention of existing pixel values and ensure consistency. Second, the model calculates the values of the 8-connected neighbouring pixels around each missing pixel, thereby enhancing the coherence and integrity of the imputed pixels. The efficacy of ECP-IGANN was rigorously evaluated through extensive experimentation across five diverse datasets: BigGAN-ImageNet, the 2024 Medical Imaging Challenge Dataset, the Autonomous Vehicles Dataset, the 2024 Satellite Imagery Dataset, and the Fashion and Apparel Dataset 2024. These experiments assessed the model's performance in terms of diversity, pixel imputation accuracy, and mode collapse mitigation, with results demonstrating significant improvements in the Inception Score (IS) and Fréchet Inception Distance (FID). ECP-IGANN markedly enhanced image segmentation performance in the validation phase across all datasets. Key metrics, such as Dice Score, Accuracy, Precision, and Recall, were improved substantially for various segmentation models, including Spatial Attention U-Net, Dense U-Net, and Residual Attention U-Net. For example, in the 2024 Medical Imaging Challenge Dataset, the Residual Attention U-Net's Dice Score increased from 0.84 to 0.90, while accuracy improved from 0.88 to 0.93 following the application of ECP-IGANN. Similar performance enhancements were observed with the other datasets, highlighting the model's robust generalizability across diverse imaging domains. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

5. Comprehensive data optimization and risk prediction framework: machine learning methods for inflammatory bowel disease prediction based on the human gut microbiome data.

Author: Yan Peng, Yue Liu, Yifei Liu, and Jie Wang
Subjects: HUMAN microbiota, MISSING data (Statistics), RANDOM forest algorithms, OVERALL survival, INFLAMMATORY bowel diseases, MACHINE learning, GUT microbiome
Abstract: Over the past decade, the prevalence of inflammatory bowel disease (IBD) has significantly increased, making early detection crucial for improving patient survival rates. Medical research suggests that changes in the human gut microbiome are closely linked to IBD onset, playing a critical role in its prediction. However, the current gut microbiome data often exhibit missing values and high dimensionality, posing challenges to the accuracy of predictive algorithms. To address these issues, we proposed the comprehensive data optimization and risk prediction framework (CDORPF), an ensemble learning framework designed to predict IBD risk based on the human gut microbiome, aiding early diagnosis. The framework comprised two main components: data optimization and risk prediction. The data optimization module first employed triple optimization imputation (TOI) to impute missing data while preserving the biological characteristics of the microbiome. It then utilized importance-weighted variational autoencoder (IWVAE) to reduce redundant information from the high-dimensional microbiome data. This process resulted in a complete, low-dimensional representation of the data, laying the foundation for improved algorithm efficiency and accuracy. In the risk prediction module, the optimized data was classified using a random forest (RF) model, and hyperparameters were globally optimized using improved aquila optimizer (IAO), which incorporated multiple strategies. Experimental results on IBD-related gut microbiome datasets showed that the proposed framework achieved classification accuracy, recall, and F1 scores exceeding 0.9, outperforming comparison models and serving as a valuable tool for predicting IBD onset risk. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

6. Data Imputation in Electricity Consumption Profiles through Shape Modeling with Autoencoders.

Author: Duarte, Oscar, Duarte, Javier E., and Rosero-Garcia, Javier
Subjects: *ELECTRIC power consumption, *SMART meters, *ENERGY consumption, *TIME series analysis, *FORECASTING
Abstract: In this paper, we propose a novel methodology for estimating missing data in energy consumption datasets. Conventional data imputation methods are not suitable for these datasets, because they are time series with special characteristics and because, for some applications, it is quite important to preserve the shape of the daily energy profile. Our answer to this need is the use of autoencoders. First, we split the problem into two subproblems: how to estimate the total amount of daily energy, and how to estimate the shape of the daily energy profile. We encode the shape as a new feature that can be modeled and predicted using autoencoders. In this way, the problem of imputation of profile data are reduced to two relatively simple problems on which conventional methods can be applied. However, the two predictions are related, so special care should be taken when reconstructing the profile. We show that, as a result, our data imputation methodology produces plausible profiles where other methods fail. We tested it on a highly corrupted dataset, outperforming conventional methods by a factor of 3.7. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

7. Enhancing Aggregate Load Forecasting Accuracy with Adversarial Graph Convolutional Imputation Network and Learnable Adjacency Matrix.

Author: Zhao, Junhao, Shen, Xiaodong, Liu, Youbo, Liu, Junyong, and Tang, Xisheng
Subjects: *CONVOLUTIONAL neural networks, *LOAD forecasting (Electric power systems), *MISSING data (Statistics), *ELECTRICITY markets, *DATA quality, *MARKET power
Abstract: Accurate load forecasting, especially in the short term, is crucial for the safe and stable operation of power systems and their market participants. However, as modern power systems become increasingly complex, the challenges of short-term load forecasting are also intensifying. To address this challenge, data-driven deep learning techniques and load aggregation technologies have gradually been introduced into the field of load forecasting. However, data quality issues persist due to various factors such as sensor failures, unstable communication, and susceptibility to network attacks, leading to data gaps. Furthermore, in the domain of aggregated load forecasting, considering the potential interactions among aggregated loads can help market participants engage in cross-market transactions. However, aggregated loads often lack clear geographical locations, making it difficult to predefine graph structures. To address the issue of data quality, this study proposes a model named adversarial graph convolutional imputation network (AGCIN), combined with local and global correlations for imputation. To tackle the problem of the difficulty in predefining graph structures for aggregated loads, this study proposes a learnable adjacency matrix, which generates an adaptive adjacency matrix based on the relationships between different sequences without the need for geographical information. The experimental results demonstrate that the proposed imputation method outperforms other imputation methods in scenarios with random and continuous missing data. Additionally, the prediction accuracy of the proposed method exceeds that of several baseline methods, affirming the effectiveness of our approach in imputation and prediction, ultimately enhancing the accuracy of aggregated load forecasting. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

8. A Bayesian Aoristic Logistic Regression to Model Spatio-Temporal Crime Risk Under the Presence of Interval-Censored Event Times.

Author: Briz-Redón, Álvaro
Subjects: *CRIME prevention laws, *OFFENSES against property, *CENSORING (Statistics), *CRIME analysis, *LOGISTIC regression analysis
Abstract: Purpose: Crime data analysis has gained significant interest due to its peculiarities. One key characteristic of property crimes is the uncertainty surrounding their exact temporal location, often limited to a time window. Methods: This study introduces a spatio-temporal logistic regression model that addresses the challenges posed by temporal uncertainty in crime data analysis. Inspired by the aoristic method, our Bayesian approach allows for the inclusion of temporal uncertainty in the model. Results: To demonstrate the effectiveness of our proposed model, we apply it to both simulated datasets and a dataset of residential burglaries recorded in Valencia, Spain. We compare our proposal with a complete cases model, which excludes temporally-uncertain events, and also with alternative models that rely on imputation procedures. Our model exhibits superior performance in terms of recovering the true underlying crime risk. Conclusions: The proposed modeling framework effectively handles interval-censored temporal observations while incorporating covariate and space–time effects. This flexible model can be implemented to analyze crime data with uncertainty in temporal locations, providing valuable insights for crime prevention and law enforcement strategies. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

9. MISSING DATA IMPUTATION FOR HEALTH CARE BIG DATA USING DENOISING AUTOENCODER WITH GENERATIVE ADVERSARIAL NETWORK.

Author: YINBING ZHANG
Subjects: STANDARD deviations, GENERATIVE adversarial networks, DEEP learning, K-nearest neighbor classification, NONRESPONSE (Statistics), MISSING data (Statistics), MULTIPLE imputation (Statistics)
Abstract: Missing data imputation is a key topic in healthcare that covers the issues and strategies involved in dealing with partial data in medical records, clinical trials, and health surveys. Data in healthcare might be missing for a variety of reasons, including non-response in surveys, data entry problems, or unrecorded information during therapeutic appointments. This paper introduces a novel approach to impute missing data utilizing a hybrid model that integrates denoising autoencoders with generative adversarial networks (GANs). We begin by highlighting the prevalence of missing data in health care datasets and the potential impact on analytical outcomes. The proposed methodology leverages the denoising autoencoder's ability to reconstruct data from noisy inputs, coupled with the GAN's proficiency in generating synthetic data that is indistinguishable from real data. By combining these two neural network architectures, our model demonstrates an enhanced capability to predict and fill in missing data points effectively. To validate our approach, we conducted experiments on several large-scale health care datasets with varying degrees of artificially introduced missingness. The performance of our model was benchmarked against traditional imputation methods such as mean imputation and k-nearest neighbors, as well as against standalone denoising autoencoders and GANs. Our results indicate a significant improvement in imputation accuracy, as measured by root mean square error (RMSE) and mean absolute error (MAE), confirming the efficacy of the hybrid model in handling missing data in a robust manner. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

10. Improving Air Quality Data Reliability through Bi-Directional Univariate Imputation with the Random Forest Algorithm.

Author: Arnaut, Filip, Đurđević, Vladimir, Kolarski, Aleksandra, Srećković, Vladimir A., and Jevremović, Sreten
Abstract: Forecasting the future levels of air pollution provides valuable information that holds importance for the general public, vulnerable populations, and policymakers. High-quality data are essential for precise and reliable forecasts and investigations of air pollution. Missing observations arise when the sensors utilized for assessing air quality parameters experience malfunctions, which result in erroneous measurements or gaps in the dataset and hinder the data quality. This research paper presents a novel approach for imputing missing values in air quality data in a univariate approach. The algorithm employs the random forest (RF) algorithm to impute missing observations in a bi-directional (forward and reverse in time) manner for air quality (particulate matter less than 2.5 μm (PM2.5)) data from the Republic of Serbia. The algorithm was evaluated against simple methods, such as the mean and median imputation methods, for missing observations over durations of 24, 48, and 72 h. The results indicate that our algorithm yielded comparable error rates to the median imputation method for all periods when imputing the PM2.5 data. Ultimately, the algorithm's higher computational complexity proved itself as not justified considering the minimal error decrease it achieved compared with the simpler methods. However, for future improvement, additional research is needed, such as utilizing low-code machine learning libraries and time-series forecasting techniques. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

11. A comparison of machine learning methods for recovering noisy and missing 4D flow MRI data.

Author: Csala, Hunor, Amili, Omid, D'Souza, Roshan M., and Arzani, Amirhossein
Subjects: *COMPUTATIONAL fluid dynamics, *SINGULAR value decomposition, *DEEP learning, *BLOOD flow measurement, *MAGNETIC resonance imaging
Abstract: Experimental blood flow measurement techniques are invaluable for a better understanding of cardiovascular disease formation, progression, and treatment. One of the emerging methods is time‐resolved three‐dimensional phase‐contrast magnetic resonance imaging (4D flow MRI), which enables noninvasive time‐dependent velocity measurements within large vessels. However, several limitations hinder the usability of 4D flow MRI and other experimental methods for quantitative hemodynamics analysis. These mainly include measurement noise, corrupt or missing data, low spatiotemporal resolution, and other artifacts. Traditional filtering is routinely applied for denoising experimental blood flow data without any detailed discussion on why it is preferred over other methods. In this study, filtering is compared to different singular value decomposition (SVD)‐based machine learning and autoencoder‐type deep learning methods for denoising and filling in missing data (imputation). An artificially corrupted and voxelized computational fluid dynamics (CFD) simulation as well as in vitro 4D flow MRI data are used to test the methods. SVD‐based algorithms achieve excellent results for the idealized case but severely struggle when applied to in vitro data. The autoencoders are shown to be versatile and applicable to all investigated cases. For denoising, the in vitro 4D flow MRI data, the denoising autoencoder (DAE), and the Noise2Noise (N2N) autoencoder produced better reconstructions than filtering both qualitatively and quantitatively. Deep learning methods such as N2N can result in noise‐free velocity fields even though they did not use clean data during training. This work presents one of the first comprehensive assessments and comparisons of various classical and modern machine‐learning methods for enhancing corrupt cardiovascular flow data in diseased arteries for both synthetic and experimental test cases. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

12. An L1-and-L2-regularized nonnegative tensor factorization for power load monitoring data imputation.

Author: Luo, Xing, Hu, Zijian, Ma, Zhoujun, Lv, Zhan, Wang, Qu, Zeng, Aoling, Leung, Man Fai, and Zhang, Lei
Subjects: FEATURE extraction, FACTORIZATION, FORECASTING, EMPIRICAL research, DATA modeling
Abstract: As smart grid advance, Power Load Forecasting (PLF) has become a research hotspot. As the foundation of the forecasting model, the Power Load Monitoring (PLM) data takes on great importance due to its completeness, reliability and accuracy. However, monitoring equipment failures, transmission channel congestion and anomalies result in missing PLM data, which directly affects the performance of the PLF model. To address this issue, this paper proposes an L1-and-L2-Regularized Nonnegative Tensor Factorization (LNTF) model to impute PLM missing data. Its main idea is threefold: (1) combining L1 and L2 norms to achieve effective feature extraction and improve the model's robustness; (2) incorporating two temporal-dependent linear biases to describe the fluctuations of PLM data; (3) adding nonnegative constraints to precisely define the nonnegativity of PLM data. Extensive empirical studies on two publicly real-world PLM datasets with 1,569,491 and 413,357 known entries and missing rates of 93.35% and 96.75% demonstrate that the proposed LNTF improves 14.04%, 59.31%, and 71.43% on average over the state-of-the-art imputation models in terms of imputation error, convergence rounds, and time cos, respectively. Its high computational efficiency and low imputation error make practical sense for PLM data imputation. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

13. A novel 8-connected Pixel Identity GAN with Neutrosophic (ECP-IGANN) for missing imputation

Author: Gamal M. Mahmoud, Mostafa Elbaz, Fayez Alqahtani, Yasser Alginahi, and Wael Said
Subjects: GANs, Missing pixel imputation, Data imputation, Mode collapse, Missing pixel, Neutrosophic, Medicine, Science
Abstract: Abstract Missing pixel imputation presents a critical challenge in image processing and computer vision, particularly in applications such as image restoration and inpainting. The primary objective of this paper is to accurately estimate and reconstruct missing pixel values to restore complete visual information. This paper introduces a novel model called the Enhanced Connected Pixel Identity GAN with Neutrosophic (ECP-IGANN), which is designed to address two fundamental issues inherent in existing GAN architectures for missing pixel generation: (1) mode collapse, which leads to a lack of diversity in generated pixels, and (2) the preservation of pixel integrity within the reconstructed images. ECP-IGANN incorporates two key innovations to improve missing pixel imputation. First, an identity block is integrated into the generation process to facilitate the retention of existing pixel values and ensure consistency. Second, the model calculates the values of the 8-connected neighbouring pixels around each missing pixel, thereby enhancing the coherence and integrity of the imputed pixels. The efficacy of ECP-IGANN was rigorously evaluated through extensive experimentation across five diverse datasets: BigGAN-ImageNet, the 2024 Medical Imaging Challenge Dataset, the Autonomous Vehicles Dataset, the 2024 Satellite Imagery Dataset, and the Fashion and Apparel Dataset 2024. These experiments assessed the model’s performance in terms of diversity, pixel imputation accuracy, and mode collapse mitigation, with results demonstrating significant improvements in the Inception Score (IS) and Fréchet Inception Distance (FID). ECP-IGANN markedly enhanced image segmentation performance in the validation phase across all datasets. Key metrics, such as Dice Score, Accuracy, Precision, and Recall, were improved substantially for various segmentation models, including Spatial Attention U-Net, Dense U-Net, and Residual Attention U-Net. For example, in the 2024 Medical Imaging Challenge Dataset, the Residual Attention U-Net’s Dice Score increased from 0.84 to 0.90, while accuracy improved from 0.88 to 0.93 following the application of ECP-IGANN. Similar performance enhancements were observed with the other datasets, highlighting the model’s robust generalizability across diverse imaging domains.
Published: 2024
Full Text: View/download PDF

14. Life cycle assessment of metal powder production: a Bayesian stochastic Kriging model-based autonomous estimation

Author: Haibo Xiao, Baoyun Gao, Shoukang Yu, Bin Liu, Sheng Cao, and Shitong Peng
Subjects: Data imputation, Gas atomization, Stochastic Kriging model, Additive manufacturing, Uncertainty, Electronic computers. Computer science, QA75.5-76.95, Computer engineering. Computer hardware, TK7885-7895
Abstract: Abstract Metal powder contributes to the environmental burdens of additive manufacturing (AM) substantially. Current life cycle assessments (LCAs) of metal powders present considerable variations of lifecycle environmental inventory due to process divergence, spatial heterogeneity, or temporal fluctuation. Most importantly, the amounts of LCA studies on metal powder are limited and primarily confined to partial material types. To this end, based on the data surveyed from a metal powder supplier, this study conducted an LCA of titanium and nickel alloy produced by electrode-inducted and vacuum-inducted melting gas atomization, respectively. Given that energy consumption dominates the environmental burden of powder production and is influenced by metal materials’ physical properties, we proposed a Bayesian stochastic Kriging model to estimate the energy consumption during the gas atomization process. This model considered the inherent uncertainties of training data and adaptively updated the parameters of interest when new environmental data on gas atomization were available. With the predicted energy use information of specific powder, the corresponding lifecycle environmental impacts can be further autonomously estimated in conjunction with the other surveyed powder production stages. Results indicated the environmental impact of titanium alloy powder is slightly higher than that of nickel alloy powder and their lifecycle carbon emissions are around 20 kg CO2 equivalency. The proposed Bayesian stochastic Kriging model showed more accurate predictions of energy consumption compared with conventional Kriging and stochastic Kriging models. This study enables data imputation of energy consumption during gas atomization given the physical properties and producing technique of powder materials.
Published: 2024
Full Text: View/download PDF

15. IMD-MP: Imputation of Missing Data in IoT Based on Matrix Profile and Spatio-temporal Correlations

Author: G.V.Vidya Lakshmi and S. Gopikrishnan
Subjects: Internet of Things, Data imputation, Univariate da, Electronic computers. Computer science, QA75.5-76.95
Abstract: Data in the Internet of Things (IoT) domain may be missing due to connectivity errors, environmental extremes, sensor malfunctions, and human errors. Despite the many approaches for imputing missing values, the most significant difficulty in terms of imputation precision or compute complexity for larger missing sub-sequences in uni-variate series is still being explored. This work introduced IMD-MP (Imputation of Missing Data using Matrix Profile), a new technique that improves imputation accuracy for big data analysis in IoT applications based on spatial-temporal correlations using a novel distance metric Matrix Profile Distance (MPD). Our method preserves spatial correlation by grouping the sensors present in the network (using grouping algorithm-GA) to impute the missing data of the failed sensor node. After grouping, similar sensor nodes to the failed sensor node are identified using the Node Similarity Algorithm (NSF). From its similar sensor data, a certain number of sub-sequences that are most similar to the one preceding the failed node’s missing values are gathered. These sub-sequences heights are optimized to ensure temporal correlation in the imputed data. To find the optimal imputation sequence, the current research uses MPD and similarity scores. Numerical findings using sensor data from real-time environmental mon-itoring and Intel data sets demonstrate the algorithm’s effectiveness compared to other benchmarks.
Published: 2024
Full Text: View/download PDF

16. Comparing preprocessing strategies for 3D-Gene microarray data of extracellular vesicle-derived miRNAs

Author: Yuto Takemoto, Daisuke Ito, Shota Komori, Yoshiyuki Kishimoto, Shinichiro Yamada, Atsushi Hashizume, Masahisa Katsuno, and Masahiro Nakatochi
Subjects: Data imputation, Extracellular vesicle, miRNA, Normalization, Computer applications to medicine. Medical informatics, R858-859.7, Biology (General), QH301-705.5
Abstract: Abstract Background Extracellular vesicle-derived (EV)-miRNAs have potential to serve as biomarkers for the diagnosis of various diseases. miRNA microarrays are widely used to quantify circulating EV-miRNA levels, and the preprocessing of miRNA microarray data is critical for analytical accuracy and reliability. Thus, although microarray data have been used in various studies, the effects of preprocessing have not been studied for Toray’s 3D-Gene chip, a widely used measurement method. We aimed to evaluate batch effect, missing value imputation accuracy, and the influence of preprocessing on measured values in 18 different preprocessing pipelines for EV-miRNA microarray data from two cohorts with amyotrophic lateral sclerosis using 3D-Gene technology. Results Eighteen different pipelines with different types and orders of missing value completion and normalization were used to preprocess the 3D-Gene microarray EV-miRNA data. Notable results were suppressed in the batch effects in all pipelines using the batch effect correction method ComBat. Furthermore, pipelines utilizing missForest for missing value imputation showed high agreement with measured values. In contrast, imputation using constant values for missing data exhibited low agreement. Conclusions This study highlights the importance of selecting the appropriate preprocessing strategy for EV-miRNA microarray data when using 3D-Gene technology. These findings emphasize the importance of validating preprocessing approaches, particularly in the context of batch effect correction and missing value imputation, for reliably analyzing data in biomarker discovery and disease research.
Published: 2024
Full Text: View/download PDF

17. A Deep Auto-Optimized Collaborative Learning (DACL) model for disease prognosis using AI-IoMT systems

Author: Malarvizhi Nandagopal, Koteeswaran Seerangan, Tamilmani Govindaraju, Neeba Eralil Abi, Balamurugan Balusamy, and Shitharth Selvarajan
Subjects: Deep Auto-Optimized Collaborative Learning (DACL) model, Internet of Medical Things (IoMT), Disease diagnosis, Artificial Intelligence (AI), Data imputation, Optimization, Medicine, Science
Abstract: Abstract In modern healthcare, integrating Artificial Intelligence (AI) and Internet of Medical Things (IoMT) is highly beneficial and has made it possible to effectively control disease using networks of interconnected sensors worn by individuals. The purpose of this work is to develop an AI-IoMT framework for identifying several of chronic diseases form the patients’ medical record. For that, the Deep Auto-Optimized Collaborative Learning (DACL) Model, a brand-new AI-IoMT framework, has been developed for rapid diagnosis of chronic diseases like heart disease, diabetes, and stroke. Then, a Deep Auto-Encoder Model (DAEM) is used in the proposed framework to formulate the imputed and preprocessed data by determining the fields of characteristics or information that are lacking. To speed up classification training and testing, the Golden Flower Search (GFS) approach is then utilized to choose the best features from the imputed data. In addition, the cutting-edge Collaborative Bias Integrated GAN (ColBGaN) model has been created for precisely recognizing and classifying the types of chronic diseases from the medical records of patients. The loss function is optimally estimated during classification using the Water Drop Optimization (WDO) technique, reducing the classifier’s error rate. Using some of the well-known benchmarking datasets and performance measures, the proposed DACL’s effectiveness and efficiency in identifying diseases is evaluated and compared.
Published: 2024
Full Text: View/download PDF

18. A Classification Method for Incomplete Mixed Data Using Imputation and Feature Selection.

Author: Li, Gengsong, Zheng, Qibin, Liu, Yi, Li, Xiang, Qin, Wei, and Diao, Xingchun
Subjects: MACHINE learning, PARTICLE swarm optimization, MISSING data (Statistics), MACHINE performance, STANDARD deviations, FEATURE selection, MULTIPLE imputation (Statistics)
Abstract: Data missing is a ubiquitous problem in real-world systems that adversely affects the performance of machine learning algorithms. Although many useful imputation methods are available to address this issue, they often fail to consider the information provided by both features and labels. As a result, the performance of these methods might be constrained. Furthermore, feature selection as a data quality improvement technique has been widely used and has demonstrated its efficiency. To overcome the limitation of imputation methods, we propose a novel algorithm that combines data imputation and feature selection to tackle classification problems for mixed data. Based on the mean and standard deviation of quantitative features and the selecting probabilities of unique values of categorical features, our algorithm constructs different imputation models for quantitative and categorical features. Particle swarm optimization is used to optimize the parameters of the imputation models and select feature subsets simultaneously. Additionally, we introduce a legacy learning mechanism to enhance the optimization capability of our method. To evaluate the performance of the proposed method, seven algorithms and twelve datasets are used for comparison. The results show that our algorithm outperforms other algorithms in terms of accuracy and F 1 score and has reasonable time overhead. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

19. Review of Statistical Considerations and Data Imputation Methodologies in Psoriasis Clinical Trials.

Author: DIRUGGIERO, DOUGLAS, TRICKETT, CYNTHIA, HIPPELI, LAUREN, SANG HEE PARK, BAUM-JONES, AMY, and DAVIDSON, DAVID S.
Subjects: *CLINICAL trials, *STATISTICS, *MEDICAL personnel, *PSORIASIS, *MISSING data (Statistics)
Abstract: Numerous clinical trials have established that various biologic and oral small-molecule therapies are efficacious in patients with psoriasis. However, as there are limited head-to-head trials, healthcare providers may compare results across multiple trials when providing treatment recommendations. Direct comparisons among agents are challenging because psoriasis trials differ in terms of study design, patient population, and data analysis methodologies. Longterm clinical trials present additional challenges because the number of patients enrolled generally declines over time. The missing patient data that might occur, coupled with the speciffc approach used to substitute or impute that missing data, might introduce bias and skew efficacy results. In this review, we discuss how variations in study design and analytical methodologies affect effcacy outcomes in clinical trials. We also review published trials of biologic and oral small-molecule therapies for psoriasis to illustrate how issues related to missing data and choices in data imputation methodologies can affect the interpretation of effcacy outcomes. Imputation methodologies discussed include nonresponder imputation, modified nonresponder imputation, treatment failure rules, last observation carried forward, modified baseline observation carried forward, and multiple imputation. This review provides a foundation for the healthcare provider's critical evaluation of the psoriasis literature and emphasizes the importance of considering the level of evidence provided in a clinical trial when making treatment decisions. [ABSTRACT FROM AUTHOR]
Published: 2024

20. Using Paradata for Imputation of Missing Values in Sociological Survey Data: Results of Statistical Modeling (Case of Croatia and Slovakia).

Author: GORBACHYK, ANDRII
Subjects: MISSING data (Statistics), NONRESPONSE (Statistics), STATISTICAL models, QUANTITATIVE research, STATISTICS
Abstract: Missing values are a common issue in quantitative social researches. One of the ways to handle missing data is by data imputation. This article outlines the challenges of traditional data imputa- tion methods, which often introduce biases, and presents an advanced approach that features inte- gration of paradata auxiliary information collected during surveys into the imputation process, using the European Social Survey (ESS) as its dataset. It is proposed that the usage of paradata could enhance predictive models used for imputation. It discusses the practical applications of data imputation, particularly through the lens of sensitive topics such as LGBT issues in socially conser- vative countries, where missingness could be heavily skewed due to social inacceptability of certain answers. To evaluate the effectiveness of the proposed approach towards imputation, the research employs the approach of using the 'ideal dataset', which is a subset of the original dataset with no missing vales, and then introduces artificial missing values that are not MCAR (Missing Completely at Random) to simulate the real case of missing data. Having artificial missingness allows for evaluation of the imputation procedure by comparing it with the original dataset. The study uses a novel approach towards creation of realistic missing data patterns through clustering based on response patterns. The research uses advanced statistical methods to handle missing data, and incorporates paradata from the survey process to improve the accuracy of predictive models. By comparing statistical metrics such as RMSE, MAE, and R-squared, the article evaluates the effectiveness of these methods in mimicking the original dataset's variability [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

21. Digital Phenotyping-Based Bipolar Disorder Assessment Using Multiple Correlation Data Imputation and Lasso-MLP.

Author: Hsu, Jia-Hao, Wu, Chung-Hsien, Wang, Wei-Kai, Su, Hung-Yi, Lin, Esther Ching-Lan, and Chen, Po See
Abstract: Clinical rating scales can be used to assess the severity of bipolar disorder; however, their use involves clinician–patient interactions, which is labor-intensive. Therefore, this study proposes a digital-phenotyping-based system that provides clinical ratings of bipolar disorder severity using global positioning system, self-scale, daily mood, user emotion, sleep time, and multimedia data; these ratings are given on Hamilton Depression Rating Scale (HAM-D) and Young Mania Rating Scale (YMRS). A K-nearest-neighbor-based imputation method was used to handle missing data. In this method, missing data points are filled in with the multiple correlations between different features. Furthermore, the Least Absolute Shrinkage and Selection Operator (Lasso)-regression-based multilayer perceptron (Lasso-MLP) method was adopted to predict the total and factor scores on the HAM-D and YMRS. Five-fold cross-validation were used in evaluation experiments. When the designed data imputation method was used with Lasso-MLP, the mean square errors of the total score and average factor score on HAM-D (the YMRS) were 0.56 (0.38) and 1.88 (0.98), respectively, which were smaller than the corresponding values obtained through Lasso regression (by 0.12 and 0.05, respectively, for HAM-D and by 0.12 and 0.10, respectively, for the YMRS). The experimental results also indicated that the models trained with the imputed data outperformed those trained without imputed data. Thus, the developed approaches can eliminate the missing data problem and provide accurate clinical ratings. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

22. Comparing preprocessing strategies for 3D-Gene microarray data of extracellular vesicle-derived miRNAs.

Author: Takemoto, Yuto, Ito, Daisuke, Komori, Shota, Kishimoto, Yoshiyuki, Yamada, Shinichiro, Hashizume, Atsushi, Katsuno, Masahisa, and Nakatochi, Masahiro
Subjects: *MISSING data (Statistics), *AMYOTROPHIC lateral sclerosis, *MICRORNA, *BIOMARKERS
Abstract: Background: Extracellular vesicle-derived (EV)-miRNAs have potential to serve as biomarkers for the diagnosis of various diseases. miRNA microarrays are widely used to quantify circulating EV-miRNA levels, and the preprocessing of miRNA microarray data is critical for analytical accuracy and reliability. Thus, although microarray data have been used in various studies, the effects of preprocessing have not been studied for Toray's 3D-Gene chip, a widely used measurement method. We aimed to evaluate batch effect, missing value imputation accuracy, and the influence of preprocessing on measured values in 18 different preprocessing pipelines for EV-miRNA microarray data from two cohorts with amyotrophic lateral sclerosis using 3D-Gene technology. Results: Eighteen different pipelines with different types and orders of missing value completion and normalization were used to preprocess the 3D-Gene microarray EV-miRNA data. Notable results were suppressed in the batch effects in all pipelines using the batch effect correction method ComBat. Furthermore, pipelines utilizing missForest for missing value imputation showed high agreement with measured values. In contrast, imputation using constant values for missing data exhibited low agreement. Conclusions: This study highlights the importance of selecting the appropriate preprocessing strategy for EV-miRNA microarray data when using 3D-Gene technology. These findings emphasize the importance of validating preprocessing approaches, particularly in the context of batch effect correction and missing value imputation, for reliably analyzing data in biomarker discovery and disease research. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

23. PEDI-GAN: power equipment data imputation based on generative adversarial networks with auxiliary encoder.

Author: Lv, Qianwei, Luo, He, Wang, Guoqiang, Tai, Jianwei, and Zhang, Shengzhi
Subjects: *GENERATIVE adversarial networks, *PROBABILISTIC generative models, *STANDARD deviations, *MISSING data (Statistics)
Abstract: Smart grids commonly rely on analyzing sensor data to monitor power equipment. However, these sensor data can suffer varying levels of loss or corruption due to complex interferences, leading to a pressing need for precise missing value imputation in power equipment data. We propose a data imputation model for power equipment based on generative adversarial networks with an auxiliary encoder, named PEDI-GAN. In particular, the auxiliary encoder is designed and integrated into the GAN structure to optimize random vectors for the generator. Through data masking, we pinpoint missing data locations, enabling the generator to focus on generating accurate values for those points. Additionally, we address gradient disappearance and model collapse in GAN training by using the gradient penalty to redesign the loss function for PEDI-GAN. Experimental results demonstrate PEDI-GAN's superiority in accuracy and generalization compared to baseline methods, with notable reductions in mean absolute error and root mean square error by an average of 16.75% and 11.09%, respectively. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

24. Evaluation of the Impact of Gap Filling Technology in Precipitation Series on the Estimation of Climate Trends, the Case of the Souss Massa Watershed.

Author: Ismail, Oumechtaq, Abbdelmajid, Laghzali, Bahaj, Tarik, Abderrahim, Oulidi, Lamya, Amghar, Abdelhamid, Allaoui, Manal, Mouadil, Boualoul, Mustapha, El Mostafa, Bachaoui, and Khalid, Elkhaldi
Subjects: METEOROLOGICAL precipitation, HYDROLOGIC cycle, WATERSHED management, SOIL conservation, CLIMATE change
Abstract: Accurate climatic data, especially precipitation measurements, play a critical role in various studies concerning the water cycle, particularly in modeling flood and drought risks. Unfortunately, these datasets often suffer from temporary gaps that are randomly dispersed over time. This study aims to assess the effectiveness of three imputation methods: KNN, MICE, and missForest, in impute missing values in climate series. The evaluation is conducted in two distinct rainfall regimes: the Moulouya basin and the Sous Massa basin. The performance analysis considers the percentage of missing data across the entire dataset. The imputed datasets are used to estimate annual precipitation, which are then subjected to statistical tests to identify potential trends and detect changepoints. The analysis focuses on the precipitation series within the Souss Massa watershed, encompassing 27 rainfall stations. Results indicate that data imputation has a highly positive impact on the study of rainfall series trends and change point detection. The study found that studying trends without data imputation could lead to questionable conclusions. The most significant breakpoints detected in the analyzed rainfall series were in the years 1988, 1991, 1997, 2007, and 2010. The decrease in precipitation at stations showing a downward trend varies between -60 mm and -137 mm using the MICE method, and between -40 mm and 186 mm using the missForest method. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

25. SensorGAN: A Novel Data Recovery Approach for Wearable Human Activity Recognition.

Author: Hussein, Dina and Bhat, Ganapati
Subjects: DATA recovery, HUMAN activity recognition, MOBILE health, MISSING data (Statistics), SMART homes
Abstract: Human activity recognition (HAR) and, more broadly, activities of daily life recognition using wearable devices have the potential to transform a number of applications, including mobile healthcare, smart homes, and fitness monitoring. Recent approaches for HAR use multiple sensors on various locations on the body to achieve higher accuracy for complex activities. While multiple sensors increase the accuracy, they are also susceptible to reliability issues when one or more sensors are unable to provide data to the application due to sensor malfunction, user error, or energy limitations. Training multiple activity classifiers that use a subset of sensors is not desirable, since it may lead to reduced accuracy for applications. To handle these limitations, we propose a novel generative approach that recovers the missing data of sensors using data available from other sensors. The recovered data are then used to seamlessly classify activities. Experiments using three publicly available activity datasets show that with data missing from one sensor, the proposed approach achieves accuracy that is within 10% of the accuracy with no missing data. Moreover, implementation on a wearable device prototype shows that the proposed approach takes about 1.5 ms for recovering data in the w-HAR dataset, which results in an energy consumption of 606 μJ. The low-energy consumption ensures that SensorGAN is suitable for effectively recovering data in tinyML applications on energy-constrained devices. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

26. Spatio-Temporal Bi-LSTM Based Variational Auto-Encoder for Multivariate IoT Data Imputation.

Author: Guggilam, Venkata Vidyalakshmi and Sundaram, Gopikrishnan
Subjects: DEEP learning, MULTIPLE imputation (Statistics), INTERNET of things, SMART devices, MISSING data (Statistics), MACHINE learning, TASK performance
Abstract: In the relam of the Internet of Things (IoT), prevalence of missing data due to continuous data collection by smart devices necessitates the essential preliminary step of data imputation before engaging in information mining activities. IoT data exhibit robust interconnections in both spatial and temporal dimensions, surpassing the limitations of Euclidean space. Yet, prevailing machine learning and deep learning approaches often focus solely on temporal attributes or capture spatial features exclusively within a Euclidean framework. To address these challenges, this paper introduces a novel network named ST-Bi-LSTM-VAE (Spatio-Temporal Bidirectional Long Short-Term Memory based Variational Auto-Encoder). The architecture of ST-Bi-LSTM-VAE is primarily grounded in the Variational Auto-Encoder (VAE) framework. This innovative approach incorporates two distinct types of VAEs. The first type is dedicated to computing the adjacent matrix of the device network, a crucial input for the Graph Convolutional Network (GCN) essential in capturing intricate spatial relationships among devices. The second type of VAE is specifically tailored for data imputation, leveraging both global spatial and temporal dependencies. Empirical experiments conducted on diverse publicly available datasets substantiate the efficacy of ST-Bi-LSTM-VAE. The results obtained consistently demonstrate that proposed method surpasses baseline techniques in maintaining pattern, structure, and trend across datasets even at 50% missing gap for imputation task with 4.91% performance improvement in case of Intel Berkley Research Laboratory (IBRL) dataset and 3.5% on PRSA dataset. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

27. A review of data-driven fault detection and diagnostics for building HVAC systems

Author: Chen, Zhelun, O’Neill, Zheng, Wen, Jin, Pradhan, Ojas, Yang, Tao, Lu, Xing, Lin, Guanjing, Miyata, Shohei, Lee, Seungjae, Shen, Chou, Chiosa, Roberto, Piscitelli, Marco Savino, Capozzoli, Alfonso, Hengel, Franz, Kührer, Alexander, Pritoni, Marco, Liu, Wei, Clauß, John, Chen, Yimin, and Herr, Terry
Subjects: Control Engineering, Mechatronics and Robotics, Engineering, Machine Learning and Artificial Intelligence, Networking and Information Technology R&D (NITRD), Building HVAC, Fault detection, Fault diagnostics, Fault prognostics, Data imputation, Feature selection, Data -driven, Machine learning, Anomaly detection, Economics, Energy, Built environment and design
Abstract: With the wide adoption of building automation system, and the advancement of data, sensing, and machine learning techniques, data-driven fault detection and diagnostics (FDD) for building heating, ventilation, and air conditioning systems has gained increasing attention. In this paper, data-driven FDD is defined as those that are built or trained from data via machine learning or multivariate statistical analysis methods. Following this definition, this paper reviews and summarizes the literature on data-driven FDD from three aspects: process, systems studied (including the systems being investigated, the faults being identified, and the associated data sources), and evaluation metrics. A data-driven FDD process is further divided into the following steps: data collection, data cleansing, data preprocessing, baseline establishment, fault detection, fault diagnostics, and potential fault prognostics. Literature reported data-driven methods used in each step of an FDD process are firstly discussed. Applications of data-driven FDD in various HVAC systems/components and commonly used data source for FDD development are reviewed secondly, followed by a summary of typical metrics for evaluating FDD methods. Finally, this literature review concludes that despite the promising performance reported in the literature, data-driven FDD methods still face many challenges, such as real-building deployment, performance evaluation and benchmarking, scalability and transferability, interpretability, cyber security and data privacy, user experience, etc. Addressing these challenges is critical for a broad real-building adoption of data-driven FDD.
Published: 2023

28. Missing Data Imputation: Do Advanced ML/DL Techniques Outperform Traditional Approaches?

Author: Zhou, Youran, Bouadjenek, Mohamed Reda, Aryal, Sunil, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Bifet, Albert, editor, Krilavičius, Tomas, editor, Miliou, Ioanna, editor, and Nowaczyk, Slawomir, editor
Published: 2024
Full Text: View/download PDF

29. Emergence of Bayesian Network as Data Imputation Technique in Clinical Trials

Author: Choudhary, Shashank G., Verma, Jai Prakash, Bhavsar, Madhuri, Celebi, Emre, Series Editor, Chen, Jingdong, Series Editor, Gopi, E. S., Series Editor, Neustein, Amy, Series Editor, Liotta, Antonio, Series Editor, Di Mauro, Mario, Series Editor, Singh, Pradeep Kumar, editor, Trovati, Marcello, editor, Murtagh, Fionn, editor, Atiquzzaman, Mohammed, editor, and Farid, Mohsen, editor
Published: 2024
Full Text: View/download PDF

30. An Efficient and Reliable scRNA-seq Data Imputation Method Using Variational Autoencoders

Author: Alyassine, Widad, Raju, Anuradha Samkham, Braytee, Ali, Anaissi, Ali, Naji, Mohamad, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Daimi, Kevin, editor, and Al Sadoon, Abeer, editor
Published: 2024
Full Text: View/download PDF

31. Handling Missing Data in Longitudinal Anthropometric Data Using Multiple Imputation Method

Author: Varma, Dhruv, Yajnik, Chittaranjan S., Thorave, Aniket, Sharma, Neha, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Sharma, Neha, editor, Goje, Amol C., editor, Chakrabarti, Amlan, editor, and Bruckstein, Alfred M., editor
Published: 2024
Full Text: View/download PDF

32. An Enhanced Neural Network Collaborative Filtering (ENNCF) for Personalized Recommender System

Author: Ganesan, Thenmozhi, Vellaiyan, Palanisamy, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Jiming, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Singh, Yashwant, editor, Singh, Pradeep Kumar, editor, Gonçalves, Paulo J. Sequeira, editor, and Kar, Arpan Kumar, editor
Published: 2024
Full Text: View/download PDF

33. A Machine Learning Approach to Mental Disorder Prediction: Handling the Missing Data Challenge

Author: Mokheleli, Tsholofelo, Bokaba, Tebogo, Museba, Tinofirei, Ntshingila, Nompumelelo, Akan, Ozgur, Editorial Board Member, Bellavista, Paolo, Editorial Board Member, Cao, Jiannong, Editorial Board Member, Coulson, Geoffrey, Editorial Board Member, Dressler, Falko, Editorial Board Member, Ferrari, Domenico, Editorial Board Member, Gerla, Mario, Editorial Board Member, Kobayashi, Hisashi, Editorial Board Member, Palazzo, Sergio, Editorial Board Member, Sahni, Sartaj, Editorial Board Member, Shen, Xuemin, Editorial Board Member, Stan, Mircea, Editorial Board Member, Jia, Xiaohua, Editorial Board Member, Zomaya, Albert Y., Editorial Board Member, Masinde, Muthoni, editor, Möbs, Sabine, editor, and Bagula, Antoine, editor
Published: 2024
Full Text: View/download PDF

34. Data Imputation Using Artificial Neural Network for a Reservoir System

Author: Shrinivas, Chintala Rahulsai, Bhatia, Rajesh, Wadhwa, Shruti, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Kumar, Sandeep, editor, K., Balachandran, editor, Kim, Joong Hoon, editor, and Bansal, Jagdish Chand, editor
Published: 2024
Full Text: View/download PDF

35. Robustness of Centrality Measures Under Incomplete Data

Author: Meshcheryakova, Natalia, Shvydun, Sergey, Kacprzyk, Janusz, Series Editor, Cherifi, Hocine, editor, Rocha, Luis M., editor, Cherifi, Chantal, editor, and Donduran, Murat, editor
Published: 2024
Full Text: View/download PDF

36. Evaluation of the Impact of Gap Filling Technology in Precipitation Series on the Estimation of Climate Trends, the Case of the Souss Massa Watershed

Author: Oumechtaq Ismail, Abbdelmajid Laghzali, Tarik Bahaj, Oulidi Abderrahim, Amghar Lamya, Allaoui Abdelhamid, Mouadil Manal, Mustapha Boualoul, Bachaoui El Mostafa, and Elkhaldi Khalid
Subjects: precipitation, r software, change point, knn, data imputation, mice, climate trends, missforest, Ecology, QH540-549.5
Abstract: Accurate climatic data, especially precipitation measurements, play a critical role in various studies concerning the water cycle, particularly in modeling flood and drought risks. Unfortunately, these datasets often suffer from temporary gaps that are randomly dispersed over time. This study aims to assess the effectiveness of three imputation methods: KNN, MICE, and missForest, in impute missing values in climate series. The evaluation is conducted in two distinct rainfall regimes: the Moulouya basin and the Sous Massa basin. The performance analysis considers the percentage of missing data across the entire dataset. The imputed datasets are used to estimate annual precipitation, which are then subjected to statistical tests to identify potential trends and detect changepoints. The analysis focuses on the precipitation series within the Souss Massa watershed, encompassing 27 rainfall stations. Results indicate that data imputation has a highly positive impact on the study of rainfall series trends and change point detection. The study found that studying trends without data imputation could lead to questionable conclusions. The most significant breakpoints detected in the analyzed rainfall series were in the years 1988, 1991, 1997, 2007, and 2010. The decrease in precipitation at stations showing a downward trend varies between -60 mm and -137 mm using the MICE method, and between -40 mm and 186 mm using the MissForest method.
Published: 2024
Full Text: View/download PDF

37. The Effect of Data Missingness on Machine Learning Predictions of Uncontrolled Diabetes Using All of Us Data

Author: Zain Jabbar and Peter Washington
Subjects: algorithmic fairness, electronic health records, data missingness, data imputation, diabetes, Neurosciences. Biological psychiatry. Neuropsychiatry, RC321-571, Computer applications to medicine. Medical informatics, R858-859.7
Abstract: Electronic Health Records (EHR) provide a vast amount of patient data that are relevant to predicting clinical outcomes. The inherent presence of missing values poses challenges to building performant machine learning models. This paper aims to investigate the effect of various imputation methods on the National Institutes of Health’s All of Us dataset, a dataset containing a high degree of data missingness. We apply several imputation techniques such as mean substitution, constant filling, and multiple imputation on the same dataset for the task of diabetes prediction. We find that imputing values causes heteroskedastic performance for machine learning models with increased data missingness. That is, the more missing values a patient has for their tests, the higher variance there is on a diabetes model AUROC, F1, precision, recall, and accuracy scores. This highlights a critical challenge in using EHR data for predictive modeling. This work highlights the need for future research to develop methodologies to mitigate the effects of missing data and heteroskedasticity in EHR-based predictive models.
Published: 2024
Full Text: View/download PDF

38. QAR Data Imputation Using Generative Adversarial Network with Self-Attention Mechanism

Author: Jingqi Zhao, Chuitian Rong, Xin Dang, and Huabo Sun
Subjects: multivariate time series, data imputation, self-attention, generative adversarial network (gan), Electronic computers. Computer science, QA75.5-76.95
Abstract: Quick Access Recorder (QAR), an important device for storing data from various flight parameters, contains a large amount of valuable data and comprehensively records the real state of the airline flight. However, the recorded data have certain missing values due to factors, such as weather and equipment anomalies. These missing values seriously affect the analysis of QAR data by aeronautical engineers, such as airline flight scenario reproduction and airline flight safety status assessment. Therefore, imputing missing values in the QAR data, which can further guarantee the flight safety of airlines, is crucial. QAR data also have multivariate, multiprocess, and temporal features. Therefore, we innovatively propose the imputation models A-AEGAN (“A” denotes attention mechanism, “AE” denotes autoencoder, and “GAN” denotes generative adversarial network) and SA-AEGAN (“SA” denotes self-attentive mechanism) for missing values of QAR data, which can be effectively applied to QAR data. Specifically, we apply an innovative generative adversarial network to impute missing values from QAR data. The improved gated recurrent unit is then introduced as the neural unit of GAN, which can successfully capture the temporal relationships in QAR data. In addition, we modify the basic structure of GAN by using an autoencoder as the generator and a recurrent neural network as the discriminator. The missing values in the QAR data are imputed by using the adversarial relationship between generator and discriminator. We introduce an attention mechanism in the autoencoder to further improve the capability of the proposed model to capture the features of QAR data. Attention mechanisms can maintain the correlation among QAR data and improve the capability of the model to impute missing data. Furthermore, we improve the proposed model by integrating a self-attention mechanism to further capture the relationship between different parameters within the QAR data. Experimental results on real datasets demonstrate that the model can reasonably impute the missing values in QAR data with excellent results.
Published: 2024
Full Text: View/download PDF

39. Exploring the optimization of autoencoder design for imputing single-cell RNA sequencing data

Author: Xi, Nan Miles and Li, Jingyi Jessica
Subjects: Information and Computing Sciences, Biological Sciences, Machine Learning, Genetics, Neurosciences, Generic health relevance, Good Health and Well Being, ScRNA-seq, Data imputation, Autoencoder design, Benchmark, Numerical and Computational Mathematics, Computation Theory and Mathematics, Biochemistry and cell biology, Applied computing
Abstract: Autoencoders are the backbones of many imputation methods that aim to relieve the sparsity issue in single-cell RNA sequencing (scRNA-seq) data. The imputation performance of an autoencoder relies on both the neural network architecture and the hyperparameter choice. So far, literature in the single-cell field lacks a formal discussion on how to design the neural network and choose the hyperparameters. Here, we conducted an empirical study to answer this question. Our study used many real and simulated scRNA-seq datasets to examine the impacts of the neural network architecture, the activation function, and the regularization strategy on imputation accuracy and downstream analyses. Our results show that (i) deeper and narrower autoencoders generally lead to better imputation performance; (ii) the sigmoid and tanh activation functions consistently outperform other commonly used functions including ReLU; (iii) regularization improves the accuracy of imputation and downstream cell clustering and DE gene analyses. Notably, our results differ from common practices in the computer vision field regarding the activation function and the regularization strategy. Overall, our study offers practical guidance on how to optimize the autoencoder design for scRNA-seq data imputation.
Published: 2023

40. funspace : An R package to build, analyse and plot functional trait spaces.

Author: Carmona, Carlos P., Pavanetto, Nicola, and Puglielli, Giacomo
Subjects: *RESEARCH personnel, *PRINCIPAL components analysis
Abstract: Aim: Functional trait space analyses are pivotal to describe and compare organisms' functional diversity across the tree of life. Yet, there is no single application that streamlines the many sometimes‐troublesome steps needed to build and analyse functional trait spaces. Innovation: To fill this gap, we propose funspace, an R package to easily handle bivariate and multivariate functional trait space analyses. The six functions that constitute the package can be grouped in three modules: 'Building and exploring', 'Mapping' and 'Plotting'. The building and exploring module defines the main features of a functional trait space (e.g. functional diversity metrics) by leveraging kernel density‐based methods. The mapping module uses general additive models to map how a target variable distributes within a trait space. The plotting module provides many options for creating flexible and publication‐ready figures representing the outputs obtained from previous modules. We provide a worked example to demonstrate a complete funspace workflow. Main Conclusions: funspace will provide researchers working with functional traits across the tree of life with a new tool to easily explore: (i) the main features of any functional trait space, (ii) the relationship between a functional trait space and any other biological or non‐biological factor that might contribute to shaping species' functional diversity. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

41. Learning the structure of the mTOR protein signaling pathway from protein phosphorylation data.

Author: Salam, Abdul and Grzegorczyk, Marco
Subjects: *MTOR protein, *PROTEIN structure, *CELLULAR signal transduction, *SYSTEMS biology, *COMPUTATIONAL biology, *POLYMER networks
Abstract: Statistical learning of the structures of cellular networks, such as protein signaling pathways, is a topical research field in computational systems biology. To get the most information out of experimental data, it is often required to develop a tailored statistical approach rather than applying one of the off-the-shelf network reconstruction methods. The focus of this paper is on learning the structure of the mTOR protein signaling pathway from immunoblotting protein phosphorylation data. Under two experimental conditions eleven phosphorylation sites of eight key proteins of the mTOR pathway were measured at ten non-equidistant time points. For the statistical analysis we propose a new advanced hierarchically coupled non-homogeneous dynamic Bayesian network (NH-DBN) model, and we consider various data imputation methods for dealing with non-equidistant temporal observations. Because of the absence of a true gold standard network, we propose to use predictive probabilities in combination with a leave-one-out cross validation strategy to objectively cross-compare the accuracies of different NH-DBN models and data imputation methods. Finally, we employ the best combination of model and data imputation method for predicting the structure of the mTOR protein signaling pathway. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

42. Estimation of missing weather variables using different data mining techniques for avalanche forecasting.

Author: Kaur, Prabhjot, Joshi, Jagdish Chandra, and Aggarwal, Preeti
Subjects: DATA mining, FORECASTING, MISSING data (Statistics), SCATTER diagrams, WEATHER, LANDSLIDES, WEATHER forecasting
Abstract: The availability of continuous weather data is essential in many applications such as the study of hydrology, glaciology, and modelling of extreme catastrophic events such as landslides, heavy precipitation, cloud burst and snow avalanches. Weather data are collected either manually or automatically, and due to variety of reasons, it becomes difficult to maintain continuous records of these data. In the present study, different data mining techniques like multivariate imputation by chained equations and nearest neighbour have been used to address the missing data problem for avalanche forecasting over the Himalayas. Six weather variables, maximum temperature, minimum temperature, wind speed, pressure, fresh snow and relative humidity used in all avalanche and weather forecasting models, have been made available from 1996 to 2019. Missing data are generated randomly to create 10, 15, 20 and 30% in order to study the algorithms. Scatter plots, root-mean-square error and coefficient of determination (R2) of the generated missing data have been computed. Case analysis of imputed major snow events is done from 2017 to 2019, demonstrating proficient imputation. The performance of artificial neural network-based avalanche forecasting models has been compared with and without data imputation. Results of the study are promising as HSS and accuracy for avalanche forecasting models accelerates to 0.36 from 0.31 and 0.74 from 0.71, respectively, for Station-1 and HSS to 0.3 from 0.24 and accuracy to 0.72 from 0.68 for Station-2 after missing data imputation. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

43. Regenerating Networked Systems’ Monitoring Traces Using Neural Networks.

Author: Paim, Kayuã Oleques, Quincozes, Vagner Ereno, Kreutz, Diego, Mansilha, Rodrigo Brandão, and Cordeiro, Weverton
Abstract: Monitoring main entities in distributed systems is important for research, development, and innovation activities involving those systems (offline analysis, simulated evaluation, etc.). In many of them, monitoring can only be done via periodic and indirect sampling of the entities online (e.g., obtaining peer lists from trackers in BitTorrent). A problem with such approach is that the monitoring system may fail to see one or more online entities when samples are captured. Avoiding such failures by increasing monitoring resources (i.e., using fault tolerance techniques) may be challenging due to restrictions imposed by observed entities (e.g., a minimum interval between monitoring requests), if not impossible (e.g., monitoring again the behavior of a system during the 2023 Women’s World Cup). To cope with such failures after the monitoring has occurred, previous investigations have applied statistical methods to identify and correct failures. In this paper, we move in that direction by investigating artificial neural networks as a means to regenerate monitoring traces collected via sampling. We propose a deep learning based algorithm and three neural network topologies for correcting traces. We provide evidence that precision, accuracy, and recall can be substantially improved compared to existing statistical methods. The proposed method has potential to pave the road for improving the quality of monitoring traces of large distributed systems using neural networks, for increasing the quality of previously taken monitoring traces, and for delivering more resource-efficient distributed system monitoring. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

44. The Effect of Data Missingness on Machine Learning Predictions of Uncontrolled Diabetes Using All of Us Data.

Author: Jabbar, Zain and Washington, Peter
Subjects: *ELECTRONIC health records, *MACHINE learning, *DIABETES, *CLINICAL trials, *PREDICTION models
Abstract: Electronic Health Records (EHR) provide a vast amount of patient data that are relevant to predicting clinical outcomes. The inherent presence of missing values poses challenges to building performant machine learning models. This paper aims to investigate the effect of various imputation methods on the National Institutes of Health's All of Us dataset, a dataset containing a high degree of data missingness. We apply several imputation techniques such as mean substitution, constant filling, and multiple imputation on the same dataset for the task of diabetes prediction. We find that imputing values causes heteroskedastic performance for machine learning models with increased data missingness. That is, the more missing values a patient has for their tests, the higher variance there is on a diabetes model AUROC, F1, precision, recall, and accuracy scores. This highlights a critical challenge in using EHR data for predictive modeling. This work highlights the need for future research to develop methodologies to mitigate the effects of missing data and heteroskedasticity in EHR-based predictive models. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

45. IMPUTING MISSING DATA IN A SWAT WATER QUALITY MODELLING STUDY USING STATISTICAL METHODS.

Author: Boyacıoğlu, Hülya, Uyar, Meltem Kaya, and Boyacıoğlu, Hayal
Abstract: Large water-quality databases are useful in modeling studies to identify optimal measures for pollution mitigation and management of water basins. The objective of the study was to conduct statistical methods to impute missing data in the water quality simulation study in the Küçük Menderes River Basin, Türkiye, where missing data caused by a lack of periodic sampling is an important challenge. In the study, the Soil Water Assessment Tool (SWAT) was used to simulate nitrate-nitrogen concentrations (NO3-N). Water-quality data collected between 2001 and 2012 from the outlet of the basin was subjected to regression analysis-based imputation methods. In this scope, simple regression models were developed to estimate missing water quality data. Hence, a continuous data set was created, and then the SWAT water quality model was calibrated and validated. Since the calculated Nash– Sutcliffe model efficiency coefficient values were above 0.65, model simulations were judged "good". Furthermore, the MannWhitney U test was applied to test model performance by comparing continuous data generated by the SWAT model with the limited observed water quality data. It can be concluded that a simple regression model and non-parametric Mann-Whitney U tests can be performed to impute missing data and evaluate model performance in modeling studies of data shortage basins. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

46. Forecasting PM2.5 Concentration Using Gradient-Boosted Regression Tree with CNN Learning Model.

Author: Usha Ruby, A., Chandran, J. George Chellin, Theerthagiri, Prasannavenkatesan, Patil, Renuka, Chaithanya, B. N., and Jain, T. J. Swasthika
Abstract: Air pollution imposed by particle matter (PM) made it a public health concern and hazard to humans and the environment. Reduced vision, allergic responses, pneumonia, asthma, cardiovascular disorders, lung cancer, and even mortality can result from prolonged exposure to the concentration of air's small particulate matter. Air quality prediction can offer reliable information for future air pollution status to operate air pollution control effectively and make preventative plans. Tracking, predicting, and regulating emissions is crucial. Controlling PM2.5 is the key for enhancing air quality, and it can be accomplished by forecasting PM2.5 concentrations. This work develops a methodology for forecasting PM2.5 concentrations using a gradient-boosted regression tree with Convolutional Neural Network (CNN) and fuzzy K-nearest neighbour (fuzzy-KNN). The results of the proposed methodology have been comparatively analysed with multiple linear regression, stacked long short-term memory, bidirectional gated recurrent unit, and gradient-boosted regression tree. The Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) are evaluated, and it shows that the gradient-boosted regression tree model produces a reduced error with improved accuracy in forecasting air quality. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

47. A visual analysis approach for data imputation via multi-party tabular data correlation strategies.

Author: Zhu, Haiyang, Han, Dongming, Pan, Jiacheng, Wei, Yating, Feng, Yingchaojie, Weng, Luoxuan, Mao, Ketian, Xing, Yuankai, Lv, Jianshu, Wan, Qiucheng, and Chen, Wei
Abstract: Data imputation is an essential pre-processing task for data governance, aimed at filling in incomplete data. However, conventional data imputation methods can only partly alleviate data incompleteness using isolated tabular data, and they fail to achieve the best balance between accuracy and efficiency. In this paper, we present a novel visual analysis approach for data imputation. We develop a multi-party tabular data association strategy that uses intelligent algorithms to identify similar columns and establish column correlations across multiple tables. Then, we perform the initial imputation of incomplete data using correlated data entries from other tables. Additionally, we develop a visual analysis system to refine data imputation candidates. Our interactive system combines the multi-party data imputation approach with expert knowledge, allowing for a better understanding of the relational structure of the data. This significantly enhances the accuracy and efficiency of data imputation, thereby enhancing the quality of data governance and the intrinsic value of data assets. Experimental validation and user surveys demonstrate that this method supports users in verifying and judging the associated columns and similar rows using their domain knowledge. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

48. A framework for cloud cover prediction using machine learning with data imputation.

Author: Mandal, Nabanita and Sarode, Tanuja
Abstract: The climatic conditions of a region are affected by multiple factors. These factors are dew point temperature, humidity, wind speed, and wind direction. These factors are closely related to each other. In this paper, the correlation between these factors is studied and an approach has been proposed for data imputation. The idea is to utilize all these features to obtain the prediction of the total cloud cover of a region instead of removing the missing values. Total cloud cover prediction is significant because it affects the agriculture, aviation, and energy sectors. Based on the imputed data which is obtained as the output of the proposed method, a machine learning-based model is proposed. The foundation of this proposed model is the bi-directional approach of the long short-term memory (LSTM) model. It is trained for 8 stations for two different approaches. In the first approach, 80% of the entire data is considered for training and 20% of the data is considered for testing. In the second approach, 90% of the entire data is accounted for training and 10% of the data is accounted for testing. It is observed that in the first approach, the model gives less error for prediction. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

49. An L1-and-L2-regularized nonnegative tensor factorization for power load monitoring data imputation

Author: Xing Luo, Zijian Hu, Zhoujun Ma, Zhan Lv, Qu Wang, and Aoling Zeng
Subjects: power load forecasting, power load monitoring, missing data, tensor factorization, imputation models, data imputation, General Works
Abstract: As smart grid advance, Power Load Forecasting (PLF) has become a research hotspot. As the foundation of the forecasting model, the Power Load Monitoring (PLM) data takes on great importance due to its completeness, reliability and accuracy. However, monitoring equipment failures, transmission channel congestion and anomalies result in missing PLM data, which directly affects the performance of the PLF model. To address this issue, this paper proposes an L1-and-L2-Regularized Nonnegative Tensor Factorization (LNTF) model to impute PLM missing data. Its main idea is threefold: (1) combining L1 and L2 norms to achieve effective feature extraction and improve the model’s robustness; (2) incorporating two temporal-dependent linear biases to describe the fluctuations of PLM data; (3) adding nonnegative constraints to precisely define the nonnegativity of PLM data. Extensive empirical studies on two publicly real-world PLM datasets with 1,569,491 and 413,357 known entries and missing rates of 93.35% and 96.75% demonstrate that the proposed LNTF improves 14.04%, 59.31%, and 71.43% on average over the state-of-the-art imputation models in terms of imputation error, convergence rounds, and time cos, respectively. Its high computational efficiency and low imputation error make practical sense for PLM data imputation.
Published: 2024
Full Text: View/download PDF

50. Anomaly Signal Imputation Using Latent Coordination Relations

Author: Thasorn Chalongvorachai and Kuntpong Woraratpanya
Subjects: Data imputation, time series analysis, anomaly detection, neural networks, variational autoencoder, latent coordination relations, Electrical engineering. Electronics. Nuclear engineering, TK1-9971
Abstract: Missing data is a critical challenge in industrial data analysis, particularly during anomaly incidents caused by system equipment malfunctions or, more critically, by cyberattacks in industrial systems. It impedes effective imputation and compromises data integrity. Existing statistical and machine learning techniques struggle with heavily missing data, often failing to restore original data characteristics. To address this, we propose Anomaly Signal Imputation Using Latent Coordination Relations, a framework employing a variational autoencoder (VAE) to learn from complete data and establish a robust imputation model based on latent space coordination points. Experimental results from a water treatment testbed show significant improvements in output signal fidelity despite substantial data loss, outperforming conventional techniques.
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

1,355 results on '"Data imputation"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources