1,355 results on '"Data imputation"'
Search Results
202. Adaptive RBF Interpolation for Estimating Missing Values in Geographical Data
- Author
-
Gao, Kaifeng, Mei, Gang, Cuomo, Salvatore, Piccialli, Francesco, Xu, Nengxiong, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Sergeyev, Yaroslav D., editor, and Kvasov, Dmitri E., editor
- Published
- 2020
- Full Text
- View/download PDF
203. Machine learning-based identification of patients with a cardiovascular defect
- Author
-
Nabaouia Louridi, Samira Douzi, and Bouabid El Ouahidi
- Subjects
Cardiovascular diseases ,Data imputation ,Machine learning ,Preprocessing ,Normalization ,Computer engineering. Computer hardware ,TK7885-7895 ,Information technology ,T58.5-58.64 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Abstract Cardiovascular diseases had been for a long time one of the essential medical problems. As indicated by the World Health Association, heart ailments are at the highest point of ten leading reasons for death. Correct and early identification is a vital step in rehabilitation and treatment. To diagnose heart defects, it would be necessary to implement a system able to predict the existence of heart diseases. In the current article, our main motivation is to develop an effective intelligent medical system based on machine learning techniques, to aid in identifying a patient’s heart condition and guide a doctor in making an accurate diagnosis of whether or not a patient has cardiovascular diseases. Using multiple data processing techniques, we address the problem of missing data as well as the problem of imbalanced data in the publicly available UCI Heart Disease dataset and the Framingham dataset. Furthermore, we use machine learning to select the most effective algorithm for predicting cardiovascular diseases. Different metrics, such as accuracy, sensitivity, F-measure, and precision, were used to test our system, demonstrating that the proposed approach significantly outperforms other models.
- Published
- 2021
- Full Text
- View/download PDF
204. A genetic algorithm for multivariate missing data imputation.
- Author
-
Figueroa-García, Juan Carlos, Neruda, Roman, and Hernandez–Pérez, German
- Subjects
- *
MISSING data (Statistics) , *GENETIC algorithms , *EXPECTATION-maximization algorithms , *GLOBAL optimization , *ELECTRONIC data processing , *DATA mining - Abstract
Some data mining, AI and data processing tasks might have data loss whose estimation/imputation is an important problem to be solved. Genetic algorithms are efficient and flexible global optimization methods able to deal with both multiple missing observations and multiple features such as continuous/discrete/binary data which are often found in multivariate databases unlike classical missing data estimation methods which only deal with univariate–continuous data. This paper presents a genetic algorithm to impute multiple missing observations in multivariate data which minimizes a new multi–objective (fitness) function based on the Minkowski distance of the means, variances, covariances and skewness between available/completed data. To do so, two sets of examples were tested: a continuous/discrete dataset which is compared to both the EM algorithm and auxiliary regressions, and a comparison over seven benchmark datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
205. Machine Learning-Based Imputation Approach with Dynamic Feature Extraction for Wireless RAN Performance Data Preprocessing
- Author
-
Jean Nestor M. Dahj and Kingsley A. Ogudo
- Subjects
machine learning (ML) ,data imputation ,radio access network (RAN) ,data preprocessing ,telecommunications ,mobile network operators (MNOs) ,Mathematics ,QA1-939 - Abstract
Machine learning (ML) in wireless mobile communication is becoming more and more customary, with application trends leaning toward performance improvement and network automation. The radio access network (RAN), critical for service access, frequently generates performance data that mobile network operators (MNOs) and researchers leverage for planning, self-optimization, and intelligent network operations. However, missing values in the RAN performance data, as in any valuable data, impact analysis. Poor handling of such missing data in the RAN can distort the relationships between different metrics, leading to inaccurate and unreliable conclusions and predictions. Therefore, there is a need for imputation methods that preserve the overall structure of the RAN data to an optimal level. In this study, we present an imputation approach for handling RAN performance missing data based on machine learning algorithms. The method customizes the feature-extraction mechanism by using dynamic correlation analysis. We apply the method to actual RAN performance indicator data to evaluate its performance. We finally compare and evaluate the proposed approach with statistical imputation techniques such as the mean, median, and mode. The results show that machine learning-based imputation, as approached in this experimental study, preserves some relationships between KPIs compared to non-ML techniques. Random Forest regressor gave the best performance in imputing the data.
- Published
- 2023
- Full Text
- View/download PDF
206. Identifying common driver modules by equilibrating coverage and mutual exclusivity across pan-cancer data.
- Author
-
Wu, Jingli, Wu, Cong, and Li, Gaoshi
- Subjects
- *
CANCER genes , *STOCHASTIC processes , *PROBLEM solving , *HIERARCHICAL clustering (Cluster analysis) , *RANDOM walks , *NEAREST neighbor analysis (Statistics) - Abstract
It is of significance to identify common driver modules from pan-cancer data to interpret heterogeneity of cancer, and the accumulated omic data have made it into reality. In this paper, the pan-cancer common driver module identification problem is formulated, which takes the frequency difference among various cancers into account. For solving this problem, a K -nearest neighbors based imputation algorithm KNNImp is firstly devised to infer the variation values for some potential significant missing genes. Secondly, a harmonic mean of coverage and mutual exclusivity based random walk algorithm HMCEwalk is proposed. It weights the integrated PPI network with the harmonic mean of gene coverage scores and mutual exclusion scores among various cancer types, and extracts modules through a random walk process. Experiments were implemented on both simulated data and real biological data. The experimental results on simulated data indicate that given two types of cancers, the HMCEwalk algorithm has a stronger tendency to identify a set of modules which not only mutate in a large proportion of samples of these cancers, but have close proportion of mutated samples for each cancer. The experimental results on biological data show that the presented imputation algorithm does play roles in regaining some important cancer related genes. In comparison with two state-of-the-art identification methods MEXCOwalk and DriveWays, the presented one exhibits competitive performance in most instances in terms of revealing the known cancer genes, producing modules having satisfied coverage and mutual exclusivity for each cancer. Many detected modules engage in the known cancer-related biological pathways. In addition, the presented method does recognize many cancer-associated genes omitted by methods MEXCOwalk and DriveWays. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
207. Missing Data Repairs for Traffic Flow With Self-Attention Generative Adversarial Imputation Net.
- Author
-
Zhang, Weibin, Zhang, Pulin, Yu, Yinghao, Li, Xiying, Biancardo, Salvatore Antonio, and Zhang, Junyi
- Abstract
With the rapid development of sensor technologies, time series data collected by multiple and spatially distributed sensors have been widely used in different research fields. Examples of such data include geo-tagged temperature data collected by temperature sensors, air pollutant monitoring data, and traffic data collected by road traffic sensors. Due to sensor failure, communication errors and storage loss, etc., data collected by sensors inevitably includes missing data. However, models commonly used in the analysis of such large-scale data often rely on complete data sets. This paper proposes a model for the imputation of missing data of traffic flow, which combines a self-attention mechanism, an auto-encoder, and a generative adversarial network, into a self-attention generative adversarial imputation net (SA-GAIN). The introduction of the self-attention mechanism can help the proposed model to effectively capture correlations between spatially-distributed sensors at different time points. Adversarial training through two neural networks, called generators and discriminators, allows the proposed model to generate imputed data close to the real data. In comparison with different imputation models, the proposed model shows the best performance in imputing missing data. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
208. Imputation of missing precipitation data using KNN, SOM, RF, and FNN.
- Author
-
Sahoo, Abinash and Ghose, Dillip Kumar
- Subjects
- *
STANDARD deviations , *MULTIPLE imputation (Statistics) , *GEOSPATIAL data , *ESTIMATION theory , *SELF-organizing maps , *MISSING data (Statistics) , *K-nearest neighbor classification - Abstract
Efficient methods are necessary for interpolation of precipitation data in geospatial systems. In recent years, there has been an incremental need to complete rainfall data networks. Reliable missing data estimation is significant for hydrologists, meteorologists, and environmentalists. A study is conducted in the Cachar watershed, Assam state (India), for imputation of missing precipitation data considering nineteen rain gauging stations using K-nearest neighbor (KNN), self-organizing maps (SOM), random forest (RF), and feed-forward neural network (FNN). Various performance indices like root mean squared error (RMSE), determination coefficient (R2), and mean absolute error (MAE) are used for understanding model efficacy. Performance indices of model indicate MAE (0.043), R2 (0.999), and RMSE (0.066) values for FNN, which shows its effectiveness in imputation of missing precipitation data as compared to KNN, SOM, and RF, especially in regions with extreme missingness. Results of this research proved to be highly imperative in selecting preeminent techniques to estimate precipitation data and reduce data gaps in complex watersheds like Cachar. For all stations, the performance indices of proposed models fell inside standard range of hydrological modeling. This study can be well utilized for water resources management and hydrological modeling. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
209. Data Imputation for Methylation by Variational Auto-Encoder.
- Author
-
WANG Xinfeng and HUANG Wei
- Subjects
MISSING data (Statistics) ,MULTIPLE imputation (Statistics) ,SINGULAR value decomposition ,DNA methylation ,FEATURE extraction ,K-nearest neighbor classification ,PRINCIPAL components analysis ,METHYLATION - Abstract
High-throughput sequencing technology is an important method for studying DNA methylation. Due to experimental and technical limitations, DNA methylation sequencing data contains some missing values. To solve the problem of missing values, the VAE-MethImp model based on variational auto-encoder for DNA methylation missing data imputation is proposed. VAE-MethImp is composed of an encoder layer, a hidden layer and a decoder layer. It is a deeply hidden space generation model with a powerful ability to reconstruct input data. The encoder layer infers the mean and variance; the hidden layer is the exclusive normal distribution of the input data calculated from the mean and variance output by the encoder layer; the decoder layer decodes the information contained in the hidden variables to generate reconstructed data. The imputation experiments on lung cancer and breast cancer prove that the features extracted by the VAE are more informative. The imputation accuracy of the VAE model is 4.8% higher than the optimal SVD among the four traditional methods, K-nearest neighbor (KNN), principal component analysis (PCA), and singular value decomposition (SVD). The survival analysis experiment results show that the data imputed by the VAE has better predictability, and it also proves that DNA Methylation is directly related to cancer survival. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
210. Out-of-Sample Validity of the PROLOGUE Score to Predict Neurologic Function after Cardiac Arrest.
- Author
-
Schriefl, Christoph, Schoergenhofer, Christian, Buchtele, Nina, Mueller, Matthias, Poppe, Michael, Clodi, Christian, Ettl, Florian, Merrelaar, Anne, Boegl, Magdalena Sophie, Steininger, Philipp, Holzer, Michael, Herkner, Harald, and Schwameis, Michael
- Subjects
- *
CARDIAC arrest , *PREDICTIVE validity , *MARKOV chain Monte Carlo , *MARKOV processes - Abstract
Background: The clinical value of a prognostic score depends on its out-of-sample validity because inaccurate outcome prediction can be not only useless but potentially fatal. We aimed to evaluate the out-of-sample validity of a recently developed and highly accurate Korean prognostic score for predicting neurologic outcome after cardiac arrest in an independent, plausibly related sample of European cardiac arrest survivors. Methods: Analysis of data from a European cardiac arrest center, certified in compliance with the specifications of the German Council for Resuscitation. The study sample included adults with nontraumatic out-of-hospital cardiac arrest admitted between 2013 and 2018. Exposure was the PROgnostication using LOGistic regression model for Unselected adult cardiac arrest patients in the Early stages (PROLOGUE) score, including 12 clinical variables readily available at hospital admission. The outcome was poor 30-day neurologic function, as assessed using the cerebral performance category scale. The risk of a poor outcome was calculated using the PROLOGUE score regression equation. Predicted risk deciles were compared to observed outcome estimates in a complete-case analysis, a best-case analysis, and a multiple-data-imputation analysis using the Markov chain Monte Carlo method. Results: A total of 1051 patients (median 61 years, IQR 50–71; 29% female) were analyzed. A total of 808 patients (77%) were included in the complete-case analysis. The PROLOGUE score overestimated the risk of poor neurologic outcomes in the range of 40% to 100% predicted risk, involving 63% of patients. The model fit did not improve after missing data imputation. Conclusions: In a plausibly related sample of European cardiac arrest survivors, risk prediction by the PROLOGUE score was largely too pessimistic and failed to replicate the high accuracy found in the original study. Using the PROLOGUE score as an example, this study highlights the compelling need for independent validation of a proposed prognostic score to prevent potentially fatal mispredictions. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
211. Data Imputation for Multivariate Time Series Sensor Data With Large Gaps of Missing Data.
- Author
-
Wu, Rui, Hamshaw, Scott D., Yang, Lei, Kincaid, Dustin W., Etheridge, Randall, and Ghasemkhani, Amir
- Abstract
Imputation of missing sensor-collected data is often an important step prior to machine learning and statistical data analysis. One particular data imputation challenge is filling large data gaps when the only related data comes from the same sensor station. In this paper, we propose a framework to improve the popular multivariate imputation by chained equations (MICE) method for dealing with missing data. One key strategy we use to improve model accuracy is to reshape the original sensor data to leverage the correlation between the missing data and the observed data. We demonstrate our framework using data from continuous water quality monitoring stations in Vermont. Because of possible irregularly spaced peaks throughout the time series, the reshaped data is split into extreme and normal values and two MICE models are built. We also recommend that sensor-collected data should be transformed to meet the machine learning model assumptions. According to our experimental results, these strategies can improve MICE data imputation model accuracy at least 23% for large data gaps based on $\text {R}^{{2}}$ values and are promising to be applied for other data imputation algorithms. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
212. Study on Performance Evaluation and Prediction of Francis Turbine Units Considering Low-Quality Data and Variable Operating Conditions.
- Author
-
Duan, Ran, Liu, Jie, Zhou, Jianzhong, Liu, Yi, Wang, Pei, and Niu, Xinqiang
- Subjects
FRANCIS turbines ,GENERATIVE adversarial networks ,MISSING data (Statistics) ,SHORT-term memory ,PERFORMANCE theory ,FORECASTING - Abstract
The stable operation of the Francis turbine unit (FTU) determines the safety of the hydropower plant and the energy grid. The traditional FTU performance evaluation methods with a fixed threshold cannot avoid the influence of variable operating conditions. Meanwhile, anomaly samples and missing values in the low-quality on-site data distort the monitoring signals, which greatly affects the evaluation and prediction accuracy of the FTU. Therefore, an approach to the performance evaluation and prediction of the FTU considering low-quality data and variable operating conditions is proposed in this study. First, taking the variable operating conditions into consideration, a FTU on-site data-cleaning method based on DBSCAN is constructed to adaptively identify the anomaly samples. Second, the gate recurrent unit with decay mechanism (GRUD) and the Wasserstein generative adversarial network (WGAN) are combined to propose the GRUD–WGAN model for missing data imputation. Third, to reduce the impact of data randomness, the healthy-state probability model of the FTU is established based on the GPR. Fourth, the prediction model based on the temporal pattern attention–long short-term memory (TPA–LSTM) is constructed for accurate degradation trend forecasting. Ultimately, validity experiments were conducted with the on-site data set of a large FTU in production. The comparison experiments indicate that the proposed GRUD–WGAN has the highest accuracy at each data missing rate. In addition, since the cleaning and imputation improve the data quality, the TPA–LSTM-based performance indicator prediction model has great accuracy and generalization performance. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
213. Analysis of Spatiotemporal Data Imputation Methods for Traffic Flow Data in Urban Networks.
- Author
-
Joelianto, Endra, Fathurrahman, Muhammad Farhan, Sutarto, Herman Yoseph, Semanjski, Ivana, Putri, Adiyana, and Gautama, Sidharta
- Subjects
- *
CITY traffic , *TRAFFIC flow , *VEHICLE detectors , *TRAFFIC signs & signals , *DATA analysis , *PRINCIPAL components analysis - Abstract
The increase in traffic in cities world-wide has led to a need for better traffic management systems in urban networks. Despite the advances in technology for traffic data collection, the collected data are still suffering from significant issues, such as missing data, hence the need for data imputation methods. This paper explores the spatiotemporal probabilistic principal component analysis (PPCA) based data imputation method that utilizes traffic flow data from vehicle detectors and focuses specifically on detectors in urban networks as opposed to a freeway setting. In the urban context, detectors are in a complex network, separated by traffic lights, measuring different flow directions on different types of roads. Different constructions of a spatial network are compared, from a single detector to a neighborhood and a city-wide network. Experiments are conducted on data from 285 detectors in the urban network of Surabaya, Indonesia, with a case study on the Diponegoro neighborhood. Methods are tested against both point-wise and interval-wise missing data in various scenarios. Results show that a spatial network adds robustness to the system and the choice of the subset has an impact on the imputation error. Compared to a single detector, spatiotemporal PPCA is better suited for interval-wise errors and more robust against outliers and extreme missing data. Even in the case where an entire day of data is missing, the method is still able to impute data accurately relying on other vehicle detectors in the network. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
214. A method for assessing robustness of the results of a star-shaped network meta-analysis under the unidentifiable consistency assumption
- Author
-
Jeong-Hwa Yoon, Sofia Dias, and Seokyung Hahn
- Subjects
Star-shaped network ,Indirect comparisons ,Network meta-analysis ,Inconsistency ,Sensitivity analysis ,Data imputation ,Medicine (General) ,R5-920 - Abstract
Abstract Background In a star-shaped network, pairwise comparisons link treatments with a reference treatment (often placebo or standard care), but not with each other. Thus, comparisons between non-reference treatments rely on indirect evidence, and are based on the unidentifiable consistency assumption, limiting the reliability of the results. We suggest a method of performing a sensitivity analysis through data imputation to assess the robustness of results with an unknown degree of inconsistency. Methods The method involves imputation of data for randomized controlled trials comparing non-reference treatments, to produce a complete network. The imputed data simulate a situation that would allow mixed treatment comparison, with a statistically acceptable extent of inconsistency. By comparing the agreement between the results obtained from the original star-shaped network meta-analysis and the results after incorporating the imputed data, the robustness of the results of the original star-shaped network meta-analysis can be quantified and assessed. To illustrate this method, we applied it to two real datasets and some simulated datasets. Results Applying the method to the star-shaped network formed by discarding all comparisons between non-reference treatments from a real complete network, 33% of the results from the analysis incorporating imputed data under acceptable inconsistency indicated that the treatment ranking would be different from the ranking obtained from the star-shaped network. Through a simulation study, we demonstrated the sensitivity of the results after data imputation for a star-shaped network with different levels of within- and between-study variability. An extended usability of the method was also demonstrated by another example where some head-to-head comparisons were incorporated. Conclusions Our method will serve as a practical technique to assess the reliability of results from a star-shaped network meta-analysis under the unverifiable consistency assumption.
- Published
- 2021
- Full Text
- View/download PDF
215. Machine-learning methods for hydrological imputation data: analysis of the goodness of fit of the model in hydrographic systems of the Pacific - Ecuador
- Author
-
Diego Heras and Carlos Matovelle
- Subjects
data imputation ,hydrographic systems ,machine learning ,Environmental sciences ,GE1-350 - Abstract
Computational methods based on machine learning have had extensive development and application in hydrology, especially for modelling systems that do not have enough data. Within this problem, there are data series that are missing, and that should not necessarily be discarded; this is achieved by means of the imputation of the same ones, obtaining complete sets. For this reason, this research proposes a comparison of computer-learning techniques to identify those best suited for hydrographic systems of the Pacific of Ecuador. For the elaboration of this investigation, the hydro-meteorological records of the monitoring stations located in the watersheds of the Esmeraldas, Cañar and Jubones Rivers were used for 22 years, between 1990 and 2012. The variables that were imputed were precipitation and flow. Automatic learning machines of the Python Scikit_Learn module were used; these modules integrate a wide range of automated learning algorithms, such as Linear Regression and Random Forest. Finally, results were obtained that led to a minimum useful mean square error for Random Forest as an automatic machine-learning imputation method that best fits the systems and data analyzed.
- Published
- 2021
- Full Text
- View/download PDF
216. A novel data-characteristic-driven modeling approach for imputing missing value in industrial statistics: A case study of China electricity statistics.
- Author
-
Chen, Fan, Yu, Lan, Mao, Jinqi, Yang, Qing, Wang, Delu, and Yu, Chenghao
- Subjects
- *
INDUSTRIAL statistics , *PANEL analysis , *NUMERIC databases , *MISSING data (Statistics) , *TEST validity , *DATA modeling - Abstract
As a direct reference tool to reflect the operational status and development level of national industry, industrial statistics hold significant value for numerous systematic studies. Nevertheless, it is crucial to recognize that the quality of these statistics can be compromised by the common occurrence of missing value. This issue poses a substantial challenge for analyzing and utilizing industrial statistics, impeding progress in tasks reliant upon them. Given the severity of the missing value problem in industrial statistical databases and the limitations of existing literatures on missing value imputation in terms of research objects and modeling approaches, this paper proposes a novel missing value imputation modeling approach for single-indicator panels of industrial statistics based on the idea of data-characteristic-driven (DCD). Accordingly, taking the inter-provincial "monthly power generation" data from China as an example, the imputation model was constructed and its validity was tested under different imputed objects (Jiangsu and Jilin), different missing types (continuous and discrete), and different missing rates (5%, 10% and 20%) respectively. The results indicate that the proposed DCD modeling approach in this paper exhibits excellent efficacy. The imputation model, constructed based on the data characteristic of the imputed object, demonstrates clear advantages in handling missing value with different missing types and rates. This is evident in its superior consideration of numerical accuracy, directional accuracy, and imputation stability, resulting in an outstanding comprehensive imputation effect. • A novel DCD modeling approach for missing value imputation of industrial statistics is proposed. • A data characteristic recognition path of single-indicator panel data is constructed. • The match between data characteristic and model mechanisms improves imputation performance. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
217. Toward an Axiomatization of Strongly Possible Functional Dependencies
- Author
-
Munqath Alattar and Attila Sali
- Subjects
strongly possible functional dependencies ,null values ,data imputation ,list coloring ,Information technology ,T58.5-58.64 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
In general, there are two main approaches to handle the missing data values problem in SQL tables. One is to ignore or remove any record with some missing data values. The other approach is to fill or impute the missing data with new values [A. Farhangfar, L. A. Kurgan and W. Pedrycz, A novel framework for imputation of missing values in databases, IEEE Trans. Syst. Man Cybern. A, Syst. Hum. 37(5) (2007) 692–709]. In this paper, the second method is considered. Possible worlds, possible and certain keys, and weak and strong functional dependencies were introduced in Refs. 4 and 2 [H. Köhler, U. Leck, S. Link and X. Zhou, Possible and certain keys for SQL, VLDB J. 25(4) (2016) 571–596; M. Levene and G. Loizou, Axiomatisation of functional dependencies in incomplete relations, Theor. Comput. Sci. 206(1) (1998) 283–300]. We introduced the intermediate concept of strongly possible worlds in a preceding paper, which are obtained by filling missing data values with values already existing in the table. Using strongly possible worlds, strongly possible keys and strongly possible functional dependencies (spFDs) were introduced in Refs. 5 and 1 [M. Alattar and A. Sali, Keys in relational databases with nulls and bounded domains, in ADBIS 2019: Advances in Databases and Information Systems, Lecture Notes in Computer Science, Vol. 11695 (Springer, Cham, 2019), pp. 33–50; Functional dependencies in incomplete databases with limited domains, in FoiKS 2020: Foundations of Information and Knowledge Systems, Lecture Notes in Computer Science, Vol. 12012 (Springer, Cham, 2020), pp. 1–21]. In this paper, some axioms and rules for strongly possible functional dependencies are provided, These axioms and rules form the basis for a possible axiomatization of spFDs. For that, we analyze which weak/strong functional dependency and certain functional dependency axioms remain sound for strongly possible functional dependencies, and for the axioms that are not sound, we give appropriate modifications for soundness.
- Published
- 2021
- Full Text
- View/download PDF
218. Geo-guided deep learning for spatial downscaling of solute transport in heterogeneous porous media.
- Author
-
Pawar, Nikhil M., Soltanmohammadi, Ramin, Faroughi, Shirko, and Faroughi, Salah A.
- Subjects
- *
POROUS materials , *GENERATIVE adversarial networks , *FINITE element method , *PERCEPTUAL learning , *SIGNAL-to-noise ratio - Abstract
Resolving solute transport in heterogeneous porous media is a complex task, because of the sparse experimental data and the high computational cost of numerical simulations. This work proposes a unique two-stage deep learning architecture comprising a dual-branch autoencoder and a geo-guided super-resolution generative adversarial network (Gg-SRGAN) to address this dual challenge. The dual-branch autoencoder addresses the issue of sparsity by constructing a continuous, but coarse representation of concentration and pressure profiles from a sparse, discontinuous profile with up to 85% missing data points. The Gg-SRGAN is then employed to generate a finer representation of field variables from the outputs generated by the dual-branch autoencoder (i.e., downscaling). We train and test our framework using six solute transport cases with varying levels of heterogeneity and compare the results with standalone methods, namely the vanilla autoencoder and vanilla SRGAN, in addition to ground truth profiles generated by the finite element method (FEM). The comparisons are performed based on several statistical metrics, such as absolute point error (APE), mean squared error (MSE), peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and learned perceptual image patch similarity (LPIPS). The first four cases are used for training, evaluation, and testing. The last two cases are utilized for blind testing to determine the generalizability of the framework. Our results show that the dual-branch autoencoder outperforms the vanilla autoencoder, and the Gg-SRGAN outperforms the SRGAN during both the training and evaluation phases. Moreover, the proposed framework can successfully construct the fine representation of concentration profiles, compared to FEM, using the coarse representation of the pressure, concentration, and domain permeability fields. When tested using the two blind test cases, the proposed dual-branch autoencoder and Gg-SRGAN exhibit superior performance compared to their counterparts in terms of all evaluation metrics. • A two-stage deep learning framework is developed to resolve solute transport in porous media. • The framework addresses the dual challenge of data sparsity and spatial downscaling. • The framework incorporates geological information to comprehend the underlying physics. • The framework outperforms conventional algorithms in terms of accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
219. Machine learning motivated data imputation of storm data used in coastal hazard assessments.
- Author
-
Liu, Ziyue, Carr, Meredith L., Nadal-Caraballo, Norberto C., Yawn, Madison C., Taflanidis, Alexandros A., and Bensi, Michelle T.
- Subjects
- *
ARTIFICIAL neural networks , *STORMS , *MARGINAL distributions , *RISK assessment , *KRIGING , *MACHINE learning - Abstract
In the Coastal Hazards System's (CHS) Probabilistic Coastal Hazard Analysis (PCHA) framework developed by the United States Army Corps of Engineers (USACE), historical records of tropical cyclone parameters have been used as data sources for statistical analysis, including fitting marginal distributions and measuring correlations between storm parameters. One limitation of the available historical databases is that observations of central pressure and radius of maximum winds are not available for a large number of storms. This may adversely affect the results of statistical analyses used to develop hazard curves. This study uses machine learning techniques to develop a data imputation method to "fill in" missing storm parameter records in historical datasets used for Joint Probability Method (JPM)-based coastal hazard analysis such as the USACE's CHS-PCHA. Specifically, Gaussian process regression (GPR) and artificial neural network (ANN) models are investigated as candidate machine learning-derived data imputation models, and the performance of different model parameterizations is assessed. Candidate imputation models are compared against existing statistical relationships. The effect of the data imputation process on statistical analyses (marginal distributions and correlation measures) is also evaluated for a series of example coastal reference locations. • Machine learning models for coastal hazard storm data imputation. • Comparative assessment of imputation model performance. • Assessing data imputation effect on coastal hazard statistical analysis. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
220. Enhancing environmental data imputation: A physically-constrained machine learning framework.
- Author
-
Pastorini, Marcos, Rodríguez, Rafael, Etcheverry, Lorena, Castro, Alberto, and Gorgoglione, Angela
- Published
- 2024
- Full Text
- View/download PDF
221. A machine learning framework for intelligent development of Ultra-High performance concrete (UHPC): From dataset cleaning to performance predicting.
- Author
-
Xu, Liuliu, Fan, Dingqiang, Liu, Kangning, Xu, Wangyang, and Yu, Rui
- Subjects
- *
MACHINE learning , *CONCRETE , *MISSING data (Statistics) , *PREDICTION models , *FORECASTING - Abstract
This study proposes a new machine learning (ML) framework, which mainly includes dataset cleaning processing as well as performance predicting, for property prediction of ultra-high performance concrete (UHPC). Firstly, the missing data in original dataset is interpolated and discussed by visualization results. Then, the existing outliers in the completed dataset are found out to improve the quality of dataset. Meanwhile, by analyzing the influence of key parameter, it not only clarifies the influence of dataset quality on model prediction results, but also proves the necessity of anomaly detection, with R2 increasing 15 % and RMSE decreasing 37 %. Finally, the chosen model is trained and further optimized by hyperparameter optimization, in which the loss function is significantly reduced by 68.82 % for training data (R2 > 0.95) and 84.36 % for testing data (R2 > 0.94). Overall, this framework can effectively improve the accuracy and generalization of UHPC predictive models, which is also suitable for other types of concrete materials. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
222. Attribute imputation autoencoders for attribute-missing graphs.
- Author
-
Xia, Riting, Zhang, Chunxu, Li, Anchen, Liu, Xueyan, and Yang, Bo
- Abstract
Analyzing attribute-missing graphs with a complete topology, but missing the attributes of some nodes, is an emerging and challenging research topic. Data imputation techniques based on graph autoencoders are commonly used for attribute-missing graphs. However, this method cannot effectively integrate existing attributes and structural information during the encoding stage and is prone to introducing noise, resulting in inaccurate imputation. In addition, the expressiveness of decoders in existing methods is limited because their network architecture has not been adequately designed, which restricts the accuracy and robustness of the generated attributes. To address these issues, we propose a novel A ttribute I mputation A uto E ncoder for attribute-missing graphs, named AIAE. In particular, during the encoding stage, a dual encoder based on knowledge distillation is designed to encode both attribute and structural information into representations of attribute-missing nodes to achieve more accurate imputation. To avoid introducing noise, we fully exploit the observed information by reorganizing the representations of the attribute-missing and attribute-observed nodes. In the decoding stage, we propose a multi-scale decoder with masking to make the decoder more expressive and enhance its robustness and generative ability. Extensive experiments demonstrate that our model significantly outperforms state-of-the-art methods in attribute-missing graphs. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
223. Ensemble Learning Traffic Model for Sofia: A Case Study
- Author
-
Danail Brezov and Angel Burov
- Subjects
urban traffic models ,machine learning ,multiple regression ,data imputation ,AutoML ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
Traffic models have gained much popularity in recent years, in the context of smart cities and urban planning, as well as environmental and health research. With the development of Machine Learning (ML) and Artificial Intelligence (AI) some limitations imposed by the traditional analytical, numerical and statistical methods have been overcome. The present paper shows a case study of traffic modeling with scarce reliable data. The approach we propose resorts on the advantages of ensemble learning using a large number of related features such as road and street categories, population density, functional analysis, space syntax, previous traffic measurements and models, etc. We use advanced regression models such as Random Forest, XGBoost, CatBoost etc., ranked according to the chosen evaluation metrics and stacked in a weighted ensemble for optimal fitting. After a series of consecutive data imputations we estimate the annual average daily traffic distribution in the street and road network of Sofia city and the metropolitan municipality for 2018 and 2022, and the NO2 levels for 2021 with accuracy resp. 78%, 74% and 92%, using AutoGluon and Scikit-Learn.
- Published
- 2023
- Full Text
- View/download PDF
224. Effective Handling of Missing Values in Datasets for Classification Using Machine Learning Methods
- Author
-
Ashokkumar Palanivinayagam and Robertas Damaševičius
- Subjects
diabetes classification ,missing values ,data imputation ,false rate reduction ,two-level classification ,Information technology ,T58.5-58.64 - Abstract
The existence of missing values reduces the amount of knowledge learned by the machine learning models in the training stage thus affecting the classification accuracy negatively. To address this challenge, we introduce the use of Support Vector Machine (SVM) regression for imputing the missing values. Additionally, we propose a two-level classification process to reduce the number of false classifications. Our evaluation of the proposed method was conducted using the PIMA Indian dataset for diabetes classification. We compared the performance of five different machine learning models: Naive Bayes (NB), Support Vector Machine (SVM), k-Nearest Neighbours (KNN), Random Forest (RF), and Linear Regression (LR). The results of our experiments show that the SVM classifier achieved the highest accuracy of 94.89%. The RF classifier had the highest precision (98.80%) and the SVM classifier had the highest recall (85.48%). The NB model had the highest F1-Score (95.59%). Our proposed method provides a promising solution for detecting diabetes at an early stage by addressing the issue of missing values in the dataset. Our results show that the use of SVM regression and a two-level classification process can notably improve the performance of machine learning models for diabetes classification. This work provides a valuable contribution to the field of diabetes research and highlights the importance of addressing missing values in machine learning applications.
- Published
- 2023
- Full Text
- View/download PDF
225. Fast and robust imputation for miRNA expression data using constrained least squares.
- Author
-
Webber, James W. and Elias, Kevin M.
- Subjects
- *
LEAST squares , *MISSING data (Statistics) , *MICRORNA , *INVERSE problems , *STATISTICAL sampling , *TRANSCRIPTOMES - Abstract
Background: High dimensional transcriptome profiling, whether through next generation sequencing techniques or high-throughput arrays, may result in scattered variables with missing data. Data imputation is a common strategy to maximize the inclusion of samples by using statistical techniques to fill in missing values. However, many data imputation methods are cumbersome and risk introduction of systematic bias. Results: We present a new data imputation method using constrained least squares and algorithms from the inverse problems literature and present applications for this technique in miRNA expression analysis. The proposed technique is shown to offer an imputation orders of magnitude faster, with greater than or equal accuracy when compared to similar methods from the literature. Conclusions: This study offers a robust and efficient algorithm for data imputation, which can be used, e.g., to improve cancer prediction accuracy in the presence of missing data. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
226. Imputing DNA Methylation by Transferred Learning Based Neural Network.
- Author
-
Wang, Xin-Feng, Zhou, Xiang, Rao, Jia-Hua, Zhang, Zhu-Jin, and Yang, Yue-Dong
- Subjects
DNA methylation ,MISSING data (Statistics) ,DNA analysis ,NUCLEOTIDE sequencing ,CANCER prognosis - Abstract
DNA methylation is one important epigenetic type to play a vital role in many diseases including cancers. With the development of the high-throughput sequencing technology, there is much progress to disclose the relations of DNA methylation with diseases. However, the analyses of DNA methylation data are challenging due to the missing values caused by the limitations of current techniques. While many methods have been developed to impute the missing values, these methods are mostly based on the correlations between individual samples, and thus are limited for the abnormal samples in cancers. In this study, we present a novel transfer learning based neural network to impute missing DNA methylation data, namely the TDimpute-DNAmeth method. The method learns common relations between DNA methylation from pan-cancer samples, and then fine-tunes the learned relations over each specific cancer type for imputing the missing data. Tested on 16 cancer datasets, our method was shown to outperform other commonly-used methods. Further analyses indicated that DNA methylation is related to cancer survival and thus can be used as a biomarker of cancer prognosis. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
227. Data imputation via conditional generative adversarial network with fuzzy c mean membership based loss term.
- Author
-
Wu, Zisheng and Ling, Bingo Wing-Kuen
- Subjects
GENERATIVE adversarial networks ,MISSING data (Statistics) ,FUZZY algorithms - Abstract
There are some missing values in the data when the data is acquired from the sensors or other equipments. This makes it difficult for performing the analysis based on the data. There are two major types of existing methods for performing the data imputation. They are the discriminative methods and the generative methods. However, these methods are incapable for dealing the data either with a high missing rate or with an unacceptable error. This paper proposes an effective method for performing the data imputation. In particular, the conditional generative adversarial network (CGAN) is used to predict the missing data. Here, the enhanced fuzzy c mean algorithm is employed for performing the clustering so that the information on the local samples is exploited in the algorithm. The computer numerical simulations are performed on several real world datasets. Since this CGAN exploits the class of the missing values of the data, it is shown that our proposed method achieves a higher imputation accuracy compared to state of the art methods. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
228. Missing information in imbalanced data stream: fuzzy adaptive imputation approach.
- Author
-
Halder, Bohnishikha, Ahmed, Md Manjur, Amagasa, Toshiyuki, Isa, Nor Ashidi Mat, Faisal, Rahat Hossain, and Rahman, Md. Mostafijur
- Subjects
MISSING data (Statistics) ,DECOMPOSITION method ,PROBLEM solving - Abstract
From a real-world perspective, missing information is an ordinary scenario in data stream. Generally, missing data generate diverse problems in recognizing the pattern of data (i.e., clustering and classification). Particularly, missing data in data stream is a challenging topic. With imbalanced data, the problem of missing data greatly affects pattern recognition. As a solution to all these issues, this study puts forward an adaptive technique with fuzzy-based information decomposition method, which simultaneously solves the problem of incomplete data and overcomes the imbalanced data stream in a dataset. The main purpose of the proposed fuzzy adaptive imputation approach (FAIA) is to represent the effect of missing values in imbalance data stream and handle the missing data problem in imbalance data stream. FAIA is a single pass method. It considers adaptive selection of intervals based on all observed instances by using the interrelationship of attributes to identify correct interval for computing missing instances. Here, the interrelationship of two attributes means one attribute's value of an instance depends on another attribute's value of the same instance. In FAIA, after measuring all interval distances from a certain missing value, the least distance is selected for this missing value. Synthetic data of minority class are generated using the same process of missing value imputation for balancing data that is called oversampling. Instances of the datasets are divided into the chunks in data stream to balance data without any ensemble of previous chunks because missing values may misguide the future chunks. To demonstrate the performance of FAIA, the experiment is divided into three parts: missing data imputation, imbalanced information for offline data for data stream, and imbalanced information with missing value for offline data. Eleven numerical datasets with different dimensions from various repositories are considered for the computing performance of missing data imputation and imbalanced data without data stream. Four different datasets are also used to measure the performance of imbalanced data stream. In maximum measuring cases, the proposed method outperforms. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
229. Hybrid diabetes disease prediction framework based on data imputation and outlier detection techniques.
- Author
-
Srivastava, Anand Kumar, Kumar, Yugal, and Singh, Pradeep Kumar
- Subjects
- *
OUTLIER detection , *MISSING data (Statistics) , *DATABASES , *DIABETES , *FORECASTING , *ART techniques - Abstract
In the field of medical science, accurate prediction is a difficult and challenging task. But, the presence of missing values and outliers can make the prediction task more complicated. Many researchers address the issue of missing value in medical data, either detect the missing value and delete the respective data instances from the dataset or adopt some default methods such as mean, median, neighbour etc., for filling the missing value. However, both methods are lacking to produce optimal results. Furthermore, outliers are also presented in data and degraded the performance of classifier. Few researchers also focus on the outlier detection in medical dataset, but it is not fully explored till date. This work considers the two well‐known problems of data that is, (i) missing value imputation, and (ii) outlier. The missing value imputation issue is addressed through K‐Mean++ based data imputation technique. This technique also validates the data through clustering and also compute the values for missing data. The outlier can be detected through an ABC based outlier detection technique. Further, the final outcome is determined using LS‐SVM classifiers. Hence, this work presents a hybrid disease diagnosis framework for diabetes prediction, called hybrid diabetes prediction framework. The reason behind to choose the diabetes dataset for implementation as it contains 763 missing values and several outliers. The simulation results showed that proposed hybrid framework effectively determines the missing values and outliers in diabetes dataset. Further, the performance of proposed hybrid diabetes prediction framework is evaluated using accuracy, sensitivity, specificity, kappa and AUC parameters and compared with 34 state of art techniques. Results confirmed that proposed hybrid framework obtains 96.57%, 93.37%, 98.12%, 98.17%, and 95.43% accuracy, sensitivity, specificity, kappa and AUC rate respectively. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
230. scLRTD : A Novel Low Rank Tensor Decomposition Method for Imputing Missing Values in Single-Cell Multi-Omics Sequencing Data.
- Author
-
Ni, Zhijie, Zheng, Xiaoying, Zheng, Xiao, and Zou, Xiufen
- Abstract
With the successful application of single-cell sequencing technology, a large number of single-cell multi-omics sequencing (scMO-seq)data have been generated, which enables researchers to study heterogeneity between individual cells. One prominent problem in single-cell data analysis is the prevalence of dropouts, caused by failures in amplification during the experiments. It is necessary to develop effective approaches for imputing the missing values. Different with general methods imputing single type of single-cell data, we propose an imputation method called scLRTD, using low-rank tensor decomposition based on nuclear norm to impute scMO-seq data and single-cell RNA-sequencing (scRNA-seq)data with different stages, tissues or conditions. Furthermore, four sets of simulated and two sets of real scRNA-seq data from mouse embryonic stem cells and hepatocellular carcinoma, respectively, are used to carry out numerical experiments and compared with other six published methods. Error accuracy and clustering results demonstrate the effectiveness of proposed method. Moreover, we clearly identify two cell subpopulations after imputing the real scMO-seq data from hepatocellular carcinoma. Further, Gene Ontology identifies 7 genes in Bile secretion pathway, which is related to metabolism in hepatocellular carcinoma. The survival analysis using the database TCGA also show that two cell subpopulations after imputing have distinguished survival rates. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
231. Automated Cognitive Health Assessment Using Partially Complete Time Series Sensor Data.
- Author
-
Thomas, Brian L., Holder, Lawrence B., and Cook, Diane J.
- Abstract
Background: Behavior and health are inextricably linked. As a result, continuous wearable sensor data offer the potential to predict clinical measures. However, interruptions in the data collection occur, which create a need for strategic data imputation.Objective: The objective of this work is to adapt a data generation algorithm to impute multivariate time series data. This will allow us to create digital behavior markers that can predict clinical health measures.Methods: We created a bidirectional time series generative adversarial network to impute missing sensor readings. Values are imputed based on relationships between multiple fields and multiple points in time, for single time points or larger time gaps. From the complete data, digital behavior markers are extracted and are mapped to predicted clinical measures.Results: We validate our approach using continuous smartwatch data for n = 14 participants. When reconstructing omitted data, we observe an average normalized mean absolute error of 0.0197. We then create machine learning models to predict clinical measures from the reconstructed, complete data with correlations ranging from r = 0.1230 to r = 0.7623. This work indicates that wearable sensor data collected in the wild can be used to offer insights on a person's health in natural settings. [ABSTRACT FROM AUTHOR]- Published
- 2022
- Full Text
- View/download PDF
232. Practical Strategies for Extreme Missing Data Imputation in Dementia Diagnosis.
- Author
-
McCombe, Niamh, Liu, Shuo, Ding, Xuemei, Prasad, Girijesh, Bucholc, Magda, Finn, David P., Todd, Stephen, McClean, Paula L., and Wong-Lin, KongFatt
- Subjects
MISSING data (Statistics) ,DECISION support systems ,DEMENTIA ,DIAGNOSIS ,DIAGNOSTIC imaging - Abstract
Accurate computational models for clinical decision support systems require clean and reliable data but, in clinical practice, data are often incomplete. Hence, missing data could arise not only from training datasets but also test datasets which could consist of a single undiagnosed case, an individual. This work addresses the problem of extreme missingness in both training and test data by evaluating multiple imputation and classification workflows based on both diagnostic classification accuracy and computational cost. Extreme missingness is defined as having ∼50% of the total data missing in more than half the data features. In particular, we focus on dementia diagnosis due to long time delays, high variability, high attrition rates and lack of practical data imputation strategies in its diagnostic pathway. We identified and replicated the extreme missingness structure of data from a real-world memory clinic on a larger open dataset, with the original complete data acting as ground truth. Overall, we found that computational cost, but not accuracy, varies widely for various imputation and classification approaches. Particularly, we found that iterative imputation on the training dataset combined with a reduced-feature classification model provides the best approach, in terms of speed and accuracy. Taken together, this work has elucidated important factors to be considered when developing a predictive model for a dementia diagnostic support system. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
233. Towards an Efficient Prediction Model of Malaria Cases in Senegal
- Author
-
Mbaye, Ousseynou, Ba, Mouhamadou Lamine, Camara, Gaoussou, Sy, Alassane, Mboup, Balla Mbacké, Diallo, Aldiouma, Akan, Ozgur, Editorial Board Member, Bellavista, Paolo, Editorial Board Member, Cao, Jiannong, Editorial Board Member, Coulson, Geoffrey, Editorial Board Member, Dressler, Falko, Editorial Board Member, Ferrari, Domenico, Editorial Board Member, Gerla, Mario, Editorial Board Member, Kobayashi, Hisashi, Editorial Board Member, Palazzo, Sergio, Editorial Board Member, Sahni, Sartaj, Editorial Board Member, Shen, Xuemin (Sherman), Editorial Board Member, Stan, Mircea, Editorial Board Member, Jia, Xiaohua, Editorial Board Member, Zomaya, Albert Y., Editorial Board Member, Bassioni, Ghada, editor, Kebe, Cheikh M.F., editor, Gueye, Assane, editor, and Ndiaye, Ababacar, editor
- Published
- 2019
- Full Text
- View/download PDF
234. Missing Data Imputation for Machine Learning
- Author
-
Wang, Shaoqian, Li, Bo, Yang, Mao, Yan, Zhongjiang, Akan, Ozgur, Series Editor, Bellavista, Paolo, Series Editor, Cao, Jiannong, Series Editor, Coulson, Geoffrey, Series Editor, Dressler, Falko, Series Editor, Ferrari, Domenico, Series Editor, Gerla, Mario, Series Editor, Kobayashi, Hisashi, Series Editor, Palazzo, Sergio, Series Editor, Sahni, Sartaj, Series Editor, Shen, Xuemin (Sherman), Series Editor, Stan, Mircea, Series Editor, Xiaohua, Jia, Series Editor, Zomaya, Albert Y., Series Editor, Li, Bo, editor, Yang, Mao, editor, Yuan, Hui, editor, and Yan, Zhongjiang, editor
- Published
- 2019
- Full Text
- View/download PDF
235. Handling Missing Values for the CN2 Algorithm
- Author
-
Nguyen, Cuong Duc, Tran, Phuong-Tuan, Thai, Thi-Thanh-Thao, Akan, Ozgur, Series Editor, Bellavista, Paolo, Series Editor, Cao, Jiannong, Series Editor, Coulson, Geoffrey, Series Editor, Dressler, Falko, Series Editor, Ferrari, Domenico, Series Editor, Gerla, Mario, Series Editor, Kobayashi, Hisashi, Series Editor, Palazzo, Sergio, Series Editor, Sahni, Sartaj, Series Editor, Shen, Xuemin (Sherman), Series Editor, Stan, Mircea, Series Editor, Xiaohua, Jia, Series Editor, Zomaya, Albert Y., Series Editor, Cong Vinh, Phan, editor, and Alagar, Vangalur, editor
- Published
- 2019
- Full Text
- View/download PDF
236. EDA and a Tailored Data Imputation Algorithm for Daily Ozone Concentrations
- Author
-
Gualán, Ronald, Saquicela, Víctor, Tran-Thanh, Long, Kacprzyk, Janusz, Series Editor, Pal, Nikhil R., Advisory Editor, Bello Perez, Rafael, Advisory Editor, Corchado, Emilio S., Advisory Editor, Hagras, Hani, Advisory Editor, Kóczy, László T., Advisory Editor, Kreinovich, Vladik, Advisory Editor, Lin, Chin-Teng, Advisory Editor, Lu, Jie, Advisory Editor, Melin, Patricia, Advisory Editor, Nedjah, Nadia, Advisory Editor, Nguyen, Ngoc Thanh, Advisory Editor, Wang, Jun, Advisory Editor, Botto-Tobar, Miguel, editor, Barba-Maggi, Lida, editor, González-Huerta, Javier, editor, Villacrés-Cevallos, Patricio, editor, S. Gómez, Omar, editor, and Uvidia-Fassler, María I., editor
- Published
- 2019
- Full Text
- View/download PDF
237. Missing Slice Imputation in Population CMR Imaging via Conditional Generative Adversarial Nets
- Author
-
Zhang, Le, Pereañez, Marco, Bowles, Christopher, Piechnik, Stefan, Neubauer, Stefan, Petersen, Steffen, Frangi, Alejandro, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Shen, Dinggang, editor, Liu, Tianming, editor, Peters, Terry M., editor, Staib, Lawrence H., editor, Essert, Caroline, editor, Zhou, Sean, editor, Yap, Pew-Thian, editor, and Khan, Ali, editor
- Published
- 2019
- Full Text
- View/download PDF
238. MNAR Imputation with Distributed Healthcare Data
- Author
-
Pereira, Ricardo Cardoso, Santos, Miriam Seoane, Rodrigues, Pedro Pereira, Abreu, Pedro Henriques, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Moura Oliveira, Paulo, editor, Novais, Paulo, editor, and Reis, Luís Paulo, editor
- Published
- 2019
- Full Text
- View/download PDF
239. Query-Oriented Answer Imputation for Aggregate Queries
- Author
-
Hannou, Fatma-Zohra, Amann, Bernd, Baazizi, Mohamed-Amine, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Welzer, Tatjana, editor, Eder, Johann, editor, Podgorelec, Vili, editor, and Kamišalić Latifić, Aida, editor
- Published
- 2019
- Full Text
- View/download PDF
240. Keys in Relational Databases with Nulls and Bounded Domains
- Author
-
Alattar, Munqath, Sali, Attila, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Welzer, Tatjana, editor, Eder, Johann, editor, Podgorelec, Vili, editor, and Kamišalić Latifić, Aida, editor
- Published
- 2019
- Full Text
- View/download PDF
241. Missing Data Imputation for Operation Data of Transformer Based on Functional Principal Component Analysis and Wavelet Transform
- Author
-
Qin, Jiafeng, Yang, Yi, Hong, Zijing, Du, Hongyi, Goos, Gerhard, Founding Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Woeginger, Gerhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Huang, De-Shuang, editor, Huang, Zhi-Kai, editor, and Hussain, Abir, editor
- Published
- 2019
- Full Text
- View/download PDF
242. Smartic: A smart tool for Big Data analytics and IoT [version 1; peer review: 2 approved]
- Author
-
Shohel Sayeed, Abu Fuad Ahmad, and Tan Choo Peng
- Subjects
Method Article ,Articles ,IoT ,Big Data Analytics ,Data Cleaning ,Data Imputation ,Feature Engineering - Abstract
The Internet of Things (IoT) is leading the physical and digital world of technology to converge. Real-time and massive scale connections produce a large amount of versatile data, where Big Data comes into the picture. Big Data refers to large, diverse sets of information with dimensions that go beyond the capabilities of widely used database management systems, or standard data processing software tools to manage within a given limit. Almost every big dataset is dirty and may contain missing data, mistyping, inaccuracies, and many more issues that impact Big Data analytics performances. One of the biggest challenges in Big Data analytics is to discover and repair dirty data; failure to do this can lead to inaccurate analytics results and unpredictable conclusions. We experimented with different missing value imputation techniques and compared machine learning (ML) model performances with different imputation methods. We propose a hybrid model for missing value imputation combining ML and sample-based statistical techniques. Furthermore, we continued with the best missing value inputted dataset, chosen based on ML model performance for feature engineering and hyperparameter tuning. We used k-means clustering and principal component analysis. Accuracy, the evaluated outcome, improved dramatically and proved that the XGBoost model gives very high accuracy at around 0.125 root mean squared logarithmic error (RMSLE). To overcome overfitting, we used K-fold cross-validation.
- Published
- 2022
- Full Text
- View/download PDF
243. Online PMU Missing Value Replacement Via Event-Participation Decomposition.
- Author
-
Foggo, Bradnon and Yu, Nanpeng
- Subjects
- *
MISSING data (Statistics) , *PHASOR measurement , *TIME series analysis , *INFORMATION networks - Abstract
We introduce a new method for online Phasor Measurement Unit (PMU) missing value replacement. Our approach allows us to decompose PMU event responses into a non-dynamic component (denoted the participation factor) that can be inferred directly from the past and a dynamic component that can be inferred directly from all other PMUs (denoted the event strength). When missing values occur, we can use these two components, which do not rely on the missing index, to estimate the correct value. The method is extremely fast and can easily be used for online applications. Furthermore, extensive testing on real power system event data reveals that our approach achieves state-of-the-art performance in terms of Mean Absolute Percent Errors (MAPEs) for PMU data dropped during event periods. The method also yields an interpretable and simplified view of events for further analysis and applications. The method relies only on PMU data and does not take outside information such as network topology. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
244. Matrix factorization for biomedical link prediction and scRNA-seq data imputation: an empirical survey.
- Author
-
Ou-Yang, Le, Lu, Fan, Zhang, Zi-Chao, and Wu, Min
- Subjects
- *
MATRIX decomposition , *FACTORIZATION , *FORECASTING , *RNA sequencing , *DATA analysis - Abstract
Advances in high-throughput experimental technologies promote the accumulation of vast number of biomedical data. Biomedical link prediction and single-cell RNA-sequencing (scRNA-seq) data imputation are two essential tasks in biomedical data analyses, which can facilitate various downstream studies and gain insights into the mechanisms of complex diseases. Both tasks can be transformed into matrix completion problems. For a variety of matrix completion tasks, matrix factorization has shown promising performance. However, the sparseness and high dimensionality of biomedical networks and scRNA-seq data have raised new challenges. To resolve these issues, various matrix factorization methods have emerged recently. In this paper, we present a comprehensive review on such matrix factorization methods and their usage in biomedical link prediction and scRNA-seq data imputation. Moreover, we select representative matrix factorization methods and conduct a systematic empirical comparison on 15 real data sets to evaluate their performance under different scenarios. By summarizing the experimental results, we provide general guidelines for selecting matrix factorization methods for different biomedical matrix completion tasks and point out some future directions to further improve the performance for biomedical link prediction and scRNA-seq data imputation. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
245. Strongly Possible Functional Dependencies for SQL.
- Author
-
Alattar, Munqath and Sali, Attila
- Subjects
SQL ,POLYNOMIAL time algorithms ,MISSING data (Statistics) ,STATISTICAL power analysis ,BIPARTITE graphs ,NP-complete problems ,RELATIONAL databases - Abstract
Missing data is a large-scale challenge to research and investigate. It reduces the statistical power and produces negative consequences that may introduce selection bias on the data. Many approaches to handle this prob- lem have been introduced. The main approaches suggested are either missing values to be ignored (removed) or imputed (filled in) with new values [14]. This paper uses the second method. Possible worlds and possible and certain keys were introduced in [22, 25], while certain functional dependencies (c-FD) were introduced in [23] as a natural complement to Lien's class of possible functional dependencies (p-FD) by [26], and Weak and strong functional de- pendencies were studied in [25]. The intermediate concept of strongly possible worlds introduced in a preceding paper [3] and results in strongly possible keys (spKey's) and strongly possible functional dependencies (spFD's) were stud- ied. Also, a polynomial algorithm to verify a single spKey was given and it was shown that it is NP-complete in general to verify an arbitrary collection of spKeys. Furthermore, a graph-heoretical characterization was given for validating a given spFD X →
sp Y. We show, that the complexity to verify a single strongly possible func- tional dependency is NP-complete in general, then we introduce some cases when verifying a single spFD can be done in polynomial time. As a step toward axiomatization of spFD's, the rules given for weak/strong and cer- tain functional dependencies are checked. Appropriate weakenings of those that are not sound for spFD's are listed. The interaction between spFD's and spKey's and certain keys is studied. Furthermore, a graph theoretical characterization of implication between singular attribute spFD's is given. [ABSTRACT FROM AUTHOR]- Published
- 2022
- Full Text
- View/download PDF
246. An Intelligent Neural Network Algorithm for Uncertainty Handling in Sensor Failure Scenario of Food Quality Assurance Model.
- Author
-
DEEPA, S. N. and JAYALAKSHMI, N. Yogambal
- Subjects
FOOD quality ,SUPPORT vector machines ,INTERNET of things ,K-nearest neighbor classification ,ARDUINO (Microcontroller) ,UNCERTAINTY - Abstract
The quality of food is usually tested by sensing the product odor using e-nose technique. However, in a real-time testing environment, some of the employed sensors may fail to operate, which imposes great uncertainty on the food quality assurance model. To handle the uncertainty, a support vector machine (SVM) classifier algorithm is developed to deal with the failure sensor effect using a data imputation strategy. The proposed model is evaluated experimentally by means of benchmark datasets, and validated in a realtime environment by programming an Arduino-UNO controller in the internet of things (IoT) environment. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
247. A New Empirical Approach for Estimating Solar Insolation Using Air Temperature in Tropical and Mountainous Environments.
- Author
-
Hoyos-Gomez, Laura Sofía and Ruiz-Mendoza, Belizza Janet
- Abstract
Solar irradiance is an available resource that could support electrification in regions that are low on socio-economic indices. Therefore, it is increasingly important to understand the behavior of solar irradiance. and data on solar irradiance. Some locations, especially those with a low socio-economic population, do not have measured solar irradiance data, and if such information exists, it is not complete. There are different approaches for estimating solar irradiance, from learning models to empirical models. The latter has the advantage of low computational costs, allowing its wide use. Researchers estimate solar energy resources using information from other meteorological variables, such as temperature. However, there is no broad analysis of these techniques in tropical and mountainous environments. Therefore, in order to address this gap, our research analyzes the performance of three well-known empirical temperature-based models—Hargreaves and Samani, Bristol and Campbell, and Okundamiya and Nzeako—and proposes a new one for tropical and mountainous environments. The new empirical technique models daily solar irradiance in some areas better than the other three models. Statistical error comparison allows us to select the best model for each location and determines the data imputation model. Hargreaves and Samani's model had better results in the Pacific zone with an average RMSE of 936 , 195 Wh / m 2 day , SD of 36 , 01 % , MAE of 748 , 435 Wh / m 2 day , and U 95 of 1.836 , 325 Wh / m 2 day . The new proposed model showed better results in the Andean and Amazon zones with an average RMSE of 1.032 , 99 Wh / m 2 day , SD of 34 , 455 Wh / m 2 day , MAE of 825 , 46 Wh / m 2 day , and U 95 of 2.025 , 84 Wh / m 2 day . Another result was the linear relationship between the new empirical model constants and the altitude of 2500 MASL (mean above sea level). [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
248. Spatial Origin-Destination Flow Imputation Using Graph Convolutional Networks.
- Author
-
Yao, Xin, Gao, Yong, Zhu, Di, Manley, Ed, Wang, Jiaoe, and Liu, Yu
- Abstract
Due to the limitation of data collection techniques and privacy issues, the problem of missing spatial origin-destination flows frequently occurs. Data imputation provides great support for the acquisition of complete flow data, which enables us to better understand regional connections and mobility patterns. However, existing models or approaches neglect the network structure of spatial flows, thus resulting in inappropriate estimates and a low performance. The development of graph neural networks offers a powerful tool to deal with graph-structured data. In this article, we proposed a spatial interaction graph convolutional network model, which combines graph convolution and a mapping function to predict flow data from the perspective of network learning. This model utilizes geographical unit embedding in local spatial networks to improve prediction accuracy. A negative sampling technique is adopted to reduce misestimation. Experiments on Beijing taxi trip data verified the usefulness of our model in spatial flow prediction. We also demonstrated that a biased training sample had a negative impact on the model’s performance. More attributes of geographical units, a more proper negative sampling rate and a larger training set can increase the prediction accuracy of flow data. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
249. ScLRTC: imputation for single-cell RNA-seq data via low-rank tensor completion.
- Author
-
Pan, Xiutao, Li, Zhong, Qin, Shengwei, Yu, Minzhe, and Hu, Hang
- Subjects
- *
RNA sequencing , *PEARSON correlation (Statistics) , *GENE expression , *SOURCE code , *ALGORITHMS - Abstract
Background: With single-cell RNA sequencing (scRNA-seq) methods, gene expression patterns at the single-cell resolution can be revealed. But as impacted by current technical defects, dropout events in scRNA-seq lead to missing data and noise in the gene-cell expression matrix and adversely affect downstream analyses. Accordingly, the true gene expression level should be recovered before the downstream analysis is carried out. Results: In this paper, a novel low-rank tensor completion-based method, termed as scLRTC, is proposed to impute the dropout entries of a given scRNA-seq expression. It initially exploits the similarity of single cells to build a third-order low-rank tensor and employs the tensor decomposition to denoise the data. Subsequently, it reconstructs the cell expression by adopting the low-rank tensor completion algorithm, which can restore the gene-to-gene and cell-to-cell correlations. ScLRTC is compared with other state-of-the-art methods on simulated datasets and real scRNA-seq datasets with different data sizes. Specific to simulated datasets, scLRTC outperforms other methods in imputing the dropouts closest to the original expression values, which is assessed by both the sum of squared error (SSE) and Pearson correlation coefficient (PCC). In terms of real datasets, scLRTC achieves the most accurate cell classification results in spite of the choice of different clustering methods (e.g., SC3 or t-SNE followed by K-means), which is evaluated by using adjusted rand index (ARI) and normalized mutual information (NMI). Lastly, scLRTC is demonstrated to be also effective in cell visualization and in inferring cell lineage trajectories. Conclusions: a novel low-rank tensor completion-based method scLRTC gave imputation results better than the state-of-the-art tools. Source code of scLRTC can be accessed at https://github.com/jianghuaijie/scLRTC. [ABSTRACT FROM AUTHOR]
- Published
- 2021
- Full Text
- View/download PDF
250. Missing-data imputation using wearable sensors in heart rate variability
- Author
-
A. Tlija, K. Węgrzyn-Wolska, and D. Istrate
- Subjects
data imputation ,spline interpolation ,linear interpolation ,hrv ,iot ,Technology ,Technology (General) ,T1-995 - Abstract
The objective of this work is to set up a methodology that considers missing data from a connected heartbeat sensor in order to propose a good replacement methodology in the context of heart rate variability (HRV) computation. The framework is a research project, which aims to build a system that can measure stress and other factors influencing the onset and development of heart disease. The research encompasses studying existing methods, and improving them by use of experimental data from case study that describe the participant’s everyday life. We conduct a study to modelize stress from the HRV signal, which is extracted from a heart rate monitor belt connected to a smart watch. This paper describes data recording procedure and data imputation methodology. Missing data is a topic that has been discussed by several authors. The manuscript explains why we choose spline interpolation for data values imputation. We implement a random suppression data procedure and simulate removed data. After that, we implement several algorithms and choose the best one for our case study based on the mean square error.
- Published
- 2020
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.