1,405 results on '"Data cleaning"'
Search Results
2. Data science for pattern recognition in agricultural large time series data: A case study on sugarcane sucrose yield
- Author
-
Bautista-Romero, Laura Valentina, Sánchez-Murcia, Juan David, and Ramírez-Gil, Joaquín Guillermo
- Published
- 2025
- Full Text
- View/download PDF
3. Integrated STL-DBSCAN algorithm for online hydrological and water quality monitoring data cleaning
- Author
-
Song, Chenyu, Cui, Jingyuan, Cui, Yafei, Zhang, Sheng, Wu, Chang, Qin, Xiaoyan, Wu, Qiaofeng, Chi, Shanqing, Yang, Mingqing, Liu, Jia, Chen, Ruihong, and Zhang, Haiping
- Published
- 2025
- Full Text
- View/download PDF
4. Wind field characterization with skip-connected variational autoencoder for data cleaning under deck disturbance effects
- Author
-
Chen, Nanxi, Liu, Guilin, Ma, Rujin, Chen, Airong, Bai, Yufan, and Guo, Donghao
- Published
- 2025
- Full Text
- View/download PDF
5. Leveraging local and global relationships for corrupted label detection
- Author
-
Lam, Phong, Nguyen, Ha-Linh, Dang, Xuan-Truc Dao, Tran, Van-Son, Le, Minh-Duc, Nguyen, Thu-Trang, Nguyen, Son, and Vo, Hieu Dinh
- Published
- 2025
- Full Text
- View/download PDF
6. A data-driven standardised generalisable methodology to validate a large energy performance Certification dataset: A case of the application in Ireland
- Author
-
Raushan, K., Mac Uidhir, T., Llorens Salvador, M., Norton, B., and Ahern, C.
- Published
- 2024
- Full Text
- View/download PDF
7. MIDC: Medical image dataset cleaning framework based on deep learning
- Author
-
Yi, Sanli and Chen, Ziyan
- Published
- 2024
- Full Text
- View/download PDF
8. Challenges in Using Data for Public Policy Decisions
- Author
-
Paget-Seekins, Laurel, Lauter, Kristin, Series Editor, Brisbin, Abra, editor, Lange, Karen, editor, McNicholas, Erin, editor, and Purvine, Emilie, editor
- Published
- 2025
- Full Text
- View/download PDF
9. Research on Improved Ant-Lion Optimization Long Short-Term Memory Network Power Load Forecasting Model
- Author
-
Xu, Jianjun, Shi, Yuanbo, Huang, Yueyang, and Zhou, Yimin, editor
- Published
- 2025
- Full Text
- View/download PDF
10. Mitigating Hallucinations in LLMs Using Sieve of Fallacies and Truths (SoFT): A Game Theoretic Perspective
- Author
-
Roy, Anuran, Roy, Sanjiban Sekhar, Rocha, Álvaro, Series Editor, Hameurlain, Abdelkader, Editorial Board Member, Idri, Ali, Editorial Board Member, Vaseashta, Ashok, Editorial Board Member, Dubey, Ashwani Kumar, Editorial Board Member, Montenegro, Carlos, Editorial Board Member, Laporte, Claude, Editorial Board Member, Moreira, Fernando, Editorial Board Member, Peñalvo, Francisco, Editorial Board Member, Dzemyda, Gintautas, Editorial Board Member, Mejia-Miranda, Jezreel, Editorial Board Member, Hall, Jon, Editorial Board Member, Piattini, Mário, Editorial Board Member, Holanda, Maristela, Editorial Board Member, Tang, Mincong, Editorial Board Member, Ivanovíc, Mirjana, Editorial Board Member, Muñoz, Mirna, Editorial Board Member, Kanth, Rajeev, Editorial Board Member, Anwar, Sajid, Editorial Board Member, Herawan, Tutut, Editorial Board Member, Colla, Valentina, Editorial Board Member, Devedzic, Vladan, Editorial Board Member, Vajjhala, Narasimha Rao, editor, Roy, Sanjiban Sekhar, editor, Taşcı, Burak, editor, and Hoque Chowdhury, Muhammad Enamul, editor
- Published
- 2025
- Full Text
- View/download PDF
11. Study on Intelligent Cleaning of Hydro-logical Data in the Main Canal of the Middle Route of the South-to-North Water Diversion Project
- Author
-
Chen, Xiaonan, Wang, Yilin, Gu, Qihao, Jin, Yanguo, Duan, Chunqing, Chan, Albert P. C., Series Editor, Hong, Wei-Chiang, Series Editor, Mellal, Mohamed Arezki, Series Editor, Narayanan, Ramadas, Series Editor, Nguyen, Quang Ngoc, Series Editor, Ong, Hwai Chyuan, Series Editor, Sachsenmeier, Peter, Series Editor, Sun, Zaicheng, Series Editor, Ullah, Sharif, Series Editor, Wu, Junwei, Series Editor, Zhang, Wei, Series Editor, Qiu, Yanjun, editor, Feng, Weimin, editor, Zhang, Zhiqiang, editor, and Ahmad, Fauziah, editor
- Published
- 2025
- Full Text
- View/download PDF
12. Enhancing Load Forecasting with VAE-GAN-Based Data Cleaning for Electric Vehicle Charging Loads
- Author
-
Zhang, Wensi, Lei, Shuya, Jiang, Yuqing, Yao, Tiechui, Wang, Yishen, Sun, Zhiqing, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Morishima, Atsuyuki, editor, Li, Guoliang, editor, Ishikawa, Yoshiharu, editor, Amer-Yahia, Sihem, editor, Jagadish, H. V., editor, and Lu, Kejing, editor
- Published
- 2025
- Full Text
- View/download PDF
13. Wind Power Anomaly Data Cleaning Based on KDE-DBSCAN
- Author
-
Zhang, Qiushi, Liu, Yiqun, Zhou, Jiancheng, Liu, Yantao, Lei, Boxiang, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, and Muyeen, S M, editor
- Published
- 2025
- Full Text
- View/download PDF
14. A Novel and Efficient Machine Learning Technique for Cleaning Q-Messy Data
- Author
-
Saradhi, K., Davanam, Ganesh, Malchi, Sunil Kumar, Angrisani, Leopoldo, Series Editor, Arteaga, Marco, Series Editor, Chakraborty, Samarjit, Series Editor, Chen, Shanben, Series Editor, Chen, Tan Kay, Series Editor, Dillmann, Rüdiger, Series Editor, Duan, Haibin, Series Editor, Ferrari, Gianluigi, Series Editor, Ferre, Manuel, Series Editor, Jabbari, Faryar, Series Editor, Jia, Limin, Series Editor, Kacprzyk, Janusz, Series Editor, Khamis, Alaa, Series Editor, Kroeger, Torsten, Series Editor, Li, Yong, Series Editor, Liang, Qilian, Series Editor, Martín, Ferran, Series Editor, Ming, Tan Cher, Series Editor, Minker, Wolfgang, Series Editor, Misra, Pradeep, Series Editor, Mukhopadhyay, Subhas, Series Editor, Ning, Cun-Zheng, Series Editor, Nishida, Toyoaki, Series Editor, Oneto, Luca, Series Editor, Panigrahi, Bijaya Ketan, Series Editor, Pascucci, Federica, Series Editor, Qin, Yong, Series Editor, Seng, Gan Woon, Series Editor, Speidel, Joachim, Series Editor, Veiga, Germano, Series Editor, Wu, Haitao, Series Editor, Zamboni, Walter, Series Editor, Tan, Kay Chen, Series Editor, Kumar, Amit, editor, Gunjan, Vinit Kumar, editor, Senatore, Sabrina, editor, and Hu, Yu-Chen, editor
- Published
- 2025
- Full Text
- View/download PDF
15. Practical Machine Learning
- Author
-
Nyamawe, Ally S., Mjahidi, Mohamedi M., Nnko, Noe E., Diwani, Salim A., Minja, Godbless G., and Malyango, Kulwa
- Subjects
Ethics ,pre-processing ,data collection ,hyperparameter optimization ,data cleaning ,programming ,choosing algorith ,models ,cloud computing ,responsible AI ,explainable AI ,XAI ,classification ,regression ,python ,Automatic control engineering ,Artificial intelligence ,Programming and scripting languages: general ,Software Engineering - Abstract
The book provides an accessible, comprehensive introduction for beginners to machine learning, equipping them with the fundamental skills and techniques essential for this field. It enables beginners to construct practical, real-world solutions powered by machine learning across diverse application domains. It demonstrates the fundamental techniques involved in data collection, integration, cleansing, transformation, development, and deployment of machine learning models. This book emphasizes the importance of integrating responsible and explainable AI into machine learning models, ensuring these principles are prioritized rather than treated as an afterthought. To support learning, this book also offers information on accessing additional machine learning resources such as datasets, libraries, pre-trained models, and tools for tracking machine learning models. This is a core resource for students and instructors of machine learning and data science looking for a beginner-friendly material which offers real-world applications and takes ethical discussions into account. The Open Access version of this book, available at http://www.taylorfrancis.com, has been made available under a Creative Commons Attribution-Non Commercial-No Derivatives (CC-BY-NC-ND) 4.0 license.
- Published
- 2025
- Full Text
- View/download PDF
16. Wind power data cleaning using RANSAC-based polynomial and linear regression with adaptive threshold.
- Author
-
Yang, Haineng, Tang, Jie, Shao, Wu, Yin, Jintian, and Liu, Baiyang
- Subjects
- *
CONVOLUTIONAL neural networks , *REGRESSION analysis , *RENEWABLE energy sources , *CLEAN energy , *WIND forecasting , *WIND power - Abstract
As the global demand for clean energy continues to rise, wind power has become one of the most important renewable energy sources. However, wind power data often contains a high proportion of dense anomalies, which not only significantly affect the accuracy of wind power forecasting models but may also mislead grid scheduling decisions, thereby jeopardizing grid security. To address this issue, this paper proposes an adaptive threshold robust regression model (RPR model) based on the combination of the Random Sample Consensus (RANSAC) algorithm and polynomial linear regression for wind power data cleaning. The model successfully captures the nonlinear relationship between wind speed and power by extending the polynomial features of wind speed and power, enabling the linear regression model to handle the nonlinearity. By combining the RANSAC algorithm and polynomial linear regression, a robust polynomial regression model is constructed to tackle anomalous data and enhance the accuracy of data cleaning. During the cleaning process, the model first fits the raw data by randomly selecting a minimal sample set, then dynamically adjusts the decision thresholds based on the median of residuals and median absolute deviation (MAD), ensuring effective identification and cleaning of anomalous data. The model's robustness allows it to maintain efficient cleaning performance even with a high proportion of anomalous data, addressing the limitations of existing methods when handling densely distributed anomalies. The effectiveness and innovation of the proposed method were validated by applying it to real data from a wind farm operated by Longyuan Power. Compared to other commonly used cleaning methods, such as the Bidirectional Change Point Grouping Quartile Statistical Model, Principal Contour Image Processing Model, DBSCAN Clustering Model, and Support Vector Machine (SVM) Model, experimental results showed that the proposed method delivered the best performance in improving data quality. Specifically, the method significantly reduced the average absolute error (MAE) of the wind power forecasting model by 72.1%, which is higher than the reductions observed in other methods (ranging from 37.3 to 52.7%). Moreover, it effectively reduced the prediction error of the Convolutional Neural Network (CNN) + Gated Recurrent Unit (GRU) forecasting model, ensuring high prediction accuracy. The adaptive threshold robust regression model proposed in this study is innovative and has significant application potential. It provides an effective new approach for wind power data cleaning, applicable not only to conventional scenarios with low proportions of anomalous data but also to complex datasets with a high proportion of dense anomalies. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
17. Data Cleaning Model of Mine Wind Speed Sensor Based on LOF-GMM and SGAIN.
- Author
-
Ni, Jingfeng, Yang, Shengya, and Liu, Yujiao
- Abstract
To improve the quality of mine ventilation wind speed sensor data, a data cleaning model for mine ventilation wind speed sensors based on LOF-GMM and SGAIN is proposed. First, the LOF-GMM algorithm was used to identify wind speed sensor data, cluster the data, and determine the threshold of the local outlier factor, enabling automatic identification of abnormal data and recognition of ventilation fault state information. Abnormal data were then removed to create blank missing points. Finally, wind speed data from the normal operating state of the ventilation system were used to train the SGAIN model to obtain its optimal parameters. The trained SGAIN model was then used to fill in the blank points. The results show that the proposed method can effectively detect abnormal wind speed sensor data and identify ventilation system fault information. In terms of imputation performance, this model outperformed other data imputation models such as GAIN, RF, and DAE. Although the imputation speed was slightly lower than that of the RF and DAE models, considering the high accuracy requirements of mine wind speed data, SGAIN is more suitable for use in the field of mine ventilation. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
18. Prediction for Coastal Wind Speed Based on Improved Variational Mode Decomposition and Recurrent Neural Network.
- Author
-
Du, Muyuan, Zhang, Zhimeng, and Ji, Chunning
- Subjects
- *
RECURRENT neural networks , *WIND speed , *OPTIMIZATION algorithms , *DATA scrubbing , *WIND forecasting , *OUTLIER detection - Abstract
Accurate and comprehensive wind speed forecasting is crucial for improving efficiency in offshore wind power operation systems in coastal regions. However, raw wind speed data often suffer from noise and missing values, which can undermine the prediction performance. This study proposes a systematic framework, termed VMD-RUN-Seq2Seq-Attention, for noise reduction, outlier detection, and wind speed prediction by integrating Variational Mode Decomposition (VMD), the Runge–Kutta optimization algorithm (RUN), and a Sequence-to-Sequence model with an Attention mechanism (Seq2Seq-Attention). Using wind speed data from the Shidao, Xiaomaidao, and Lianyungang stations as case studies, a fitness function based on the Pearson correlation coefficient was developed to optimize the VMD mode count and penalty factor. A comparative analysis of different Intrinsic Mode Function (IMF) selection ratios revealed that selecting a 50% IMF ratio effectively retains the intrinsic information of the raw data while minimizing noise. For outlier detection, statistical methods were employed, followed by a comparative evaluation of three models—LSTM, LSTM-KAN, and Seq2Seq-Attention—for multi-step wind speed forecasting over horizons ranging from 1 to 12 h. The results consistently showed that the Seq2Seq-Attention model achieved superior predictive accuracy across all forecast horizons, with the correlation coefficient of its prediction results greater than 0.9 in all cases. The proposed VMD-RUN-Seq2Seq-Attention framework outperformed other methods in the denoising, data cleansing, and reconstruction of the original wind speed dataset, with a maximum improvement of 21% in accuracy, producing highly accurate and reliable results. This approach offers a robust methodology for improving data quality and enhancing wind speed forecasting accuracy in coastal environments. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
19. Enhancing data cleaning process on accounting data for fraud detection.
- Author
-
Abdul Malek, Mohamad Affendi and Jalil, Kamarularifin Abd
- Subjects
DATA scrubbing ,FRAUD investigation ,DETECTION algorithms ,ACCOUNTING fraud ,FORENSIC accounting - Abstract
Data cleaning is a crucial step in fraud detection as it involves identifying and correcting any inaccuracies or inconsistencies in the data. This can help to ensure that the data being used for fraud detection is reliable and accurate, which in turn can improve the effectiveness of fraud detection algorithms. Due to the overwhelming amount of data, data cleaning specific for fraud detection is a very important activity for the auditor to find the appropriate information. Therefore, a new accounting data cleaning for fraud detection is needed. In this paper, an enhancement of the process of fraud detection by accounting auditors through the implementation of accounting data cleaning technique is proposed. The proposed technique was embedded in a prototype system called accounting data cleaning for fraud detection (ADCFD). Through experiment, the performance of the proposed technique through ADCF is compared with those obtained from the IDEA system, using the same dataset. The results show that the proposed enhanced technique through ADCFD system performed better than the IDEA system. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
20. Standardizing Collections Record Data for Increased Understanding of Glass Collections Across the Smithsonian Institution.
- Author
-
Hiebert, Miriam E., O'Hern, Robin, Colebank, Sadie, Baburi, Sarah, and Kaczkowski, Rebecca
- Subjects
- *
DATA scrubbing , *CATALOGS , *DATA management , *DATABASES , *ACQUISITION of data - Abstract
The information contained in museum collections catalog databases is an invaluable resource for understanding and caring for collection items. The identification and grouping of items in a collection based on criteria such as location of manufacture, use, or materials composition allows for items at greater risk for long term preservation issues to be more appropriately monitored and cared for. However, identifying these items can be difficult, particularly when the collections data or terminology included in collections catalogs and databases vary significantly between items or across collections. This paper discusses the process and results of a data consolidation and cleaning campaign that was undertaken by the Smithsonian's Glass Deterioration Working Group. This work was done in order to develop a consistent database of glass and glass-containing collection items that is able to be effectively queried for items of particular concern or interest from a preservation standpoint. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
21. Taking the Next Step in Exploring the Literary Digest 1936 Poll
- Author
-
Beth Chance, Andrew Kerr, and Jett Palmer
- Subjects
Bias ,Data cleaning ,Data validation ,Post-stratification ,Probabilities. Mathematical statistics ,QA273-280 ,Special aspects of education ,LC8-6691 - Abstract
While many instructors are aware of the Literary Digest 1936 poll as an example of biased sampling methods, this article details potential further explorations for the Digest’s 1924–1936 quadrennial U.S. presidential election polls. Potential activities range from lessons in data acquisition, cleaning, and validation, to basic data literacy and visualization skills, to exploring one or more methods of adjustment to account for bias based on information collected at that time. Students can also compare how those methods would have performed. One option could be to give introductory students a first look at the idea of “sampling adjustment” and how this principle can be used to account for difficulties in modern polling, but the context is rich in other opportunities that can be discussed at various times in the course or in more advanced sampling courses. Supplementary materials for this article are available online.
- Published
- 2024
- Full Text
- View/download PDF
22. Sensor data cleaning for applications in dairy herd management and breeding.
- Author
-
Schodl, Katharina, Stygar, Anna, Steininger, Franz, and Egger-Danner, Christa
- Subjects
ANIMAL herds ,DATA scrubbing ,DAIRY cattle ,DATA analysis ,DETECTORS - Abstract
Data cleaning is a core process when it comes to using data from dairy sensor technologies. This article presents guidelines for sensor data cleaning with a specific focus on dairy herd management and breeding applications. Prior to any data cleaning steps, context and purpose of the data use must be considered. Recommendations for data cleaning are provided in five distinct steps: 1) validate the data merging process, 2) get to know the data, 3) check completeness of the data, 4) evaluate the plausibility of sensor measures and detect outliers, and 5) check for technology related noise. Whenever necessary, the recommendations are supported by examples of different sensor types (bolus, accelerometer) collected in an international project (D4Dairy) or supported by relevant literature. To ensure quality and reproducibility, data users are required to document their approach throughout the process. The target group for these guidelines are professionals involved in the process of collecting, managing, and analyzing sensor data from dairy herds. Providing guidelines for data cleaning could help to ensure that the data used for analysis is accurate, consistent, and reliable, ultimately leading to more informed management decisions and better breeding outcomes for dairy herds. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
23. An ORSAC method for data cleaning inspired by RANSAC.
- Author
-
Jenkins, Thomas, Goodwin, Autumn, and Talafha, Sameerah
- Subjects
DATA scrubbing ,COMPUTER vision ,OUTLIER detection ,DEEP learning ,STATISTICAL sampling - Abstract
In classification problems, mislabeled data can have a dramatic effect on the capability of a trained model. The traditional method of dealing with mislabeled data is through expert review. However, this is not always ideal, due to the large volume of data in many classification datasets, such as image datasets supporting deep learning models, and the limited availability of human experts for reviewing the data. Herein, we propose an ordered sample consensus (ORSAC) method to support data cleaning by flagging mislabeled data. This method is inspired by the random sample consensus (RANSAC) method for outlier detection. In short, the method involves iteratively training and testing a model on different splits of the dataset, recording misclassifications, and flagging data that is frequently misclassified as probably mislabeled. We evaluate the method by purposefully mislabeling subsets of data and assessing the method's capability to find such data. We demonstrate with three datasets, a mosquito image dataset, CIFAR-10, and CIFAR-100, that this method is reliable in finding mislabeled data with a high degree of accuracy. Our experimental results indicate a high proficiency of our methodology in identifying mislabeled data across these diverse datasets, with performance assessed using different mislabeling frequencies. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
24. BConvLSTM: a deep learning-based technique for severity prediction of a traffic crash.
- Author
-
Vinta, Surendra Reddy, Rajarajeswari, Pothuraju, Kumar, M. Vijay, and Kumar, G. Sai Chaitanya
- Subjects
ARTIFICIAL neural networks ,METAHEURISTIC algorithms ,TRAFFIC estimation ,ROAD users ,DEEP learning ,TRAFFIC accidents - Abstract
Predicting the severity of crashes has become a significant issue in research on road accidents. Traffic accident severity prediction is essential for protecting vulnerable road users and preventing traffic accidents. For practitioners to identify significant risk variables and set appropriate countermeasures in place, explainability of the forecast is also essential. Most previous research ignores the severity of property loss caused by traffic accidents and cannot differentiate between different levels of fatalities and property loss severity. Additionally, while an understandable structure of deep neural networks (DNN) is significantly lacking in existing works, understanding traditional systems is quite simple. An inability to use structural data when describing forecasting and the many attempts to incorporate neural networks afflict the absence of hidden layers. We propose a Deep Learning (DL) framework for forecasting traffic crash severity to overcome the accident severity prediction. It has three steps to process. Initially, collected input data are cleaned. Data cleaning is performed in a preprocessing step. We conduct experiments on two datasets, A Countrywide (US) Traffic Accident Dataset and UK Road Accident Dataset. The outcomes of the experiments demonstrate that the proposed technique outperformed other approaches and produced the best accuracy. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
25. Cleaning of Abnormal Wind Speed Power Data Based on Quartile RANSAC Regression.
- Author
-
Zhang, Fengjuan, Zhang, Xiaohui, Xu, Zhilei, Dong, Keliang, Li, Zhiwei, and Liu, Yubo
- Subjects
- *
WIND power , *SUPERVISORY control systems , *GEOGRAPHIC boundaries , *WIND speed , *WIND turbines - Abstract
The combined complexity of wind turbine systems and harsh operating conditions pose significant challenges to the accuracy of operational data in Supervisory Control and Data Acquisition (SCADA) systems. Improving the precision of data cleaning for high proportions of stacked abnormalities remains an urgent problem. This paper deeply analyzes the distribution characteristics of abnormal data and proposes a novel method for abnormal data cleaning based on a classification processing framework. Firstly, the first type of abnormal data is cleaned based on operational criteria; secondly, the quartile method is used to eliminate sparse abnormal data to obtain a clearer boundary line; on this basis, the Random Sample Consensus (RANSAC) algorithm is employed to eliminate stacked abnormal data; finally, the effectiveness of the proposed algorithm in cleaning abnormal data with a high proportion of stacked abnormalities is verified through case studies, and evaluation indicators are introduced through comparative experiments to quantitatively assess the cleaning effect. The research results indicate that the algorithm excels in cleaning effectiveness, efficiency, accuracy, and rationality of data deletion. The cleaning accuracy improvement is particularly significant when dealing with a high proportion of stacked anomaly data, thereby bringing significant value to wind power applications such as wind power prediction, condition assessment, and fault detection. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
26. florabr: An R package to explore and spatialize species distribution using Flora e Funga do Brasil.
- Author
-
Trindade, Weverton C. F.
- Subjects
- *
GEOGRAPHICAL discoveries , *PHYTOGEOGRAPHY , *PLANT diversity , *DATA scrubbing , *SPECIES distribution - Abstract
Premise: The Flora e Funga do Brasil project is the most comprehensive effort to reliably document Brazilian plant and fungal diversity. It involves the collaborative work of hundreds of taxonomists, integrating detailed and standardized morphological descriptions, nomenclatural status, and geographic distribution information of plants, algae, and fungi collected throughout Brazil. Despite the extensive information available, managing the information from the Flora e Funga do Brasil website poses certain challenges. Methods and Results: florabr is an R package developed to facilitate the exploration and geographical analysis of species information derived from the Flora e Funga do Brasil. Unique to florabr is its ability to interact with the latest, or any other version of the dataset, which undergoes weekly updates. I illustrate the practical application of florabr in common tasks in biogeography and conservation studies. Conclusions: florabr is anticipated to be of significant interest to biogeographers, ecologists, curators of biological collections, and taxonomists actively contributing to the Flora e Funga do Brasil. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
27. Frequency of anthropometric implausible values estimated from different methodologies: a systematic review and meta-analysis.
- Author
-
Santos, Iolanda Karla Santana dos, Pereira, Débora Borges dos Santos, Silva, Jéssica Cumpian, Gallo, Caroline de Oliveira, Oliveira, Mariane Helen de, Vasconcelos, Luana Cristina Pereira de, and Conde, Wolney Lisbôa
- Subjects
- *
MEDICAL information storage & retrieval systems , *RESEARCH funding , *NUTRITIONAL assessment , *META-analysis , *STATURE , *MEDLINE , *SYSTEMATIC reviews , *NUTRITIONAL status , *ANTHROPOMETRY , *ONLINE information services , *CONFIDENCE intervals , *DATA analysis software - Abstract
Context Poor anthropometric data quality affect the prevalence of malnutrition and could harm public policy planning. Objective This systematic review and meta-analysis was designed to identify different methods to evaluate and clean anthropometric data, and to calculate the frequency of implausible values for weight and height obtained from these methodologies. Data Sources Studies about anthropometric data quality and/or anthropometric data cleaning were searched for in the MEDLINE, LILACS, SciELO, Embase, Scopus, Web of Science, and Google Scholar databases in October 2020 and updated in January 2023. In addition, references of included studies were searched for the identification of potentially eligible studies. Data Extraction Paired researchers selected studies, extracted data, and critically appraised the selected publications. Data Analysis Meta-analysis of the frequency of implausible values and 95% confidence interval (CI) was estimated. Heterogeneity (I 2) and publication bias were examined by meta-regression and funnel plot, respectively. Results In the qualitative synthesis, 123 reports from 104 studies were included, and in the quantitative synthesis, 23 studies of weight and 14 studies of height were included. The study reports were published between 1980 and 2022. The frequency of implausible values for weight was 0.55% (95%CI, 0.29–0.91) and for height was 1.20% (95%CI, 0.44–2.33). Heterogeneity was not affected by the methodological quality score of the studies and publication bias was discarded. Conclusions Height had twice the frequency of implausible values compared with weight. Using a set of indicators of quality to evaluate anthropometric data is better than using indicators singly. Systematic Review Registration PROSPERO registration no. CRD42020208977. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
28. Background removal for debiasing computer-aided cytological diagnosis.
- Author
-
Takeda, Keita, Sakai, Tomoya, and Mitate, Eiji
- Abstract
To address the background-bias problem in computer-aided cytology caused by microscopic slide deterioration, this article proposes a deep learning approach for cell segmentation and background removal without requiring cell annotation. A U-Net-based model was trained to separate cells from the background in an unsupervised manner by leveraging the redundancy of the background and the sparsity of cells in liquid-based cytology (LBC) images. The experimental results demonstrate that the U-Net-based model trained on a small set of cytology images can exclude background features and accurately segment cells. This capability is beneficial for debiasing in the detection and classification of the cells of interest in oral LBC. Slide deterioration can significantly affect deep learning-based cell classification. Our proposed method effectively removes background features at no cost of cell annotation, thereby enabling accurate cytological diagnosis through the deep learning of microscopic slide images. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
29. 基于双向 LSTM 神经网络的站点周边水位 预测系统设计.
- Author
-
姚晔, 许锡伟, 管剑波, and 葛旭初
- Abstract
Copyright of Computer Measurement & Control is the property of Magazine Agency of Computer Measurement & Control and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
30. Toward Dynamic Data-Driven Time-Slicing LSH for Joinable Table Discovery.
- Author
-
Wang, Weiwei, Zhu, Chunxiang, and Yan, Han
- Subjects
SIMILARITY (Physics) ,DATA scrubbing ,DATABASES ,INDUSTRIALISM ,LEGACY systems ,BINARY codes ,DATA integration - Abstract
In legacy industrial systems, discovering joinable information between database tables is important for applications such as data integration and data analysis. Locality-Sensitive Hashing-based methods have been proven to be capable of handling chaotic and diverse table relationships, but these methods often rely on an incorrect assumption—that the similarity of table columns in the database directly reflects their joinability, causing problems related to the accuracy of their results. To solve this problem, this study proposes a dynamic data-driven time-slicing Locality-Sensitive Hashing method for joinable table discovery. This method introduces database log information and within different time slices, uses the co-occurrence matrix of data tables to determine their joinability. Specifically, it first performs a MinHash dimensionality reduction on database columns and then uses Locality-Sensitive Hashing to calculate the static similarity. Next, it identifies business modular time slices through database logs, calculates the dynamic similarity of the slice time data, and builds a co-occurrence matrix between tables. Finally, the joinability between data tables is calculated using the static similarity, dynamic similarity, and co-occurrence matrix. The experimental results demonstrate that this method effectively excludes tables that only have similarity but no business relationship for data cleaning, and its accuracy exceeds that of methods that only depend on similarity. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
31. Taking the Next Step in Exploring the Literary Digest 1936 Poll.
- Author
-
Chance, Beth, Kerr, Andrew, and Palmer, Jett
- Subjects
UNITED States presidential elections ,DATA scrubbing ,ELECTION forecasting ,UPPER level courses (Education) ,PUBLIC opinion polls - Abstract
While many instructors are aware of the Literary Digest 1936 poll as an example of biased sampling methods, this article details potential further explorations for the Digest's 1924–1936 quadrennial U.S. presidential election polls. Potential activities range from lessons in data acquisition, cleaning, and validation, to basic data literacy and visualization skills, to exploring one or more methods of adjustment to account for bias based on information collected at that time. Students can also compare how those methods would have performed. One option could be to give introductory students a first look at the idea of "sampling adjustment" and how this principle can be used to account for difficulties in modern polling, but the context is rich in other opportunities that can be discussed at various times in the course or in more advanced sampling courses. Supplementary materials for this article are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
32. Research on urban water demand prediction based on machine learning and feature engineering
- Author
-
Dongfei Yan, Yi Tao, Jianqi Zhang, and Huijia Yang
- Subjects
data cleaning ,data interpolation ,feature engineering ,machine learning ,water demand prediction ,water supply system ,Water supply for domestic and industrial purposes ,TD201-500 ,River, lake, and water-supply engineering (General) ,TC401-506 - Abstract
Urban water demand prediction is not only the foundation of water resource planning and management, but also an important component of water supply system optimization and scheduling. Therefore, predicting future water demand is of great significance. For univariate time series data, the issue of outliers can be solved through data preprocessing. Then, the data input dimension is increased through feature engineering, and finally, the LightGBM (Light Gradient Boosting Machine) model is used to predict future water demand. The results demonstrate that cubic polynomial interpolation outperforms the Prophet model and the linear method in the context of missing value interpolation tasks. In terms of predicting water demand, the LightGBM model demonstrates excellent forecasting performance and can effectively predict future water demand trends. The evaluation indicators MAPE (mean absolute percentage error) and NSE (Nash–Sutcliffe efficiency coefficient) on the test dataset are 4.28% and 0.94, respectively. These indicators can provide a scientific basis for short-term prediction of water supply enterprises. HIGHLIGHTS Interpolation of raw training data may not necessarily improve the performance of predictive models.; Accurate prediction of univariate data can be achieved through feature engineering and machine learning.;
- Published
- 2024
- Full Text
- View/download PDF
33. Data cleaning with variational autoencoders
- Author
-
Eduardo, Simão Fernandes Lopes Marques, Sutton, Charles, and Williams, Chris
- Subjects
poor quality data ,outliers ,corrupting errors ,data cleaning ,automated data cleaning ,deep learning - Abstract
A typical data science or machine learning pipeline starts with data exploration; then data engineering (wrangling, cleaning); then moves towards modelling (model selection, learning, validation); and finally model visualization or deployment. Most of the datasets used in industry are either structured or text based. Two relevant instances of structured datasets are: graph data (e.g. knowledge graphs), and tabular data (e.g. excel sheets, databases). However, image datasets are increasingly used in industry and have similar pipeline steps. This thesis explores the data cleaning problem, where two of its main steps are outlier detection and subsequent data repair. This work focuses on outliers that result from corruption processes that are applied to a subset of instances belonging to an original clean dataset. The remaining instances unaffected by corruption, or before corruption, are called inliers. The outlier detection step finds which data instances have been corrupted. The repair step either replaces the entire instance with a clean version, or imputes the values of specific features in that instance that are deemed corrupted. In both cases, an ideal repair process restores the underlying inlier instance, before having been corrupted by errors. The main goal is to devise machine learning (ML) models that automate both outlier detection and data repair, with minimal supervision by the end-user. In particular, we focus on solutions based on variational autoencoders (VAEs), because these are flexible generative models capable of providing repairs as samples or reconstructions. Moreover, the reconstruction provided by VAEs also allow for the detection of corrupted feature values, unlike classic outlier detection methods. Since the training dataset is corrupted by outliers, the key aspect to good performance in detection and repair is model robustness to data corruption, which prevents overfitting to errors. If the model overfits to errors, then it is difficult to distinguish inliers from outliers, therefore degrading performance. In this thesis two novel generative models are proposed for this task, to be used in different contexts. The two most common types of errors are either of random or systematic nature. Random errors corrupt each instance independently using an unknown distribution, exhibiting no clear anomalous pattern across outlier instances. Systematic errors result from nearly deterministic transformations that occur repeatedly in the data, exhibiting a clear pattern across outliers. Overall, this means high capacity models like VAEs more easily overfit to systematic errors, which compromises outlier detection and repair performance. This thesis focuses on point outliers as they are the most commonly found by practitioners. Point outliers are those that can be identified by only evaluating said instance individually, without the context of other instances (e.g. space, time, graphs). The first model proposal devises a novel unsupervised VAE that is robust to random errors for mixed-type (e.g. categorical, continuous) tabular data. This first model is called the Robust Variational Autoencoder (RVAE). We introduce this robustness by designing a decoder architecture that downweighs the contribution of corrupted feature values (cells) during training. Unlike traditional methods, besides providing which instances are outliers, the novel model provides which cells have been corrupted improving model interpretability. It is shown experimentally that the novel model performs better than baselines in cell outlier detection and repair, and is robust against initial hyper-parameter selection. In the second model proposal the focus is on detection and repair in datasets corrupted by systematic errors. This second model is called the Clean Subspace Variational Autoencoder (CLSVAE). The nature of systematic errors makes them easy to learn, and thus easy to overfit to. This means that if they are numerous in a dataset, then unsupervised methods will have difficulty distinguishing between inliers and outliers. A novel semi-supervised VAE is proposed that only requires a small labelled set of inliers and outliers, thus minimizing end-user intervention. The main idea is to learn separate latent representations for inliers and systematic errors, and only use the inlier representation for data repair. The novel model is shown to be robust to systematic errors, and it registers state-of-the-art repair in image datasets. Compared to the baselines, the novel model does better in challenging scenarios, where corruption level is higher or the labelled set is very small.
- Published
- 2023
- Full Text
- View/download PDF
34. Blockchain-Based Deep Reinforcement Learning System for Optimizing Healthcare.
- Author
-
Ali, Tariq Emad, Ali, Faten Imad, Abdala, Mohammed A., Morad, Ameer Hussein, Gódor, Győző, and Zoltán, Alwahab Dhulfiqar
- Subjects
- *
DEEP reinforcement learning , *REINFORCEMENT learning , *BIOSENSORS , *DATA scrubbing , *MEDICAL appointments , *DEEP learning - Abstract
The Industrial Internet of Things (IIoT) has become a transformative force in various healthcare applications, providing integrated services for daily life. The app healthcare based on the IIoT framework is broadly used to remotely monitor clients health using advanced biomedical sensors with wireless technologies, managing activities such as monitoring blood pressure, heart rate, and vital signs. Despite its widespread use, IIoT in healthcare faces challenges such as security concerns, inefficient work scheduling, and associated costs. To address these issues, this paper proposes and evaluates the Blockchain-Based Deep Reinforcement Learning System for Optimizing Healthcare (BDRL) framework. BDRL aims to enhance security protocols and maximize makespan efficiency in scheduling medical applications. It facilitates the sharing of legitimate and secure data among linked network nodes beyond the initial stages of data validation and assignment. This study presents the design, implementation, and statistical evaluation of BDRL using a new dataset and varying platform resources. The evaluation shows that BDRL is versatile and successfully addresses the security, privacy, and makespan needs of healthcare applications on distributed networks, while also delivering excellent performance. However, the framework utilizes high resources as the size of inserted data increases. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
35. Data processing to remove outliers and inliers: A systematic literature study.
- Author
-
Alves, Fernando, de Souza, Eduardo G., Sobjak, Ricardo, Bazzi, Claudio L., Hachisuca, Antonio M. M., and Mercante, Erivelto
- Subjects
DATA scrubbing ,PRINCIPAL components analysis ,PRECISION farming ,ELECTRONIC data processing ,ACQUISITION of data - Abstract
Copyright of Revista Brasileira de Engenharia Agricola e Ambiental - Agriambi is the property of Revista Brasileira de Engenharia Agricola e Ambiental and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
- Published
- 2024
- Full Text
- View/download PDF
36. Retinal image preprocessing techniques: Acquisition and cleaning perspective.
- Author
-
Pandey, Anuj Kumar, Singh, Satya Prakash, and Chakraborty, Chinmay
- Abstract
Image preprocessing is a method to transform raw image data into clean image data. The objective of preprocessing is to improve the image data by suppressing undesired distortions. Enhancement of some image features which are relevant for further processing of image and analysis task is also done in preprocessing. Screening and diagnosis of various eye diseases like diabetic retinopathy, Choroidal Neovascularization(CNV), DRUSEN, etc. are possible using digital retinal images. This paper aims to provide a better understanding and knowledge of the computer algorithms used for retinal image preprocessing. In this paper, various image preprocessing techniques are incorporated such as color correction, color space selection, noise reduction, and contrast enhancement on retinal images. Retinal blood vessels are better seen in Green color space instead of Red or Blue color space. Noise reduction through Block matching and 3D(BM3D) techniques show a significant result as compared to Total Variation Filter (TVF) and Bilateral Filter (BLF). Contrast enhancement through Contrast Limited Adaptive Histogram Equalization (CLAHE) outperforms Global Equalization (GE) or Adaptive Histogram Equalization (AHE). Evaluation parameters such as Mean square error, Peak Signal Noise ratio, Structured similarity index measures, and Normalized root mean square error values for BM3D noise filtering are 0.0029, 25.3370, 0.6839 and 0.0998 respectively which shows that BM3D outperforms the others. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
37. OBMI: oversampling borderline minority instances by a two-stage Tomek link-finding procedure for class imbalance problem.
- Author
-
Leng, Qiangkui, Guo, Jiamei, Tao, Jiaqing, Meng, Xiangfu, and Wang, Changzhong
- Subjects
LEARNING communities ,MACHINE learning ,DATA scrubbing ,MINORITIES - Abstract
Mitigating the impact of class imbalance datasets on classifiers poses a challenge to the machine learning community. Conventional classifiers do not perform well as they are habitually biased toward the majority class. Among existing solutions, the synthetic minority oversampling technique (SMOTE) has shown great potential, aiming to improve the dataset rather than the classifier. However, SMOTE still needs improvement because of its equal oversampling to each minority instance. Based on the consensus that instances far from the borderline contribute less to classification, a refined method for oversampling borderline minority instances (OBMI) is proposed in this paper using a two-stage Tomek link-finding procedure. In the oversampling stage, the pairs of between-class instances nearest to each other are first found to form Tomek links. Then, these minority instances in Tomek links are extracted as base instances. Finally, new minority instances are generated, each of which is linearly interpolated between a base instance and one minority neighbor of the base instance. To address the overlap caused by oversampling, in the cleaning stage, Tomek links are employed again to remove the borderline instances from both classes. The OBMI is compared with ten baseline methods on 17 benchmark datasets. The results show that it performs better on most of the selected datasets in terms of the F1-score and G-mean. Statistical analysis also indicates its higher-level Friedman ranking. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
38. A novel hybrid forecasting approach for NOx emission of coal‐fired boiler combined with CEEMDAN and self‐attention improved by LSTM.
- Author
-
Yan, Hua, Chen, Yunchi, Yang, Bin, Yang, Yang, Ni, Hu, and Wang, Ying
- Subjects
- *
STANDARD deviations , *AIR heaters , *DATA scrubbing , *CATALYTIC reduction , *PREDICTION models , *DEEP learning - Abstract
The precise prediction of NOx generation concentration in coal‐fired boilers serves as the foundational cornerstone for the judicious optimization and control of selective catalytic reduction denitrification (SCR) systems. Owing to the intricate nature of the denitrification process within SCR, there exists a temporal delay in regulating the ammonia injection rate based on the monitored data of NOx concentration at the SCR inlet. Such delays can give rise to ammonia leakage and subsequent obstruction of the air preheater. In light of this, a predictive model, CEEMDAN‐LSTM‐SA, is proposed as an amalgamation of data decomposition and the LSTM (long short‐term memory) fusion self‐attention mechanism within a deep learning network, which is introduced to forecast the NOx emission concentration at the SCR inlet of coal‐fired units. To mitigate the impact of data outliers on the training effectiveness of the model, a clustering method coupled with a statistical testing strategy is initially applied to refine the dataset first. CEEMDAN data decomposition technology is leveraged to facilitate the breakdown of data, alleviating its non‐stationary and intricate characteristics. Subsequently, through spectral analysis, the decomposed components are grouped and aggregated to form novel data elements, which are then subjected to prediction by the constructed LSTM‐SA deep learning network. The ultimate NOx emission concentration prediction value is derived through a process of fusion. Upon scrutinizing and comparing the predictions derived from various models using coal‐fired power plant data, it is evident that the performance metrics of CEEMDAN‐LSTM‐SA predictions exhibit a mean absolute error of 7.425, mean absolute percentage error of 2.415%, root mean square error of 9.715, R‐squared (R2) value of.789, mean absolute relative error of 2.109%, and a Theil's information criterion of.016. In contrast to other models, including traditional self‐attention networks, LSTM, and LSTM‐SA combination networks, CEEMDAN‐LSTM‐SA proposed in this study demonstrates superior prediction accuracy and enhanced generalization capabilities. Consequently, this predictive model stands poised to furnish an efficacious framework for the SCR ammonia injection strategy within thermal power units. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
39. occTest: An integrated approach for quality control of species occurrence data.
- Author
-
Serra‐Diaz, Josep M., Borderieux, Jeremy, Maitner, Brian, Boonman, Coline C. F., Park, Daniel, Guo, Wen‐Yong, Callebaut, Arnaud, Enquist, Brian J., Svenning, Jens‐C., and Merow, Cory
- Subjects
- *
DIGITIZATION , *QUALITY control , *DATA scrubbing , *OUTLIER detection , *SPECIES , *TEST interpretation - Abstract
Aim: Species occurrence data are valuable information that enables one to estimate geographical distributions, characterize niches and their evolution, and guide spatial conservation planning. Rapid increases in species occurrence data stem from increasing digitization and aggregation efforts, and citizen science initiatives. However, persistent quality issues in occurrence data can impact the accuracy of scientific findings, underscoring the importance of filtering erroneous occurrence records in biodiversity analyses. Innovation: We introduce an R package, occTest, that synthesizes a growing open‐source ecosystem of biodiversity cleaning workflows to prepare occurrence data for different modelling applications. It offers a structured set of algorithms to identify potential problems with species occurrence records by employing a hierarchical organization of multiple tests. The workflow has a hierarchical structure organized in testPhases (i.e. cleaning vs. testing) that encompass different testBlocks grouping different testTypes (e.g. environmental outlier detection), which may use different testMethods (e.g. Rosner test, jacknife,etc.). Four different testBlocks characterize potential problems in geographic, environmental, human influence and temporal dimensions. Filtering and plotting functions are incorporated to facilitate the interpretation of tests. We provide examples with different data sources, with default and user‐defined parameters. Compared to other available tools and workflows, occTest offers a comprehensive suite of integrated tests, and allows multiple methods associated with each test to explore consensus among data cleaning methods. It uniquely incorporates both coordinate accuracy analysis and environmental analysis of occurrence records. Furthermore, it provides a hierarchical structure to incorporate future tests yet to be developed. Main conclusions: occTest will help users understand the quality and quantity of data available before the start of data analysis, while also enabling users to filter data using either predefined rules or custom‐built rules. As a result, occTest can better assess each record's appropriateness for its intended application. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
40. A SYSTEMATIC MAPPING REVIEW ON DATA CLEANING METHODS IN BIG DATA ENVIRONMENTS.
- Author
-
Keiji Iwata, Cláudio, Verardi Galegale, Napoleão, Ito, Márcia, Macorin de Azevedo, Marília, Duduchi Feitosa, Marcelo, and Hideo Arima, Carlos
- Subjects
DATA scrubbing ,INFORMATION technology ,DATA mining ,BIG data ,ARTIFICIAL intelligence - Abstract
The evolution of information technology combined with artificial intelligence, IoT (Internet of Things) and robotics has made processes integrated and intelligent. The increased use of technology and the need for evidence-based decisions have contributed to the rapid expansion of a large volume of data in recent years. The quality of data generated mainly by humans must be given special attention, as errors can occur more frequently, making the pre-processing phase, such as data cleaning, a determining factor for better results in data analysis. The aim of this article is therefore to analyze data cleaning methods applied in Big Data environments by conducting a systematic review. The review method was based on the Kitchenham protocol, and the search databases were Scopus, Web of Science and CAPES. After searching and selecting the articles according to the protocol, 69 articles were analyzed, revealing the use of a wide variety of techniques, such as machine learning, data mining, natural language processing and others. The review also emphasized the various publication formats and the wide dissemination and discussion of research on data cleaning in Big Data in the academic community. Finally, this study provides the state of the art of data cleansing techniques that have been used in a Big Data context, offering insights and directions for future research. [ABSTRACT FROM AUTHOR]
- Published
- 2024
41. A Constrained Factor Mixture Model for Detecting Careless Responses that is Simple to Implement.
- Author
-
Kam, Chester Chun Seng and Cheung, Shu Fai
- Subjects
DATA scrubbing ,TEST validity ,DATA management ,RESEARCH personnel ,ACQUISITION of data - Abstract
Using constrained factor mixture models (FMM) for careless response identification is still in its infancy. Existing models have overly restrictive statistical assumptions that do not identify all types of careless respondents. The current paper presents a novel constrained FMM model with more reasonable assumptions that capture both longstring and random careless respondents. We provide a comprehensive comparison of the statistical assumptions between the proposed model and two previous constrained models. The proposed model was evaluated using both real data (N = 1,455) and statistical simulation. The results showed that the model had a superior fit, stronger convergent validity with other indicators of careless responding, more accurate parameter recovery and more accurate identification of careless respondents when compared to its predecessors. The proposed model does not require additional data collection effort, and thus researchers can routinely use it to control careless responses. We provide user-friendly syntax with detailed explanations online to facilitate its use. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
42. A Framework for Cleaning Streaming Data in Healthcare: A Context and User-Supported Approach.
- Author
-
Alotaibi, Obaid, Tomy, Sarath, and Pardede, Eric
- Subjects
GENERATIVE artificial intelligence ,DATA scrubbing ,MISSING data (Statistics) ,DATA quality ,DECISION making - Abstract
Nowadays, ubiquitous technology makes life easier, especially devices that use the internet (IoT). IoT devices have been used to generate data in various domains, including healthcare, industry, and education. However, there are often problems with this generated data such as missing values, duplication, and data errors, which can significantly affect data analysis results and lead to inaccurate decision making. Enhancing the quality of real-time data streams has become a challenging task as it is crucial for better decisions. In this paper, we propose a framework to improve the quality of a real-time data stream by considering different aspects, including context-awareness. The proposed framework tackles several issues in the data stream, including duplicated data, missing values, and outliers to improve data quality. The proposed framework also provides recommendations on appropriate data cleaning techniques to the user to help improve data quality in real time. Also, the data quality assessment is included in the proposed framework to provide insight to the user about the data stream quality for better decisions. We present a prototype to examine the concept of the proposed framework. We use a dataset that is collected in healthcare and process these data using a case study. The effectiveness of the proposed framework is verified by the ability to detect and repair stream data quality issues in selected context and to provide a recommended context and data cleaning techniques to the expert for better decision making in providing healthcare advice to the patient. We evaluate our proposed framework by comparing the proposed framework against previous works. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
43. Explainable artificial intelligence and machine learning algorithms for classification of thyroid disease.
- Author
-
Kumari, Priyanka, Kaur, Baljinder, Rakhra, Manik, Deka, Aniruddha, Byeon, Haewon, Asenso, Evans, and Rawat, Anil Kumar
- Abstract
A common endocrine issue affecting millions globally is thyroid illness. For this ailment to be effectively treated and managed, an early and accurate diagnosis is essential. Machine learning algorithms have attracted a lot of attention recently in the healthcare industry and have the potential to improve thyroid disease diagnosis and categorization. The implementation of machine learning methods for the classification of thyroid disease is presented in this study. To create predictive models, the study makes use of a dataset that includes a variety of thyroid-related factors, including age, gender, and hormone levels. To evaluate the effectiveness of several machine learning techniques in classifying thyroid diseases, including random forest, support vector machines, XG-Boost, and ensemble classifier, they are implemented and compared. To ensure robust model performance, the methodology includes data preparation, feature selection, and model training, as well as strategies for hyperparameter adjustment and cross-validation. To assess the algorithms’ efficiency in differentiating between several thyroid illness classifications, such as hyperthyroidism, hypothyroidism, and the study measures the algorithms’ accuracy, precision, recall, F1-score, voting, and area under the ROC curve.Highlights: Machine Learning models has been proven important tool for disease diagnosis and classification. Here, in this research various machine learning models are implemented for thyroid disease classification. An analysis between various machine learning models is done to choose the best model for classification. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
44. An Automatic Sample Data Cleaning Method Based on Weight Iteration.
- Author
-
XIA Wang, XU Shixuan, and TONG Siqi
- Subjects
DEEP learning ,ARTIFICIAL neural networks ,DATA scrubbing ,DATABASES ,DIGITAL maps ,FEATURE extraction - Abstract
Based on digital line graphic and true digital orthophoto map, a large amount of sample data that meets the requirements of deep learning can be automatically generated. However, there are often some erroneous information, which increases the difficulty of training neural network models and limits the improvement of ground feature extraction accuracy. A sample data automatic cleaning method based on selection weight iteration was proposed to address this issue. Firstly, a deep neural network model for data cleaning was constructed, and a network training method based on selection weight iteration was proposed. The method broke the assumption that all samples had the same weight for the calculation of loss function during network model training. The prediction accuracy of the samples during the data cleaning network model training process was used as the weight of the samples to be brought into the network training, and the sample weights were continuously updated through iterative training. Finally, samples with low weights were eliminated to achieve automatic data cleaning and sample database refinement. Training and accuracy comparison experiments were conducted on five classic semantic segmentation network models using a sample database before and after data cleaning. The results show that, the model trained using the sample database after data cleaning improves the average accuracy of building extraction by 2.36%, road extraction by 3.48%, and water extraction by 1.88%. This experiment proves that the data cleaning method proposed in this paper can effectively improve the accuracy of the network model in extracting ground features. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
45. End-To-End Machine Learning Workflow on Chronic Kidney Disease Dataset
- Author
-
Adithya, N. N. S. S. S., Sah, P. VaniShree, Lim, Meng-Hiot, Series Editor, Saha, Apu Kumar, editor, Sharma, Harish, editor, and Prasad, Mukesh, editor
- Published
- 2024
- Full Text
- View/download PDF
46. Preprocessing of Agricultural and Natural Resource Data
- Author
-
Raval, Mehul S., Chaudhary, Sanjay, Kacprzyk, Janusz, Series Editor, Raval, Mehul S., editor, Chaudhary, Sanjay, editor, Adinarayana, J., editor, and Guo, Wei, editor
- Published
- 2024
- Full Text
- View/download PDF
47. A Survey on Data Preprocessing Techniques in Stream Mining
- Author
-
Jajoo, Vranda, Tanwani, Sanjay, Kacprzyk, Janusz, Series Editor, Gomide, Fernando, Advisory Editor, Kaynak, Okyay, Advisory Editor, Liu, Derong, Advisory Editor, Pedrycz, Witold, Advisory Editor, Polycarpou, Marios M., Advisory Editor, Rudas, Imre J., Advisory Editor, Wang, Jun, Advisory Editor, Choudrie, Jyoti, editor, Mahalle, Parikshit N., editor, Perumal, Thinagaran, editor, and Joshi, Amit, editor
- Published
- 2024
- Full Text
- View/download PDF
48. JUVDATA: Data Visualization of Juvenile Crime in Malaysia
- Author
-
Daud, Nur’Aina, Samsuri, Nur Amalina, Hussein, Surya Sumarni, Wan Yaacob, Wan Fairos, editor, Wah, Yap Bee, editor, and Mehmood, Obaid Ullah, editor
- Published
- 2024
- Full Text
- View/download PDF
49. Optimal Update Repair with Maximum Likelihood and Minimum Cost
- Author
-
Li, Wenyu, Zhang, Anzhen, Zong, Chuanyu, Zhu, Rui, Qiu, Tao, Goos, Gerhard, Series Editor, Hartmanis, Juris, Founding Editor, Bertino, Elisa, Editorial Board Member, Gao, Wen, Editorial Board Member, Steffen, Bernhard, Editorial Board Member, Yung, Moti, Editorial Board Member, Onizuka, Makoto, editor, Lee, Jae-Gil, editor, Tong, Yongxin, editor, Xiao, Chuan, editor, Ishikawa, Yoshiharu, editor, Amer-Yahia, Sihem, editor, Jagadish, H. V., editor, and Lu, Kejing, editor
- Published
- 2024
- Full Text
- View/download PDF
50. Performance Analysis of Marketing Campaign with Customer Profile Using Machine Learning
- Author
-
Devaraju, S., Jawahar, S., Manimaran, M., Somasundaram, A., Thenmozhi, M., Bansal, Jagdish Chand, Series Editor, Kim, Joong Hoon, Series Editor, Nagar, Atulya K., Series Editor, Mandal, Jyotsna Kumar, editor, Hinchey, Mike, editor, and Chakrabarti, Satyajit, editor
- Published
- 2024
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.