1,030 results on '"data synthesis"'
Search Results
2. User-perceptional privacy protection in NILM: A differential privacy approach
- Author
-
Zhang, Jiahao, Lu, Chenbei, Yi, Hongyu, and Wu, Chenye
- Published
- 2025
- Full Text
- View/download PDF
3. A novel robust data synthesis method based on feature subspace interpolation to optimize samples with unknown noise
- Author
-
Du, Yukun, Cai, Yitao, Jin, Xiao, Yu, Haiyue, Lou, Zhilong, Li, Yao, Jiang, Jiang, and Wang, Yongxiong
- Published
- 2025
- Full Text
- View/download PDF
4. Comparative Analysis of Systematic, Scoping, Umbrella, and Narrative Reviews in Clinical Research: Critical Considerations and Future Directions.
- Author
-
Motevalli, Mohamad and Xie, Zhongqiu
- Abstract
Review studies play a key role in the development of clinical practice by synthesizing data and drawing conclusions from multiple scientific sources. In recent decades, there has been a significant increase in the number of review studies conducted and published by researchers. In clinical research, different types of review studies (systematic, scoping, umbrella, and narrative reviews) are conducted with different objectives and methodologies. Despite the abundance of guidelines for conducting review studies, researchers often face challenges in selecting the most appropriate review method, mainly due to their overlapping characteristics, including the complexity of matching review types to specific research questions. The aim of this article is to compare the main features of systematic, scoping, umbrella, and narrative reviews in clinical research and to address key considerations for selecting the most appropriate review approach. It also discusses future opportunities for updating their strategies based on emerging trends in clinical research. Understanding the differences between review approaches will help researchers, practitioners, journalists, and policymakers to effectively navigate the complex and evolving field of scientific research, leading to informed decisions that ultimately enhance the overall quality of healthcare practices. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
5. Data-Centric Benchmarking of Neural Network Architectures for the Univariate Time Series Forecasting Task
- Author
-
Philipp Schlieper, Mischa Dombrowski, An Nguyen, Dario Zanca, and Bjoern Eskofier
- Subjects
deep learning ,time series ,neural networks ,model selection ,data synthesis ,univariate forecasting ,Science (General) ,Q1-390 ,Mathematics ,QA1-939 - Abstract
Time series forecasting has witnessed a rapid proliferation of novel neural network approaches in recent times. However, performances in terms of benchmarking results are generally not consistent, and it is complicated to determine in which cases one approach fits better than another. Therefore, we propose adopting a data-centric perspective for benchmarking neural network architectures on time series forecasting by generating ad hoc synthetic datasets. In particular, we combine sinusoidal functions to synthesize univariate time series data for multi-input-multi-output prediction tasks. We compare the most popular architectures for time series, namely long short-term memory (LSTM) networks, convolutional neural networks (CNNs), and transformers, and directly connect their performance with different controlled data characteristics, such as the sequence length, noise and frequency, and delay length. Our findings suggest that transformers are the best architecture for dealing with different delay lengths. In contrast, for different noise and frequency levels and different sequence lengths, LSTM is the best-performing architecture by a significant amount. Based on our insights, we derive recommendations which allow machine learning (ML) practitioners to decide which architecture to apply, given the dataset’s characteristics.
- Published
- 2024
- Full Text
- View/download PDF
6. Data-Centric Benchmarking of Neural Network Architectures for the Univariate Time Series Forecasting Task.
- Author
-
Schlieper, Philipp, Dombrowski, Mischa, Nguyen, An, Zanca, Dario, and Eskofier, Bjoern
- Subjects
CONVOLUTIONAL neural networks ,DEEP learning ,TIME series analysis ,TRANSFORMER models ,MACHINE learning - Abstract
Time series forecasting has witnessed a rapid proliferation of novel neural network approaches in recent times. However, performances in terms of benchmarking results are generally not consistent, and it is complicated to determine in which cases one approach fits better than another. Therefore, we propose adopting a data-centric perspective for benchmarking neural network architectures on time series forecasting by generating ad hoc synthetic datasets. In particular, we combine sinusoidal functions to synthesize univariate time series data for multi-input-multi-output prediction tasks. We compare the most popular architectures for time series, namely long short-term memory (LSTM) networks, convolutional neural networks (CNNs), and transformers, and directly connect their performance with different controlled data characteristics, such as the sequence length, noise and frequency, and delay length. Our findings suggest that transformers are the best architecture for dealing with different delay lengths. In contrast, for different noise and frequency levels and different sequence lengths, LSTM is the best-performing architecture by a significant amount. Based on our insights, we derive recommendations which allow machine learning (ML) practitioners to decide which architecture to apply, given the dataset's characteristics. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
7. Using machine learning to reveal seasonal nutrient dynamics and their impact on chlorophyll-a levels in lake ecosystems: A focus on nitrogen and phosphorus
- Author
-
Yong Fang, Ruting Huang, and Xianyang Shi
- Subjects
Eutrophication ,Nutrients ,Machine learning ,Data synthesis ,Seasons ,Ecology ,QH540-549.5 - Abstract
Chlorophyll-a (Chl-a) is a pivotal indicator of lake eutrophication. Studies examining nutrients limiting lake eutrophication at large scales have traditionally focused on summer and autumn, potentially limiting the applicability of their findings. This study encompasses 86 state-controlled points in the Eastern China Basin, spanning data collected from January 2020 to July 2023. Furthermore, we focus on the application of three machine-learning models (i.e., eXtreme Gradient Boosting, Support Vector Machines, and Naive Bayes Classifier) to analyze the seasonal nutrient dynamics in lake ecosystems. We categorized the monitoring data by season to eliminate outliers and employed adaptive synthetic sampling to address data imbalance issues. The results reveal that the direct correlations between total nitrogen (TN), total phosphorus (TP), and TP in conjunction with turbidity and Chl-a are broadly weak, possibly because of geographic variations, nutrient lag effects on algae, and differences in algal community composition. However, probabilistic analyses revealed that as TP or TN levels transitioned from oligo-mesotrophic (O) to eutrophic (E), TP exhibited a greater influence on the variation in Chl-a status than TN during spring and winter (p
- Published
- 2024
- Full Text
- View/download PDF
8. Paired Synoptic and Long-Term Monitoring Datasets Reveal Decadal Shifts in Suspended Sediment Supply and Particulate Organic Matter Sources in a River-Estuarine System
- Author
-
Richardson, CM, Young, M, and Paytan, A
- Subjects
Earth Sciences ,Physical Geography and Environmental Geoscience ,Atmospheric Sciences ,Environmental Sciences ,Estuary ,Detritus ,Rivers ,Environmental change ,Carbon ,Data synthesis ,Biological Sciences ,Marine Biology & Hydrobiology ,Biological sciences ,Earth sciences ,Environmental sciences - Abstract
Abstract: The San Francisco Estuary, in central California, has several long-running monitoring programs that have been used to reveal human-induced changes throughout the estuary in the last century. Here, we pair synoptic records of particulate organic matter (POM) composition from 1990–1996 and 2007–2016 with more robust long-term monitoring program records of total suspended sediment (TSS) concentrations generally starting in the mid-1970s to better understand how POM and TSS sources and transport have shifted. Specifically, POM C:N ratios and stable isotope values were used as indicators of POM source and to separate the bulk POC pool into detrital and phytoplankton components. We found that TSS and POC sources have shifted significantly across the estuary in time and space from declines in terrestrial inputs. Landward freshwater and brackish water sites, in the Delta and near Suisun Bay, witnessed long-term declines in TSS (32 to 52%), while seaward sites, near San Pablo Bay, recorded recent increases in TSS (16 to 121%) that began to trend downwards at the end of the record considered. Bulk POM C:N ratios shifted coeval with the TSS concentration changes at nearly all sites, with mean declines of 12 to 27% between 1990–1996 and 2007–2016. The widespread declines in bulk POM C:N ratios and inferred changes in POC concentrations from TSS trends, along with the substantial declines in upstream TSS supply through time (56%), suggest measurable reductions in terrestrial inputs to the system. Changes in terrestrial TSS and POM inputs have implications for biotic (e.g., food web dynamics) and abiotic organic matter cycling (e.g., burial, export) along the estuarine continuum. This work demonstrates how human-generated environmental changes can propagate spatially and temporally through a large river-estuary system. More broadly, we show how underutilized monitoring program datasets can be paired with existing (and often imperfect) synoptic records to generate new system insight in lieu of new data collection.
- Published
- 2023
9. Advanced integration of 2DCNN-GRU model for accurate identification of shockable life-threatening cardiac arrhythmias: a deep learning approach.
- Author
-
Ba Mahel, Abduljabbar S., Shenghong Cao, Kaixuan Zhang, Chelloug, Samia Allaoua, Alnashwan, Rana, and Ali Muthanna, Mohammed Saleh
- Subjects
ARRHYTHMIA ,DEEP learning ,VENTRICULAR tachycardia ,VENTRICULAR fibrillation ,CARDIAC patients ,CARDIOVASCULAR diseases - Abstract
Cardiovascular diseases remain one of the main threats to human health, significantly affecting the quality and life expectancy. Effective and prompt recognition of these diseases is crucial. This research aims to develop an effective novel hybrid method for automatically detecting dangerous arrhythmias based on cardiac patients’ short electrocardiogram (ECG) fragments. This study suggests using a continuous wavelet transform (CWT) to convert ECG signals into images (scalograms) and examining the task of categorizing short 2-s segments of ECG signals into four groups of dangerous arrhythmias that are shockable, including ventricular flutter (C1), ventricular fibrillation (C2), ventricular tachycardia torsade de pointes (C3), and high-rate ventricular tachycardia (C4). We propose developing a novel hybrid neural network with a deep learning architecture to classify dangerous arrhythmias. This work utilizes actual electrocardiogram (ECG) data obtained from the PhysioNet database, alongside artificially generated ECG data produced by the Synthetic Minority Over-sampling Technique (SMOTE) approach, to address the issue of imbalanced class distribution for obtaining an accuracy-trained model. Experimental results demonstrate that the proposed approach achieves high accuracy, sensitivity, specificity, precision, and an F1-score of 97.75%, 97.75%, 99.25%, 97.75%, and 97.75%, respectively, in classifying all the four shockable classes of arrhythmias and are superior to traditional methods. Our work possesses significant clinical value in real-life scenarios since it has the potential to significantly enhance the diagnosis and treatment of lifethreatening arrhythmias in individuals with cardiac disease. Furthermore, our model also has demonstrated adaptability and generality for two other datasets. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
10. Efficient Wheat Head Segmentation with Minimal Annotation: A Generative Approach.
- Author
-
Myers, Jaden, Najafian, Keyhan, Maleki, Farhad, and Ovens, Katie
- Subjects
GENERATIVE adversarial networks ,SUPERVISED learning ,IMAGE processing ,WHEAT - Abstract
Deep learning models have been used for a variety of image processing tasks. However, most of these models are developed through supervised learning approaches, which rely heavily on the availability of large-scale annotated datasets. Developing such datasets is tedious and expensive. In the absence of an annotated dataset, synthetic data can be used for model development; however, due to the substantial differences between simulated and real data, a phenomenon referred to as domain gap, the resulting models often underperform when applied to real data. In this research, we aim to address this challenge by first computationally simulating a large-scale annotated dataset and then using a generative adversarial network (GAN) to fill the gap between simulated and real images. This approach results in a synthetic dataset that can be effectively utilized to train a deep-learning model. Using this approach, we developed a realistic annotated synthetic dataset for wheat head segmentation. This dataset was then used to develop a deep-learning model for semantic segmentation. The resulting model achieved a Dice score of 83.4% on an internal dataset and Dice scores of 79.6% and 83.6% on two external datasets from the Global Wheat Head Detection datasets. While we proposed this approach in the context of wheat head segmentation, it can be generalized to other crop types or, more broadly, to images with dense, repeated patterns such as those found in cellular imagery. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
11. Enhanced Pet Behavior Prediction via S2GAN-Based Heterogeneous Data Synthesis.
- Author
-
Kim, Jinah and Moon, Nammee
- Subjects
GENERATIVE adversarial networks ,PREDICTION models - Abstract
Heterogeneous data have been used to enhance behavior prediction performance; however, it involves issues such as missing data, which need to be addressed. This paper proposes enhanced pet behavior prediction via Sensor to Skeleton Generative Adversarial Networks (S2GAN)-based heterogeneous data synthesis. The S2GAN model synthesizes the key features of video skeletons based on collected nine-axis sensor data and replaces missing data, thereby enhancing the accuracy of behavior prediction. In this study, data collected from 10 pets in a real-life-like environment were used to conduct recognition experiments on 9 commonly occurring types of indoor behavior. Experimental results confirmed that the proposed S2GAN-based synthesis method effectively resolves possible missing data issues in real environments and significantly improves the performance of the pet behavior prediction model. Additionally, by utilizing data collected under conditions similar to the real environment, the method enables more accurate and reliable behavior prediction. This research demonstrates the importance and utility of synthesizing heterogeneous data in behavior prediction, laying the groundwork for applications in various fields such as abnormal behavior detection and monitoring. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
12. Synthesis methods used to combine observational studies and randomised trials in published meta-analyses
- Author
-
Cherifa Cheurfa, Sofia Tsokani, Katerina-Maria Kontouli, Isabelle Boutron, and Anna Chaimani
- Subjects
Data synthesis ,Non-randomised studies ,Comparative effectiveness, heterogeneous designs ,Medicine - Abstract
Abstract Background This study examined the synthesis methods used in meta-analyses pooling data from observational studies (OSs) and randomised controlled trials (RCTs) from various medical disciplines. Methods We searched Medline via PubMed to identify reports of systematic reviews of interventions, including and pooling data from RCTs and OSs published in 110 high-impact factor general and specialised journals between 2015 and 2019. Screening and data extraction were performed in duplicate. To describe the synthesis methods used in the meta-analyses, we considered the first meta-analysis presented in each article. Results Overall, 132 reports were identified with a median number of included studies of 14 [9–26]. The median number of OSs was 6.5 [3–12] and that of RCTs was 3 [1–6]. The effect estimates recorded from OSs (i.e., adjusted or unadjusted) were not specified in 82% (n = 108) of the meta-analyses. An inverse-variance common-effect model was used in 2% (n = 3) of the meta-analyses, a random-effects model was used in 55% (n = 73), and both models were used in 40% (n = 53). A Poisson regression model was used in 1 meta-analysis, and 2 meta-analyses did not report the model they used. The mean total weight of OSs in the studied meta-analyses was 57.3% (standard deviation, ± 30.3%). Only 44 (33%) meta-analyses reported results stratified by study design. Of them, the results between OSs and RCTs had a consistent direction of effect in 70% (n = 31). Study design was explored as a potential source of heterogeneity in 79% of the meta-analyses, and confounding factors were investigated in only 10% (n = 13). Publication bias was assessed in 70% (n = 92) of the meta-analyses. Tau-square was reported in 32 meta-analyses with a median of 0.07 [0–0.30]. Conclusion The inclusion of OSs in a meta-analysis on interventions could provide useful information. However, considerations of several methodological and conceptual aspects of OSs, that are required to avoid misleading findings, were often absent or insufficiently reported in our sample.
- Published
- 2024
- Full Text
- View/download PDF
13. Synthesis methods used to combine observational studies and randomised trials in published meta-analyses
- Author
-
Cheurfa, Cherifa, Tsokani, Sofia, Kontouli, Katerina-Maria, Boutron, Isabelle, and Chaimani, Anna
- Published
- 2024
- Full Text
- View/download PDF
14. Generative Models for Synthetic Urban Mobility Data: A Systematic Literature Review.
- Author
-
KAPP, ALEXANDRA, HANSMEYER, JULIA, and MIHALJEVIĆ, HELENA
- Subjects
- *
ARTIFICIAL neural networks , *MACHINE learning , *ARTIFICIAL intelligence , *DATA mining , *ALGORITHMIC bias , *DEEP learning , *TRAFFIC violations , *TAXICABS - Published
- 2024
- Full Text
- View/download PDF
15. Research on the Simulation Method of HTTP Traffic Based on GAN.
- Author
-
Yang, Chenglin, Xu, Dongliang, and Ma, Xiao
- Subjects
COMPUTER network traffic ,GENERATIVE adversarial networks ,TRANSFORMER models ,GAUSSIAN mixture models ,HTTP (Computer network protocol) ,EVOLUTIONARY algorithms - Abstract
Due to the increasing severity of network security issues, training corresponding detection models requires large datasets. In this work, we propose a novel method based on generative adversarial networks to synthesize network data traffic. We introduced a network traffic data normalization method based on Gaussian mixture models (GMM), and for the first time, incorporated a generator based on the Swin Transformer structure into the field of network traffic generation. To further enhance the robustness of the model, we mapped real data through an AE (autoencoder) module and optimized the training results in the form of evolutionary algorithms. We validated the training results on four different datasets and introduced four additional models for comparative experiments in the experimental evaluation section. Our proposed SEGAN outperformed other state-of-the-art network traffic emulation methods. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF
16. Advanced integration of 2DCNN-GRU model for accurate identification of shockable life-threatening cardiac arrhythmias: a deep learning approach
- Author
-
Abduljabbar S. Ba Mahel, Shenghong Cao, Kaixuan Zhang, Samia Allaoua Chelloug, Rana Alnashwan, and Mohammed Saleh Ali Muthanna
- Subjects
dangerous arrhythmias ,recognition ,deep learning networks ,data synthesis ,scalogram ,Physiology ,QP1-981 - Abstract
Cardiovascular diseases remain one of the main threats to human health, significantly affecting the quality and life expectancy. Effective and prompt recognition of these diseases is crucial. This research aims to develop an effective novel hybrid method for automatically detecting dangerous arrhythmias based on cardiac patients’ short electrocardiogram (ECG) fragments. This study suggests using a continuous wavelet transform (CWT) to convert ECG signals into images (scalograms) and examining the task of categorizing short 2-s segments of ECG signals into four groups of dangerous arrhythmias that are shockable, including ventricular flutter (C1), ventricular fibrillation (C2), ventricular tachycardia torsade de pointes (C3), and high-rate ventricular tachycardia (C4). We propose developing a novel hybrid neural network with a deep learning architecture to classify dangerous arrhythmias. This work utilizes actual electrocardiogram (ECG) data obtained from the PhysioNet database, alongside artificially generated ECG data produced by the Synthetic Minority Over-sampling Technique (SMOTE) approach, to address the issue of imbalanced class distribution for obtaining an accuracy-trained model. Experimental results demonstrate that the proposed approach achieves high accuracy, sensitivity, specificity, precision, and an F1-score of 97.75%, 97.75%, 99.25%, 97.75%, and 97.75%, respectively, in classifying all the four shockable classes of arrhythmias and are superior to traditional methods. Our work possesses significant clinical value in real-life scenarios since it has the potential to significantly enhance the diagnosis and treatment of life-threatening arrhythmias in individuals with cardiac disease. Furthermore, our model also has demonstrated adaptability and generality for two other datasets.
- Published
- 2024
- Full Text
- View/download PDF
17. A systematic review of deep learning data augmentation in medical imaging: Recent advances and future research directions
- Author
-
Tauhidul Islam, Md. Sadman Hafiz, Jamin Rahman Jim, Md. Mohsin Kabir, and M.F. Mridha
- Subjects
Deep learning ,Data augmentation ,Image transformation ,Medical imaging augmentation ,Data synthesis ,Systematic review ,Computer applications to medicine. Medical informatics ,R858-859.7 - Abstract
Data augmentation involves artificially expanding a dataset by applying various transformations to the existing data. Recent developments in deep learning have advanced data augmentation, enabling more complex transformations. Especially vital in the medical domain, deep learning-based data augmentation improves model robustness by generating realistic variations in medical images, enhancing diagnostic and predictive task performance. Therefore, to assist researchers and experts in their pursuits, there is a need for an extensive and informative study that covers the latest advancements in the growing domain of deep learning-based data augmentation in medical imaging. There is a gap in the literature regarding recent advancements in deep learning-based data augmentation. This study explores the diverse applications of data augmentation in medical imaging and analyzes recent research in these areas to address this gap. The study also explores popular datasets and evaluation metrics to improve understanding. Subsequently, the study provides a short discussion of conventional data augmentation techniques along with a detailed discussion on applying deep learning algorithms in data augmentation. The study further analyzes the results and experimental details from recent state-of-the-art research to understand the advancements and progress of deep learning-based data augmentation in medical imaging. Finally, the study discusses various challenges and proposes future research directions to address these concerns. This systematic review offers a thorough overview of deep learning-based data augmentation in medical imaging, covering application domains, models, results analysis, challenges, and research directions. It provides a valuable resource for multidisciplinary studies and researchers making decisions based on recent analytics.
- Published
- 2024
- Full Text
- View/download PDF
18. Theseus Data Synthesis Approach: A Privacy-Preserving Online Data Sharing Service
- Author
-
Yi-Jun Tang and Po-Wen Chi
- Subjects
Data anonymization ,data synthesis ,privacy-preserving data sharing ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
With the vigorously developed services of cloud computing, it is relatively easier and more convenient for organizations or enterprises to open data on clouds. However, as personal information in electronic data becomes more massive and detailed, how to balance data opening and personal privacy has become a critical issue. In this paper, we propose the Theseus Data Synthesis Approach (TDSA), which generates synthetic data by replacing partial records until no record from the original dataset remains. Unlike other data anonymization works such as k-anonymity and differential privacy, which encountered limitations and challenges when applying to real-world scenarios. In our work, Since there are no real data, personal privacy is definitely preserved. We also analyze the quality and utility of the synthetic dataset and make comparisons with related works. We conclude that with our scheme, opening useful data on clouds and preserving personal privacy can be simultaneously achieved.
- Published
- 2024
- Full Text
- View/download PDF
19. Feature Distribution-Based Medical Data Augmentation: Enhancing Mood Disorder Classification
- Author
-
Joo Hun Yoo, Ji Hyun An, and Tai-Myoung Chung
- Subjects
Data augmentation ,data synthesis ,deep neural networks ,mood disorder classification ,multimodal analysis ,Electrical engineering. Electronics. Nuclear engineering ,TK1-9971 - Abstract
Classification models using deep or machine learning algorithms require a sufficient and balanced training dataset to improve performance. Still, they suffer from data collection due to data privacy issues. In medical research, where most data variables are sensitive information, collecting enough training data for model performance improvement is more challenging. This study presents a new medical data augmentation algorithm consisting of four steps to solve the data shortage and class imbalance issues. The main idea of the proposed algorithm is to reflect the core characteristic of the original data’s class label. The algorithm receives an original dataset as an input value to extract the feature vector and trains the individual autoencoder model. Then it verifies the augmented feature vector through a distributional equality check, and each feature vector is concatenated into one feature vector. The deep learning model inference is applied on a concatenated vector for the second verification, to finalize the augmented training dataset. Our team performed mood disorder classification using patient data to prove the presented data augmentation algorithm. With the method, the classification performance improved by 0.059 in the severity classification of major depressive disorder, 0.041 in the severity classification of anxiety disorder, and 0.073 in the subtype classification of bipolar disorder. Through this study, we proved that our algorithm can be applied to minimize model bias and improve classification performance on the medical data that are unbalanced or insufficient in number by class.
- Published
- 2024
- Full Text
- View/download PDF
20. Crack modeling via minimum-weight surfaces in 3d Voronoi diagrams
- Author
-
Christian Jung and Claudia Redenbach
- Subjects
Fracture modeling ,Tessellations ,Data synthesis ,3d image processing ,Adaptive dilation ,Mathematics ,QA1-939 ,Industry ,HD2321-4730.9 - Abstract
Abstract As the number one building material, concrete is of fundamental importance in civil engineering. Understanding its failure mechanisms is essential for designing sustainable buildings and infrastructure. Micro-computed tomography (μCT) is a well-established tool for virtually assessing crack initiation and propagation in concrete. The reconstructed 3d images can be examined via techniques from the fields of classical image processing and machine learning. Ground truths are a prerequisite for an objective evaluation of crack segmentation methods. Furthermore, they are necessary for training machine learning models. However, manual annotation of large 3d concrete images is not feasible. To tackle the problem of data scarcity, the image pairs of cracked concrete and corresponding ground truth can be synthesized. In this work we propose a novel approach to stochastically model crack structures via Voronoi diagrams. The method is based on minimum-weight surfaces, an extension of shortest paths to 3d. Within a dedicated image processing pipeline, the surfaces are then discretized and embedded into real μCT images of concrete. The method is flexible and fast, such that a variety of different crack structures can be generated in a short amount of time.
- Published
- 2023
- Full Text
- View/download PDF
21. Unsupervised GAN epoch selection for biomedical data synthesis
- Author
-
Böhland Moritz, Bruch Roman, Löffler Katharina, and Reischl Markus
- Subjects
generative adversarial network ,data synthesis ,segmentation ,computer vision ,Medicine - Abstract
Supervised Neural Networks are used for segmentation in many biological and biomedical applications. To omit the time-consuming and tiring process of manual labeling, unsupervised Generative Adversarial Networks (GANs) can be used to synthesize labeled data. However, the training of GANs requires extensive computation and is often unstable. Due to the lack of established stopping criteria, GANs are usually trained multiple times for a heuristically fixed number of epochs. Early stopping and epoch selection can lead to better synthetic datasets resulting in higher downstream segmentation quality on biological or medical data. This article examines whether the Frechet Inception Distance (FID), the Kernel Inception Distance (KID), or the WeightWatcher tool can be used for early stopping or epoch selection of unsupervised GANs. The experiments show that the last trained GAN epoch is not necessarily the best one to synthesize downstream segmentation data. On complex datasets, FID and KID correlate with the downstream segmentation quality, and both can be used for epoch selection.
- Published
- 2023
- Full Text
- View/download PDF
22. Deep learning based classification of sheep behaviour from accelerometer data with imbalance
- Author
-
Kirk E. Turner, Andrew Thompson, Ian Harris, Mark Ferguson, and Ferdous Sohel
- Subjects
Sheep behaviour classification ,Data synthesis ,Class imbalance ,Grazing sheep ,Agriculture (General) ,S1-972 ,Information technology ,T58.5-58.64 - Abstract
Classification of sheep behaviour from a sequence of tri-axial accelerometer data has the potential to enhance sheep management. Sheep behaviour is inherently imbalanced (e.g., more ruminating than walking) resulting in underperforming classification for the minority activities which hold importance. Existing works have not addressed class imbalance and use traditional machine learning techniques, e.g., Random Forest (RF). We investigated Deep Learning (DL) models, namely, Long Short Term Memory (LSTM) and Bidirectional LSTM (BLSTM), appropriate for sequential data, from imbalanced data. Two data sets were collected in normal grazing conditions using jaw-mounted and ear-mounted sensors. Novel to this study, alongside typical single classes, e.g., walking, depending on the behaviours, data samples were labelled with compound classes, e.g., walking_grazing. The number of steps a sheep performed in the observed 10 s time window was also recorded and incorporated in the models. We designed several multi-class classification studies with imbalance being addressed using synthetic data. DL models achieved superior performance to traditional ML models, especially with augmented data (e.g., 4-Class + Steps: LSTM 88.0%, RF 82.5%). DL methods showed superior generalisability on unseen sheep (i.e., F1-score: BLSTM 0.84, LSTM 0.83, RF 0.65). LSTM, BLSTM and RF achieved sub-millisecond average inference time, making them suitable for real-time applications. The results demonstrate the effectiveness of DL models for sheep behaviour classification in grazing conditions. The results also demonstrate the DL techniques can generalise across different sheep. The study presents a strong foundation of the development of such models for real-time animal monitoring.
- Published
- 2023
- Full Text
- View/download PDF
23. Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification.
- Author
-
Zhang, Xiao, Paz, Iván, Nebot, Àngela, Mugica, Francisco, and Romero, Enrique
- Subjects
MACHINE learning ,RANDOM forest algorithms ,MACHINE performance ,LOGISTIC regression analysis ,CLASSIFICATION - Abstract
When classifiers face imbalanced class distributions, they often misclassify minority class samples, consequently diminishing the predictive performance of machine learning models. Existing oversampling techniques predominantly rely on the selection of neighboring data via interpolation, with less emphasis on uncovering the intrinsic patterns and relationships within the data. In this research, we present the usefulness of an algorithm named RuLer to deal with the problem of classification with imbalanced data. RuLer is a learning algorithm initially designed to recognize new sound patterns within the context of the performative artistic practice known as live coding. This paper demonstrates that this algorithm, once adapted (Ad-RuLer), has great potential to address the problem of oversampling imbalanced data. An extensive comparison with other mainstream oversampling algorithms (SMOTE, ADASYN, Tomek-links, Borderline-SMOTE, and KmeansSMOTE), using different classifiers (logistic regression, random forest, and XGBoost) is performed on several real-world datasets with different degrees of data imbalance. The experiment results indicate that Ad-RuLer serves as an effective oversampling technique with extensive applicability. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
24. Crack modeling via minimum-weight surfaces in 3d Voronoi diagrams.
- Author
-
Jung, Christian and Redenbach, Claudia
- Subjects
- *
VORONOI polygons , *MACHINE learning , *SUSTAINABLE architecture , *THREE-dimensional imaging , *CIVIL engineering - Abstract
As the number one building material, concrete is of fundamental importance in civil engineering. Understanding its failure mechanisms is essential for designing sustainable buildings and infrastructure. Micro-computed tomography (μCT) is a well-established tool for virtually assessing crack initiation and propagation in concrete. The reconstructed 3d images can be examined via techniques from the fields of classical image processing and machine learning. Ground truths are a prerequisite for an objective evaluation of crack segmentation methods. Furthermore, they are necessary for training machine learning models. However, manual annotation of large 3d concrete images is not feasible. To tackle the problem of data scarcity, the image pairs of cracked concrete and corresponding ground truth can be synthesized. In this work we propose a novel approach to stochastically model crack structures via Voronoi diagrams. The method is based on minimum-weight surfaces, an extension of shortest paths to 3d. Within a dedicated image processing pipeline, the surfaces are then discretized and embedded into real μCT images of concrete. The method is flexible and fast, such that a variety of different crack structures can be generated in a short amount of time. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
25. A Generative Adversarial Network to Synthesize 3D Magnetohydrodynamic Distortions for Electrocardiogram Analyses Applied to Cardiac Magnetic Resonance Imaging.
- Author
-
Mehri, Maroua, Calmon, Guillaume, Odille, Freddy, Oster, Julien, and Lalande, Alain
- Subjects
- *
CARDIAC magnetic resonance imaging , *GENERATIVE adversarial networks , *DATA augmentation , *MAGNETIC resonance imaging , *PROBABILISTIC generative models , *DEEP learning - Abstract
Recently, deep learning (DL) models have been increasingly adopted for automatic analyses of medical data, including electrocardiograms (ECGs). Large, available ECG datasets, generally of high quality, often lack specific distortions, which could be helpful for enhancing DL-based algorithms. Synthetic ECG datasets could overcome this limitation. A generative adversarial network (GAN) was used to synthesize realistic 3D magnetohydrodynamic (MHD) distortion templates, as observed during magnetic resonance imaging (MRI), and then added to available ECG recordings to produce an augmented dataset. Similarity metrics, as well as the accuracy of a DL-based R-peak detector trained with and without data augmentation, were used to evaluate the effectiveness of the synthesized data. Three-dimensional MHD distortions produced by the proposed GAN were similar to the measured ones used as input. The precision of a DL-based R-peak detector, tested on actual unseen data, was significantly enhanced by data augmentation; its recall was higher when trained with augmented data. Using synthesized MHD-distorted ECGs significantly improves the accuracy of a DL-based R-peak detector, with a good generalization capacity. This provides a simple and effective alternative to collecting new patient data. DL-based algorithms for ECG analyses can suffer from bias or gaps in training datasets. Using a GAN to synthesize new data, as well as metrics to evaluate its performance, can overcome the scarcity issue of data availability. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
26. Drivers of biodiversity change in the Anthropocene
- Author
-
Daskalova, Gergana Nikolaeva, Myers-Smith, Isla, Bjorkman, Anne, and Dornelas, Maria
- Subjects
biodiversity ,conservation ,global change ,ecology ,data science ,data synthesis ,time-series ,forest loss ,global change drivers ,rarity ,species traits ,biodiversity change ,species richness ,community composition - Abstract
Across the globe, the populations of species and the biodiversity of ecological communities are changing, including declines, gains and stable trends over time. Against a backdrop of accelerating global change, a critical research challenge is to disentangle the sources of the heterogeneous patterns of population and biodiversity change over time. In this thesis, I linked population and biodiversity change with species traits like rarity and commonness, and with global change drivers like forest loss. I synthesised global biodiversity databases with gridded driver datasets to quantify how species' populations and biodiversity are being impacted by human activities in the Anthropocene. The rise of open-access data in ecology has produced databases with millions of records which have launched large-scale syntheses of how Earth's biota is changing over time and space. However, our knowledge of biodiversity change is limited by the available data and their biases. In Chapter 1, I tested the representation of three worldwide biodiversity databases (Living Planet, BioTIME and PREDICTS) across geographic and temporal variation in global change over land and sea and across the tree of life. I found that variation in global change drivers is better captured over space than over time and in the marine realm versus on land. I provided recommendations on how to improve the use of existing data, better target future ecological monitoring and capture different combinations of global change. In Chapter 2, I tested whether vertebrate species from specific biomes, taxa or with certain species traits are more likely to increase or decrease in a time of accelerating global change. I analysed nearly 10 000 population abundance time series from over 2000 vertebrate species part of the Living Planet Database. I integrated abundance data with information on geographic range, habitat preference, taxonomic and phylogenetic relationships, and IUCN Red List Categories and threats. I found that 15% of populations declined, 18% increased, and 67% showed no net changes over time. Amphibians were the only taxa that experienced net declines in the analysed data, while birds, mammals and reptiles experienced net increases. Despite this variation among broad taxonomic groups, surprisingly I did not detect phylogenetic patterns in which species were more likely to decline versus increase. Population trends were poorly explained by species' rarity and global-scale threats. I found that incorporating the full spectrum of population change, including declines, gains and stable trends, will improve conservation efforts to protect global biodiversity. In Chapter 3, I explored land-use change to fill the gap in empirical evidence of how habitat transformations such as forest loss and gain are reshaping biodiversity over time. I quantified how change in forest cover has influenced temporal shifts in populations and ecological assemblages from over 6000 globally distributed time series across six taxonomic groups. I found that local-scale increases and decreases in abundance, species richness, and temporal species replacement (turnover) were intensified by as much as 48% after forest loss. Larger amounts of forest loss did not always correlate with higher population and biodiversity change across sites, highlighting the mediating effects of local context and historical baselines. Temporal lags in population- and assemblage-level shifts after forest loss extended up to 50 years and increased with species' generation time. My findings indicate that forest loss amplified population and biodiversity change, with effects on both short and long temporal scales. A mix of immediate and lagged biodiversity change following land-use change emphasises the need for temporally explicit biodiversity scenarios to accurately estimate progress towards conservation goals. Together, my thesis findings demonstrate the wide spectrum of population and biodiversity change happening across varying amounts of global change and different realms, taxa and species traits. These heterogeneous impacts of global change on population and biodiversity spanned temporal scales from immediate effects in a couple of years to lagged responses decades after disturbance. The links between global change drivers and shifts in species' abundance, species richness and compositional turnover depended on historical context and species' characteristics like generation time. I documented both immediate and temporally delayed effects of global change drivers on species' populations abundance and the biodiversity of ecological assemblages which highlights the importance of long-term ecological monitoring. The main implications of my thesis findings are that first, any inferences drawn from biodiversity syntheses reflect the types of species and places represented by the data and the global change that is experienced. To create accurate scenarios, we need biodiversity data that span not only different taxa and locations, but also the spectrum of global change variation around the world. Second, biodiversity predictions should incorporate both positive and negative impacts of global change drivers as well as lagged responses. Finally, ecosystems and the species within them are usually simultaneously exposed to a suite of global change drivers and a key future research step is to test the synergy and/or antagony in the effects and interactions among multiple types of environmental change on populations and biodiversity. Overall, my thesis research demonstrates that the drivers of biodiversity change in the Anthropocene have both immediate and temporally-delayed effects which depend on species' traits and the sites' historical context. My findings suggest that by incorporating the full spectrum of biodiversity change and the nuance around interacting global change drivers we can improve projections of future ecological shifts and enhance local and international conservation policies.
- Published
- 2021
- Full Text
- View/download PDF
27. Efficient Wheat Head Segmentation with Minimal Annotation: A Generative Approach
- Author
-
Jaden Myers, Keyhan Najafian, Farhad Maleki, and Katie Ovens
- Subjects
deep learning ,segmentation ,generative adversarial networks ,data synthesis ,Photography ,TR1-1050 ,Computer applications to medicine. Medical informatics ,R858-859.7 ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Deep learning models have been used for a variety of image processing tasks. However, most of these models are developed through supervised learning approaches, which rely heavily on the availability of large-scale annotated datasets. Developing such datasets is tedious and expensive. In the absence of an annotated dataset, synthetic data can be used for model development; however, due to the substantial differences between simulated and real data, a phenomenon referred to as domain gap, the resulting models often underperform when applied to real data. In this research, we aim to address this challenge by first computationally simulating a large-scale annotated dataset and then using a generative adversarial network (GAN) to fill the gap between simulated and real images. This approach results in a synthetic dataset that can be effectively utilized to train a deep-learning model. Using this approach, we developed a realistic annotated synthetic dataset for wheat head segmentation. This dataset was then used to develop a deep-learning model for semantic segmentation. The resulting model achieved a Dice score of 83.4% on an internal dataset and Dice scores of 79.6% and 83.6% on two external datasets from the Global Wheat Head Detection datasets. While we proposed this approach in the context of wheat head segmentation, it can be generalized to other crop types or, more broadly, to images with dense, repeated patterns such as those found in cellular imagery.
- Published
- 2024
- Full Text
- View/download PDF
28. Enhanced Pet Behavior Prediction via S2GAN-Based Heterogeneous Data Synthesis
- Author
-
Jinah Kim and Nammee Moon
- Subjects
behavior prediction ,behavior monitoring ,heterogeneous data ,data synthesis ,generative adversarial network ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
Heterogeneous data have been used to enhance behavior prediction performance; however, it involves issues such as missing data, which need to be addressed. This paper proposes enhanced pet behavior prediction via Sensor to Skeleton Generative Adversarial Networks (S2GAN)-based heterogeneous data synthesis. The S2GAN model synthesizes the key features of video skeletons based on collected nine-axis sensor data and replaces missing data, thereby enhancing the accuracy of behavior prediction. In this study, data collected from 10 pets in a real-life-like environment were used to conduct recognition experiments on 9 commonly occurring types of indoor behavior. Experimental results confirmed that the proposed S2GAN-based synthesis method effectively resolves possible missing data issues in real environments and significantly improves the performance of the pet behavior prediction model. Additionally, by utilizing data collected under conditions similar to the real environment, the method enables more accurate and reliable behavior prediction. This research demonstrates the importance and utility of synthesizing heterogeneous data in behavior prediction, laying the groundwork for applications in various fields such as abnormal behavior detection and monitoring.
- Published
- 2024
- Full Text
- View/download PDF
29. CTAB-GAN+: enhancing tabular data synthesis
- Author
-
Zilong Zhao, Aditya Kunar, Robert Birke, Hiek Van der Scheer, and Lydia Y. Chen
- Subjects
GAN ,data synthesis ,tabular data ,differential privacy ,imbalanced distribution ,Information technology ,T58.5-58.64 - Abstract
The usage of synthetic data is gaining momentum in part due to the unavailability of original data due to privacy and legal considerations and in part due to its utility as an augmentation to the authentic data. Generative adversarial networks (GANs), a paragon of generative models, initially for images and subsequently for tabular data, has contributed many of the state-of-the-art synthesizers. As GANs improve, the synthesized data increasingly resemble the real data risking to leak privacy. Differential privacy (DP) provides theoretical guarantees on privacy loss but degrades data utility. Striking the best trade-off remains yet a challenging research question. In this study, we propose CTAB-GAN+ a novel conditional tabular GAN. CTAB-GAN+ improves upon state-of-the-art by (i) adding downstream losses to conditional GAN for higher utility synthetic data in both classification and regression domains; (ii) using Wasserstein loss with gradient penalty for better training convergence; (iii) introducing novel encoders targeting mixed continuous-categorical variables and variables with unbalanced or skewed data; and (iv) training with DP stochastic gradient descent to impose strict privacy guarantees. We extensively evaluate CTAB-GAN+ on statistical similarity and machine learning utility against state-of-the-art tabular GANs. The results show that CTAB-GAN+ synthesizes privacy-preserving data with at least 21.9% higher machine learning utility (i.e., F1-Score) across multiple datasets and learning tasks under given privacy budget.
- Published
- 2024
- Full Text
- View/download PDF
30. Risky business: human-related data is lacking from Lyme disease risk models
- Author
-
Erica Fellin, Mathieu Varin, and Virginie Millien
- Subjects
blacklegged ticks ,data synthesis ,human-related ,Lyme disease ,risk assessment ,risk map ,Public aspects of medicine ,RA1-1270 - Abstract
Used as a communicative tool for risk management, risk maps provide a service to the public, conveying information that can raise risk awareness and encourage mitigation. Several studies have utilized risk maps to determine risks associated with the distribution of Borrelia burgdorferi, the causal agent of Lyme disease in North America and Europe, as this zoonotic disease can lead to severe symptoms. This literature review focused on the use of risk maps to model distributions of B. burgdorferi and its vector, the blacklegged tick (Ixodes scapularis), in North America to compare variables used to predict these spatial models. Data were compiled from the existing literature to determine which ecological, environmental, and anthropic (i.e., human focused) variables past research has considered influential to the risk level for Lyme disease. The frequency of these variables was examined and analyzed via a non-metric multidimensional scaling analysis to compare different map elements that may categorize the risk models performed. Environmental variables were found to be the most frequently used in risk spatial models, particularly temperature. It was found that there was a significantly dissimilar distribution of variables used within map elements across studies: Map Type, Map Distributions, and Map Scale. Within these map elements, few anthropic variables were considered, particularly in studies that modeled future risk, despite the objective of these models directly or indirectly focusing on public health intervention. Without including human-related factors considering these variables within risk map models, it is difficult to determine how reliable these risk maps truly are. Future researchers may be persuaded to improve disease risk models by taking this into consideration.
- Published
- 2023
- Full Text
- View/download PDF
31. SeedArc, a global archive of primary seed germination data.
- Author
-
Fernández‐Pascual, Eduardo, Carta, Angelino, Rosbakh, Sergey, Guja, Lydia, Phartyal, Shyam S., Silveira, Fernando A. O., Chen, Si‐Chong, Larson, Julie E., and Jiménez‐Alfaro, Borja
- Subjects
- *
GERMINATION , *BOTANY , *BIOTIC communities , *SEED size , *PLANT reproduction , *BIOMES , *PLANT ecology - Abstract
Keywords: data synthesis; database; germination; open science; plant reproduction; repository; seed; trait EN data synthesis database germination open science plant reproduction repository seed trait 466 470 5 09/25/23 20231015 NES 231015 Data availability The data and code used to produce this article are available at https://github.com/efernandezpascual/seedarcms. The need for a global archive of primary seed germination data The seed ecology community has recently recognized the need to synthesize knowledge, setting the research agenda for functional seed ecology (Saatkamp I et al i ., [34]). I SeedArc i compiles primary seed germination data to synthesize the seed germination spectrum at a global scale. The theory underlying the seed germination spectrum has been laid out by decades of work on seed ecology (Baskin & Baskin, [1]), but empirical studies testing major ecological hypotheses at both global and local scales remain elusive without a standardized seed germination database. [Extracted from the article]
- Published
- 2023
- Full Text
- View/download PDF
32. Of causes and symptoms: using monitoring data and expert knowledge to diagnose the causes of stream degradation.
- Author
-
Rettig, Katharina, Semmler-Elpers, Renate, Brettschneider, Denise, Hering, Daniel, and Feld, Christian K.
- Subjects
WATER management ,BAYESIAN analysis ,ECOLOGICAL assessment ,WATER use ,LAND use ,FECAL contamination - Abstract
Ecological status assessment under the European Water Framework Directive (WFD) often integrates the impact of multiple stressors into a single index value. This hampers the identification of individual stressors being responsible for status deterioration. As a consequence, management measures are often disentangled from assessment results. To close this gap and to support river basin managers in the diagnosis of stressors, we linked numerous macroinvertebrate assessment metrics and one diatom index with potential causes of ecological deterioration through Bayesian belief networks (BBNs). The BBNs were informed by WFD monitoring data as well as regular consultation with experts and allow to estimate the probabilities of individual degradation causes based upon a selection of biological metrics. Macroinvertebrate metrics were shown to be stronger linked to hydromorphological conditions and land use than to water quality-related parameters (e.g., thermal and nutrient pollution). The modeled probabilities also allow to order the potential causes of degradation hierarchically. The comparison of assessment metrics showed that compositional and trait-based community metrics performed equally well in the diagnosis. The testing of the BBNs by experts resulted in an agreement between model output and expert opinion of 17–92% for individual stressors. Overall, the expert-based validation confirmed a good diagnostic potential of the BBNs; on average 80% of the diagnosed causes were in agreement with expert judgement. We conclude that diagnostic BBNs can assist the identification of causes of stream and river degradation and thereby inform the derivation of appropriate management decisions. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
33. ASIDS: A Robust Data Synthesis Method for Generating Optimal Synthetic Samples.
- Author
-
Du, Yukun, Cai, Yitao, Jin, Xiao, Wang, Hongxia, Li, Yao, and Lu, Min
- Subjects
- *
SAMPLE size (Statistics) , *INTERPOLATION - Abstract
Most existing data synthesis methods are designed to tackle problems with dataset imbalance, data anonymization, and an insufficient sample size. There is a lack of effective synthesis methods in cases where the actual datasets have a limited number of data points but a large number of features and unknown noise. Thus, in this paper we propose a data synthesis method named Adaptive Subspace Interpolation for Data Synthesis (ASIDS). The idea is to divide the original data feature space into several subspaces with an equal number of data points, and then perform interpolation on the data points in the adjacent subspaces. This method can adaptively adjust the sample size of the synthetic dataset that contains unknown noise, and the generated sample data typically contain minimal errors. Moreover, it adjusts the feature composition of the data points, which can significantly reduce the proportion of the data points with large fitting errors. Furthermore, the hyperparameters of this method have an intuitive interpretation and usually require little calibration. Analysis results obtained using simulated original data and benchmark original datasets demonstrate that ASIDS is a robust and stable method for data synthesis. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
34. Emotion recognition using facial expressions in an immersive virtual reality application.
- Author
-
Chen, Xinrun and Chen, Hengxin
- Subjects
EMOTION recognition ,FACIAL expression ,VIRTUAL reality ,HEAD-mounted displays ,INFRARED cameras ,EMOTIONS ,LIGHT sources - Abstract
Facial expression recognition (FER) is an important method to study and distinguish human emotions. In the virtual reality (VR) context, people's emotions are instantly and naturally triggered and mobilized due to the high immersion and realism of VR. However, when people are wearing head mounted display (HMD) VR equipment, the eye regions will be covered. The FER accuracy will be reduced if the eye region information is discarded. Therefore, it is necessary to obtain the information of eye regions using other methods. The main difficulty in FER in an immersive VR context is that the conventional FER methods depend on public databases. The image facial information in the public databases is complete, so these methods are difficult to directly apply to the VR context. To solve this problem, this paper designs and implements a solution for FER in the VR context as follows. A real facial expression database collection scheme in the VR context is implemented by adding an infrared camera and infrared light source to the HMD. A virtual database construction method is presented for FER in the VR context, which can improve the generalization of models. A deep network named the multi-region facial expression recognition model is designed for FER in the VR context. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
35. On Evaluating IoT Data Trust via Machine Learning.
- Author
-
Tadj, Timothy, Arablouei, Reza, and Dedeoglu, Volkan
- Subjects
TRUST ,SUPERVISED learning ,MACHINE learning ,INTERNET of things ,PYTHON programming language ,TAGS (Metadata) ,SECURE Sockets Layer (Computer network protocol) ,RANDOM walks ,CLUSTER analysis (Statistics) - Abstract
Data trust in IoT is crucial for safeguarding privacy, security, reliable decision-making, user acceptance, and complying with regulations. Various approaches based on supervised or unsupervised machine learning (ML) have recently been proposed for evaluating IoT data trust. However, assessing their real-world efficacy is hard mainly due to the lack of related publicly available datasets that can be used for benchmarking. Since obtaining such datasets is challenging, we propose a data synthesis method, called random walk infilling (RWI), to augment IoT time-series datasets by synthesizing untrustworthy data from existing trustworthy data. Thus, RWI enables us to create labeled datasets that can be used to develop and validate ML models for IoT data trust evaluation. We also extract new features from IoT time-series sensor data that effectively capture its autocorrelation as well as its cross-correlation with the data of the neighboring (peer) sensors. These features can be used to learn ML models for recognizing the trustworthiness of IoT sensor data. Equipped with our synthesized ground-truth-labeled datasets and informative correlation-based features, we conduct extensive experiments to critically examine various approaches to evaluating IoT data trust via ML. The results reveal that commonly used ML-based approaches to IoT data trust evaluation, which rely on unsupervised cluster analysis to assign trust labels to unlabeled data, perform poorly. This poor performance is due to the underlying assumption that clustering provides reliable labels for data trust, which is found to be untenable. The results also indicate that ML models, when trained on datasets augmented via RWI and using the proposed features, generalize well to unseen data and surpass existing related approaches. Moreover, we observe that a semi-supervised ML approach that requires only about 10% of the data labeled offers competitive performance while being practically more appealing compared to the fully supervised approaches. The related Python code and data are available online. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
36. Genetic diversity and IUCN Red List status.
- Author
-
Schmidt, Chloé, Hoban, Sean, Hunter, Margaret, Paz‐Vinas, Ivan, and Garroway, Colin J.
- Subjects
- *
GENETIC variation , *GENETIC drift , *BIOLOGICAL extinction , *INBREEDING , *GENETIC correlations , *ENDANGERED species - Abstract
The International Union for Conservation of Nature (IUCN) Red List is an important and widely used tool for conservation assessment. The IUCN uses information about a species' range, population size, habitat quality and fragmentation levels, and trends in abundance to assess extinction risk. Genetic diversity is not considered, although it affects extinction risk. Declining populations are more strongly affected by genetic drift and higher rates of inbreeding, which can reduce the efficiency of selection, lead to fitness declines, and hinder species' capacities to adapt to environmental change. Given the importance of conserving genetic diversity, attempts have been made to find relationships between red‐list status and genetic diversity. Yet, there is still no consensus on whether genetic diversity is captured by the current IUCN Red List categories in a way that is informative for conservation. To assess the predictive power of correlations between genetic diversity and IUCN Red List status in vertebrates, we synthesized previous work and reanalyzed data sets based on 3 types of genetic data: mitochondrial DNA, microsatellites, and whole genomes. Consistent with previous work, species with higher extinction risk status tended to have lower genetic diversity for all marker types, but these relationships were weak and varied across taxa. Regardless of marker type, genetic diversity did not accurately identify threatened species for any taxonomic group. Our results indicate that red‐list status is not a useful metric for informing species‐specific decisions about the protection of genetic diversity and that genetic data cannot be used to identify threat status in the absence of demographic data. Thus, there is a need to develop and assess metrics specifically designed to assess genetic diversity and inform conservation policy, including policies recently adopted by the UN's Convention on Biological Diversity Kunming‐Montreal Global Biodiversity Framework. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
37. Genetic algorithms and their applications to synthetic data generation
- Author
-
Chen, Yingrui, Elliot, Mark, and Smith, Duncan
- Subjects
Machine Learning ,Data Privacy ,Genetic Algorithms ,Data Synthesis - Abstract
Data synthesis is a statistical disclosure control technique that prevents the leakage of personal information from survey data. Rubin, who originally proposed this technique, treated the confidential data within a dataset as missing and then replaced those data using multiple imputation [103]. Most methods in data synthesis were then developed based on this principle. However, data synthesis is a multi-objective problem that aims to maximise information utility as well as minimising disclosure risks, and these methods have no explicit mechanism for balancing the objectives. This issue is the basis for the line of enquiry embodied in this thesis. The need to optimise competing objectives suggests the possible use of iterative machine learning techniques for data synthesis, but - to date - investigations of this possibility have been limited. In the thesis, a new synthesis method using Genetic Algorithms (GAs) is introduced. GAs are evolutionary computational methods that simulate natural evolution. They allow candidates (which in this thesis are datasets) to compete, reproduce and mate in a pre-determined environment until one or more of them perfectly fits the environment (which is defined by a set of objectives). GAs were firstly used on binary strings and now they have variants that deal with different problems and data forms. In this thesis, a GA data synthesiser whose candidates are matrix and real-coded data is designed, and most of its parameters and hyper-parameters tested. A new information utility function to measure the overall divergence from synthetic data to the original data is used. The results of running the synthesiser on a real dataset are presented, which show that the GA approach successfully produced plausible synthetic data using a single utility objective and they were proved to be able to seek for a trade-off between information utility and disclosure risks during the process of synthesising. The overall conclusion is that GAs represent a significant opportunity for the practice of data synthesis.
- Published
- 2020
38. Generative neural data synthesis for autonomous systems
- Author
-
Jegorova, Marija, Hospedales, Timothy, Mistry, Michael, and Ramamoorthy, Subramanian
- Subjects
GANs ,data synthesis ,data augmentation - Abstract
A significant number of Machine Learning methods for automation currently rely on data-hungry training techniques. The lack of accessible training data often represents an insurmountable obstacle, especially in the fields of robotics and automation, where acquiring new data can be far from trivial. Additional data acquisition is not only often expensive and time-consuming, but occasionally is not even an option. Furthermore, the real world applications sometimes have commercial sensitivity issues associated with the distribution of the raw data. This doctoral thesis explores bypassing the aforementioned difficulties by synthesising new realistic and diverse datasets using the Generative Adversarial Network (GAN). The success of this approach is demonstrated empirically through solving a variety of case-specific data-hungry problems, via application of novel GAN-based techniques and architectures. Specifically, it starts with exploring the use of GANs for the realistic simulation of the extremely high-dimensional underwater acoustic imagery for the purpose of training both teleoperators and autonomous target recognition systems. We have developed a method capable of generating realistic sonar data of any chosen dimension by image-translation GANs with Markov principle. Following this, we apply GAN-based models to robot behavioural repertoire generation, that enables a robot manipulator to successfully overcome unforeseen impedances, such as unknown sets of obstacles and random broken joints scenarios. Finally, we consider dynamical system identification for articulated robot arms. We show how using diversity-driven GAN models to generate exploratory trajectories can allow dynamic parameters to be identified more efficiently and accurately than with conventional optimisation approaches. Together, these results show that GANs have the potential to benefit a variety of robotics learning problems where training data is currently a bottleneck.
- Published
- 2020
- Full Text
- View/download PDF
39. Data Synthesis for Alfalfa Biomass Yield Estimation
- Author
-
Jonathan Vance, Khaled Rasheed, Ali Missaoui, and Frederick W. Maier
- Subjects
machine learning ,data synthesis ,generative models ,alfalfa ,biomass ,precision agriculture ,Electronic computers. Computer science ,QA75.5-76.95 - Abstract
Alfalfa is critical to global food security, and its data is abundant in the U.S. nationally, but often scarce locally, limiting the potential performance of machine learning (ML) models in predicting alfalfa biomass yields. Training ML models on local-only data results in very low estimation accuracy when the datasets are very small. Therefore, we explore synthesizing non-local data to estimate biomass yields labeled as high, medium, or low. One option to remedy scarce local data is to train models using non-local data; however, this only works about as well as using local data. Therefore, we propose a novel pipeline that trains models using data synthesized from non-local data to estimate local crop yields. Our pipeline, synthesized non-local training (SNLT pronounced like sunlight), achieves a gain of 42.9% accuracy over the best results from regular non-local and local training on our very small target dataset. This pipeline produced the highest accuracy of 85.7% with a decision tree classifier. From these results, we conclude that SNLT can be a useful tool in helping to estimate crop yields with ML. Furthermore, we propose a software application called Predict Your CropS (PYCS pronounced like Pisces) designed to help farmers and researchers estimate and predict crop yields based on pretrained models.
- Published
- 2022
- Full Text
- View/download PDF
40. Effects and parameters of community-based exercise on motor symptoms in Parkinson’s disease: a meta-analysis
- Author
-
Chun-Lan Yang, Jia-Peng Huang, Ting-Ting Wang, Ying-Chao Tan, Yin Chen, Zi-Qi Zhao, Chao-Hua Qu, and Yun Qu
- Subjects
Data synthesis ,Exercise ,Movement ,Parkinson’s disease ,Prescription ,Review ,Neurology. Diseases of the nervous system ,RC346-429 - Abstract
Abstract Background Community-based exercise is a continuation and complement to inpatient rehabilitation for Parkinson's disease and does not require a professional physical therapist or equipment. The effects, parameters, and forms of each exercise are diverse, and the effect is affected by many factors. A meta-analysis was conducted to determine the effect and the best parameters for improving motor symptoms and to explore the possible factors affecting the effect of community-based exercise. Methods We conducted a comprehensive search of six databases: PEDro, PubMed/Medline, CENTRAL, Scopus, Embase, and WOS. Studies that compared community-based exercise with usual care were included. The intervention mainly included dance, Chinese martial arts, Nordic walking, and home-based exercise. The primary outcome measure was the Unified Parkinson’s Disease Rating Scale part III (UPDRS-III) score. The mean difference (95% CI) was used to calculate the treatment outcomes of continuous outcome variables, and the I2 statistic was used to estimate the heterogeneity of the statistical analysis. We conducted subgroup analysis and meta-regression analysis to determine the optimal parameters and the most important influencing factors of the exercise effect. Results Twenty-two studies that enrolled a total of 809 subjects were included in the analysis. Exercise had a positive effect on the UPDRS-III (MD = -5.83; 95% CI, -8.29 to -3.37), Timed Up and Go test (MD = -2.22; 95% CI -3.02 to -1.42), UPDRS ((MD = -7.80; 95% CI -10.98 to -6.42), 6-Minute Walk Test (MD = 68.81; 95% CI, 32.14 to 105.48), and Berg Balance Scale (MD = 4.52; 95% CI, 2.72 to 5.78) scores. However, the heterogeneity of each included study was obvious. Weekly frequency, age, and duration of treatment were all factors that potentially influenced the effect. Conclusions This meta-analysis suggests that community-based exercise may benefit motor function in patients with PD. The most commonly used modalities of exercise were tango and tai chi, and the most common prescription was 60 min twice a week. Future studies should consider the influence of age, duration of treatment, and weekly frequency on the effect of exercise. PROSPERO trial registration number CRD42022327162.
- Published
- 2022
- Full Text
- View/download PDF
41. A systematic review and future research agenda on detection of polycystic ovary syndrome (PCOS) with computer-aided techniques
- Author
-
Sayma Alam Suha and Muhammad Nazrul Islam
- Subjects
Polycystic ovary syndrome (PCOS) ,Computer-assisted methods ,Systematic literature review (SLR) ,Data synthesis ,Future research scopes ,Science (General) ,Q1-390 ,Social sciences (General) ,H1-99 - Abstract
Polycystic Ovary Syndrome (PCOS) is among the most prevalent endocrinological abnormalities seen in reproductive female bodies posing serious health hazards. The correctness of interpreting this condition depends heavily on the wide spectrum of associated symptoms and the doctor's expertise, making real-time clinical detection quite challenging. Thus, investigations on computer-aided PCOS detection systems have recently been explored by several researchers worldwide as a potential replacement for manual assessment. This review study's objective is to analyze the relevant research works on computer-assisted methods for automatically identifying PCOS through a systematic literature review (SLR) methodology as well as investigate the research limitations and explore potential future research scopes in this domain. 28 articles have been selected using the PRISMA approach based on a set of inclusion-exclusion criteria for conducting the review. The data synthesis of the selected articles has been conducted using six data exploration themes. As outcomes, the SLR explored the topical association between the studies; their research profiles; objectives; data size, type, and sources; methodologies applied for the detection of PCOS; and lastly the research outcomes along with their evaluation measures and performances. The study also highlights areas for future research directions examining the study gaps to enhance the current efforts for autonomous PCOS identification; such as integrating advanced techniques with the current methods; developing interactive software systems; exploring deep learning and unsupervised machine learning techniques; enhancing datasets and country context; and investigating more unknown factors behind PCOS. Thus, this SLR provides a state-of-the-art paradigm of autonomous PCOS detection which will support significantly efficient clinical assessment, diagnosis and treatment of PCOS.
- Published
- 2023
- Full Text
- View/download PDF
42. Laparoscopic versus ultrasoundguided transversus abdominis plane block for postoperative pain management in minimally invasive colorectal surgery: a meta-analysis protocol.
- Author
-
Wenming Yang, Tao Yuan, Zhaolun Cai, Qin Ma, Xueting Liu, Hang Zhou, Siyuan Qiu, and Lie Yang
- Subjects
POSTOPERATIVE pain treatment ,TRANSVERSUS abdominis muscle ,MINIMALLY invasive procedures ,INFLAMMATORY bowel diseases ,SURGICAL site infections - Abstract
Introduction: Transversus abdominis plane block (TAPB) is now commonly administered for postoperative pain control and reduced opioid consumption in patients undergoing major colorectal surgeries, such as colorectal cancer, diverticular disease, and inflammatory bowel disease resection. However, there remain several controversies about the effectiveness and safety of laparoscopic TAPB compared to ultrasound-guided TAPB. Therefore, the aim of this study is to integrate both direct and indirect comparisons to identify a more effective and safer TAPB approach. Materials and methods: Systematic electronic literature surveillance will be performed in the PubMed, Embase, Cochrane Central Register of Controlled Trials (CENTRAL), and ClinicalTrials.gov databases for eligible studies through July 31, 2023. The Cochrane Risk of Bias version 2 (RoB 2) and Risk of Bias in Non-randomized Studies of Interventions (ROBINS-I) tools will be applied to scrutinize the methodological quality of the selected studies. The primary outcomes will include (1) opioid consumption at 24 hours postoperatively and (2) pain scores at 24 hours postoperatively both at rest and at coughing and movement according to the numerical rating scale (NRS). Additionally, the probability of TAPB-related adverse events, overall postoperative 30-day complications, postoperative 30-day ileus, postoperative 30-day surgical site infection, postoperative 7-day nausea and vomiting, and length of stay will be analyzed as secondary outcome measures. The findings will be assessed for robustness through subgroup analyses and sensitivity analyses. Data analyses will be performed using RevMan 5.4.1 and Stata 17.0. P value of less than 0.05 will be defined as statistically significant. The certainty of evidence will be examined via the Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) working group approach. Ethics and dissemination: Owing to the nature of the secondary analysis of existing data, no ethical approval will be required. Our meta-analysis will summarize all the available evidence for the effectiveness and safety of TAPB approaches for minimally invasive colorectal surgery. High-quality peerreviewed publications and presentations at international conferences will facilitate disseminating the results of this study, which are expected to inform future clinical trials and help anesthesiologists and surgeons determine the optimal tailored clinical practice for perioperative pain management. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
43. Research on the Simulation Method of HTTP Traffic Based on GAN
- Author
-
Chenglin Yang, Dongliang Xu, and Xiao Ma
- Subjects
GAN ,HTTP stream ,traffic feature mimicry ,data synthesis ,network data ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
Due to the increasing severity of network security issues, training corresponding detection models requires large datasets. In this work, we propose a novel method based on generative adversarial networks to synthesize network data traffic. We introduced a network traffic data normalization method based on Gaussian mixture models (GMM), and for the first time, incorporated a generator based on the Swin Transformer structure into the field of network traffic generation. To further enhance the robustness of the model, we mapped real data through an AE (autoencoder) module and optimized the training results in the form of evolutionary algorithms. We validated the training results on four different datasets and introduced four additional models for comparative experiments in the experimental evaluation section. Our proposed SEGAN outperformed other state-of-the-art network traffic emulation methods.
- Published
- 2024
- Full Text
- View/download PDF
44. Software Application Profile: The Anchored Multiplier calculator—a Bayesian tool to synthesize population size estimates
- Author
-
Wesson, Paul D, McFarland, Willi, Qin, Cong Charlie, and Mirzazadeh, Ali
- Subjects
Mathematical Sciences ,Statistics ,Bayes Theorem ,Female ,HIV Infections ,Humans ,Iran ,Models ,Statistical ,Population Density ,Population Surveillance ,Sex Workers ,Software ,Bayesian modelling ,population size estimation ,key populations ,data synthesis ,Public Health and Health Services ,Epidemiology ,Public health - Abstract
Estimating the number of people in hidden populations is needed for public health research, yet available methods produce highly variable and uncertain results. The Anchored Multiplier calculator uses a Bayesian framework to synthesize multiple population size estimates to generate a consensus estimate. Users submit point estimates and lower/upper bounds which are converted to beta probability distributions and combined to form a single posterior probability distribution. The Anchored Multiplier calculator is available as a web browser-based application. The software allows for unlimited empirical population size estimates to be submitted and combined according to Bayes Theorem to form a single estimate. The software returns output as a forest plot (to visually compare data inputs and the final Anchored Multiplier estimate) and a table that displays results as population percentages and counts. The web application 'Anchored Multiplier Calculator' is free software and is available at [http://globalhealthsciences.ucsf.edu/resources/tools] or directly at [http://anchoredmultiplier.ucsf.edu/].
- Published
- 2019
45. Towards automated molecular detection through simulated generation of CMOS-based rotational spectroscopy
- Author
-
Yasamin Fozouni, Eric C. Larson, and Bruce Gnade
- Subjects
Rotational spectroscopy ,Molecular detection ,Data synthesis ,Science (General) ,Q1-390 ,Social sciences (General) ,H1-99 - Abstract
The use of CMOS sensors for rotational spectroscopy is a promising, but challenging avenue for low-cost gas sensing and molecular identification. A main challenge in this approach is that practical CMOS spectroscopy samples contain various different noise sources that reduce the effectiveness of matching techniques for molecular identification with rotational spectroscopy. To help solve this challenge, we develop a software application tool that can demonstrate the feasibility and reliability of detection with CMOS sensor samples. Specifically, the tool characterizes the types of noise in CMOS sample collection and synthesizes spectroscopy files based upon existing databases of rotational spectroscopy samples gathered from other sensors. We use the software to create a large database of plausible CMOS-generated sample files of gases. This dataset is used to help evaluate spectral matching algorithms used in gas sensing and molecular identification applications. We evaluate these traditional methods on the synthesized dataset and discuss how peak finding and spectral matching algorithms can be altered to accommodate the noise sources present in CMOS sample collection.
- Published
- 2023
- Full Text
- View/download PDF
46. Laparoscopic versus ultrasound-guided transversus abdominis plane block for postoperative pain management in minimally invasive colorectal surgery: a meta-analysis protocol
- Author
-
Wenming Yang, Tao Yuan, Zhaolun Cai, Qin Ma, Xueting Liu, Hang Zhou, Siyuan Qiu, and Lie Yang
- Subjects
transversus abdominis plane block ,postoperative pain management ,minimally invasive ,colorectal surgery ,data synthesis ,Neoplasms. Tumors. Oncology. Including cancer and carcinogens ,RC254-282 - Abstract
IntroductionTransversus abdominis plane block (TAPB) is now commonly administered for postoperative pain control and reduced opioid consumption in patients undergoing major colorectal surgeries, such as colorectal cancer, diverticular disease, and inflammatory bowel disease resection. However, there remain several controversies about the effectiveness and safety of laparoscopic TAPB compared to ultrasound-guided TAPB. Therefore, the aim of this study is to integrate both direct and indirect comparisons to identify a more effective and safer TAPB approach.Materials and methodsSystematic electronic literature surveillance will be performed in the PubMed, Embase, Cochrane Central Register of Controlled Trials (CENTRAL), and ClinicalTrials.gov databases for eligible studies through July 31, 2023. The Cochrane Risk of Bias version 2 (RoB 2) and Risk of Bias in Non-randomized Studies of Interventions (ROBINS-I) tools will be applied to scrutinize the methodological quality of the selected studies. The primary outcomes will include (1) opioid consumption at 24 hours postoperatively and (2) pain scores at 24 hours postoperatively both at rest and at coughing and movement according to the numerical rating scale (NRS). Additionally, the probability of TAPB-related adverse events, overall postoperative 30-day complications, postoperative 30-day ileus, postoperative 30-day surgical site infection, postoperative 7-day nausea and vomiting, and length of stay will be analyzed as secondary outcome measures. The findings will be assessed for robustness through subgroup analyses and sensitivity analyses. Data analyses will be performed using RevMan 5.4.1 and Stata 17.0. P value of less than 0.05 will be defined as statistically significant. The certainty of evidence will be examined via the Grading of Recommendations, Assessment, Development, and Evaluation (GRADE) working group approach.Ethics and disseminationOwing to the nature of the secondary analysis of existing data, no ethical approval will be required. Our meta-analysis will summarize all the available evidence for the effectiveness and safety of TAPB approaches for minimally invasive colorectal surgery. High-quality peer-reviewed publications and presentations at international conferences will facilitate disseminating the results of this study, which are expected to inform future clinical trials and help anesthesiologists and surgeons determine the optimal tailored clinical practice for perioperative pain management.Systematic review registrationhttps://www.crd.york.ac.uk/PROSPERO/display_record.php?RecordID=281720, identifier (CRD42021281720).
- Published
- 2023
- Full Text
- View/download PDF
47. Tracking the global application of conservation translocation and social attraction to reverse seabird declines.
- Author
-
Spatz, Dena R., Young, Lindsay C., Holmes, Nick D., Jones, Holly P., VanderWerf, Eric A., Lyons, Donald E., Kress, Stephen, Miskelly, Colin M., and Taylor, Graeme A.
- Subjects
- *
COASTAL zone management , *DATABASES , *ECOLOGICAL resilience , *ENVIRONMENTAL degradation , *CHARADRIIFORMES - Abstract
The global loss of biodiversity has inspired actions to restore nature across the planet. Translocation and social attraction actions deliberately move or lure a target species to a restoration site to reintroduce or augment populations and enhance biodiversity and ecosystem resilience. Given limited conservation funding and rapidly accelerating extinction trajectories, tracking progress of these interventions can inform best practices and advance management outcomes. Seabirds are globally threatened and commonly targeted for translocation and social attraction (“active seabird restoration”), yet no framework exists for tracking these efforts nor informing best practices. This study addresses this gap for conservation decision makers responsible for seabirds and coastal management. We systematically reviewed active seabird restoration projects worldwide and collated results into a publicly accessible Seabird Restoration Database. We describe global restoration trends, apply a systematic process to measure success rates and response times since implementation, and examine global factors influencing outcomes. The database contains 851 active restoration events in 551 locations targeting 138 seabird species; 16% of events targeted globally threatened taxa. Visitation occurred in 80% of events and breeding occurred in 76%, on average 2 y after implementation began (SD = 3.2 y). Outcomes varied by taxonomy, with the highest and quickest breeding response rates for Charadriiformes (terns, gulls, and auks), primarily with social attraction. Given delayed and variable response times to active restoration, 5 y is appropriate before evaluating outcomes. The database and results serve as a model for tracking and evaluating restoration outcomes, and is applicable to measuring conservation interventions for additional threatened taxa. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
48. Data Synthesis for Alfalfa Biomass Yield Estimation.
- Author
-
Vance, Jonathan, Rasheed, Khaled, Missaoui, Ali, and Maier, Frederick W.
- Subjects
- *
BIOMASS estimation , *MACHINE learning , *ALFALFA , *CROP yields , *MACHINE performance , *DECISION trees - Abstract
Alfalfa is critical to global food security, and its data is abundant in the U.S. nationally, but often scarce locally, limiting the potential performance of machine learning (ML) models in predicting alfalfa biomass yields. Training ML models on local-only data results in very low estimation accuracy when the datasets are very small. Therefore, we explore synthesizing non-local data to estimate biomass yields labeled as high, medium, or low. One option to remedy scarce local data is to train models using non-local data; however, this only works about as well as using local data. Therefore, we propose a novel pipeline that trains models using data synthesized from non-local data to estimate local crop yields. Our pipeline, synthesized non-local training (SNLT pronounced like sunlight), achieves a gain of 42.9% accuracy over the best results from regular non-local and local training on our very small target dataset. This pipeline produced the highest accuracy of 85.7% with a decision tree classifier. From these results, we conclude that SNLT can be a useful tool in helping to estimate crop yields with ML. Furthermore, we propose a software application called Predict Your CropS (PYCS pronounced like Pisces) designed to help farmers and researchers estimate and predict crop yields based on pretrained models. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF
49. Computationally efficient data synthesis for AC-OPF: Integrating Physics-Informed Neural Network solvers and active learning.
- Author
-
Zhang, Jiahao, Peng, Ruo, Lu, Chenbei, and Wu, Chenye
- Subjects
- *
DATA privacy , *BILEVEL programming , *ELECTRICAL load , *DATA release , *TEST systems - Abstract
This study addresses the challenges of privacy, utility, and efficiency in releasing privacy-preserving operational data for AC Optimal Power Flow (AC-OPF) research. Traditional methods, injecting noise into operational data (i.e. , demand data and dispatch profiles) within the Differential Privacy (DP) framework, often violate physical constraints within the data, leading to unrealistic and infeasible outcomes that diminish data utility. While AC-OPF-solver-based bi-level post-processing optimizations can enforce physical feasibility, the objective divergence between post-processing and AC-OPF leads to discrepancies, compromising data utility. Additionally, their non-convex and adversarial nature makes computation prohibitively expensive, further preventing efficient data release. To overcome these challenges, our research introduces a DP approach that combines strategic noise injection for demand data with the computation of corresponding dispatch profiles, ensuring the privacy-preserving data satisfy AC-OPF's physical constraints. To accelerate data release, we employ Physics-Informed Neural Networks (PINNs). This ensures solutions' physical feasibility while enhancing computational efficiency. Furthermore, we incorporate active learning to target the most informative data samples, enhancing PINN training and optimizing efficiency while maintaining solution accuracy. Comprehensive experiments on IEEE test systems reveal our approach's improved performance and accelerated computation speed over traditional methods, highlighting its efficiency in maintaining data privacy and utility and decreasing computational burden amidst diverse privacy considerations. • Realistic, feasible, and fast data synthesis via AC-OPF-tailored PINNs. • Quantifying trade-off between demand privacy and dispatch profile accuracy. • Sampling efficient PINN training set via active learning. [ABSTRACT FROM AUTHOR]
- Published
- 2025
- Full Text
- View/download PDF
50. Ad-RuLer: A Novel Rule-Driven Data Synthesis Technique for Imbalanced Classification
- Author
-
Xiao Zhang, Iván Paz, Àngela Nebot, Francisco Mugica, and Enrique Romero
- Subjects
rule-based approach ,oversampling ,data synthesis ,imbalanced data ,classification ,Technology ,Engineering (General). Civil engineering (General) ,TA1-2040 ,Biology (General) ,QH301-705.5 ,Physics ,QC1-999 ,Chemistry ,QD1-999 - Abstract
When classifiers face imbalanced class distributions, they often misclassify minority class samples, consequently diminishing the predictive performance of machine learning models. Existing oversampling techniques predominantly rely on the selection of neighboring data via interpolation, with less emphasis on uncovering the intrinsic patterns and relationships within the data. In this research, we present the usefulness of an algorithm named RuLer to deal with the problem of classification with imbalanced data. RuLer is a learning algorithm initially designed to recognize new sound patterns within the context of the performative artistic practice known as live coding. This paper demonstrates that this algorithm, once adapted (Ad-RuLer), has great potential to address the problem of oversampling imbalanced data. An extensive comparison with other mainstream oversampling algorithms (SMOTE, ADASYN, Tomek-links, Borderline-SMOTE, and KmeansSMOTE), using different classifiers (logistic regression, random forest, and XGBoost) is performed on several real-world datasets with different degrees of data imbalance. The experiment results indicate that Ad-RuLer serves as an effective oversampling technique with extensive applicability.
- Published
- 2023
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.