Back to Search Start Over

Using synthesized data to train machine learning models used in genome engineering pipeline

Authors :
Vallabhajosyula, Manikya Swathi
Chanana, Rishabh
Publication Year :
2023
Publisher :
figshare, 2023.

Abstract

Genome Sequencing is the process of identifying the order of bases - adenine (A), cytosine (C), guanine (G), order of bases - adenine (A), cytosine (C), guanine (G), and thymine (T) - in the target Deoxyribonucleic acid (DNA). This sequencing process consists of library preparation, Cluster Generation, Sequencing, and Data Analysis. The first three sequencing stages could induce some noise which could be reduced/eliminated in the Data Analysis phase. The Sequence Read Archive (SRA) files are filtered for quality analysis and assembled into the genome. This process grows computationally intensive as the quality of the SRA increases and hence needs appropriate hyperparameters to assemble the genome effectively. Data Synthesis helps identify appropriate features for training a machine learning algorithm that efficiently predicts the assembly pipeline's fitting settings to generate a high-quality assembly. Synthesizing task-specific data is challenging and domain-dependent but has many practical applications. There is a need for more data when data is limited for training; a disproportionate amount of data between classes or data is too sensitive, and sharing data might have privacy issues. Also, in many cases, a Machine Learning model's performance depends on the data quality used for training. In this work, we synthesize the genome sequence to identify potential machine-learning features for hyperparameter tuning the tools in the assembly pipeline.

Details

Database :
OpenAIRE
Accession number :
edsair.doi.dedup.....37071f34dd90f6f5cb2e80c5a765ce20
Full Text :
https://doi.org/10.6084/m9.figshare.22565113