1. Extracting biological insights from genomics data using machine learning approaches
- Author
-
Hentges, LD, Hughes, J, and Taylor, S
- Subjects
Machine learning ,Genomics ,Biology - Abstract
ATAC-seq, ChIP-seq, and DNase-seq have revolutionised molecular biology by allowing researchers to identify important DNA-encoded elements genome-wide. Regions where these elements are found appear as peaks in the analogue signal of an assay’s coverage track, and despite the ease with which humans can visu- ally categorize these regions, meaningful peak calls from whole genome datasets require complex analytical techniques. Current methods focus on statistical tests to classify peaks, reducing the information-dense peak shapes to simply maxi- mum height, and discounting that background signals do not completely follow any known probability distribution for significance testing. Deep learning has been shown to be highly accurate for image recognition, on par or exceeding human ability, providing an opportunity to reimagine and improve peak calling. Here, a large, labelled dataset is built by classifying peak and noise regions by hand. These data are used to explore supervised and unsupervised machine learning techniques, as well as to assess the reliability of p-values in chromatin binding as- says. The culmination of this work was the development of a peak calling frame- work, LanceOtron, which combines multifaceted enrichment measurements with deep learning image recognition techniques for assessing peak shape. In bench- marking transcription factor binding, chromatin modification, and open chromatin datasets, LanceOtron outperforms the long-standing, gold-standard peak caller MACS2 through its improved selectivity and near perfect sensitivity. In addition to command line accessibility, a graphical web application was designed to give any researcher the ability to generate optimal peak calls and interactive visualizations in a single step. Furthermore, the general applicability of this technique is demon- strated in chromatin conformation capture experiments. This establishes that the signals extracted and analysed may be found across a range of genomic data, and extending this technique to new datatypes is possible requiring only minor adjustments and model retraining.
- Published
- 2022