Back to Search Start Over

Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction.

Authors :
Phan H
Brouard C
Mourad R
Source :
Briefings in bioinformatics [Brief Bioinform] 2024 Sep 23; Vol. 25 (6).
Publication Year :
2024

Abstract

Predicting molecular processes using deep learning is a promising approach to provide biological insights for non-coding single nucleotide polymorphisms identified in genome-wide association studies. However, most deep learning methods rely on supervised learning, which requires DNA sequences associated with functional data, and whose amount is severely limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is growing exponentially due to ongoing large-scale sequencing projects, but in most cases without functional data. To alleviate the limitations of supervised learning, we propose a novel semi-supervised learning (SSL) based on pseudo-labeling, which allows to exploit unlabeled DNA sequences from numerous genomes during model pre-training. We further improved it incorporating principles from the Noisy Student algorithm to predict the confidence in pseudo-labeled data used for pre-training, which showed improvements for transcription factor with very few binding (very small training data). The approach is very flexible and can be used to train any neural architecture including state-of-the-art models, and shows in most cases strong predictive performance improvements compared to standard supervised learning. Moreover, small models trained by SSL showed similar or better performance than large language model DNABERT2.<br /> (© The Author(s) 2024. Published by Oxford University Press.)

Details

Language :
English
ISSN :
1477-4054
Volume :
25
Issue :
6
Database :
MEDLINE
Journal :
Briefings in bioinformatics
Publication Type :
Academic Journal
Accession number :
39489607
Full Text :
https://doi.org/10.1093/bib/bbae560