Back to Search
Start Over
Evaluating the representational power of pre-trained DNA language models for regulatory genomics.
- Source :
-
BioRxiv : the preprint server for biology [bioRxiv] 2024 Sep 25. Date of Electronic Publication: 2024 Sep 25. - Publication Year :
- 2024
-
Abstract
- The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis -regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis -regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that probing the representations of pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.<br />Competing Interests: Competing interests Nothing to declare.
Details
- Language :
- English
- ISSN :
- 2692-8205
- Database :
- MEDLINE
- Journal :
- BioRxiv : the preprint server for biology
- Publication Type :
- Academic Journal
- Accession number :
- 38464101
- Full Text :
- https://doi.org/10.1101/2024.02.29.582810