1. Are There Seven Symbols for the Nucleotide-Based Genetic Code?
- Author
-
Kłóś, Adam, Płonka, Przemysław M., and Baczyński, Krzysztof
- Subjects
ZIPF'S law ,NUCLEOTIDES ,NATURAL languages ,GENOMES ,DNA - Abstract
The common assumption is that genetic information is built on a four-symbol alphabet, i.e., DNA nucleotides, the smallest meaningful blocks of genomes are codon triplets, and the record of genetic information does not contain any asserted symbols playing the role of the space. It is, however, well known that some nucleotides in some codons are redundant. Our study, therefore, tests the alternative scenario. As the same nucleotide may play various semiotic roles, the genomic alphabet actually contains seven semiotic symbols. Consequently, the meaningful fragments of genomes (words) can be of different sizes, and there are asserted symbols in the record of genomic information. If this is true, then, similarly to natural languages, the frequency-range of these genomic words should follow the power-law distribution. The presented hypothesis was tested, in comparison to competitive (codon-based and n-tuple) forms of tokenization, on a wide range of genomic texts. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF