Back to Search
Start Over
General encoding of canonical k-mers
- Source :
- Peer Community Journal, Vol 3, Iss , Pp - (2023)
- Publication Year :
- 2023
- Publisher :
- Peer Community In, 2023.
-
Abstract
- To index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. For efficient indexing or storage, k-mers are often encoded as integers, e.g., applying some bijective mapping between all possible σk k-mers and the interval [0, σk −1], where σ is the alphabet size. In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, canonical k-mers are considered, i.e., the lexicographically smaller of a given k-mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical k-mers are not evenly distributed within the interval [0, σk −1]. We present a minimal encoding of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk/2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet.
Details
- Language :
- English
- ISSN :
- 28043871
- Volume :
- 3
- Issue :
- -
- Database :
- Directory of Open Access Journals
- Journal :
- Peer Community Journal
- Publication Type :
- Academic Journal
- Accession number :
- edsdoj.463430ca95d4a17aa405c118a29ed58
- Document Type :
- article
- Full Text :
- https://doi.org/10.24072/pcjournal.323