Back to Search Start Over

General encoding of canonical k-mers

Authors :
Wittler, Roland
Source :
Peer Community Journal, Vol 3, Iss , Pp - (2023)
Publication Year :
2023
Publisher :
Peer Community In, 2023.

Abstract

To index or compare sequences efficiently, often k-mers, i.e., substrings of fixed length k, are used. For efficient indexing or storage, k-mers are often encoded as integers, e.g., applying some bijective mapping between all possible σk k-mers and the interval [0, σk −1], where σ is the alphabet size. In many applications, e.g., when the reading direction of a DNA-sequence is ambiguous, canonical k-mers are considered, i.e., the lexicographically smaller of a given k-mer and its reverse (or reverse complement) is chosen as a representative. In naive encodings, canonical k-mers are not evenly distributed within the interval [0, σk −1]. We present a minimal encoding of canonical k-mers on alphabets of arbitrary size, i.e., a mapping to the interval [0, σk/2−1]. The approach is introduced for canonicalization under reversal and extended to canonicalization under reverse complementation. We further present a space and time efficient bit-based implementation for the DNA alphabet.

Details

Language :
English
ISSN :
28043871
Volume :
3
Issue :
-
Database :
Directory of Open Access Journals
Journal :
Peer Community Journal
Publication Type :
Academic Journal
Accession number :
edsdoj.463430ca95d4a17aa405c118a29ed58
Document Type :
article
Full Text :
https://doi.org/10.24072/pcjournal.323