8 Sequence Representation & Tokens
TODO:
- …
- …
8.1 From Sequence to Model: The Representation Problem
Every genomic deep learning model must answer a fundamental question: how should DNA sequence be represented as numerical input? The previous chapters employed one-hot encoding—a simple, lossless representation where each nucleotide becomes a 4-dimensional binary vector. This approach worked well for CNN-based models like DeepSEA (Chapter 5) and SpliceAI (Chapter 7), but the emergence of transformer-based language models introduced new considerations around tokenization, vocabulary design, and the trade-offs between sequence compression and resolution.
This chapter examines the evolution of sequence representation strategies, from one-hot encoding through k-mer tokenization to modern approaches including Byte Pair Encoding (BPE), single-nucleotide tokens, and biologically-informed tokenization schemes. The choice of representation profoundly affects what a model can learn, how efficiently it trains, and what context lengths it can practically achieve.
8.2 One-Hot Encoding: The CNN Baseline
8.2.1 Representation
One-hot encoding represents each nucleotide as a sparse binary vector:
| Nucleotide | Vector |
|---|---|
| A | [1, 0, 0, 0] |
| C | [0, 1, 0, 0] |
| G | [0, 0, 1, 0] |
| T | [0, 0, 0, 1] |
A sequence of length \(L\) becomes a matrix of dimensions \(4 \times L\), interpretable as 4 “channels” (like RGB channels in images, plus one).
8.2.2 Advantages
One-hot encoding offers several properties that made it the default for CNN-based genomic models:
- Lossless: No information is discarded; every nucleotide is explicitly represented
- Single-nucleotide resolution: Enables detection of effects from individual SNPs
- Translation equivariance: Convolutional filters learn position-invariant motifs
- Simplicity: No preprocessing, vocabulary construction, or tokenizer training required
8.2.3 Limitations
For transformer architectures, one-hot encoding presents challenges:
- Sequence length: A 10 kb sequence requires 10,000 tokens, straining attention’s \(O(L^2)\) complexity
- No learned embeddings: Each nucleotide has a fixed, sparse representation rather than a learned dense embedding
- Context constraints: Practical transformer context windows of 512–4,096 tokens translate to only 512–4,096 bp—a tiny fraction of genes or regulatory regions
8.3 K-mer Tokenization: DNABERT’s Approach
8.3.1 Concept
K-mer tokenization treats overlapping subsequences of length \(k\) as tokens, analogous to words in natural language. DNABERT (2021) pioneered this approach for genomic transformers, using 6-mers (Ji et al. (2021)].
For a 6-mer vocabulary: - Vocabulary size: \(4^6 = 4,096\) possible tokens - Each token represents 6 consecutive nucleotides - Tokens overlap by \(k-1 = 5\) positions
8.3.2 Overlapping vs. Non-Overlapping
DNABERT used overlapping k-mers: for a sequence ACGTACGT, the 3-mer tokens would be:
Position: 1 2 3 4 5 6
Sequence: A C G T A C G T
3-mers: ACG CGT GTA TAC ACG CGT
This preserves positional information but creates computational redundancy—the sequence length in tokens equals the sequence length in nucleotides (minus \(k-1\)).
8.3.3 Problems with K-mer Tokenization
DNABERT-2 (2024) identified fundamental limitations of k-mer tokenization (Zhou et al. (2024)]:
No sequence compression: Overlapping k-mers don’t reduce sequence length, so context window limitations persist
Tokenization ambiguity: A single sequence position contributes to \(k\) different tokens, complicating variant effect interpretation
Sample inefficiency: The model must learn that overlapping tokens share nucleotides, rather than this being encoded in the representation
Computational overhead: Processing \(L\) overlapping tokens for an \(L\)-bp sequence is no more efficient than one-hot encoding
Fixed vocabulary: The \(4^k\) vocabulary doesn’t adapt to corpus statistics; frequent and rare k-mers receive equal representation capacity
8.4 Byte Pair Encoding: Learning the Vocabulary
8.4.1 The BPE Algorithm
Byte Pair Encoding, originally a data compression algorithm, constructs a vocabulary by iteratively merging the most frequent adjacent token pairs in the training corpus:
- Initialize vocabulary with single nucleotides: {A, C, G, T}
- Count all adjacent token pairs in the corpus
- Merge the most frequent pair into a new token
- Repeat until desired vocabulary size is reached
This produces variable-length tokens that capture frequently occurring sequence patterns, achieving genuine sequence compression.
8.4.2 DNABERT-2’s BPE Implementation
DNABERT-2 replaced 6-mer tokenization with BPE, demonstrating substantial improvements (Zhou et al. (2024)]:
- 21× fewer parameters than comparable k-mer models
- 92× less GPU time in pretraining
- Non-overlapping tokens: Actual sequence compression, enabling longer effective context
The BPE vocabulary learns corpus statistics—repetitive elements, common motifs, and frequent sequence patterns receive dedicated tokens, while rare sequences are represented as shorter subunits.
8.4.3 GROVER’s Custom BPE
GROVER (Genome Rules Obtained Via Extracted Representations) trained BPE specifically on the human genome and selected vocabulary using a custom next-k-mer prediction task (Sanabria et al. (2024)]. Analysis revealed that learned token embeddings encode:
- Frequency: Common tokens cluster separately from rare ones
- Sequence content: GC-rich versus AT-rich tokens segregate
- Length: Token length correlates with embedding dimensions
- Genomic localization: Some tokens appear primarily in repeats; others distribute broadly
8.5 Single-Nucleotide Tokenization: HyenaDNA
8.5.1 The Case for Maximum Resolution
While k-mer and BPE tokenization compress sequences, they sacrifice single-nucleotide resolution. A single nucleotide polymorphism (SNP) can completely alter protein function, yet multi-nucleotide tokens obscure the precise position and identity of variants.
HyenaDNA (2023) took the opposite approach: single-nucleotide tokens with no compression (Nguyen et al. (2023)]. Each nucleotide (A, C, G, T) is a separate token, preserving:
- Full resolution: Every nucleotide is independently represented
- Variant precision: SNP effects can be isolated to specific tokens
- No tokenization artifacts: No ambiguity about which token contains a variant
8.5.2 Scaling to 1 Million Base Pairs
The challenge with single-nucleotide tokens is sequence length. A 1 Mb region requires 1 million tokens—far beyond standard transformer capacity. HyenaDNA addresses this through the Hyena architecture, which replaces attention with implicit convolutions that scale sub-quadratically:
| Model | Architecture | Max Context | Complexity |
|---|---|---|---|
| DNABERT | Transformer | 512 bp | \(O(L^2)\) |
| Nucleotide Transformer | Transformer | 6 kb | \(O(L^2)\) |
| HyenaDNA | Hyena | 1 Mb | \(O(L \log L)\) |
HyenaDNA achieved a 500× increase in context length over dense attention models while maintaining single-nucleotide resolution.
8.5.3 Performance Characteristics
On Nucleotide Transformer benchmarks, HyenaDNA reached state-of-the-art on 12 of 18 datasets with orders of magnitude fewer parameters and less pretraining data. On GenomicBenchmarks, it surpassed prior state-of-the-art on 7 of 8 datasets by an average of +10 accuracy points.
Notably, HyenaDNA demonstrated the first use of in-context learning in genomics—performing tasks based on examples provided in the context window without fine-tuning.
8.6 Biologically-Informed Tokenization
8.6.1 The Central Dogma as Tokenization Guide: Life-Code
Standard tokenization treats DNA as a homogeneous string, ignoring the biological reality that different genomic regions serve different functions. Coding sequences follow codon structure (3-nucleotide units encoding amino acids), while noncoding regions have no such constraint.
Life-Code (2025) proposed codon-aware tokenization that respects the central dogma of molecular biology (Liu et al. (2025)]:
- Coding regions: Tokenized by codons (3-mers in reading frame)
- Noncoding regions: Tokenized by learned patterns
- Integration: Unified framework spanning DNA, RNA, and protein
This approach enables Life-Code to: - Learn protein structure through knowledge distillation from protein language models - Capture interactions between coding and noncoding regions - Achieve state-of-the-art results across DNA, RNA, and protein tasks
8.6.2 BioToken: Encoding Genomic Annotations
BioToken (2025) extends tokenization beyond sequence content to include genomic structural annotations (Medvedev et al. (2025)]:
- Variant encoding: Tokens that explicitly represent SNPs, insertions, and deletions
- Regulatory annotations: Encoding of known regulatory elements
- Functional context: Integration of gene structure, chromatin state, and other annotations
By incorporating biological inductive biases directly into the token representation, BioToken’s associated model (BioFM) achieves competitive or superior performance to specialized models (Enformer, SpliceAI) with significantly fewer parameters (265M).
8.7 The Context Length Evolution
The history of genomic deep learning shows a consistent trend toward longer sequence context:
| Era | Representative Models | Max Context | Tokenization |
|---|---|---|---|
| 2015–2017 | DeepSEA, DeepBind | 1 kb | One-hot |
| 2018–2020 | ExPecto, SpliceAI | 10–40 kb | One-hot |
| 2021 | DNABERT, Enformer | 512 bp – 200 kb | K-mer / One-hot |
| 2022–2023 | Nucleotide Transformer | 6 kb | K-mer |
| 2023–2024 | HyenaDNA, Caduceus | 1 Mb | Single-nucleotide |
| 2025 | Evo 2 | 1 Mb | Single-nucleotide (BPE) |
This expansion reflects biological reality: regulatory elements can influence genes from hundreds of kilobases away, and understanding genome function requires integrating information across these distances.
8.8 Trade-offs in Tokenization Design
8.8.1 Compression vs. Resolution
| Strategy | Compression | Resolution | Variant Handling |
|---|---|---|---|
| One-hot | None | Single-nucleotide | Precise |
| Overlapping k-mer | None | K-nucleotide | Ambiguous |
| Non-overlapping k-mer | ~K× | K-nucleotide | Frame-dependent |
| BPE | Variable | Variable | Context-dependent |
| Single-nucleotide | None | Single-nucleotide | Precise |
Higher compression enables longer context but loses precision for variant effects. BPE offers a middle ground with adaptive compression, but variant positions relative to token boundaries can affect predictions.
8.8.2 Vocabulary Size Considerations
| Tokenization | Typical Vocabulary Size |
|---|---|
| One-hot / Single-nucleotide | 4 (+ special tokens) |
| 6-mer | 4,096 |
| BPE (DNABERT-2) | 4,096–32,000 |
| Codon-aware | ~64 (codons) + noncoding |
Larger vocabularies increase embedding table size but may capture more complex patterns. Smaller vocabularies are parameter-efficient but require the model to learn compositional structure.
8.8.3 Computational Efficiency
For a sequence of length \(L\) bp:
| Tokenization | Tokens | Attention Cost |
|---|---|---|
| One-hot | \(L\) | \(O(L^2)\) |
| Non-overlapping k-mer | \(L/k\) | \(O(L^2/k^2)\) |
| BPE (average compression \(c\)) | \(L/c\) | \(O(L^2/c^2)\) |
BPE’s variable compression can achieve substantial speedups, but the benefit depends on corpus statistics and vocabulary size.
8.9 Implications for Variant Effect Prediction
Tokenization choice directly affects variant effect prediction:
8.9.1 Single-Nucleotide Tokens (HyenaDNA, Evo 2)
- Reference and alternate alleles occupy the same token position
- Effects are precisely localized
- No tokenization artifacts
8.9.2 K-mer Tokens
- A single SNP changes \(k\) overlapping tokens
- Must aggregate effects across affected tokens
- Boundary effects if variant is at token junction
8.9.3 BPE Tokens
- Variant may fall within a token or at token boundary
- Effect interpretation depends on token segmentation
- Re-tokenization may be needed for alternate allele
For clinical variant interpretation, single-nucleotide resolution is often preferred despite computational costs, as subtle genetic variations can have major phenotypic consequences.
8.10 The Emerging Consensus
Recent developments suggest convergence toward:
Single-nucleotide resolution for maximum precision, enabled by sub-quadratic architectures (Hyena, Mamba, state space models)
Learned embeddings rather than fixed one-hot vectors, allowing the model to discover meaningful nucleotide representations
Biologically-informed augmentation where appropriate—encoding codons in coding regions, incorporating annotations, or using species-specific vocabularies
Hybrid approaches combining the efficiency of compression with resolution where needed
The choice ultimately depends on the task: variant effect prediction demands high resolution, while tasks like species classification or repeat annotation may benefit from compression.
8.11 References in Context
The models discussed in this chapter set the stage for the genomic language models covered in Chapter 10. Understanding tokenization choices clarifies why models like the Nucleotide Transformer use 6-mers (Dalla-Torre et al. 2023), why DNABERT-2 switched to BPE, and why HyenaDNA’s single-nucleotide approach enabled unprecedented context lengths. The hybrid architectures of Chapter 11 (Enformer, Borzoi) largely retained one-hot encoding for its precision, while the long-range models of Chapter 12 explore how sub-quadratic architectures enable single-nucleotide tokenization at genomic scale.