8 Sequence Representation & Tokens – Genomic Foundation Models

8.1 From Sequence to Model: The Representation Problem

Every genomic deep learning model must answer a fundamental question: how should DNA sequence be represented as numerical input? The previous chapters employed one-hot encoding—a simple, lossless representation where each nucleotide becomes a 4-dimensional binary vector. This approach worked well for CNN-based models like DeepSEA (Chapter 5) and SpliceAI (Chapter 7), but the emergence of transformer-based language models introduced new considerations around tokenization, vocabulary design, and the trade-offs between sequence compression and resolution.

This chapter examines the evolution of sequence representation strategies, from one-hot encoding through k-mer tokenization to modern approaches including Byte Pair Encoding (BPE), single-nucleotide tokens, and biologically-informed tokenization schemes. The choice of representation profoundly affects what a model can learn, how efficiently it trains, and what context lengths it can practically achieve.

8.2 One-Hot Encoding: The CNN Baseline

8.2.1 Representation

One-hot encoding represents each nucleotide as a sparse binary vector:

Nucleotide	Vector
A	[1, 0, 0, 0]
C	[0, 1, 0, 0]
G	[0, 0, 1, 0]
T	[0, 0, 0, 1]

A sequence of length \(L\) becomes a matrix of dimensions \(4 \times L\), interpretable as 4 “channels” (like RGB channels in images, plus one).

8.2.2 Advantages

One-hot encoding offers several properties that made it the default for CNN-based genomic models:

Lossless: No information is discarded; every nucleotide is explicitly represented
Single-nucleotide resolution: Enables detection of effects from individual SNPs
Translation equivariance: Convolutional filters learn position-invariant motifs
Simplicity: No preprocessing, vocabulary construction, or tokenizer training required

8.2.3 Limitations

For transformer architectures, one-hot encoding presents challenges:

Sequence length: A 10 kb sequence requires 10,000 tokens, straining attention’s \(O(L^2)\) complexity
No learned embeddings: Each nucleotide has a fixed, sparse representation rather than a learned dense embedding
Context constraints: Practical transformer context windows of 512–4,096 tokens translate to only 512–4,096 bp—a tiny fraction of genes or regulatory regions

8.3 K-mer Tokenization: DNABERT’s Approach

8.3.1 Concept

K-mer tokenization treats overlapping subsequences of length \(k\) as tokens, analogous to words in natural language. DNABERT (2021) pioneered this approach for genomic transformers, using 6-mers (Ji et al. (2021)].

For a 6-mer vocabulary: - Vocabulary size: \(4^6 = 4,096\) possible tokens - Each token represents 6 consecutive nucleotides - Tokens overlap by \(k-1 = 5\) positions

8.3.2 Overlapping vs. Non-Overlapping

DNABERT used overlapping k-mers: for a sequence ACGTACGT, the 3-mer tokens would be:

Position:  1   2   3   4   5   6
Sequence:  A   C   G   T   A   C   G   T
3-mers:   ACG CGT GTA TAC ACG CGT

This preserves positional information but creates computational redundancy—the sequence length in tokens equals the sequence length in nucleotides (minus \(k-1\)).

8.3.3 Problems with K-mer Tokenization

DNABERT-2 (2024) identified fundamental limitations of k-mer tokenization (Zhou et al. (2024)]:

No sequence compression: Overlapping k-mers don’t reduce sequence length, so context window limitations persist
Tokenization ambiguity: A single sequence position contributes to \(k\) different tokens, complicating variant effect interpretation
Sample inefficiency: The model must learn that overlapping tokens share nucleotides, rather than this being encoded in the representation
Computational overhead: Processing \(L\) overlapping tokens for an \(L\)-bp sequence is no more efficient than one-hot encoding
Fixed vocabulary: The \(4^k\) vocabulary doesn’t adapt to corpus statistics; frequent and rare k-mers receive equal representation capacity

8.4 Byte Pair Encoding: Learning the Vocabulary

8.4.1 The BPE Algorithm

Byte Pair Encoding, originally a data compression algorithm, constructs a vocabulary by iteratively merging the most frequent adjacent token pairs in the training corpus:

Initialize vocabulary with single nucleotides: {A, C, G, T}
Count all adjacent token pairs in the corpus
Merge the most frequent pair into a new token
Repeat until desired vocabulary size is reached

This produces variable-length tokens that capture frequently occurring sequence patterns, achieving genuine sequence compression.

8.4.2 DNABERT-2’s BPE Implementation

DNABERT-2 replaced 6-mer tokenization with BPE, demonstrating substantial improvements (Zhou et al. (2024)]:

21× fewer parameters than comparable k-mer models
92× less GPU time in pretraining
Non-overlapping tokens: Actual sequence compression, enabling longer effective context

The BPE vocabulary learns corpus statistics—repetitive elements, common motifs, and frequent sequence patterns receive dedicated tokens, while rare sequences are represented as shorter subunits.

8.4.3 GROVER’s Custom BPE

GROVER (Genome Rules Obtained Via Extracted Representations) trained BPE specifically on the human genome and selected vocabulary using a custom next-k-mer prediction task (Sanabria et al. (2024)]. Analysis revealed that learned token embeddings encode:

Frequency: Common tokens cluster separately from rare ones
Sequence content: GC-rich versus AT-rich tokens segregate
Length: Token length correlates with embedding dimensions
Genomic localization: Some tokens appear primarily in repeats; others distribute broadly

8.5 Single-Nucleotide Tokenization: HyenaDNA

8.5.1 The Case for Maximum Resolution

While k-mer and BPE tokenization compress sequences, they sacrifice single-nucleotide resolution. A single nucleotide polymorphism (SNP) can completely alter protein function, yet multi-nucleotide tokens obscure the precise position and identity of variants.

HyenaDNA (2023) took the opposite approach: single-nucleotide tokens with no compression (Nguyen et al. (2023)]. Each nucleotide (A, C, G, T) is a separate token, preserving:

Full resolution: Every nucleotide is independently represented
Variant precision: SNP effects can be isolated to specific tokens
No tokenization artifacts: No ambiguity about which token contains a variant

8.5.2 Scaling to 1 Million Base Pairs

The challenge with single-nucleotide tokens is sequence length. A 1 Mb region requires 1 million tokens—far beyond standard transformer capacity. HyenaDNA addresses this through the Hyena architecture, which replaces attention with implicit convolutions that scale sub-quadratically:

Model	Architecture	Max Context	Complexity
DNABERT	Transformer	512 bp	\(O(L^2)\)
Nucleotide Transformer	Transformer	6 kb	\(O(L^2)\)
HyenaDNA	Hyena	1 Mb	\(O(L \log L)\)

HyenaDNA achieved a 500× increase in context length over dense attention models while maintaining single-nucleotide resolution.

8.5.3 Performance Characteristics

On Nucleotide Transformer benchmarks, HyenaDNA reached state-of-the-art on 12 of 18 datasets with orders of magnitude fewer parameters and less pretraining data. On GenomicBenchmarks, it surpassed prior state-of-the-art on 7 of 8 datasets by an average of +10 accuracy points.

Notably, HyenaDNA demonstrated the first use of in-context learning in genomics—performing tasks based on examples provided in the context window without fine-tuning.

8.6 Biologically-Informed Tokenization

8.6.1 The Central Dogma as Tokenization Guide: Life-Code

Standard tokenization treats DNA as a homogeneous string, ignoring the biological reality that different genomic regions serve different functions. Coding sequences follow codon structure (3-nucleotide units encoding amino acids), while noncoding regions have no such constraint.

Life-Code (2025) proposed codon-aware tokenization that respects the central dogma of molecular biology (Liu et al. (2025)]:

Coding regions: Tokenized by codons (3-mers in reading frame)
Noncoding regions: Tokenized by learned patterns
Integration: Unified framework spanning DNA, RNA, and protein

This approach enables Life-Code to: - Learn protein structure through knowledge distillation from protein language models - Capture interactions between coding and noncoding regions - Achieve state-of-the-art results across DNA, RNA, and protein tasks

8.6.2 BioToken: Encoding Genomic Annotations

BioToken (2025) extends tokenization beyond sequence content to include genomic structural annotations (Medvedev et al. (2025)]:

Variant encoding: Tokens that explicitly represent SNPs, insertions, and deletions
Regulatory annotations: Encoding of known regulatory elements
Functional context: Integration of gene structure, chromatin state, and other annotations

By incorporating biological inductive biases directly into the token representation, BioToken’s associated model (BioFM) achieves competitive or superior performance to specialized models (Enformer, SpliceAI) with significantly fewer parameters (265M).

8.7 The Context Length Evolution

The history of genomic deep learning shows a consistent trend toward longer sequence context:

Era	Representative Models	Max Context	Tokenization
2015–2017	DeepSEA, DeepBind	1 kb	One-hot
2018–2020	ExPecto, SpliceAI	10–40 kb	One-hot
2021	DNABERT, Enformer	512 bp – 200 kb	K-mer / One-hot
2022–2023	Nucleotide Transformer	6 kb	K-mer
2023–2024	HyenaDNA, Caduceus	1 Mb	Single-nucleotide
2025	Evo 2	1 Mb	Single-nucleotide (BPE)

This expansion reflects biological reality: regulatory elements can influence genes from hundreds of kilobases away, and understanding genome function requires integrating information across these distances.

8.8 Trade-offs in Tokenization Design

8.8.1 Compression vs. Resolution

Strategy	Compression	Resolution	Variant Handling
One-hot	None	Single-nucleotide	Precise
Overlapping k-mer	None	K-nucleotide	Ambiguous
Non-overlapping k-mer	~K×	K-nucleotide	Frame-dependent
BPE	Variable	Variable	Context-dependent
Single-nucleotide	None	Single-nucleotide	Precise

Higher compression enables longer context but loses precision for variant effects. BPE offers a middle ground with adaptive compression, but variant positions relative to token boundaries can affect predictions.

8.8.2 Vocabulary Size Considerations

Tokenization	Typical Vocabulary Size
One-hot / Single-nucleotide	4 (+ special tokens)
6-mer	4,096
BPE (DNABERT-2)	4,096–32,000
Codon-aware	~64 (codons) + noncoding

Larger vocabularies increase embedding table size but may capture more complex patterns. Smaller vocabularies are parameter-efficient but require the model to learn compositional structure.

8.8.3 Computational Efficiency

For a sequence of length \(L\) bp:

Tokenization	Tokens	Attention Cost
One-hot	\(L\)	\(O(L^2)\)
Non-overlapping k-mer	\(L/k\)	\(O(L^2/k^2)\)
BPE (average compression \(c\))	\(L/c\)	\(O(L^2/c^2)\)

BPE’s variable compression can achieve substantial speedups, but the benefit depends on corpus statistics and vocabulary size.

8.9 Implications for Variant Effect Prediction

Tokenization choice directly affects variant effect prediction:

8.9.1 Single-Nucleotide Tokens (HyenaDNA, Evo 2)

Reference and alternate alleles occupy the same token position
Effects are precisely localized
No tokenization artifacts

8.9.2 K-mer Tokens

A single SNP changes \(k\) overlapping tokens
Must aggregate effects across affected tokens
Boundary effects if variant is at token junction

8.9.3 BPE Tokens

Variant may fall within a token or at token boundary
Effect interpretation depends on token segmentation
Re-tokenization may be needed for alternate allele

For clinical variant interpretation, single-nucleotide resolution is often preferred despite computational costs, as subtle genetic variations can have major phenotypic consequences.

8.10 The Emerging Consensus

Recent developments suggest convergence toward:

Single-nucleotide resolution for maximum precision, enabled by sub-quadratic architectures (Hyena, Mamba, state space models)
Learned embeddings rather than fixed one-hot vectors, allowing the model to discover meaningful nucleotide representations
Biologically-informed augmentation where appropriate—encoding codons in coding regions, incorporating annotations, or using species-specific vocabularies
Hybrid approaches combining the efficiency of compression with resolution where needed

The choice ultimately depends on the task: variant effect prediction demands high resolution, while tasks like species classification or repeat annotation may benefit from compression.

8.11 References in Context

The models discussed in this chapter set the stage for the genomic language models covered in Chapter 10. Understanding tokenization choices clarifies why models like the Nucleotide Transformer use 6-mers (Dalla-Torre et al. 2023), why DNABERT-2 switched to BPE, and why HyenaDNA’s single-nucleotide approach enabled unprecedented context lengths. The hybrid architectures of Chapter 11 (Enformer, Borzoi) largely retained one-hot encoding for its precision, while the long-range models of Chapter 12 explore how sub-quadratic architectures enable single-nucleotide tokenization at genomic scale.