1 Sequencing: From Reads to Variants
TODO:
- Citations:
- …
- Add notes on imputaition and boosting
1.1 The Challenge of NGS Data
Next-generation sequencing (NGS) has transformed genomics by making it routine to generate tens to hundreds of gigabases of sequence from a single individual in a few days. Modern instruments produce short reads—typically 100–300 bp paired-end Illumina reads—at very high throughput, but with non-trivial error profiles including substitutions, context-specific errors, and base quality uncertainties. These reads are then aligned to imperfect reference genomes that omit structural variation and some segmental duplications (Goodwin, McPherson, and McCombie 2016).
Turning these raw reads into a reliable list of variants is therefore not just a matter of comparing strings. Variant calling pipelines must disentangle sequencing errors (instrument noise, PCR artifacts), alignment artifacts (mis-mapping in repeats, paralogous regions, pseudogenes), and genuine biological variation (germline variants, somatic mutations, mosaicism). Historically, this was addressed by complex, modular pipelines combining probabilistic models and hand-crafted heuristics (Nielsen et al. 2011). Deep learning now plays an important role in simplifying and improving parts of this stack, but it is helpful to understand the classical pipeline first.
1.2 Targeting Strategies: Panels, Exomes, and Genomes
NGS is not a single technology; it is deployed in different targeting strategies, each with distinct trade-offs.
1.2.1 Targeted and Panel Sequencing
Targeted gene panels capture tens to hundreds of genes selected for a clinical indication such as cardiomyopathy or hereditary cancer syndromes. These panels offer high depth in a limited region (often 200–500×), relatively low cost per sample, and simple interpretation workflows tied to well-curated gene lists. However, panels miss novel genes outside their design, most structural variants and non-coding regulatory changes, and opportunities for reanalysis as gene–disease knowledge evolves.
1.2.2 Whole-Exome Sequencing
Whole-exome sequencing (WES) enriches coding exons and some flanking splice regions genome-wide. Typical coverage ranges from 80–150× for exonic targets, though capture efficiency varies across GC content and repetitive exons. Non-coding regions are largely unobserved.
WES has been especially successful for Mendelian disease gene discovery and diagnostic workflows. At the same time, it misses non-coding and structural causes, has non-uniform coverage leading to heterogeneous sensitivity across genes, and requires careful handling of capture biases and batch effects.
1.2.3 Whole-Genome Sequencing
Whole-genome sequencing (WGS) samples nearly all bases in the genome. Typical coverage is 30–60× across the genome, with more uniform depth than WES. Because there is no capture step, WGS produces fewer batch-specific artifacts and enables detection of non-coding variants, structural variants, and copy-number changes along with SNVs and indels.
WGS is increasingly favored for new large cohorts and rare disease diagnostics despite higher cost, because the data are reusable for many downstream analyses (GWAS, PGS, rare variant burden tests), it simplifies pipelines by eliminating the need to track changing capture designs, and it supports more complete variant catalogs for the models discussed later in this book.
Throughout this text, we assume a WES/WGS-style pipeline where we start from aligned reads and aim to call high-confidence SNVs and small indels.
1.2.4 Long-Read Sequencing Technologies
While short-read Illumina sequencing dominates population-scale studies, long-read technologies are increasingly important for resolving complex genomic regions and structural variation.
Pacific Biosciences (PacBio) HiFi produces reads of 10–25 kb with accuracy exceeding 99.9% through circular consensus sequencing (Wenger et al. 2019). These reads can span repetitive elements, segmental duplications, and structural variants that confound short-read alignment. Oxford Nanopore Technologies (ONT) generates ultra-long reads—routinely 10–100 kb, with some exceeding 1 Mb—at somewhat lower per-base accuracy (roughly 95–99% for recent chemistries). ONT’s portability and real-time sequencing enable novel applications ranging from field diagnostics to direct RNA sequencing (Dabernig-Heinz et al. 2024).
Long reads transform variant calling in several ways. Structural variants—deletions, insertions, inversions, and complex rearrangements that are invisible or ambiguous in short-read data—become directly observable. Tandem repeats, segmental duplications, and transposable element insertions can be traversed rather than collapsed. A single long read can span many heterozygous sites, enabling direct, read-backed phasing over tens of kilobases. The T2T-CHM13 reference genome, completed with long reads, added approximately 200 Mb of previously unresolved sequence, including centromeres and acrocentric chromosome arms (Nurk et al. 2022).
Specialized variant callers have been developed for long-read data. DeepVariant includes models trained on PacBio and ONT data, while tools like Clair3 and PEPPER-Margin-DeepVariant are optimized for nanopore error profiles (Poplin et al. 2018; Zheng et al. 2022; Shafin et al. 2021). For structural variants, callers such as Sniffles, pbsv, and cuteSV exploit the unique properties of long reads (Smolka et al. 2024; “PacificBiosciences/Pbsv” 2025; Jiang et al. 2020).
This chapter focuses on short-read pipelines, which remain the workhorse for large cohorts and cost-sensitive applications. However, hybrid approaches—combining short-read depth with long-read phasing and structural variant detection—are increasingly common, and the models in later chapters must accommodate variants discovered by either technology.
1.3 Classical Variant Calling Pipelines
While every institution implements its own details, a classical short-read pipeline has several common stages.
The process begins with base calling and demultiplexing, where instrument software converts fluorescent images to base calls and quality scores, and reads are demultiplexed by barcode into sample-specific FASTQ files.
Next comes read alignment, in which short reads are aligned to a reference genome (such as GRCh38 or T2T-CHM13) using seed-and-extend mappers such as BWA-MEM or minimap2 (Li 2013, 2018). Aligners must cope with mismatches, small indels, and repetitive sequence.
Post-alignment processing follows, including marking or removing PCR duplicates, base quality score recalibration (BQSR) to model systematic quality score errors, and local realignment around indels in older pipelines.
Per-sample variant calling then takes place, where tools like the Genome Analysis Toolkit (GATK) HaplotypeCaller fit local haplotypes using hidden Markov models, de Bruijn graphs, or other probabilistic frameworks (McKenna et al. 2010). These tools produce candidate variants with genotype likelihoods for each sample.
Finally, for cohort variant calling, joint genotyping and cohort refinement recombines per-sample likelihoods to enforce a consistent set of variants across individuals. Raw calls are then filtered to distinguish true variants from artifacts. Simple hard filters apply fixed cutoffs to individual metrics, but GATK’s Variant Quality Score Recalibration (VQSR) takes a more sophisticated approach: it fits a Gaussian mixture model in the multi-dimensional space of variant annotations (depth, strand bias, mapping quality, read position bias, etc.), using variants at known polymorphic sites as a training set. Each candidate receives a recalibrated score reflecting how well it matches the learned “true variant” distribution, allowing joint consideration of multiple quality axes rather than independent hard thresholds.
These steps are encoded in pipelines like GATK Best Practices and similar frameworks. The key point is that each step uses hand-designed summary features and mechanistic models chosen by experts, not learned end-to-end (Van der Auwera et al. 2018).
1.3.1 Probabilistic Framework
At the core of GATK’s HaplotypeCaller is a Bayesian genotype likelihood model. For a candidate genotype \(G\) at a given site, the posterior probability given the read data \(D\) is:
\[P(G \mid D) \propto P(G) \prod_{r \in \text{reads}} P(r \mid G)\]
where \(P(G)\) is a prior over genotypes (often assuming Hardy-Weinberg equilibrium) and \(P(r \mid G)\) is the likelihood of observing read \(r\) given genotype \(G\). Computing \(P(r \mid G)\) is non-trivial: GATK uses a pair hidden Markov model (pair-HMM) to marginalize over possible alignments between the read and candidate haplotypes, incorporating base quality scores to weight the contribution of each base.
This formulation assumes conditional independence of reads given the genotype—an assumption known to be violated in practice due to correlated errors from PCR duplicates, systematic instrument biases, and local sequence context (DePristo et al. 2011). DeepVariant’s CNN (see Section 1.8), by contrast, sees all reads in a pileup simultaneously and can learn to model these dependencies implicitly.
1.4 Haplotype Phasing
Diploid organisms carry two copies of each autosomal chromosome—one inherited from each parent. Standard variant calling produces unphased genotypes: we know that an individual is heterozygous at two nearby sites (say, A/G and C/T), but not which alleles reside on the same physical chromosome. Phasing resolves this ambiguity by assigning each allele to a specific haplotype.
1.4.1 Why Phasing Matters
Phased haplotypes are essential for multiple applications. In recessive disease, two deleterious variants in the same gene cause disease only if they are on different haplotypes (in trans); unphased calls cannot distinguish cis from trans configurations, making compound heterozygosity detection impossible without phase information. Some regulatory and coding effects depend on the combination of alleles on a single chromosome, enabling haplotype-aware variant effect prediction. Reference panels used for genotype imputation (such as TOPMed and 1000 Genomes) are stored as phased haplotypes, and accurate phasing improves imputation quality. Finally, haplotype structure carries information about population history, recombination, and natural selection that drives ancestry and selection analyses.
1.4.2 Phasing Methods
Phasing can be achieved through several approaches. Read-backed phasing exploits physical linkage: when heterozygous variants are close enough to be spanned by the same sequencing read or read pair, the linkage directly reveals phase. Short-read data can phase variants within a few hundred base pairs, while long reads extend this to tens of kilobases or more.
Statistical or population-based phasing tools such as SHAPEIT, Eagle, and Beagle infer phase by leveraging linkage disequilibrium patterns observed in reference panels (O’Connell et al. 2014; Loh et al. 2016; Browning et al. 2021). These methods are highly accurate for common variants but struggle with rare variants that lack informative LD.
Pedigree-based phasing becomes possible when parent–offspring trios or larger pedigrees are available; Mendelian inheritance rules can resolve phase with high confidence.
Long-read and linked-read technologies provide direct observation of haplotype structure. PacBio HiFi and Oxford Nanopore reads span tens of kilobases, while linked-read methods (such as the now-discontinued 10x Genomics platform) tag short reads originating from the same long DNA molecule, providing intermediate-range phasing.
Modern pipelines often combine these approaches: statistical phasing across the genome, refined by read-backed evidence where available, and augmented by trio data when present. The result is a phased VCF with haplotype assignments (for example, 0|1 rather than 0/1), which downstream analyses can exploit.
1.5 Sources of Error and Uncertainty
Even with modern pipelines, variant calls are imperfect. Understanding the important failure modes is essential for interpreting downstream analyses (Li 2014).
Mapping ambiguity arises when reads from segmental duplications, paralogous genes, and low-complexity regions are mis-aligned. Reference bias can favor the reference allele in ambiguous regions, causing systematic undercalling of alternate alleles.
Systematic sequencing artifacts include context-specific errors in homopolymers and GC-rich regions, as well as batch effects across runs, instruments, or library preparations. These artifacts can create correlated false positives that are difficult to filter.
Low-coverage regions present another challenge. WES capture dropouts or WGS coverage dips can create false negatives for heterozygous variants, and somatic or mosaic variants at low allele fraction can be mistaken for noise.
Complex variants are also problematic. Small indels near homopolymers or repetitive elements are difficult to call accurately, and multi-nucleotide variants may be decomposed into multiple SNVs depending on the caller’s representation choices.
The deep learning models in later chapters inherit these errors as input noise. Understanding where variant calls are reliable—and where they are not—is essential when training sequence-to-function models, building polygenic scores, or interpreting predicted variant effects.
1.6 Difficult-to-Call Regions
Not all genomic regions are created equal. Some areas of the genome are systematically problematic for short-read variant calling due to their sequence properties.
1.6.1 Segmental Duplications and Paralogs
Regions with high sequence identity to other parts of the genome cause reads to map ambiguously. Paralogous genes—such as SMN1 and SMN2, or CYP2D6 and its pseudogenes—are particularly challenging. A read originating from one copy may align equally well to another, leading to false variant calls or missed true variants.
1.6.2 Low-Complexity and Repetitive Sequence
Homopolymers, short tandem repeats, and other low-complexity regions have elevated error rates on most sequencing platforms. Indel calling in these regions is especially unreliable, and many pipelines mask or flag them.
1.6.3 The HLA Region: A Case Study
The human leukocyte antigen (HLA) locus on chromosome 6p21 is among the most challenging regions in the human genome—and among the most clinically important.
HLA is difficult to call for several reasons. The HLA genes (HLA-A, HLA-B, HLA-C, HLA-DRB1, and others) are the most polymorphic coding genes in the human genome, with thousands of known alleles per gene (Robinson et al. 2020). Standard reference-based alignment struggles because reads may match the reference poorly even when they represent common, well-characterized alleles. The MHC region contains segmental duplications, copy-number variable genes (such as HLA-DRB3/4/5), and pseudogenes that confound read mapping. Different HLA alleles may differ by only a few nucleotides, making accurate allele-level typing difficult with short reads alone. Reads carrying non-reference HLA alleles may fail to align or align with low mapping quality, causing systematic undercalling of alternate alleles.
Despite these challenges, accurate HLA typing is essential for several clinical applications. In transplantation, HLA matching between donor and recipient is critical for organ and hematopoietic stem cell transplantation outcomes. HLA alleles are the strongest genetic risk factors for many autoimmune conditions, including type 1 diabetes, rheumatoid arthritis, and multiple sclerosis; fine-mapping causal alleles and amino acid positions requires accurate genotyping (Sakaue et al. 2023; Padyukov 2022). Specific HLA alleles—such as HLA-B*57:01 for abacavir and HLA-B*15:02 for carbamazepine—are pharmacogenomic markers for severe adverse drug reactions (Mallal et al. 2008; Chung et al. 2004). HLA diversity also shapes immune responses to pathogens, including HIV, hepatitis, and SARS-CoV-2.
Because standard variant callers perform poorly in HLA, specialized tools have been developed. HLA imputation methods, including those available through the Michigan Imputation Server, use dense reference panels to impute HLA alleles from array genotypes, enabling large-scale association studies (Sakaue et al. 2023). Sequence-based typing tools such as T1K perform HLA and KIR (killer immunoglobulin-like receptor) genotyping directly from WES, WGS, or RNA-seq data by aligning reads against allele databases (such as IPD-IMGT/HLA) rather than the linear reference genome (Song et al. 2022). T1K is notable for its speed, accuracy across sequencing platforms, and ability to handle both DNA and RNA data. Graph-based approaches that incorporate known HLA alleles as alternate paths can also improve alignment and variant calling in this region (Garrison et al. 2018; Liao et al. 2023).
For the purposes of this book, HLA exemplifies a broader lesson: regions of extreme diversity, structural complexity, or clinical importance may require specialized methods beyond generic variant calling pipelines. Later chapters show how variant effect models handle these challenging regions—often by excluding them entirely or applying specialized processing.
1.7 Benchmarking and Ground Truth
Evaluating variant callers requires high-confidence truth sets against which predictions can be compared. The Genome in a Bottle (GIAB) Consortium, coordinated by NIST, provides extensively characterized reference samples with validated variant calls across most of the genome (Zook et al. 2019).
1.7.1 GIAB Reference Samples
The primary GIAB samples include NA12878 (also known as HG001), a well-studied female of European ancestry from the CEPH/Utah pedigree with the longest history of characterization. The collection also includes HG002 through HG007: an Ashkenazi Jewish trio (HG002–HG004) and a Han Chinese trio (HG005–HG007), providing diversity and enabling trio-based validation.
For each sample, GIAB provides high-confidence variant calls—consensus calls derived from multiple sequencing technologies and variant callers, representing the best current estimate of true genotypes. They also define high-confidence regions, genomic intervals where the truth set is believed to be reliable; difficult regions such as segmental duplications and centromeres are excluded. Benchmarking tools like hap.py and RTG Tools enable standardized comparison of callsets against truth, reporting precision, recall, and F1 by variant type (Krusche et al. 2019; “RealTimeGenomics/Rtg-Core” 2025).
1.7.2 Benchmarking Metrics
Standard metrics for variant calling include recall (sensitivity), defined as TP / (TP + FN) or the fraction of true variants detected; precision (positive predictive value), defined as TP / (TP + FP) or the fraction of called variants that are true; and the F1 score, the harmonic mean of precision and recall. These are typically reported separately for SNVs and indels and may be stratified by genomic context, such as performance inside versus outside difficult regions.
1.7.3 Limitations of Current Benchmarks
GIAB truth sets have known limitations. They are derived primarily from short-read data and may miss complex variants, structural variants, and variation in difficult regions. High-confidence regions cover only approximately 85–90% of the genome, so performance in excluded regions is unknown. Sample diversity is limited, and performance may differ in underrepresented populations.
Ongoing efforts—including the T2T Consortium’s complete genome assemblies and the Human Pangenome Reference Consortium’s diverse haplotype collection—are expanding the scope of benchmarking resources (Liao et al. 2023).
1.8 DeepVariant: CNNs for Variant Calling
DeepVariant replaces much of the hand-engineered logic in classical pipelines with a deep convolutional neural network trained to classify candidate variants directly from read pileups (Poplin et al. 2018).
1.8.1 Image-Like Pileup Representation
Around each candidate site, DeepVariant constructs a six-channel tensor resembling an image. Each row corresponds to a read overlapping the site, with channels encoding reference match/mismatch status, Phred-scaled base quality, mapping quality, strand orientation, allele support (reference vs. alternate), and additional alignment features. The reference sequence and candidate alleles are overlaid. This representation allows the CNN to distinguish patterns consistent with true variants—balanced allele support across strands, consistent base qualities, clean alignments—from artifacts like strand-biased support or mismatches clustered at read ends.
1.8.2 Inception-Style CNN Classifier
DeepVariant uses an Inception-style CNN originally developed for image classification. Trained on high-confidence truth sets such as GIAB genomes, it learns to recognize true variant patterns and reject artifacts (strand bias, mapping pileups in repeats, inconsistent quality profiles). Once trained, the same architecture generalizes across whole-genome versus whole-exome data, PCR-free versus PCR-amplified libraries, and different instrument models and read lengths.
Crucially, DeepVariant learns to weigh quality signals jointly and end-to-end, rather than relying on post-hoc recalibration. Where VQSR fits a separate model on hand-selected annotations after calling, DeepVariant integrates the raw evidence directly into its classification—the CNN sees the same strand bias and quality patterns that VQSR would use, but learns their relationship to true variant status during training rather than in a decoupled second step.
1.8.3 Cohort Calling with DeepVariant and GLnexus
For cohort calling, DeepVariant can be combined with joint genotyping tools such as GLnexus to scale to tens or hundreds of thousands of samples while maintaining high accuracy (Yun et al. 2021). In this setup, DeepVariant produces per-sample gVCFs (genomic VCFs) containing genotype likelihoods at all sites, not just variant sites, and GLnexus merges gVCFs across samples to produce a cohort-wide callset.
Joint calling matters for several reasons. It improves sensitivity for rare variants: a variant observed in only one or two individuals may have weak per-sample evidence, but by combining likelihoods across carriers, joint calling can recover true rare variants that would be filtered in single-sample analysis. Joint calling ensures consistent representation, so that the same variants are genotyped across all samples, avoiding the problem of comparing different candidate variant sites across samples. Cohort-level quality filters can identify and remove systematic artifacts that affect subsets of samples, reducing batch effects and improving allele frequency estimates for downstream GWAS and PRS accuracy.
This combination has become a de facto standard for large WES and WGS cohorts, including recent releases of gnomAD and UK Biobank (Karczewski et al. 2020; Bycroft et al. 2018).
1.8.4 Comparison: Classical Pipelines vs. DeepVariant
| Aspect | GATK HaplotypeCaller | DeepVariant |
|---|---|---|
| Core approach | Pair-HMM + hand-crafted heuristics | CNN on pileup tensors |
| Feature engineering | Expert-designed (MQ, DP, FS, etc.) | Learned end-to-end |
| Read independence | Assumed (violated in practice) | Implicitly models dependencies |
| Calibration | VQSR post-hoc recalibration | Well-calibrated likelihoods |
| Generalization | Requires species/platform tuning | Transfers across species and platforms |
| Structural variants | Limited (SNVs/indels only) | Limited (SNVs/indels only) |
Both approaches achieve comparable accuracy on high-quality Illumina data, but DeepVariant shows advantages in difficult contexts and generalizes more readily to new sequencing technologies.
1.9 Significance for Genomic Deep Learning
NGS and variant calling set the stage for everything else in this book.
1.9.1 Defining the Atoms We Model
The output of WES and WGS pipelines—a VCF of SNVs, indels, and increasingly structural variants—is the raw material for nearly all downstream analyses. Polygenic scores (Chapter 3) aggregate variant effects across the genome. Rare variant burden tests collapse variants by gene or functional annotation. Variant effect predictors (Chapter 4 and later chapters) learn to score individual variants for deleteriousness or functional impact. The quality of variant calls directly limits the quality of these downstream models: false positives introduce noise, while false negatives create blind spots.
1.9.2 Constraining Downstream Models
If an assay never observes a class of variants, deep models cannot learn about them. Short-read WGS misses many structural variants and complex rearrangements. WES captures coding regions but ignores most non-coding regulatory variation. Difficult regions such as HLA and segmental duplications may be systematically excluded from training data. Models trained on these data inherit their biases and gaps. Chapter 16 returns to these issues when discussing confounders and dataset artifacts.
1.9.3 Motivating End-to-End Learning
DeepVariant is an early example of replacing a hand-designed pipeline with a learned model that operates on raw-ish data and directly optimizes accuracy. This paradigm—replacing feature engineering with learned representations—recurs throughout the book. DeepSEA and Basenji (Chapter 5) learn regulatory grammars from sequence. SpliceAI (Chapter 7) predicts splicing from local sequence context. DNA language models (Chapter 10) learn general-purpose representations from unlabeled genomes. In each case, the question is whether learned representations outperform hand-crafted features, and under what conditions.
1.9.4 Looking Ahead
The remaining chapters in Part I describe how variant calls are aggregated into genome-wide association studies (Chapter 3), which identify variant-trait associations; polygenic scores (Chapter 3), which predict complex traits from many variants; and deleteriousness scores (Chapter 4), which prioritize variants by predicted pathogenicity. These serve as baselines, inputs, and evaluation targets for the deep learning models that follow.