Appendix C — Data Curation
Training genomic foundation models requires carefully curated datasets that balance scale with quality. Poor data curation propagates errors into model representations, creating systematic biases that persist through downstream applications. This appendix provides practical guidance for constructing training sets, detecting contamination, and documenting data provenance.
C.1 Data Sources
Genomic foundation models draw from diverse data sources, each with distinct characteristics, biases, and access requirements.
C.1.1 Reference Genomes and Assemblies
Human reference genome (GRCh38/hg38) provides the coordinate system for human genomics:
| Assembly | Release | Key Features |
|---|---|---|
| GRCh38 | 2013 | Current standard, alternate loci |
| T2T-CHM13 | 2022 | First complete human genome |
| Pangenome | 2023 | Graph-based, population diversity |
Species genomes from Ensembl, NCBI, and UCSC provide sequences for comparative genomics and cross-species pretraining.
Considerations: - Reference genome represents a single haplotype, missing population diversity - Repetitive regions and centromeres are poorly represented in older assemblies - Coordinate systems differ between assemblies; ensure consistency
C.1.2 Population-Scale Sequencing
| Resource | Samples | Data Type | Access |
|---|---|---|---|
| gnomAD | 730,000+ | Exomes/genomes | Open |
| UK Biobank | 500,000 | WGS, WES, arrays | Controlled |
| All of Us | 1,000,000+ | WGS | Controlled |
| TOPMed | 180,000+ | WGS | Controlled |
| 1000 Genomes | 3,200 | WGS | Open |
Population databases provide variant frequency information and diverse genetic backgrounds. gnomAD’s open access makes it valuable for training; controlled-access resources like UK Biobank require applications and data use agreements.
C.1.3 Protein Sequence Databases
| Database | Sequences | Coverage |
|---|---|---|
| UniRef100 | 300M+ | Non-redundant proteins |
| UniRef90 | 150M+ | 90% identity clusters |
| UniRef50 | 55M+ | 50% identity clusters |
| UniParc | 500M+ | All known proteins |
UniRef provides clustered protein sequences at different identity thresholds, enabling control over sequence redundancy during pretraining. Lower redundancy (UniRef50) reduces training time but may miss sequence diversity; higher redundancy captures more variation but increases computational cost.
C.1.4 Functional Annotation
| Resource | Assays | Cell Types |
|---|---|---|
| ENCODE | ChIP-seq, ATAC-seq, RNA-seq | 500+ |
| Roadmap | Histone marks, DNase | 127 |
| GTEx | RNA-seq | 54 tissues |
| FANTOM5 | CAGE | 1,800+ |
Functional genomics data provides supervision signals for regulatory sequence models. ENCODE and Roadmap offer consistent protocols across cell types; GTEx provides tissue-specific expression. Data quality varies by assay and cell type; metadata review is essential.
C.1.5 Clinical Variant Databases
| Database | Variants | Curation |
|---|---|---|
| ClinVar | 2M+ | Submitter-dependent |
| HGMD | 350K+ | Expert curated |
| ClinGen | Genes + variants | Expert panels |
| LOVD | Gene-specific | Variable |
ClinVar provides the largest open-access collection of clinical variant interpretations but includes submitter variability and classification conflicts. See Chapter 2 for detailed discussion of ClinVar biases.
C.1.6 Access and Licensing
Data access requirements vary:
| Access Type | Examples | Requirements |
|---|---|---|
| Open | gnomAD, 1000G, ClinVar | None |
| Registered | ENCODE, GTEx | Account creation |
| Controlled | UK Biobank, dbGaP | Application, IRB approval |
| Commercial | HGMD Professional | Subscription |
Controlled-access data requires: - Institutional review board (IRB) approval - Data use agreements (DUA) - Secure computing environments - Compliance with return/destruction policies
C.2 Quality Filtering
Raw genomic data contains errors from sequencing, alignment, and annotation. Quality filtering removes problematic entries before training.
C.2.1 Sequence Quality Filters
For DNA sequences:
| Filter | Threshold | Rationale |
|---|---|---|
| N content | <5% | Removes poorly sequenced regions |
| Low complexity | DUST score | Removes repetitive sequence |
| Length | Task-dependent | Ensures sufficient context |
| GC content | Species-appropriate | Flags contamination |
def filter_sequence(seq, max_n_frac=0.05, min_length=100):
"""Basic sequence quality filter."""
n_frac = seq.count('N') / len(seq)
if n_frac > max_n_frac:
return False
if len(seq) < min_length:
return False
return TrueFor protein sequences:
| Filter | Threshold | Rationale |
|---|---|---|
| X content | <1% | Removes ambiguous residues |
| Length | 30–10,000 AA | Filters fragments and artifacts |
| Stop codons | 0 internal | Removes pseudogenes |
C.2.2 Variant Quality Filters
Variant calls include false positives from sequencing errors and alignment artifacts:
| Filter | Typical Threshold | Notes |
|---|---|---|
| QUAL | >20 | Phred-scaled quality |
| DP | >10 | Read depth |
| GQ | >20 | Genotype quality |
| FILTER | PASS | Caller-specific filters |
| AF_gnomAD | <0.01 for rare | Population frequency |
GATK and other variant callers apply default filters; additional filtering may be needed for training data. Overly stringent filtering biases toward common variants; overly permissive filtering includes false positives.
C.2.3 Annotation Quality
Clinical annotations have variable quality:
| Quality Indicator | Interpretation |
|---|---|
| Review status (ClinVar) | Stars indicate curation level |
| Submission count | More submissions increase confidence |
| Date | Recent annotations reflect current knowledge |
| Conflicts | Multiple interpretations reduce reliability |
Filtering strategy for ClinVar: - Require ≥2 stars review status for training labels - Exclude variants with conflicting interpretations - Consider date cutoffs for temporal validation
C.2.4 Handling Missing Data
Missing annotations are common:
- Explicit missing: Marked as unknown/uncertain
- Implicit missing: Simply not annotated
Strategies: - Exclude samples with critical missing fields - Impute where appropriate (mean, median, model-based) - Model missingness explicitly (separate category) - Document missing data rates
C.3 Deduplication
Duplicate sequences inflate dataset size without providing new information and can cause train-test leakage.
C.3.1 Exact Deduplication
Remove identical sequences:
import hashlib
def deduplicate_exact(sequences):
"""Remove exact duplicate sequences."""
seen = set()
unique = []
for seq in sequences:
seq_hash = hashlib.md5(seq.encode()).hexdigest()
if seq_hash not in seen:
seen.add(seq_hash)
unique.append(seq)
return uniqueC.3.2 Near-Duplicate Detection
Sequences differing by few positions may represent the same biological entity:
For DNA: - MinHash/LSH for approximate matching - CD-HIT for clustering at identity thresholds - MMseqs2 for scalable clustering
For proteins: - CD-HIT at 90%, 70%, 50%, 30% identity - MMseqs2 for large-scale clustering - PSI-BLAST for remote homology
Example CD-HIT usage:
# Cluster at 90% sequence identity
cd-hit -i proteins.fasta -o proteins_nr90.fasta -c 0.9 -n 5
# Cluster at 50% identity (requires different word size)
cd-hit -i proteins.fasta -o proteins_nr50.fasta -c 0.5 -n 3C.3.3 Redundancy Levels
| Redundancy | Use Case |
|---|---|
| 100% (no dedup) | Maximum data, risk of memorization |
| 90% identity | Reduce near-duplicates, preserve variants |
| 70% identity | Balance diversity and coverage |
| 50% identity | Maximize diversity, may lose variants |
| 30% identity | Remote homologs only |
The optimal redundancy level depends on the task and model capacity. Larger models can benefit from more redundancy; smaller models may need aggressive deduplication.
C.3.4 Train-Test Deduplication
Critical for valid evaluation:
- Define test set sequences first
- Remove training sequences similar to test sequences
- Use appropriate similarity threshold (often 30–50% for proteins)
- Document the deduplication procedure
def remove_test_similar(train_seqs, test_seqs, threshold=0.5):
"""Remove training sequences similar to test set."""
# Use MMseqs2 or similar for actual implementation
# This is pseudocode for the concept
clean_train = []
for train_seq in train_seqs:
if not any(similarity(train_seq, test_seq) > threshold
for test_seq in test_seqs):
clean_train.append(train_seq)
return clean_trainC.4 Contamination Detection
Contamination introduces sequences from unintended sources, corrupting training data.
C.4.1 Types of Contamination
| Type | Source | Detection |
|---|---|---|
| Cross-species | Sample mix-up, xenograft | BLAST to species databases |
| Microbial | Sample contamination | Screen against microbial genomes |
| Adapter | Library prep artifacts | Match adapter sequences |
| Vector | Cloning artifacts | Screen UniVec database |
| Human | Non-human samples | Screen against human genome |
C.4.2 Screening Approaches
BLAST-based screening:
# Screen against human genome for non-human samples
blastn -query sequences.fasta -db human_genome \
-outfmt 6 -evalue 1e-10 > human_hits.txtSpecialized tools:
| Tool | Purpose |
|---|---|
| FastQ Screen | Multi-genome contamination |
| Kraken2 | Taxonomic classification |
| BBDuk | Adapter/contaminant removal |
| VecScreen | Vector contamination |
C.4.3 Benchmark Contamination
A subtle but critical issue: test benchmarks contaminated in pretraining data inflate performance estimates.
Detection: - Search pretraining corpus for benchmark sequences - Check for substring matches, not just exact matches - Verify temporal separation (benchmark created after pretraining data)
Prevention: - Document pretraining data sources and dates - Use chromosome-based holdouts where possible - Report contamination checks in publications
C.5 Data Provenance
Tracking data origins enables reproducibility and debugging.
C.5.1 Metadata Requirements
Essential metadata for each data source:
| Field | Description |
|---|---|
| Source | Database/repository name |
| Version | Release version or date |
| Access date | When data was downloaded |
| URL | Exact download location |
| Processing | Filters and transformations applied |
| Checksum | MD5/SHA256 for verification |
C.5.2 Documentation Template
# data_manifest.yaml
dataset:
name: "variant_training_v2"
created: "2024-01-15"
sources:
- name: "ClinVar"
version: "2024-01"
url: "https://ftp.ncbi.nlm.nih.gov/pub/clinvar/..."
checksum: "abc123..."
filters:
- "review_status >= 2 stars"
- "no conflicting interpretations"
records_raw: 2500000
records_filtered: 850000
- name: "gnomAD"
version: "4.0"
url: "https://gnomad.broadinstitute.org/..."
checksum: "def456..."
filters:
- "FILTER == PASS"
- "AF > 0.001"
records_raw: 750000000
records_filtered: 45000000
processing:
deduplication: "exact + 90% identity clustering"
train_test_split: "chromosome-based (chr8 test)"
final_counts:
train: 800000
validation: 50000
test: 100000C.5.3 Version Control
For reproducibility: - Store data manifests in version control - Use content-addressable storage for large files (DVC, git-lfs) - Tag dataset versions with model training runs - Archive exact preprocessing scripts
C.6 Bias Assessment
Training data biases propagate into model predictions. Proactive assessment enables mitigation.
C.6.1 Population Bias
Genomic databases are not representative of global populations:
| Database | European | African | Asian | Other |
|---|---|---|---|---|
| ClinVar | ~70% | ~5% | ~15% | ~10% |
| gnomAD | ~50% | ~10% | ~25% | ~15% |
| UK Biobank | ~95% | ~2% | ~2% | ~1% |
Consequences: - Variant frequency estimates biased toward Europeans - Pathogenic variants in non-European populations underrepresented - Models may perform worse on underrepresented populations
Assessment: - Compute ancestry distribution of training samples - Evaluate model performance stratified by ancestry - Document limitations for underrepresented groups
C.6.2 Gene Coverage Bias
Some genes are more studied than others:
- Cancer genes (BRCA1, TP53) have extensive annotation
- Novel disease genes have sparse data
- Gene function determines ascertainment
Assessment: - Plot variants per gene vs. gene length - Identify genes with suspiciously high/low variant counts - Consider gene-level normalization
C.6.3 Ascertainment Bias
Clinical databases reflect clinical practice:
- Common diseases overrepresented
- Severe phenotypes more likely to reach clinical attention
- Geographic patterns in healthcare access
Assessment: - Compare phenotype distribution to population prevalence - Identify systematic gaps in disease coverage - Document clinical ascertainment assumptions
C.6.4 Label Bias
Annotations reflect annotator knowledge and conventions:
- Historical classifications may be outdated
- Different submitters use different standards
- Pathogenicity thresholds vary by context
Assessment: - Track annotation dates and sources - Identify conflicting labels - Consider temporal validation (train on old, test on new)
C.7 Building Training Sets
Practical workflow for constructing training data.
C.7.1 Step 1: Define Scope
- What sequences will the model process?
- What predictions will it make?
- What populations/contexts must it serve?
C.7.2 Step 2: Identify Sources
- List candidate data sources
- Assess access requirements and licenses
- Evaluate quality and coverage
C.7.3 Step 3: Download and Verify
# Download with verification
wget https://example.com/data.vcf.gz
md5sum data.vcf.gz # Compare to published checksum
# Document in manifest
echo "Downloaded data.vcf.gz on $(date)" >> data_log.txt
echo "MD5: $(md5sum data.vcf.gz)" >> data_log.txtC.7.4 Step 4: Quality Filter
Apply appropriate filters for each data type: - Sequence quality (N content, length, complexity) - Variant quality (QUAL, DP, GQ, FILTER) - Annotation quality (review status, conflicts)
C.7.5 Step 5: Deduplicate
- Remove exact duplicates
- Cluster at appropriate identity threshold
- Ensure train-test separation
C.7.6 Step 6: Split Data
| Split | Purpose | Size |
|---|---|---|
| Train | Model training | 80–90% |
| Validation | Hyperparameter tuning | 5–10% |
| Test | Final evaluation | 5–10% |
Splitting strategies: - Random (simple but may leak related samples) - Chromosome-based (ensures spatial separation) - Temporal (train on older data, test on newer) - Gene-family-based (tests generalization to new genes)
C.7.7 Step 7: Assess Bias
- Compute population/gene/phenotype distributions
- Compare to expected distributions
- Document known biases and limitations
C.7.8 Step 8: Document
- Create comprehensive data manifest
- Archive preprocessing scripts
- Record final counts and splits
- Publish data card with limitations
C.8 Data Cards
A data card documents dataset characteristics for users:
# Dataset: VariantBench-v2
## Overview
- Purpose: Training variant effect predictors
- Size: 950,000 variants (800K train / 50K val / 100K test)
- Created: January 2024
## Sources
- ClinVar 2024-01 (pathogenic/benign labels)
- gnomAD 4.0 (population frequencies)
## Curation
- Required 2+ stars review status
- Excluded conflicting interpretations
- 90% identity clustering applied
- Chromosome 8 held out for testing
## Known Biases
- 70% European ancestry
- Cancer genes overrepresented (*BRCA1*: 15K variants)
- Recent submissions may have unstable classifications
## Intended Use
- Training and evaluating pathogenicity predictors
- NOT suitable for: clinical diagnosis without validation
## Updates
- v2.1 (March 2024): Added 50K variants from new ClinVar releaseC.9 Checklist
Before using a dataset for training:
Data Quality
Bias Assessment
Documentation
Reproducibility