Appendix G — Statistical Learning Theory Primer
A neural network with 1 billion parameters has enough capacity to memorize every training example verbatim. Classical learning theory predicts such models cannot generalize. Yet they do. This paradox sits at the heart of deep learning theory and shapes how we evaluate genomic foundation models.
G.1 The Generalization Problem
Setup. We observe training data \((x_i, y_i)\) drawn i.i.d. from unknown distribution \(P\). We fit model \(\hat{f}\) to minimize training error:
\[\hat{R}_{train}(\hat{f}) = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{f}(x_i)) \tag{G.1}\]
Goal. Bound the generalization error:
\[R(\hat{f}) = \mathbb{E}_{(x,y) \sim P}[L(y, \hat{f}(x))] \tag{G.2}\]
The generalization gap is \(R(\hat{f}) - \hat{R}_{train}(\hat{f})\).
G.2 VC Dimension and Capacity
The Vapnik-Chervonenkis (VC) dimension measures a hypothesis class’s complexity.
Definition. A hypothesis class \(\mathcal{H}\) shatters a set of points \(\{x_1, \ldots, x_m\}\) if for every labeling \((y_1, \ldots, y_m) \in \{0,1\}^m\), there exists \(h \in \mathcal{H}\) with \(h(x_i) = y_i\) for all \(i\).
The VC dimension \(d_{VC}(\mathcal{H})\) is the largest \(m\) such that some set of \(m\) points can be shattered.
Examples:
| Hypothesis Class | VC Dimension |
|---|---|
| Linear classifiers in \(\mathbb{R}^d\) | \(d + 1\) |
| Axis-aligned rectangles in \(\mathbb{R}^2\) | 4 |
| Circles in \(\mathbb{R}^2\) | 3 |
| Neural network with \(W\) weights | \(O(W \log W)\) |
G.3 Classical Generalization Bounds
VC Bound. With probability at least \(1 - \delta\):
\[R(\hat{f}) \leq \hat{R}_{train}(\hat{f}) + \sqrt{\frac{d_{VC} (\log(2n/d_{VC}) + 1) + \log(4/\delta)}{n}} \tag{G.3}\]
Interpretation:
- Generalization gap scales as \(O(\sqrt{d_{VC}/n})\)
- More capacity → larger gap
- More data → smaller gap
Rademacher Complexity. A data-dependent measure of capacity:
\[\mathfrak{R}_n(\mathcal{H}) = \mathbb{E}\left[\sup_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i h(x_i)\right] \tag{G.4}\]
where \(\sigma_i \in \{-1, +1\}\) are random signs.
With probability \(1 - \delta\): \[R(\hat{f}) \leq \hat{R}_{train}(\hat{f}) + 2\mathfrak{R}_n(\mathcal{H}) + \sqrt{\frac{\log(1/\delta)}{2n}} \tag{G.5}\]
G.4 Why Classical Theory Fails for Deep Learning
A neural network with 1B parameters has VC dimension \(O(10^{10})\). Classical bounds predict no generalization is possible. Yet these models work. Why?
G.4.1 Implicit Regularization
Gradient descent on overparameterized models converges to solutions with special properties:
Minimum Norm. Among all solutions fitting the data, SGD finds (approximately) the one with minimum \(\ell_2\) norm.
Flat Minima. Solutions found by SGD tend to lie in “flat” regions of the loss landscape, where small perturbations don’t increase loss much. These correlate with good generalization.
Simplicity Bias. Networks learn simple functions first (low-frequency Fourier components), only fitting complex patterns with extended training.
G.4.2 Benign Overfitting
In high dimensions, models can interpolate training data (zero training error) yet generalize:
Double Descent. Test error follows a U-shape as model capacity increases, but then decreases again past the interpolation threshold.
Conditions for Benign Overfitting:
- Covariance spectrum decays (signal in few directions)
- Signal-to-noise ratio is large in signal directions
- Model’s implicit bias aligns with signal directions
G.4.3 PAC-Bayes Bounds
Tighter bounds that account for the distribution over hypotheses:
\[R(\hat{f}) \leq \hat{R}_{train}(\hat{f}) + \sqrt{\frac{KL(Q || P) + \log(n/\delta)}{2n}} \tag{G.6}\]
where \(Q\) is the posterior over models and \(P\) is the prior.
For neural networks: The prior is implicit (initialization distribution), and the posterior concentrates near found solutions. If training moves parameters only slightly, \(KL\) is small and generalization is good.
G.5 Scaling Laws
Empirical scaling laws capture foundation model generalization:
Chinchilla Law: \[L(N, D) = A \cdot N^{-\alpha} + B \cdot D^{-\beta} + L_\infty \tag{G.7}\]
where \(N\) = parameters, \(D\) = data tokens, \(\alpha \approx 0.34\), \(\beta \approx 0.28\).
Implications:
- Test loss decreases as power law in both data and parameters
- Compute-optimal allocation: \(N \propto D\) (equal investment)
- No evidence of “diminishing returns” within studied regimes
G.6 Implications for Genomic Foundation Models
- Pretraining provides implicit regularization that constrains the effective hypothesis space
- Transfer learning works because pretrained representations have low effective dimension; the “usable” capacity is much smaller than parameter count
- Benchmark overfitting is possible even for models that generalize well to new data; small test sets have high variance
- Scale helps within current regimes, but theoretical understanding of why remains incomplete
G.7 Further Reading
- Shalev-Shwartz & Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms for comprehensive foundations
- Mohri, Rostamizadeh & Talwalkar (2018). Foundations of Machine Learning for mathematical treatment
- Belkin et al. (2019) for double descent and interpolation
- Hoffmann et al. (2022) for scaling laws