Appendix G — Statistical Learning Theory Primer

A neural network with 1 billion parameters has enough capacity to memorize every training example verbatim. Classical learning theory predicts such models cannot generalize. Yet they do. This paradox sits at the heart of deep learning theory and shapes how we evaluate genomic foundation models.

G.1 The Generalization Problem

Setup. We observe training data \((x_i, y_i)\) drawn i.i.d. from unknown distribution \(P\). We fit model \(\hat{f}\) to minimize training error:

\[\hat{R}_{train}(\hat{f}) = \frac{1}{n} \sum_{i=1}^n L(y_i, \hat{f}(x_i)) \tag{G.1}\]

Goal. Bound the generalization error:

\[R(\hat{f}) = \mathbb{E}_{(x,y) \sim P}[L(y, \hat{f}(x))] \tag{G.2}\]

The generalization gap is \(R(\hat{f}) - \hat{R}_{train}(\hat{f})\).

G.2 VC Dimension and Capacity

The Vapnik-Chervonenkis (VC) dimension measures a hypothesis class’s complexity.

Definition. A hypothesis class \(\mathcal{H}\) shatters a set of points \(\{x_1, \ldots, x_m\}\) if for every labeling \((y_1, \ldots, y_m) \in \{0,1\}^m\), there exists \(h \in \mathcal{H}\) with \(h(x_i) = y_i\) for all \(i\).

The VC dimension \(d_{VC}(\mathcal{H})\) is the largest \(m\) such that some set of \(m\) points can be shattered.

Examples:

Hypothesis Class	VC Dimension
Linear classifiers in \(\mathbb{R}^d\)	\(d + 1\)
Axis-aligned rectangles in \(\mathbb{R}^2\)	4
Circles in \(\mathbb{R}^2\)	3
Neural network with \(W\) weights	\(O(W \log W)\)

G.3 Classical Generalization Bounds

VC Bound. With probability at least \(1 - \delta\):

\[R(\hat{f}) \leq \hat{R}_{train}(\hat{f}) + \sqrt{\frac{d_{VC} (\log(2n/d_{VC}) + 1) + \log(4/\delta)}{n}} \tag{G.3}\]

Interpretation:

Generalization gap scales as \(O(\sqrt{d_{VC}/n})\)
More capacity → larger gap
More data → smaller gap

Rademacher Complexity. A data-dependent measure of capacity:

\[\mathfrak{R}_n(\mathcal{H}) = \mathbb{E}\left[\sup_{h \in \mathcal{H}} \frac{1}{n} \sum_{i=1}^n \sigma_i h(x_i)\right] \tag{G.4}\]

where \(\sigma_i \in \{-1, +1\}\) are random signs.

With probability \(1 - \delta\): \[R(\hat{f}) \leq \hat{R}_{train}(\hat{f}) + 2\mathfrak{R}_n(\mathcal{H}) + \sqrt{\frac{\log(1/\delta)}{2n}} \tag{G.5}\]

G.4 Why Classical Theory Fails for Deep Learning

A neural network with 1B parameters has VC dimension \(O(10^{10})\). Classical bounds predict no generalization is possible. Yet these models work. Why?

G.4.1 Implicit Regularization

Gradient descent on overparameterized models converges to solutions with special properties:

Minimum Norm. Among all solutions fitting the data, SGD finds (approximately) the one with minimum \(\ell_2\) norm.

Flat Minima. Solutions found by SGD tend to lie in “flat” regions of the loss landscape, where small perturbations don’t increase loss much. These correlate with good generalization.

Simplicity Bias. Networks learn simple functions first (low-frequency Fourier components), only fitting complex patterns with extended training.

G.4.2 Benign Overfitting

In high dimensions, models can interpolate training data (zero training error) yet generalize:

Double Descent. Test error follows a U-shape as model capacity increases, but then decreases again past the interpolation threshold.

Conditions for Benign Overfitting:

Covariance spectrum decays (signal in few directions)
Signal-to-noise ratio is large in signal directions
Model’s implicit bias aligns with signal directions

G.4.3 PAC-Bayes Bounds

Tighter bounds that account for the distribution over hypotheses:

\[R(\hat{f}) \leq \hat{R}_{train}(\hat{f}) + \sqrt{\frac{KL(Q || P) + \log(n/\delta)}{2n}} \tag{G.6}\]

where \(Q\) is the posterior over models and \(P\) is the prior.

For neural networks: The prior is implicit (initialization distribution), and the posterior concentrates near found solutions. If training moves parameters only slightly, \(KL\) is small and generalization is good.

G.5 Scaling Laws

Empirical scaling laws capture foundation model generalization:

Chinchilla Law: \[L(N, D) = A \cdot N^{-\alpha} + B \cdot D^{-\beta} + L_\infty \tag{G.7}\]

where \(N\) = parameters, \(D\) = data tokens, \(\alpha \approx 0.34\), \(\beta \approx 0.28\).

Implications:

Test loss decreases as power law in both data and parameters
Compute-optimal allocation: \(N \propto D\) (equal investment)
No evidence of “diminishing returns” within studied regimes

G.6 Implications for Genomic Foundation Models

Pretraining provides implicit regularization that constrains the effective hypothesis space
Transfer learning works because pretrained representations have low effective dimension; the “usable” capacity is much smaller than parameter count
Benchmark overfitting is possible even for models that generalize well to new data; small test sets have high variance
Scale helps within current regimes, but theoretical understanding of why remains incomplete

G.7 Further Reading

Shalev-Shwartz & Ben-David (2014). Understanding Machine Learning: From Theory to Algorithms for comprehensive foundations
Mohri, Rostamizadeh & Talwalkar (2018). Foundations of Machine Learning for mathematical treatment
Belkin et al. (2019) for double descent and interpolation
Hoffmann et al. (2022) for scaling laws