The Geometry of Uncertainty

In introductory probability, we often treat “variance” as a monolithic number describing spread. But when we introduce information—represented by a σ\sigma-algebra or a random variable XX—we can dissect this fluctuations with remarkable precision.

The core insight is geometric: if we view random variables as vectors in an L2L^2 space (the space of random variables with finite second moments), then conditional expectation is nothing more than an orthogonal projection.

1. The Projection Principle

Let YL2(Ω,F,P)Y \in L^2(\Omega, \mathcal{F}, P) be our target variable. We observe some information XX, which generates a sub-σ\sigma-algebra G=σ(X)\mathcal{G} = \sigma(X).

We seek the “best” approximation of YY given only the information in G\mathcal{G}. “Best” here means minimizing the mean squared error (the L2L^2 distance). The solution, E[YX]\mathbb{E}[Y|X], is the unique element ZZ in the subspace of G\mathcal{G}-measurable functions such that the error is orthogonal to the subspace.

Y=E[YX]Signal+(YE[YX])Noise εY = \underbrace{\mathbb{E}[Y|X]}_{\text{Signal}} + \underbrace{(Y - \mathbb{E}[Y|X])}_{\text{Noise } \varepsilon}

This isn’t just a definition; it’s a decomposition of the vector YY into two orthogonal components:

  1. The Projection: E[YX]\mathbb{E}[Y|X] lies in the plane defined by XX.
  2. The Residual: ε\varepsilon is perpendicular to that plane.

2. Orthogonality and the Vanishing Cross-Term

The power of this perspective becomes clear when we look at the orthogonality condition. By definition of projection, the residual ε\varepsilon is uncorrelated with any function of XX (technically, orthogonal in the inner product U,V=E[UV]\langle U, V \rangle = \mathbb{E}[UV]).

E[εg(X)]=0for any measurable g\mathbb{E}[\varepsilon \cdot g(X)] = 0 \quad \text{for any measurable } g

This simple fact explains why there are no “cross terms” when we analyze variance.

3. The Law of Total Variance via Pythagoras

Since the decomposition Y=E[YX]+εY = \mathbb{E}[Y|X] + \varepsilon is orthogonal, the Pythagorean theorem applies directly to their “lengths” (variances):

YE[Y]2=E[YX]E[Y]2+ε2\|Y - \mathbb{E}[Y]\|^2 = \|\mathbb{E}[Y|X] - \mathbb{E}[Y]\|^2 + \|\varepsilon\|^2

Translated back into probabilistic language, this is the Law of Total Variance:

Var(Y)=Var(E[YX])+E[Var(YX)]\operatorname{Var}(Y) = \operatorname{Var}(\mathbb{E}[Y|X]) + \mathbb{E}[\operatorname{Var}(Y|X)]
  • Var(E[YX])\operatorname{Var}(\mathbb{E}[Y|X]): The “Explained Variance.” This is how much our prediction moves around as XX changes. If XX tells us a lot about YY, this component is large.
  • E[Var(YX)]\mathbb{E}[\operatorname{Var}(Y|X)]: The “Unexplained Variance.” This is the average squared length of the residual ε\varepsilon. It’s the noise we’re stuck with even after using XX.

4. Why This Matters

This geometric structure repeats itself everywhere in statistics:

  • Regression (ANOVA): SST = SSR + SSE is just finite-sample version of this same Pythagorean identity.
  • Machine Learning: The irreducible error (Bayes error) in the Bias-Variance decomposition corresponds exactly to the noise term E[Var(YX)]\mathbb{E}[\operatorname{Var}(Y|X)]. No model, no matter how complex, can project onto a component of YY that is orthogonal to the feature space XX.

Viewing conditional expectation as a projection rather than an integral formula clarifies why we split errors the way we do: we are simply projecting a vector onto a subspace and measuring what’s left.