The Geometry of Uncertainty
In introductory probability, we often treat “variance” as a monolithic number describing spread. But when we introduce information—represented by a -algebra or a random variable —we can dissect this fluctuations with remarkable precision.
The core insight is geometric: if we view random variables as vectors in an space (the space of random variables with finite second moments), then conditional expectation is nothing more than an orthogonal projection.
1. The Projection Principle
Let be our target variable. We observe some information , which generates a sub--algebra .
We seek the “best” approximation of given only the information in . “Best” here means minimizing the mean squared error (the distance). The solution, , is the unique element in the subspace of -measurable functions such that the error is orthogonal to the subspace.
This isn’t just a definition; it’s a decomposition of the vector into two orthogonal components:
- The Projection: lies in the plane defined by .
- The Residual: is perpendicular to that plane.
2. Orthogonality and the Vanishing Cross-Term
The power of this perspective becomes clear when we look at the orthogonality condition. By definition of projection, the residual is uncorrelated with any function of (technically, orthogonal in the inner product ).
This simple fact explains why there are no “cross terms” when we analyze variance.
3. The Law of Total Variance via Pythagoras
Since the decomposition is orthogonal, the Pythagorean theorem applies directly to their “lengths” (variances):
Translated back into probabilistic language, this is the Law of Total Variance:
- : The “Explained Variance.” This is how much our prediction moves around as changes. If tells us a lot about , this component is large.
- : The “Unexplained Variance.” This is the average squared length of the residual . It’s the noise we’re stuck with even after using .
4. Why This Matters
This geometric structure repeats itself everywhere in statistics:
- Regression (ANOVA): SST = SSR + SSE is just finite-sample version of this same Pythagorean identity.
- Machine Learning: The irreducible error (Bayes error) in the Bias-Variance decomposition corresponds exactly to the noise term . No model, no matter how complex, can project onto a component of that is orthogonal to the feature space .
Viewing conditional expectation as a projection rather than an integral formula clarifies why we split errors the way we do: we are simply projecting a vector onto a subspace and measuring what’s left.