Prediction Error Decomposition (Bias–Variance Tradeoff)

The Journey of Regularization in ML Using Logistic Regression · Part 4 of 6

Understanding overfitting and underfitting leads to an important concept called the bias–variance tradeoff.

1. Introduction

When training models in machine learning, we want the model to learn patterns from the training data and make accurate predictions on unseen data. On the other hand, the expected prediction error in machine learning is decomposed into three parts known as 'predictive error decomposition' or 'total prediction error of a model', which can be decomposed as

$\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$

Where:

$\text{Bias}^2$ : Error caused by incorrect assumptions in the model.

Variance: Error caused by sensitivity to fluctuations in the training data.

Irreducible Error: Noise inherent in the data that cannot be eliminated.

Example sources:

measurement noise
randomness in observations
incomplete features

Error ≠ Bias : Bias is only one contributor to the total error.

2. What Bias and Variance Represents

In Machine Learning, the Bias–Variance concept explains why models make errors and how model complexity affects performance.

2.1. Bias

Bias measures how far the model's average prediction is from the true function.

$\text{Bias}(x) = E[\hat{f}(x)] - f(x)$

Where:

$f(x)$ : True function

$E[\hat{f}(x)]$ : Expected prediction

$(x)$ : Predicted Model

Squared Bias:

$\text{Bias}^2(x) = (E[\hat{f}(x)] - f(x))^2$

Characteristics of Bias

High Bias Model:

Oversimplified
Cannot capture patterns
Underfitting

Examples:

Linear regression for nonlinear data
Very shallow neural network

 Key point: the goal of the ML engineer should be to obtained low bias model.

For example, suppose the true relationship is the following:

$y = x^2$

And the training data:

x	True y
1	1
2	4
3	9

If the model assumes:

$y=ax+b$

Predictions may be:

x	True y	Predicted
1	1	2
2	4	3
3	9	4

Let's estimate the bias average prediction error for the above model:

$\text{Bias} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)$

x	True y	Predicted	Error ((\hat{y}-y))
1	1	2	1
2	4	3	-1
3	9	4	-5

$\begin{aligned}<br>\text{Bias} &= \frac{1 + (-1) + (-5)}{3} \\= \frac{-5}{3} \\\approx -1.67\end{aligned}$

$\text{Bias} = \frac{-5}{3}$

Bias = −1.67

$\text{Bias}^2 = (-1.67)^2 = 2.7889$

$\text{Bias}^2 = 2.78$

Interpretation

Bias = -1.67

Meaning:

The model predictions are on average 1.67 units lower than the true values.
This indicates systematic error, hence high bias (underfitting).

The model cannot capture the quadratic relationship $y = x^2$ because it is likely a simple linear model.

2.2. Variance

Variance measures how much predictions change when training data changes.

$\text{Var}(x) = E[(\hat{f}(x) - E[\hat{f}(x)])^2]$

Where:

$\hat{f}(x)$ : Prediction from a particular training dataset
$E[\hat{f}(x)])$ : Average prediction over many datasets

Characteristics of variance

High Variance Model:

Very complex
Fits training data closely
Overfitting

Examples:

Deep decision tree
High-degree polynomial
Very complex neural networks

Example:

Training datasets:

Dataset 1 → Model predicts: 5
Dataset 2 → Model predicts: 9
Dataset 3 → Model predicts: 2

Average prediction:

$E[\hat{f}(x)] = \frac{5 + 9 + 2}{3} = 5.33$

Variance:

$\text{Var}(x) = \frac{(5 - 5.33)^2 + (9 - 5.33)^2 + (2 - 5.33)^2}{3}$

$=30.11+13.44+11.09$

=8.21 (A large value means high variance.)

2.3. What Irreducible Error Means

Even if we had the perfect model, there is still noise in data. Example: house price prediction where two identical houses may sell for different prices due to the following:

negotiation
market conditions
random factors

This randomness cannot be removed. This is irreducible error.

3. Bias², Variance, and MSE

Let's compute Bias², Variance, and MSE together using a simple numerical example. This is the standard demonstration used in machine learning courses.

3.1. Problem Setup

Suppose the true value of the function at is:

We train the model on different datasets, producing different predictions. where total number of models are (n=4)

Model	Prediction $\hat f(x)$
Model 1	8
Model 2	9
Model 3	12
Model 4	11

Step 1: Average Prediction

$E[\hat{f}(x)] = \frac{8 + 9 + 12 + 11}{4} = 10$

$E[f^(x)]=440$

E[\hat{f}(x)] = 10

Step 2: Compute Bias

$E[\hat{f}(x)] = 10$

$Bias=10−10$

Bias=0

Step 3: Bias²

$\text{Bias}^2 = (0)^2 = 0$

$\text{Bias}^2=0$

Interpretation:

The average model prediction is exactly correct
The model has no systematic error

Step 4: Variance

$\text{Var} = \frac{1}{n} \sum (\hat{f}_i - E[\hat{f}(x)])^2$

Prediction	Deviation from mean	Squared
8	8−10 = −2	4
9	9−10 = −1	1
12	12−10 = 2	4
11	11−10 = 1	1

$\text{Var} = \frac{4 + 1 + 4 + 1}{4} = 2.5$

\text{Var} = \frac{10}{4} = 2.5

Step 5: Mean Squared Error (MSE)

\text{MSE} = \frac{1}{n} \sum (\hat{f}_i - f(x))^2

Prediction	True value	Error	Error²
8	10	-2	4
9	10	-1	1
12	10	2	4
11	10	1	1

$\text{MSE} = \frac{4 + 1 + 4 + 1}{4} = 2.5$

$\text{MSE} = \frac{10}{4} = 2.5$

3.1. Verify Bias–Variance Decomposition

$\text{MSE} = \text{Bias}^2 + \text{Var} + \epsilon$

MSE=0+2.5

MSE=2.5

Interpretation

Component	Value	Meaning
Bias²	0	No systematic error
Variance	2.5	Predictions fluctuate
MSE	2.5	Total prediction error

The model is unbiased
But predictions vary due to variance.

Conclusion:

Think of three model types:

Model Type	Bias	Variance
Underfitting	High	Low
Optimal	Low	Low
Overfitting	Low	High

← Overfitting and Underfitting in Machine... Regularization in Logistic Regression →

1. Introduction

2. What Bias and Variance Represents

2.1. Bias

Squared Bias:

Characteristics of Bias

2.2. Variance

Characteristics of variance

2.3. What Irreducible Error Means

3. Bias², Variance, and MSE

3.1. Problem Setup

3.1. Verify Bias–Variance Decomposition

0 Comments