1. Introduction
When training models in machine learning, we want the model to learn patterns from the training data and make accurate predictions on unseen data. On the other hand, the expected prediction error in machine learning is decomposed into three parts known as 'predictive error decomposition' or 'total prediction error of a model', which can be decomposed as
Where:
: Error caused by incorrect assumptions in the model.
Variance: Error caused by sensitivity to fluctuations in the training data.
Irreducible Error: Noise inherent in the data that cannot be eliminated.
Example sources:
- measurement noise
- randomness in observations
- incomplete features
Error ≠ Bias :Bias is only one contributor to the total error.
2. What Bias and Variance Represents
In Machine Learning, the Bias–Variance concept explains why models make errors and how model complexity affects performance.
2.1. Bias
Bias measures how far the model's average prediction is from the true function.
Where:
: True function
: Expected prediction
: Predicted Model
Squared Bias:
Characteristics of Bias
High Bias Model:
- Oversimplified
- Cannot capture patterns
- Underfitting
Examples:
- Linear regression for nonlinear data
- Very shallow neural network
Key point: the goal of the ML engineer should be to obtained low bias model.
For example, suppose the true relationship is the following:
And the training data:
| x | True y |
|---|---|
| 1 | 1 |
| 2 | 4 |
| 3 | 9 |
If the model assumes:
Predictions may be:
| x | True y | Predicted |
|---|---|---|
| 1 | 1 | 2 |
| 2 | 4 | 3 |
| 3 | 9 | 4 |
Let's estimate the bias average prediction error for the above model:
| x | True y | Predicted | Error ((\hat{y}-y)) |
|---|---|---|---|
| 1 | 1 | 2 | 1 |
| 2 | 4 | 3 | -1 |
| 3 | 9 | 4 | -5 |
Bias = −1.67
Interpretation
Bias = -1.67
Meaning:
- The model predictions are on average 1.67 units lower than the true values.
- This indicates systematic error, hence high bias (underfitting).
The model cannot capture the quadratic relationship because it is likely a simple linear model.
2.2. Variance
Variance measures how much predictions change when training data changes.
Where:
- : Prediction from a particular training dataset
- : Average prediction over many datasets
Characteristics of variance
High Variance Model:
- Very complex
- Fits training data closely
- Overfitting
Examples:
- Deep decision tree
- High-degree polynomial
- Very complex neural networks
Example:
Training datasets:
Dataset 1 → Model predicts: 5
Dataset 2 → Model predicts: 9
Dataset 3 → Model predicts: 2
Average prediction:
Variance:
=8.21 (A large value means high variance.)
2.3. What Irreducible Error Means
Even if we had the perfect model, there is still noise in data. Example: house price prediction where two identical houses may sell for different prices due to the following:
- negotiation
- market conditions
- random factors
This randomness cannot be removed. This is irreducible error.
3. Bias², Variance, and MSE
Let's compute Bias², Variance, and MSE together using a simple numerical example. This is the standard demonstration used in machine learning courses.
3.1. Problem Setup
Suppose the true value of the function at x is:
We train the model on different datasets, producing different predictions. where total number of models are (n=4)
| Model | Prediction |
|---|---|
| Model 1 | 8 |
| Model 2 | 9 |
| Model 3 | 12 |
| Model 4 | 11 |
Step 1: Average Prediction
E[\hat{f}(x)] = 10
Step 2: Compute Bias
Bias=0
Step 3: Bias²
Interpretation:
- The average model prediction is exactly correct
- The model has no systematic error
Step 4: Variance
| Prediction | Deviation from mean | Squared |
|---|---|---|
| 8 | 8−10 = −2 | 4 |
| 9 | 9−10 = −1 | 1 |
| 12 | 12−10 = 2 | 4 |
| 11 | 11−10 = 1 | 1 |
\text{Var} = \frac{10}{4} = 2.5
Step 5: Mean Squared Error (MSE)
\text{MSE} = \frac{1}{n} \sum (\hat{f}_i - f(x))^2
| Prediction | True value | Error | Error² |
|---|---|---|---|
| 8 | 10 | -2 | 4 |
| 9 | 10 | -1 | 1 |
| 12 | 10 | 2 | 4 |
| 11 | 10 | 1 | 1 |
3.1. Verify Bias–Variance Decomposition
MSE=0+2.5
MSE=2.5
Interpretation
| Component | Value | Meaning |
|---|---|---|
| Bias² | 0 | No systematic error |
| Variance | 2.5 | Predictions fluctuate |
| MSE | 2.5 | Total prediction error |
- The model is unbiased
- But predictions vary due to variance.
Conclusion:
Think of three model types:
| Model Type | Bias | Variance |
|---|---|---|
| Underfitting | High | Low |
| Optimal | Low | Low |
| Overfitting | Low | High |