StatLearning-02: Model Selection and Bias-Variance Tradeoff
Assessing Model Accuracy
For regression problems, we can use mean squared error to assess the accuracy of a model which is:
\[MSE = Ave[y_i - \hat{f}(x_i)]\]While the $MSE_{Tr}$ on training data may be biased toward overfit models, we should use $MSE_{Te}$ on fresh test data in order to see how well the model performs.
\[MSE_{Te} = Ave_{i \in Te}[y_i - \hat{f}(x_i)]\]These are some examples of the relationship between the flexibility of a model and its $MSEs$ on training and test data.
In this figure above, the dashed line refers to the irreducible error $Var(\epsilon)$ from this formula:
\[E[(Y- \hat{f}(X))^2 | X=x] = [f(x) - \hat{f}(X)]^2 + Var(\epsilon)\]Bias-Variance Tradeoff
Suppose we have a model $\hat{f}(X)$ fitted to training data that drawn from the true model $Y = f(X) + \epsilon$ with $f(x) = E(Y |X=x)$. Given a test observation $(x_0, y_0)$ from the population, we have a decomposition:
\[E[y_0 - \hat{f}(x_0)]^2 = Var(\hat{f}(x_0)) + Bias(\hat{f}(x_0))^2 + Var(\epsilon)\]Note that, $Bias(\hat{f}(x_0)) = E[\hat{f}(x_0)] - f(x_0)$. This means that the bias of a model at the point $x_0 $ is the difference between the expected value of $\hat{f}(x_0)$ and the true value $f(x_0)$.
But what is the expected value of $\hat{f}(x_0)$? Remember that, for different training data set, we can have different estimated $\hat{f}(x)$ giving different values for $\hat{f}(x_0)$. The expected value is calculated over all these possible predictions.
Typically as the flexibility of $\hat{f}$ increases, the expected value $E[\hat{f}(x_0)]$ is changing toward the true value $f(x_0)$ making the bias decreased. At the same time, the variability of $\hat{f}(x_0)$ will increase. The bias-variance tradeoff when choosing the flexibility of a model is demonstrated in the figure below.
In short, as the flexibility of a model increases, the variance increases, the bias decreases, and the MSE is always U-curved.
References
James, G., Witten, Daniela, author, Hastie, Trevor, author, & Tibshirani, Robert, author. (2015). An introduction to statistical learning : with applications in R.