 Bias/Variance trade-off is an important concept in learning theory. This post discusses topic from both – theoretical and practical perspective.

### General

Main goal of any learning algorithm is to predict and generalise well. More formally, this goal is equivalent to ‘minimise expected error on unseen data’ – thus taking closer look at the components of expected error will provide deeper understanding.

It can be proven that expected (generalisation) error of an algorithm can be decomposed into 3 parts:

(1) Error = Bias^2 + Variance + Noise

As can be seen from (1), either high variance or bias will lead to higher error, indicating that we need a balance between two. Let’s take a closer look at the components:

Bias
Bias of an algorithm can be expressed as: on average, what is the difference between prediction of an algorithm from the ‘desired prediction’. In other words, high bias means algorithm (on average) is making wrong predictions and struggling to learn from patterns in data. Note that this is theoretical concept since ‘desired prediction’ is usually unknown.

Variance
Variance is synonymous to over-fitting and can be described as difference between train error and test (out-of-sample) error. Too complex algorithm is likely to learn the noise and not generalise well on unseen data.

As an example – given 15 noisy observations, 10th order polynomial will perfectly fit on any 10 data points, but error will explode trying to predict on the remaining 5.

Noise
In context of generalisation error, noise is irreversible and can’t be reduced no matter what learning algorithm is used.

### Practical aspects/FAQ

Q) What is indication of high bias?
A) Large error on training set can indicate under-fitting, which is synonymous to high bias.

Q) What is indication of high variance?
A) Small train error AND large difference between train and test errors indicates high variance – algorithm over-fitted on training data and doesn’t generalise well.

Q) How to reduce over-fitting?
A) Several approaches can be used:

•  Obtain more data.
•  Decrease complexity of algorithm (either by hyper-parameters or choice of algorithm)
•  Regularisation.

By: Denis Lapchev