When I actually started my journey in Data Science, it was always difficult for me to remember the difference between Bias and Variance. We always talk about the Bias-Variance tradeoff when we talk about the model prediction. My post will present a very basic understanding of these terms and two related terms – Underfitting and Overfitting.
Bias is the difference in the model prediction and the actual value. High Bias means that your model does not capture some important features and is underfitting the training data (high error on training data). It is due to the oversimplification of your model.
Variance is the difference in the model prediction on the train and test data. If the model performs well on the training data (low training error) but fails to generalize well to the test data (high test dataset error), it is said to have a high variance. This essentially means that your model is overfitting the training data. It can be because the model captures random noise in the data or it has not seen enough data in the training to generalize well to the unseen examples.
High Bias (UNDESIRABLE): This type of model always predicts the same output every time or takes a random guess while prediction.
Low Bias/High Variance (UNDESIRABLE): A model that overfits the training data. This performs poorly on unseen data.
Low Bias/Low Variance (DESIRABLE): A model that almost always gives the best results. This model performs well on seen and unseen examples.
To avoid underfitting (high bias), Try to increase the number of features by finding new features or making new features from the existing ones.
To avoid overfitting (high variance), try the following –
1. Increase the training data (collecting more data/augment the training dataset)
2. L1/L2 regularization to simplify your model.
By: Damanpreet Kaur