Building a statistical predictive model is good, and even better is to have a high prediction accuracy. In our quest to achieve high accuracy in our forecasts, we tend to make our model more and more complex. Such models though may increase the overall accuracy of the forecast on the modeled data, but the question is will they perform the same way in real world on untested data? Imagine a situation where you are getting a high prediction accuracy from your model, however, when tested in real life your model completely fails. This process in which we build overly complex model to increase the forecast accuracy on our in-sample or the testing data but when employed on real world or the out of sample/training data our complex model fails to predict correctly is known as overfitting.
Let us understand this with a simple analogy. Driving a car has always excited me, since my childhood I’ve imagined of owning a car and steering endlessly on the highways. To fulfill my dream, as soon as I attained the age of 18 I enrolled myself into a driving school. So excited to hold the steering in my hand! I was careful to note all the instructions conveyed by the instructor. Now here, the ultimate motive was to drive smoothly. Let’s treat driving efficiently as the dependent variable so we can understand what is a complex model. Having read/heard a lot from several so called experts, I tried to incorporate each and everything while learning to drive. So the first variable: set the rear view mirror properly; second variable: check the air pressure in the tyres; third variable: check if we have a spare tyre in case of emergency and so on. After incorporating all these independent variables I was able to drive effortlessly on first day of driving class itself. So happy with myself, never thought driving a car would be so easy, I took the keys of my dad’s car and took it for a drive. To my surprise, I wasn’t even able to take it off from the first gear smoothly, you may imagine what might have happened during the rest of ride. So what went wrong? I was able to drive so effortlessly during my driving class, but not now. Well, I made the entire process of driving so complex by introducing so many independent variables that I overlooked the fact that the most important components i.e brakes and the clutch, were also controlled by the instructor while teaching. So even if I was not able to apply the brakes or use clutch properly, instructor was overriding it and it appeared to me as if I was driving smoothly. However, when tested in real world situation without any instructor controls, my model for driving failed completely. Overfitting, isn’t it interesting?
Let’s take this another example. The data in Figure 1, represents a scatter plot of two variables X and Y. in general the relation between the two variables look linear and could be represented by a straight line. However, to increase our accuracy and cover all the points in our dataset we try to fit an overly complex polynomial model of order 5 (see Figure 2). The model accuracy on the in-sample data as can also be seen by the R-squared is quite high for the complex model compared to the simple linear model. However, if we extrapolate the fitted lines for both the models, the complex model fails miserably while predicting out of sample (data point in red).
Bias and Variance
For anyone working on the Machine Learning models, it is trivial to understand the concepts of Bias and Variance and finding a trade-off between them.
Simply put, Bias refers to the tendency of the model to keep learning the same thing without utilitizing all the features of the data. This is often referred to as underfitting. Bias is often used to guage, by how much the accuracy of the model changes by changing the training/in-sample data.
Variance on the other hand, is the tendency of the model to randomly learn new things by fitting complex models that tries to mimic the error in the data too closely. This is often referred to as overfitting. Variance is also used to denote how sensitive the model is to the training/in-sample data.
To understand the concept of Bias and Variance in more detail, let’s refer to the figure below:
In the first image we try to fit a linear model to a Quadratic dataset:
- Clearly, the model does not seem to take into account all information from the data, a case for underfitting.
- Also, if we were to add new data points to the existing set, still the model results will not change much, indicating a low variance.
Now, in the third image, we try to fit a 30 degree polynomial model on the same dataset:
- Here the model seems to represent our data quite well, low bias.
- However, on a closer look, we observe the model is trying to run through all the data points to mimic the input data rather than the general behaviour/features of the data. As a result, the model outcome would vary significantly if were to add new data points to our existing data set and re-estimate the model, indicating a high variance.
Now, if we look at the second image, it seems to have struck the balance between Bias and Tradeoff and we may refer to as correct fit.
So always remember:
High Bias + Low Variance -> Underfitting
Low Bias + High Variance -> Overfitting
So far we have understood that we can always make our in-sample results more accurate by increasing the complexity of the model. However, this approach fails miserably in the real world/out of sample data as the model has memorized the the data rather than generalizing itself to the important features of the data.
There is always a trade-off between the model accuracy and model complexity, which we can see in the figure below. Ideally we should aim for the peak of the dotted line in the figure below.
So how do we ensure that we are not succumbed to the trap of overfitting and giving extra weight to the unnecessary noise in the data. An easiest way to check for the same is through out of the sample prediction error. Model that gives the least out of sample prediction error may be regarded as the best model, rather than looking at the in-sample prediction errors. Another closely related method is to look at the K-fold cross validations.
In this article we discussed what exactly is overfitting in Machine Learning models and how does it originate and it’s impact on our predictions. We also discussed on methods to check if we have overfitted a model.